new Assessing Climate Transition Risks in the Colombian Processed Food Sector: A Fuzzy Logic and Multicriteria Decision-Making Approach

Authors: Juan F. P\'erez-P\'erez, Pablo Isaza G\'omez, Isis Bonet, Mar\'ia Solange S\'anchez-Pinz\'on, Fabio Caraffini, Christian Lochmuller

Abstract: Climate risk assessment is becoming increasingly important. For organisations, identifying and assessing climate-related risks is challenging, as they can come from multiple sources. This study identifies and assesses the main climate transition risks in the colombian processed food sector. As transition risks are vague, our approach uses Fuzzy Logic and compares it to various multi-criteria decision-making methods to classify the different climate transition risks an organisation may be exposed to. This approach allows us to use linguistic expressions for risk analysis and to better describe risks and their consequences. The results show that the risks ranked as the most critical for this organisation in their order were price volatility and raw materials availability, the change to less carbon-intensive production or consumption patterns, the increase in carbon taxes and technological change, and the associated development or implementation costs. These risks show a critical risk level, which implies that they are the most significant risks for the organisation in the case study. These results highlight the importance of investments needed to meet regulatory requirements, which are the main drivers for organisations at the financial level.

new SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

Authors: Yifan Jiang, Filip Ilievski, Kaixin Ma

Abstract: While vertical thinking relies on logical and commonsense reasoning, lateral thinking requires systems to defy commonsense associations and overwrite them through unconventional thinking. Lateral thinking has been shown to be challenging for current models but has received little attention. A recent benchmark, BRAINTEASER, aims to evaluate current models' lateral thinking ability in a zero-shot setting. In this paper, we split the original benchmark to also support fine-tuning setting and present SemEval Task 9: BRAIN-TEASER(S), the first task at this competition designed to test the system's reasoning and lateral thinking ability. As a popular task, BRAINTEASER(S)'s two subtasks receive 483 team submissions from 182 participants during the competition. This paper provides a fine-grained system analysis of the competition results, together with a reflection on what this means for the ability of the systems to reason laterally. We hope that the BRAINTEASER(S) subtasks and findings in this paper can stimulate future work on lateral thinking and robust reasoning by computational models.

new Playing Board Games with the Predict Results of Beam Search Algorithm

Authors: Sergey Pastukhov

Abstract: This paper introduces a novel algorithm for two-player deterministic games with perfect information, which we call PROBS (Predict Results of Beam Search). Unlike existing methods that predominantly rely on Monte Carlo Tree Search (MCTS) for decision processes, our approach leverages a simpler beam search algorithm. We evaluate the performance of our algorithm across a selection of board games, where it consistently demonstrates an increased winning ratio against baseline opponents. A key result of this study is that the PROBS algorithm operates effectively, even when the beam search size is considerably smaller than the average number of turns in the game.

new Learning World Models With Hierarchical Temporal Abstractions: A Probabilistic Perspective

Authors: Vaisakh Shaj

Abstract: Machines that can replicate human intelligence with type 2 reasoning capabilities should be able to reason at multiple levels of spatio-temporal abstractions and scales using internal world models. Devising formalisms to develop such internal world models, which accurately reflect the causal hierarchies inherent in the dynamics of the real world, is a critical research challenge in the domains of artificial intelligence and machine learning. This thesis identifies several limitations with the prevalent use of state space models (SSMs) as internal world models and propose two new probabilistic formalisms namely Hidden-Parameter SSMs and Multi-Time Scale SSMs to address these drawbacks. The structure of graphical models in both formalisms facilitates scalable exact probabilistic inference using belief propagation, as well as end-to-end learning via backpropagation through time. This approach permits the development of scalable, adaptive hierarchical world models capable of representing nonstationary dynamics across multiple temporal abstractions and scales. Moreover, these probabilistic formalisms integrate the concept of uncertainty in world states, thus improving the system's capacity to emulate the stochastic nature of the real world and quantify the confidence in its predictions. The thesis also discuss how these formalisms are in line with related neuroscience literature on Bayesian brain hypothesis and predicitive processing. Our experiments on various real and simulated robots demonstrate that our formalisms can match and in many cases exceed the performance of contemporary transformer variants in making long-range future predictions. We conclude the thesis by reflecting on the limitations of our current models and suggesting directions for future research.

new Knowledge Graph Completion using Structural and Textual Embeddings

Authors: Sakher Khalil Alqaaidi, Krzysztof Kochut

Abstract: Knowledge Graphs (KGs) are widely employed in artificial intelligence applications, such as question-answering and recommendation systems. However, KGs are frequently found to be incomplete. While much of the existing literature focuses on predicting missing nodes for given incomplete KG triples, there remains an opportunity to complete KGs by exploring relations between existing nodes, a task known as relation prediction. In this study, we propose a relations prediction model that harnesses both textual and structural information within KGs. Our approach integrates walks-based embeddings with language model embeddings to effectively represent nodes. We demonstrate that our model achieves competitive results in the relation prediction task when evaluated on a widely used dataset.

new ReZero: Boosting MCTS-based Algorithms by Just-in-Time and Speedy Reanalyze

Authors: Chunyu Xuan, Yazhe Niu, Yuan Pu, Shuai Hu, Jing Yang

Abstract: MCTS-based algorithms, such as MuZero and its derivatives, have achieved widespread success in various decision-making domains. These algorithms employ the reanalyze process to enhance sample efficiency, albeit at the expense of significant wall-clock time consumption. To address this issue, we propose a general approach named ReZero to boost MCTS-based algorithms. Specifically, we propose a new scheme that simplifies data collecting and reanalyzing, which significantly reduces the search cost while guarantees the performance as well. Furthermore, to accelerate each search process, we conceive a method to reuse the subsequent information in the trajectory. The corresponding analysis conducted on the bandit model also provides auxiliary theoretical substantiation for our design. Experiments conducted on Atari environments and board games demonstrates that ReZero substantially improves training speed while maintaining high sample efficiency. The code is available as part of the LightZero benchmark at https://github.com/opendilab/LightZero.

URLs: https://github.com/opendilab/LightZero.

new Optimal and Bounded Suboptimal Any-Angle Multi-agent Pathfinding

Authors: Konstantin Yakovlev, Anton Andreychuk, Roni Stern

Abstract: Multi-agent pathfinding (MAPF) is the problem of finding a set of conflict-free paths for a set of agents. Typically, the agents' moves are limited to a pre-defined graph of possible locations and allowed transitions between them, e.g. a 4-neighborhood grid. We explore how to solve MAPF problems when each agent can move between any pair of possible locations as long as traversing the line segment connecting them does not lead to the collision with the obstacles. This is known as any-angle pathfinding. We present the first optimal any-angle multi-agent pathfinding algorithm. Our planner is based on the Continuous Conflict-based Search (CCBS) algorithm and an optimal any-angle variant of the Safe Interval Path Planning (TO-AA-SIPP). The straightforward combination of those, however, scales poorly since any-angle path finding induces search trees with a very large branching factor. To mitigate this, we adapt two techniques from classical MAPF to the any-angle setting, namely Disjoint Splitting and Multi-Constraints. Experimental results on different combinations of these techniques show they enable solving over 30% more problems than the vanilla combination of CCBS and TO-AA-SIPP. In addition, we present a bounded-suboptimal variant of our algorithm, that enables trading runtime for solution cost in a controlled manner.

new Label-Free Topic-Focused Summarization Using Query Augmentation

Authors: Wenchuan Mu, Kwan Hui Lim

Abstract: In today's data and information-rich world, summarization techniques are essential in harnessing vast text to extract key information and enhance decision-making and efficiency. In particular, topic-focused summarization is important due to its ability to tailor content to specific aspects of an extended text. However, this usually requires extensive labelled datasets and considerable computational power. This study introduces a novel method, Augmented-Query Summarization (AQS), for topic-focused summarization without the need for extensive labelled datasets, leveraging query augmentation and hierarchical clustering. This approach facilitates the transferability of machine learning models to the task of summarization, circumventing the need for topic-specific training. Through real-world tests, our method demonstrates the ability to generate relevant and accurate summaries, showing its potential as a cost-effective solution in data-rich environments. This innovation paves the way for broader application and accessibility in the field of topic-focused summarization technology, offering a scalable, efficient method for personalized content extraction.

new SIDEs: Separating Idealization from Deceptive Explanations in xAI

Authors: Emily Sullivan

Abstract: Explainable AI (xAI) methods are important for establishing trust in using black-box models. However, recent criticism has mounted against current xAI methods that they disagree, are necessarily false, and can be manipulated, which has started to undermine the deployment of black-box models. Rudin (2019) goes so far as to say that we should stop using black-box models altogether in high-stakes cases because xAI explanations "must be wrong". However, strict fidelity to the truth is historically not a desideratum in science. Idealizations -- the intentional distortions introduced to scientific theories and models -- are commonplace in the natural sciences and are seen as a successful scientific tool. Thus, it is not falsehood qua falsehood that is the issue. In this paper, I outline the need for xAI research to engage in idealization evaluation. Drawing on the use of idealizations in the natural sciences and philosophy of science, I introduce a novel framework for evaluating whether xAI methods engage in successful idealizations or deceptive explanations (SIDEs). SIDEs evaluates whether the limitations of xAI methods, and the distortions that they introduce, can be part of a successful idealization or are indeed deceptive distortions as critics suggest. I discuss the role that existing research can play in idealization evaluation and where innovation is necessary. Through a qualitative analysis we find that leading feature importance methods and counterfactual explanations are subject to idealization failure and suggest remedies for ameliorating idealization failure.

new Neural Interaction Energy for Multi-Agent Trajectory Prediction

Authors: Kaixin Shen, Ruijie Quan, Linchao Zhu, Jun Xiao, Yi Yang

Abstract: Maintaining temporal stability is crucial in multi-agent trajectory prediction. Insufficient regularization to uphold this stability often results in fluctuations in kinematic states, leading to inconsistent predictions and the amplification of errors. In this study, we introduce a framework called Multi-Agent Trajectory prediction via neural interaction Energy (MATE). This framework assesses the interactive motion of agents by employing neural interaction energy, which captures the dynamics of interactions and illustrates their influence on the future trajectories of agents. To bolster temporal stability, we introduce two constraints: inter-agent interaction constraint and intra-agent motion constraint. These constraints work together to ensure temporal stability at both the system and agent levels, effectively mitigating prediction fluctuations inherent in multi-agent systems. Comparative evaluations against previous methods on four diverse datasets highlight the superior prediction accuracy and generalization capabilities of our model.

new Learning to Beat ByteRL: Exploitability of Collectible Card Game Agents

Authors: Radovan Haluska, Martin Schmid

Abstract: While Poker, as a family of games, has been studied extensively in the last decades, collectible card games have seen relatively little attention. Only recently have we seen an agent that can compete with professional human players in Hearthstone, one of the most popular collectible card games. Although artificial agents must be able to work with imperfect information in both of these genres, collectible card games pose another set of distinct challenges. Unlike in many poker variants, agents must deal with state space so vast that even enumerating all states consistent with the agent's beliefs is intractable, rendering the current search methods unusable and requiring the agents to opt for other techniques. In this paper, we investigate the strength of such techniques for this class of games. Namely, we present preliminary analysis results of ByteRL, the state-of-the-art agent in Legends of Code and Magic and Hearthstone. Although ByteRL beat a top-10 Hearthstone player from China, we show that its play in Legends of Code and Magic is highly exploitable.

new Distilling Privileged Information for Dubins Traveling Salesman Problems with Neighborhoods

Authors: Min Kyu Shin, Su-Jeong Park, Seung-Keol Ryu, Heeyeon Kim, Han-Lim Choi

Abstract: This paper presents a novel learning approach for Dubins Traveling Salesman Problems(DTSP) with Neighborhood (DTSPN) to quickly produce a tour of a non-holonomic vehicle passing through neighborhoods of given task points. The method involves two learning phases: initially, a model-free reinforcement learning approach leverages privileged information to distill knowledge from expert trajectories generated by the LinKernighan heuristic (LKH) algorithm. Subsequently, a supervised learning phase trains an adaptation network to solve problems independently of privileged information. Before the first learning phase, a parameter initialization technique using the demonstration data was also devised to enhance training efficiency. The proposed learning method produces a solution about 50 times faster than LKH and substantially outperforms other imitation learning and RL with demonstration schemes, most of which fail to sense all the task points.

cross A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming

Authors: Pengyuan Zhou, Lin Wang, Zhi Liu, Yanbin Hao, Pan Hui, Sasu Tarkoma, Jussi Kangasharju

Abstract: This paper offers an insightful examination of how currently top-trending AI technologies, i.e., generative artificial intelligence (Generative AI) and large language models (LLMs), are reshaping the field of video technology, including video generation, understanding, and streaming. It highlights the innovative use of these technologies in producing highly realistic videos, a significant leap in bridging the gap between real-world dynamics and digital creation. The study also delves into the advanced capabilities of LLMs in video understanding, demonstrating their effectiveness in extracting meaningful information from visual content, thereby enhancing our interaction with videos. In the realm of video streaming, the paper discusses how LLMs contribute to more efficient and user-centric streaming experiences, adapting content delivery to individual viewer preferences. This comprehensive review navigates through the current achievements, ongoing challenges, and future possibilities of applying Generative AI and LLMs to video-related tasks, underscoring the immense potential these technologies hold for advancing the field of video technology related to multimedia, networking, and AI communities.

cross Forklift: An Extensible Neural Lifter

Authors: Jordi Armengol-Estap\'e, Rodrigo C. O. Rocha, Jackson Woodruff, Pasquale Minervini, Michael F. P. O'Boyle

Abstract: The escalating demand to migrate legacy software across different Instruction Set Architectures (ISAs) has driven the development of assembly-to-assembly translators to map between their respective assembly languages. However, the development of these tools requires substantial engineering effort. State-of-the-art approaches use lifting, a technique where source assembly code is translated to an architecture-independent intermediate representation (IR) (for example, the LLVM IR) and use a pre-existing compiler to recompile the IR to the target ISA. However, the hand-written rules these lifters employ are sensitive to the particular compiler and optimization level used to generate the code and require significant engineering effort to support each new ISA. We propose Forklift, the first neural lifter that learns how to translate assembly to LLVM IR using a token-level encoder-decoder Transformer. We show how to incrementally add support to new ISAs by fine tuning the assembly encoder and freezing the IR decoder, improving the overall accuracy and efficiency. We collect millions of parallel LLVM IR, x86, ARM, and RISC-V programs across compilers and optimization levels to train Forklift and set up an input/output-based accuracy harness. We evaluate Forklift on two challenging benchmark suites and translate 2.5x more x86 programs than a state-of-the-art hand-written lifter and 4.4x more x86 programs than GPT-4 as well as enabling translation from new ISAs.

cross How explainable AI affects human performance: A systematic review of the behavioural consequences of saliency maps

Authors: Romy M\"uller

Abstract: Saliency maps can explain how deep neural networks classify images. But are they actually useful for humans? The present systematic review of 68 user studies found that while saliency maps can enhance human performance, null effects or even costs are quite common. To investigate what modulates these effects, the empirical outcomes were organised along several factors related to the human tasks, AI performance, XAI methods, images to be classified, human participants and comparison conditions. In image-focused tasks, benefits were less common than in AI-focused tasks, but the effects depended on the specific cognitive requirements. Moreover, benefits were usually restricted to incorrect AI predictions in AI-focused tasks but to correct ones in image-focused tasks. XAI-related factors had surprisingly little impact. The evidence was limited for image- and human-related factors and the effects were highly dependent on the comparison conditions. These findings may support the design of future user studies.

cross Elicitron: An LLM Agent-Based Simulation Framework for Design Requirements Elicitation

Authors: Mohammadmehdi Ataei, Hyunmin Cheong, Daniele Grandi, Ye Wang, Nigel Morris, Alexander Tessier

Abstract: Requirements elicitation, a critical, yet time-consuming and challenging step in product development, often fails to capture the full spectrum of user needs. This may lead to products that fall short of expectations. This paper introduces a novel framework that leverages Large Language Models (LLMs) to automate and enhance the requirements elicitation process. LLMs are used to generate a vast array of simulated users (LLM agents), enabling the exploration of a much broader range of user needs and unforeseen use cases. These agents engage in product experience scenarios, through explaining their actions, observations, and challenges. Subsequent agent interviews and analysis uncover valuable user needs, including latent ones. We validate our framework with three experiments. First, we explore different methodologies for diverse agent generation, discussing their advantages and shortcomings. We measure the diversity of identified user needs and demonstrate that context-aware agent generation leads to greater diversity. Second, we show how our framework effectively mimics empathic lead user interviews, identifying a greater number of latent needs than conventional human interviews. Third, we showcase that LLMs can be used to analyze interviews, capture needs, and classify them as latent or not. Our work highlights the potential of using LLM agents to accelerate early-stage product development, reduce costs, and increase innovation.

cross From "AI" to Probabilistic Automation: How Does Anthropomorphization of Technical Systems Descriptions Influence Trust?

Authors: Nanna Inie, Stefania Druga, Peter Zukerman, Emily M. Bender

Abstract: This paper investigates the influence of anthropomorphized descriptions of so-called "AI" (artificial intelligence) systems on people's self-assessment of trust in the system. Building on prior work, we define four categories of anthropomorphization (1. Properties of a cognizer, 2. Agency, 3. Biological metaphors, and 4. Properties of a communicator). We use a survey-based approach (n=954) to investigate whether participants are likely to trust one of two (fictitious) "AI" systems by randomly assigning people to see either an anthropomorphized or a de-anthropomorphized description of the systems. We find that participants are no more likely to trust anthropomorphized over de-anthropmorphized product descriptions overall. The type of product or system in combination with different anthropomorphic categories appears to exert greater influence on trust than anthropomorphizing language alone, and age is the only demographic factor that significantly correlates with people's preference for anthropomorphized or de-anthropomorphized descriptions. When elaborating on their choices, participants highlight factors such as lesser of two evils, lower or higher stakes contexts, and human favoritism as driving motivations when choosing between product A and B, irrespective of whether they saw an anthropomorphized or a de-anthropomorphized description of the product. Our results suggest that "anthropomorphism" in "AI" descriptions is an aggregate concept that may influence different groups differently, and provide nuance to the discussion of whether anthropomorphization leads to higher trust and over-reliance by the general public in systems sold as "AI".

cross GUIDE: Graphical User Interface Data for Execution

Authors: Rajat Chawla, Adarsh Jha, Muskaan Kumar, Mukunda NS, Ishaan Bhola

Abstract: In this paper, we introduce GUIDE, a novel dataset tailored for the advancement of Multimodal Large Language Model (MLLM) applications, particularly focusing on Robotic Process Automation (RPA) use cases. Our dataset encompasses diverse data from various websites including Apollo(62.67\%), Gmail(3.43\%), Calendar(10.98\%) and Canva(22.92\%). Each data entry includes an image, a task description, the last action taken, CoT and the next action to be performed along with grounding information of where the action needs to be executed. The data is collected using our in-house advanced annotation tool NEXTAG (Next Action Grounding and Annotation Tool). The data is adapted for multiple OS, browsers and display types. It is collected by multiple annotators to capture the variation of design and the way person uses a website. Through this dataset, we aim to facilitate research and development in the realm of LLMs for graphical user interfaces, particularly in tasks related to RPA. The dataset's multi-platform nature and coverage of diverse websites enable the exploration of cross-interface capabilities in automation tasks. We believe that our dataset will serve as a valuable resource for advancing the capabilities of multi-platform LLMs in practical applications, fostering innovation in the field of automation and natural language understanding. Using GUIDE, we build V-Zen, the first RPA model to automate multiple websites using our in-House Automation tool AUTONODE

cross Human Latency Conversational Turns for Spoken Avatar Systems

Authors: Derek Jacoby, Tianyi Zhang, Aanchan Mohan, Yvonne Coady

Abstract: A problem with many current Large Language Model (LLM) driven spoken dialogues is the response time. Some efforts such as Groq address this issue by lightning fast processing of the LLM, but we know from the cognitive psychology literature that in human-to-human dialogue often responses occur prior to the speaker completing their utterance. No amount of delay for LLM processing is acceptable if we wish to maintain human dialogue latencies. In this paper, we discuss methods for understanding an utterance in close to real time and generating a response so that the system can comply with human-level conversational turn delays. This means that the information content of the final part of the speaker's utterance is lost to the LLM. Using the Google NaturalQuestions (NQ) database, our results show GPT-4 can effectively fill in missing context from a dropped word at the end of a question over 60% of the time. We also provide some examples of utterances and the impacts of this information loss on the quality of LLM response in the context of an avatar that is currently under development. These results indicate that a simple classifier could be used to determine whether a question is semantically complete, or requires a filler phrase to allow a response to be generated within human dialogue time constraints.

cross LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation

Authors: Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, Mengwei Xu

Abstract: The emergent large language/multimodal models facilitate the evolution of mobile agents, especially in the task of mobile UI automation. However, existing evaluation approaches, which rely on human validation or established datasets to compare agent-predicted actions with predefined ones, are unscalable and unfaithful. To overcome these limitations, this paper presents LlamaTouch, a testbed for on-device agent execution and faithful, scalable agent evaluation. By observing that the task execution process only transfers UI states, LlamaTouch employs a novel evaluation approach that only assesses whether an agent traverses all manually annotated, essential application/system states. LlamaTouch comprises three key techniques: (1) On-device task execution that enables mobile agents to interact with real mobile environments for task completion. (2) Fine-grained UI component annotation that merges pixel-level screenshots and textual screen hierarchies to explicitly identify and precisely annotate essential UI components with a rich set of designed annotation primitives. (3) A multi-level state matching algorithm that utilizes exact and fuzzy matching to accurately detect critical information in each screen with unpredictable UI layout/content dynamics. LlamaTouch currently incorporates four mobile agents and 495 UI automation tasks, encompassing both tasks in the widely-used datasets and our self-constructed ones for more diverse mobile applications. Evaluation results demonstrate the LlamaTouch's high faithfulness of evaluation in real environments and its better scalability than human validation. LlamaTouch also enables easy task annotation and integration of new mobile agents. Code and dataset are publicly available at https://github.com/LlamaTouch/LlamaTouch.

URLs: https://github.com/LlamaTouch/LlamaTouch.

cross LuminLab: An AI-Powered Building Retrofit and Energy Modelling Platform

Authors: Kevin Credit, Qian Xiao, Jack Lehane, Juan Vazquez, Dan Liu, Leo De Figueiredo

Abstract: This paper describes the technical and conceptual development of the LuminLab platform, an online tool that integrates a purpose-fit human-centric AI chatbot and predictive energy model into a streamlined front-end that can rapidly produce and discuss building retrofit plans in natural language. The platform provides users with the ability to engage with a range of possible retrofit pathways tailored to their individual budget and building needs on-demand. Given the complicated and costly nature of building retrofit projects, which rely on a variety of stakeholder groups with differing goals and incentives, we feel that AI-powered tools such as this have the potential to pragmatically de-silo knowledge, improve communication, and empower individual homeowners to undertake incremental retrofit projects that might not happen otherwise.

cross Layout2Rendering: AI-aided Greenspace design

Authors: Ran Chen, Zeke Lian, Yueheng He, Xiao Ling, Fuyu Yang, Xueqi Yao, Xingjian Yi, Jing Zhao

Abstract: In traditional human living environment landscape design, the establishment of three-dimensional models is an essential step for designers to intuitively present the spatial relationships of design elements, as well as a foundation for conducting landscape analysis on the site. Rapidly and effectively generating beautiful and realistic landscape spaces is a significant challenge faced by designers. Although generative design has been widely applied in related fields, they mostly generate three-dimensional models through the restriction of indicator parameters. However, the elements of landscape design are complex and have unique requirements, making it difficult to generate designs from the perspective of indicator limitations. To address these issues, this study proposes a park space generative design system based on deep learning technology. This system generates design plans based on the topological relationships of landscape elements, then vectorizes the plan element information, and uses Grasshopper to generate three-dimensional models while synchronously fine-tuning parameters, rapidly completing the entire process from basic site conditions to model effect analysis. Experimental results show that: (1) the system, with the aid of AI-assisted technology, can rapidly generate space green space schemes that meet the designer's perspective based on site conditions; (2) this study has vectorized and three-dimensionalized various types of landscape design elements based on semantic information; (3) the analysis and visualization module constructed in this study can perform landscape analysis on the generated three-dimensional models and produce node effect diagrams, allowing users to modify the design in real time based on the effects, thus enhancing the system's interactivity.

cross Interactive Visual Learning for Stable Diffusion

Authors: Seongmin Lee, Benjamin Hoover, Hendrik Strobelt, Zijie J. Wang, ShengYun Peng, Austin Wright, Kevin Li, Haekyu Park, Haoyang Yang, Polo Chau

Abstract: Diffusion-based generative models' impressive ability to create convincing images has garnered global attention. However, their complex internal structures and operations often pose challenges for non-experts to grasp. We introduce Diffusion Explainer, the first interactive visualization tool designed to elucidate how Stable Diffusion transforms text prompts into images. It tightly integrates a visual overview of Stable Diffusion's complex components with detailed explanations of their underlying operations. This integration enables users to fluidly transition between multiple levels of abstraction through animations and interactive elements. Offering real-time hands-on experience, Diffusion Explainer allows users to adjust Stable Diffusion's hyperparameters and prompts without the need for installation or specialized hardware. Accessible via users' web browsers, Diffusion Explainer is making significant strides in democratizing AI education, fostering broader public access. More than 7,200 users spanning 113 countries have used our open-sourced tool at https://poloclub.github.io/diffusion-explainer/. A video demo is available at https://youtu.be/MbkIADZjPnA.

URLs: https://poloclub.github.io/diffusion-explainer/., https://youtu.be/MbkIADZjPnA.

cross Augmenting the Author: Exploring the Potential of AI Collaboration in Academic Writing

Authors: Joseph Tu, Hilda Hadan, Derrick M. Wang, Sabrina A Sgandurra, Reza Hadi Mogavi, Lennart E. Nacke

Abstract: This workshop paper presents a critical examination of the integration of Generative AI (Gen AI) into the academic writing process, focusing on the use of AI as a collaborative tool. It contrasts the performance and interaction of two AI models, Gemini and ChatGPT, through a collaborative inquiry approach where researchers engage in facilitated sessions to design prompts that elicit specific AI responses for crafting research outlines. This case study highlights the importance of prompt design, output analysis, and recognizing the AI's limitations to ensure responsible and effective AI integration in scholarly work. Preliminary findings suggest that prompt variation significantly affects output quality and reveals distinct capabilities and constraints of each model. The paper contributes to the field of Human-Computer Interaction by exploring effective prompt strategies and providing a comparative analysis of Gen AI models, ultimately aiming to enhance AI-assisted academic writing and prompt a deeper dialogue within the HCI community.

cross Explaining AI Decisions: Towards Achieving Human-Centered Explainability in Smart Home Environments

Authors: Md Shajalal, Alexander Boden, Gunnar Stevens, Delong Du, Dean-Robin Kern

Abstract: Smart home systems are gaining popularity as homeowners strive to enhance their living and working environments while minimizing energy consumption. However, the adoption of artificial intelligence (AI)-enabled decision-making models in smart home systems faces challenges due to the complexity and black-box nature of these systems, leading to concerns about explainability, trust, transparency, accountability, and fairness. The emerging field of explainable artificial intelligence (XAI) addresses these issues by providing explanations for the models' decisions and actions. While state-of-the-art XAI methods are beneficial for AI developers and practitioners, they may not be easily understood by general users, particularly household members. This paper advocates for human-centered XAI methods, emphasizing the importance of delivering readily comprehensible explanations to enhance user satisfaction and drive the adoption of smart home systems. We review state-of-the-art XAI methods and prior studies focusing on human-centered explanations for general users in the context of smart home applications. Through experiments on two smart home application scenarios, we demonstrate that explanations generated by prominent XAI techniques might not be effective in helping users understand and make decisions. We thus argue for the necessity of a human-centric approach in representing explanations in smart home systems and highlight relevant human-computer interaction (HCI) methodologies, including user studies, prototyping, technology probes analysis, and heuristic evaluation, that can be employed to generate and present human-centered explanations to users.

cross Semantic Evolvement Enhanced Graph Autoencoder for Rumor Detection

Authors: Xiang Tao, Liang Wang, Qiang Liu, Shu Wu, Liang Wang

Abstract: Due to the rapid spread of rumors on social media, rumor detection has become an extremely important challenge. Recently, numerous rumor detection models which utilize textual information and the propagation structure of events have been proposed. However, these methods overlook the importance of semantic evolvement information of event in propagation process, which is often challenging to be truly learned in supervised training paradigms and traditional rumor detection methods. To address this issue, we propose a novel semantic evolvement enhanced Graph Autoencoder for Rumor Detection (GARD) model in this paper. The model learns semantic evolvement information of events by capturing local semantic changes and global semantic evolvement information through specific graph autoencoder and reconstruction strategies. By combining semantic evolvement information and propagation structure information, the model achieves a comprehensive understanding of event propagation and perform accurate and robust detection, while also detecting rumors earlier by capturing semantic evolvement information in the early stages. Moreover, in order to enhance the model's ability to learn the distinct patterns of rumors and non-rumors, we introduce a uniformity regularizer to further improve the model's performance. Experimental results on three public benchmark datasets confirm the superiority of our GARD method over the state-of-the-art approaches in both overall performance and early rumor detection.

cross Enhancing Diagnosis through AI-driven Analysis of Reflectance Confocal Microscopy

Authors: Hong-Jun Yoon, Chris Keum, Alexander Witkowski, Joanna Ludzik, Tracy Petrie, Heidi A. Hanson, Sancy A. Leachman

Abstract: Reflectance Confocal Microscopy (RCM) is a non-invasive imaging technique used in biomedical research and clinical dermatology. It provides virtual high-resolution images of the skin and superficial tissues, reducing the need for physical biopsies. RCM employs a laser light source to illuminate the tissue, capturing the reflected light to generate detailed images of microscopic structures at various depths. Recent studies explored AI and machine learning, particularly CNNs, for analyzing RCM images. Our study proposes a segmentation strategy based on textural features to identify clinically significant regions, empowering dermatologists in effective image interpretation and boosting diagnostic confidence. This approach promises to advance dermatological diagnosis and treatment.

cross Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges

Authors: Badri Narayana Patro, Vijay Srinivas Agneeswaran

Abstract: Sequence modeling is a crucial area across various domains, including Natural Language Processing (NLP), speech recognition, time series forecasting, music generation, and bioinformatics. Recurrent Neural Networks (RNNs) and Long Short Term Memory Networks (LSTMs) have historically dominated sequence modeling tasks like Machine Translation, Named Entity Recognition (NER), etc. However, the advancement of transformers has led to a shift in this paradigm, given their superior performance. Yet, transformers suffer from $O(N^2)$ attention complexity and challenges in handling inductive bias. Several variations have been proposed to address these issues which use spectral networks or convolutions and have performed well on a range of tasks. However, they still have difficulty in dealing with long sequences. State Space Models(SSMs) have emerged as promising alternatives for sequence modeling paradigms in this context, especially with the advent of S4 and its variants, such as S4nd, Hippo, Hyena, Diagnol State Spaces (DSS), Gated State Spaces (GSS), Linear Recurrent Unit (LRU), Liquid-S4, Mamba, etc. In this survey, we categorize the foundational SSMs based on three paradigms namely, Gating architectures, Structural architectures, and Recurrent architectures. This survey also highlights diverse applications of SSMs across domains such as vision, video, audio, speech, language (especially long sequence modeling), medical (including genomics), chemical (like drug design), recommendation systems, and time series analysis, including tabular data. Moreover, we consolidate the performance of SSMs on benchmark datasets like Long Range Arena (LRA), WikiText, Glue, Pile, ImageNet, Kinetics-400, sstv2, as well as video datasets such as Breakfast, COIN, LVU, and various time series datasets. The project page for Mamba-360 work is available on this webpage.\url{https://github.com/badripatro/mamba360}.

URLs: https://github.com/badripatro/mamba360

cross Online Personalizing White-box LLMs Generation with Neural Bandits

Authors: Zekai Chen, Weeden Daniel, Po-yu Chen, Francois Buet-Golfouse

Abstract: The advent of personalized content generation by LLMs presents a novel challenge: how to efficiently adapt text to meet individual preferences without the unsustainable demand of creating a unique model for each user. This study introduces an innovative online method that employs neural bandit algorithms to dynamically optimize soft instruction embeddings based on user feedback, enhancing the personalization of open-ended text generation by white-box LLMs. Through rigorous experimentation on various tasks, we demonstrate significant performance improvements over baseline strategies. NeuralTS, in particular, leads to substantial enhancements in personalized news headline generation, achieving up to a 62.9% improvement in terms of best ROUGE scores and up to 2.76% increase in LLM-agent evaluation against the baseline.

cross Classifying Human-Generated and AI-Generated Election Claims in Social Media

Authors: Alphaeus Dmonte, Marcos Zampieri, Kevin Lybarger, Massimiliano Albanese

Abstract: Politics is one of the most prevalent topics discussed on social media platforms, particularly during major election cycles, where users engage in conversations about candidates and electoral processes. Malicious actors may use this opportunity to disseminate misinformation to undermine trust in the electoral process. The emergence of Large Language Models (LLMs) exacerbates this issue by enabling malicious actors to generate misinformation at an unprecedented scale. Artificial intelligence (AI)-generated content is often indistinguishable from authentic user content, raising concerns about the integrity of information on social networks. In this paper, we present a novel taxonomy for characterizing election-related claims. This taxonomy provides an instrument for analyzing election-related claims, with granular categories related to jurisdiction, equipment, processes, and the nature of claims. We introduce ElectAI, a novel benchmark dataset that consists of 9,900 tweets, each labeled as human- or AI-generated. For AI-generated tweets, the specific LLM variant that produced them is specified. We annotated a subset of 1,550 tweets using the proposed taxonomy to capture the characteristics of election-related claims. We explored the capabilities of LLMs in extracting the taxonomy attributes and trained various machine learning models using ElectAI to distinguish between human- and AI-generated posts and identify the specific LLM variant.

cross FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Authors: Eric Slyman, Stefan Lee, Scott Cohen, Kushal Kafle

Abstract: Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.

cross From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Authors: Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Jonathan Larson

Abstract: The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as "What are the main themes in the dataset?", since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text to be indexed. Our approach uses an LLM to build a graph-based text index in two stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely-related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG leads to substantial improvements over a na\"ive RAG baseline for both the comprehensiveness and diversity of generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is forthcoming at https://aka.ms/graphrag.

URLs: https://aka.ms/graphrag.

cross AFU: Actor-Free critic Updates in off-policy RL for continuous control

Authors: Nicolas Perrin-Gilbert

Abstract: This paper presents AFU, an off-policy deep RL algorithm addressing in a new way the challenging "max-Q problem" in Q-learning for continuous action spaces, with a solution based on regression and conditional gradient scaling. AFU has an actor but its critic updates are entirely independent from it. As a consequence, the actor can be chosen freely. In the initial version, AFU-alpha, we employ the same stochastic actor as in Soft Actor-Critic (SAC), but we then study a simple failure mode of SAC and show how AFU can be modified to make actor updates less likely to become trapped in local optima, resulting in a second version of the algorithm, AFU-beta. Experimental results demonstrate the sample efficiency of both versions of AFU, marking it as the first model-free off-policy algorithm competitive with state-of-the-art actor-critic methods while departing from the actor-critic perspective.

cross Domain-Specific Improvement on Psychotherapy Chatbot Using Assistant

Authors: Cheng Kang, Daniel Novak, Katerina Urbanova, Yuqing Cheng, Yong Hu

Abstract: Large language models (LLMs) have demonstrated impressive generalization capabilities on specific tasks with human-written instruction data. However, the limited quantity, diversity, and professional expertise of such instruction data raise concerns about the performance of LLMs in psychotherapy tasks when provided with domain-specific instructions. To address this, we firstly propose Domain-Specific Assistant Instructions based on AlexanderStreet therapy, and secondly, we use an adaption fine-tuning method and retrieval augmented generation method to improve pre-trained LLMs. Through quantitative evaluation of linguistic quality using automatic and human evaluation, we observe that pre-trained LLMs on Psychotherapy Assistant Instructions outperform state-of-the-art LLMs response baselines. Our Assistant-Instruction approach offers a half-annotation method to align pre-trained LLMs with instructions and provide pre-trained LLMs with more psychotherapy knowledge.

cross Scaling Lifelong Multi-Agent Path Finding to More Realistic Settings: Research Challenges and Opportunities

Authors: He Jiang, Yulun Zhang, Rishi Veerapaneni, Jiaoyang Li

Abstract: Multi-Agent Path Finding (MAPF) is the problem of moving multiple agents from starts to goals without collisions. Lifelong MAPF (LMAPF) extends MAPF by continuously assigning new goals to agents. We present our winning approach to the 2023 League of Robot Runners LMAPF competition, which leads us to several interesting research challenges and future directions. In this paper, we outline three main research challenges. The first challenge is to search for high-quality LMAPF solutions within a limited planning time (e.g., 1s per step) for a large number of agents (e.g., 10,000) or extremely high agent density (e.g., 97.7%). We present future directions such as developing more competitive rule-based and anytime MAPF algorithms and parallelizing state-of-the-art MAPF algorithms. The second challenge is to alleviate congestion and the effect of myopic behaviors in LMAPF algorithms. We present future directions, such as developing moving guidance and traffic rules to reduce congestion, incorporating future prediction and real-time search, and determining the optimal agent number. The third challenge is to bridge the gaps between the LMAPF models used in the literature and real-world applications. We present future directions, such as dealing with more realistic kinodynamic models, execution uncertainty, and evolving systems.

cross Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Authors: Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang

Abstract: Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.

cross The Over-Certainty Phenomenon in Modern UDA Algorithms

Authors: Fin Amin, Jung-Eun Kim

Abstract: When neural networks are confronted with unfamiliar data that deviate from their training set, this signifies a domain shift. While these networks output predictions on their inputs, they typically fail to account for their level of familiarity with these novel observations. This challenge becomes even more pronounced in resource-constrained settings, such as embedded systems or edge devices. To address such challenges, we aim to recalibrate a neural network's decision boundaries in relation to its cognizance of the data it observes, introducing an approach we coin as certainty distillation. While prevailing works navigate unsupervised domain adaptation (UDA) with the goal of curtailing model entropy, they unintentionally birth models that grapple with calibration inaccuracies - a dilemma we term the over-certainty phenomenon. In this paper, we probe the drawbacks of this traditional learning model. As a solution to the issue, we propose a UDA algorithm that not only augments accuracy but also assures model calibration, all while maintaining suitability for environments with limited computational resources.

cross Advancing Recommender Systems by mitigating Shilling attacks

Authors: Aditya Chichani, Juzer Golwala, Tejas Gundecha, Kiran Gawande

Abstract: Considering the premise that the number of products offered grow in an exponential fashion and the amount of data that a user can assimilate before making a decision is relatively small, recommender systems help in categorizing content according to user preferences. Collaborative filtering is a widely used method for computing recommendations due to its good performance. But, this method makes the system vulnerable to attacks which try to bias the recommendations. These attacks, known as 'shilling attacks' are performed to push an item or nuke an item in the system. This paper proposes an algorithm to detect such shilling profiles in the system accurately and also study the effects of such profiles on the recommendations.

cross ABCD: Trust enhanced Attention based Convolutional Autoencoder for Risk Assessment

Authors: Sarala Naidu, Ning Xiong

Abstract: Anomaly detection in industrial systems is crucial for preventing equipment failures, ensuring risk identification, and maintaining overall system efficiency. Traditional monitoring methods often rely on fixed thresholds and empirical rules, which may not be sensitive enough to detect subtle changes in system health and predict impending failures. To address this limitation, this paper proposes, a novel Attention-based convolutional autoencoder (ABCD) for risk detection and map the risk value derive to the maintenance planning. ABCD learns the normal behavior of conductivity from historical data of a real-world industrial cooling system and reconstructs the input data, identifying anomalies that deviate from the expected patterns. The framework also employs calibration techniques to ensure the reliability of its predictions. Evaluation results demonstrate that with the attention mechanism in ABCD a 57.4% increase in performance and a reduction of false alarms by 9.37% is seen compared to without attention. The approach can effectively detect risks, the risk priority rank mapped to maintenance, providing valuable insights for cooling system designers and service personnel. Calibration error of 0.03% indicates that the model is well-calibrated and enhances model's trustworthiness, enabling informed decisions about maintenance strategies

cross Pearls from Pebbles: Improved Confidence Functions for Auto-labeling

Authors: Harit Vishwakarma (Yi), Reid (Yi), Chen, Sui Jiet Tay, Satya Sai Srinath Namburi, Frederic Sala, Ramya Korlakai Vinayak

Abstract: Auto-labeling is an important family of techniques that produce labeled training sets with minimum manual labeling. A prominent variant, threshold-based auto-labeling (TBAL), works by finding a threshold on a model's confidence scores above which it can accurately label unlabeled data points. However, many models are known to produce overconfident scores, leading to poor TBAL performance. While a natural idea is to apply off-the-shelf calibration methods to alleviate the overconfidence issue, such methods still fall short. Rather than experimenting with ad-hoc choices of confidence functions, we propose a framework for studying the \emph{optimal} TBAL confidence function. We develop a tractable version of the framework to obtain \texttt{Colander} (Confidence functions for Efficient and Reliable Auto-labeling), a new post-hoc method specifically designed to maximize performance in TBAL systems. We perform an extensive empirical evaluation of our method \texttt{Colander} and compare it against methods designed for calibration. \texttt{Colander} achieves up to 60\% improvements on coverage over the baselines while maintaining auto-labeling error below $5\%$ and using the same amount of labeled data as the baselines.

cross Improving Multi-label Recognition using Class Co-Occurrence Probabilities

Authors: Samyak Rawlekar, Shubhang Bhatnagar, Vishnuvardhan Pogunulu Srinivasulu, Narendra Ahuja

Abstract: Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such co-occurrences can be captured from the training data as conditional probabilities between a pair of classes. We propose a framework to extend the independent classifiers by incorporating the co-occurrence information for object pairs to improve the performance of independent classifiers. We use a Graph Convolutional Network (GCN) to enforce the conditional probabilities between classes, by refining the initial estimates derived from image and text sources obtained using VLMs. We validate our method on four MLR datasets, where our approach outperforms all state-of-the-art methods.

cross ApisTox: a new benchmark dataset for the classification of small molecules toxicity on honey bees

Authors: Jakub Adamczyk, Jakub Poziemski, Pawe{\l} Siedlecki

Abstract: The global decline in bee populations poses significant risks to agriculture, biodiversity, and environmental stability. To bridge the gap in existing data, we introduce ApisTox, a comprehensive dataset focusing on the toxicity of pesticides to honey bees (Apis mellifera). This dataset combines and leverages data from existing sources such as ECOTOX and PPDB, providing an extensive, consistent, and curated collection that surpasses the previous datasets. ApisTox incorporates a wide array of data, including toxicity levels for chemicals, details such as time of their publication in literature, and identifiers linking them to external chemical databases. This dataset may serve as an important tool for environmental and agricultural research, but also can support the development of policies and practices aimed at minimizing harm to bee populations. Finally, ApisTox offers a unique resource for benchmarking molecular property prediction methods on agrochemical compounds, facilitating advancements in both environmental science and cheminformatics. This makes it a valuable tool for both academic research and practical applications in bee conservation.

cross GPU-RANC: A CUDA Accelerated Simulation Framework for Neuromorphic Architectures

Authors: Sahil Hassan, Michael Inouye, Miguel C. Gonzalez, Ilkin Aliyev, Joshua Mack, Maisha Hafiz, Ali Akoglu

Abstract: Open-source simulation tools play a crucial role for neuromorphic application engineers and hardware architects to investigate performance bottlenecks and explore design optimizations before committing to silicon. Reconfigurable Architecture for Neuromorphic Computing (RANC) is one such tool that offers ability to execute pre-trained Spiking Neural Network (SNN) models within a unified ecosystem through both software-based simulation and FPGA-based emulation. RANC has been utilized by the community with its flexible and highly parameterized design to study implementation bottlenecks, tune architectural parameters or modify neuron behavior based on application insights and study the trade space on hardware performance and network accuracy. In designing architectures for use in neuromorphic computing, there are an incredibly large number of configuration parameters such as number and precision of weights per neuron, neuron and axon counts per core, network topology, and neuron behavior. To accelerate such studies and provide users with a streamlined productive design space exploration, in this paper we introduce the GPU-based implementation of RANC. We summarize our parallelization approach and quantify the speedup gains achieved with GPU-based tick-accurate simulations across various use cases. We demonstrate up to 780 times speedup compared to serial version of the RANC simulator based on a 512 neuromorphic core MNIST inference application. We believe that the RANC ecosystem now provides a much more feasible avenue in the research of exploring different optimizations for accelerating SNNs and performing richer studies by enabling rapid convergence to optimized neuromorphic architectures.

cross Efficient NAS with FaDE on Hierarchical Spaces

Authors: Simon Neumeyer, Julian Stier, Michael Granitzer

Abstract: Neural architecture search (NAS) is a challenging problem. Hierarchical search spaces allow for cheap evaluations of neural network sub modules to serve as surrogate for architecture evaluations. Yet, sometimes the hierarchy is too restrictive or the surrogate fails to generalize. We present FaDE which uses differentiable architecture search to obtain relative performance predictions on finite regions of a hierarchical NAS space. The relative nature of these ranks calls for a memory-less, batch-wise outer search algorithm for which we use an evolutionary algorithm with pseudo-gradient descent. FaDE is especially suited on deep hierarchical, respectively multi-cell search spaces, which it can explore by linear instead of exponential cost and therefore eliminates the need for a proxy search space. Our experiments show that firstly, FaDE-ranks on finite regions of the search space correlate with corresponding architecture performances and secondly, the ranks can empower a pseudo-gradient evolutionary search on the complete neural architecture search space.

cross AutoGluon-Multimodal (AutoMM): Supercharging Multimodal AutoML with Foundation Models

Authors: Zhiqiang Tang, Haoyang Fang, Su Zhou, Taojiannan Yang, Zihan Zhong, Tony Hu, Katrin Kirchhoff, George Karypis

Abstract: AutoGluon-Multimodal (AutoMM) is introduced as an open-source AutoML library designed specifically for multimodal learning. Distinguished by its exceptional ease of use, AutoMM enables fine-tuning of foundational models with just three lines of code. Supporting various modalities including image, text, and tabular data, both independently and in combination, the library offers a comprehensive suite of functionalities spanning classification, regression, object detection, semantic matching, and image segmentation. Experiments across diverse datasets and tasks showcases AutoMM's superior performance in basic classification and regression tasks compared to existing AutoML tools, while also demonstrating competitive results in advanced tasks, aligning with specialized toolboxes designed for such purposes.

cross URL: Universal Referential Knowledge Linking via Task-instructed Representation Compression

Authors: Zhuoqun Li, Hongyu Lin, Tianshu Wang, Boxi Cao, Yaojie Lu, Weixiang Zhou, Hao Wang, Zhenyu Zeng, Le Sun, Xianpei Han

Abstract: Linking a claim to grounded references is a critical ability to fulfill human demands for authentic and reliable information. Current studies are limited to specific tasks like information retrieval or semantic matching, where the claim-reference relationships are unique and fixed, while the referential knowledge linking (RKL) in real-world can be much more diverse and complex. In this paper, we propose universal referential knowledge linking (URL), which aims to resolve diversified referential knowledge linking tasks by one unified model. To this end, we propose a LLM-driven task-instructed representation compression, as well as a multi-view learning approach, in order to effectively adapt the instruction following and semantic understanding abilities of LLMs to referential knowledge linking. Furthermore, we also construct a new benchmark to evaluate ability of models on referential knowledge linking tasks across different scenarios. Experiments demonstrate that universal RKL is challenging for existing approaches, while the proposed framework can effectively resolve the task across various scenarios, and therefore outperforms previous approaches by a large margin.

cross Investigating the prompt leakage effect and black-box defenses for multi-turn LLM interactions

Authors: Divyansh Agarwal, Alexander R. Fabbri, Philippe Laban, Shafiq Joty, Caiming Xiong, Chien-Sheng Wu

Abstract: Prompt leakage in large language models (LLMs) poses a significant security and privacy threat, particularly in retrieval-augmented generation (RAG) systems. However, leakage in multi-turn LLM interactions along with mitigation strategies has not been studied in a standardized manner. This paper investigates LLM vulnerabilities against prompt leakage across 4 diverse domains and 10 closed- and open-source LLMs. Our unique multi-turn threat model leverages the LLM's sycophancy effect and our analysis dissects task instruction and knowledge leakage in the LLM response. In a multi-turn setting, our threat model elevates the average attack success rate (ASR) to 86.2%, including a 99% leakage with GPT-4 and claude-1.3. We find that some black-box LLMs like Gemini show variable susceptibility to leakage across domains - they are more likely to leak contextual knowledge in the news domain compared to the medical domain. Our experiments measure specific effects of 6 black-box defense strategies, including a query-rewriter in the RAG scenario. Our proposed multi-tier combination of defenses still has an ASR of 5.3% for black-box LLMs, indicating room for enhancement and future direction for LLM security research.

cross Translation of Multifaceted Data without Re-Training of Machine Translation Systems

Authors: Hyeonseok Moon, Seungyoon Lee, Seongtae Hong, Seungjun Lee, Chanjun Park, Heuiseok Lim

Abstract: Translating major language resources to build minor language resources becomes a widely-used approach. Particularly in translating complex data points composed of multiple components, it is common to translate each component separately. However, we argue that this practice often overlooks the interrelation between components within the same data point. To address this limitation, we propose a novel MT pipeline that considers the intra-data relation in implementing MT for training data. In our MT pipeline, all the components in a data point are concatenated to form a single translation sequence and subsequently reconstructed to the data components after translation. We introduce a Catalyst Statement (CS) to enhance the intra-data relation, and Indicator Token (IT) to assist the decomposition of a translated sequence into its respective data components. Through our approach, we have achieved a considerable improvement in translation quality itself, along with its effectiveness as training data. Compared with the conventional approach that translates each data component separately, our method yields better training data that enhances the performance of the trained model by 2.690 points for the web page ranking (WPR) task, and 0.845 for the question generation (QG) task in the XGLUE benchmark.

cross OmniSearchSage: Multi-Task Multi-Entity Embeddings for Pinterest Search

Authors: Prabhat Agarwal, Minhazul Islam Sk, Nikil Pancha, Kurchi Subhra Hazra, Jiajing Xu, Chuck Rosenberg

Abstract: In this paper, we present OmniSearchSage, a versatile and scalable system for understanding search queries, pins, and products for Pinterest search. We jointly learn a unified query embedding coupled with pin and product embeddings, leading to an improvement of $>8\%$ relevance, $>7\%$ engagement, and $>5\%$ ads CTR in Pinterest's production search system. The main contributors to these gains are improved content understanding, better multi-task learning, and real-time serving. We enrich our entity representations using diverse text derived from image captions from a generative LLM, historical engagement, and user-curated boards. Our multitask learning setup produces a single search query embedding in the same space as pin and product embeddings and compatible with pre-existing pin and product embeddings. We show the value of each feature through ablation studies, and show the effectiveness of a unified model compared to standalone counterparts. Finally, we share how these embeddings have been deployed across the Pinterest search stack, from retrieval to ranking, scaling to serve $300k$ requests per second at low latency. Our implementation of this work is available at https://github.com/pinterest/atg-research/tree/main/omnisearchsage.

URLs: https://github.com/pinterest/atg-research/tree/main/omnisearchsage.

cross An Efficient Reconstructed Differential Evolution Variant by Some of the Current State-of-the-art Strategies for Solving Single Objective Bound Constrained Problems

Authors: Sichen Tao, Ruihan Zhao, Kaiyu Wang, Shangce Gao

Abstract: Complex single-objective bounded problems are often difficult to solve. In evolutionary computation methods, since the proposal of differential evolution algorithm in 1997, it has been widely studied and developed due to its simplicity and efficiency. These developments include various adaptive strategies, operator improvements, and the introduction of other search methods. After 2014, research based on LSHADE has also been widely studied by researchers. However, although recently proposed improvement strategies have shown superiority over their previous generation's first performance, adding all new strategies may not necessarily bring the strongest performance. Therefore, we recombine some effective advances based on advanced differential evolution variants in recent years and finally determine an effective combination scheme to further promote the performance of differential evolution. In this paper, we propose a strategy recombination and reconstruction differential evolution algorithm called reconstructed differential evolution (RDE) to solve single-objective bounded optimization problems. Based on the benchmark suite of the 2024 IEEE Congress on Evolutionary Computation (CEC2024), we tested RDE and several other advanced differential evolution variants. The experimental results show that RDE has superior performance in solving complex optimization problems.

cross LLM-Based Section Identifiers Excel on Open Source but Stumble in Real World Applications

Authors: Saranya Krishnamoorthy, Ayush Singh, Shabnam Tafreshi

Abstract: Electronic health records (EHR) even though a boon for healthcare practitioners, are growing convoluted and longer every day. Sifting around these lengthy EHRs is taxing and becomes a cumbersome part of physician-patient interaction. Several approaches have been proposed to help alleviate this prevalent issue either via summarization or sectioning, however, only a few approaches have truly been helpful in the past. With the rise of automated methods, machine learning (ML) has shown promise in solving the task of identifying relevant sections in EHR. However, most ML methods rely on labeled data which is difficult to get in healthcare. Large language models (LLMs) on the other hand, have performed impressive feats in natural language processing (NLP), that too in a zero-shot manner, i.e. without any labeled data. To that end, we propose using LLMs to identify relevant section headers. We find that GPT-4 can effectively solve the task on both zero and few-shot settings as well as segment dramatically better than state-of-the-art methods. Additionally, we also annotate a much harder real world dataset and find that GPT-4 struggles to perform well, alluding to further research and harder benchmarks.

cross Research on Splicing Image Detection Algorithms Based on Natural Image Statistical Characteristics

Authors: Ao Xiang, Jingyu Zhang, Qin Yang, Liyang Wang, Yu Cheng

Abstract: With the development and widespread application of digital image processing technology, image splicing has become a common method of image manipulation, raising numerous security and legal issues. This paper introduces a new splicing image detection algorithm based on the statistical characteristics of natural images, aimed at improving the accuracy and efficiency of splicing image detection. By analyzing the limitations of traditional methods, we have developed a detection framework that integrates advanced statistical analysis techniques and machine learning methods. The algorithm has been validated using multiple public datasets, showing high accuracy in detecting spliced edges and locating tampered areas, as well as good robustness. Additionally, we explore the potential applications and challenges faced by the algorithm in real-world scenarios. This research not only provides an effective technological means for the field of image tampering detection but also offers new ideas and methods for future related research.

cross When Fuzzing Meets LLMs: Challenges and Opportunities

Authors: Yu Jiang, Jie Liang, Fuchen Ma, Yuanliang Chen, Chijin Zhou, Yuheng Shen, Zhiyong Wu, Jingzhou Fu, Mingzhe Wang, ShanShan Li, Quan Zhang

Abstract: Fuzzing, a widely-used technique for bug detection, has seen advancements through Large Language Models (LLMs). Despite their potential, LLMs face specific challenges in fuzzing. In this paper, we identified five major challenges of LLM-assisted fuzzing. To support our findings, we revisited the most recent papers from top-tier conferences, confirming that these challenges are widespread. As a remedy, we propose some actionable recommendations to help improve applying LLM in Fuzzing and conduct preliminary evaluations on DBMS fuzzing. The results demonstrate that our recommendations effectively address the identified challenges.

cross Semantic Segmentation Refiner for Ultrasound Applications with Zero-Shot Foundation Models

Authors: Hedda Cohen Indelman, Elay Dahan, Angeles M. Perez-Agosto, Carmit Shiran, Doron Shaked, Nati Daniel

Abstract: Despite the remarkable success of deep learning in medical imaging analysis, medical image segmentation remains challenging due to the scarcity of high-quality labeled images for supervision. Further, the significant domain gap between natural and medical images in general and ultrasound images in particular hinders fine-tuning models trained on natural images to the task at hand. In this work, we address the performance degradation of segmentation models in low-data regimes and propose a prompt-less segmentation method harnessing the ability of segmentation foundation models to segment abstract shapes. We do that via our novel prompt point generation algorithm which uses coarse semantic segmentation masks as input and a zero-shot prompt-able foundation model as an optimization target. We demonstrate our method on a segmentation findings task (pathologic anomalies) in ultrasound images. Our method's advantages are brought to light in varying degrees of low-data regime experiments on a small-scale musculoskeletal ultrasound images dataset, yielding a larger performance gain as the training set size decreases.

cross IMWA: Iterative Model Weight Averaging Benefits Class-Imbalanced Learning Tasks

Authors: Zitong Huang, Ze Chen, Bowen Dong, Chaoqi Liang, Erjin Zhou, Wangmeng Zuo

Abstract: Model Weight Averaging (MWA) is a technique that seeks to enhance model's performance by averaging the weights of multiple trained models. This paper first empirically finds that 1) the vanilla MWA can benefit the class-imbalanced learning, and 2) performing model averaging in the early epochs of training yields a greater performance improvement than doing that in later epochs. Inspired by these two observations, in this paper we propose a novel MWA technique for class-imbalanced learning tasks named Iterative Model Weight Averaging (IMWA). Specifically, IMWA divides the entire training stage into multiple episodes. Within each episode, multiple models are concurrently trained from the same initialized model weight, and subsequently averaged into a singular model. Then, the weight of this average model serves as a fresh initialization for the ensuing episode, thus establishing an iterative learning paradigm. Compared to vanilla MWA, IMWA achieves higher performance improvements with the same computational cost. Moreover, IMWA can further enhance the performance of those methods employing EMA strategy, demonstrating that IMWA and EMA can complement each other. Extensive experiments on various class-imbalanced learning tasks, i.e., class-imbalanced image classification, semi-supervised class-imbalanced image classification and semi-supervised object detection tasks showcase the effectiveness of our IMWA.

cross AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code Generation

Authors: Zhensu Sun, Xiaoning Du, Zhou Yang, Li Li, David Lo

Abstract: Besides humans and machines, Artificial Intelligence (AI) models have emerged to be another important audience of programming languages, as we come to the era of large language models (LLMs). LLMs can now excel at coding competitions and even program like developers to address various tasks, such as math calculation. Yet, the grammar and layout of existing programs are designed for humans. Particularly, abundant grammar tokens and formatting tokens are included to make the code more readable to humans. While beneficial, such a human-centric design imposes an unnecessary computational burden on LLMs where each token, either consumed or generated, consumes computational resources. To improve inference efficiency and reduce computational costs, we propose the concept of AI-oriented grammar, which aims to represent the code in a way that better suits the working mechanism of AI models. Code written with AI-oriented grammar discards formats and uses a minimum number of tokens to convey code semantics effectively. To demonstrate the feasibility of this concept, we explore and implement the first AI-oriented grammar for Python, named Simple Python (SimPy). SimPy is crafted by revising the original Python grammar through a series of heuristic rules. Programs written in SimPy maintain identical Abstract Syntax Tree (AST) structures to those in standard Python, allowing execution via a modified AST parser. In addition, we explore methods to enable existing LLMs to proficiently understand and use SimPy, and ensure the changes remain imperceptible for human developers. Compared with the original Python, SimPy not only reduces token usage by 13.5% and 10.4% for CodeLlama and GPT-4, but can also achieve equivalent, even improved, performance over the models trained on Python code.

cross Training-Free Unsupervised Prompt for Vision-Language Models

Authors: Sifan Long, Linbin Wang, Zhen Zhao, Zichang Tan, Yiming Wu, Shengsheng Wang, Jingdong Wang

Abstract: Prompt learning has become the most effective paradigm for adapting large pre-trained vision-language models (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset.

cross Light-weight Retinal Layer Segmentation with Global Reasoning

Authors: Xiang He, Weiye Song, Yiming Wang, Fabio Poiesi, Ji Yi, Manishi Desai, Quanqing Xu, Kongzheng Yang, Yi Wan

Abstract: Automatic retinal layer segmentation with medical images, such as optical coherence tomography (OCT) images, serves as an important tool for diagnosing ophthalmic diseases. However, it is challenging to achieve accurate segmentation due to low contrast and blood flow noises presented in the images. In addition, the algorithm should be light-weight to be deployed for practical clinical applications. Therefore, it is desired to design a light-weight network with high performance for retinal layer segmentation. In this paper, we propose LightReSeg for retinal layer segmentation which can be applied to OCT images. Specifically, our approach follows an encoder-decoder structure, where the encoder part employs multi-scale feature extraction and a Transformer block for fully exploiting the semantic information of feature maps at all scales and making the features have better global reasoning capabilities, while the decoder part, we design a multi-scale asymmetric attention (MAA) module for preserving the semantic information at each encoder scale. The experiments show that our approach achieves a better segmentation performance compared to the current state-of-the-art method TransUnet with 105.7M parameters on both our collected dataset and two other public datasets, with only 3.3M parameters.

cross Integration of Mixture of Experts and Multimodal Generative AI in Internet of Vehicles: A Survey

Authors: Minrui Xu (Sherman), Dusit Niyato (Sherman), Jiawen Kang (Sherman), Zehui Xiong (Sherman), Abbas Jamalipour (Sherman), Yuguang Fang (Sherman), Dong In Kim (Sherman), Xuemin (Sherman), Shen

Abstract: Generative AI (GAI) can enhance the cognitive, reasoning, and planning capabilities of intelligent modules in the Internet of Vehicles (IoV) by synthesizing augmented datasets, completing sensor data, and making sequential decisions. In addition, the mixture of experts (MoE) can enable the distributed and collaborative execution of AI models without performance degradation between connected vehicles. In this survey, we explore the integration of MoE and GAI to enable Artificial General Intelligence in IoV, which can enable the realization of full autonomy for IoV with minimal human supervision and applicability in a wide range of mobility scenarios, including environment monitoring, traffic management, and autonomous driving. In particular, we present the fundamentals of GAI, MoE, and their interplay applications in IoV. Furthermore, we discuss the potential integration of MoE and GAI in IoV, including distributed perception and monitoring, collaborative decision-making and planning, and generative modeling and simulation. Finally, we present several potential research directions for facilitating the integration.

cross VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations

Authors: Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, Hassan Sajjad

Abstract: Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at https://github.com/Sri-Harsha/visla_benchmark.

URLs: https://github.com/Sri-Harsha/visla_benchmark.

cross Guarding Graph Neural Networks for Unsupervised Graph Anomaly Detection

Authors: Yuanchen Bei, Sheng Zhou, Jinke Shi, Yao Ma, Haishuai Wang, Jiajun Bu

Abstract: Unsupervised graph anomaly detection aims at identifying rare patterns that deviate from the majority in a graph without the aid of labels, which is important for a variety of real-world applications. Recent advances have utilized Graph Neural Networks (GNNs) to learn effective node representations by aggregating information from neighborhoods. This is motivated by the hypothesis that nodes in the graph tend to exhibit consistent behaviors with their neighborhoods. However, such consistency can be disrupted by graph anomalies in multiple ways. Most existing methods directly employ GNNs to learn representations, disregarding the negative impact of graph anomalies on GNNs, resulting in sub-optimal node representations and anomaly detection performance. While a few recent approaches have redesigned GNNs for graph anomaly detection under semi-supervised label guidance, how to address the adverse effects of graph anomalies on GNNs in unsupervised scenarios and learn effective representations for anomaly detection are still under-explored. To bridge this gap, in this paper, we propose a simple yet effective framework for Guarding Graph Neural Networks for Unsupervised Graph Anomaly Detection (G3AD). Specifically, G3AD introduces two auxiliary networks along with correlation constraints to guard the GNNs from inconsistent information encoding. Furthermore, G3AD introduces an adaptive caching module to guard the GNNs from solely reconstructing the observed data that contains anomalies. Extensive experiments demonstrate that our proposed G3AD can outperform seventeen state-of-the-art methods on both synthetic and real-world datasets.

cross List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Authors: An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang

Abstract: Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these visual tags. To promote the learning of SoM prompting for open-source models, we propose a new learning paradigm: "list items one by one," which asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags. By integrating our curated dataset with other visual instruction tuning datasets, we are able to equip existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our finetuned SoM models on five MLLM benchmarks. We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs. Perhaps surprisingly, these improvements persist even when the visual tags are omitted from input images during inference. This suggests the potential of "list items one by one" as a new paradigm for training MLLMs, which strengthens the object-text alignment through the use of visual tags in the training stage. Finally, we conduct analyses by probing trained models to understand the working mechanism of SoM. Our code and data are available at \url{https://github.com/zzxslp/SoM-LLaVA}.

URLs: https://github.com/zzxslp/SoM-LLaVA

cross SwarmRL: Building the Future of Smart Active Systems

Authors: Samuel Tovey, Christoph Lohrmann, Tobias Merkt, David Zimmer, Konstantin Nikolaou, Simon Koppenh\"ofer, Anna Bushmakina, Jonas Scheunemann, Christian Holm

Abstract: This work introduces SwarmRL, a Python package designed to study intelligent active particles. SwarmRL provides an easy-to-use interface for developing models to control microscopic colloids using classical control and deep reinforcement learning approaches. These models may be deployed in simulations or real-world environments under a common framework. We explain the structure of the software and its key features and demonstrate how it can be used to accelerate research. With SwarmRL, we aim to streamline research into micro-robotic control while bridging the gap between experimental and simulation-driven sciences. SwarmRL is available open-source on GitHub at https://github.com/SwarmRL/SwarmRL.

URLs: https://github.com/SwarmRL/SwarmRL.

cross Fuzzy Inference System for Test Case Prioritization in Software Testing

Authors: Aron Karatayev, Anna Ogorodova, Pakizar Shamoi

Abstract: In the realm of software development, testing is crucial for ensuring software quality and adherence to requirements. However, it can be time-consuming and resource-intensive, especially when dealing with large and complex software systems. Test case prioritization (TCP) is a vital strategy to enhance testing efficiency by identifying the most critical test cases for early execution. This paper introduces a novel fuzzy logic-based approach to automate TCP, using fuzzy linguistic variables and expert-derived fuzzy rules to establish a link between test case characteristics and their prioritization. Our methodology utilizes two fuzzy variables - failure rate and execution time - alongside two crisp parameters: Prerequisite Test Case and Recently Updated Flag. Our findings demonstrate the proposed system capacity to rank test cases effectively through experimental validation on a real-world software system. The results affirm the practical applicability of our approach in optimizing the TCP and reducing the resource intensity of software testing.

cross Offline Reinforcement Learning with Behavioral Supervisor Tuning

Authors: Padmanaba Srinivasan, William Knottenbelt

Abstract: Offline reinforcement learning (RL) algorithms are applied to learn performant, well-generalizing policies when provided with a static dataset of interactions. Many recent approaches to offline RL have seen substantial success, but with one key caveat: they demand substantial per-dataset hyperparameter tuning to achieve reported performance, which requires policy rollouts in the environment to evaluate; this can rapidly become cumbersome. Furthermore, substantial tuning requirements can hamper the adoption of these algorithms in practical domains. In this paper, we present TD3 with Behavioral Supervisor Tuning (TD3-BST), an algorithm that trains an uncertainty model and uses it to guide the policy to select actions within the dataset support. TD3-BST can learn more effective policies from offline datasets compared to previous methods and achieves the best performance across challenging benchmarks without requiring per-dataset tuning.

cross Constructing Optimal Noise Channels for Enhanced Robustness in Quantum Machine Learning

Authors: David Winderl, Nicola Franco, Jeanette Miriam Lorenz

Abstract: With the rapid advancement of Quantum Machine Learning (QML), the critical need to enhance security measures against adversarial attacks and protect QML models becomes increasingly evident. In this work, we outline the connection between quantum noise channels and differential privacy (DP), by constructing a family of noise channels which are inherently $\epsilon$-DP: $(\alpha, \gamma)$-channels. Through this approach, we successfully replicate the $\epsilon$-DP bounds observed for depolarizing and random rotation channels, thereby affirming the broad generality of our framework. Additionally, we use a semi-definite program to construct an optimally robust channel. In a small-scale experimental evaluation, we demonstrate the benefits of using our optimal noise channel over depolarizing noise, particularly in enhancing adversarial accuracy. Moreover, we assess how the variables $\alpha$ and $\gamma$ affect the certifiable robustness and investigate how different encoding methods impact the classifier's robustness.

cross Leveraging tropical reef, bird and unrelated sounds for superior transfer learning in marine bioacoustics

Authors: Ben Williams, Bart van Merri\"enboer, Vincent Dumoulin, Jenny Hamer, Eleni Triantafillou, Abram B. Fleishman, Matthew McKown, Jill E. Munger, Aaron N. Rice, Ashlee Lillis, Clemency E. White, Catherine A. D. Hobbs, Tries B. Razak, Kate E. Jones, Tom Denton

Abstract: Machine learning has the potential to revolutionize passive acoustic monitoring (PAM) for ecological assessments. However, high annotation and compute costs limit the field's efficacy. Generalizable pretrained networks can overcome these costs, but high-quality pretraining requires vast annotated libraries, limiting its current applicability primarily to bird taxa. Here, we identify the optimum pretraining strategy for a data-deficient domain using coral reef bioacoustics. We assemble ReefSet, a large annotated library of reef sounds, though modest compared to bird libraries at 2% of the sample count. Through testing few-shot transfer learning performance, we observe that pretraining on bird audio provides notably superior generalizability compared to pretraining on ReefSet or unrelated audio alone. However, our key findings show that cross-domain mixing which leverages bird, reef and unrelated audio during pretraining maximizes reef generalizability. SurfPerch, our pretrained network, provides a strong foundation for automated analysis of marine PAM data with minimal annotation and compute costs.

cross Contextual Categorization Enhancement through LLMs Latent-Space

Authors: Zineddine Bettouche, Anas Safi, Andreas Fischer

Abstract: Managing the semantic quality of the categorization in large textual datasets, such as Wikipedia, presents significant challenges in terms of complexity and cost. In this paper, we propose leveraging transformer models to distill semantic information from texts in the Wikipedia dataset and its associated categories into a latent space. We then explore different approaches based on these encodings to assess and enhance the semantic identity of the categories. Our graphical approach is powered by Convex Hull, while we utilize Hierarchical Navigable Small Worlds (HNSWs) for the hierarchical approach. As a solution to the information loss caused by the dimensionality reduction, we modulate the following mathematical solution: an exponential decay function driven by the Euclidean distances between the high-dimensional encodings of the textual categories. This function represents a filter built around a contextual category and retrieves items with a certain Reconsideration Probability (RP). Retrieving high-RP items serves as a tool for database administrators to improve data groupings by providing recommendations and identifying outliers within a contextual framework.

cross Latent Modulated Function for Computational Optimal Continuous Image Representation

Authors: Zongyao He, Zhi Jin

Abstract: The recent work Local Implicit Image Function (LIIF) and subsequent Implicit Neural Representation (INR) based works have achieved remarkable success in Arbitrary-Scale Super-Resolution (ASSR) by using MLP to decode Low-Resolution (LR) features. However, these continuous image representations typically implement decoding in High-Resolution (HR) High-Dimensional (HD) space, leading to a quadratic increase in computational cost and seriously hindering the practical applications of ASSR. To tackle this problem, we propose a novel Latent Modulated Function (LMF), which decouples the HR-HD decoding process into shared latent decoding in LR-HD space and independent rendering in HR Low-Dimensional (LD) space, thereby realizing the first computational optimal paradigm of continuous image representation. Specifically, LMF utilizes an HD MLP in latent space to generate latent modulations of each LR feature vector. This enables a modulated LD MLP in render space to quickly adapt to any input feature vector and perform rendering at arbitrary resolution. Furthermore, we leverage the positive correlation between modulation intensity and input image complexity to design a Controllable Multi-Scale Rendering (CMSR) algorithm, offering the flexibility to adjust the decoding efficiency based on the rendering precision. Extensive experiments demonstrate that converting existing INR-based ASSR methods to LMF can reduce the computational cost by up to 99.9%, accelerate inference by up to 57 times, and save up to 76% of parameters, while maintaining competitive performance. The code is available at https://github.com/HeZongyao/LMF.

URLs: https://github.com/HeZongyao/LMF.

cross Canonical Decision Diagrams Modulo Theories

Authors: Massimo Michelutti, Gabriele Masina, Giuseppe Spallitta, Roberto Sebastiani

Abstract: Decision diagrams (DDs) are powerful tools to represent effectively propositional formulas, which are largely used in many domains, in particular in formal verification and in knowledge compilation. Some forms of DDs (e.g., OBDDs, SDDs) are canonical, that is, (under given conditions on the atom list) they univocally represent equivalence classes of formulas. Given the limited expressiveness of propositional logic, a few attempts to leverage DDs to SMT level have been presented in the literature. Unfortunately, these techniques still suffer from some limitations: most procedures are theory-specific; some produce theory DDs (T-DDs) which do not univocally represent T-valid formulas or T-inconsistent formulas; none of these techniques provably produces theory-canonical T-DDs, which (under given conditions on the T-atom list) univocally represent T-equivalence classes of formulas. Also, these procedures are not easy to implement, and very few implementations are actually available. In this paper, we present a novel very-general technique to leverage DDs to SMT level, which has several advantages: it is very easy to implement on top of an AllSMT solver and a DD package, which are used as blackboxes; it works for every form of DDs and every theory, or combination thereof, supported by the AllSMT solver; it produces theory-canonical T-DDs if the propositional DD is canonical. We have implemented a prototype tool for both T-OBDDs and T-SDDs on top of OBDD and SDD packages and the MathSAT SMT solver. Some preliminary empirical evaluation supports the effectiveness of the approach.

cross Towards Precise Observations of Neural Model Robustness in Classification

Authors: Wenchuan Mu, Kwan Hui Lim

Abstract: In deep learning applications, robustness measures the ability of neural models that handle slight changes in input data, which could lead to potential safety hazards, especially in safety-critical applications. Pre-deployment assessment of model robustness is essential, but existing methods often suffer from either high costs or imprecise results. To enhance safety in real-world scenarios, metrics that effectively capture the model's robustness are needed. To address this issue, we compare the rigour and usage conditions of various assessment methods based on different definitions. Then, we propose a straightforward and practical metric utilizing hypothesis testing for probabilistic robustness and have integrated it into the TorchAttacks library. Through a comparative analysis of diverse robustness assessment methods, our approach contributes to a deeper understanding of model robustness in safety-critical applications.

cross A Dual Perspective of Reinforcement Learning for Imposing Policy Constraints

Authors: Bram De Cooman, Johan Suykens

Abstract: Model-free reinforcement learning methods lack an inherent mechanism to impose behavioural constraints on the trained policies. While certain extensions exist, they remain limited to specific types of constraints, such as value constraints with additional reward signals or visitation density constraints. In this work we try to unify these existing techniques and bridge the gap with classical optimization and control theory, using a generic primal-dual framework for value-based and actor-critic reinforcement learning methods. The obtained dual formulations turn out to be especially useful for imposing additional constraints on the learned policy, as an intrinsic relationship between such dual constraints (or regularization terms) and reward modifications in the primal is reveiled. Furthermore, using this framework, we are able to introduce some novel types of constraints, allowing to impose bounds on the policy's action density or on costs associated with transitions between consecutive states and actions. From the adjusted primal-dual optimization problems, a practical algorithm is derived that supports various combinations of policy constraints that are automatically handled throughout training using trainable reward modifications. The resulting $\texttt{DualCRL}$ method is examined in more detail and evaluated under different (combinations of) constraints on two interpretable environments. The results highlight the efficacy of the method, which ultimately provides the designer of such systems with a versatile toolbox of possible policy constraints.

cross DiffSeg: A Segmentation Model for Skin Lesions Based on Diffusion Difference

Authors: Zhihao Shuai, Yinan Chen, Shunqiang Mao, Yihan Zho, Xiaohong Zhang

Abstract: Weakly supervised medical image segmentation (MIS) using generative models is crucial for clinical diagnosis. However, the accuracy of the segmentation results is often limited by insufficient supervision and the complex nature of medical imaging. Existing models also only provide a single outcome, which does not allow for the measurement of uncertainty. In this paper, we introduce DiffSeg, a segmentation model for skin lesions based on diffusion difference which exploits diffusion model principles to ex-tract noise-based features from images with diverse semantic information. By discerning difference between these noise features, the model identifies diseased areas. Moreover, its multi-output capability mimics doctors' annotation behavior, facilitating the visualization of segmentation result consistency and ambiguity. Additionally, it quantifies output uncertainty using Generalized Energy Distance (GED), aiding interpretability and decision-making for physicians. Finally, the model integrates outputs through the Dense Conditional Random Field (DenseCRF) algorithm to refine the segmentation boundaries by considering inter-pixel correlations, which improves the accuracy and optimizes the segmentation results. We demonstrate the effectiveness of DiffSeg on the ISIC 2018 Challenge dataset, outperforming state-of-the-art U-Net-based methods.

cross Evaluating Consistency and Reasoning Capabilities of Large Language Models

Authors: Yash Saxena, Sarthak Chopra, Arunendra Mani Tripathi

Abstract: Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance, for tasks such as text generation, summarization, and translation. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This behavior can be attributed to several factors, with consistency and reasoning capabilities being significant contributors. LLMs frequently lack the ability to generate explanations and engage in coherent reasoning, leading to inaccurate responses. Moreover, they exhibit inconsistencies in their outputs. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs. The experiments utilize the Boolq dataset as the ground truth, comprising questions, answers, and corresponding explanations. Queries from the dataset are presented as prompts to the LLMs, and the generated responses are evaluated against the ground truth answers. Additionally, explanations are generated to assess the models' reasoning abilities. Consistency is evaluated by repeatedly presenting the same query to the models and observing for variations in their responses. For measuring reasoning capabilities, the generated explanations are compared to the ground truth explanations using metrics such as BERT, BLEU, and F-1 scores. The findings reveal that proprietary models generally outperform public models in terms of both consistency and reasoning capabilities. However, even when presented with basic general knowledge questions, none of the models achieved a score of 90\% in both consistency and reasoning. This study underscores the direct correlation between consistency and reasoning abilities in LLMs and highlights the inherent reasoning challenges present in current language models.

cross Exploring the Dynamics of Data Transmission in 5G Networks: A Conceptual Analysis

Authors: Nikita Smirnov, Sven Tomforde

Abstract: This conceptual analysis examines the dynamics of data transmission in 5G networks. It addresses various aspects of sending data from cameras and LiDARs installed on a remote-controlled ferry to a land-based control center. The range of topics includes all stages of video and LiDAR data processing from acquisition and encoding to final decoding, all aspects of their transmission and reception via the WebRTC protocol, and all possible types of network problems such as handovers or congestion that could affect the quality of experience for end-users. A series of experiments were conducted to evaluate the key aspects of the data transmission. These include simulation-based reproducible runs and real-world experiments conducted using open-source solutions we developed: "Gymir5G" - an OMNeT++-based 5G simulation and "GstWebRTCApp" - a GStreamer-based application for adaptive control of media streams over the WebRTC protocol. One of the goals of this study is to formulate the bandwidth and latency requirements for reliable real-time communication and to estimate their approximate values. This goal was achieved through simulation-based experiments involving docking maneuvers in the Bay of Kiel, Germany. The final latency for the entire data processing pipeline was also estimated during the real tests. In addition, a series of simulation-based experiments showed the impact of key WebRTC features and demonstrated the effectiveness of the WebRTC protocol, while the conducted video codec comparison showed that the hardware-accelerated H.264 codec is the best. Finally, the research addresses the topic of adaptive communication, where the traditional congestion avoidance and deep reinforcement learning approaches were analyzed. The comparison in a sandbox scenario shows that the AI-based solution outperforms the WebRTC baseline GCC algorithm in terms of data rates, latency, and packet loss.

cross Global Concept Explanations for Graphs by Contrastive Learning

Authors: Jonas Teufel, Pascal Friederich

Abstract: Beyond improving trust and validating model fairness, xAI practices also have the potential to recover valuable scientific insights in application domains where little to no prior human intuition exists. To that end, we propose a method to extract global concept explanations from the predictions of graph neural networks to develop a deeper understanding of the tasks underlying structure-property relationships. We identify concept explanations as dense clusters in the self-explaining Megan models subgraph latent space. For each concept, we optimize a representative prototype graph and optionally use GPT-4 to provide hypotheses about why each structure has a certain effect on the prediction. We conduct computational experiments on synthetic and real-world graph property prediction tasks. For the synthetic tasks we find that our method correctly reproduces the structural rules by which they were created. For real-world molecular property regression and classification tasks, we find that our method rediscovers established rules of thumb. More specifically, our results for molecular mutagenicity prediction indicate more fine-grained resolution of structural details than existing explainability methods, consistent with previous results from chemistry literature. Overall, our results show promising capability to extract the underlying structure-property relationships for complex graph property prediction tasks.

cross Developing Acoustic Models for Automatic Speech Recognition in Swedish

Authors: Giampiero Salvi

Abstract: This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement.

cross Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

Authors: Kuofeng Gao, Jindong Gu, Yang Bai, Shu-Tao Xia, Philip Torr, Wei Liu, Zhifeng Li

Abstract: Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and video-based ones, and aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences, which motivates us to propose verbose samples, including verbose images and videos. Concretely, two modality non-specific losses are proposed, including a loss to delay end-of-sequence (EOS) token and an uncertainty loss to increase the uncertainty over each generated token. In addition, improving diversity is important to encourage longer responses by increasing the complexity, which inspires the following modality specific loss. For verbose images, a token diversity loss is proposed to promote diverse hidden states. For verbose videos, a frame feature diversity loss is proposed to increase the feature diversity among frames. To balance these losses, we propose a temporal weight adjustment algorithm. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.

cross DeepKalPose: An Enhanced Deep-Learning Kalman Filter for Temporally Consistent Monocular Vehicle Pose Estimation

Authors: Leandro Di Bella, Yangxintong Lyu, Adrian Munteanu

Abstract: This paper presents DeepKalPose, a novel approach for enhancing temporal consistency in monocular vehicle pose estimation applied on video through a deep-learning-based Kalman Filter. By integrating a Bi-directional Kalman filter strategy utilizing forward and backward time-series processing, combined with a learnable motion model to represent complex motion patterns, our method significantly improves pose accuracy and robustness across various conditions, particularly for occluded or distant vehicles. Experimental validation on the KITTI dataset confirms that DeepKalPose outperforms existing methods in both pose accuracy and temporal consistency.

cross Understanding Privacy Risks of Embeddings Induced by Large Language Models

Authors: Zhihao Zhu, Ninglu Shao, Defu Lian, Chenwang Wu, Zheng Liu, Yi Yang, Enhong Chen

Abstract: Large language models (LLMs) show early signs of artificial general intelligence but struggle with hallucinations. One promising solution to mitigate these hallucinations is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation. However, such a solution risks compromising privacy, as recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained language models. The significant advantage of LLMs over traditional pre-trained models may exacerbate these concerns. To this end, we investigate the effectiveness of reconstructing original knowledge and predicting entity attributes from these embeddings when LLMs are employed. Empirical findings indicate that LLMs significantly improve the accuracy of two evaluated tasks over those from pre-trained models, regardless of whether the texts are in-distribution or out-of-distribution. This underscores a heightened potential for LLMs to jeopardize user privacy, highlighting the negative consequences of their widespread use. We further discuss preliminary strategies to mitigate this risk.

cross SFMViT: SlowFast Meet ViT in Chaotic World

Authors: Jiaying Lin, Jiajun Wen, Mengyuan Liu, Jinfu Liu, Baiqiao Yin, Yue Li

Abstract: The task of spatiotemporal action localization in chaotic scenes is a challenging task toward advanced video understanding. Paving the way with high-quality video feature extraction and enhancing the precision of detector-predicted anchors can effectively improve model performance. To this end, we propose a high-performance dual-stream spatiotemporal feature extraction network SFMViT with an anchor pruning strategy. The backbone of our SFMViT is composed of ViT and SlowFast with prior knowledge of spatiotemporal action localization, which fully utilizes ViT's excellent global feature extraction capabilities and SlowFast's spatiotemporal sequence modeling capabilities. Secondly, we introduce the confidence maximum heap to prune the anchors detected in each frame of the picture to filter out the effective anchors. These designs enable our SFMViT to achieve a mAP of 26.62% in the Chaotic World dataset, far exceeding existing models. Code is available at https://github.com/jfightyr/SlowFast-Meet-ViT.

URLs: https://github.com/jfightyr/SlowFast-Meet-ViT.

cross Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare

Authors: Emre Can Acikgoz, Osman Batur \.Ince, Rayene Bench, Arda An{\i}l Boz, \.Ilker Kesen, Aykut Erdem, Erkut Erdem

Abstract: The integration of Large Language Models (LLMs) into healthcare promises to transform medical diagnostics, research, and patient care. Yet, the progression of medical LLMs faces obstacles such as complex training requirements, rigorous evaluation demands, and the dominance of proprietary models that restrict academic exploration. Transparent, comprehensive access to LLM resources is essential for advancing the field, fostering reproducibility, and encouraging innovation in healthcare AI. We present Hippocrates, an open-source LLM framework specifically developed for the medical domain. In stark contrast to previous efforts, it offers unrestricted access to its training datasets, codebase, checkpoints, and evaluation protocols. This open approach is designed to stimulate collaborative research, allowing the community to build upon, refine, and rigorously evaluate medical LLMs within a transparent ecosystem. Also, we introduce Hippo, a family of 7B models tailored for the medical domain, fine-tuned from Mistral and LLaMA2 through continual pre-training, instruction tuning, and reinforcement learning from human and AI feedback. Our models outperform existing open medical LLMs models by a large-margin, even surpassing models with 70B parameters. Through Hippocrates, we aspire to unlock the full potential of LLMs not just to advance medical knowledge and patient care but also to democratize the benefits of AI research in healthcare, making them available across the globe.

cross Legal Aspects for Software Developers Interested in Generative AI Applications

Authors: Steffen Herbold, Brian Valerius, Anamaria Mojica-Hanke, Isabella Lex, Joel Mittel

Abstract: Recent successes in Generative Artificial Intelligence (GenAI) have led to new technologies capable of generating high-quality code, natural language, and images. The next step is to integrate GenAI technology into products, a task typically conducted by software developers. Such product development always comes with a certain risk of liability. Within this article, we want to shed light on the current state of two such risks: data protection and copyright. Both aspects are crucial for GenAI. This technology deals with data for both model training and generated output. We summarize key aspects regarding our current knowledge that every software developer involved in product development using GenAI should be aware of to avoid critical mistakes that may expose them to liability claims.

cross Tele-FLM Technical Report

Authors: Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Chao Wang, Xinzhang Liu, Zihan Wang, Yu Zhao, Xin Wang, Yuyao Huang, Shuangyong Song, Yongxiang Li, Zheng Zhang, Bo Zhao, Aixin Sun, Yequan Wang, Zhongjiang He, Zhongyuan Wang, Xuelong Li, Tiejun Huang

Abstract: Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.

cross An\'alise de ambiguidade lingu\'istica em modelos de linguagem de grande escala (LLMs)

Authors: Lav\'inia de Carvalho Moraes, Irene Cristina Silv\'erio, Rafael Alexandre Sousa Marques, Bianca de Castro Anaia, Dandara Freitas de Paula, Maria Carolina Schincariol de Faria, Iury Cleveston, Alana de Santana Correia, Raquel Meister Ko Freitag

Abstract: Linguistic ambiguity continues to represent a significant challenge for natural language processing (NLP) systems, notwithstanding the advancements in architectures such as Transformers and BERT. Inspired by the recent success of instructional models like ChatGPT and Gemini (In 2023, the artificial intelligence was called Bard.), this study aims to analyze and discuss linguistic ambiguity within these models, focusing on three types prevalent in Brazilian Portuguese: semantic, syntactic, and lexical ambiguity. We create a corpus comprising 120 sentences, both ambiguous and unambiguous, for classification, explanation, and disambiguation. The models capability to generate ambiguous sentences was also explored by soliciting sets of sentences for each type of ambiguity. The results underwent qualitative analysis, drawing on recognized linguistic references, and quantitative assessment based on the accuracy of the responses obtained. It was evidenced that even the most sophisticated models, such as ChatGPT and Gemini, exhibit errors and deficiencies in their responses, with explanations often providing inconsistent. Furthermore, the accuracy peaked at 49.58 percent, indicating the need for descriptive studies for supervised learning.

cross A Self-Organizing Clustering System for Unsupervised Distribution Shift Detection

Authors: Sebasti\'an Basterrech, Line Clemmensen, Gerardo Rubino

Abstract: Modeling non-stationary data is a challenging problem in the field of continual learning, and data distribution shifts may result in negative consequences on the performance of a machine learning model. Classic learning tools are often vulnerable to perturbations of the input covariates, and are sensitive to outliers and noise, and some tools are based on rigid algebraic assumptions. Distribution shifts are frequently occurring due to changes in raw materials for production, seasonality, a different user base, or even adversarial attacks. Therefore, there is a need for more effective distribution shift detection techniques. In this work, we propose a continual learning framework for monitoring and detecting distribution changes. We explore the problem in a latent space generated by a bio-inspired self-organizing clustering and statistical aspects of the latent space. In particular, we investigate the projections made by two topology-preserving maps: the Self-Organizing Map and the Scale Invariant Map. Our method can be applied in both a supervised and an unsupervised context. We construct the assessment of changes in the data distribution as a comparison of Gaussian signals, making the proposed method fast and robust. We compare it to other unsupervised techniques, specifically Principal Component Analysis (PCA) and Kernel-PCA. Our comparison involves conducting experiments using sequences of images (based on MNIST and injected shifts with adversarial samples), chemical sensor measurements, and the environmental variable related to ozone levels. The empirical study reveals the potential of the proposed approach.

cross ProbGate at EHRSQL 2024: Enhancing SQL Query Generation Accuracy through Probabilistic Threshold Filtering and Error Handling

Authors: Sangryul Kim, Donghee Han, Sehyun Kim

Abstract: Recently, deep learning-based language models have significantly enhanced text-to-SQL tasks, with promising applications in retrieving patient records within the medical domain. One notable challenge in such applications is discerning unanswerable queries. Through fine-tuning model, we demonstrate the feasibility of converting medical record inquiries into SQL queries. Additionally, we introduce an entropy-based method to identify and filter out unanswerable results. We further enhance result quality by filtering low-confidence SQL through log probability-based distribution, while grammatical and schema errors are mitigated by executing queries on the actual database. We experimentally verified that our method can filter unanswerable questions, which can be widely utilized even when the parameters of the model are not accessible, and that it can be effectively utilized in practice.

cross Benchmarking Mobile Device Control Agents across Diverse Configurations

Authors: Juyong Lee, Taywon Min, Minyong An, Changyeon Kim, Kimin Lee

Abstract: Developing autonomous agents for mobile devices can significantly enhance user interactions by offering increased efficiency and accessibility. However, despite the growing interest in mobile device control agents, the absence of a commonly adopted benchmark makes it challenging to quantify scientific progress in this area. In this work, we introduce B-MoCA: a novel benchmark designed specifically for evaluating mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 60 common daily tasks. Importantly, we incorporate a randomization feature that changes various aspects of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs as well as agents trained from scratch using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to enhance their effectiveness. Our source code is publicly available at https://b-moca.github.io.

URLs: https://b-moca.github.io.

cross Formal Specification, Assessment, and Enforcement of Fairness for Generative AIs

Authors: Chih-Hong Cheng, Changshun Wu, Harald Ruess, Xingyu Zhao, Saddek Bensalem

Abstract: The risk of reinforcing or exacerbating societal biases and inequalities is growing as generative AI increasingly produces content that resembles human output, from text to images and beyond. Here we formally characterize the notion of fairness for generative AI as a basis for monitoring and enforcing fairness. We define two levels of fairness utilizing the concept of infinite words. The first is the fairness demonstrated on the generated sequences, which is only evaluated on the outputs while agnostic to the prompts/models used. The second is the inherent fairness of the generative AI model, which requires that fairness be manifested when input prompts are neutral, that is, they do not explicitly instruct the generative AI to produce a particular type of output. We also study relative intersectional fairness to counteract the combinatorial explosion of fairness when considering multiple categories together with lazy fairness enforcement. Our implemented specification monitoring and enforcement tool shows interesting results when tested against several generative AI models.

cross EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning

Authors: Hongxia Xie, Chu-Jun Peng, Yu-Wen Tseng, Hung-Jen Chen, Chan-Feng Hsu, Hong-Han Shuai, Wen-Huang Cheng

Abstract: Visual Instruction Tuning represents a novel learning paradigm involving the fine-tuning of pre-trained language models using task-specific instructions. This paradigm shows promising zero-shot results in various natural language processing tasks but is still unexplored in vision emotion understanding. In this work, we focus on enhancing the model's proficiency in understanding and adhering to instructions related to emotional contexts. Initially, we identify key visual clues critical to visual emotion recognition. Subsequently, we introduce a novel GPT-assisted pipeline for generating emotion visual instruction data, effectively addressing the scarcity of annotated instruction data in this domain. Expanding on the groundwork established by InstructBLIP, our proposed EmoVIT architecture incorporates emotion-specific instruction data, leveraging the powerful capabilities of Large Language Models to enhance performance. Through extensive experiments, our model showcases its proficiency in emotion classification, adeptness in affective reasoning, and competence in comprehending humor. The comparative analysis provides a robust benchmark for Emotion Visual Instruction Tuning in the era of LLMs, providing valuable insights and opening avenues for future exploration in this domain. Our code is available at \url{https://github.com/aimmemotion/EmoVIT}.

URLs: https://github.com/aimmemotion/EmoVIT

cross Multi-scale HSV Color Feature Embedding for High-fidelity NIR-to-RGB Spectrum Translation

Authors: Huiyu Zhai, Mo Chen, Xingxing Yang, Gusheng Kang

Abstract: The NIR-to-RGB spectral domain translation is a formidable task due to the inherent spectral mapping ambiguities within NIR inputs and RGB outputs. Thus, existing methods fail to reconcile the tension between maintaining texture detail fidelity and achieving diverse color variations. In this paper, we propose a Multi-scale HSV Color Feature Embedding Network (MCFNet) that decomposes the mapping process into three sub-tasks, including NIR texture maintenance, coarse geometry reconstruction, and RGB color prediction. Thus, we propose three key modules for each corresponding sub-task: the Texture Preserving Block (TPB), the HSV Color Feature Embedding Module (HSV-CFEM), and the Geometry Reconstruction Module (GRM). These modules contribute to our MCFNet methodically tackling spectral translation through a series of escalating resolutions, progressively enriching images with color and texture fidelity in a scale-coherent fashion. The proposed MCFNet demonstrates substantial performance gains over the NIR image colorization task. Code is released at: https://github.com/AlexYangxx/MCFNet.

URLs: https://github.com/AlexYangxx/MCFNet.

cross Influence of Solution Efficiency and Valence of Instruction on Additive and Subtractive Solution Strategies in Humans and GPT-4

Authors: Lydia Uhler, Verena Jordan, J\"urgen Buder, Markus Huff, Frank Papenmeier

Abstract: We explored the addition bias, a cognitive tendency to prefer adding elements over removing them to alter an initial state or structure, by conducting four preregistered experiments examining the problem-solving behavior of both humans and OpenAl's GPT-4 large language model. The experiments involved 588 participants from the U.S. and 680 iterations of the GPT-4 model. The problem-solving task was either to create symmetry within a grid (Experiments 1 and 3) or to edit a summary (Experiments 2 and 4). As hypothesized, we found that overall, the addition bias was present. Solution efficiency (Experiments 1 and 2) and valence of the instruction (Experiments 3 and 4) played important roles. Human participants were less likely to use additive strategies when subtraction was relatively more efficient than when addition and subtraction were equally efficient. GPT-4 exhibited the opposite behavior, with a strong addition bias when subtraction was more efficient. In terms of instruction valence, GPT-4 was more likely to add words when asked to "improve" compared to "edit", whereas humans did not show this effect. When we looked at the addition bias under different conditions, we found more biased responses for GPT-4 compared to humans. Our findings highlight the importance of considering comparable and sometimes superior subtractive alternatives, as well as reevaluating one's own and particularly the language models' problem-solving behavior.

cross Report on Candidate Computational Indicators for Conscious Valenced Experience

Authors: Andres Campero

Abstract: This report enlists 13 functional conditions cashed out in computational terms that have been argued to be constituent of conscious valenced experience. These are extracted from existing empirical and theoretical literature on, among others, animal sentience, medical disorders, anaesthetics, philosophy, evolution, neuroscience, and artificial intelligence.

cross Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding

Authors: Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu

Abstract: We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task.

cross Embracing Diversity: Interpretable Zero-shot classification beyond one vector per class

Authors: Mazda Moayeri, Michael Rabbat, Mark Ibrahim, Diane Bouchacourt

Abstract: Vision-language models enable open-world classification of objects without the need for any retraining. While this zero-shot paradigm marks a significant advance, even today's best models exhibit skewed performance when objects are dissimilar from their typical depiction. Real world objects such as pears appear in a variety of forms -- from diced to whole, on a table or in a bowl -- yet standard VLM classifiers map all instances of a class to a \it{single vector based on the class label}. We argue that to represent this rich diversity within a class, zero-shot classification should move beyond a single vector. We propose a method to encode and account for diversity within a class using inferred attributes, still in the zero-shot setting without retraining. We find our method consistently outperforms standard zero-shot classification over a large suite of datasets encompassing hierarchies, diverse object states, and real-world geographic diversity, as well finer-grained datasets where intra-class diversity may be less prevalent. Importantly, our method is inherently interpretable, offering faithful explanations for each inference to facilitate model debugging and enhance transparency. We also find our method scales efficiently to a large number of attributes to account for diversity -- leading to more accurate predictions for atypical instances. Finally, we characterize a principled trade-off between overall and worst class accuracy, which can be tuned via a hyperparameter of our method. We hope this work spurs further research into the promise of zero-shot classification beyond a single class vector for capturing diversity in the world, and building transparent AI systems without compromising performance.

cross Features Fusion for Dual-View Mammography Mass Detection

Authors: Arina Varlamova, Valery Belotsky, Grigory Novikov, Anton Konushin, Evgeny Sidorov

Abstract: Detection of malignant lesions on mammography images is extremely important for early breast cancer diagnosis. In clinical practice, images are acquired from two different angles, and radiologists can fully utilize information from both views, simultaneously locating the same lesion. However, for automatic detection approaches such information fusion remains a challenge. In this paper, we propose a new model called MAMM-Net, which allows the processing of both mammography views simultaneously by sharing information not only on an object level, as seen in existing works, but also on a feature level. MAMM-Net's key component is the Fusion Layer, based on deformable attention and designed to increase detection precision while keeping high recall. Our experiments show superior performance on the public DDSM dataset compared to the previous state-of-the-art model, while introducing new helpful features such as lesion annotation on pixel-level and classification of lesions malignancy.

cross Prefix Text as a Yarn: Eliciting Non-English Alignment in Foundation Language Model

Authors: Runzhe Zhan, Xinyi Yang, Derek F. Wong, Lidia S. Chao, Yue Zhang

Abstract: While supervised fine-tuning (SFT) has been a straightforward approach for tailoring the output of foundation large language model (LLM) to specific preferences, concerns have been raised about the depth of this alignment, with some critiques suggesting it is merely "superficial". We critically examine this hypothesis within the scope of cross-lingual generation tasks, proposing that the effectiveness of SFT may be constrained by its reliance on prior tokens to guide cross-lingual generation. Based on this crucial insight, and in response to the challenges posed by the costly and limited availability of non-English data for SFT, we introduce a novel training-free alignment method named PreTTY, which employs minimal task-related prior tokens to bridge the foundation LLM and the SFT LLM, achieving comparable performance without training. Experiments on machine translation and part-of-speech tagging across eight languages demonstrate the efficacy of PreTTY in cross-lingual settings. Remarkably, by initiating the decoding process with only one or two prior tokens, foundation LLMs can achieve performance comparable to their SFT counterparts. This method presents a cost-effective alternative to SFT and advances the democratization of multilingual LLMs.

cross Redefining Safety for Autonomous Vehicles

Authors: Philip Koopman, William Widen

Abstract: Existing definitions and associated conceptual frameworks for computer-based system safety should be revisited in light of real-world experiences from deploying autonomous vehicles. Current terminology used by industry safety standards emphasizes mitigation of risk from specifically identified hazards, and carries assumptions based on human-supervised vehicle operation. Operation without a human driver dramatically increases the scope of safety concerns, especially due to operation in an open world environment, a requirement to self-enforce operational limits, participation in an ad hoc sociotechnical system of systems, and a requirement to conform to both legal and ethical constraints. Existing standards and terminology only partially address these new challenges. We propose updated definitions for core system safety concepts that encompass these additional considerations as a starting point for evolving safe-ty approaches to address these additional safety challenges. These results might additionally inform framing safety terminology for other autonomous system applications.

cross ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving

Authors: Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan, Shengcai Liao, Xiaodan Liang

Abstract: Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.

cross DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks

Authors: Tongzhou Mu, Minghua Liu, Hao Su

Abstract: The success of many RL techniques heavily relies on human-engineered dense rewards, which typically demand substantial domain expertise and extensive trial and error. In our work, we propose DrS (Dense reward learning from Stages), a novel approach for learning reusable dense rewards for multi-stage tasks in a data-driven manner. By leveraging the stage structures of the task, DrS learns a high-quality dense reward from sparse rewards and demonstrations if given. The learned rewards can be \textit{reused} in unseen tasks, thus reducing the human effort for reward engineering. Extensive experiments on three physical robot manipulation task families with 1000+ task variants demonstrate that our learned rewards can be reused in unseen tasks, resulting in improved performance and sample efficiency of RL algorithms. The learned rewards even achieve comparable performance to human-engineered rewards on some tasks. See our project page (https://sites.google.com/view/iclr24drs) for more details.

URLs: https://sites.google.com/view/iclr24drs)

cross Continual Learning of Large Language Models: A Comprehensive Survey

Authors: Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Hao Wang

Abstract: The recent success of large language models (LLMs) trained on static, pre-collected, general datasets has sparked numerous research directions and applications. One such direction addresses the non-trivial challenge of integrating pre-trained LLMs into dynamic data distributions, task structures, and user preferences. Pre-trained LLMs, when tailored for specific needs, often experience significant performance degradation in previous knowledge domains -- a phenomenon known as "catastrophic forgetting". While extensively studied in the continual learning (CL) community, it presents new manifestations in the realm of LLMs. In this survey, we provide a comprehensive overview of the current research progress on LLMs within the context of CL. This survey is structured into four main sections: we first describe an overview of continually learning LLMs, consisting of two directions of continuity: vertical continuity (or vertical continual learning), i.e., continual adaptation from general to specific capabilities, and horizontal continuity (or horizontal continual learning), i.e., continual adaptation across time and domains (Section 3). We then summarize three stages of learning LLMs in the context of modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP), and Continual Fine-Tuning (CFT) (Section 4). Then we provide an overview of evaluation protocols for continual learning with LLMs, along with the current available data sources (Section 5). Finally, we discuss intriguing questions pertaining to continual learning for LLMs (Section 6). The full list of papers examined in this survey is available at https://github.com/Wang-ML-Lab/llm-continual-learning-survey.

URLs: https://github.com/Wang-ML-Lab/llm-continual-learning-survey.

cross Weak-to-Strong Extrapolation Expedites Alignment

Authors: Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, Nanyun Peng

Abstract: Although the capabilities of large language models (LLMs) ideally scale up with increasing data and compute, they are inevitably constrained by limited resources in reality. Suppose we have a moderately trained LLM (e.g., trained to align with human preference) in hand, can we further exploit its potential and cheaply acquire a stronger model? In this paper, we propose a simple method called ExPO to boost LLMs' alignment with human preference. ExPO assumes that a medium-aligned model can be interpolated between a less-aligned (weaker) model, e.g., the initial SFT model, and a better-aligned (stronger) one, thereby directly obtaining this stronger model by extrapolating from the weights of the former two relatively weaker models. On the AlpacaEval 2.0 benchmark, we show that ExPO pushes models trained with less preference data (e.g., 10% or 20%) to reach and even surpass the fully-trained one, without any additional training. Furthermore, ExPO also significantly improves off-the-shelf DPO/RLHF models and exhibits decent scalability across model sizes from 7B to 70B. Our work demonstrates the efficacy of model extrapolation in exploiting LLMs' capabilities, suggesting a promising direction that deserves future exploration.

cross AAPL: Adding Attributes to Prompt Learning for Vision-Language Models

Authors: Gahyeon Kim, Sohee Kim, Seokju Lee

Abstract: Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.

cross Make Your LLM Fully Utilize the Context

Authors: Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou

Abstract: While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: https://github.com/microsoft/FILM.

URLs: https://github.com/microsoft/FILM.

cross Learning Visuotactile Skills with Two Multifingered Hands

Authors: Toru Lin, Yu Zhang, Qiyang Li, Haozhi Qi, Brent Yi, Sergey Levine, Jitendra Malik

Abstract: Aiming to replicate human-like dexterity, perceptual experiences, and motion patterns, we explore learning from human demonstrations using a bimanual system with multifingered hands and visuotactile data. Two significant challenges exist: the lack of an affordable and accessible teleoperation system suitable for a dual-arm setup with multifingered hands, and the scarcity of multifingered hand hardware equipped with touch sensing. To tackle the first challenge, we develop HATO, a low-cost hands-arms teleoperation system that leverages off-the-shelf electronics, complemented with a software suite that enables efficient data collection; the comprehensive software suite also supports multimodal data processing, scalable policy learning, and smooth policy deployment. To tackle the latter challenge, we introduce a novel hardware adaptation by repurposing two prosthetic hands equipped with touch sensors for research. Using visuotactile data collected from our system, we learn skills to complete long-horizon, high-precision tasks which are difficult to achieve without multifingered dexterity and touch feedback. Furthermore, we empirically investigate the effects of dataset size, sensing modality, and visual input preprocessing on policy learning. Our results mark a promising step forward in bimanual multifingered manipulation from visuotactile data. Videos, code, and datasets can be found at https://toruowo.github.io/hato/ .

URLs: https://toruowo.github.io/hato/

cross Make-it-Real: Unleashing Large Multimodal Model's Ability for Painting 3D Objects with Realistic Materials

Authors: Ye Fang, Zeyi Sun, Tong Wu, Jiaqi Wang, Ziwei Liu, Gordon Wetzstein, Dahua Lin

Abstract: Physically realistic materials are pivotal in augmenting the realism of 3D assets across various applications and lighting conditions. However, existing 3D assets and generative models often lack authentic material properties. Manual assignment of materials using graphic software is a tedious and time-consuming task. In this paper, we exploit advancements in Multimodal Large Language Models (MLLMs), particularly GPT-4V, to present a novel approach, Make-it-Real: 1) We demonstrate that GPT-4V can effectively recognize and describe materials, allowing the construction of a detailed material library. 2) Utilizing a combination of visual cues and hierarchical text prompts, GPT-4V precisely identifies and aligns materials with the corresponding components of 3D objects. 3) The correctly matched materials are then meticulously applied as reference for the new SVBRDF material generation according to the original diffuse map, significantly enhancing their visual authenticity. Make-it-Real offers a streamlined integration into the 3D content creation workflow, showcasing its utility as an essential tool for developers of 3D assets.

replace Large Language Models can Learn Rules

Authors: Zhaocheng Zhu, Yuan Xue, Xinyun Chen, Denny Zhou, Jian Tang, Dale Schuurmans, Hanjun Dai

Abstract: When prompted with a few examples and intermediate steps, large language models (LLMs) have demonstrated impressive performance in various reasoning tasks. However, prompting methods that rely on implicit knowledge in an LLM often generate incorrect answers when the implicit knowledge is wrong or inconsistent with the task. To tackle this problem, we present Hypotheses-to-Theories (HtT), a framework that learns a rule library for reasoning with LLMs. HtT contains two stages, an induction stage and a deduction stage. In the induction stage, an LLM is first asked to generate and verify rules over a set of training examples. Rules that appear and lead to correct answers sufficiently often are collected to form a rule library. In the deduction stage, the LLM is then prompted to employ the learned rule library to perform reasoning to answer test questions. Experiments on relational reasoning, numerical reasoning and concept learning problems show that HtT improves existing prompting methods, with an absolute gain of 10-30% in accuracy. The learned rules are also transferable to different models and to different forms of the same problem.

replace ChemLLM: A Chemical Large Language Model

Authors: Di Zhang, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, Weiran Huang, Xiangyu Yue, Wanli Ouyang, Dongzhan Zhou, Shufei Zhang, Mao Su, Han-Sen Zhong, Yuqiang Li

Abstract: Large language models (LLMs) have made impressive progress in chemistry applications. However, the community lacks an LLM specifically designed for chemistry. The main challenges are two-fold: firstly, most chemical data and scientific knowledge are stored in structured databases, which limits the model's ability to sustain coherent dialogue when used directly. Secondly, there is an absence of objective and fair benchmark that encompass most chemistry tasks. Here, we introduce ChemLLM, a comprehensive framework that features the first LLM dedicated to chemistry. It also includes ChemData, a dataset specifically designed for instruction tuning, and ChemBench, a robust benchmark covering nine essential chemistry tasks. ChemLLM is adept at performing various tasks across chemical disciplines with fluid dialogue interaction. Notably, ChemLLM achieves results comparable to GPT-4 on the core chemical tasks and demonstrates competitive performance with LLMs of similar size in general scenarios. ChemLLM paves a new path for exploration in chemical studies, and our method of incorporating structured chemical knowledge into dialogue systems sets a new standard for developing LLMs in various scientific fields. Codes, Datasets, and Model weights are publicly accessible at https://hf.co/AI4Chem

URLs: https://hf.co/AI4Chem

replace Even-Ifs From If-Onlys: Are the Best Semi-Factual Explanations Found Using Counterfactuals As Guides?

Authors: Saugat Aryal, Mark T. Keane

Abstract: Recently, counterfactuals using "if-only" explanations have become very popular in eXplainable AI (XAI), as they describe which changes to feature-inputs of a black-box AI system result in changes to a (usually negative) decision-outcome. Even more recently, semi-factuals using "even-if" explanations have gained more attention. They elucidate the feature-input changes that do not change the decision-outcome of the AI system, with a potential to suggest more beneficial recourses. Some semi-factual methods use counterfactuals to the query-instance to guide semi-factual production (so-called counterfactual-guided methods), whereas others do not (so-called counterfactual-free methods). In this work, we perform comprehensive tests of 8 semi-factual methods on 7 datasets using 5 key metrics, to determine whether counterfactual guidance is necessary to find the best semi-factuals. The results of these tests suggests not, but rather that computing other aspects of the decision space lead to better semi-factual XAI.

replace How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Authors: Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Michael R. Lyu

Abstract: Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates LLMs' decision-making capabilities through the lens of a well-established field, Game Theory. We focus specifically on games that support the participation of more than two agents simultaneously. Subsequently, we introduce our framework, GAMA-Bench, including eight classical multi-agent games. We design a scoring scheme to assess a model's performance in these games quantitatively. Through GAMA-Bench, we investigate LLMs' robustness, generalizability, and enhancement strategies. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we conduct evaluations across various LLMs and find that GPT-4 outperforms other models on GAMA-Bench, achieving a score of 60.5. Moreover, Gemini-1.0-Pro and GPT-3.5 (0613, 1106, 0125) demonstrate similar intelligence on GAMA-Bench. The code and experimental results are made publicly available via https://github.com/CUHK-ARISE/GAMABench.

URLs: https://github.com/CUHK-ARISE/GAMABench.

replace Demonstration of DB-GPT: Next Generation Data Interaction System Empowered by Large Language Models

Authors: Siqiao Xue, Danrui Qi, Caigao Jiang, Wenhui Shi, Fangyin Cheng, Keting Chen, Hongjun Yang, Zhiping Zhang, Jianshan He, Hongyang Zhang, Ganglin Wei, Wang Zhao, Fan Zhou, Hong Yi, Shaodong Liu, Hongjun Yang, Faqiang Chen

Abstract: The recent breakthroughs in large language models (LLMs) are positioned to transition many areas of software. The technologies of interacting with data particularly have an important entanglement with LLMs as efficient and intuitive data interactions are paramount. In this paper, we present DB-GPT, a revolutionary and product-ready Python library that integrates LLMs into traditional data interaction tasks to enhance user experience and accessibility. DB-GPT is designed to understand data interaction tasks described by natural language and provide context-aware responses powered by LLMs, making it an indispensable tool for users ranging from novice to expert. Its system design supports deployment across local, distributed, and cloud environments. Beyond handling basic data interaction tasks like Text-to-SQL with LLMs, it can handle complex tasks like generative data analysis through a Multi-Agents framework and the Agentic Workflow Expression Language (AWEL). The Service-oriented Multi-model Management Framework (SMMF) ensures data privacy and security, enabling users to employ DB-GPT with private LLMs. Additionally, DB-GPT offers a series of product-ready features designed to enable users to integrate DB-GPT within their product environments easily. The code of DB-GPT is available at Github(https://github.com/eosphoros-ai/DB-GPT) which already has over 10.7k stars. Please install DB-GPT for your own usage with the instructions(https://github.com/eosphoros-ai/DB-GPT#install) and watch a 5-minute introduction video on Youtube(https://youtu.be/n_8RI1ENyl4) to further investigate DB-GPT.

URLs: https://github.com/eosphoros-ai/DB-GPT), https://github.com/eosphoros-ai/DB-GPT, https://youtu.be/n_8RI1ENyl4)

replace Personalized Forgetting Mechanism with Concept-Driven Knowledge Tracing

Authors: Shanshan Wang, Ying Hu, Xun Yang, Zhongzhou Zhang, Keyang Wang, Xingyi Zhang

Abstract: Knowledge Tracing (KT) aims to trace changes in students' knowledge states throughout their entire learning process by analyzing their historical learning data and predicting their future learning performance. Existing forgetting curve theory based knowledge tracing models only consider the general forgetting caused by time intervals, ignoring the individualization of students and the causal relationship of the forgetting process. To address these problems, we propose a Concept-driven Personalized Forgetting knowledge tracing model (CPF) which integrates hierarchical relationships between knowledge concepts and incorporates students' personalized cognitive abilities. First, we integrate the students' personalized capabilities into both the learning and forgetting processes to explicitly distinguish students' individual learning gains and forgetting rates according to their cognitive abilities. Second, we take into account the hierarchical relationships between knowledge points and design a precursor-successor knowledge concept matrix to simulate the causal relationship in the forgetting process, while also integrating the potential impact of forgetting prior knowledge points on subsequent ones. The proposed personalized forgetting mechanism can not only be applied to the learning of specifc knowledge concepts but also the life-long learning process. Extensive experimental results on three public datasets show that our CPF outperforms current forgetting curve theory based methods in predicting student performance, demonstrating CPF can better simulate changes in students' knowledge status through the personalized forgetting mechanism.

replace FlagVNE: A Flexible and Generalizable Reinforcement Learning Framework for Network Resource Allocation

Authors: Tianfu Wang, Qilin Fan, Chao Wang, Long Yang, Leilei Ding, Nicholas Jing Yuan, Hui Xiong

Abstract: Virtual network embedding (VNE) is an essential resource allocation task in network virtualization, aiming to map virtual network requests (VNRs) onto physical infrastructure. Reinforcement learning (RL) has recently emerged as a promising solution to this problem. However, existing RL-based VNE methods are limited by the unidirectional action design and one-size-fits-all training strategy, resulting in restricted searchability and generalizability. In this paper, we propose a FLexible And Generalizable RL framework for VNE, named FlagVNE. Specifically, we design a bidirectional action-based Markov decision process model that enables the joint selection of virtual and physical nodes, thus improving the exploration flexibility of solution space. To tackle the expansive and dynamic action space, we design a hierarchical decoder to generate adaptive action probability distributions and ensure high training efficiency. Furthermore, to overcome the generalization issue for varying VNR sizes, we propose a meta-RL-based training method with a curriculum scheduling strategy, facilitating specialized policy training for each VNR size. Finally, extensive experimental results show the effectiveness of FlagVNE across multiple key metrics. Our code is available at GitHub (https://github.com/GeminiLight/flag-vne).

URLs: https://github.com/GeminiLight/flag-vne).

replace-cross Digital Over-the-Air Federated Learning in Multi-Antenna Systems

Authors: Sihua Wang, Mingzhe Chen, Cong Shen, Changchuan Yin, Christopher G. Brinton

Abstract: In this paper, the performance optimization of federated learning (FL), when deployed over a realistic wireless multiple-input multiple-output (MIMO) communication system with digital modulation and over-the-air computation (AirComp) is studied. In particular, a MIMO system is considered in which edge devices transmit their local FL models (trained using their locally collected data) to a parameter server (PS) using beamforming to maximize the number of devices scheduled for transmission. The PS, acting as a central controller, generates a global FL model using the received local FL models and broadcasts it back to all devices. Due to the limited bandwidth in a wireless network, AirComp is adopted to enable efficient wireless data aggregation. However, fading of wireless channels can produce aggregate distortions in an AirComp-based FL scheme. To tackle this challenge, we propose a modified federated averaging (FedAvg) algorithm that combines digital modulation with AirComp to mitigate wireless fading while ensuring the communication efficiency. This is achieved by a joint transmit and receive beamforming design, which is formulated as an optimization problem to dynamically adjust the beamforming matrices based on current FL model parameters so as to minimize the transmitting error and ensure the FL performance. To achieve this goal, we first analytically characterize how the beamforming matrices affect the performance of the FedAvg in different iterations. Based on this relationship, an artificial neural network (ANN) is used to estimate the local FL models of all devices and adjust the beamforming matrices at the PS for future model transmission. The algorithmic advantages and improved performance of the proposed methodologies are demonstrated through extensive numerical experiments.

replace-cross Understanding Sparse Neural Networks from their Topology via Multipartite Graph Representations

Authors: Elia Cunegatti, Matteo Farina, Doina Bucur, Giovanni Iacca

Abstract: Pruning-at-Initialization (PaI) algorithms provide Sparse Neural Networks (SNNs) which are computationally more efficient than their dense counterparts, and try to avoid performance degradation. While much emphasis has been directed towards \emph{how} to prune, we still do not know \emph{what topological metrics} of the SNNs characterize \emph{good performance}. From prior work, we have layer-wise topological metrics by which SNN performance can be predicted: the Ramanujan-based metrics. To exploit these metrics, proper ways to represent network layers via Graph Encodings (GEs) are needed, with Bipartite Graph Encodings (BGEs) being the \emph{de-facto} standard at the current stage. Nevertheless, existing BGEs neglect the impact of the inputs, and do not characterize the SNN in an end-to-end manner. Additionally, thanks to a thorough study of the Ramanujan-based metrics, we discover that they are only as good as the \emph{layer-wise density} as performance predictors, when paired with BGEs. To close both gaps, we design a comprehensive topological analysis for SNNs with both linear and convolutional layers, via (i) a new input-aware Multipartite Graph Encoding (MGE) for SNNs and (ii) the design of new end-to-end topological metrics over the MGE. With these novelties, we show the following: (a) The proposed MGE allows to extract topological metrics that are much better predictors of the accuracy drop than metrics computed from current input-agnostic BGEs; (b) Which metrics are important at different sparsity levels and for different architectures; (c) A mixture of our topological metrics can rank PaI algorithms more effectively than Ramanujan-based metrics. The codebase is publicly available at https://github.com/eliacunegatti/mge-snn.

URLs: https://github.com/eliacunegatti/mge-snn.

replace-cross Structure in Deep Reinforcement Learning: A Survey and Open Problems

Authors: Aditya Mohan, Amy Zhang, Marius Lindauer

Abstract: Reinforcement Learning (RL), bolstered by the expressive capabilities of Deep Neural Networks (DNNs) for function approximation, has demonstrated considerable success in numerous applications. However, its practicality in addressing various real-world scenarios, characterized by diverse and unpredictable dynamics, noisy signals, and large state and action spaces, remains limited. This limitation stems from poor data efficiency, limited generalization capabilities, a lack of safety guarantees, and the absence of interpretability, among other factors. To overcome these challenges and improve performance across these crucial metrics, one promising avenue is to incorporate additional structural information about the problem into the RL learning process. Various sub-fields of RL have proposed methods for incorporating such inductive biases. We amalgamate these diverse methodologies under a unified framework, shedding light on the role of structure in the learning problem, and classify these methods into distinct patterns of incorporating structure. By leveraging this comprehensive framework, we provide valuable insights into the challenges of structured RL and lay the groundwork for a design pattern perspective on RL research. This novel perspective paves the way for future advancements and aids in developing more effective and efficient RL algorithms that can potentially handle real-world scenarios better.

replace-cross Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations

Authors: Yongyuan Liang, Yanchao Sun, Ruijie Zheng, Xiangyu Liu, Benjamin Eysenbach, Tuomas Sandholm, Furong Huang, Stephen McAleer

Abstract: Deploying reinforcement learning (RL) systems requires robustness to uncertainty and model misspecification, yet prior robust RL methods typically only study noise introduced independently across time. However, practical sources of uncertainty are usually coupled across time. We formally introduce temporally-coupled perturbations, presenting a novel challenge for existing robust RL methods. To tackle this challenge, we propose GRAD, a novel game-theoretic approach that treats the temporally-coupled robust RL problem as a partially observable two-player zero-sum game. By finding an approximate equilibrium within this game, GRAD optimizes for general robustness against temporally-coupled perturbations. Experiments on continuous control tasks demonstrate that, compared with prior methods, our approach achieves a higher degree of robustness to various types of attacks on different attack domains, both in settings with temporally-coupled perturbations and decoupled perturbations.

replace-cross DOMAIN: MilDly COnservative Model-BAsed OfflINe Reinforcement Learning

Authors: Xiao-Yin Liu, Xiao-Hu Zhou, Xiao-Liang Xie, Shi-Qi Liu, Zhen-Qiu Feng, Hao Li, Mei-Jiang Gui, Tian-Yu Xiang, De-Xing Huang, Zeng-Guang Hou

Abstract: Model-based reinforcement learning (RL), which learns environment model from offline dataset and generates more out-of-distribution model data, has become an effective approach to the problem of distribution shift in offline RL. Due to the gap between the learned and actual environment, conservatism should be incorporated into the algorithm to balance accurate offline data and imprecise model data. The conservatism of current algorithms mostly relies on model uncertainty estimation. However, uncertainty estimation is unreliable and leads to poor performance in certain scenarios, and the previous methods ignore differences between the model data, which brings great conservatism. Therefore, this paper proposes a milDly cOnservative Model-bAsed offlINe RL algorithm (DOMAIN) without estimating model uncertainty to address the above issues. DOMAIN introduces adaptive sampling distribution of model samples, which can adaptively adjust the model data penalty. In this paper, we theoretically demonstrate that the Q value learned by the DOMAIN outside the region is a lower bound of the true Q value, the DOMAIN is less conservative than previous model-based offline RL algorithms and has the guarantee of security policy improvement. The results of extensive experiments show that DOMAIN outperforms prior RL algorithms on the D4RL dataset benchmark, and achieves better performance than other RL algorithms on tasks that require generalization.

replace-cross Lemur: Integrating Large Language Models in Automated Program Verification

Authors: Haoze Wu, Clark Barrett, Nina Narodytska

Abstract: The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of transition rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure and demonstrate practical improvements on a set of synthetic and competition benchmarks.

replace-cross Prompt Cache: Modular Attention Reuse for Low-Latency Inference

Authors: In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong

Abstract: We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt. Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.

replace-cross Data-driven building energy efficiency prediction using physics-informed neural networks

Authors: Vasilis Michalakopoulos, Sotiris Pelekis, Giorgos Kormpakis, Vagelis Karakolis, Spiros Mouzakitis, Dimitris Askounis

Abstract: The analytical prediction of building energy performance in residential buildings based on the heat losses of its individual envelope components is a challenging task. It is worth noting that this field is still in its infancy, with relatively limited research conducted in this specific area to date, especially when it comes for data-driven approaches. In this paper we introduce a novel physics-informed neural network model for addressing this problem. Through the employment of unexposed datasets that encompass general building information, audited characteristics, and heating energy consumption, we feed the deep learning model with general building information, while the model's output consists of the structural components and several thermal properties that are in fact the basic elements of an energy performance certificate (EPC). On top of this neural network, a function, based on physics equations, calculates the energy consumption of the building based on heat losses and enhances the loss function of the deep learning model. This methodology is tested on a real case study for 256 buildings located in Riga, Latvia. Our investigation comes up with promising results in terms of prediction accuracy, paving the way for automated, and data-driven energy efficiency performance prediction based on basic properties of the building, contrary to exhaustive energy efficiency audits led by humans, which are the current status quo.

replace-cross On the Relationship Between Interpretability and Explainability in Machine Learning

Authors: Benjamin Leblanc, Pascal Germain

Abstract: Interpretability and explainability have gained more and more attention in the field of machine learning as they are crucial when it comes to high-stakes decisions and troubleshooting. Since both provide information about predictors and their decision process, they are often seen as two independent means for one single end. This view has led to a dichotomous literature: explainability techniques designed for complex black-box models, or interpretable approaches ignoring the many explainability tools. In this position paper, we challenge the common idea that interpretability and explainability are substitutes for one another by listing their principal shortcomings and discussing how both of them mitigate the drawbacks of the other. In doing so, we call for a new perspective on interpretability and explainability, and works targeting both topics simultaneously, leveraging each of their respective assets.

replace-cross Perspective in Opinion Dynamics on Complex Convex Domains of Time Networks for Addiction, Forgetting

Authors: Yasuko Kawahata

Abstract: This paper revises previous work and introduces changes in spatio-temporal scales. The paper presents a model that includes layers A and B with varying degrees of forgetting and dependence over time. We also model changes in dependence and forgetting in layers A, A', B, and B' under certain conditions. In addition, to discuss the formation of opinion clusters that have reinforcing or obstructive behaviors of forgetting and dependence and are conservative or brainwashing or detoxifying and less prone to filter bubbling, new clusters C and D that recommend, obstruct, block, or incite forgetting and dependence over time are Introduction. This introduction allows us to test hypotheses regarding the expansion of opinions in two dimensions over time and space, the state of development of opinion space, and the expansion of public opinion. Challenges in consensus building will be highlighted, emphasizing the dynamic nature of opinions and the need to consider factors such as dissent, distrust, and media influence. The paper proposes an extended framework that incorporates trust, distrust, and media influence into the consensus building model. We introduce network analysis using dimerizing as a method to gain deeper insights. In this context, we discuss network clustering, media influence, and consensus building. The location and distribution of dimers will be analyzed to gain insight into the structure and dynamics of the network. Dimertiling has been applied in various fields other than network analysis, such as physics and sociology. The paper concludes by emphasizing the importance of diverse perspectives, network analysis, and influential entities in consensus building. It also introduces torus-based visualizations that aid in understanding complex network structures.

replace-cross BELT: Old-School Backdoor Attacks can Evade the State-of-the-Art Defense with Backdoor Exclusivity Lifting

Authors: Huming Qiu, Junjie Sun, Mi Zhang, Xudong Pan, Min Yang

Abstract: Deep neural networks (DNNs) are susceptible to backdoor attacks, where malicious functionality is embedded to allow attackers to trigger incorrect classifications. Old-school backdoor attacks use strong trigger features that can easily be learned by victim models. Despite robustness against input variation, the robustness however increases the likelihood of unintentional trigger activations. This leaves traces to existing defenses, which find approximate replacements for the original triggers that can activate the backdoor without being identical to the original trigger via, e.g., reverse engineering and sample overlay. In this paper, we propose and investigate a new characteristic of backdoor attacks, namely, backdoor exclusivity, which measures the ability of backdoor triggers to remain effective in the presence of input variation. Building upon the concept of backdoor exclusivity, we propose Backdoor Exclusivity LifTing (BELT), a novel technique which suppresses the association between the backdoor and fuzzy triggers to enhance backdoor exclusivity for defense evasion. Extensive evaluation on three popular backdoor benchmarks validate, our approach substantially enhances the stealthiness of four old-school backdoor attacks, which, after backdoor exclusivity lifting, is able to evade seven state-of-the-art backdoor countermeasures, at almost no cost of the attack success rate and normal utility. For example, one of the earliest backdoor attacks BadNet, enhanced by BELT, evades most of the state-of-the-art defenses including ABS and MOTH which would otherwise recognize the backdoored model.

replace-cross Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Authors: Henry Hengyuan Zhao, Pan Zhou, Mike Zheng Shou

Abstract: Instruction tuning data is essential for training the Multimodal Large Language Models (MLLMs). However, the creation of high-quality instruction tuning data presents significant challenges. Asking the human to label the instruction tuning data is label-intensive and time-consuming. Some works prompted to GPT-4 for data generation were not only costly but also lacked satisfactory performance in complex tasks (i.e., grounding-based reasoning tasks). To address the challenges of data creation, we are the first to explore the potential of empowering MLLMs with the ability to generate instruction-tuning data by following user instructions. Specifically, we developed an innovative data generation pipeline Genixer to generate various high-quality instruction tuning data, including nine representative tasks, e.g., Common VQA, REC, REG, and PointQ. Genixer provides a unified solution for data generation with four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLM, and (iv) data generation and filtering. To validate the effectiveness of generated data, we conducted the human evaluation and user preference study to assess the quality of generated data. Subsequently, we generated two instruction-tuning datasets for the training of two representative MLLMs, LLaVA1.5 and Shikra, and noted consistent improvements across various VQA tasks and multimodal benchmarks. For instance, performance on the VizWiz benchmark improved from 50.0% to 53.8%, and on ScienceQA, it increased from 66.8% to 69.7%, reconfirming the quality of the generated instruction tuning data. The data, code, and models will be released.

replace-cross Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies

Authors: Jared Strader, Nathan Hughes, William Chen, Alberto Speranzon, Luca Carlone

Abstract: This paper proposes an approach to build 3D scene graphs in arbitrary indoor and outdoor environments. Such extension is challenging; the hierarchy of concepts that describe an outdoor environment is more complex than for indoors, and manually defining such hierarchy is time-consuming and does not scale. Furthermore, the lack of training data prevents the straightforward application of learning-based tools used in indoor settings. To address these challenges, we propose two novel extensions. First, we develop methods to build a spatial ontology defining concepts and relations relevant for indoor and outdoor robot operation. In particular, we use a Large Language Model (LLM) to build such an ontology, thus largely reducing the amount of manual effort required. Second, we leverage the spatial ontology for 3D scene graph construction using Logic Tensor Networks (LTN) to add logical rules, or axioms (e.g., "a beach contains sand"), which provide additional supervisory signals at training time thus reducing the need for labelled data, providing better predictions, and even allowing predicting concepts unseen at training time. We test our approach in a variety of datasets, including indoor, rural, and coastal environments, and show that it leads to a significant increase in the quality of the 3D scene graph generation with sparsely annotated data.

replace-cross Training towards significance with the decorrelated event classifier transformer neural network

Authors: Jaebak Kim

Abstract: Experimental particle physics uses machine learning for many of tasks, where one application is to classify signal and background events. The classification can be used to bin an analysis region to enhance the expected significance for a mass resonance search. In natural language processing, one of the leading neural network architectures is the transformer. In this work, an event classifier transformer is proposed to bin an analysis region, in which the network is trained with special techniques. The techniques developed here can enhance the significance and reduce the correlation between the network's output and the reconstructed mass. It is found that this trained network can perform better than boosted decision trees and feed-forward networks.

replace-cross The Anatomy Spread of Online Opinion Polarization: The Pivotal Role of Super-Spreaders in Social Networks

Authors: Yasuko Kawahata

Abstract: The study investigates the role of 'superspreaders' in shaping opinions within networks, distinguishing three types: A, B, and C. Type A has a significant influence in shaping opinions, Type B acts as a counterbalance to A, and Type C functions like media, providing an objective viewpoint and potentially regulating A and B's influence. The research uses a confidence coefficient and z-score to survey superspreaders' behaviors, with a focus on the conditions affecting group dynamics and opinion formation, including environmental factors and forgetfulness over time. The findings offer insights for improving online communication security and understanding social influence. This paper is partially an attempt to utilize "Generative AI" and was written with educational intent. There are currently no plans for it to become a peer-reviewed paper.

replace-cross VerAs: Verify then Assess STEM Lab Reports

Authors: Berk Atil, Mahsa Sheikhi Karizaki, Rebecca J. Passonneau

Abstract: With an increasing focus in STEM education on critical thinking skills, science writing plays an ever more important role in curricula that stress inquiry skills. A recently published dataset of two sets of college level lab reports from an inquiry-based physics curriculum relies on analytic assessment rubrics that utilize multiple dimensions, specifying subject matter knowledge and general components of good explanations. Each analytic dimension is assessed on a 6-point scale, to provide detailed feedback to students that can help them improve their science writing skills. Manual assessment can be slow, and difficult to calibrate for consistency across all students in large classes. While much work exists on automated assessment of open-ended questions in STEM subjects, there has been far less work on long-form writing such as lab reports. We present an end-to-end neural architecture that has separate verifier and assessment modules, inspired by approaches to Open Domain Question Answering (OpenQA). VerAs first verifies whether a report contains any content relevant to a given rubric dimension, and if so, assesses the relevant sentences. On the lab reports, VerAs outperforms multiple baselines based on OpenQA systems or Automated Essay Scoring (AES). VerAs also performs well on an analytic rubric for middle school physics essays.

replace-cross The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Authors: Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Samuel Marks, Oam Patel, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Liu, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks

Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 4,157 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop CUT, a state-of-the-art unlearning method based on controlling model representations. CUT reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai

URLs: https://wmdp.ai

replace-cross Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Machel Reid (Dima), Nikolay Savinov (Dima), Denis Teplyashin (Dima), Dmitry (Dima), Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, Andrew Dai, Katie Millican, Ethan Dyer, Mia Glaese, Thibault Sottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, James Molloy, Jilin Chen, Michael Isard, Paul Barham, Tom Hennigan, Ross McIlroy, Melvin Johnson, Johan Schalkwyk, Eli Collins, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, Clemens Meyer, Gregory Thornton, Zhen Yang, Henryk Michalewski, Zaheer Abbas, Nathan Schucher, Ankesh Anand, Richard Ives, James Keeling, Karel Lenc, Salem Haykal, Siamak Shakeri, Pranav Shyam, Aakanksha Chowdhery, Roman Ring, Stephen Spencer, Eren Sezener, Luke Vilnis, Oscar Chang, Nobuyuki Morioka, George Tucker, Ce Zheng, Oliver Woodman, Nithya Attaluri, Tomas Kocisky, Evgenii Eltyshev, Xi Chen, Timothy Chung, Vittorio Selo, Siddhartha Brahma, Petko Georgiev, Ambrose Slone, Zhenkai Zhu, James Lottes, Siyuan Qiao, Ben Caine, Sebastian Riedel, Alex Tomala, Martin Chadwick, Juliette Love, Peter Choy, Sid Mittal, Neil Houlsby, Yunhao Tang, Matthew Lamm, Libin Bai, Qiao Zhang, Luheng He, Yong Cheng, Peter Humphreys, Yujia Li, Sergey Brin, Albin Cassirer, Yingjie Miao, Lukas Zilka, Taylor Tobin, Kelvin Xu, Lev Proleev, Daniel Sohn, Alberto Magni, Lisa Anne Hendricks, Isabel Gao, Santiago Ontanon, Oskar Bunyan, Nathan Byrd, Abhanshu Sharma, Biao Zhang, Mario Pinto, Rishika Sinha, Harsh Mehta, Dawei Jia, Sergi Caelles, Albert Webson, Alex Morris, Becca Roelofs, Yifan Ding, Robin Strudel, Xuehan Xiong, Marvin Ritter, Mostafa Dehghani, Rahma Chaabouni, Abhijit Karmarkar, Guangda Lai, Fabian Mentzer, Bibo Xu, YaGuang Li, Yujing Zhang, Tom Le Paine, Alex Goldin, Behnam Neyshabur, Kate Baumli, Anselm Levskaya, Michael Laskin, Wenhao Jia, Jack W. Rae, Kefan Xiao, Antoine He, Skye Giordano, Lakshman Yagati, Jean-Baptiste Lespiau, Paul Natsev, Sanjay Ganapathy, Fangyu Liu, Danilo Martins, Nanxin Chen, Yunhan Xu, Megan Barnes, Rhys May, Arpi Vezer, Junhyuk Oh, Ken Franko, Sophie Bridgers, Ruizhe Zhao, Boxi Wu, Basil Mustafa, Sean Sechrist, Emilio Parisotto, Thanumalayan Sankaranarayana Pillai, Chris Larkin, Chenjie Gu, Christina Sorokin, Maxim Krikun, Alexey Guseynov, Jessica Landon, Romina Datta, Alexander Pritzel, Phoebe Thacker, Fan Yang, Kevin Hui, Anja Hauth, Chih-Kuan Yeh, David Barker, Justin Mao-Jones, Sophia Austin, Hannah Sheahan, Parker Schuh, James Svensson, Rohan Jain, Vinay Ramasesh, Anton Briukhov, Da-Woon Chung, Tamara von Glehn, Christina Butterfield, Priya Jhakra, Matthew Wiethoff, Justin Frye, Jordan Grimstad, Beer Changpinyo, Charline Le Lan, Anna Bortsova, Yonghui Wu, Paul Voigtlaender, Tara Sainath, Shane Gu, Charlotte Smith, Will Hawkins, Kris Cao, James Besley, Srivatsan Srinivasan, Mark Omernick, Colin Gaffney, Gabriela Surita, Ryan Burnell, Bogdan Damoc, Junwhan Ahn, Andrew Brock, Mantas Pajarskas, Anastasia Petrushkina, Seb Noury, Lorenzo Blanco, Kevin Swersky, Arun Ahuja, Thi Avrahami, Vedant Misra, Raoul de Liedekerke, Mariko Iinuma, Alex Polozov, Sarah York, George van den Driessche, Paul Michel, Justin Chiu, Rory Blevins, Zach Gleicher, Adri\`a Recasens, Alban Rrustemi, Elena Gribovskaya, Aurko Roy, Wiktor Gworek, S\'ebastien M. R. Arnold, Lisa Lee, James Lee-Thorp, Marcello Maggioni, Enrique Piqueras, Kartikeya Badola, Sharad Vikram, Lucas Gonzalez, Anirudh Baddepudi, Evan Senter, Jacob Devlin, James Qin, Michael Azzam, Maja Trebacz, Martin Polacek, Kashyap Krishnakumar, Shuo-yiin Chang, Matthew Tung, Ivo Penchev, Rishabh Joshi, Kate Olszewska, Carrie Muir, Mateo Wirth, Ale Jakse Hartman, Josh Newlan, Sheleem Kashem, Vijay Bolina, Elahe Dabir, Joost van Amersfoort, Zafarali Ahmed, James Cobon-Kerr, Aishwarya Kamath, Arnar Mar Hrafnkelsson, Le Hou, Ian Mackinnon, Alexandre Frechette, Eric Noland, Xiance Si, Emanuel Taropa, Dong Li, Phil Crone, Anmol Gulati, S\'ebastien Cevey, Jonas Adler, Ada Ma, David Silver, Simon Tokumine, Richard Powell, Stephan Lee, Kiran Vodrahalli, Samer Hassan, Diana Mincu, Antoine Yang, Nir Levine, Jenny Brennan, Mingqiu Wang, Sarah Hodkinson, Jeffrey Zhao, Josh Lipschultz, Aedan Pope, Michael B. Chang, Cheng Li, Laurent El Shafey, Michela Paganini, Sholto Douglas, Bernd Bohnet, Fabio Pardo, Seth Odoom, Mihaela Rosca, Cicero Nogueira dos Santos, Kedar Soparkar, Arthur Guez, Tom Hudson, Steven Hansen, Chulayuth Asawaroengchai, Ravi Addanki, Tianhe Yu, Wojciech Stokowiec, Mina Khan, Justin Gilmer, Jaehoon Lee, Carrie Grimes Bostock, Keran Rong, Jonathan Caton, Pedram Pejman, Filip Pavetic, Geoff Brown, Vivek Sharma, Mario Lu\v{c}i\'c, Rajkumar Samuel, Josip Djolonga, Amol Mandhane, Lars Lowe Sj\"osund, Elena Buchatskaya, Elspeth White, Natalie Clay, Jiepu Jiang, Hyeontaek Lim, Ross Hemsley, Zeyncep Cankara, Jane Labanowski, Nicola De Cao, David Steiner, Sayed Hadi Hashemi, Jacob Austin, Anita Gergely, Tim Blyth, Joe Stanton, Kaushik Shivakumar, Aditya Siddhant, Anders Andreassen, Carlos Araya, Nikhil Sethi, Rakesh Shivanna, Steven Hand, Ankur Bapna, Ali Khodaei, Antoine Miech, Garrett Tanzer, Andy Swing, Shantanu Thakoor, Lora Aroyo, Zhufeng Pan, Zachary Nado, Jakub Sygnowski, Stephanie Winkler, Dian Yu, Mohammad Saleh, Loren Maggiore, Yamini Bansal, Xavier Garcia, Mehran Kazemi, Piyush Patil, Ishita Dasgupta, Iain Barr, Minh Giang, Thais Kagohara, Ivo Danihelka, Amit Marathe, Vladimir Feinberg, Mohamed Elhawaty, Nimesh Ghelani, Dan Horgan, Helen Miller, Lexi Walker, Richard Tanburn, Mukarram Tariq, Disha Shrivastava, Fei Xia, Qingze Wang, Chung-Cheng Chiu, Zoe Ashwood, Khuslen Baatarsukh, Sina Samangooei, Rapha\"el Lopez Kaufman, Fred Alcober, Axel Stjerngren, Paul Komarek, Katerina Tsihlas, Anudhyan Boral, Ramona Comanescu, Jeremy Chen, Ruibo Liu, Chris Welty, Dawn Bloxwich, Charlie Chen, Yanhua Sun, Fangxiaoyu Feng, Matthew Mauger, Xerxes Dotiwalla, Vincent Hellendoorn, Michael Sharman, Ivy Zheng, Krishna Haridasan, Gabe Barth-Maron, Craig Swanson, Dominika Rogozi\'nska, Alek Andreev, Paul Kishan Rubenstein, Ruoxin Sang, Dan Hurt, Gamaleldin Elsayed, Renshen Wang, Dave Lacey, Anastasija Ili\'c, Yao Zhao, Adam Iwanicki, Alejandro Lince, Alexander Chen, Christina Lyu, Carl Lebsack, Jordan Griffith, Meenu Gaba, Paramjit Sandhu, Phil Chen, Anna Koop, Ravi Rajwar, Soheil Hassas Yeganeh, Solomon Chang, Rui Zhu, Soroush Radpour, Elnaz Davoodi, Ving Ian Lei, Yang Xu, Daniel Toyama, Constant Segal, Martin Wicke, Hanzhao Lin, Anna Bulanova, Adri\`a Puigdom\`enech Badia, Nemanja Raki\'cevi\'c, Pablo Sprechmann, Angelos Filos, Shaobo Hou, V\'ictor Campos, Nora Kassner, Devendra Sachan, Meire Fortunato, Chimezie Iwuanyanwu, Vitaly Nikolaev, Balaji Lakshminarayanan, Sadegh Jazayeri, Mani Varadarajan, Chetan Tekur, Doug Fritz, Misha Khalman, David Reitter, Kingshuk Dasgupta, Shourya Sarcar, Tina Ornduff, Javier Snaider, Fantine Huot, Johnson Jia, Rupert Kemp, Nejc Trdin, Anitha Vijayakumar, Lucy Kim, Christof Angermueller, Li Lao, Tianqi Liu, Haibin Zhang, David Engel, Somer Greene, Ana\"is White, Jessica Austin, Lilly Taylor, Shereen Ashraf, Dangyi Liu, Maria Georgaki, Irene Cai, Yana Kulizhskaya, Sonam Goenka, Brennan Saeta, Ying Xu, Christian Frank, Dario de Cesare, Brona Robenek, Harry Richardson, Mahmoud Alnahlawi, Christopher Yew, Priya Ponnapalli, Marco Tagliasacchi, Alex Korchemniy, Yelin Kim, Dinghua Li, Bill Rosgen, Kyle Levin, Jeremy Wiesner, Praseem Banzal, Praveen Srinivasan, Hongkun Yu, \c{C}a\u{g}lar \"Unl\"u, David Reid, Zora Tung, Daniel Finchelstein, Ravin Kumar, Andre Elisseeff, Jin Huang, Ming Zhang, Ricardo Aguilar, Mai Gim\'enez, Jiawei Xia, Olivier Dousse, Willi Gierke, Damion Yates, Komal Jalan, Lu Li, Eri Latorre-Chimoto, Duc Dung Nguyen, Ken Durden, Praveen Kallakuri, Yaxin Liu, Matthew Johnson, Tomy Tsai, Alice Talbert, Jasmine Liu, Alexander Neitz, Chen Elkind, Marco Selvi, Mimi Jasarevic, Livio Baldini Soares, Albert Cui, Pidong Wang, Alek Wenjiao Wang, Xinyu Ye, Krystal Kallarackal, Lucia Loher, Hoi Lam, Josef Broder, Dan Holtmann-Rice, Nina Martin, Bramandia Ramadhana, Mrinal Shukla, Sujoy Basu, Abhi Mohan, Nick Fernando, Noah Fiedel, Kim Paterson, Hui Li, Ankush Garg, Jane Park, DongHyun Choi, Diane Wu, Sankalp Singh, Zhishuai Zhang, Amir Globerson, Lily Yu, John Carpenter, F\'elix de Chaumont Quitry, Carey Radebaugh, Chu-Cheng Lin, Alex Tudor, Prakash Shroff, Drew Garmon, Dayou Du, Neera Vats, Han Lu, Shariq Iqbal, Alex Yakubovich, Nilesh Tripuraneni, James Manyika, Haroon Qureshi, Nan Hua, Christel Ngani, Maria Abi Raad, Hannah Forbes, Jeff Stanway, Mukund Sundararajan, Victor Ungureanu, Colton Bishop, Yunjie Li, Balaji Venkatraman, Bo Li, Chloe Thornton, Salvatore Scellato, Nishesh Gupta, Yicheng Wang, Ian Tenney, Xihui Wu, Ashish Shenoy, Gabriel Carvajal, Diana Gage Wright, Ben Bariach, Zhuyun Xiao, Peter Hawkins, Sid Dalmia, Clement Farabet, Pedro Valenzuela, Quan Yuan, Ananth Agarwal, Mia Chen, Wooyeol Kim, Brice Hulse, Nandita Dukkipati, Adam Paszke, Andrew Bolt, Kiam Choo, Jennifer Beattie, Jennifer Prendki, Harsha Vashisht, Rebeca Santamaria-Fernandez, Luis C. Cobo, Jarek Wilkiewicz, David Madras, Ali Elqursh, Grant Uy, Kevin Ramirez, Matt Harvey, Tyler Liechty, Heiga Zen, Jeff Seibert, Clara Huiyi Hu, Andrey Khorlin, Maigo Le, Asaf Aharoni, Megan Li, Lily Wang, Sandeep Kumar, Norman Casagrande, Jay Hoover, Dalia El Badawy, David Soergel, Denis Vnukov, Matt Miecnikowski, Jiri Simsa, Praveen Kumar, Thibault Sellam, Daniel Vlasic, Samira Daruki, Nir Shabat, John Zhang, Guolong Su, Jiageng Zhang, Jeremiah Liu, Yi Sun, Evan Palmer, Alireza Ghaffarkhah, Xi Xiong, Victor Cotruta, Michael Fink, Lucas Dixon, Ashwin Sreevatsa, Adrian Goedeckemeyer, Alek Dimitriev, Mohsen Jafari, Remi Crocker, Nicholas FitzGerald, Aviral Kumar, Sanjay Ghemawat, Ivan Philips, Frederick Liu, Yannie Liang, Rachel Sterneck, Alena Repina, Marcus Wu, Laura Knight, Marin Georgiev, Hyo Lee, Harry Askham, Abhishek Chakladar, Annie Louis, Carl Crous, Hardie Cate, Dessie Petrova, Michael Quinn, Denese Owusu-Afriyie, Achintya Singhal, Nan Wei, Solomon Kim, Damien Vincent, Milad Nasr, Christopher A. Choquette-Choo, Reiko Tojo, Shawn Lu, Diego de Las Casas, Yuchung Cheng, Tolga Bolukbasi, Katherine Lee, Saaber Fatehi, Rajagopal Ananthanarayanan, Miteyan Patel, Charbel Kaed, Jing Li, Shreyas Rammohan Belle, Zhe Chen, Jaclyn Konzelmann, Siim P\~oder, Roopal Garg, Vinod Koverkathu, Adam Brown, Chris Dyer, Rosanne Liu, Azade Nova, Jun Xu, Alanna Walton, Alicia Parrish, Mark Epstein, Sara McCarthy, Slav Petrov, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, Oriol Vinyals

Abstract: In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k). Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

replace-cross CleanAgent: Automating Data Standardization with LLM-based Agents

Authors: Danrui Qi, Jiannan Wang

Abstract: Data standardization is a crucial part in data science life cycle. While tools like Pandas offer robust functionalities, their complexity and the manual effort required for customizing code to diverse column types pose significant challenges. Although large language models (LLMs) like ChatGPT have shown promise in automating this process through natural language understanding and code generation, it still demands expert-level programming knowledge and continuous interaction for prompt refinement. To solve these challenges, our key idea is to propose a Python library with declarative, unified APIs for standardizing column types, simplifying the code generation of LLM with concise API calls. We first propose Dataprep.Clean which is written as a component of the Dataprep Library, offers a significant reduction in complexity by enabling the standardization of specific column types with a single line of code. Then we introduce the CleanAgent framework integrating Dataprep.Clean and LLM-based agents to automate the data standardization process. With CleanAgent, data scientists need only provide their requirements once, allowing for a hands-free, automatic standardization process.

replace-cross BLINK: Multimodal Large Language Models Can See but Not Perceive

Authors: Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, Ranjay Krishna

Abstract: We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.

replace-cross Leveraging Large Language Model as Simulated Patients for Clinical Education

Authors: Yanzeng Li, Cheng Zeng, Jialun Zhong, Ruoyu Zhang, Minhao Zhang, Lei Zou

Abstract: Simulated Patients (SPs) play a crucial role in clinical medical education by providing realistic scenarios for student practice. However, the high cost of training and hiring qualified SPs, along with the heavy workload and potential risks they face in consistently portraying actual patients, limit students' access to this type of clinical training. Consequently, the integration of computer program-based simulated patients has emerged as a valuable educational tool in recent years. With the rapid development of Large Language Models (LLMs), their exceptional capabilities in conversational artificial intelligence and role-playing have been demonstrated, making them a feasible option for implementing Virtual Simulated Patient (VSP). In this paper, we present an integrated model-agnostic framework called CureFun that harnesses the potential of LLMs in clinical medical education. This framework facilitates natural conversations between students and simulated patients, evaluates their dialogue, and provides suggestions to enhance students' clinical inquiry skills. Through comprehensive evaluations, our approach demonstrates more authentic and professional SP-scenario dialogue flows compared to other LLM-based chatbots, thus proving its proficiency in simulating patients. Additionally, leveraging CureFun's evaluation ability, we assess several medical LLMs and discuss the possibilities and limitations of using LLMs as virtual doctors from the perspective of their diagnostic abilities.

replace-cross FedTAD: Topology-aware Data-free Knowledge Distillation for Subgraph Federated Learning

Authors: Yinlin Zhu, Xunkai Li, Zhengyu Wu, Di Wu, Miao Hu, Rong-Hua Li

Abstract: Subgraph federated learning (subgraph-FL) is a new distributed paradigm that facilitates the collaborative training of graph neural networks (GNNs) by multi-client subgraphs. Unfortunately, a significant challenge of subgraph-FL arises from subgraph heterogeneity, which stems from node and topology variation, causing the impaired performance of the global GNN. Despite various studies, they have not yet thoroughly investigated the impact mechanism of subgraph heterogeneity. To this end, we decouple node and topology variation, revealing that they correspond to differences in label distribution and structure homophily. Remarkably, these variations lead to significant differences in the class-wise knowledge reliability of multiple local GNNs, misguiding the model aggregation with varying degrees. Building on this insight, we propose topology-aware data-free knowledge distillation technology (FedTAD), enhancing reliable knowledge transfer from the local model to the global model. Extensive experiments on six public datasets consistently demonstrate the superiority of FedTAD over state-of-the-art baselines.

replace-cross FlashSpeech: Efficient Zero-Shot Speech Synthesis

Authors: Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, Wei Xue

Abstract: Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io/.

URLs: https://flashspeech.github.io/.

replace-cross The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews

Authors: Aleksi Huotala, Miikka Kuutila, Paul Ralph, Mika M\"antyl\"a

Abstract: Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if Large Language Models (LLMs) can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-abstract screening. We performed an experiment where humans screened titles and abstracts for 20 papers with both original and simplified abstracts from a prior SR. The experiment with human screeners was reproduced with GPT-3.5 and GPT-4 LLMs to perform the same screening tasks. We also studied if different prompting techniques (Zero-shot (ZS), One-shot (OS), Few-shot (FS), and Few-shot with Chain-of-Thought (FS-CoT)) improve the screening performance of LLMs. Lastly, we studied if redesigning the prompt used in the LLM reproduction of screening leads to improved performance. Text simplification did not increase the screeners' screening performance, but reduced the time used in screening. Screeners' scientific literacy skills and researcher status predict screening performance. Some LLM and prompt combinations perform as well as human screeners in the screening tasks. Our results indicate that the GPT-4 LLM is better than its predecessor, GPT-3.5. Additionally, Few-shot and One-shot prompting outperforms Zero-shot prompting. Using LLMs for text simplification in the screening process does not significantly improve human performance. Using LLMs to automate title-abstract screening seems promising, but current LLMs are not significantly more accurate than human screeners. To recommend the use of LLMs in the screening process of SRs, more research is needed. We recommend future SR studies publish replication packages with screening data to enable more conclusive experimenting with LLM screening.

replace-cross Retrieval and Distill: A Temporal Data Shift-Free Paradigm for Online Recommendation System

Authors: Lei Zheng, Ning Li, Weinan Zhang, Yong Yu

Abstract: Current recommendation systems are significantly affected by a serious issue of temporal data shift, which is the inconsistency between the distribution of historical data and that of online data. Most existing models focus on utilizing updated data, overlooking the transferable, temporal data shift-free information that can be learned from shifting data. We propose the Temporal Invariance of Association theorem, which suggests that given a fixed search space, the relationship between the data and the data in the search space keeps invariant over time. Leveraging this principle, we designed a retrieval-based recommendation system framework that can train a data shift-free relevance network using shifting data, significantly enhancing the predictive performance of the original model in the recommendation system. However, retrieval-based recommendation models face substantial inference time costs when deployed online. To address this, we further designed a distill framework that can distill information from the relevance network into a parameterized module using shifting data. The distilled model can be deployed online alongside the original model, with only a minimal increase in inference time. Extensive experiments on multiple real datasets demonstrate that our framework significantly improves the performance of the original model by utilizing shifting data.

replace-cross HDBN: A Novel Hybrid Dual-branch Network for Robust Skeleton-based Action Recognition

Authors: Jinfu Liu, Baiqiao Yin, Jiaying Lin, Jiajun Wen, Yue Li, Mengyuan Liu

Abstract: Skeleton-based action recognition has gained considerable traction thanks to its utilization of succinct and robust skeletal representations. Nonetheless, current methodologies often lean towards utilizing a solitary backbone to model skeleton modality, which can be limited by inherent flaws in the network backbone. To address this and fully leverage the complementary characteristics of various network architectures, we propose a novel Hybrid Dual-Branch Network (HDBN) for robust skeleton-based action recognition, which benefits from the graph convolutional network's proficiency in handling graph-structured data and the powerful modeling capabilities of Transformers for global information. In detail, our proposed HDBN is divided into two trunk branches: MixGCN and MixFormer. The two branches utilize GCNs and Transformers to model both 2D and 3D skeletal modalities respectively. Our proposed HDBN emerged as one of the top solutions in the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) of 2024 ICME Grand Challenge, achieving accuracies of 47.95% and 75.36% on two benchmarks of the UAV-Human dataset by outperforming most existing methods. Our code will be publicly available at: https://github.com/liujf69/ICMEW2024-Track10.

URLs: https://github.com/liujf69/ICMEW2024-Track10.

replace-cross What Makes Multimodal In-Context Learning Work?

Authors: Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski

Abstract: Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at https://gitlab.com/folbaeni/multimodal-icl

URLs: https://gitlab.com/folbaeni/multimodal-icl

replace-cross Unexplored Faces of Robustness and Out-of-Distribution: Covariate Shifts in Environment and Sensor Domains

Authors: Eunsu Baek, Keondo Park, Jiyoon Kim, Hyung-Sin Kim

Abstract: Computer vision applications predict on digital images acquired by a camera from physical scenes through light. However, conventional robustness benchmarks rely on perturbations in digitized images, diverging from distribution shifts occurring in the image acquisition process. To bridge this gap, we introduce a new distribution shift dataset, ImageNet-ES, comprising variations in environmental and camera sensor factors by directly capturing 202k images with a real camera in a controllable testbed. With the new dataset, we evaluate out-of-distribution (OOD) detection and model robustness. We find that existing OOD detection methods do not cope with the covariate shifts in ImageNet-ES, implying that the definition and detection of OOD should be revisited to embrace real-world distribution shifts. We also observe that the model becomes more robust in both ImageNet-C and -ES by learning environment and sensor variations in addition to existing digital augmentations. Lastly, our results suggest that effective shift mitigation via camera sensor control can significantly improve performance without increasing model size. With these findings, our benchmark may aid future research on robustness, OOD, and camera sensor control for computer vision. Our code and dataset are available at https://github.com/Edw2n/ImageNet-ES.

URLs: https://github.com/Edw2n/ImageNet-ES.

replace-cross Decentralized Personalized Federated Learning based on a Conditional Sparse-to-Sparser Scheme

Authors: Qianyu Long, Qiyuan Wang, Christos Anagnostopoulos, Daning Bi

Abstract: Decentralized Federated Learning (DFL) has become popular due to its robustness and avoidance of centralized coordination. In this paradigm, clients actively engage in training by exchanging models with their networked neighbors. However, DFL introduces increased costs in terms of training and communication. Existing methods focus on minimizing communication often overlooking training efficiency and data heterogeneity. To address this gap, we propose a novel \textit{sparse-to-sparser} training scheme: DA-DPFL. DA-DPFL initializes with a subset of model parameters, which progressively reduces during training via \textit{dynamic aggregation} and leads to substantial energy savings while retaining adequate information during critical learning periods. Our experiments showcase that DA-DPFL substantially outperforms DFL baselines in test accuracy, while achieving up to $5$ times reduction in energy costs. We provide a theoretical analysis of DA-DPFL's convergence by solidifying its applicability in decentralized and personalized learning. The code is available at:https://github.com/EricLoong/da-dpfl

URLs: https://github.com/EricLoong/da-dpfl