Authors: Seong-Hyun Hong, Hyun-Sung Kim, Zian Jang, Byung-Jun Lee
Abstract: Recent advancements in learning-based combinatorial optimization (CO) methods have shown promising results in solving NP-hard problems without the need for expert-crafted heuristics. However, high performance of these approaches often rely on problem-specific human-expertise-based search after generating candidate solutions, limiting their applicability to commonly solved CO problems such as Travelling Salesman Problem (TSP). In this paper, we present IC/DC, a CO framework that operates without any supervision. IC/DC is specialized in addressing problems involving two distinct sets of items, and it does not need problem-specific search processes to generate valid solutions. IC/DC employs a novel architecture capable of capturing the intricate relationships between items, and thereby enabling effective optimization in challenging CO scenarios. We train our model in a self-supervised way to minimize the cost of the solution while adhering to the problem-specific constraints. IC/DC not only achieves state-of-the-art performance compared to previous learning methods, but also surpasses well-known solvers and heuristic approaches on Asymmetric Traveling Salesman Problem (ATSP).
Authors: J. M. Conejero, J. C. Preciado, A. E. Prieto, M. C. Bas, V. J. Bolos
Abstract: In this paper we present a multi-criteria classification of Vocational and Educational Programs in Extremadura (Spain) during the period 2009-2016. This ranking has been carried out through the integration into a complete database of the detailed information of individuals finishing such studies together with their labor data. The multicriteria method used is TOPSIS together with a new decision support method for assessing the influence of each criterion and its dependence on the weights assigned to them. This new method is based on a worst-best case scenario analysis and it is compared to a well known global sensitivity analysis technique based on the Pearson's correlation ratio.
Authors: Altera. AL, Andrew Ahn, Nic Becker, Stephanie Carroll, Nico Christie, Manuel Cortes, Arda Demirci, Melissa Du, Frankie Li, Shuying Luo, Peter Y Wang, Mathew Willows, Feitong Yang, Guangyu Robert Yang
Abstract: AI agents have been evaluated in isolation or within small groups, where interactions remain limited in scope and complexity. Large-scale simulations involving many autonomous agents -- reflecting the full spectrum of civilizational processes -- have yet to be explored. Here, we demonstrate how 10 - 1000+ AI agents behave and progress within agent societies. We first introduce the PIANO (Parallel Information Aggregation via Neural Orchestration) architecture, which enables agents to interact with humans and other agents in real-time while maintaining coherence across multiple output streams. We then evaluate agent performance in agent simulations using civilizational benchmarks inspired by human history. These simulations, set within a Minecraft environment, reveal that agents are capable of meaningful progress -- autonomously developing specialized roles, adhering to and changing collective rules, and engaging in cultural and religious transmission. These preliminary results show that agents can achieve significant milestones towards AI civilizations, opening new avenues for large simulations, agentic organizational intelligence, and integrating AI into human civilizations.
Authors: Chunyan Mu, Muhammad Najib, Nir Oren
Abstract: Responsibility plays a key role in the development and deployment of trustworthy autonomous systems. In this paper, we focus on the problem of strategic reasoning in probabilistic multi-agent systems with responsibility-aware agents. We introduce the logic PATL+R, a variant of Probabilistic Alternating-time Temporal Logic. The novelty of PATL+R lies in its incorporation of modalities for causal responsibility, providing a framework for responsibility-aware multi-agent strategic reasoning. We present an approach to synthesise joint strategies that satisfy an outcome specified in PATL+R, while optimising the share of expected causal responsibility and reward. This provides a notion of balanced distribution of responsibility and reward gain among agents. To this end, we utilise the Nash equilibrium as the solution concept for our strategic reasoning problem and demonstrate how to compute responsibility-aware Nash equilibrium strategies via a reduction to parametric model checking of concurrent stochastic multi-player games.
Authors: Elliot Meyerson, Olivier Francon, Darren Sargent, Babak Hodjat, Risto Miikkulainen
Abstract: Solving societal problems on a global scale requires the collection and processing of ideas and methods from diverse sets of international experts. As the number and diversity of human experts increase, so does the likelihood that elements in this collective knowledge can be combined and refined to discover novel and better solutions. However, it is difficult to identify, combine, and refine complementary information in an increasingly large and diverse knowledge base. This paper argues that artificial intelligence (AI) can play a crucial role in this process. An evolutionary AI framework, termed RHEA, fills this role by distilling knowledge from diverse models created by human experts into equivalent neural networks, which are then recombined and refined in a population-based search. The framework was implemented in a formal synthetic domain, demonstrating that it is transparent and systematic. It was then applied to the results of the XPRIZE Pandemic Response Challenge, in which over 100 teams of experts across 23 countries submitted models based on diverse methodologies to predict COVID-19 cases and suggest non-pharmaceutical intervention policies for 235 nations, states, and regions across the globe. Building upon this expert knowledge, by recombining and refining the 169 resulting policy suggestion models, RHEA discovered a broader and more effective set of policies than either AI or human experts alone, as evaluated based on real-world data. The results thus suggest that AI can play a crucial role in realizing the potential of human expertise in global problem-solving.
Authors: Yu Pan, Jianxin Sun, Hongfeng Yu, Joe Luck, Geng Bai, Nipuna Chamara, Yufeng Ge, Tala Awada
Abstract: Current agricultural data management and analysis paradigms are to large extent traditional, in which data collecting, curating, integration, loading, storing, sharing and analyzing still involve too much human effort and know-how. The experts, researchers and the farm operators need to understand the data and the whole process of data management pipeline to make fully use of the data. The essential problem of the traditional paradigm is the lack of a layer of orchestrational intelligence which can understand, organize and coordinate the data processing utilities to maximize data management and analysis outcome. The emerging reasoning and tool mastering abilities of large language models (LLM) make it a potentially good fit to this position, which helps a shift from the traditional user-driven paradigm to AI-driven paradigm. In this paper, we propose and explore the idea of a LLM based copilot for autonomous agricultural data management and analysis. Based on our previously developed platform of Agricultural Data Management and Analytics (ADMA), we build a proof-of-concept multi-agent system called ADMA Copilot, which can understand user's intent, makes plans for data processing pipeline and accomplishes tasks automatically, in which three agents: a LLM based controller, an input formatter and an output formatter collaborate together. Different from existing LLM based solutions, by defining a meta-program graph, our work decouples control flow and data flow to enhance the predictability of the behaviour of the agents. Experiments demonstrates the intelligence, autonomy, efficacy, efficiency, extensibility, flexibility and privacy of our system. Comparison is also made between ours and existing systems to show the superiority and potential of our system.
Authors: Declan Campbell, Sunayana Rane, Tyler Giallanza, Nicol\`o De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, Steven M. Frankland, Thomas L. Griffiths, Jonathan D. Cohen, Taylor W. Webb
Abstract: Recent work has documented striking heterogeneity in the performance of state-of-the-art vision language models (VLMs), including both multimodal language models and text-to-image models. These models are able to describe and generate a diverse array of complex, naturalistic images, yet they exhibit surprising failures on basic multi-object reasoning tasks -- such as counting, localization, and simple forms of visual analogy -- that humans perform with near perfect accuracy. To better understand this puzzling pattern of successes and failures, we turn to theoretical accounts of the binding problem in cognitive science and neuroscience, a fundamental problem that arises when a shared set of representational resources must be used to represent distinct entities (e.g., to represent multiple objects in an image), necessitating the use of serial processing to avoid interference. We find that many of the puzzling failures of state-of-the-art VLMs can be explained as arising due to the binding problem, and that these failure modes are strikingly similar to the limitations exhibited by rapid, feedforward processing in the human brain.
Authors: Grace Guo, Jenna Jiayi Kang, Raj Sanjay Shah, Hanspeter Pfister, Sashank Varma
Abstract: Vision Language Models (VLMs) have been successful at many chart comprehension tasks that require attending to both the images of charts and their accompanying textual descriptions. However, it is not well established how VLM performance profiles map to human-like behaviors. If VLMs can be shown to have human-like chart comprehension abilities, they can then be applied to a broader range of tasks, such as designing and evaluating visualizations for human readers. This paper lays the foundations for such applications by evaluating the accuracy of zero-shot prompting of VLMs on graphical perception tasks with established human performance profiles. Our findings reveal that VLMs perform similarly to humans under specific task and style combinations, suggesting that they have the potential to be used for modeling human performance. Additionally, variations to the input stimuli show that VLM accuracy is sensitive to stylistic changes such as fill color and chart contiguity, even when the underlying data and data mappings are the same.
Authors: Sina Rismanchian, Yasaman Razeghi, Sameer Singh, Shayan Doroudi
Abstract: Humans have the ability to reason about geometric patterns in images and scenes from a young age. However, developing large multimodal models (LMMs) capable of similar reasoning remains a challenge, highlighting the need for robust evaluation methods to assess these capabilities. We introduce TurtleBench, a benchmark designed to evaluate LMMs' capacity to interpret geometric patterns -- given visual examples, textual instructions, or both -- and generate precise code outputs. Inspired by turtle geometry, a notion used to teach children foundational coding and geometric concepts, TurtleBench features tasks with patterned shapes that have underlying algorithmic logic. Our evaluation reveals that leading LMMs struggle significantly with these tasks, with GPT-4o achieving only 19\% accuracy on the simplest tasks and few-shot prompting only marginally improves their performance ($<2\%$). TurtleBench highlights the gap between human and AI performance in intuitive and visual geometrical understanding, setting the stage for future research in this area. TurtleBench stands as one of the few benchmarks to evaluate the integration of visual understanding and code generation capabilities in LMMs, setting the stage for future research. Code and Dataset for this paper is provided here: https://github.com/sinaris76/TurtleBench
Authors: Daijin Yang, Erica Kleinman, Casper Harteveld
Abstract: Due to GPT's impressive generative capabilities, its applications in games are expanding rapidly. To offer researchers a comprehensive understanding of the current applications and identify both emerging trends and unexplored areas, this paper introduces an updated scoping review of 131 articles, 76 of which were published in 2024, to explore GPT's potential for games. By coding and synthesizing the papers, we identify five prominent applications of GPT in current game research: procedural content generation, mixed-initiative game design, mixed-initiative gameplay, playing games, and game user research. Drawing on insights from these application areas and emerging research, we propose future studies should focus on expanding the technical boundaries of the GPT models and exploring the complex interaction dynamics between them and users. This review aims to illustrate the state of the art in innovative GPT applications in games, offering a foundation to enrich game development and enhance player experiences through cutting-edge AI innovations.
Authors: Wout Gerdes, Erman Acar
Abstract: Credit card fraud detection is a critical concern for financial institutions, intensified by the rise of contactless payment technologies. While deep learning models offer high accuracy, their lack of explainability poses significant challenges in financial settings. This paper explores the integration of fuzzy logic into Deep Symbolic Regression (DSR) to enhance both performance and explainability in fraud detection. We investigate the effectiveness of different fuzzy logic implications, specifically {\L}ukasiewicz, G\"odel, and Product, in handling the complexity and uncertainty of fraud detection datasets. Our analysis suggest that the {\L}ukasiewicz implication achieves the highest F1-score and overall accuracy, while the Product implication offers a favorable balance between performance and explainability. Despite having a performance lower than state-of-the-art (SOTA) models due to information loss in data transformation, our approach provides novelty and insights into into integrating fuzzy logic into DSR for fraud detection, providing a comprehensive comparison between different implications and methods.
Authors: Zihong He, Weizhe Lin, Hao Zheng, Fan Zhang, Matt Jones, Laurence Aitchison, Xuhai Xu, Miao Liu, Per Ola Kristensson, Junxiao Shen
Abstract: With the rapid advancement of AI systems, their abilities to store, retrieve, and utilize information over the long term - referred to as long-term memory - have become increasingly significant. These capabilities are crucial for enhancing the performance of AI systems across a wide range of tasks. However, there is currently no comprehensive survey that systematically investigates AI's long-term memory capabilities, formulates a theoretical framework, and inspires the development of next-generation AI long-term memory systems. This paper begins by systematically introducing the mechanisms of human long-term memory, then explores AI long-term memory mechanisms, establishing a mapping between the two. Based on the mapping relationships identified, we extend the current cognitive architectures and propose the Cognitive Architecture of Self-Adaptive Long-term Memory (SALM). SALM provides a theoretical framework for the practice of AI long-term memory and holds potential for guiding the creation of next-generation long-term memory driven AI systems. Finally, we delve into the future directions and application prospects of AI long-term memory.
Authors: Dillon Z. Chen
Abstract: Scalable learning for planning research generally involves juggling between different programming languages for handling learning and planning modules effectively. Interpreted languages such as Python are commonly used for learning routines due to their ease of use and the abundance of highly maintained learning libraries they exhibit, while compiled languages such as C++ are used for planning routines due to their optimised resource usage. Motivated by the need for tools for developing scalable learning planners, we introduce WLPlan, a C++ package with Python bindings which implements recent promising work for automatically generating relational features of planning tasks. Such features can be used for any downstream routine, such as learning domain control knowledge or probing and understanding planning tasks. More specifically, WLPlan provides functionality for (1) transforming planning tasks into graphs, and (2) embedding planning graphs into feature vectors via graph kernels. The source code and instructions for the installation and usage of WLPlan are available at tinyurl.com/42kymswc
Authors: Bowen Li, Zhaoyu Li, Qiwei Du, Jinqi Luo, Wenshan Wang, Yaqi Xie, Simon Stepputtis, Chen Wang, Katia P. Sycara, Pradeep Kumar Ravikumar, Alexander G. Gray, Xujie Si, Sebastian Scherer
Abstract: Recent years have witnessed the rapid development of Neuro-Symbolic (NeSy) AI systems, which integrate symbolic reasoning into deep neural networks. However, most of the existing benchmarks for NeSy AI fail to provide long-horizon reasoning tasks with complex multi-agent interactions. Furthermore, they are usually constrained by fixed and simplistic logical rules over limited entities, making them far from real-world complexities. To address these crucial gaps, we introduce LogiCity, the first simulator based on customizable first-order logic (FOL) for an urban-like environment with multiple dynamic agents. LogiCity models diverse urban elements using semantic and spatial concepts, such as IsAmbulance(X) and IsClose(X, Y). These concepts are used to define FOL rules that govern the behavior of various agents. Since the concepts and rules are abstractions, they can be universally applied to cities with any agent compositions, facilitating the instantiation of diverse scenarios. Besides, a key feature of LogiCity is its support for user-configurable abstractions, enabling customizable simulation complexities for logical reasoning. To explore various aspects of NeSy AI, LogiCity introduces two tasks, one features long-horizon sequential decision-making, and the other focuses on one-step visual reasoning, varying in difficulty and agent behaviors. Our extensive evaluation reveals the advantage of NeSy frameworks in abstract reasoning. Moreover, we highlight the significant challenges of handling more complex abstractions in long-horizon multi-agent scenarios or under high-dimensional, imbalanced data. With its flexible design, various features, and newly raised challenges, we believe LogiCity represents a pivotal step forward in advancing the next generation of NeSy AI. All the code and data are open-sourced at our website.
Authors: Binggui Zhou, Xi Yang, Shaodan Ma, Feifei Gao, Guanghua Yang
Abstract: In TDD mmWave massive MIMO systems, the downlink CSI can be attained through uplink channel estimation thanks to the uplink-downlink channel reciprocity. However, the channel aging issue is significant under high-mobility scenarios and thus necessitates frequent uplink channel estimation. In addition, large amounts of antennas and subcarriers lead to high-dimensional CSI matrices, aggravating the pilot training overhead. To systematically reduce the pilot overhead, a spatial, frequency, and temporal domain (3D) channel extrapolation framework is proposed in this paper. Considering the marginal effects of pilots in the spatial and frequency domains and the effectiveness of traditional knowledge-driven channel estimation methods, we first propose a knowledge-and-data driven spatial-frequency channel extrapolation network (KDD-SFCEN) for uplink channel estimation by exploiting the least square estimator for coarse channel estimation and joint spatial-frequency channel extrapolation to reduce the spatial-frequency domain pilot overhead. Then, resorting to the uplink-downlink channel reciprocity and temporal domain dependencies of downlink channels, a temporal uplink-downlink channel extrapolation network (TUDCEN) is proposed for slot-level channel extrapolation, aiming to enlarge the pilot signal period and thus reduce the temporal domain pilot overhead under high-mobility scenarios. Specifically, we propose the spatial-frequency sampling embedding module to reduce the representation dimension and consequent computational complexity, and we propose to exploit the autoregressive generative Transformer for generating downlink channels autoregressively. Numerical results demonstrate the superiority of the proposed framework in significantly reducing the pilot training overhead by more than 16 times and improving the system's spectral efficiency under high-mobility scenarios.
Authors: Yu Bai, Yukai Miao, Li Chen, Dawei Wang, Dan Li, Yanyu Ren, Hongtao Xie, Ce Yang, Xuhui Cai
Abstract: RAG systems face limitations when semantic relevance alone does not guarantee improved generation quality. This issue becomes particularly evident due to the sensitivity of large language models (LLMs) to the ordering of few-shot prompts, which can affect model performance. To address this challenge, aligning LLM outputs with human preferences using structured feedback, such as options to copy, regenerate, or dislike, offers a promising method for improvement. This feedback is applied to the entire list of inputs rather than giving specific ratings for individual documents, making it a Listwide Labels Learning-to-Rank task. To address this task, we propose Pistis-RAG, a new RAG framework designed with a content-centric approach to better align LLMs with human preferences. Pistis-RAG effectively utilizes human feedback, enhancing content ranking and generation quality. To validate our framework, we use public datasets to simulate human feedback, allowing us to evaluate and refine our method effectively. Experimental results indicate that Pistis-RAG improves alignment with human preferences relative to the baseline RAG system, showing a 6.06% increase in MMLU (English) and a 7.08% increase in C-EVAL (Chinese) accuracy metrics. These results highlight Pistis-RAG's effectiveness in overcoming the limitations associated with traditional RAG approaches.
Authors: Liang He, Peiran Jin, Yaosen Min, Shufang Xie, Lijun Wu, Tao Qin, Xiaozhuan Liang, Kaiyuan Gao, Yuliang Jiang, Tie-Yan Liu
Abstract: Proteins, essential to biological systems, perform functions intricately linked to their three-dimensional structures. Understanding the relationship between protein structures and their amino acid sequences remains a core challenge in protein modeling. While traditional protein foundation models benefit from pre-training on vast unlabeled datasets, they often struggle to capture critical co-evolutionary information, which evolutionary-based methods excel at. In this study, we introduce a novel pre-training strategy for protein foundation models that emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features from sequence data. Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability, outperforming established baselines of similar size, including the ESM model, across diverse downstream tasks. Experimental results confirm the model's effectiveness in integrating co-evolutionary information, marking a significant step forward in protein sequence-based modeling.
Authors: Meng Chen, Philip Arthur, Qianyu Feng, Cong Duy Vu Hoang, Yu-Heng Hong, Mahdi Kazemi Moghaddam, Omid Nezami, Thien Nguyen, Gioacchino Tangari, Duy Vu, Thanh Vu, Mark Johnson, Krishnaram Kenthapadi, Don Dharmasiri, Long Duong, Yuan-Fang Li
Abstract: Large language models (LLMs) have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.
Authors: Yaoqi Guo, Zhenpeng Chen, Jie M. Zhang, Yang Liu, Yun Ma
Abstract: Code generation, the automatic creation of source code from natural language descriptions, has garnered significant attention due to its potential to streamline software development. Inspired by research that links task-personality alignment with improved development outcomes, we conduct an empirical study on personality-guided code generation using large language models (LLMs). Specifically, we investigate how emulating personality traits appropriate to the coding tasks affects LLM performance. We extensively evaluate this approach using seven widely adopted LLMs across four representative datasets. Our results show that personality guidance significantly enhances code generation accuracy, with improved pass rates in 23 out of 28 LLM-dataset combinations. Notably, in 11 cases, the improvement exceeds 5%, and in 5 instances, it surpasses 10%, with the highest gain reaching 12.9%. Additionally, personality guidance can be easily integrated with other prompting strategies to further boost performance.
Authors: Oggi Rudovic, Pranay Dighe, Yi Su, Vineet Garg, Sameer Dharur, Xiaochuan Niu, Ahmed H. Abdelaziz, Saurabah Adya, Ahmed Tewfik
Abstract: Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR uncertainty, compared to when follow-ups are modeled alone.
Authors: Zifeng Wang, Hanyin Wang, Benjamin Danek, Ying Li, Christina Mack, Hoifung Poon, Yajun Wang, Pranav Rajpurkar, Jimeng Sun
Abstract: The integration of Large Language Models (LLMs) into medical applications has sparked widespread interest across the healthcare industry, from drug discovery and development to clinical decision support, assisting telemedicine, medical devices, and healthcare insurance applications. This perspective paper aims to discuss the inner workings of building LLM-powered medical AI applications and introduces a comprehensive framework for their development. We review existing literature and outline the unique challenges of applying LLMs in specialized medical contexts. Additionally, we introduce a three-step framework to organize medical LLM research activities: 1) Modeling: breaking down complex medical workflows into manageable steps for developing medical-specific models; 2) Optimization: optimizing the model performance with crafted prompts and integrating external knowledge and tools, and 3) System engineering: decomposing complex tasks into subtasks and leveraging human expertise for building medical AI applications. Furthermore, we offer a detailed use case playbook that describes various LLM-powered medical AI applications, such as optimizing clinical trial design, enhancing clinical decision support, and advancing medical imaging analysis. Finally, we discuss various challenges and considerations for building medical AI applications with LLMs, such as handling hallucination issues, data ownership and compliance, privacy, intellectual property considerations, compute cost, sustainability issues, and responsible AI requirements.
Authors: Zhilun Zhou, Jingyang Fan, Yu Liu, Fengli Xu, Depeng Jin, Yong Li
Abstract: The fast development of location-based social networks (LBSNs) has led to significant changes in society, resulting in popular studies of using LBSN data for socioeconomic prediction, e.g., regional population and commercial activity estimation. Existing studies design various graphs to model heterogeneous LBSN data, and further apply graph representation learning methods for socioeconomic prediction. However, these approaches heavily rely on heuristic ideas and expertise to extract task-relevant knowledge from diverse data, which may not be optimal for specific tasks. Additionally, they tend to overlook the inherent relationships between different indicators, limiting the prediction accuracy. Motivated by the remarkable abilities of large language models (LLMs) in commonsense reasoning, embedding, and multi-agent collaboration, in this work, we synergize LLM agents and knowledge graph for socioeconomic prediction. We first construct a location-based knowledge graph (LBKG) to integrate multi-sourced LBSN data. Then we leverage the reasoning power of LLM agent to identify relevant meta-paths in the LBKG for each type of socioeconomic prediction task, and design a semantic-guided attention module for knowledge fusion with meta-paths. Moreover, we introduce a cross-task communication mechanism to further enhance performance by enabling knowledge sharing across tasks at both LLM agent and KG levels. On the one hand, the LLM agents for different tasks collaborate to generate more diverse and comprehensive meta-paths. On the other hand, the embeddings from different tasks are adaptively merged for better socioeconomic prediction. Experiments on two datasets demonstrate the effectiveness of the synergistic design between LLM and KG, providing insights for information sharing across socioeconomic prediction tasks.
Authors: Donghoon Kim, Gusang Lee, Kyuhong Shim, Byonghyo Shim
Abstract: Recently, we have observed that Large Multi-modal Models (LMMs) are revolutionizing the way machines interact with the world, unlocking new possibilities across various multi-modal applications. To adapt LMMs for downstream tasks, parameter-efficient fine-tuning (PEFT) which only trains additional prefix tokens or modules, has gained popularity. Nevertheless, there has been little analysis of how PEFT works in LMMs. In this paper, we delve into the strengths and weaknesses of each tuning strategy, shifting the focus from the efficiency typically associated with these approaches. We first discover that model parameter tuning methods such as LoRA and Adapters distort the feature representation space learned during pre-training and limit the full utilization of pre-trained knowledge. We also demonstrate that prefix-tuning excels at preserving the representation space, despite its lower performance on downstream tasks. These findings suggest a simple two-step PEFT strategy called Prefix-Tuned PEFT (PT-PEFT), which successively performs prefix-tuning and then PEFT (i.e., Adapter, LoRA), combines the benefits of both. Experimental results show that PT-PEFT not only improves performance in image captioning and visual question answering compared to vanilla PEFT methods but also helps preserve the representation space of the four pre-trained models.
Authors: Danrun Cao (IRISA, EXPRESSION), Nicolas B\'echet (IRISA, UBS, EXPRESSION), Pierre-Fran\c{c}ois Marteau (IRISA, UBS, EXPRESSION)
Abstract: We address in this article the the quality of the WikiNER corpus, a multilingual Named Entity Recognition corpus, and provide a consolidated version of it. The annotation of WikiNER was produced in a semi-supervised manner i.e. no manual verification has been carried out a posteriori. Such corpus is called silver-standard. In this paper we propose WikiNER-fr-gold which is a revised version of the French proportion of WikiNER. Our corpus consists of randomly sampled 20% of the original French sub-corpus (26,818 sentences with 700k tokens). We start by summarizing the entity types included in each category in order to define an annotation guideline, and then we proceed to revise the corpus. Finally we present an analysis of errors and inconsistency observed in the WikiNER-fr corpus, and we discuss potential future work directions.
Authors: Xinguo Yu, Weina Cheng, Chuanzhi Yang, Ting Zhang
Abstract: Solving algebra problems (APs) continues to attract significant research interest as evidenced by the large number of algorithms and theories proposed over the past decade. Despite these important research contributions, however, the body of work remains incomplete in terms of theoretical justification and scope. The current contribution intends to fill the gap by developing a review framework that aims to lay a theoretical base, create an evaluation scheme, and extend the scope of the investigation. This paper first develops the State Transform Theory (STT), which emphasizes that the problem-solving algorithms are structured according to states and transforms unlike the understanding that underlies traditional surveys which merely emphasize the progress of transforms. The STT, thus, lays the theoretical basis for a new framework for reviewing algorithms. This new construct accommodates the relation-centric algorithms for solving both word and diagrammatic algebra problems. The latter not only highlights the necessity of introducing new states but also allows revelation of contributions of individual algorithms obscured in prior reviews without this approach.
Authors: Herman Lassche (AFAS Software, University Groningen), Michiel Overeem (AFAS Software), Ayushi Rastogi (University Groningen)
Abstract: Companies support their customers using live chats and chatbots to gain their loyalty. AFAS is a Dutch company aiming to leverage the opportunity large language models (LLMs) offer to answer customer queries with minimal to no input from its customer support team. Adding to its complexity, it is unclear what makes a response correct, and that too in Dutch. Further, with minimal data available for training, the challenge is to identify whether an answer generated by a large language model is correct and do it on the fly. This study is the first to define the correctness of a response based on how the support team at AFAS makes decisions. It leverages literature on natural language generation and automated answer grading systems to automate the decision-making of the customer support team. We investigated questions requiring a binary response (e.g., Would it be possible to adjust tax rates manually?) or instructions (e.g., How would I adjust tax rate manually?) to test how close our automated approach reaches support rating. Our approach can identify wrong messages in 55\% of the cases. This work shows the viability of automatically assessing when our chatbot tell lies.
Authors: Andrei Khrennikov, Masanao Ozawa, Felix Benninger, Oded Shor
Abstract: The recent years are characterized by intensive applications of the methodology and mathematical apparatus of quantum theory, quantum-like modeling, in cognition, psychology, and decision making. In spite of the successful applications of this approach to a variety of psychological effects, e.g., the order, conjunction, disjunction, and response replicability effects, one may (but need not) feel dissatisfaction due to the absence of clear coupling to the neurophysiological processes in the brain. For the moment, this is just a phenomenological approach. In this paper we construct the quantum-like representation of the networks of communicating neurons. It is based not on standard quantum theory, but on generalized probability theory (GPT) with the emphasis of the operational measurement approach. We employ GPT's version which is based on ordered linear state space (instead of complex Hilbert space). A network of communicating neurons is described as a weighted ordered graph that in turn is encoded by its weight matrix. The state space of weight matrices is embedded in GPT with effect-observables and state updates within measurement instruments theory. The latter plays the crucial role. This GPT based model shows the basic quantum-like effects, as e.g. the order, non-repeatability, and disjunction effects; the latter is also known as interference of decisions. This GPT coupling also supports quantum-like modeling in medical diagnostic for neurological diseases, as depression and epilepsy. Although the paper is concentrated on cognition and neuronal networks, the formalism and methodology can be straightforwardly applied to a variety of biological and social networks.
Authors: Yaran Fan, Jamie Pool, Senja Filipi, Ross Cutler
Abstract: Workplace meetings are vital to organizational collaboration, yet a large percentage of meetings are rated as ineffective. To help improve meeting effectiveness by understanding if the conversation is on topic, we create a comprehensive Topic-Conversation Relevance (TCR) dataset that covers a variety of domains and meeting styles. The TCR dataset includes 1,500 unique meetings, 22 million words in transcripts, and over 15,000 meeting topics, sourced from both newly collected Speech Interruption Meeting (SIM) data and existing public datasets. Along with the text data, we also open source scripts to generate synthetic meetings or create augmented meetings from the TCR dataset to enhance data diversity. For each data source, benchmarks are created using GPT-4 to evaluate the model accuracy in understanding transcription-topic relevance.
Authors: Yulong Wang, Chang Zuo, Yin Xuan, Hong Li, Ni Wei
Abstract: Fine-tuning large language models (LLMs) has become essential for adapting pretrained models to specific downstream tasks. In this paper, we propose Linear Chain Transformation (LinChain), a novel approach that introduces a sequence of linear transformations during fine-tuning to enrich optimization dynamics. By incorporating multiple linear transformations into the parameter update process, LinChain expands the effective rank of updates and enhances the model's ability to learn complex task-specific representations. We demonstrate that this method significantly improves the performance of LLM fine-tuning over state-of-the-art methods by providing more flexible optimization paths during training, while maintaining the inference efficiency of the resulting model. Our experiments on various benchmark tasks show that LinChain leads to better generalization, fewer learnable parameters, and improved task adaptation, making it a compelling strategy for LLM fine-tuning.
Authors: Qi Wang, Pu Ren, Hao Zhou, Xin-Yang Liu, Zhiwen Deng, Yi Zhang, Ruizhi Chengze, Hongsheng Liu, Zidong Wang, Jian-Xun Wang, Ji-Rong_Wen, Hao Sun, Yang Liu
Abstract: When solving partial differential equations (PDEs), classical numerical methods often require fine mesh grids and small time stepping to meet stability, consistency, and convergence conditions, leading to high computational cost. Recently, machine learning has been increasingly utilized to solve PDE problems, but they often encounter challenges related to interpretability, generalizability, and strong dependency on rich labeled data. Hence, we introduce a new PDE-Preserved Coarse Correction Network (P$^2$C$^2$Net) to efficiently solve spatiotemporal PDE problems on coarse mesh grids in small data regimes. The model consists of two synergistic modules: (1) a trainable PDE block that learns to update the coarse solution (i.e., the system state), based on a high-order numerical scheme with boundary condition encoding, and (2) a neural network block that consistently corrects the solution on the fly. In particular, we propose a learnable symmetric Conv filter, with weights shared over the entire model, to accurately estimate the spatial derivatives of PDE based on the neural-corrected system state. The resulting physics-encoded model is capable of handling limited training data (e.g., 3--5 trajectories) and accelerates the prediction of PDE solutions on coarse spatiotemporal grids while maintaining a high accuracy. P$^2$C$^2$Net achieves consistent state-of-the-art performance with over 50\% gain (e.g., in terms of relative prediction error) across four datasets covering complex reaction-diffusion processes and turbulent flows.
Authors: Parvez Zamil, Gollam Rabby, Md. Sadekur Rahman, S\"oren Auer
Abstract: The growing volume of biomedical scholarly document abstracts presents an increasing challenge in efficiently retrieving accurate and relevant information. To address this, we introduce a novel approach that integrates an optimized topic modelling framework, OVB-LDA, with the BI-POP CMA-ES optimization technique for enhanced scholarly document abstract categorization. Complementing this, we employ the distilled MiniLM model, fine-tuned on domain-specific data, for high-precision answer extraction. Our approach is evaluated across three configurations: scholarly document abstract retrieval, gold-standard scholarly documents abstract, and gold-standard snippets, consistently outperforming established methods such as RYGH and bio-answer finder. Notably, we demonstrate that extracting answers from scholarly documents abstracts alone can yield high accuracy, underscoring the sufficiency of abstracts for many biomedical queries. Despite its compact size, MiniLM exhibits competitive performance, challenging the prevailing notion that only large, resource-intensive models can handle such complex tasks. Our results, validated across various question types and evaluation batches, highlight the robustness and adaptability of our method in real-world biomedical applications. While our approach shows promise, we identify challenges in handling complex list-type questions and inconsistencies in evaluation metrics. Future work will focus on refining the topic model with more extensive domain-specific datasets, further optimizing MiniLM and utilizing large language models (LLM) to improve both precision and efficiency in biomedical question answering.
Authors: Elena Kardanova (National Research University Higher School of Economics, Moscow, Russia), Alina Ivanova (National Research University Higher School of Economics, Moscow, Russia), Ksenia Tarasova (National Research University Higher School of Economics, Moscow, Russia), Taras Pashchenko (National Research University Higher School of Economics, Moscow, Russia), Aleksei Tikhoniuk (National Research University Higher School of Economics, Moscow, Russia), Elen Yusupova (National Research University Higher School of Economics, Moscow, Russia), Anatoly Kasprzhak (National Research University Higher School of Economics, Moscow, Russia), Yaroslav Kuzminov (National Research University Higher School of Economics, Moscow, Russia), Ekaterina Kruchinskaia (National Research University Higher School of Economics, Moscow, Russia), Irina Brun (National Research University Higher School of Economics, Moscow, Russia)
Abstract: The era of large language models (LLM) raises questions not only about how to train models, but also about how to evaluate them. Despite numerous existing benchmarks, insufficient attention is often given to creating assessments that test LLMs in a valid and reliable manner. To address this challenge, we accommodate the Evidence-centered design (ECD) methodology and propose a comprehensive approach to benchmark development based on rigorous psychometric principles. In this paper, we have made the first attempt to illustrate this approach by creating a new benchmark in the field of pedagogy and education, highlighting the limitations of existing benchmark development approach and taking into account the development of LLMs. We conclude that a new approach to benchmarking is required to match the growing complexity of AI applications in the educational context. We construct a novel benchmark guided by the Bloom's taxonomy and rigorously designed by a consortium of education experts trained in test development. Thus the current benchmark provides an academically robust and practical assessment tool tailored for LLMs, rather than human participants. Tested empirically on the GPT model in the Russian language, it evaluates model performance across varied task complexities, revealing critical gaps in current LLM capabilities. Our results indicate that while generative AI tools hold significant promise for education - potentially supporting tasks such as personalized tutoring, real-time feedback, and multilingual learning - their reliability as autonomous teachers' assistants right now remain rather limited, particularly in tasks requiring deeper cognitive engagement.
Authors: Harry Caufield, Carlo Kroll, Shawn T O'Neil, Justin T Reese, Marcin P Joachimiak, Harshad Hegde, Nomi L Harris, Madan Krishnamurthy, James A McLaughlin, Damian Smedley, Melissa A Haendel, Peter N Robinson, Christopher J Mungall
Abstract: Effective data-driven biomedical discovery requires data curation: a time-consuming process of finding, organizing, distilling, integrating, interpreting, annotating, and validating diverse information into a structured form suitable for databases and knowledge bases. Accurate and efficient curation of these digital assets is critical to ensuring that they are FAIR, trustworthy, and sustainable. Unfortunately, expert curators face significant time and resource constraints. The rapid pace of new information being published daily is exceeding their capacity for curation. Generative AI, exemplified by instruction-tuned large language models (LLMs), has opened up new possibilities for assisting human-driven curation. The design philosophy of agents combines the emerging abilities of generative AI with more precise methods. A curator's tasks can be aided by agents for performing reasoning, searching ontologies, and integrating knowledge across external sources, all efforts otherwise requiring extensive manual effort. Our LLM-driven annotation tool, CurateGPT, melds the power of generative AI together with trusted knowledge bases and literature sources. CurateGPT streamlines the curation process, enhancing collaboration and efficiency in common workflows. Compared to direct interaction with an LLM, CurateGPT's agents enable access to information beyond that in the LLM's training data and they provide direct links to the data supporting each claim. This helps curators, researchers, and engineers scale up curation efforts to keep pace with the ever-increasing volume of scientific data.
Authors: Albert N\"ossig, Tobias Hell, Georg Moser
Abstract: In this paper, we present an innovative iterative approach to rule learning specifically designed for (but not limited to) text-based data. Our method focuses on progressively expanding the vocabulary utilized in each iteration resulting in a significant reduction of memory consumption. Moreover, we introduce a Value of Confidence as an indicator of the reliability of the generated rules. By leveraging the Value of Confidence, our approach ensures that only the most robust and trustworthy rules are retained, thereby improving the overall quality of the rule learning process. We demonstrate the effectiveness of our method through extensive experiments on various textual as well as non-textual datasets including a use case of significant interest to insurance industries, showcasing its potential for real-world applications.
Authors: Ahmed Akib Jawad Karim, Kazi Hafiz Md. Asad, Md. Golam Rabiul Alam
Abstract: This work focuses on the efficiency of the knowledge distillation approach in generating a lightweight yet powerful BERT based model for natural language processing applications. After the model creation, we applied the resulting model, LastBERT, to a real-world task classifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data. Referring to LastBERT, a customized student BERT model, we significantly lowered model parameters from 110 million BERT base to 29 million, resulting in a model approximately 73.64% smaller. On the GLUE benchmark, comprising paraphrase identification, sentiment analysis, and text classification, the student model maintained strong performance across many tasks despite this reduction. The model was also used on a real-world ADHD dataset with an accuracy and F1 score of 85%. When compared to DistilBERT (66M) and ClinicalBERT (110M), LastBERT demonstrated comparable performance, with DistilBERT slightly outperforming it at 87%, and ClinicalBERT achieving 86% across the same metrics. These findings highlight the LastBERT model's capacity to classify degrees of ADHD severity properly, so it offers a useful tool for mental health professionals to assess and comprehend material produced by users on social networking platforms. The study emphasizes the possibilities of knowledge distillation to produce effective models fit for use in resource-limited conditions, hence advancing NLP and mental health diagnosis. Furthermore underlined by the considerable decrease in model size without appreciable performance loss is the lower computational resources needed for training and deployment, hence facilitating greater applicability. Especially using readily available computational tools like Google Colab. This study shows the accessibility and usefulness of advanced NLP methods in pragmatic world applications.
Authors: Andrew Estornell, Jean-Francois Ton, Yuanshun Yao, Yang Liu
Abstract: Large language models (LLMs) have demonstrated a remarkable ability to serve as general-purpose tools for various language-based tasks. Recent works have demonstrated that the efficacy of such models can be improved through iterative dialog between multiple models, frequently referred to as multi-agent debate (MAD). While debate shows promise as a means of improving model efficacy, most works in this area treat debate as an emergent behavior, rather than a learned behavior. In doing so, current debate frameworks rely on collaborative behaviors to have been sufficiently trained into off-the-shelf models. To address this limitation, we propose ACC-Debate, an Actor-Critic based learning framework to produce a two-agent team specialized in debate. We demonstrate that ACC-Debate outperforms SotA debate techniques on a wide array of benchmarks.
Authors: Chaochen Wu, Meiyun Zuo, Lei Xie
Abstract: Single-cell RNA-seq (scRNA-seq) technology is a powerful tool for unraveling the complexity of biological systems. One of essential and fundamental tasks in scRNA-seq data analysis is Cell Type Annotation (CTA). In spite of tremendous efforts in developing machine learning methods for this problem, several challenges remains. They include identifying Out-of-Domain (OOD) cell types, quantifying the uncertainty of unseen cell type annotations, and determining interpretable cell type-specific gene drivers for an OOD case. OOD cell types are often associated with therapeutic responses and disease origins, making them critical for precision medicine and early disease diagnosis. Additionally, scRNA-seq data contains tens thousands of gene expressions. Pinpointing gene drivers underlying CTA can provide deep insight into gene regulatory mechanisms and serve as disease biomarkers. In this study, we develop a new method, eDOC, to address aforementioned challenges. eDOC leverages a transformer architecture with evidential learning to annotate In-Domain (IND) and OOD cell types as well as to highlight genes that contribute both IND cells and OOD cells in a single cell resolution. Rigorous experiments demonstrate that eDOC significantly improves the efficiency and effectiveness of OOD cell type and gene driver identification compared to other state-of-the-art methods. Our findings suggest that eDOC may provide new insights into single-cell biology.
Authors: Darian Rodriguez Vasquez, Afroditi Papadaki
Abstract: Despite the impressive performance of large language models across various tasks, they often struggle with reasoning under negated statements. Negations are important in real-world applications as they encode negative polarity in verb phrases, clauses, or other expressions. Nevertheless, they are underrepresented in current benchmarks, which mainly include basic negation forms and overlook more complex ones, resulting in insufficient data for training a language model. In this work, we propose NegVerse, a method that tackles the lack of negation datasets by producing a diverse range of negation types from affirmative sentences, including verbal, non-verbal, and affixal forms commonly found in English text. We provide new rules for masking parts of sentences where negations are most likely to occur, based on syntactic structure and use a frozen baseline LLM and prompt tuning to generate negated sentences. We also propose a filtering mechanism to identify negation cues and remove degenerate examples, producing a diverse range of meaningful perturbations. Our results show that NegVerse outperforms existing methods and generates negations with higher lexical similarity to the original sentences, better syntactic preservation and negation diversity. The code is available in https://github.com/DarianRodriguez/NegVerse
Authors: Ziyu Ye, Rishabh Agarwal, Tianqi Liu, Rishabh Joshi, Sarmishta Velury, Quoc V. Le, Qijun Tan, Yuan Liu
Abstract: Current RLHF frameworks for aligning large language models (LLMs) typically assume a fixed prompt distribution, which is sub-optimal and limits the scalability of alignment and generalizability of models. To address this, we introduce a general open-ended RLHF framework that casts alignment as an asymmetric game between two players: (i) a creator that generates increasingly informative prompt distributions using the reward model, and (ii) a solver that learns to produce more preferred responses on prompts produced by the creator. This framework of Evolving Alignment via Asymmetric Self-Play (eva), results in a simple and efficient approach that can utilize any existing RLHF algorithm for scalable alignment. eva outperforms state-of-the-art methods on widely-used benchmarks, without the need of any additional human crafted prompts. Specifically, eva improves the win rate of Gemma-2-9B-it on Arena-Hard from 51.6% to 60.1% with DPO, from 55.7% to 58.9% with SPPO, from 52.3% to 60.7% with SimPO, and from 54.8% to 60.3% with ORPO, surpassing its 27B version and matching claude-3-opus. This improvement is persistent even when new human crafted prompts are introduced. Finally, we show eva is effective and robust under various ablation settings.
Authors: Kangxiang Xia, Dake Guo, Jixun Yao, Liumeng Xue, Hanzhao Li, Shuai Wang, Zhao Guo, Lei Xie, Qingqing Zhang, Lei Luo, Minghui Dong, Peng Sun
Abstract: The ISCSLP 2024 Conversational Voice Clone (CoVoC) Challenge aims to benchmark and advance zero-shot spontaneous style voice cloning, particularly focusing on generating spontaneous behaviors in conversational speech. The challenge comprises two tracks: an unconstrained track without limitation on data and model usage, and a constrained track only allowing the use of constrained open-source datasets. A 100-hour high-quality conversational speech dataset is also made available with the challenge. This paper details the data, tracks, submitted systems, evaluation results, and findings.
Authors: Eunji Kim, Sriya Mantena, Weiwei Yang, Chandan Singh, Sungroh Yoon, Jianfeng Gao
Abstract: Recent large language models (LLMs) have excelled across a wide range of tasks, but their use in high-stakes and compute-limited settings has intensified the demand for interpretability and efficiency. We address this need by proposing Induction-head ngram models (Induction-Gram), a method that builds an efficient, interpretable LM by bolstering modern ngram models with a hand-engineered "induction head". This induction head uses a custom neural similarity metric to efficiently search the model's input context for potential next-word completions. This process enables Induction-Gram to provide ngram-level grounding for each generated token. Moreover, experiments show that this simple method significantly improves next-word prediction over baseline interpretable models (up to 26%p) and can be used to speed up LLM inference for large models through speculative decoding. We further study Induction-Gram in a natural-language neuroscience setting, where the goal is to predict the next fMRI response in a sequence. It again provides a significant improvement over interpretable models (20% relative increase in the correlation of predicted fMRI responses), potentially enabling deeper scientific investigation of language selectivity in the brain. The code is available at https://github.com/ejkim47/induction-gram.
Authors: Mahesh Vaijainthymala Krishnamoorthy
Abstract: The Artificial intelligence in critical sectors-healthcare, finance, and public safety-has made system integrity paramount for maintaining societal trust. Current verification methods for AI systems lack comprehensive lifecycle assurance, creating significant vulnerabilities in deployment of both powerful and trustworthy AI. This research introduces Meta-Sealing, a cryptographic framework that fundamentally changes integrity verification in AI systems throughout their operational lifetime. Meta-Sealing surpasses traditional integrity protocols through its implementation of cryptographic seal chains, establishing verifiable, immutable records for all system decisions and transformations. The framework combines advanced cryptography with distributed verification, delivering tamper-evident guarantees that achieve both mathematical rigor and computational efficiency. Our implementation addresses urgent regulatory requirements for AI system transparency and auditability. The framework integrates with current AI governance standards, specifically the EU's AI Act and FDA's healthcare AI guidelines, enabling organizations to maintain operational efficiency while meeting compliance requirements. Testing on financial institution data demonstrated Meta-Sealing's capability to reduce audit timeframes by 62% while enhancing stakeholder confidence by 47%. Results can establish a new benchmark for integrity assurance in enterprise AI deployments. This research presents Meta-Sealing not merely as a technical solution, but as a foundational framework ensuring AI system integrity aligns with human values and regulatory requirements. As AI continues to influence critical decisions, provides the necessary bridge between technological advancement and verifiable trust. Meta-Sealing serves as a guardian of trust, ensuring that the AI systems we depend on are as reliable and transparent as they are powerful.
Authors: Zhenbiao Cao, Yuanlei Zheng, Zhihao Fan, Xiaojin Zhang, Wei Chen
Abstract: Text-to-SQL generation aims to translate natural language questions into SQL statements. In large language models (LLMs) based Text-to-SQL, schema linking is a widely adopted strategy to streamline the input for LLMs by selecting only relevant schema elements, therefore reducing noise and computational overhead. However, schema linking faces risks that requires caution, including the potential omission of necessary elements and disruption of database structural integrity. To address these challenges, we propose a novel framework called RSL-SQL that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction. Our approach improves the recall of schema linking through forward and backward pruning and hedges the risk by voting between full schema and contextual information augmented simplified schema. Experiments on the BIRD and Spider benchmarks demonstrate that our approach achieves state-of-the-art execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on Spider using GPT-4o. Furthermore, our approach outperforms a series of GPT-4 based Text-to-SQL systems when adopting DeepSeek (much cheaper) with same intact prompts. Extensive analysis and ablation studies confirm the effectiveness of each component in our framework. The codes are available at https://github.com/Laqcce-cao/RSL-SQL.
Authors: Lamine Diop, Marc Plantevit, Arnaud Soulet
Abstract: Efficient learning from streaming data is important for modern data analysis due to the continuous and rapid evolution of data streams. Despite significant advancements in stream pattern mining, challenges persist, particularly in managing complex data streams like sequential and weighted itemsets. While reservoir sampling serves as a fundamental method for randomly selecting fixed-size samples from data streams, its application to such complex patterns remains largely unexplored. In this study, we introduce an approach that harnesses a weighted reservoir to facilitate direct pattern sampling from streaming batch data, thus ensuring scalability and efficiency. We present a generic algorithm capable of addressing temporal biases and handling various pattern types, including sequential, weighted, and unweighted itemsets. Through comprehensive experiments conducted on real-world datasets, we evaluate the effectiveness of our method, showcasing its ability to construct accurate incremental online classifiers for sequential data. Our approach not only enables previously unusable online machine learning models for sequential data to achieve accuracy comparable to offline baselines but also represents significant progress in the development of incremental online sequential itemset classifiers.
Authors: Junlin Guo, Siqi Lu, Can Cui, Ruining Deng, Tianyuan Yao, Zhewen Tao, Yizhe Lin, Marilyn Lionts, Quan Liu, Juming Xiong, Yu Wang, Shilin Zhao, Catie Chang, Mitchell Wilkes, Mengmeng Yin, Haichun Yang, Yuankai Huo
Abstract: Training AI foundation models has emerged as a promising large-scale learning approach for addressing real-world healthcare challenges, including digital pathology. While many of these models have been developed for tasks like disease diagnosis and tissue quantification using extensive and diverse training datasets, their readiness for deployment on some arguably simplest tasks, such as nuclei segmentation within a single organ (e.g., the kidney), remains uncertain. This paper seeks to answer this key question, "How good are we?", by thoroughly evaluating the performance of recent cell foundation models on a curated multi-center, multi-disease, and multi-species external testing dataset. Additionally, we tackle a more challenging question, "How can we improve?", by developing and assessing human-in-the-loop data enrichment strategies aimed at enhancing model performance while minimizing the reliance on pixel-level human annotation. To address the first question, we curated a multicenter, multidisease, and multispecies dataset consisting of 2,542 kidney whole slide images (WSIs). Three state-of-the-art (SOTA) cell foundation models-Cellpose, StarDist, and CellViT-were selected for evaluation. To tackle the second question, we explored data enrichment algorithms by distilling predictions from the different foundation models with a human-in-the-loop framework, aiming to further enhance foundation model performance with minimal human efforts. Our experimental results showed that all three foundation models improved over their baselines with model fine-tuning with enriched data. Interestingly, the baseline model with the highest F1 score does not yield the best segmentation outcomes after fine-tuning. This study establishes a benchmark for the development and deployment of cell vision foundation models tailored for real-world data applications.
Authors: Matthew Chang, Gunjan Chhablani, Alexander Clegg, Mikael Dallaire Cote, Ruta Desai, Michal Hlavac, Vladimir Karashchuk, Jacob Krantz, Roozbeh Mottaghi, Priyam Parashar, Siddharth Patki, Ishita Prasad, Xavier Puig, Akshara Rai, Ram Ramrakhya, Daniel Tran, Joanne Truong, John M. Turner, Eric Undersander, Tsung-Yen Yang
Abstract: We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We employ a semi-automated task generation pipeline using Large Language Models (LLMs), incorporating simulation in the loop for grounding and verification. PARTNR stands as the largest benchmark of its kind, comprising 100,000 natural language tasks, spanning 60 houses and 5,819 unique objects. We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution. The analysis reveals significant limitations in SoTA models, such as poor coordination and failures in task tracking and recovery from errors. When LLMs are paired with real humans, they require 1.5x as many steps as two humans collaborating and 1.1x more steps than a single human, underscoring the potential for improvement in these models. We further show that fine-tuning smaller LLMs with planning data can achieve performance on par with models 9 times larger, while being 8.6x faster at inference. Overall, PARTNR highlights significant challenges facing collaborative embodied agents and aims to drive research in this direction.
Authors: Ashwin De Silva, Rahul Ramesh, Rubing Yang, Siyu Yu, Joshua T Vogelstein, Pratik Chaudhari
Abstract: In real-world applications, the distribution of the data, and our goals, evolve over time. The prevailing theoretical framework for studying machine learning, namely probably approximately correct (PAC) learning, largely ignores time. As a consequence, existing strategies to address the dynamic nature of data and goals exhibit poor real-world performance. This paper develops a theoretical framework called "Prospective Learning" that is tailored for situations when the optimal hypothesis changes over time. In PAC learning, empirical risk minimization (ERM) is known to be consistent. We develop a learner called Prospective ERM, which returns a sequence of predictors that make predictions on future data. We prove that the risk of prospective ERM converges to the Bayes risk under certain assumptions on the stochastic process generating the data. Prospective ERM, roughly speaking, incorporates time as an input in addition to the data. We show that standard ERM as done in PAC learning, without incorporating time, can result in failure to learn when distributions are dynamic. Numerical experiments illustrate that prospective ERM can learn synthetic and visual recognition problems constructed from MNIST and CIFAR-10.
Authors: Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel Mendiola-Ortiz, Junfeng Yang, Chengzhi Mao
Abstract: Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.
Authors: Thomas Crasson, Yacine Nabet, Mathias L\'ecuyer
Abstract: Deep learning time-series models are often used to make forecasts that inform downstream decisions. Since these decisions can differ from those in the training set, there is an implicit requirement that time-series models will generalize outside of their training distribution. Despite this core requirement, time-series models are typically trained and evaluated on in-distribution predictive tasks. We extend the orthogonal statistical learning framework to train causal time-series models that generalize better when forecasting the effect of actions outside of their training distribution. To evaluate these models, we leverage Regression Discontinuity Designs popular in economics to construct a test set of causal treatment effects.
Authors: Tang Li, Mengmeng Ma, Xi Peng
Abstract: Large pretrained foundation models demonstrate exceptional performance and, in some high-stakes applications, even surpass human experts. However, most of these models are currently evaluated primarily on prediction accuracy, overlooking the validity of the rationales behind their accurate predictions. For the safe deployment of foundation models, there is a pressing need to ensure double-correct predictions, i.e., correct prediction backed by correct rationales. To achieve this, we propose a two-phase scheme: First, we curate a new dataset that offers structured rationales for visual recognition tasks. Second, we propose a rationale-informed optimization method to guide the model in disentangling and localizing visual evidence for each rationale, without requiring manual annotations. Extensive experiments and ablation studies demonstrate that our model outperforms state-of-the-art models by up to 10.1% in prediction accuracy across a wide range of tasks. Furthermore, our method significantly improves the model's rationale correctness, improving localization by 7.5% and disentanglement by 36.5%. Our dataset, source code, and pretrained weights: https://github.com/deep-real/DCP
Authors: Ricardo Valadas, Maximilian St\"olzle, Jingyue Liu, Cosimo Della Santina
Abstract: Obtaining dynamic models of continuum soft robots is central to the analysis and control of soft robots, and researchers have devoted much attention to the challenge of proposing both data-driven and first-principle solutions. Both avenues have, however, shown their limitations; the former lacks structure and performs poorly outside training data, while the latter requires significant simplifications and extensive expert knowledge to be used in practice. This paper introduces a streamlined method for learning low-dimensional, physics-based models that are both accurate and easy to interpret. We start with an algorithm that uses image data (i.e., shape evolutions) to determine the minimal necessary segments for describing a soft robot's movement. Following this, we apply a dynamic regression and strain sparsification algorithm to identify relevant strains and define the model's dynamics. We validate our approach through simulations with various planar soft manipulators, comparing its performance against other learning strategies, showing that our models are both computationally efficient and 25x more accurate on out-of-training distribution inputs. Finally, we demonstrate that thanks to the capability of the method of generating physically compatible models, the learned models can be straightforwardly combined with model-based control policies.
Authors: Tong Niu, Shafiq Joty, Ye Liu, Caiming Xiong, Yingbo Zhou, Semih Yavuz
Abstract: Accurate document retrieval is crucial for the success of retrieval-augmented generation (RAG) applications, including open-domain question answering and code completion. While large language models (LLMs) have been employed as dense encoders or listwise rerankers in RAG systems, they often struggle with reasoning-intensive tasks because they lack nuanced analysis when judging document relevance. To address this limitation, we introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. Our approach consists of three key steps: (1) query analysis to identify the core problem, (2) document analysis to extract a query-aware summary, and (3) relevance judgment to provide a concise assessment of document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods and outperforming other popular reranking approaches. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability. Through comprehensive ablation studies, we demonstrate that JudgeRank's performance generalizes well across LLMs of various sizes while ensembling them yields even more accurate reranking than individual models.
Authors: Christopher Richardson, Roshan Sharma, Neeraj Gaur, Parisa Haghani, Anirudh Sundar, Bhuvana Ramabhadran
Abstract: Zero-shot domain adaptation for dialogue state tracking (DST) remains a challenging problem in task-oriented dialogue (TOD) systems, where models must generalize to target domains unseen at training time. Current large language model approaches for zero-shot domain adaptation rely on prompting to introduce knowledge pertaining to the target domains. However, their efficacy strongly depends on prompt engineering, as well as the zero-shot ability of the underlying language model. In this work, we devise a novel data augmentation approach, Schema Augmentation, that improves the zero-shot domain adaptation of language models through fine-tuning. Schema Augmentation is a simple but effective technique that enhances generalization by introducing variations of slot names within the schema provided in the prompt. Experiments on MultiWOZ and SpokenWOZ showed that the proposed approach resulted in a substantial improvement over the baseline, in some experiments achieving over a twofold accuracy gain over unseen domains while maintaining equal or superior performance over all domains.
Authors: Haritz Puerto, Martin Gubri, Sangdoo Yun, Seong Joon Oh
Abstract: Membership inference attacks (MIA) attempt to verify the membership of a given data sample in the training set for a model. MIA has become relevant in recent years, following the rapid development of large language models (LLM). Many are concerned about the usage of copyrighted materials for training them and call for methods for detecting such usage. However, recent research has largely concluded that current MIA methods do not work on LLMs. Even when they seem to work, it is usually because of the ill-designed experimental setup where other shortcut features enable "cheating." In this work, we argue that MIA still works on LLMs, but only when multiple documents are presented for testing. We construct new benchmarks that measure the MIA performances at a continuous scale of data samples, from sentences (n-grams) to a collection of documents (multiple chunks of tokens). To validate the efficacy of current MIA approaches at greater scales, we adapt a recent work on Dataset Inference (DI) for the task of binary membership detection that aggregates paragraph-level MIA features to enable MIA at document and collection of documents level. This baseline achieves the first successful MIA on pre-trained and fine-tuned LLMs.
Authors: Weiqin Yang, Jiawei Chen, Xin Xin, Sheng Zhou, Binbin Hu, Yan Feng, Chun Chen, Can Wang
Abstract: Softmax Loss (SL) is widely applied in recommender systems (RS) and has demonstrated effectiveness. This work analyzes SL from a pairwise perspective, revealing two significant limitations: 1) the relationship between SL and conventional ranking metrics like DCG is not sufficiently tight; 2) SL is highly sensitive to false negative instances. Our analysis indicates that these limitations are primarily due to the use of the exponential function. To address these issues, this work extends SL to a new family of loss functions, termed Pairwise Softmax Loss (PSL), which replaces the exponential function in SL with other appropriate activation functions. While the revision is minimal, we highlight three merits of PSL: 1) it serves as a tighter surrogate for DCG with suitable activation functions; 2) it better balances data contributions; and 3) it acts as a specific BPR loss enhanced by Distributionally Robust Optimization (DRO). We further validate the effectiveness and robustness of PSL through empirical experiments. The code is available at https://github.com/Tiny-Snow/IR-Benchmark.
Authors: Yue Fu, Han Bin, Tony Zhou, Marx Wang, Yixin Chen, Zelia Gomes Da Costa Lai, Jacob O. Wobbrock, Alexis Hiniker
Abstract: As generative AI (GenAI) increasingly permeates design workflows, its impact on design outcomes and designers' creative capabilities warrants investigation. We conducted a within-subjects experiment where we asked participants to design advertisements both with and without GenAI support. Our results show that expert evaluators rated GenAI-supported designs as more creative and unconventional ("weird") despite no significant differences in visual appeal, brand alignment, or usefulness, which highlights the decoupling of novelty from usefulness-traditional dual components of creativity-in the context of GenAI usage. Moreover, while GenAI does not significantly enhance designers' overall creative thinking abilities, users were affected differently based on native language and prior AI exposure. Native English speakers experienced reduced relaxation when using AI, whereas designers new to GenAI exhibited gains in divergent thinking, such as idea fluency and flexibility. These findings underscore the variable impact of GenAI on different user groups, suggesting the potential for customized AI tools.
Authors: John Wu, David Wu, Jimeng Sun
Abstract: Medical coding, the translation of unstructured clinical text into standardized medical codes, is a crucial but time-consuming healthcare practice. Though large language models (LLM) could automate the coding process and improve the efficiency of such tasks, interpretability remains paramount for maintaining patient trust. Current efforts in interpretability of medical coding applications rely heavily on label attention mechanisms, which often leads to the highlighting of extraneous tokens irrelevant to the ICD code. To facilitate accurate interpretability in medical language models, this paper leverages dictionary learning that can efficiently extract sparsely activated representations from dense language model embeddings in superposition. Compared with common label attention mechanisms, our model goes beyond token-level representations by building an interpretable dictionary which enhances the mechanistic-based explanations for each ICD code prediction, even when the highlighted tokens are medically irrelevant. We show that dictionary features can steer model behavior, elucidate the hidden meanings of upwards of 90% of medically irrelevant tokens, and are human interpretable.
Authors: Panagiota Gatoula, Dimitrios E. Diamantis, Anastasios Koulaouzidis, Cristina Carretero, Stefania Chetcuti-Zammit, Pablo Cortegoso Valdivia, Bego\~na Gonz\'alez-Su\'arez, Alessandro Mussetto, John Plevris, Alexander Robertson, Bruno Rosa, Ervin Toth, Dimitris K. Iakovidis
Abstract: Sharing retrospectively acquired data is essential for both clinical research and training. Synthetic Data Generation (SDG), using Artificial Intelligence (AI) models, can overcome privacy barriers in sharing clinical data, enabling advancements in medical diagnostics. This study focuses on the clinical evaluation of medical SDG, with a proof-of-concept investigation on diagnosing Inflammatory Bowel Disease (IBD) using Wireless Capsule Endoscopy (WCE) images. The paper contributes by a) presenting a protocol for the systematic evaluation of synthetic images by medical experts and b) applying it to assess TIDE-II, a novel variational autoencoder-based model for high-resolution WCE image synthesis, with a comprehensive qualitative evaluation conducted by 10 international WCE specialists, focusing on image quality, diversity, realism, and clinical decision-making. The results show that TIDE-II generates clinically relevant WCE images, helping to address data scarcity and enhance diagnostic tools. The proposed protocol serves as a reference for future research on medical image-generation techniques.
Authors: Paulius Rauba, Nabeel Seedat, Krzysztof Kacprzyk, Mihaela van der Schaar
Abstract: Real-world machine learning systems often encounter model performance degradation due to distributional shifts in the underlying data generating process (DGP). Existing approaches to addressing shifts, such as concept drift adaptation, are limited by their reason-agnostic nature. By choosing from a pre-defined set of actions, such methods implicitly assume that the causes of model degradation are irrelevant to what actions should be taken, limiting their ability to select appropriate adaptations. In this paper, we propose an alternative paradigm to overcome these limitations, called self-healing machine learning (SHML). Contrary to previous approaches, SHML autonomously diagnoses the reason for degradation and proposes diagnosis-based corrective actions. We formalize SHML as an optimization problem over a space of adaptation actions to minimize the expected risk under the shifted DGP. We introduce a theoretical framework for self-healing systems and build an agentic self-healing solution H-LLM which uses large language models to perform self-diagnosis by reasoning about the structure underlying the DGP, and self-adaptation by proposing and evaluating corrective actions. Empirically, we analyze different components of H-LLM to understand why and when it works, demonstrating the potential of self-healing ML.
Authors: Tempest A. van Schaik, Xinggang Liu, Louis Atallah, Omar Badawi
Abstract: This work proposes a fairness monitoring approach for machine learning models that predict patient mortality in the ICU. We investigate how well models perform for patient groups with different race, sex and medical diagnoses. We investigate Documentation bias in clinical measurement, showing how fairness analysis provides a more detailed and insightful comparison of model performance than traditional accuracy metrics alone.
Authors: Brody McNutt, Libby Zhang, Angus Carey-Douglas, Fritz Vollrath, Frank Pope, Leandra Brickson
Abstract: This research represents a pioneering application of automated pose estimation from drone data to study elephant behavior in the wild, utilizing video footage captured from Samburu National Reserve, Kenya. The study evaluates two pose estimation workflows: DeepLabCut, known for its application in laboratory settings and emerging wildlife fieldwork, and YOLO-NAS-Pose, a newly released pose estimation model not previously applied to wildlife behavioral studies. These models are trained to analyze elephant herd behavior, focusing on low-resolution ($\sim$50 pixels) subjects to detect key points such as the head, spine, and ears of multiple elephants within a frame. Both workflows demonstrated acceptable quality of pose estimation on the test set, facilitating the automated detection of basic behaviors crucial for studying elephant herd dynamics. For the metrics selected for pose estimation evaluation on the test set -- root mean square error (RMSE), percentage of correct keypoints (PCK), and object keypoint similarity (OKS) -- the YOLO-NAS-Pose workflow outperformed DeepLabCut. Additionally, YOLO-NAS-Pose exceeded DeepLabCut in object detection evaluation. This approach introduces a novel method for wildlife behavioral research, including the burgeoning field of wildlife drone monitoring, with significant implications for wildlife conservation.
Authors: Beyazit Yalcinkaya, Niklas Lauffer, Marcell Vazquez-Chanlatte, Sanjit A. Seshia
Abstract: Goal-conditioned reinforcement learning is a powerful way to control an AI agent's behavior at runtime. That said, popular goal representations, e.g., target states or natural language, are either limited to Markovian tasks or rely on ambiguous task semantics. We propose representing temporal goals using compositions of deterministic finite automata (cDFAs) and use cDFAs to guide RL agents. cDFAs balance the need for formal temporal semantics with ease of interpretation: if one can understand a flow chart, one can understand a cDFA. On the other hand, cDFAs form a countably infinite concept class with Boolean semantics, and subtle changes to the automaton can result in very different tasks, making them difficult to condition agent behavior on. To address this, we observe that all paths through a DFA correspond to a series of reach-avoid tasks and propose pre-training graph neural network embeddings on "reach-avoid derived" DFAs. Through empirical evaluation, we demonstrate that the proposed pre-training method enables zero-shot generalization to various cDFA task classes and accelerated policy specialization without the myopic suboptimality of hierarchical methods.
Authors: Jonveaux Luc
Abstract: This paper presents a new approach to urban sustainability assessment through the use of Large Language Models (LLMs) to streamline the use of the ISO 37101 framework to automate and standardise the assessment of urban initiatives against the six "sustainability purposes" and twelve "issues" outlined in the standard. The methodology includes the development of a custom prompt based on the standard definitions and its application to two different datasets: 527 projects from the Paris Participatory Budget and 398 activities from the PROBONO Horizon 2020 project. The results show the effectiveness of LLMs in quickly and consistently categorising different urban initiatives according to sustainability criteria. The approach is particularly promising when it comes to breaking down silos in urban planning by providing a holistic view of the impact of projects. The paper discusses the advantages of this method over traditional human-led assessments, including significant time savings and improved consistency. However, it also points out the importance of human expertise in interpreting results and ethical considerations. This study hopefully can contribute to the growing body of work on AI applications in urban planning and provides a novel method for operationalising standardised sustainability frameworks in different urban contexts.
Authors: Haozhe Lei, Yunfei Ge, Quanyan Zhu
Abstract: The integration of AI into modern critical infrastructure systems, such as healthcare, has introduced new vulnerabilities that can significantly impact workflow, efficiency, and safety. Additionally, the increased connectivity has made traditional human-driven penetration testing insufficient for assessing risks and developing remediation strategies. Consequently, there is a pressing need for a distributed, adaptive, and efficient automated penetration testing framework that not only identifies vulnerabilities but also provides countermeasures to enhance security posture. This work presents ADAPT, a game-theoretic and neuro-symbolic framework for automated distributed adaptive penetration testing, specifically designed to address the unique cybersecurity challenges of AI-enabled healthcare infrastructure networks. We use a healthcare system case study to illustrate the methodologies within ADAPT. The proposed solution enables a learning-based risk assessment. Numerical experiments are used to demonstrate effective countermeasures against various tactical techniques employed by adversarial AI.
Authors: Ehsan Ganjidoost, Jeff Orchard
Abstract: An adversarial example is a modified input image designed to cause a Machine Learning (ML) model to make a mistake; these perturbations are often invisible or subtle to human observers and highlight vulnerabilities in a model's ability to generalize from its training data. Several adversarial attacks can create such examples, each with a different perspective, effectiveness, and perceptibility of changes. Conversely, defending against such adversarial attacks improves the robustness of ML models in image processing and other domains of deep learning. Most defence mechanisms require either a level of model awareness, changes to the model, or access to a comprehensive set of adversarial examples during training, which is impractical. Another option is to use an auxiliary model in a preprocessing manner without changing the primary model. This study presents a practical and effective solution -- using predictive coding networks (PCnets) as an auxiliary step for adversarial defence. By seamlessly integrating PCnets into feed-forward networks as a preprocessing step, we substantially bolster resilience to adversarial perturbations. Our experiments on MNIST and CIFAR10 demonstrate the remarkable effectiveness of PCnets in mitigating adversarial examples with about 82% and 65% improvements in robustness, respectively. The PCnet, trained on a small subset of the dataset, leverages its generative nature to effectively counter adversarial efforts, reverting perturbed images closer to their original forms. This innovative approach holds promise for enhancing the security and reliability of neural network classifiers in the face of the escalating threat of adversarial attacks.
Authors: Akash Kundu, Leopoldo Sarra
Abstract: Building quantum circuits that perform a given task is a notoriously difficult problem. Reinforcement learning has proven to be a powerful approach, but many limitations remain due to the exponential scaling of the space of possible operations on qubits. In this paper, we develop an algorithm that automatically learns composite gates ("$gadgets$") and adds them as additional actions to the reinforcement learning agent to facilitate the search, namely the Gadget Reinforcement Learning (GRL) algorithm. We apply our algorithm to finding parameterized quantum circuits (PQCs) that implement the ground state of a given quantum Hamiltonian, a well-known NP-hard challenge. In particular, we focus on the transverse field Ising model (TFIM), since understanding its ground state is crucial for studying quantum phase transitions and critical behavior, and serves as a benchmark for validating quantum algorithms and simulation techniques. We show that with GRL we can find very compact PQCs that improve the error in estimating the ground state of TFIM by up to $10^7$ fold and make it suitable for implementation on real hardware compared to a pure reinforcement learning approach. Moreover, GRL scales better with increasing difficulty and to larger systems. The generality of the algorithm shows the potential for applications to other settings, including optimization tailored to specific real-world quantum platforms.
Authors: Alan Jeffares, Alicia Curth, Mihaela van der Schaar
Abstract: Deep learning sometimes appears to work in unexpected ways. In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network consisting of a sequence of first-order approximations telescoping out into a single empirically operational tool for practical analysis. Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena in the literature -- including double descent, grokking, linear mode connectivity, and the challenges of applying deep learning on tabular data -- highlighting that this model allows us to construct and extract metrics that help predict and understand the a priori unexpected performance of neural networks. We also demonstrate that this model presents a pedagogical formalism allowing us to isolate components of the training process even in complex contemporary settings, providing a lens to reason about the effects of design choices such as architecture & optimization strategy, and reveals surprising parallels between neural network learning and gradient boosting.
Authors: Koffi Ismael Ouattara
Abstract: Trustworthiness in neural networks is crucial for their deployment in critical applications, where reliability, confidence, and uncertainty play pivotal roles in decision-making. Traditional performance metrics such as accuracy and precision fail to capture these aspects, particularly in cases where models exhibit overconfidence. To address these limitations, this paper introduces a novel framework for quantifying the trustworthiness of neural networks by incorporating subjective logic into the evaluation of Expected Calibration Error (ECE). This method provides a comprehensive measure of trust, disbelief, and uncertainty by clustering predicted probabilities and fusing opinions using appropriate fusion operators. We demonstrate the effectiveness of this approach through experiments on MNIST and CIFAR-10 datasets, where post-calibration results indicate improved trustworthiness. The proposed framework offers a more interpretable and nuanced assessment of AI models, with potential applications in sensitive domains such as healthcare and autonomous systems.
Authors: Alexandru T. Cismaru
Abstract: Large amounts of traffic can lead to negative effects such as increased car accidents, air pollution, and significant time wasted. Understanding traffic speeds on any given road segment can be highly beneficial for traffic management strategists seeking to reduce congestion. While recent studies have primarily focused on modeling spatial dependencies by using graph convolutional networks (GCNs) over fixed weighted graphs, the relationships between nodes are often more complex, with edges that interact dynamically. This paper addresses both the temporal patterns in traffic data and the intricate spatial dependencies by introducing the Signal-Enhanced Graph Convolutional Network Long Short Term Memory (SGCN-LSTM) model for predicting traffic speeds across road networks. Extensive experiments on the PEMS-BAY road network traffic dataset demonstrate the SGCN-LSTM model's effectiveness, yielding significant improvements in Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE) compared to benchmark models on the same dataset.
Authors: Ruisi Zhang, Tianyu Liu, Will Feng, Andrew Gu, Sanket Purandare, Wanchao Liang, Francisco Massa
Abstract: Distributed training of large models consumes enormous computation resources and requires substantial engineering efforts to compose various training techniques. This paper presents SimpleFSDP, a PyTorch-native compiler-based Fully Sharded Data Parallel (FSDP) framework, which has a simple implementation for maintenance and composability, allows full computation-communication graph tracing, and brings performance enhancement via compiler backend optimizations. SimpleFSDP's novelty lies in its unique torch.compile-friendly implementation of collective communications using existing PyTorch primitives, namely parametrizations, selective activation checkpointing, and DTensor. It also features the first-of-its-kind intermediate representation (IR) nodes bucketing and reordering in the TorchInductor backend for effective computation-communication overlapping. As a result, users can employ the aforementioned optimizations to automatically or manually wrap model components for minimal communication exposure. Extensive evaluations of SimpleFSDP on Llama 3 models (including the ultra-large 405B) using TorchTitan demonstrate up to 28.54% memory reduction and 68.67% throughput improvement compared to the most widely adopted FSDP2 eager framework, when composed with other distributed training techniques.
Authors: Ashkan Golgoon, Ryan Franks, Khashayar Filom, Arjun Ravi Kannan
Abstract: In many industrial applications, it is common that the graph embeddings generated from training GNNs are used in an ensemble model where the embeddings are combined with other tabular features (e.g., original node or edge features) in a downstream ML task. The tabular features may even arise naturally if, e.g., one tries to build a graph such that some of the node or edge features are stored in a tabular format. Here we address the problem of explaining the output of such ensemble models for which the input features consist of learned neural graph embeddings combined with additional tabular features. We propose MBExplainer, a model-agnostic explanation approach for downstream models with augmented graph embeddings. MBExplainer returns a human-legible triple as an explanation for an instance prediction of the whole pipeline consisting of three components: a subgraph with the highest importance, the topmost important nodal features, and the topmost important augmented downstream features. A game-theoretic formulation is used to take the contributions of each component and their interactions into account by assigning three Shapley values corresponding to their own specific games. Finding the explanation requires an efficient search through the corresponding local search spaces corresponding to each component. MBExplainer applies a novel multilevel search algorithm that enables simultaneous pruning of local search spaces in a computationally tractable way. In particular, three interweaved Monte Carlo Tree Search are utilized to iteratively prune the local search spaces. MBExplainer also includes a global search algorithm that uses contextual bandits to efficiently allocate pruning budget among the local search spaces. We show the effectiveness of MBExplainer by presenting a set of comprehensive numerical examples on multiple public graph datasets for both node and graph classification tasks.
Authors: David A. Danhofer
Abstract: The crucial role of convolutional models, both as standalone vision models and backbones in foundation models, necessitates effective acceleration techniques. This paper proposes a novel method to learn semi-structured sparsity patterns for convolution kernels in the form of maskings enabling the utilization of readily available hardware accelerations. The approach accelerates convolutional models more than two-fold during inference without decreasing model performance. At the same time, the original model weights and structure remain unchanged keeping the model thus easily updatable. Beyond the immediate practical use, the effect of maskings on prediction is easily quantifiable. Therefore, guarantees on model predictions under maskings are derived showing stability bounds for learned maskings even after updating the original underlying model.
Authors: Yeachan Kim, Junho Kim, Wing-Lam Mok, Jun-Hyung Park, SangKeun Lee
Abstract: Despite the versatility of pre-trained language models (PLMs) across domains, their large memory footprints pose significant challenges in federated learning (FL), where the training model has to be distributed between a server and clients. One potential solution to bypass such constraints might be the use of parameter-efficient fine-tuning (PEFT) in the context of FL. However, we have observed that typical PEFT tends to severely suffer from heterogeneity among clients in FL scenarios, resulting in unstable and slow convergence. In this paper, we propose Client-Customized Adaptation (C2A), a novel hypernetwork-based FL framework that generates client-specific adapters by conditioning the client information. With the effectiveness of the hypernetworks in generating customized weights through learning to adopt the different characteristics of inputs, C2A can maximize the utility of shared model parameters while minimizing the divergence caused by client heterogeneity. To verify the efficacy of C2A, we perform extensive evaluations on FL scenarios involving heterogeneity in label and language distributions. Comprehensive evaluation results clearly support the superiority of C2A in terms of both efficiency and effectiveness in FL scenarios.
Authors: Dogyun Park, Sojin Lee, Sihyeon Kim, Taehoon Lee, Youngjoon Hong, Hyunwoo J. Kim
Abstract: Rectified flow and reflow procedures have significantly advanced fast generation by progressively straightening ordinary differential equation (ODE) flows. They operate under the assumption that image and noise pairs, known as couplings, can be approximated by straight trajectories with constant velocity. However, we observe that modeling with constant velocity and using reflow procedures have limitations in accurately learning straight trajectories between pairs, resulting in suboptimal performance in few-step generation. To address these limitations, we introduce Constant Acceleration Flow (CAF), a novel framework based on a simple constant acceleration equation. CAF introduces acceleration as an additional learnable variable, allowing for more expressive and accurate estimation of the ODE flow. Moreover, we propose two techniques to further improve estimation accuracy: initial velocity conditioning for the acceleration model and a reflow process for the initial velocity. Our comprehensive studies on toy datasets, CIFAR-10, and ImageNet 64x64 demonstrate that CAF outperforms state-of-the-art baselines for one-step generation. We also show that CAF dramatically improves few-step coupling preservation and inversion over Rectified flow. Code is available at \href{https://github.com/mlvlab/CAF}{https://github.com/mlvlab/CAF}.
URLs: https://github.com/mlvlab/CAF, https://github.com/mlvlab/CAF
Authors: Connor J. Mclaughlin, Lili Su
Abstract: Federated learning (FL) is a distributed learning framework that leverages commonalities between distributed client datasets to train a global model. Under heterogeneous clients, however, FL can fail to produce stable training results. Personalized federated learning (PFL) seeks to address this by learning individual models tailored to each client. One approach is to decompose model training into shared representation learning and personalized classifier training. Nonetheless, previous works struggle to navigate the bias-variance trade-off in classifier learning, relying solely on limited local datasets or introducing costly techniques to improve generalization. In this work, we frame representation learning as a generative modeling task, where representations are trained with a classifier based on the global feature distribution. We then propose an algorithm, pFedFDA, that efficiently generates personalized models by adapting global generative classifiers to their local feature distributions. Through extensive computer vision benchmarks, we demonstrate that our method can adjust to complex distribution shifts with significant improvements over current state-of-the-art in data-scarce settings.
Authors: Karine Karine, Benjamin M. Marlin
Abstract: The use of reinforcement learning (RL) to learn policies for just-in-time adaptive interventions (JITAIs) is of significant interest in many behavioral intervention domains including improving levels of physical activity. In a messaging-based physical activity JITAI, a mobile health app is typically used to send messages to a participant to encourage engagement in physical activity. In this setting, RL methods can be used to learn what intervention options to provide to a participant in different contexts. However, deploying RL methods in real physical activity adaptive interventions comes with challenges: the cost and time constraints of real intervention studies result in limited data to learn adaptive intervention policies. Further, commonly used RL simulation environments have dynamics that are of limited relevance to physical activity adaptive interventions and thus shed little light on what RL methods may be optimal for this challenging application domain. In this paper, we introduce StepCountJITAI, an RL environment designed to foster research on RL methods that address the significant challenges of policy learning for adaptive behavioral interventions.
Authors: Weicheng Ma, Luyang Zhao, Chun-Yi She, Yitao Jiang, Alan Sun, Bo Zhu, Devin Balkcom, Soroush Vosoughi
Abstract: Recent large language models (LLMs) have demonstrated promising capabilities in modeling real-world knowledge and enhancing knowledge-based generation tasks. In this paper, we further explore the potential of using LLMs to aid in the design of soft modular robots, taking into account both user instructions and physical laws, to reduce the reliance on extensive trial-and-error experiments typically needed to achieve robot designs that meet specific structural or task requirements. Specifically, we formulate the robot design process as a sequence generation task and find that LLMs are able to capture key requirements expressed in natural language and reflect them in the construction sequences of robots. To simplify, rather than conducting real-world experiments to assess design quality, we utilize a simulation tool to provide feedback to the generative model, allowing for iterative improvements without requiring extensive human annotations. Furthermore, we introduce five evaluation metrics to assess the quality of robot designs from multiple angles including task completion and adherence to instructions, supporting an automatic evaluation process. Our model performs well in evaluations for designing soft modular robots with uni- and bi-directional locomotion and stair-descending capabilities, highlighting the potential of using natural language and LLMs for robot design. However, we also observe certain limitations that suggest areas for further improvement.
Authors: Luyang Zhao, Yitao Jiang, Chun-Yi She, Mingi Jeong, Haibo Dong, Alberto Quattrini Li, Muhao Chen, Devin Balkcom
Abstract: This paper presents the first steps toward a soft dolphin robot using a bio-inspired approach to mimic dolphin flexibility. The current dolphin robot uses a minimalist approach, with only two actuated cable-driven degrees of freedom actuated by a pair of motors. The actuated tail moves up and down in a swimming motion, but this first proof of concept does not permit controlled turns of the robot. While existing robotic dolphins typically use revolute joints to articulate rigid bodies, our design -- which will be made opensource -- incorporates a flexible tail with tunable silicone skin and actuation flexibility via a cable-driven system, which mimics muscle dynamics and design flexibility with a tunable skeleton structure. The design is also tunable since the backbone can be easily printed in various geometries. The paper provides insights into how a few such variations affect robot motion and efficiency, measured by speed and cost of transport (COT). This approach demonstrates the potential of achieving dolphin-like motion through enhanced flexibility in bio-inspired robotics.
Authors: Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H. Hsu, Pin-Yu Chen
Abstract: Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.
Authors: Dinitha Wijewardhana, Sugandima Vidanagamachchi, Nalin Arachchilage
Abstract: Cryptocurrencies have gained popularity due to their transparency, security, and accessibility compared to traditional financial systems, with Bitcoin, introduced in 2009, leading the market. Bitcoin's security relies on blockchain technology - a decentralized ledger consisting of a consensus and an incentive mechanism. The consensus mechanism, Proof of Work (PoW), requires miners to solve difficult cryptographic puzzles to add new blocks, while the incentive mechanism rewards them with newly minted bitcoins. However, as Bitcoin's acceptance grows, it faces increasing threats from attacks targeting these mechanisms, such as selfish mining, double-spending, and block withholding. These attacks compromise security, efficiency, and reward distribution. Recent research shows that these attacks can be combined with each other or with either malicious strategies, such as network-layer attacks, or non-malicious strategies, like honest mining. These combinations lead to more sophisticated attacks, increasing the attacker's success rates and profitability. Therefore, understanding and evaluating these attacks is essential for developing effective countermeasures and ensuring long-term security. This paper begins by examining individual attacks executed in isolation and their profitability. It then explores how combining these attacks with each other or with other malicious and non-malicious strategies can enhance their overall effectiveness and profitability. The analysis further explores how the deployment of attacks such as selfish mining and block withholding by multiple competing mining pools against each other impacts their economic returns. Lastly, a set of design guidelines is provided, outlining areas future work should focus on to prevent or mitigate the identified threats.
Authors: Mengcheng Li, Mingbao Lin, Fei Chao, Chia-Wen Lin, Rongrong Ji
Abstract: In this paper, we propose TextDestroyer, the first training- and annotation-free method for scene text destruction using a pre-trained diffusion model. Existing scene text removal models require complex annotation and retraining, and may leave faint yet recognizable text information, compromising privacy protection and content concealment. TextDestroyer addresses these issues by employing a three-stage hierarchical process to obtain accurate text masks. Our method scrambles text areas in the latent start code using a Gaussian distribution before reconstruction. During the diffusion denoising process, self-attention key and value are referenced from the original latent to restore the compromised background. Latent codes saved at each inversion step are used for replacement during reconstruction, ensuring perfect background restoration. The advantages of TextDestroyer include: (1) it eliminates labor-intensive data annotation and resource-intensive training; (2) it achieves more thorough text destruction, preventing recognizable traces; and (3) it demonstrates better generalization capabilities, performing well on both real-world scenes and generated images.
Authors: Lijia Yu, Xiao-Shan Gao, Lijun Zhang, Yibo Miao
Abstract: The neural network memorization problem is to study the expressive power of neural networks to interpolate a finite dataset. Although memorization is widely believed to have a close relationship with the strong generalizability of deep learning when using over-parameterized models, to the best of our knowledge, there exists no theoretical study on the generalizability of memorization neural networks. In this paper, we give the first theoretical analysis of this topic. Since using i.i.d. training data is a necessary condition for a learning algorithm to be generalizable, memorization and its generalization theory for i.i.d. datasets are developed under mild conditions on the data distribution. First, algorithms are given to construct memorization networks for an i.i.d. dataset, which have the smallest number of parameters and even a constant number of parameters. Second, we show that, in order for the memorization networks to be generalizable, the width of the network must be at least equal to the dimension of the data, which implies that the existing memorization networks with an optimal number of parameters are not generalizable. Third, a lower bound for the sample complexity of general memorization algorithms and the exact sample complexity for memorization algorithms with constant number of parameters are given. It is also shown that there exist data distributions such that, to be generalizable for them, the memorization network must have an exponential number of parameters in the data dimension. Finally, an efficient and generalizable memorization algorithm is given when the number of training samples is greater than the efficient memorization sample complexity of the data distribution.
Authors: David Anugraha, Garry Kuwanto, Lucky Susanto, Derry Tanti Wijaya, Genta Indra Winata
Abstract: We present MetaMetrics-MT, an innovative metric designed to evaluate machine translation (MT) tasks by aligning closely with human preferences through Bayesian optimization with Gaussian Processes. MetaMetrics-MT enhances existing MT metrics by optimizing their correlation with human judgments. Our experiments on the WMT24 metric shared task dataset demonstrate that MetaMetrics-MT outperforms all existing baselines, setting a new benchmark for state-of-the-art performance in the reference-based setting. Furthermore, it achieves comparable results to leading metrics in the reference-free setting, offering greater efficiency.
Authors: Junlin He, Jinxiao Du, Wei Ma
Abstract: Self-supervised learning (SSL) has rapidly advanced in recent years, approaching the performance of its supervised counterparts through the extraction of representations from unlabeled data. However, dimensional collapse, where a few large eigenvalues dominate the eigenspace, poses a significant obstacle for SSL. When dimensional collapse occurs on features (e.g. hidden features and representations), it prevents features from representing the full information of the data; when dimensional collapse occurs on weight matrices, their filters are self-related and redundant, limiting their expressive power. Existing studies have predominantly concentrated on the dimensional collapse of representations, neglecting whether this can sufficiently prevent the dimensional collapse of the weight matrices and hidden features. To this end, we first time propose a mitigation approach employing orthogonal regularization (OR) across the encoder, targeting both convolutional and linear layers during pretraining. OR promotes orthogonality within weight matrices, thus safeguarding against the dimensional collapse of weight matrices, hidden features, and representations. Our empirical investigations demonstrate that OR significantly enhances the performance of SSL methods across diverse benchmarks, yielding consistent gains with both CNNs and Transformer-based architectures.
Authors: Heiko Hoffmann
Abstract: Scalar variables, e.g., the orientation of a shape in an image, are commonly predicted using a single output neuron in a neural network. In contrast, the mammalian cortex represents variables with a population of neurons. In this population code, each neuron is most active at its preferred value and shows partial activity for other values. Here, we investigate the benefit of using a population code for the output layer of a neural network. We compare population codes against single-neuron outputs and one-hot vectors. First, we show theoretically and in experiments with synthetic data that population codes improve robustness to input noise in networks of stacked linear layers. Second, we demonstrate the benefit of population codes to encode ambiguous outputs, as found for symmetric objects. Using the T-LESS dataset of feature-less real-world objects, we show that population codes improve the accuracy of predicting object orientation from RGB-image input.
Authors: Li Liu, Diji Yang, Sijia Zhong, Kalyana Suma Sree Tholeti, Lei Ding, Yi Zhang, Leilani H. Gilpin
Abstract: In question-answering scenarios, humans can assess whether the available information is sufficient and seek additional information if necessary, rather than providing a forced answer. In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information. To investigate this gap, we identify a critical and challenging task in the Visual Question Answering (VQA) scenario: can VLMs indicate how to adjust an image when the visual information is insufficient to answer a question? This capability is especially valuable for assisting visually impaired individuals who often need guidance to capture images correctly. To evaluate this capability of current VLMs, we introduce a human-labeled dataset as a benchmark for this task. Additionally, we present an automated framework that generates synthetic training data by simulating ``where to know'' scenarios. Our empirical results show significant performance improvements in mainstream VLMs when fine-tuned with this synthetic data. This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans.
Authors: Zhi Zhang, Chris Chow, Yasi Zhang, Yanchao Sun, Haochen Zhang, Eric Hanchen Jiang, Han Liu, Furong Huang, Yuchen Cui, Oscar Hernan Madrid Padilla
Abstract: Lifelong reinforcement learning (RL) has been developed as a paradigm for extending single-task RL to more realistic, dynamic settings. In lifelong RL, the "life" of an RL agent is modeled as a stream of tasks drawn from a task distribution. We propose EPIC (\underline{E}mpirical \underline{P}AC-Bayes that \underline{I}mproves \underline{C}ontinuously), a novel algorithm designed for lifelong RL using PAC-Bayes theory. EPIC learns a shared policy distribution, referred to as the \textit{world policy}, which enables rapid adaptation to new tasks while retaining valuable knowledge from previous experiences. Our theoretical analysis establishes a relationship between the algorithm's generalization performance and the number of prior tasks preserved in memory. We also derive the sample complexity of EPIC in terms of RL regret. Extensive experiments on a variety of environments demonstrate that EPIC significantly outperforms existing methods in lifelong RL, offering both theoretical guarantees and practical efficacy through the use of the world policy.
Authors: Bohan Lyu, Yadi Cao, Duncan Watson-Parris, Leon Bergen, Taylor Berg-Kirkpatrick, Rose Yu
Abstract: Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but often produce hallucinations for complex ones. While integrating LLMs with tools can increase reliability, this approach typically results in over-reliance on tools, diminishing the model's ability to solve simple problems through basic reasoning. In contrast, human experts first assess problem complexity using domain knowledge before choosing an appropriate solution approach. Inspired by this human problem-solving process, we propose a novel two-component fine-tuning method. In the first component World Knowledge Distillation (WKD), LLMs learn directly from solutions generated using tool's information to internalize domain knowledge. In the second component Tool Usage Adaptation (TUA), we partition problems into easy and hard categories based on the model's direct answering accuracy. While maintaining the same alignment target for easy problems as in WKD, we train the model to intelligently switch to tool usage for more challenging problems. We validate our method on six scientific benchmark datasets, spanning mathematics, climate science and epidemiology. On average, our models demonstrate a 28.18% improvement in answer accuracy and a 13.89% increase in tool usage precision across all datasets, surpassing state-of-the-art models including GPT-4o and Claude-3.5.
Authors: John Edwards, Arto Hellas, Juho Leinonen
Abstract: Computing educators and researchers have used programming process data to understand how programs are constructed and what sorts of problems students struggle with. Although such data shows promise for using it for feedback, fully automated programming process feedback systems have still been an under-explored area. The recent emergence of large language models (LLMs) have yielded additional opportunities for researchers in a wide variety of fields. LLMs are efficient at transforming content from one format to another, leveraging the body of knowledge they have been trained with in the process. In this article, we discuss opportunities of using LLMs for analyzing programming process data. To complement our discussion, we outline a case study where we have leveraged LLMs for automatically summarizing the programming process and for creating formative feedback on the programming process. Overall, our discussion and findings highlight that the computing education research and practice community is again one step closer to automating formative programming process-focused feedback.
Authors: Chenghua Huang, Zhizhen Fan, Lu Wang, Fangkai Yang, Pu Zhao, Zeqi Lin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system. These methods can be costly and may introduce biases that affect the language model's responses. As language models improve, human input may become less effective in further enhancing their performance. In this paper, we propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. We conducted extensive experiments on multiple datasets such as HH-RLHF and UltraFeedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of large language models (LLMs).
Authors: Aman Gupta, Anirudh Ravichandran, Ziji Zhang, Swair Shah, Anurag Beniwal, Narayanan Sadagopan
Abstract: Task-oriented dialogue systems are essential for applications ranging from customer service to personal assistants and are widely used across various industries. However, developing effective multi-domain systems remains a significant challenge due to the complexity of handling diverse user intents, entity types, and domain-specific knowledge across several domains. In this work, we propose DARD (Domain Assigned Response Delegation), a multi-agent conversational system capable of successfully handling multi-domain dialogs. DARD leverages domain-specific agents, orchestrated by a central dialog manager agent. Our extensive experiments compare and utilize various agent modeling approaches, combining the strengths of smaller fine-tuned models (Flan-T5-large & Mistral-7B) with their larger counterparts, Large Language Models (LLMs) (Claude Sonnet 3.0). We provide insights into the strengths and limitations of each approach, highlighting the benefits of our multi-agent framework in terms of flexibility and composability. We evaluate DARD using the well-established MultiWOZ benchmark, achieving state-of-the-art performance by improving the dialogue inform rate by 6.6% and the success rate by 4.1% over the best-performing existing approaches. Additionally, we discuss various annotator discrepancies and issues within the MultiWOZ dataset and its evaluation system.
Authors: Yun Jiang, Zilong Xie, Wei Zhang, Yun Fang, Shuai Pan
Abstract: Retrieval-augmented generation methods often neglect the quality of content retrieved from external knowledge bases, resulting in irrelevant information or potential misinformation that negatively affects the generation results of large language models. In this paper, we propose an end-to-end model with adaptive filtering for retrieval-augmented generation (E2E-AFG), which integrates answer existence judgment and text generation into a single end-to-end framework. This enables the model to focus more effectively on relevant content while reducing the influence of irrelevant information and generating accurate answers. We evaluate E2E-AFG on six representative knowledge-intensive language datasets, and the results show that it consistently outperforms baseline models across all tasks, demonstrating the effectiveness and robustness of the proposed approach.
Authors: Zixuan He, Ziqian Kong, Zhengyu Chen, Yuling Zhan, Zijun Que, Zhengguo Xu
Abstract: Accurate remaining useful life (RUL) predictions are critical to the safe operation of aero-engines. Currently, the RUL prediction task is mainly a regression paradigm with only mean square error as the loss function and lacks research on feature space structure, the latter of which has shown excellent performance in a large number of studies. This paper develops a multi-granularity supervised contrastive (MGSC) framework from plain intuition that samples with the same RUL label should be aligned in the feature space, and address the problems of too large minibatch size and unbalanced samples in the implementation. The RUL prediction with MGSC is implemented on using the proposed multi-phase training strategy. This paper also demonstrates a simple and scalable basic network structure and validates the proposed MGSC strategy on the CMPASS dataset using a convolutional long short-term memory network as a baseline, which effectively improves the accuracy of RUL prediction.
Authors: Rui Yang, Jie Wang, Guoping Wu, Bin Li
Abstract: Real-world offline datasets are often subject to data corruptions (such as noise or adversarial attacks) due to sensor failures or malicious attacks. Despite advances in robust offline reinforcement learning (RL), existing methods struggle to learn robust agents under high uncertainty caused by the diverse corrupted data (i.e., corrupted states, actions, rewards, and dynamics), leading to performance degradation in clean environments. To tackle this problem, we propose a novel robust variational Bayesian inference for offline RL (TRACER). It introduces Bayesian inference for the first time to capture the uncertainty via offline data for robustness against all types of data corruptions. Specifically, TRACER first models all corruptions as the uncertainty in the action-value function. Then, to capture such uncertainty, it uses all offline data as the observations to approximate the posterior distribution of the action-value function under a Bayesian inference framework. An appealing feature of TRACER is that it can distinguish corrupted data from clean data using an entropy-based uncertainty measure, since corrupted data often induces higher uncertainty and entropy. Based on the aforementioned measure, TRACER can regulate the loss associated with corrupted data to reduce its influence, thereby enhancing robustness and performance in clean environments. Experiments demonstrate that TRACER significantly outperforms several state-of-the-art approaches across both individual and simultaneous data corruptions.
Authors: Anuradha Chopra, Abhinaba Roy, Dorien Herremans
Abstract: This paper introduces an extendable modular system that compiles a range of music feature extraction models to aid music information retrieval research. The features include musical elements like key, downbeats, and genre, as well as audio characteristics like instrument recognition, vocals/instrumental classification, and vocals gender detection. The integrated models are state-of-the-art or latest open-source. The features can be extracted as latent or post-processed labels, enabling integration into music applications such as generative music, recommendation, and playlist generation. The modular design allows easy integration of newly developed systems, making it a good benchmarking and comparison tool. This versatile toolkit supports the research community in developing innovative solutions by providing concrete musical features.
Authors: Bubacarr Jobarteh, Madalina Mincu, Gavojdian Dinu, Suresh Neethirajan
Abstract: Understanding animal vocalizations through multi-source data fusion is crucial for assessing emotional states and enhancing animal welfare in precision livestock farming. This study aims to decode dairy cow contact calls by employing multi-modal data fusion techniques, integrating transcription, semantic analysis, contextual and emotional assessment, and acoustic feature extraction. We utilized the Natural Language Processing model to transcribe audio recordings of cow vocalizations into written form. By fusing multiple acoustic features frequency, duration, and intensity with transcribed textual data, we developed a comprehensive representation of cow vocalizations. Utilizing data fusion within a custom-developed ontology, we categorized vocalizations into high frequency calls associated with distress or arousal, and low frequency calls linked to contentment or calmness. Analyzing the fused multi dimensional data, we identified anxiety related features indicative of emotional distress, including specific frequency measurements and sound spectrum results. Assessing the sentiment and acoustic features of vocalizations from 20 individual cows allowed us to determine differences in calling patterns and emotional states. Employing advanced machine learning algorithms, Random Forest, Support Vector Machine, and Recurrent Neural Networks, we effectively processed and fused multi-source data to classify cow vocalizations. These models were optimized to handle computational demands and data quality challenges inherent in practical farm environments. Our findings demonstrate the effectiveness of multi-source data fusion and intelligent processing techniques in animal welfare monitoring. This study represents a significant advancement in animal welfare assessment, highlighting the role of innovative fusion technologies in understanding and improving the emotional wellbeing of dairy cows.
Authors: Anbang Wang
Abstract: This paper presents ReverseNER, a framework aimed at overcoming the limitations of large language models (LLMs) in zero-shot Named Entity Recognition (NER) tasks, particularly in cases where certain entity types have ambiguous boundaries. ReverseNER tackles this challenge by constructing a reliable example library with the reversed process of NER. Rather than beginning with sentences, this method uses an LLM to generate entities based on their definitions and then expands them into full sentences. During sentence generation, the LLM is guided to replicate the structure of a specific 'feature sentence', extracted from the task sentences by clustering. This results in well-annotated sentences with clearly labeled entities, while preserving semantic and structural similarity to the task sentences. Once the example library is constructed, the method selects the most semantically similar example labels for each task sentence to support the LLM's inference. We also propose an entity-level self-consistency scoring mechanism to improve NER performance with LLMs. Experiments show that ReverseNER significantly outperforms traditional zero-shot NER with LLMs and surpasses several few-shot methods, marking a notable improvement in NER for domains with limited labeled data.
Authors: Sourav Modak, Anthony Stein
Abstract: In automated crop protection tasks such as weed control, disease diagnosis, and pest monitoring, deep learning has demonstrated significant potential. However, these advanced models rely heavily on high-quality, diverse datasets, often limited and costly in agricultural settings. Traditional data augmentation can increase dataset volume but usually lacks the real-world variability needed for robust training. This study presents a new approach for generating synthetic images to improve deep learning-based object detection models for intelligent weed control. Our GenAI-based image generation pipeline integrates the Segment Anything Model (SAM) for zero-shot domain adaptation with a text-to-image Stable Diffusion Model, enabling the creation of synthetic images that capture diverse real-world conditions. We evaluate these synthetic datasets using lightweight YOLO models, measuring data efficiency with mAP50 and mAP50-95 scores across varying proportions of real and synthetic data. Notably, YOLO models trained on datasets with 10% synthetic and 90% real images generally demonstrate superior mAP50 and mAP50-95 scores compared to those trained solely on real images. This approach not only reduces dependence on extensive real-world datasets but also enhances predictive performance. The integration of this approach opens opportunities for achieving continual self-improvement of perception modules in intelligent technical systems.
Authors: Hojung Jung, Youngrok Park, Laura Schmid, Jaehyeong Jo, Dongkyu Lee, Bongsang Kim, Se-Young Yun, Jinwoo Shin
Abstract: Diffusion models have demonstrated remarkable success in various domains, including molecular generation. However, conditional molecular generation remains a fundamental challenge due to an intrinsic trade-off between targeting specific chemical properties and generating meaningful samples from the data distribution. In this work, we present Time-Aware Conditional Synthesis (TACS), a novel approach to conditional generation on diffusion models. It integrates adaptively controlled plug-and-play "online" guidance into a diffusion model, driving samples toward the desired properties while maintaining validity and stability. A key component of our algorithm is our new type of diffusion sampler, Time Correction Sampler (TCS), which is used to control guidance and ensure that the generated molecules remain on the correct manifold at each reverse step of the diffusion process at the same time. Our proposed method demonstrates significant performance in conditional 3D molecular generation and offers a promising approach towards inverse molecular design, potentially facilitating advancements in drug discovery, materials science, and other related fields.
Authors: Xintong Yang, Ze Ji, Yu-Kun Lai
Abstract: Robotic manipulation of volumetric elastoplastic deformable materials, from foods such as dough to construction materials like clay, is in its infancy, largely due to the difficulty of modelling and perception in a high-dimensional space. Simulating the dynamics of such materials is computationally expensive. It tends to suffer from inaccurately estimated physics parameters of the materials and the environment, impeding high-precision manipulation. Estimating such parameters from raw point clouds captured by optical cameras suffers further from heavy occlusions. To address this challenge, this work introduces a novel Differentiable Physics-based System Identification (DPSI) framework that enables a robot arm to infer the physics parameters of elastoplastic materials and the environment using simple manipulation motions and incomplete 3D point clouds, aligning the simulation with the real world. Extensive experiments show that with only a single real-world interaction, the estimated parameters, Young's modulus, Poisson's ratio, yield stress and friction coefficients, can accurately simulate visually and physically realistic deformation behaviours induced by unseen and long-horizon manipulation motions. Additionally, the DPSI framework inherently provides physically intuitive interpretations for the parameters in contrast to black-box approaches such as deep neural networks.
Authors: Nikita Severin, Aleksei Ziablitsev, Yulia Savelyeva, Valeriy Tashchilin, Ivan Bulychev, Mikhail Yushkov, Artem Kushneruk, Amaliya Zaryvnykh, Dmitrii Kiselev, Andrey Savchenko, Ilya Makarov
Abstract: We present LLM-KT, a flexible framework designed to enhance collaborative filtering (CF) models by seamlessly integrating LLM (Large Language Model)-generated features. Unlike existing methods that rely on passing LLM-generated features as direct inputs, our framework injects these features into an intermediate layer of any CF model, allowing the model to reconstruct and leverage the embeddings internally. This model-agnostic approach works with a wide range of CF models without requiring architectural changes, making it adaptable to various recommendation scenarios. Our framework is built for easy integration and modification, providing researchers and developers with a powerful tool for extending CF model capabilities through efficient knowledge transfer. We demonstrate its effectiveness through experiments on the MovieLens and Amazon datasets, where it consistently improves baseline CF models. Experimental studies showed that LLM-KT is competitive with the state-of-the-art methods in context-aware settings but can be applied to a broader range of CF models than current approaches.
Authors: Leo Ardon, Benjamin Patrick Evans, Deepeka Garg, Annapoorani Lakshmi Narayanan, Makada Henry-Nickie, Sumitra Ganesh
Abstract: We develop a novel two-layer approach for optimising mortgage relief products through a simulated multi-agent mortgage environment. While the approach is generic, here the environment is calibrated to the US mortgage market based on publicly available census data and regulatory guidelines. Through the simulation layer, we assess the resilience of households to exogenous income shocks, while the optimisation layer explores strategies to improve the robustness of households to these shocks by making novel mortgage assistance products available to households. Households in the simulation are adaptive, learning to make mortgage-related decisions (such as product enrolment or strategic foreclosures) that maximize their utility, balancing their available liquidity and equity. We show how this novel two-layer simulation approach can successfully design novel mortgage assistance products to improve household resilience to exogenous shocks, and balance the costs of providing such products through post-hoc analysis. Previously, such analysis could only be conducted through expensive pilot studies involving real participants, demonstrating the benefit of the approach for designing and evaluating financial products.
Authors: Xinyue Li, Zhenpeng Chen, Jie M. Zhang, Yiling Lou, Tianlin Li, Weisong Sun, Yang Liu, Xuanzhe Liu
Abstract: Large Language Models (LLMs) have become foundational in modern language-driven applications, profoundly influencing daily life. A critical technique in leveraging their potential is role-playing, where LLMs simulate diverse roles to enhance their real-world utility. However, while research has highlighted the presence of social biases in LLM outputs, it remains unclear whether and to what extent these biases emerge during role-playing scenarios. In this paper, we introduce BiasLens, a fairness testing framework designed to systematically expose biases in LLMs during role-playing. Our approach uses LLMs to generate 550 social roles across a comprehensive set of 11 demographic attributes, producing 33,000 role-specific questions targeting various forms of bias. These questions, spanning Yes/No, multiple-choice, and open-ended formats, are designed to prompt LLMs to adopt specific roles and respond accordingly. We employ a combination of rule-based and LLM-based strategies to identify biased responses, rigorously validated through human evaluation. Using the generated questions as the benchmark, we conduct extensive evaluations of six advanced LLMs released by OpenAI, Mistral AI, Meta, Alibaba, and DeepSeek. Our benchmark reveals 72,716 biased responses across the studied LLMs, with individual models yielding between 7,754 and 16,963 biased responses, underscoring the prevalence of bias in role-playing contexts. To support future research, we have publicly released the benchmark, along with all scripts and experimental results.
Authors: Cristian Meo, Louis Mahon, Anirudh Goyal, Justin Dauwels
Abstract: While disentangled representations have shown promise in generative modeling and representation learning, their downstream usefulness remains debated. Recent studies re-defined disentanglement through a formal connection to symmetries, emphasizing the ability to reduce latent domains and consequently enhance generative capabilities. However, from an information theory viewpoint, assigning a complex attribute to a specific latent variable may be infeasible, limiting the applicability of disentangled representations to simple datasets. In this work, we introduce $\alpha$-TCVAE, a variational autoencoder optimized using a novel total correlation (TC) lower bound that maximizes disentanglement and latent variables informativeness. The proposed TC bound is grounded in information theory constructs, generalizes the $\beta$-VAE lower bound, and can be reduced to a convex combination of the known variational information bottleneck (VIB) and conditional entropy bottleneck (CEB) terms. Moreover, we present quantitative analyses that support the idea that disentangled representations lead to better generative capabilities and diversity. Additionally, we perform downstream task experiments from both representation and RL domains to assess our questions from a broader ML perspective. Our results demonstrate that $\alpha$-TCVAE consistently learns more disentangled representations than baselines and generates more diverse observations without sacrificing visual fidelity. Notably, $\alpha$-TCVAE exhibits marked improvements on MPI3D-Real, the most realistic disentangled dataset in our study, confirming its ability to represent complex datasets when maximizing the informativeness of individual variables. Finally, testing the proposed model off-the-shelf on a state-of-the-art model-based RL agent, Director, significantly shows $\alpha$-TCVAE downstream usefulness on the loconav Ant Maze task.
Authors: Zhili Feng, Tanya Marwah, Lester Mackey, David Alvarez-Melis, Nicolo Fusi
Abstract: Modern large language models use a fixed tokenizer to effectively compress text drawn from a source domain. However, applying the same tokenizer to a new target domain often leads to inferior compression, more costly inference, and reduced semantic alignment. To address this deficiency, we introduce Sparse Sinkhorn Token Translation (S2T2). S2T2 trains a tailored tokenizer for the target domain and learns to translate between target and source tokens, enabling more effective reuse of the pre-trained next-source-token predictor. In our experiments with finetuned English language models, S2T2 improves both the perplexity and the compression of out-of-domain protein sequences, outperforming direct finetuning with either the source or target tokenizer. In addition, we find that token translations learned for smaller, less expensive models can be directly transferred to larger, more powerful models to reap the benefits of S2T2 at lower cost.
Authors: Mianyong Ding, Matteo Maspero, Annemieke S Littooij, Martine van Grotel, Raquel Davila Fajardo, Max M van Noesel, Marry M van den Heuvel-Eibrink, Geert O Janssens
Abstract: Purposes: This study aimed to develop a computed tomography (CT)-based multi-organ segmentation model for delineating organs-at-risk (OARs) in pediatric upper abdominal tumors and evaluate its robustness across multiple datasets. Materials and methods: In-house postoperative CTs from pediatric patients with renal tumors and neuroblastoma (n=189) and a public dataset (n=189) with CTs covering thoracoabdominal regions were used. Seventeen OARs were delineated: nine by clinicians (Type 1) and eight using TotalSegmentator (Type 2). Auto-segmentation models were trained using in-house (ModelPMC-UMCU) and a combined dataset of public data (Model-Combined). Performance was assessed with Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (HD95), and mean surface distance (MSD). Two clinicians rated clinical acceptability on a 5-point Likert scale across 15 patient contours. Model robustness was evaluated against sex, age, intravenous contrast, and tumor type. Results: Model-PMC-UMCU achieved mean DSC values above 0.95 for five of nine OARs, while spleen and heart ranged between 0.90 and 0.95. The stomach-bowel and pancreas exhibited DSC values below 0.90. Model-Combined demonstrated improved robustness across both datasets. Clinical evaluation revealed good usability, with both clinicians rating six of nine Type 1 OARs above four and six of eight Type 2 OARs above three. Significant performance 2 differences were only found across age groups in both datasets, specifically in the left lung and pancreas. The 0-2 age group showed the lowest performance. Conclusion: A multi-organ segmentation model was developed, showcasing enhanced robustness when trained on combined datasets. This model is suitable for various OARs and can be applied to multiple datasets in clinical settings.
Authors: Li Li
Abstract: 3D LiDAR point cloud data is crucial for scene perception in computer vision, robotics, and autonomous driving. Geometric and semantic scene understanding, involving 3D point clouds, is essential for advancing autonomous driving technologies. However, significant challenges remain, particularly in improving the overall accuracy (e.g., segmentation accuracy, depth estimation accuracy, etc.) and efficiency of these systems. To address the challenge in terms of accuracy related to LiDAR-based tasks, we present DurLAR, the first high-fidelity 128-channel 3D LiDAR dataset featuring panoramic ambient (near infrared) and reflectivity imagery. To improve efficiency in 3D segmentation while ensuring the accuracy, we propose a novel pipeline that employs a smaller architecture, requiring fewer ground-truth annotations while achieving superior segmentation accuracy compared to contemporary approaches. To improve the segmentation accuracy, we introduce Range-Aware Pointwise Distance Distribution (RAPiD) features and the associated RAPiD-Seg architecture. All contributions have been accepted by peer-reviewed conferences, underscoring the advancements in both accuracy and efficiency in 3D LiDAR applications for autonomous driving. Full abstract: https://etheses.dur.ac.uk/15738/.
Authors: Yu Tai, Xinglong Wu, Hongwei Yang, Hui He, Duanjing Chen, Yuanming Shao, Weizhe Zhang
Abstract: Temporal Heterogeneous Networks play a crucial role in capturing the dynamics and heterogeneity inherent in various real-world complex systems, rendering them a noteworthy research avenue for link prediction. However, existing methods fail to capture the fine-grained differential distribution patterns and temporal dynamic characteristics, which we refer to as spatial heterogeneity and temporal heterogeneity. To overcome such limitations, we propose a novel \textbf{C}ontrastive Learning-based \textbf{L}ink \textbf{P}rediction model, \textbf{CLP}, which employs a multi-view hierarchical self-supervised architecture to encode spatial and temporal heterogeneity. Specifically, aiming at spatial heterogeneity, we develop a spatial feature modeling layer to capture the fine-grained topological distribution patterns from node- and edge-level representations, respectively. Furthermore, aiming at temporal heterogeneity, we devise a temporal information modeling layer to perceive the evolutionary dependencies of dynamic graph topologies from time-level representations. Finally, we encode the spatial and temporal distribution heterogeneity from a contrastive learning perspective, enabling a comprehensive self-supervised hierarchical relation modeling for the link prediction task. Extensive experiments conducted on four real-world dynamic heterogeneous network datasets verify that our \mymodel consistently outperforms the state-of-the-art models, demonstrating an average improvement of 10.10\%, 13.44\% in terms of AUC and AP, respectively.
Authors: Yingwei Ma, Rongyu Cao, Yongchang Cao, Yue Zhang, Jue Chen, Yibo Liu, Yuchen Liu, Binhua Li, Fei Huang, Yongbin Li
Abstract: Recent advancements in LLM-based agents have led to significant progress in automatic software engineering, particularly in software maintenance and evolution. Despite these encouraging advances, current research faces two major challenges. First, SOTA performance primarily depends on closed-source models, which significantly limits the technology's accessibility, and potential for customization in diverse SE tasks. Second, these models are predominantly trained on static code data, lacking a deep understanding of the dynamic interactions, iterative problem-solving processes, and evolutionary characteristics inherent in software development. To address these challenges, our study adopts a software engineering perspective. We recognize that real-world software maintenance and evolution processes encompass not only static code data but also developers' thought processes, utilization of external tools, and the interaction between different functional personnel. Consequently, we introduce the Lingma SWE-GPT series, comprising Lingma SWE-GPT 7B and 72B. By learning from and simulating real-world code submission activities, Lingma SWE-GPT systematically incorporates the dynamic interactions and iterative problem-solving inherent in software development process, thereby achieving a more comprehensive understanding of software improvement processes. We conducted experimental evaluations using SWE-bench Verified benchmark. The results demonstrate that Lingma SWE-GPT 72B successfully resolves 30.20% of the GitHub issues, marking a significant improvement in automatic issue resolution (22.76% relative improvement compared to Llama 3.1 405B), approaching the performance of closed-source models (31.80\% issues of GPT-4o resolved). Notably, Lingma SWE-GPT 7B resolves 18.20% of the issues, highlighting the potential for applying smaller models to ASE tasks.
Authors: Zerui Wang, Yan Liu
Abstract: Transformer-based models have achieved state-of-the-art performance in various computer vision tasks, including image and video analysis. However, Transformer's complex architecture and black-box nature pose challenges for explainability, a crucial aspect for real-world applications and scientific inquiry. Current Explainable AI (XAI) methods can only provide one-dimensional feature importance, either spatial or temporal explanation, with significant computational complexity. This paper introduces STAA (Spatio-Temporal Attention Attribution), an XAI method for interpreting video Transformer models. Differ from traditional methods that separately apply image XAI techniques for spatial features or segment contribution analysis for temporal aspects, STAA offers both spatial and temporal information simultaneously from attention values in Transformers. The study utilizes the Kinetics-400 dataset, a benchmark collection of 400 human action classes used for action recognition research. We introduce metrics to quantify explanations. We also apply optimization to enhance STAA's raw output. By implementing dynamic thresholding and attention focusing mechanisms, we improve the signal-to-noise ratio in our explanations, resulting in more precise visualizations and better evaluation results. In terms of computational overhead, our method requires less than 3\% of the computational resources of traditional XAI methods, making it suitable for real-time video XAI analysis applications. STAA contributes to the growing field of XAI by offering a method for researchers and practitioners to analyze Transformer models.
Authors: Jasper Roe (James Cook University Singapore), Mike Perkins (British University Vietnam)
Abstract: This scoping review examines the relationship between Generative AI (GenAI) and agency in education, analyzing the literature available through the lens of Critical Digital Pedagogy. Following PRISMA-ScR guidelines, we collected 10 studies from academic databases focusing on both learner and teacher agency in GenAI-enabled environments. We conducted an AI-supported hybrid thematic analysis that revealed three key themes: Control in Digital Spaces, Variable Engagement and Access, and Changing Notions of Agency. The findings suggest that while GenAI may enhance learner agency through personalization and support, it also risks exacerbating educational inequalities and diminishing learner autonomy in certain contexts. This review highlights gaps in the current research on GenAI's impact on agency. These findings have implications for educational policy and practice, suggesting the need for frameworks that promote equitable access while preserving learner agency in GenAI-enhanced educational environments.
Authors: Hongjun An, Yiliang Song, Xuelong Li
Abstract: We discovered the underlying physics in Next-token Prediction (NTP). We identified the law of information conservation within NTP and proposed the First Law of Information Capacity (IC-1), demonstrating that the essence of intelligence emergence in auto-regressive models is fundamentally a process of information transfer. We also introduced Landauer's Principle into NTP, formulating the Second Law of Information Capacity (IC-2), which establishes the relationship between auto-regressive model training and energy consumption. Additionally, we presented several corollaries, which hold practical significance for production practices. Finally, we validated the compatibility and complementarity of our findings with existing theories.
Authors: Jingming Guo, Yan Liu, Yu Meng, Zhiwei Tao, Banglan Liu, Gang Chen, Xiang Li
Abstract: The Mixture of Experts (MoE) is an advanced model architecture in the industry that combines multiple specialized expert models from various domains into a single supermodel. This approach enables the model to scale without significantly increasing the computational costs of training and inference, while maximizing model performance. However, current distributed training frameworks do not consider the ultimate optimization of communication, especially for large base models. This paper proposes a network-traffic-aware parallel optimization method that selects the optimal parallel strategy based on the communication volume, and the training cluster's inter-node and intra-node network topologies. Compared to the DeepSpeed, MoNTA achieves an 8x increase in AllToAll communication performance under 8-card tensor parallelism. Compared to the baseline, training a 2x70B model using 16 A800 cards, with an 8K sequence, results in a 13% overall latency performance improvement. Project Page: https://github.com/EnflameTechnology/DeepSpeed.
Authors: Charlie B. Tan, Edan Toledo, Benjamin Ellis, Jakob N. Foerster, Ferenc Husz\'ar
Abstract: Proximal policy optimization (PPO) is a widely-used algorithm for on-policy reinforcement learning. This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with unity learning rate. Using this insight we propose outer proximal policy optimization (outer-PPO); a framework wherein these update vectors are applied using an arbitrary gradient-based optimizer. The decoupling of update estimation and update application enabled by outer-PPO highlights several implicit design choices in PPO that we challenge through empirical investigation. In particular we consider non-unity learning rates and momentum applied to the outer loop, and a momentum-bias applied to the inner estimation loop. Methods are evaluated against an aggressively tuned PPO baseline on Brax, Jumanji and MinAtar environments; non-unity learning rates and momentum both achieve statistically significant improvement on Brax and Jumanji, given the same hyperparameter tuning budget.
Authors: Sarah Al-Shareeda, Khayal Huseynov, Lal Verda Cakir, Craig Thomson, Mehmet Ozdem, Berk Canberk
Abstract: In today's networked world, Digital Twin Networks (DTNs) are revolutionizing how we understand and optimize physical networks. These networks, also known as 'Digital Twin Networks (DTNs)' or 'Networks Digital Twins (NDTs),' encompass many physical networks, from cellular and wireless to optical and satellite. They leverage computational power and AI capabilities to provide virtual representations, leading to highly refined recommendations for real-world network challenges. Within DTNs, tasks include network performance enhancement, latency optimization, energy efficiency, and more. To achieve these goals, DTNs utilize AI tools such as Machine Learning (ML), Deep Learning (DL), Reinforcement Learning (RL), Federated Learning (FL), and graph-based approaches. However, data quality, scalability, interpretability, and security challenges necessitate strategies prioritizing transparency, fairness, privacy, and accountability. This chapter delves into the world of AI-driven traffic analysis within DTNs. It explores DTNs' development efforts, tasks, AI models, and challenges while offering insights into how AI can enhance these dynamic networks. Through this journey, readers will gain a deeper understanding of the pivotal role AI plays in the ever-evolving landscape of networked systems.
Authors: Srikumar Sastry, Subash Khanal, Aayush Dhakal, Adeel Ahmad, Nathan Jacobs
Abstract: We present TaxaBind, a unified embedding space for characterizing any species of interest. TaxaBind is a multimodal embedding space across six modalities: ground-level images of species, geographic location, satellite image, text, audio, and environmental features, useful for solving ecological problems. To learn this joint embedding space, we leverage ground-level images of species as a binding modality. We propose multimodal patching, a technique for effectively distilling the knowledge from various modalities into the binding modality. We construct two large datasets for pretraining: iSatNat with species images and satellite images, and iSoundNat with species images and audio. Additionally, we introduce TaxaBench-8k, a diverse multimodal dataset with six paired modalities for evaluating deep learning models on ecological tasks. Experiments with TaxaBind demonstrate its strong zero-shot and emergent capabilities on a range of tasks including species classification, cross-model retrieval, and audio classification. The datasets and models are made available at https://github.com/mvrl/TaxaBind.
Authors: Minki Kang, Sung Ju Hwang, Gibbeum Lee, Jaewoong Cho
Abstract: As Large Language Models (LLMs) are increasingly deployed in specialized domains with continuously evolving knowledge, the need for timely and precise knowledge injection has become essential. Fine-tuning with paraphrased data is a common approach to enhance knowledge injection, yet it faces two significant challenges: high computational costs due to repetitive external model usage and limited sample diversity. To this end, we introduce LaPael, a latent-level paraphrasing method that applies input-dependent noise to early LLM layers. This approach enables diverse and semantically consistent augmentations directly within the model. Furthermore, it eliminates the recurring costs of paraphrase generation for each knowledge update. Our extensive experiments on question-answering benchmarks demonstrate that LaPael improves knowledge injection over standard fine-tuning and existing noise-based approaches. Additionally, combining LaPael with data-level paraphrasing further enhances performance.
Authors: Fuying Wang, Feng Wu, Yihan Tang, Lequan Yu
Abstract: Integrating multimodal Electronic Health Records (EHR) data, such as numerical time series and free-text clinical reports, has great potential in predicting clinical outcomes. However, prior work has primarily focused on capturing temporal interactions within individual samples and fusing multimodal information, overlooking critical temporal patterns across patients. These patterns, such as trends in vital signs like abnormal heart rate or blood pressure, can indicate deteriorating health or an impending critical event. Similarly, clinical notes often contain textual descriptions that reflect these patterns. Identifying corresponding temporal patterns across different modalities is crucial for improving the accuracy of clinical outcome predictions, yet it remains a challenging task. To address this gap, we introduce a Cross-Modal Temporal Pattern Discovery (CTPD) framework, designed to efficiently extract meaningful cross-modal temporal patterns from multimodal EHR data. Our approach introduces shared initial temporal pattern representations which are refined using slot attention to generate temporal semantic embeddings. To ensure rich cross-modal temporal semantics in the learned patterns, we introduce a contrastive-based TPNCE loss for cross-modal alignment, along with two reconstruction losses to retain core information of each modality. Evaluations on two clinically critical tasks, 48-hour in-hospital mortality and 24-hour phenotype classification, using the MIMIC-III database demonstrate the superiority of our method over existing approaches.
Authors: Leif Feddersen
Abstract: Most organizations adjust their statistical forecasts (e.g. on sales) manually. Forecasting Support Systems (FSS) enable the related process of automated forecast generation and manual adjustments. As the FSS user interface connects user and statistical algorithm, it is an obvious lever for facilitating beneficial adjustments whilst discouraging harmful adjustments. This paper reviews and organizes the literature on judgemental forecasting, forecast adjustments, and FSS design. I argue that algorithmic transparency may be a key factor towards better, integrative forecasting and test this assertion with three FSS designs that vary in their degrees of transparency based on time series decomposition. I find transparency to reduce the variance and amount of harmful forecast adjustments. Letting users adjust the algorithm's transparent components themselves, however, leads to widely varied and overall most detrimental adjustments. Responses indicate a risk of overwhelming users with algorithmic transparency without adequate training. Accordingly, self-reported satisfaction is highest with a non-transparent FSS.
Authors: Thanh Nguyen-Tang, Raman Arora
Abstract: We study learning in a dynamically evolving environment modeled as a Markov game between a learner and a strategic opponent that can adapt to the learner's strategies. While most existing works in Markov games focus on external regret as the learning objective, external regret becomes inadequate when the adversaries are adaptive. In this work, we focus on \emph{policy regret} -- a counterfactual notion that aims to compete with the return that would have been attained if the learner had followed the best fixed sequence of policy, in hindsight. We show that if the opponent has unbounded memory or if it is non-stationary, then sample-efficient learning is not possible. For memory-bounded and stationary, we show that learning is still statistically hard if the set of feasible strategies for the learner is exponentially large. To guarantee learnability, we introduce a new notion of \emph{consistent} adaptive adversaries, wherein, the adversary responds similarly to similar strategies of the learner. We provide algorithms that achieve $\sqrt{T}$ policy regret against memory-bounded, stationary, and consistent adversaries.
Authors: Shreyash Arya, Sukrut Rao, Moritz B\"ohle, Bernt Schiele
Abstract: B-cos Networks have been shown to be effective for obtaining highly human interpretable explanations of model decisions by architecturally enforcing stronger alignment between inputs and weight. B-cos variants of convolutional networks (CNNs) and vision transformers (ViTs), which primarily replace linear layers with B-cos transformations, perform competitively to their respective standard variants while also yielding explanations that are faithful by design. However, it has so far been necessary to train these models from scratch, which is increasingly infeasible in the era of large, pre-trained foundation models. In this work, inspired by the architectural similarities in standard DNNs and B-cos networks, we propose 'B-cosification', a novel approach to transform existing pre-trained models to become inherently interpretable. We perform a thorough study of design choices to perform this conversion, both for convolutional neural networks and vision transformers. We find that B-cosification can yield models that are on par with B-cos models trained from scratch in terms of interpretability, while often outperforming them in terms of classification performance at a fraction of the training cost. Subsequently, we apply B-cosification to a pretrained CLIP model, and show that, even with limited data and compute cost, we obtain a B-cosified version that is highly interpretable and competitive on zero shot performance across a variety of datasets. We release our code and pre-trained model weights at https://github.com/shrebox/B-cosification.
Authors: Fan Xiao, Junlin Hou, Ruiwei Zhao, Rui Feng, Haidong Zou, Lina Lu, Yi Xu, Juzhao Zhang
Abstract: Diabetic retinopathy (DR) is a leading cause of blindness worldwide and a common complication of diabetes. As two different imaging tools for DR grading, color fundus photography (CFP) and infrared fundus photography (IFP) are highly-correlated and complementary in clinical applications. To the best of our knowledge, this is the first study that explores a novel multi-modal deep learning framework to fuse the information from CFP and IFP towards more accurate DR grading. Specifically, we construct a dual-stream architecture Cross-Fundus Transformer (CFT) to fuse the ViT-based features of two fundus image modalities. In particular, a meticulously engineered Cross-Fundus Attention (CFA) module is introduced to capture the correspondence between CFP and IFP images. Moreover, we adopt both the single-modality and multi-modality supervisions to maximize the overall performance for DR grading. Extensive experiments on a clinical dataset consisting of 1,713 pairs of multi-modal fundus images demonstrate the superiority of our proposed method. Our code will be released for public access.
Authors: Hamees Sayed, Advait Joglekar, Srinivasan Umesh
Abstract: We develop a robust translation model for four low-resource Indic languages: Khasi, Mizo, Manipuri, and Assamese. Our approach includes a comprehensive pipeline from data collection and preprocessing to training and evaluation, leveraging data from WMT task datasets, BPCC, PMIndia, and OpenLanguageData. To address the scarcity of bilingual data, we use back-translation techniques on monolingual datasets for Mizo and Khasi, significantly expanding our training corpus. We fine-tune the pre-trained NLLB 3.3B model for Assamese, Mizo, and Manipuri, achieving improved performance over the baseline. For Khasi, which is not supported by the NLLB model, we introduce special tokens and train the model on our Khasi corpus. Our training involves masked language modelling, followed by fine-tuning for English-to-Indic and Indic-to-English translations.
Authors: Mohammad Feizabadi, Arman Hosseini, Zakaria Yahouni
Abstract: In smart manufacturing, scheduling autonomous internal logistic vehicles is crucial for optimizing operational efficiency. This paper proposes a multi-agent deep Q-network (MADQN) with a layer-based communication channel (LBCC) to address this challenge. The main goals are to minimize total job tardiness, reduce the number of tardy jobs, and lower vehicle energy consumption. The method is evaluated against nine well-known scheduling heuristics, demonstrating its effectiveness in handling dynamic job shop behaviors like job arrivals and workstation unavailabilities. The approach also proves scalable, maintaining performance across different layouts and larger problem instances, highlighting the robustness and adaptability of MADQN with LBCC in smart manufacturing.
Authors: Carl Edwards, Ziqing Lu, Ehsan Hajiramezanali, Tommaso Biancalani, Heng Ji, Gabriele Scalia
Abstract: Bridging biomolecular modeling with natural language information, particularly through large language models (LLMs), has recently emerged as a promising interdisciplinary research area. LLMs, having been trained on large corpora of scientific documents, demonstrate significant potential in understanding and reasoning about biomolecules by providing enriched contextual and domain knowledge. However, the extent to which LLM-driven insights can improve performance on complex predictive tasks (e.g., toxicity) remains unclear. Further, the extent to which relevant knowledge can be extracted from LLMs also remains unknown. In this study, we present Molecule Caption Arena: the first comprehensive benchmark of LLM-augmented molecular property prediction. We evaluate over twenty LLMs, including both general-purpose and domain-specific molecule captioners, across diverse prediction tasks. To this goal, we introduce a novel, battle-based rating system. Our findings confirm the ability of LLM-extracted knowledge to enhance state-of-the-art molecular representations, with notable model-, prompt-, and dataset-specific variations. Code, resources, and data are available at github.com/Genentech/molcap-arena.
Authors: Aashiq Muhamed, Mona Diab, Virginia Smith
Abstract: Understanding and mitigating the potential risks associated with foundation models (FMs) hinges on developing effective interpretability methods. Sparse Autoencoders (SAEs) have emerged as a promising tool for disentangling FM representations, but they struggle to capture rare, yet crucial concepts in the data. We introduce Specialized Sparse Autoencoders (SSAEs), designed to illuminate these elusive dark matter features by focusing on specific subdomains. We present a practical recipe for training SSAEs, demonstrating the efficacy of dense retrieval for data selection and the benefits of Tilted Empirical Risk Minimization as a training objective to improve concept recall. Our evaluation of SSAEs on standard metrics, such as downstream perplexity and $L_0$ sparsity, show that they effectively capture subdomain tail concepts, exceeding the capabilities of general-purpose SAEs. We showcase the practical utility of SSAEs in a case study on the Bias in Bios dataset, where SSAEs achieve a 12.5\% increase in worst-group classification accuracy when applied to remove spurious gender information. SSAEs provide a powerful new lens for peering into the inner workings of FMs in subdomains.
Authors: Yiwen Ding, Zhiheng Xi, Wei He, Zhuoyuan Li, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: Self-improvement methods enable large language models (LLMs) to generate solutions themselves and iteratively train on filtered, high-quality rationales. This process proves effective and reduces the reliance on human supervision in LLMs' reasoning, but the performance soon plateaus. We delve into the process and find that models tend to over-sample on easy queries and under-sample on queries they have yet to master. As iterations proceed, this imbalance in sampling is exacerbated, leading to a long-tail distribution where solutions to difficult queries almost diminish. This phenomenon limits the performance gain of self-improving models. A straightforward solution is brute-force sampling to balance the distribution, which significantly raises computational costs. In this paper, we introduce Guided Self-Improvement (GSI), a strategy aimed at improving the efficiency of sampling challenging heavy-tailed data. It leverages Socratic-style guidance signals to help LLM reasoning with complex queries, reducing the exploration effort and minimizing computational overhead. Experiments on four models across diverse mathematical tasks show that GSI strikes a balance between performance and efficiency, while also being effective on held-out tasks.
Authors: Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, Hao Chen
Abstract: We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos. This model facilitates high-quality, open-domain generation by simulating an extensive array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, predicting and altering future content based on the current clip, thus allowing for gameplay simulation. To realize this vision, we first collected and built an Open-World Video Game Dataset from scratch. It is the first and largest dataset for open-world game video generation and control, which comprises over a million diverse gameplay video clips sampling from over 150 games with informative captions from GPT-4o. GameGen-X undergoes a two-stage training process, consisting of foundation model pre-training and instruction tuning. Firstly, the model was pre-trained via text-to-video generation and video continuation, endowing it with the capability for long-sequence, high-quality open-domain game video generation. Further, to achieve interactive controllability, we designed InstructNet to incorporate game-related multi-modal control signal experts. This allows the model to adjust latent representations based on user inputs, unifying character interaction and scene content control for the first time in video generation. During instruction tuning, only the InstructNet is updated while the pre-trained foundation model is frozen, enabling the integration of interactive controllability without loss of diversity and quality of generated video content.
Authors: Xiong Wang, Yangze Li, Chaoyou Fu, Lei Xie, Ke Li, Xing Sun, Long Ma
Abstract: The rapid development of large language models has brought many new smart applications, especially the excellent multimodal human-computer interaction in GPT-4o has brought impressive experience to users. In this background, researchers have proposed many multimodal LLMs that can achieve speech-to-speech dialogue recently. In this paper, we propose a speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is the speech input and output modalities can connected to the LLM while keeping the LLM frozen throughout the training process. We designed 3-stage training strategies both for the modeling of speech input and output, enabling Freeze-Omni to obtain speech-to-speech dialogue ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while the end-to-end latency of the spoken response achieves a low level. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, making Freeze-Omni have a more natural style of dialogue ability between the users. Freeze-Omni mainly provides a possibility for researchers to conduct multimodal LLM under the condition of a frozen LLM, avoiding various impacts caused by the catastrophic forgetting of LLM caused by fewer data and training resources.
Authors: Asiful Arefeen, Hassan Ghasemzadeh
Abstract: Monitoring unexpected health events and taking actionable measures to avert them beforehand is central to maintaining health and preventing disease. Therefore, a tool capable of predicting adverse health events and offering users actionable feedback about how to make changes in their diet, exercise, and medication to prevent abnormal health events could have significant societal impacts. Counterfactual explanations can provide insights into why a model made a particular prediction by generating hypothetical instances that are similar to the original input but lead to a different prediction outcome. Therefore, counterfactuals can be viewed as a means to design AI-driven health interventions to not only predict but also prevent adverse health outcomes such as blood glucose spikes, diabetes, and heart disease. In this paper, we design \textit{\textbf{ExAct}}, a novel model-agnostic framework for generating counterfactual explanations for chronic disease prevention and management. Leveraging insights from adversarial learning, ExAct characterizes the decision boundary for high-dimensional data and performs a grid search to generate actionable interventions. ExAct is unique in integrating prior knowledge about user preferences of feasible explanations into the process of counterfactual generation. ExAct is evaluated extensively using four real-world datasets and external simulators. With $82.8\%$ average validity in the simulation-aided validation, ExAct surpasses the state-of-the-art techniques for generating counterfactual explanations by at least $10\%$. Besides, counterfactuals from ExAct exhibit at least $6.6\%$ improved proximity compared to previous research.
Authors: Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Shiyang Lai, Kai Shu, Jindong Gu, Adel Bibi, Ziniu Hu, David Jurgens, James Evans, Philip Torr, Bernard Ghanem, Guohao Li
Abstract: Large Language Model (LLM) agents have been increasingly adopted as simulation tools to model humans in social science and role-playing applications. However, one fundamental question remains: can LLM agents really simulate human behavior? In this paper, we focus on one critical and elemental behavior in human interactions, trust, and investigate whether LLM agents can simulate human trust behavior. We first find that LLM agents generally exhibit trust behavior, referred to as agent trust, under the framework of Trust Games, which are widely recognized in behavioral economics. Then, we discover that GPT-4 agents manifest high behavioral alignment with humans in terms of trust behavior, indicating the feasibility of simulating human trust behavior with LLM agents. In addition, we probe the biases of agent trust and differences in agent trust towards other LLM agents and humans. We also explore the intrinsic properties of agent trust under conditions including external manipulations and advanced reasoning strategies. Our study provides new insights into the behaviors of LLM agents and the fundamental analogy between LLMs and humans beyond value alignment. We further illustrate broader implications of our discoveries for applications where trust is paramount.
Authors: Minghao Chen, Yihang Li, Yanting Yang, Shiyu Yu, Binbin Lin, Xiaofei He
Abstract: Large Language Models (LLM) based agents have shown promise in autonomously completing tasks across various domains, e.g., robotics, games, and web navigation. However, these agents typically require elaborate design and expert prompts to solve tasks in specific domains, which limits their adaptability. We introduce AutoManual, a framework enabling LLM agents to autonomously build their understanding through interaction and adapt to new environments. AutoManual categorizes environmental knowledge into diverse rules and optimizes them in an online fashion by two agents: 1) The Planner codes actionable plans based on current rules for interacting with the environment. 2) The Builder updates the rules through a well-structured rule system that facilitates online rule management and essential detail retention. To mitigate hallucinations in managing rules, we introduce a *case-conditioned prompting* strategy for the Builder. Finally, the Formulator agent compiles these rules into a comprehensive manual. The self-generated manual can not only improve the adaptability but also guide the planning of smaller LLMs while being human-readable. Given only one simple demonstration, AutoManual significantly improves task success rates, achieving 97.4\% with GPT-4-turbo and 86.2\% with GPT-3.5-turbo on ALFWorld benchmark tasks. The code is available at https://github.com/minghchen/automanual.
Authors: ZeMing Gong, Austin T. Wang, Xiaoliang Huo, Joakim Bruslund Haurum, Scott C. Lowe, Graham W. Taylor, Angel X. Chang
Abstract: Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space. This allows for accurate classification of both known and unknown insect species without task-specific fine-tuning, leveraging contrastive learning for the first time to fuse DNA and image data. Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks, showcasing its effectiveness in biodiversity studies.
Authors: Jingru Jia, Zehua Yuan, Junhao Pan, Paul E. McNamara, Deming Chen
Abstract: When making decisions under uncertainty, individuals often deviate from rational behavior, which can be evaluated across three dimensions: risk preference, probability weighting, and loss aversion. Given the widespread use of large language models (LLMs) in decision-making processes, it is crucial to assess whether their behavior aligns with human norms and ethical expectations or exhibits potential biases. Several empirical studies have investigated the rationality and social behavior performance of LLMs, yet their internal decision-making tendencies and capabilities remain inadequately understood. This paper proposes a framework, grounded in behavioral economics, to evaluate the decision-making behaviors of LLMs. Through a multiple-choice-list experiment, we estimate the degree of risk preference, probability weighting, and loss aversion in a context-free setting for three commercial LLMs: ChatGPT-4.0-Turbo, Claude-3-Opus, and Gemini-1.0-pro. Our results reveal that LLMs generally exhibit patterns similar to humans, such as risk aversion and loss aversion, with a tendency to overweight small probabilities. However, there are significant variations in the degree to which these behaviors are expressed across different LLMs. We also explore their behavior when embedded with socio-demographic features, uncovering significant disparities. For instance, when modeled with attributes of sexual minority groups or physical disabilities, Claude-3-Opus displays increased risk aversion, leading to more conservative choices. These findings underscore the need for careful consideration of the ethical implications and potential biases in deploying LLMs in decision-making scenarios. Therefore, this study advocates for developing standards and guidelines to ensure that LLMs operate within ethical boundaries while enhancing their utility in complex decision-making environments.
Authors: Ching-An Cheng, Allen Nie, Adith Swaminathan
Abstract: We study a class of optimization problems motivated by automating the design and update of AI systems like coding assistants, robots, and copilots. AutoDiff frameworks, like PyTorch, enable efficient end-to-end optimization of differentiable systems. However, general computational workflows can be non-differentiable and involve rich feedback (e.g. console output or user's responses), heterogeneous parameters (e.g. prompts, codes), and intricate objectives (beyond maximizing a score). We investigate end-to-end generative optimization -- using generative models such as LLMs within the optimizer for automatic updating of general computational workflows. We discover that workflow execution traces are akin to back-propagated gradients in AutoDiff and can provide key information to interpret feedback for efficient optimization. Formally, we frame a new mathematical setup, Optimization with Trace Oracle (OPTO). In OPTO, an optimizer receives an execution trace along with feedback on the computed output and updates parameters iteratively. We provide a Python library, Trace, that efficiently converts a workflow optimization problem into an OPTO instance using PyTorch-like syntax. Using Trace, we develop a general LLM-based generative optimizer called OptoPrime. In empirical studies, we find that OptoPrime is capable of first-order numerical optimization, prompt optimization, hyper-parameter tuning, robot controller design, code debugging, etc., and is often competitive with specialized optimizers for each domain. We envision Trace as an open research platform for devising novel generative optimizers and developing the next generation of interactive learning agents. Website: https://microsoft.github.io/Trace/.
Authors: Elliott Thornley, Alexander Roman, Christos Ziakas, Leyton Ho, Louis Thomson
Abstract: Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn't happen. A key part of the IPP is using a novel 'Discounted REward for Same-Length Trajectories (DREST)' reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be 'USEFUL'), and (2) choose stochastically between different trajectory-lengths (be 'NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.
Authors: Kevin Yu, Jihye Roh, Ziang Li, Wenhao Gao, Runzhong Wang, Connor W. Coley
Abstract: Computer-aided synthesis planning (CASP) algorithms have demonstrated expert-level abilities in planning retrosynthetic routes to molecules of low to moderate complexity. However, current search methods assume the sufficiency of reaching arbitrary building blocks, failing to address the common real-world constraint where using specific molecules is desired. To this end, we present a formulation of synthesis planning with starting material constraints. Under this formulation, we propose Double-Ended Synthesis Planning (DESP), a novel CASP algorithm under a bidirectional graph search scheme that interleaves expansions from the target and from the goal starting materials to ensure constraint satisfiability. The search algorithm is guided by a goal-conditioned cost network learned offline from a partially observed hypergraph of valid chemical reactions. We demonstrate the utility of DESP in improving solve rates and reducing the number of search expansions by biasing synthesis planning towards expert goals on multiple new benchmarks. DESP can make use of existing one-step retrosynthesis models, and we anticipate its performance to scale as these one-step model capabilities improve.
Authors: Alex Oesterling, Claudio Mayrink Verdun, Carol Xuan Long, Alexander Glynn, Lucas Monteiro Paes, Sajani Vithana, Martina Cardone, Flavio P. Calmon
Abstract: Image search and retrieval tasks can perpetuate harmful stereotypes, erase cultural identities, and amplify social disparities. Current approaches to mitigate these representational harms balance the number of retrieved items across population groups defined by a small number of (often binary) attributes. However, most existing methods overlook intersectional groups determined by combinations of group attributes, such as gender, race, and ethnicity. We introduce Multi-Group Proportional Representation (MPR), a novel metric that measures representation across intersectional groups. We develop practical methods for estimating MPR, provide theoretical guarantees, and propose optimization algorithms to ensure MPR in retrieval. We demonstrate that existing methods optimizing for equal and proportional representation metrics may fail to promote MPR. Crucially, our work shows that optimizing MPR yields more proportional representation across multiple intersectional groups specified by a rich function class, often with minimal compromise in retrieval accuracy.
Authors: Falaah Arif Khan, Denys Herasymuk, Nazar Protsiv, Julia Stoyanovich
Abstract: Data missingness is a practical challenge of sustained interest to the scientific community. In this paper, we present Shades-of-Null, an evaluation suite for responsible missing value imputation. Our work is novel in two ways (i) we model realistic and socially-salient missingness scenarios that go beyond Rubin's classic Missing Completely at Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR) settings, to include multi-mechanism missingness (when different missingness patterns co-exist in the data) and missingness shift (when the missingness mechanism changes between training and test) (ii) we evaluate imputers holistically, based on imputation quality, as well as on the predictive performance, fairness and stability of the models that are trained and tested on the data post-imputation. We use Shades-of-Null to conduct a large-scale empirical study involving 23,940 experimental pipelines, and find that while there is no single best-performing imputation approach for all missingness types, interesting trade-offs arise between predictive performance, fairness and stability, based on the combination of missingness scenario, imputer choice, and the architecture of the predictive model. We make Shades-of-Null publicly available, to enable researchers to rigorously evaluate missing value imputation methods on a wide range of metrics in plausible and socially meaningful scenarios.
Authors: Zhenwei Tang, Difan Jiao, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, Ashton Anderson
Abstract: There are an increasing number of domains in which artificial intelligence (AI) systems both surpass human ability and accurately model human behavior. This introduces the possibility of algorithmically-informed teaching in these domains through more relatable AI partners and deeper insights into human decision-making. Critical to achieving this goal, however, is coherently modeling human behavior at various skill levels. Chess is an ideal model system for conducting research into this kind of human-AI alignment, with its rich history as a pivotal testbed for AI research, mature superhuman AI systems like AlphaZero, and precise measurements of skill via chess rating systems. Previous work in modeling human decision-making in chess uses completely independent models to capture human style at different skill levels, meaning they lack coherence in their ability to adapt to the full spectrum of human improvement and are ultimately limited in their effectiveness as AI partners and teaching tools. In this work, we propose a unified modeling approach for human-AI alignment in chess that coherently captures human style across different skill levels and directly captures how people improve. Recognizing the complex, non-linear nature of human learning, we introduce a skill-aware attention mechanism to dynamically integrate players' strengths with encoded chess positions, enabling our model to be sensitive to evolving player skill. Our experimental results demonstrate that this unified framework significantly enhances the alignment between AI and human players across a diverse range of expertise levels, paving the way for deeper insights into human decision-making and AI-guided teaching tools.
Authors: Pranav Rajbhandari, Prithviraj Dasgupta, Donald Sofge
Abstract: We consider the problem of team formation within multiagent adversarial games. We propose BERTeam, a novel algorithm that uses a transformer-based deep neural network with Masked Language Model training to select the best team of players from a trained population. We integrate this with coevolutionary deep reinforcement learning, which trains a diverse set of individual players to choose teams from. We test our algorithm in the multiagent adversarial game Marine Capture-The-Flag, and we find that BERTeam learns non-trivial team compositions that perform well against unseen opponents. For this game, we find that BERTeam outperforms MCAA, an algorithm that similarly optimizes team formation.
Authors: Xun Jiang, Feng Li, Han Zhao, Jiaying Wang, Jun Shao, Shihao Xu, Shu Zhang, Weiling Chen, Xavier Tang, Yize Chen, Mengyue Wu, Weizhi Ma, Mengdi Wang, Tianqiao Chen
Abstract: Large language models (LLMs) like GPTs, trained on vast datasets, have demonstrated impressive capabilities in language understanding, reasoning, and planning, achieving human-level performance in various tasks. Most studies focus on enhancing these models by training on ever-larger datasets to build more powerful foundation models. While training stronger models is important, enabling models to evolve during inference is equally crucial, a process we refer to as AI self-evolution. Unlike large-scale training, self-evolution may rely on limited data or interactions. Inspired by the columnar organization of the human cerebral cortex, we hypothesize that AI models could develop cognitive abilities and build internal representations through iterative interactions with their environment. To achieve this, models need long-term memory (LTM) to store and manage processed interaction data. LTM supports self-evolution by representing diverse experiences across environments and agents. In this report, we explore AI self-evolution and its potential to enhance models during inference. We examine LTM's role in lifelong learning, allowing models to evolve based on accumulated interactions. We outline the structure of LTM and the systems needed for effective data retention and representation. We also classify approaches for building personalized models with LTM data and show how these models achieve self-evolution through interaction. Using LTM, our multi-agent framework OMNE achieved first place on the GAIA benchmark, demonstrating LTM's potential for AI self-evolution. Finally, we present a roadmap for future research, emphasizing the importance of LTM for advancing AI technology and its practical applications.
Authors: Yonchanok Khaokaew, Hao Xue, Flora D. Salim
Abstract: In recent years, predicting mobile app usage has become increasingly important for areas like app recommendation, user behaviour analysis, and mobile resource management. Existing models, however, struggle with the heterogeneous nature of contextual data and the user cold start problem. This study introduces a novel prediction model, Mobile App Prediction Leveraging Large Language Model Embeddings (MAPLE), which employs Large Language Models (LLMs) and installed app similarity to overcome these challenges. MAPLE utilises the power of LLMs to process contextual data and discern intricate relationships within it effectively. Additionally, we explore the use of installed app similarity to address the cold start problem, facilitating the modelling of user preferences and habits, even for new users with limited historical data. In essence, our research presents MAPLE as a novel, potent, and practical approach to app usage prediction, making significant strides in resolving issues faced by existing models. MAPLE stands out as a comprehensive and effective solution, setting a new benchmark for more precise and personalised app usage predictions. In tests on two real-world datasets, MAPLE surpasses contemporary models in both standard and cold start scenarios. These outcomes validate MAPLE's capacity for precise app usage predictions and its resilience against the cold start problem. This enhanced performance stems from the model's proficiency in capturing complex temporal patterns and leveraging contextual information. As a result, MAPLE can potentially improve personalised mobile app usage predictions and user experiences markedly.
Authors: Chang'an Yi, Haotian Chen, Yifan Zhang, Yonghui Xu, Yan Zhou, Lizhen Cui
Abstract: Test-time adaptation (TTA) aims to adapt a model, initially trained on training data, to test data with potential distribution shifts. Most existing TTA methods focus on classification problems. The pronounced success of classification might lead numerous newcomers and engineers to assume that classic TTA techniques can be directly applied to the more challenging task of semantic segmentation. However, this belief is still an open question. In this paper, we investigate the applicability of existing classic TTA strategies in semantic segmentation. Our comprehensive results have led to three key observations. First, the classic normalization updating strategy only brings slight performance improvement, and in some cases, it might even adversely affect the results. Even with the application of advanced distribution estimation techniques like batch renormalization, the problem remains unresolved. Second, although the teacher-student scheme does enhance the training stability for segmentation TTA in the presence of noisy pseudo-labels and temporal correlation, it cannot directly result in performance improvement compared to the original model without TTA under complex data distribution. Third, segmentation TTA suffers a severe long-tailed class-imbalance problem, which is substantially more complex than that in TTA for classification. This long-tailed challenge negatively affects segmentation TTA performance, even when the accuracy of pseudo-labels is high. Besides those observations, we find that visual prompt tuning (VisPT) is promising in segmentation TTA and propose a novel method named TTAP. The outstanding performance of TTAP has also been verified. We hope the community can give more attention to this challenging, yet important, segmentation TTA task in the future. The source code is available at: \textit{https://github.com/ycarobot/TTAP
Authors: Nikolaos Ioannis Bountos, Maria Sdraka, Angelos Zavras, Ilektra Karasante, Andreas Karavias, Themistocles Herekakis, Angeliki Thanasou, Dimitrios Michail, Ioannis Papoutsis
Abstract: Global floods, exacerbated by climate change, pose severe threats to human life, infrastructure, and the environment. Recent catastrophic events in Pakistan and New Zealand underscore the urgent need for precise flood mapping to guide restoration efforts, understand vulnerabilities, and prepare for future occurrences. While Synthetic Aperture Radar (SAR) remote sensing offers day-and-night, all-weather imaging capabilities, its application in deep learning for flood segmentation is limited by the lack of large annotated datasets. To address this, we introduce Kuro Siwo, a manually annotated multi-temporal dataset, spanning 43 flood events globally. Our dataset maps more than 338 billion $m^2$ of land, with 33 billion designated as either flooded areas or permanent water bodies. Kuro Siwo includes a highly processed product optimized for flood mapping based on SAR Ground Range Detected, and a primal SAR Single Look Complex product with minimal preprocessing, designed to promote research on the exploitation of both the phase and amplitude information and to offer maximum flexibility for downstream task preprocessing. To leverage advances in large scale self-supervised pretraining methods for remote sensing data, we augment Kuro Siwo with a large unlabeled set of SAR samples. Finally, we provide an extensive benchmark, namely BlackBench, offering strong baselines for a diverse set of flood events from Europe, America, Africa, Asia and Australia.
Authors: Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, Yueting Zhuang
Abstract: In recent years, the remarkable progress of large language models (LLMs) has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute them, playing a central role in autonomous agents. However, there is a lack of systematic and standardized benchmarks to promote the development of LLMs in task automation. To address this, we introduce TaskBench, a comprehensive framework to evaluate the capability of LLMs in task automation. Specifically, task automation can be divided into three critical stages: task decomposition, tool selection, and parameter prediction. To tackle the complexities inherent in these stages, we introduce the concept of Tool Graph to represent decomposed tasks and adopt a back-instruct method to generate high-quality user instructions. We propose TaskEval, a multi-faceted evaluation methodology that assesses LLM performance across these three stages. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation. Experimental results demonstrate that TaskBench effectively reflects the capabilities of various LLMs in task automation. It provides insights into model performance across different task complexities and domains, pushing the boundaries of what current models can achieve. TaskBench offers a scalable, adaptable, and reliable benchmark for advancing LLM-based autonomous agents.
Authors: Jeongwhan Choi, Hyowon Wi, Jayoung Kim, Yehjin Shin, Kookjin Lee, Nathaniel Trask, Noseong Park
Abstract: Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech recognition, and code classification.
Authors: Shengsheng Qian, Dizhan Xue, Yifei Wang, Shengjie Zhang, Huaiwen Zhang, Changsheng Xu
Abstract: Self-Supervised Learning (SSL) is an effective paradigm for learning representations from unlabeled data, such as text, images, and videos. However, researchers have recently found that SSL is vulnerable to backdoor attacks. The attacker can embed hidden SSL backdoors via a few poisoned examples in the training dataset and maliciously manipulate the behavior of downstream models. To defend against SSL backdoor attacks, a feasible route is to detect and remove the poisonous samples in the training set. However, the existing SSL backdoor defense method fails to detect the poisonous samples precisely. In this paper, we propose to erase the SSL backdoor by cluster activation masking and propose a novel PoisonCAM method. After obtaining the threat model trained on the poisoned dataset, our method can precisely detect poisonous samples based on the assumption that masking the backdoor trigger can effectively change the activation of a downstream clustering model. In experiments, our PoisonCAM achieves 96\% accuracy for backdoor trigger detection compared to 3\% of the state-of-the-art method on poisoned ImageNet-100. Moreover, our proposed PoisonCAM significantly improves the performance of the trained SSL model under backdoor attacks compared to the state-of-the-art method. Our code, data, and trained models will be open once this paper is accepted.
Authors: Yue Fu, Yixin Chen, Zelia Gomes Da Costa Lai, Alexis Hiniker
Abstract: Relationships are essential to our happiness and wellbeing, yet their dissolution-the final stage of a relationship's lifecycle-is among the most stressful events individuals can experience, often leading to profound and lasting impacts. With the breakup process increasingly facilitated by technology, such as computer-mediated communication, and the likely future influence of generative AI (GenAI) tools, we conducted a semi-structured interview study with 21 participants. We aim to understand: 1) the current role of technology in the breakup process, 2) the needs and support individuals seek during this time, and 3) how GenAI might address or undermine these needs. Our findings show that people have distinct needs at various stages of breakups. While currently technology plays an important role, it falls short in supporting users' unmet needs. Participants envision that GenAI could: 1) aid in prompting self-reflection, providing neutral second opinions, and assisting with planning leading up to a breakup; 2) serve as a communication mediator, supporting wording and tone to facilitate emotional expression during breakup conversations; and 3) support personal growth and offer companionship after a breakup. However, our findings also reveal participants' concerns about involving GenAI in this process. Based on our results, we discuss the potential opportunities, design considerations, and harms of GenAI tools in facilitating people's relationship dissolution.
Authors: Nemanja Ra\v{s}ajski, Chintan Trivedi, Konstantinos Makantasis, Antonios Liapis, Georgios N. Yannakakis
Abstract: Domain randomisation enhances the transferability of vision models across visually distinct domains with similar content. However, current methods heavily depend on intricate simulation engines, hampering feasibility and scalability. This paper introduces BehAVE, a video understanding framework that utilises existing commercial video games for domain randomisation without accessing their simulation engines. BehAVE taps into the visual diversity of video games for randomisation and uses textual descriptions of player actions to align videos with similar content. We evaluate BehAVE across 25 first-person shooter (FPS) games using various video and text foundation models, demonstrating its robustness in domain randomisation. BehAVE effectively aligns player behavioural patterns and achieves zero-shot transfer to multiple unseen FPS games when trained on just one game. In a more challenging scenario, BehAVE enhances the zero-shot transferability of foundation models to unseen FPS games, even when trained on a game of a different genre, with improvements of up to 22%. BehAVE is available online at https://github.com/nrasajski/BehAVE.
Authors: Zhi Jing, Yongye Su, Yikun Han
Abstract: This survey explores the synergistic potential of Large Language Models (LLMs) and Vector Databases (VecDBs), a burgeoning but rapidly evolving research area. With the proliferation of LLMs comes a host of challenges, including hallucinations, outdated knowledge, prohibitive commercial application costs, and memory issues. VecDBs emerge as a compelling solution to these issues by offering an efficient means to store, retrieve, and manage the high-dimensional vector representations intrinsic to LLM operations. Through this nuanced review, we delineate the foundational principles of LLMs and VecDBs and critically analyze their integration's impact on enhancing LLM functionalities. This discourse extends into a discussion on the speculative future developments in this domain, aiming to catalyze further research into optimizing the confluence of LLMs and VecDBs for advanced data handling and knowledge extraction capabilities.
Authors: Justin Dumouchelle, Esther Julien, Jannis Kurtz, Elias B. Khalil
Abstract: Bilevel optimization deals with nested problems in which a leader takes the first decision to minimize their objective function while accounting for a follower's best-response reaction. Constrained bilevel problems with integer variables are particularly notorious for their hardness. While exact solvers have been proposed for mixed-integer linear bilevel optimization, they tend to scale poorly with problem size and are hard to generalize to the non-linear case. On the other hand, problem-specific algorithms (exact and heuristic) are limited in scope. Under a data-driven setting in which similar instances of a bilevel problem are solved routinely, our proposed framework, Neur2BiLO, embeds a neural network approximation of the leader's or follower's value function, trained via supervised regression, into an easy-to-solve mixed-integer program. Neur2BiLO serves as a heuristic that produces high-quality solutions extremely fast for four applications with linear and non-linear objectives and pure and mixed-integer variables.
Authors: Hannes Nilsson, Rikard Johansson, Niklas {\AA}kerblom, Morteza Haghir Chehreghani
Abstract: We propose a new framework for contextual multi-armed bandits based on tree ensembles. Our framework adapts two widely used bandit methods, Upper Confidence Bound and Thompson Sampling, for both standard and combinatorial settings. As part of this framework, we propose a novel method of estimating the uncertainty in tree ensemble predictions. We further demonstrate the effectiveness of our framework via several experimental studies, employing XGBoost and random forests, two popular tree ensemble methods. Compared to state-of-the-art methods based on decision trees and neural networks, our methods exhibit superior performance in terms of both regret minimization and computational runtime, when applied to benchmark datasets and the real-world application of navigation over road networks.
Authors: Md Musfiqur Rahman, Matt Jordan, Murat Kocaoglu
Abstract: Causal inference from observational data plays critical role in many applications in trustworthy machine learning. While sound and complete algorithms exist to compute causal effects, many of them assume access to conditional likelihoods, which is difficult to estimate for high-dimensional (particularly image) data. Researchers have alleviated this issue by simulating causal relations with neural models. However, when we have high-dimensional variables in the causal graph along with some unobserved confounders, no existing work can effectively sample from the un/conditional interventional distributions. In this work, we show how to sample from any identifiable interventional distribution given an arbitrary causal graph through a sequence of push-forward computations of conditional generative models, such as diffusion models. Our proposed algorithm follows the recursive steps of the existing likelihood-based identification algorithms to train a set of feed-forward models, and connect them in a specific way to sample from the desired distribution. We conduct experiments on a Colored MNIST dataset having both the treatment ($X$) and the target variables ($Y$) as images and sample from $P(y|do(x))$. Our algorithm also enables us to conduct a causal analysis to evaluate spurious correlations among input features of generative models pre-trained on the CelebA dataset. Finally, we generate high-dimensional interventional samples from the MIMIC-CXR dataset involving text and image variables.
Authors: Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao
Abstract: Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models with human values, reward hacking, also termed reward overoptimization, remains a critical challenge. This issue primarily arises from reward misgeneralization, where reward models (RMs) compute reward using spurious features that are irrelevant to human preferences. In this work, we tackle this problem from an information-theoretic perspective and propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective to filter out irrelevant information. Notably, we further identify a correlation between overoptimization and outliers in the IB latent space of InfoRM, establishing it as a promising tool for detecting reward overoptimization. Inspired by this finding, we propose the Cluster Separation Index (CSI), which quantifies deviations in the IB latent space, as an indicator of reward overoptimization to facilitate the development of online mitigation strategies. Extensive experiments on a wide range of settings and RM scales (70M, 440M, 1.4B, and 7B) demonstrate the effectiveness of InfoRM. Further analyses reveal that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets, signifying a notable advancement in the field of RLHF. The code will be released upon acceptance.
Authors: Tao Lin, Yiling Chen
Abstract: Classic principal-agent problems such as Stackelberg games, contract design, and Bayesian persuasion, often assume that the agent is able to best respond to the principal's committed strategy. We study repeated generalized principal-agent problems under the assumption that the principal does not have commitment power and the agent uses algorithms to learn to respond to the principal. We reduce this problem to a one-shot generalized principal-agent problem where the agent approximately best responds. Using this reduction, we show that: (1) If the agent uses contextual no-regret learning algorithms with regret $\mathrm{Reg}(T)$, then the principal can guarantee utility at least $U^* - \Theta\big(\sqrt{\tfrac{\mathrm{Reg}(T)}{T}}\big)$, where $U^*$ is the principal's optimal utility in the classic model with a best-responding agent. (2) If the agent uses contextual no-swap-regret learning algorithms with swap-regret $\mathrm{SReg}(T)$, then the principal cannot obtain utility more than $U^* + O(\frac{\mathrm{SReg(T)}}{T})$. But (3) if the agent uses mean-based learning algorithms (which can be no-regret but not no-swap-regret), then the principal can sometimes do significantly better than $U^*$. These results not only refine previous results in Stackelberg games and contract design, but also lead to new results for Bayesian persuasion with a learning agent and all generalized principal-agent problems where the agent does not have private information.
Authors: Lei Song, Chenxiao Gao, Ke Xue, Chenyang Wu, Dong Li, Jianye Hao, Zongzhang Zhang, Chao Qian
Abstract: Black-Box Optimization (BBO) has found successful applications in many fields of science and engineering. Recently, there has been a growing interest in meta-learning particular components of BBO algorithms to speed up optimization and get rid of tedious hand-crafted heuristics. As an extension, learning the entire algorithm from data requires the least labor from experts and can provide the most flexibility. In this paper, we propose RIBBO, a method to reinforce-learn a BBO algorithm from offline data in an end-to-end fashion. RIBBO employs expressive sequence models to learn the optimization histories produced by multiple behavior algorithms and tasks, leveraging the in-context learning ability of large models to extract task information and make decisions accordingly. Central to our method is to augment the optimization histories with \textit{regret-to-go} tokens, which are designed to represent the performance of an algorithm based on cumulative regret over the future part of the histories. The integration of regret-to-go tokens enables RIBBO to automatically generate sequences of query points that satisfy the user-desired regret, which is verified by its universally good empirical performance on diverse problems, including BBO benchmark functions, hyper-parameter optimization and robot control problems.
Authors: Yiyan Huang, Cheuk Hang Leung, Siyi Wang, Yijun Li, Qi Wu
Abstract: The growing demand for personalized decision-making has led to a surge of interest in estimating the Conditional Average Treatment Effect (CATE). Various types of CATE estimators have been developed with advancements in machine learning and causal inference. However, selecting the desirable CATE estimator through a conventional model validation procedure remains impractical due to the absence of counterfactual outcomes in observational data. Existing approaches for CATE estimator selection, such as plug-in and pseudo-outcome metrics, face two challenges. First, they must determine the metric form and the underlying machine learning models for fitting nuisance parameters (e.g., outcome function, propensity function, and plug-in learner). Second, they lack a specific focus on selecting a robust CATE estimator. To address these challenges, this paper introduces a Distributionally Robust Metric (DRM) for CATE estimator selection. The proposed DRM is nuisance-free, eliminating the need to fit models for nuisance parameters, and it effectively prioritizes the selection of a distributionally robust CATE estimator. The experimental results validate the effectiveness of the DRM method in selecting CATE estimators that are robust to the distribution shift incurred by covariate shift and hidden confounders.
Authors: Michal Lukasik, Harikrishna Narasimhan, Aditya Krishna Menon, Felix Yu, Sanjiv Kumar
Abstract: Large language models (LLMs) have shown strong results on a range of applications, including regression and scoring tasks. Typically, one obtains outputs from an LLM via autoregressive sampling from the model's output distribution. We show that this inference strategy can be sub-optimal for common regression and scoring evaluation metrics. As a remedy, we build on prior work on Minimum Bayes Risk decoding, and propose alternate inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from sampled responses. We show that our proposal significantly improves over baselines across datasets and models.
Authors: Thomas M. Sutter, Yang Meng, Andrea Agostini, Daphn\'e Chopard, Norbert Fortin, Julia E. Vogt, Bahbak Shahbaba, Stephan Mandt
Abstract: Variational Autoencoders for multimodal data hold promise for many tasks in data analysis, such as representation learning, conditional generation, and imputation. Current architectures either share the encoder output, decoder input, or both across modalities to learn a shared representation. Such architectures impose hard constraints on the model. In this work, we show that a better latent representation can be obtained by replacing these hard constraints with a soft constraint. We propose a new mixture-of-experts prior, softly guiding each modality's latent representation towards a shared aggregate posterior. This approach results in a superior latent representation and allows each encoding to preserve information better from its uncompressed original features. In extensive experiments on multiple benchmark datasets and two challenging real-world datasets, we show improved learned latent representations and imputation of missing data modalities compared to existing methods.
Authors: Gantavya Bhatt, Arnav Das, Jeff Bilmes
Abstract: Submodular functions, crucial for various applications, often lack practical learning methods for their acquisition. Seemingly unrelated, learning a scaling from oracles offering graded pairwise preferences (GPC) is underexplored, despite a rich history in psychometrics. In this paper, we introduce deep submodular peripteral networks (DSPNs), a novel parametric family of submodular functions, and methods for their training using a GPC-based strategy to connect and then tackle both of the above challenges. We introduce newly devised GPC-style ``peripteral'' loss which leverages numerically graded relationships between pairs of objects (sets in our case). Unlike traditional contrastive learning, or RHLF preference ranking, our method utilizes graded comparisons, extracting more nuanced information than just binary-outcome comparisons, and contrasts sets of any size (not just two). We also define a novel suite of automatic sampling strategies for training, including active-learning inspired submodular feedback. We demonstrate DSPNs' efficacy in learning submodularity from a costly target submodular function and demonstrate its superiority both for experimental design and online streaming applications.
Authors: Chang Zong, Yuyan Chen, Weiming Lu, Jian Shao, Yueting Zhuang
Abstract: Large Language Models (LLMs) have demonstrated efficacy in various linguistic applications, including text summarization and controlled text generation. However, studies into their capacity of switching between styles via instruction tuning remain underexplored. This study concentrates on the style-switching abilities of LLMs and introduces a novel approach, named ProSwitch, which enables a language model to switch between professional and non-professional answers, by tuning and evaluating through the guidance of domain and style knowledge. ProSwitch unfolds across three phases: LLM-augmented preparation to collect domain knowledge and QA pairs, instruction tuning to optimize LLMs with multiple levels of knowledge, and comprehensive evaluation to assess both style discrimination and reference-based quality of generated text. Comparative analysis of ProSwitch against general and specialized LLMs reveals that our approach outperforms baselines in switching between professional and non-professional answers.
Authors: Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Marti\~no R\'ios-Garc\'ia, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, Amir Mohammad Elahi, Mehrdad Asgari, Juliane Eberhardt, Hani M. Elbeheiry, Mar\'ia Victoria Gil, Maximilian Greiner, Caroline T. Holick, Christina Glaubitz, Tim Hoffmann, Abdelrahman Ibrahim, Lea C. Klepsch, Yannik K\"oster, Fabian Alexander Kreth, Jakob Meyer, Santiago Miret, Jan Matthias Peschel, Michael Ringleb, Nicole Roesner, Johanna Schreiber, Ulrich S. Schubert, Leanne M. Stafast, Dinga Wonanke, Michael Pieler, Philippe Schwaller, Kevin Maik Jablonka
Abstract: Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here, we introduce "ChemBench," an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question-answer pairs, evaluated leading open- and closed-source LLMs, and found that the best models outperformed the best human chemists in our study on average. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs' impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.
Authors: Moshe Eliasof, Beatrice Bevilacqua, Carola-Bibiane Sch\"onlieb, Haggai Maron
Abstract: In recent years, significant efforts have been made to refine the design of Graph Neural Network (GNN) layers, aiming to overcome diverse challenges, such as limited expressive power and oversmoothing. Despite their widespread adoption, the incorporation of off-the-shelf normalization layers like BatchNorm or InstanceNorm within a GNN architecture may not effectively capture the unique characteristics of graph-structured data, potentially reducing the expressive power of the overall architecture. Moreover, existing graph-specific normalization layers often struggle to offer substantial and consistent benefits. In this paper, we propose GRANOLA, a novel graph-adaptive normalization layer. Unlike existing normalization layers, GRANOLA normalizes node features by adapting to the specific characteristics of the graph, particularly by generating expressive representations of its neighborhood structure, obtained by leveraging the propagation of Random Node Features (RNF) in the graph. We present theoretical results that support our design choices. Our extensive empirical evaluation of various graph benchmarks underscores the superior performance of GRANOLA over existing normalization techniques. Furthermore, GRANOLA emerges as the top-performing method among all baselines within the same time complexity of Message Passing Neural Networks (MPNNs).
Authors: Yihao Zhang, Zeming Wei, Jun Sun, Meng Sun
Abstract: Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at https://github.com/Zhang-Yihao/Adversarial-Representation-Engineering.
URLs: https://github.com/Zhang-Yihao/Adversarial-Representation-Engineering.
Authors: Jo\~ao Monteiro, \'Etienne Marcotte, Pierre-Andr\'e No\"el, Valentina Zantedeschi, David V\'azquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian
Abstract: In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context isn't known in advance, caching ICL can be challenging. This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt. More precisely, we leverage pre-trained decoder-only models and only train a small number of added layers. We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they outperform ICL, are comparable to fine-tuned prompted LLMs, and drastically reduce the space footprint relative to standard KV caching by two orders of magnitude.
Authors: Qingyang Zhang, Yake Wei, Zongbo Han, Huazhu Fu, Xi Peng, Cheng Deng, Qinghua Hu, Cai Xu, Jie Wen, Di Hu, Changqing Zhang
Abstract: Multimodal fusion focuses on integrating information from multiple modalities with the goal of more accurate prediction, which has achieved remarkable progress in a wide range of scenarios, including autonomous driving and medical diagnosis. However, the reliability of multimodal fusion remains largely unexplored especially under low-quality data settings. This paper surveys the common challenges and recent advances of multimodal fusion in the wild and presents them in a comprehensive taxonomy. From a data-centric view, we identify four main challenges that are faced by multimodal fusion on low-quality data, namely (1) noisy multimodal data that are contaminated with heterogeneous noises, (2) incomplete multimodal data that some modalities are missing, (3) imbalanced multimodal data that the qualities or properties of different modalities are significantly different and (4) quality-varying multimodal data that the quality of each modality dynamically changes with respect to different samples. This new taxonomy will enable researchers to understand the state of the field and identify several potential directions. We also provide discussion for the open problems in this field together with interesting future research directions.
Authors: Shenghe Zheng, Hongzhi Wang, Xianglong Liu
Abstract: Graph Neural Networks (GNNs) have shown great performance in various tasks, with the core idea of learning from data labels and aggregating messages within the neighborhood of nodes. However, the common challenges in graphs are twofold: insufficient accurate (high-quality) labels and limited neighbors for nodes, resulting in weak GNNs. Existing graph augmentation methods typically address only one of these challenges, often adding training costs or relying on oversimplified or knowledge-intensive strategies, limiting their generalization. To simultaneously address both challenges faced by graphs in a generalized way, we propose an elegant method called IntraMix. Considering the incompatibility of vanilla Mixup with the complex topology of graphs, IntraMix innovatively employs Mixup among inaccurate labeled data of the same class, generating high-quality labeled data at minimal cost. Additionally, it finds data with high confidence of being clustered into the same group as the generated data to serve as their neighbors, thereby enriching the neighborhoods of graphs. IntraMix efficiently tackles both issues faced by graphs and challenges the prior notion of the limited effectiveness of Mixup in node classification. IntraMix is a theoretically grounded plug-in-play method that can be readily applied to all GNNs. Extensive experiments demonstrate the effectiveness of IntraMix across various GNNs and datasets. Our code is available at: https://github.com/Zhengsh123/IntraMix.
Authors: Xin Qiu, Risto Miikkulainen
Abstract: With the widespread application of Large Language Models (LLMs) to various domains, concerns regarding the trustworthiness of LLMs in safety-critical scenarios have been raised, due to their unpredictable tendency to hallucinate and generate misinformation. Existing LLMs do not have an inherent functionality to provide the users with an uncertainty/confidence metric for each response it generates, making it difficult to evaluate trustworthiness. Although several studies aim to develop uncertainty quantification methods for LLMs, they have fundamental limitations, such as being restricted to classification tasks, requiring additional training and data, considering only lexical instead of semantic information, and being prompt-wise but not response-wise. A new framework is proposed in this paper to address these issues. Semantic density extracts uncertainty/confidence information for each response from a probability distribution perspective in semantic space. It has no restriction on task types and is "off-the-shelf" for new models and tasks. Experiments on seven state-of-the-art LLMs, including the latest Llama 3 and Mixtral-8x22B models, on four free-form question-answering benchmarks demonstrate the superior performance and robustness of semantic density compared to prior approaches.
Authors: Peng Sun, Yi Jiang, Tao Lin
Abstract: Data, the seminal opportunity and challenge in modern machine learning, currently constrains the scalability of representation learning and impedes the pace of model evolution. In this work, we investigate the efficiency properties of data from both optimization and generalization perspectives. Our theoretical and empirical analysis reveals an unexpected finding: for a given task, utilizing a publicly available, task- and architecture-agnostic model (referred to as the `prior model' in this paper) can effectively produce efficient data. Building on this insight, we propose the Representation Learning Accelerator (\algopt), which promotes the formation and utilization of efficient data, thereby accelerating representation learning. Utilizing a ResNet-18 pre-trained on CIFAR-10 as a prior model to inform ResNet-50 training on ImageNet-1K reduces computational costs by 50% while maintaining the same accuracy as the model trained with the original BYOL, which requires 100% cost. Our code is available at: \url{https://github.com/LINs-lab/ReLA}.
Authors: Yifei Li, Yuchen Sun, Pingchuan Ma, Eftychios Sifakis, Tao Du, Bo Zhu, Wojciech Matusik
Abstract: We present a novel framework to explore neural control and design of complex fluidic systems with dynamic solid boundaries. Our system features a fast differentiable Navier-Stokes solver with solid-fluid interface handling, a low-dimensional differentiable parametric geometry representation, a control-shape co-design algorithm, and gym-like simulation environments to facilitate various fluidic control design applications. Additionally, we present a benchmark of design, control, and learning tasks on high-fidelity, high-resolution dynamic fluid environments that pose challenges for existing differentiable fluid simulators. These tasks include designing the control of artificial hearts, identifying robotic end-effector shapes, and controlling a fluid gate. By seamlessly incorporating our differentiable fluid simulator into a learning framework, we demonstrate successful design, control, and learning results that surpass gradient-free solutions in these benchmark tasks.
Authors: Abhinav Jain, Swarat Chaudhuri, Thomas Reps, Chris Jermaine
Abstract: Parameter-Efficient Fine-Tuning (PEFT) has become the standard for customising Foundation Models (FMs) to user-specific downstream tasks. However, typical PEFT methods require storing multiple task-specific adapters, creating scalability issues as these adapters must be housed and run at the FM server. Traditional prompt tuning offers a potential solution by customising them through task-specific input prefixes, but it under-performs compared to other PEFT methods like LoRA. To address this gap, we propose Low-Rank Prompt Adaptation (LoPA), a prompt-tuning-based approach that performs on par with state-of-the-art PEFT methods and full fine-tuning while being more parameter-efficient and not requiring a server-based adapter. LoPA generates soft prompts by balancing between sharing task-specific information across instances and customization for each instance. It uses a low-rank decomposition of the soft-prompt component encoded for each instance to achieve parameter efficiency. We provide a comprehensive evaluation on multiple natural language understanding and code generation and understanding tasks across a wide range of foundation models with varying sizes.
Authors: Shraddha Barke, Emmanuel Anaya Gonzalez, Saketh Ram Kasibatla, Taylor Berg-Kirkpatrick, Nadia Polikarpova
Abstract: Many structured prediction and reasoning tasks can be framed as program synthesis problems, where the goal is to generate a program in a domain-specific language (DSL) that transforms input data into the desired output. Unfortunately, purely neural approaches, such as large language models (LLMs), often fail to produce fully correct programs in unfamiliar DSLs, while purely symbolic methods based on combinatorial search scale poorly to complex problems. Motivated by these limitations, we introduce a hybrid approach, where LLM completions for a given task are used to learn a task-specific, context-free surrogate model, which is then used to guide program synthesis. We evaluate this hybrid approach on three domains, and show that it outperforms both unguided search and direct sampling from LLMs, as well as existing program synthesizers.
Authors: Runlin Lei, Yuwei Hu, Yuchen Ren, Zhewei Wei
Abstract: Graph Neural Networks (GNNs) excel across various applications but remain vulnerable to adversarial attacks, particularly Graph Injection Attacks (GIAs), which inject malicious nodes into the original graph and pose realistic threats. Text-attributed graphs (TAGs), where nodes are associated with textual features, are crucial due to their prevalence in real-world applications and are commonly used to evaluate these vulnerabilities. However, existing research only focuses on embedding-level GIAs, which inject node embeddings rather than actual textual content, limiting their applicability and simplifying detection. In this paper, we pioneer the exploration of GIAs at the text level, presenting three novel attack designs that inject textual content into the graph. Through theoretical and empirical analysis, we demonstrate that text interpretability, a factor previously overlooked at the embedding level, plays a crucial role in attack strength. Among the designs we investigate, the Word-frequency-based Text-level GIA (WTGIA) is particularly notable for its balance between performance and interpretability. Despite the success of WTGIA, we discover that defenders can easily enhance their defenses with customized text embedding methods or large language model (LLM)--based predictors. These insights underscore the necessity for further research into the potential and practical significance of text-level GIAs.
Authors: Yusuke Kawamoto, Kentaro Kobayashi, Kohei Suenaga
Abstract: Statistical methods have been widely misused and misinterpreted in various scientific fields, raising significant concerns about the integrity of scientific research. To mitigate this problem, we propose a new method for formally specifying and automatically verifying the correctness of statistical programs. In this method, programmers are required to annotate the source code of the statistical programs with the requirements for these methods. Through this annotation, they are reminded to check the requirements for statistical methods, including those that cannot be formally verified, such as the distribution of the unknown true population. Our software tool StatWhy automatically checks whether programmers have properly specified the requirements for the statistical methods, thereby identifying any missing requirements that need to be addressed. This tool is implemented using the Why3 platform to verify the correctness of OCaml programs that conduct statistical hypothesis testing. We demonstrate how StatWhy can be used to avoid common errors in various popular statistical hypothesis testing programs.
Authors: Allen Nie, Yash Chandak, Christina J. Yuan, Anirudhan Badrinath, Yannis Flet-Berliac, Emma Brunskil
Abstract: Offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance by leveraging historical interaction data collected from other policies. Evaluating a new policy online without a confident estimate of its performance can lead to costly, unsafe, or hazardous outcomes, especially in education and healthcare. Several OPE estimators have been proposed in the last decade, many of which have hyperparameters and require training. Unfortunately, choosing the best OPE algorithm for each task and domain is still unclear. In this paper, we propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure. We prove that our estimator is consistent and satisfies several desirable properties for policy evaluation. Additionally, we demonstrate that when compared to alternative approaches, our estimator can be used to select higher-performing policies in healthcare and robotics. Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
Authors: Jaehong Yoon, Shoubin Yu, Mohit Bansal
Abstract: Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling various modifications to the input video, such as removing, changing subjects, and/or adding new objects. The proposed approach stands out from other methods through several significant contributions: (1) RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. (2) Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. (3) RACCooN also plans to imagine new objects in a given video, so users simply prompt the model to receive a detailed video editing plan for complex video editing. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
Authors: Arthur Juliani, Jordan T. Ash
Abstract: Continual learning with deep neural networks presents challenges distinct from both the fixed-dataset and convex continual learning regimes. One such challenge is plasticity loss, wherein a neural network trained in an online fashion displays a degraded ability to fit new tasks. This problem has been extensively studied in both supervised learning and off-policy reinforcement learning (RL), where a number of remedies have been proposed. Still, plasticity loss has received less attention in the on-policy deep RL setting. Here we perform an extensive set of experiments examining plasticity loss and a variety of mitigation methods in on-policy deep RL. We demonstrate that plasticity loss is pervasive under domain shift in this regime, and that a number of methods developed to resolve it in other settings fail, sometimes even performing worse than applying no intervention at all. In contrast, we find that a class of ``regenerative'' methods are able to consistently mitigate plasticity loss in a variety of contexts, including in gridworld tasks and more challenging environments like Montezuma's Revenge and ProcGen.
Authors: Tianyu Chen, Zhendong Wang, Mingyuan Zhou
Abstract: Offline reinforcement learning (RL) leverages pre-collected datasets to train optimal policies. Diffusion Q-Learning (DQL), introducing diffusion models as a powerful and expressive policy class, significantly boosts the performance of offline RL. However, its reliance on iterative denoising sampling to generate actions slows down both training and inference. While several recent attempts have tried to accelerate diffusion-QL, the improvement in training and/or inference speed often results in degraded performance. In this paper, we introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy. We bridge the two polices by a newly introduced diffusion trust region loss. The diffusion policy maintains expressiveness, while the trust region loss directs the one-step policy to explore freely and seek modes within the region defined by the diffusion policy. DTQL eliminates the need for iterative denoising sampling during both training and inference, making it remarkably computationally efficient. We evaluate its effectiveness and algorithmic characteristics against popular Kullback--Leibler divergence-based distillation methods in 2D bandit scenarios and gym tasks. We then show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds. The PyTorch implementation is available at https://github.com/TianyuCodings/Diffusion_Trusted_Q_Learning.
URLs: https://github.com/TianyuCodings/Diffusion_Trusted_Q_Learning.
Authors: Yangruibo Ding, Jinjun Peng, Marcus J. Min, Gail Kaiser, Junfeng Yang, Baishakhi Ray
Abstract: Code Large Language Models (Code LLMs) have excelled at tasks like code completion but often miss deeper semantics such as execution effects and dynamic states. This paper aims to bridge the gap between Code LLMs' reliance on static text data and the need for semantic understanding for complex tasks like debugging and program repair. We introduce a novel strategy, monologue reasoning, to train Code LLMs to reason comprehensive semantics, encompassing high-level functional descriptions, local execution effects of individual statements, and overall input/output behavior, thereby linking static code text with dynamic execution states. We begin by collecting PyX, a clean Python corpus of fully executable code samples with functional descriptions and test cases. We propose training Code LLMs not only to write code but also to understand code semantics by reasoning about key properties, constraints, and execution behaviors using natural language, mimicking human verbal debugging, i.e., rubber-duck debugging. This approach led to the development of SemCoder, a Code LLM with only 6.7B parameters, which shows competitive performance with GPT-3.5-turbo on code generation and execution reasoning tasks. SemCoder achieves 79.3% on HumanEval (GPT-3.5-turbo: 76.8%), 63.6% on CRUXEval-I (GPT-3.5-turbo: 50.3%), and 63.9% on CRUXEval-O (GPT-3.5-turbo: 59.0%). We also study the effectiveness of SemCoder's monologue-style execution reasoning compared to concrete scratchpad reasoning, showing that our approach integrates semantics from multiple dimensions more smoothly. Finally, we demonstrate the potential of applying learned semantics to improve Code LLMs' debugging and self-refining capabilities. Our data, code, and models are available at: https://github.com/ARiSE-Lab/SemCoder.
Authors: Charlie B. Tan, In\'es Garc\'ia-Redondo, Qiquan Wang, Michael M. Bronstein, Anthea Monod
Abstract: Bounding and predicting the generalization gap of overparameterized neural networks remains a central open problem in theoretical machine learning. There is a recent and growing body of literature that proposes the framework of fractals to model optimization trajectories of neural networks, motivating generalization bounds and measures based on the fractal dimension of the trajectory. Notably, the persistent homology dimension has been proposed to correlate with the generalization gap. This paper performs an empirical evaluation of these persistent homology-based generalization measures, with an in-depth statistical analysis. Our study reveals confounding effects in the observed correlation between generalization and topological measures due to the variation of hyperparameters. We also observe that fractal dimension fails to predict generalization of models trained from poor initializations. We lastly reveal the intriguing manifestation of model-wise double descent in these topological generalization measures. Our work forms a basis for a deeper investigation of the causal relationships between fractal geometry, topological data analysis, and neural network optimization.
Authors: Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun
Abstract: We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of all previous sequences to be retrieved from memory at every decoding step to retrieve context information, leading to two primary bottlenecks during batch inference. First, there is a significant delay in obtaining the first token, as the information of the entire prompt must first be processed to prefill the KV cache. Second, computation of subsequent tokens is bottlenecked by the high memory I/O demand of fetching the entire KV cache, which grows linearly with sequence length, incurring quadratic memory reads overall. We design the Block Transformer to strategically mitigate these costs, by incorporating coarsity and locality into an integrated global-to-local architecture. At the lower layers, we aggregate tokens into fixed size blocks to apply attention across the entire sequence at coarse-grained detail, to capture the global context while minimizing KV cache overhead. At upper layers, we apply attention within each block to decode individual tokens, to model fine-grained details with a lightweight local KV cache. We pretrain vanilla and Block Transformers from scratch and demonstrate that Block Transformers reach 10--20x inference throughput compared to vanilla transformers with equivalent perplexity and zero-shot task performance. Code is available at https://github.com/itsnamgyu/block-transformer.
Authors: Taewon Park, Hyun-Chul Kim, Minho Lee
Abstract: Neuro-symbolic neural networks have been extensively studied to integrate symbolic operations with neural networks, thereby improving systematic generalization. Specifically, Tensor Product Representation (TPR) framework enables neural networks to perform differentiable symbolic operations by encoding the symbolic structure of data within vector spaces. However, TPR-based neural networks often struggle to decompose unseen data into structured TPR representations, undermining their symbolic operations. To address this decomposition problem, we propose a Discrete Dictionary-based Decomposition (D3) layer designed to enhance the decomposition capabilities of TPR-based models. D3 employs discrete, learnable key-value dictionaries trained to capture symbolic features essential for decomposition operations. It leverages the prior knowledge acquired during training to generate structured TPR representations by mapping input data to pre-learned symbolic features within these dictionaries. D3 is a straightforward drop-in layer that can be seamlessly integrated into any TPR-based model without modifications. Our experimental results demonstrate that D3 significantly improves the systematic generalization of various TPR-based models while requiring fewer additional parameters. Notably, D3 outperforms baseline models on the synthetic task that demands the systematic decomposition of unseen combinatorial data.
Authors: Hengyi Wang, Shiwei Tan, Hao Wang
Abstract: Vision transformers (ViTs) have emerged as a significant area of focus, particularly for their capacity to be jointly trained with large language models and to serve as robust vision foundation models. Yet, the development of trustworthy explanation methods for ViTs has lagged, particularly in the context of post-hoc interpretations of ViT predictions. Existing sub-image selection approaches, such as feature-attribution and conceptual models, fall short in this regard. This paper proposes five desiderata for explaining ViTs -- faithfulness, stability, sparsity, multi-level structure, and parsimony -- and demonstrates the inadequacy of current methods in meeting these criteria comprehensively. We introduce a variational Bayesian explanation framework, dubbed ProbAbilistic Concept Explainers (PACE), which models the distributions of patch embeddings to provide trustworthy post-hoc conceptual explanations. Our qualitative analysis reveals the distributions of patch-level concepts, elucidating the effectiveness of ViTs by modeling the joint distribution of patch embeddings and ViT's predictions. Moreover, these patch-level explanations bridge the gap between image-level and dataset-level explanations, thus completing the multi-level structure of PACE. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that PACE surpasses state-of-the-art methods in terms of the defined desiderata.
Authors: Stefan K. Nielsen, Laziz U. Abdullaev, Rachel S. Y. Teo, Tan M. Nguyen
Abstract: Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision. This dot-product self-attention computes attention weights among the input tokens using Euclidean distance, which makes the model prone to representation collapse and vulnerable to contaminated samples. In this paper, we propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance. In particular, we define a hyper-ellipsoidal neighborhood around each query to increase the attention weights of the tokens lying in the contextually important directions. We term this novel class of attention Elliptical Attention. Our Elliptical Attention provides two benefits: 1) reducing representation collapse and 2) enhancing the model's robustness as Elliptical Attention pays more attention to contextually relevant information rather than focusing on some small subset of informative features. We empirically demonstrate the advantages of Elliptical Attention over the baseline dot-product attention and state-of-the-art attention methods on various practical tasks, including object classification, image segmentation, and language modeling across different data modalities.
Authors: Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang
Abstract: Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by $3.9\times$ with the same average attention span, boosting retrieval accuracy by $1.5-7.1\times$ over the uniform-attention baseline across Vicuna-{7B,13B}, and Llama3-{8B,70B} models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from $9\%-36\%$ to within $5\%$ across two long-context understanding benchmarks. MoA achieves a $1.2-1.4\times$ GPU memory reduction, boosting decode throughput by $6.6-8.2\times$ and $1.7-1.9\times$ compared to FlashAttention2 and vLLM, with minimal impact on performance. Our code is available at \url{https://github.com/thu-nics/MoA}.
Authors: Dmitry Shribak, Chen-Xiao Gao, Yitong Li, Chenjun Xiao, Bo Dai
Abstract: Diffusion-based models have achieved notable empirical successes in reinforcement learning (RL) due to their expressiveness in modeling complex distributions. Despite existing methods being promising, the key challenge of extending existing methods for broader real-world applications lies in the computational cost at inference time, i.e., sampling from a diffusion model is considerably slow as it often requires tens to hundreds of iterations to generate even one sample. To circumvent this issue, we propose to leverage the flexibility of diffusion models for RL from a representation learning perspective. In particular, by exploiting the connection between diffusion models and energy-based models, we develop Diffusion Spectral Representation (Diff-SR), a coherent algorithm framework that enables extracting sufficient representations for value functions in Markov decision processes (MDP) and partially observable Markov decision processes (POMDP). We further demonstrate how Diff-SR facilitates efficient policy optimization and practical algorithms while explicitly bypassing the difficulty and inference cost of sampling from the diffusion model. Finally, we provide comprehensive empirical studies to verify the benefits of Diff-SR in delivering robust and advantageous performance across various benchmarks with both fully and partially observable settings.
Authors: Jiahe Huang, Guandao Yang, Zichen Wang, Jeong Joon Park
Abstract: We introduce a general framework for solving partial differential equations (PDEs) using generative diffusion models. In particular, we focus on the scenarios where we do not have the full knowledge of the scene necessary to apply classical solvers. Most existing forward or inverse PDE approaches perform poorly when the observations on the data or the underlying coefficients are incomplete, which is a common assumption for real-world measurements. In this work, we propose DiffusionPDE that can simultaneously fill in the missing information and solve a PDE by modeling the joint distribution of the solution and coefficient spaces. We show that the learned generative priors lead to a versatile framework for accurately solving a wide range of PDEs under partial observation, significantly outperforming the state-of-the-art methods for both forward and inverse directions.
Authors: Jonas Ngnaw\'e, Sabyasachi Sahoo, Yann Pequignot, Fr\'ed\'eric Precioso, Christian Gagn\'e
Abstract: Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency -- a property that links the input space margins and the logit margins in robust models -- for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model's logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate high margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively and confidently use the logit margin to detect brittle decisions with such models. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to assess adversarial vulnerability in deployment scenarios efficiently.
Authors: Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David F. Fouhey, Joyce Chai
Abstract: Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent objects or become distracted) when tasked with focusing on multiple objects simultaneously. We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. With comprehensive empirical studies and analysis of potential factors leading to multi-object hallucination, we found that (1). LVLMs suffer more hallucinations when focusing on multiple objects compared to a single object. (2). The tested object class distribution affects hallucination behaviors, indicating that LVLMs may follow shortcuts and spurious correlations. (3). Hallucinatory behaviors are influenced by data-specific factors, salience and frequency, and model intrinsic behaviors. We hope to enable LVLMs to recognize and reason about multiple objects that often occur in realistic visual scenes, provide insights, and quantify our progress towards mitigating the issues.
Authors: Shuang Wang, Qianwen Lu, Boxing Peng, Yihe Nie, Qingchuan Tao
Abstract: For the task of low-light image enhancement, deep learning-based algorithms have demonstrated superiority and effectiveness compared to traditional methods. However, these methods, primarily based on Retinex theory, tend to overlook the noise and color distortions in input images, leading to significant noise amplification and local color distortions in enhanced results. To address these issues, we propose the Dual-Path Error Compensation (DPEC) method, designed to improve image quality under low-light conditions by preserving local texture details while restoring global image brightness without amplifying noise. DPEC incorporates precise pixel-level error estimation to capture subtle differences and an independent denoising mechanism to prevent noise amplification. We introduce the HIS-Retinex loss to guide DPEC's training, ensuring the brightness distribution of enhanced images closely aligns with real-world conditions. To balance computational speed and resource efficiency while training DPEC for a comprehensive understanding of the global context, we integrated the VMamba architecture into its backbone. Comprehensive quantitative and qualitative experimental results demonstrate that our algorithm significantly outperforms state-of-the-art methods in low-light image enhancement. The code is publicly available online at https://github.com/wangshuang233/DPEC.
Authors: Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong
Abstract: Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the conclusion that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. Most LLMs, however, use insufficient vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work highlights the importance of jointly considering tokenization and model scaling for efficient pre-training. The code and demo are available at https://github.com/sail-sg/scaling-with-vocab and https://hf.co/spaces/sail/scaling-with-vocab-demo.
URLs: https://github.com/sail-sg/scaling-with-vocab, https://hf.co/spaces/sail/scaling-with-vocab-demo.
Authors: Xiaohang Tang, Afonso Marques, Parameswaran Kamalaruban, Ilija Bogunovic
Abstract: Decision Transformer (DT), as one of the representative Reinforcement Learning via Supervised Learning (RvS) methods, has achieved strong performance in offline learning tasks by leveraging the powerful Transformer architecture for sequential decision-making. However, in adversarial environments, these methods can be non-robust, since the return is dependent on the strategies of both the decision-maker and adversary. Training a probabilistic model conditioned on observed return to predict action can fail to generalize, as the trajectories that achieve a return in the dataset might have done so due to a suboptimal behavior adversary. To address this, we propose a worst-case-aware RvS algorithm, the Adversarially Robust Decision Transformer (ARDT), which learns and conditions the policy on in-sample minimax returns-to-go. ARDT aligns the target return with the worst-case return learned through minimax expectile regression, thereby enhancing robustness against powerful test-time adversaries. In experiments conducted on sequential games with full data coverage, ARDT can generate a maximin (Nash Equilibrium) strategy, the solution with the largest adversarial robustness. In large-scale sequential games and continuous adversarial RL environments with partial data coverage, ARDT demonstrates significantly superior robustness to powerful test-time adversaries and attains higher worst-case returns compared to contemporary DT methods.
Authors: Federico Belotti, Fabio Dadda, Marco Cremaschi, Roberto Avogadro, Matteo Palmonari
Abstract: Tables are crucial containers of information, but understanding their meaning may be challenging. Over the years, there has been a surge in interest in data-driven approaches based on deep learning that have increasingly been combined with heuristic-based ones. In the last period, the advent of \acf{llms} has led to a new category of approaches for table annotation. However, these approaches have not been consistently evaluated on a common ground, making evaluation and comparison difficult. This work proposes an extensive evaluation of four STI SOTA approaches: Alligator (formerly s-elbat), Dagobah, TURL, and TableLlama; the first two belong to the family of heuristic-based algorithms, while the others are respectively encoder-only and decoder-only Large Language Models (LLMs). We also include in the evaluation both GPT-4o and GPT-4o-mini, since they excel in various public benchmarks. The primary objective is to measure the ability of these approaches to solve the entity disambiguation task with respect to both the performance achieved on a common-ground evaluation setting and the computational and cost requirements involved, with the ultimate aim of charting new research paths in the field.
Authors: Xiaoyu Kong, Jiancan Wu, An Zhang, Leheng Sheng, Hui Lin, Xiang Wang, Xiangnan He
Abstract: Sequential recommendation systems predict the next interaction item based on users' past interactions, aligning recommendations with individual preferences. Leveraging the strengths of Large Language Models (LLMs) in knowledge comprehension and reasoning, recent approaches are eager to apply LLMs to sequential recommendation. A common paradigm is converting user behavior sequences into instruction data, and fine-tuning the LLM with parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaption (LoRA). However, the uniform application of LoRA across diverse user behaviors is insufficient to capture individual variability, resulting in negative transfer between disparate sequences. To address these challenges, we propose Instance-wise LoRA (iLoRA). We innovatively treat the sequential recommendation task as a form of multi-task learning, integrating LoRA with the Mixture of Experts (MoE) framework. This approach encourages different experts to capture various aspects of user behavior. Additionally, we introduce a sequence representation guided gate function that generates customized expert participation weights for each user sequence, which allows dynamic parameter adjustment for instance-wise recommendations. In sequential recommendation, iLoRA achieves an average relative improvement of 11.4\% over basic LoRA in the hit ratio metric, with less than a 1\% relative increase in trainable parameters. Extensive experiments on three benchmark datasets demonstrate the effectiveness of iLoRA, highlighting its superior performance compared to existing methods in mitigating negative transfer and improving recommendation accuracy. Our data and code are available at https://github.com/AkaliKong/iLoRA.
Authors: Elvis Hsieh, Preston Fu, Jonathan Chen
Abstract: Language models (LMs) trained on web-scale datasets are largely successful due to their ability to memorize large amounts of training data, even if only present in a few examples. These capabilities are often desirable in evaluation on tasks such as question answering but raise questions about whether these models can exhibit genuine reasoning or succeed only at mimicking patterns from the training data. This distinction is particularly salient in forecasting tasks, where the answer is not present in the training data, and the model must reason to make logical deductions. We present Reasoning and Tools for Forecasting (RTF), a framework of reasoning-and-acting (ReAct) agents that can dynamically retrieve updated information and run numerical simulation with equipped tools. We evaluate our model with questions from competitive forecasting platforms and demonstrate that our method is competitive with and can outperform human predictions. This suggests that LMs, with the right tools, can indeed think and adapt like humans, offering valuable insights for real-world decision-making.
Authors: Guanwen Xie, Jingzehua Xu, Yiyuan Yang, Yimian Ding, Shuai Zhang
Abstract: Achieving the effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities. Specifically, we generate reward components for each numerically explicit user requirement and employ a reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively adjust the weights without ambiguity and redundant adjustments by flexibly adopting directional mutation and crossover strategies, similar to genetic algorithms, based on the context provided by the training log analyzer. We applied the framework to an underwater data collection RL task without direct human feedback or reward examples (zero-shot learning). The reward critic successfully corrects the reward code with only one feedback instance for each requirement, effectively preventing unrectifiable errors. The initialization of weights enables the acquisition of different reward functions within the Pareto solution set without the need for weight search. Even in cases where a weight is 500 times off, on average, only 5.2 iterations are needed to meet user requirements. The ERFSL also works well with most prompts utilizing GPT-4o mini, as we decompose the weight searching process to reduce the requirement for numerical and long-context understanding capabilities
Authors: Faezeh Faez, Raika Karimi, Yingxue Zhang, Xing Li, Lei Chen, Mingxuan Yuan, Mahdi Biparva
Abstract: Electronic Design Automation (EDA) is essential for IC design and has recently benefited from AI-based techniques to improve efficiency. Logic synthesis, a key EDA stage, transforms high-level hardware descriptions into optimized netlists. Recent research has employed machine learning to predict Quality of Results (QoR) for pairs of And-Inverter Graphs (AIGs) and synthesis recipes. However, the severe scarcity of data due to a very limited number of available AIGs results in overfitting, significantly hindering performance. Additionally, the complexity and large number of nodes in AIGs make plain GNNs less effective for learning expressive graph-level representations. To tackle these challenges, we propose MTLSO - a Multi-Task Learning approach for Logic Synthesis Optimization. On one hand, it maximizes the use of limited data by training the model across different tasks. This includes introducing an auxiliary task of binary multi-label graph classification alongside the primary regression task, allowing the model to benefit from diverse supervision sources. On the other hand, we employ a hierarchical graph representation learning strategy to improve the model's capacity for learning expressive graph-level representations of large AIGs, surpassing traditional plain GNNs. Extensive experiments across multiple datasets and against state-of-the-art baselines demonstrate the superiority of our method, achieving an average performance gain of 8.22\% for delay and 5.95\% for area.
Authors: Jingru Jia, Zehua Yuan
Abstract: This study explores the potential of large language models (LLMs) to conduct market experiments, aiming to understand their capability to comprehend competitive market dynamics. We model the behavior of market agents in a controlled experimental setting, assessing their ability to converge toward competitive equilibria. The results reveal the challenges current LLMs face in replicating the dynamic decision-making processes characteristic of human trading behavior. Unlike humans, LLMs lacked the capacity to achieve market equilibrium. The research demonstrates that while LLMs provide a valuable tool for scalable and reproducible market simulations, their current limitations necessitate further advancements to fully capture the complexities of market behavior. Future work that enhances dynamic learning capabilities and incorporates elements of behavioral economics could improve the effectiveness of LLMs in the economic domain, providing new insights into market dynamics and aiding in the refinement of economic policies.
Authors: Arya Grayeli, Atharva Sehgal, Omar Costilla-Reyes, Miles Cranmer, Swarat Chaudhuri
Abstract: We present a novel method for symbolic regression (SR), the task of searching for compact programmatic hypotheses that best explain a dataset. The problem is commonly solved using genetic algorithms; we show that we can enhance such methods by inducing a library of abstract textual concepts. Our algorithm, called LaSR, uses zero-shot queries to a large language model (LLM) to discover and evolve concepts occurring in known high-performing hypotheses. We discover new hypotheses using a mix of standard evolutionary steps and LLM-guided steps (obtained through zero-shot LLM queries) conditioned on discovered concepts. Once discovered, hypotheses are used in a new round of concept abstraction and evolution. We validate LaSR on the Feynman equations, a popular SR benchmark, as well as a set of synthetic tasks. On these benchmarks, LaSR substantially outperforms a variety of state-of-the-art SR approaches based on deep learning and evolutionary algorithms. Moreover, we show that LaSR can be used to discover a novel and powerful scaling law for LLMs.
Authors: Lindsey Linxi Wei, Guorui Xiao, Magdalena Balazinska
Abstract: As an important component of data exploration and integration, Column Type Annotation (CTA) aims to label columns of a table with one or more semantic types. With the recent development of Large Language Models (LLMs), researchers have started to explore the possibility of using LLMs for CTA, leveraging their strong zero-shot capabilities. In this paper, we build on this promising work and improve on LLM-based methods for CTA by showing how to use a Knowledge Graph (KG) to augment the context information provided to the LLM. Our approach, called RACOON, combines both pre-trained parametric and non-parametric knowledge during generation to improve LLMs' performance on CTA. Our experiments show that RACOON achieves up to a 0.21 micro F-1 improvement compared against vanilla LLM inference.
Authors: Jixuan Cui, Jun Li, Zhen Mei, Yiyang Ni, Wen Chen, Zengxiang Li
Abstract: The challenge of data scarcity hinders the application of deep learning in industrial surface defect classification (SDC), as it's difficult to collect and centralize sufficient training data from various entities in Industrial Internet of Things (IIoT) due to privacy concerns. Federated learning (FL) provides a solution by enabling collaborative global model training across clients while maintaining privacy. However, performance may suffer due to data heterogeneity-discrepancies in data distributions among clients. In this paper, we propose a novel personalized FL (PFL) approach, named Adversarial Federated Consensus Learning (AFedCL), for the challenge of data heterogeneity across different clients in SDC. First, we develop a dynamic consensus construction strategy to mitigate the performance degradation caused by data heterogeneity. Through adversarial training, local models from different clients utilize the global model as a bridge to achieve distribution alignment, alleviating the problem of global knowledge forgetting. Complementing this strategy, we propose a consensus-aware aggregation mechanism. It assigns aggregation weights to different clients based on their efficacy in global knowledge learning, thereby enhancing the global model's generalization capabilities. Finally, we design an adaptive feature fusion module to further enhance global knowledge utilization efficiency. Personalized fusion weights are gradually adjusted for each client to optimally balance global and local features. Compared with state-of-the-art FL methods like FedALA, the proposed AFedCL method achieves an accuracy increase of up to 5.67% on three SDC datasets.
Authors: Joshua Ashkinaze, Emily Fry, Narendra Edara, Eric Gilbert, Ceren Budak
Abstract: Recent debates raised concerns that language models may favor certain viewpoints. But what if the solution is not to aim for a 'view from nowhere' but rather to leverage different viewpoints? We introduce Plurals, a system and Python library for pluralistic AI deliberation. Plurals consists of Agents (LLMs, optionally with personas) which deliberate within customizable Structures, with Moderators overseeing deliberation. Plurals is a generator of simulated social ensembles. Plurals integrates with government datasets to create nationally representative personas, includes deliberation templates inspired by democratic deliberation theory, and allows users to customize both information-sharing structures and deliberation behavior within Structures. Six case studies demonstrate fidelity to theoretical constructs and efficacy. Three randomized experiments show simulated focus groups produced output resonant with an online sample of the relevant audiences (chosen over zero-shot generation in 75% of trials). Plurals is both a paradigm and a concrete system for pluralistic AI. The Plurals library is available at https://github.com/josh-ashkinaze/plurals and will be continually updated.
Authors: Xun Zhu, Ying Hu, Fanbin Mo, Miao Li, Ji Wu
Abstract: Multi-modal large language models (MLLMs) have shown impressive capabilities as a general-purpose interface for various visual and linguistic tasks. However, building a unified MLLM for multi-task learning in the medical field remains a thorny challenge. To mitigate the tug-of-war problem of multi-modal multi-task optimization in MLLMs, recent advances primarily focus on improving the LLM components, while neglecting the connector that bridges the gap between modalities. In this paper, we introduce Uni-Med, a novel medical generalist foundation model which consists of a universal visual feature extraction module, a connector mixture-of-experts (CMoE) module, and an LLM. Benefiting from the proposed CMoE that leverages a well-designed router with a mixture of projection experts at the connector, Uni-Med achieves efficient solution to the tug-of-war problem and can perform six different medical tasks including question answering, visual question answering, report generation, referring expression comprehension, referring expression generation and image classification. To the best of our knowledge, Uni-Med is the first effort to tackle multi-task interference at the connector in MLLMs. Extensive ablation experiments validate the effectiveness of introducing CMoE under any configuration, with up to an average 8% performance gains. We further provide interpretation analysis of the tug-of-war problem from the perspective of gradient optimization and parameter statistics. Compared to previous state-of-the-art medical MLLMs, Uni-Med achieves competitive or superior evaluation metrics on diverse tasks. Code and resources are available at https://github.com/tsinghua-msiip/Uni-Med.
Authors: Haoyu Zhang, Wenbin Wang, Tianshu Yu
Abstract: The field of Multimodal Sentiment Analysis (MSA) has recently witnessed an emerging direction seeking to tackle the issue of data incompleteness. Recognizing that the language modality typically contains dense sentiment information, we consider it as the dominant modality and present an innovative Language-dominated Noise-resistant Learning Network (LNLN) to achieve robust MSA. The proposed LNLN features a dominant modality correction (DMC) module and dominant modality based multimodal learning (DMML) module, which enhances the model's robustness across various noise scenarios by ensuring the quality of dominant modality representations. Aside from the methodical design, we perform comprehensive experiments under random data missing scenarios, utilizing diverse and meaningful settings on several popular datasets (\textit{e.g.,} MOSI, MOSEI, and SIMS), providing additional uniformity, transparency, and fairness compared to existing evaluations in the literature. Empirically, LNLN consistently outperforms existing baselines, demonstrating superior performance across these challenging and extensive evaluation metrics.
Authors: Ying-Chun Lin, Jennifer Neville
Abstract: Graphs model complex relationships between entities, with nodes and edges capturing intricate connections. Node representation learning involves transforming nodes into low-dimensional embeddings. These embeddings are typically used as features for downstream tasks. Therefore, their quality has a significant impact on task performance. Existing approaches for node representation learning span (semi-)supervised, unsupervised, and self-supervised paradigms. In graph domains, (semi-)supervised learning often only optimizes models based on class labels, neglecting other abundant graph signals, which limits generalization. While self-supervised or unsupervised learning produces representations that better capture underlying graph signals, the usefulness of these captured signals for downstream target tasks can vary. To bridge this gap, we introduce Target-Aware Contrastive Learning (Target-aware CL) which aims to enhance target task performance by maximizing the mutual information between the target task and node representations with a self-supervised learning process. This is achieved through a sampling function, XGBoost Sampler (XGSampler), to sample proper positive examples for the proposed Target-Aware Contrastive Loss (XTCL). By minimizing XTCL, Target-aware CL increases the mutual information between the target task and node representations, such that model generalization is improved. Additionally, XGSampler enhances the interpretability of each signal by showing the weights for sampling the proper positive examples. We show experimentally that XTCL significantly improves the performance on two target tasks: node classification and link prediction tasks, compared to state-of-the-art models.
Authors: Vy Nguyen, Chau Pham
Abstract: The increasing frequency of suicidal thoughts highlights the importance of early detection and intervention. Social media platforms, where users often share personal experiences and seek help, could be utilized to identify individuals at risk. However, the large volume of daily posts makes manual review impractical. This paper explores the use of Large Language Models (LLMs) to automatically detect suicidal content in text-based social media posts. We propose a novel method for generating pseudo-labels for unlabeled data by prompting LLMs, along with traditional classification fine-tuning techniques to enhance label accuracy. To create a strong suicide detection model, we develop an ensemble approach involving prompting with Qwen2-72B-Instruct, and using fine-tuned models such as Llama3-8B, Llama3.1-8B, and Gemma2-9B. We evaluate our approach on the dataset of the Suicide Ideation Detection on Social Media Challenge, a track of the IEEE Big Data 2024 Big Data Cup. Additionally, we conduct a comprehensive analysis to assess the impact of different models and fine-tuning strategies on detection performance. Experimental results show that the ensemble model significantly improves the detection accuracy, by 5% points compared with the individual models. It achieves a weight F1 score of 0.770 on the public test set, and 0.731 on the private test set, providing a promising solution for identifying suicidal content in social media. Our analysis shows that the choice of LLMs affects the prompting performance, with larger models providing better accuracy. Our code and checkpoints are publicly available at https://github.com/khanhvynguyen/Suicide_Detection_LLMs.
URLs: https://github.com/khanhvynguyen/Suicide_Detection_LLMs.
Authors: Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, Jiajun Zhang
Abstract: Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks. However, these advancements have predominantly benefited "first-class" languages such as English and Chinese, leaving many other languages underrepresented. This imbalance, while limiting broader applications, generates a natural preference ranking between languages, offering an opportunity to bootstrap the multilingual capabilities of LLM in a self-improving manner. Thus, we propose $\textit{Language Imbalance Driven Rewarding}$, where the inherent imbalance between dominant and non-dominant languages within LLMs is leveraged as a reward signal. Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language's capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct over two iterations of this approach results in continuous improvements in multilingual performance across instruction-following and arithmetic reasoning tasks, evidenced by an average improvement of 7.46% win rate on the X-AlpacaEval leaderboard and 13.9% accuracy on the MGSM benchmark. This work serves as an initial exploration, paving the way for multilingual self-improvement of LLMs.
Authors: Zihan Chen, Bike Xie, Jundong Li, Cong Shen
Abstract: Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ can adapt to any bit-width constraint. CMPQ employs a non-uniform quantization strategy and incorporates two outlier extraction techniques that collaboratively preserve the critical information, thereby minimizing the quantization loss. Experiments on different sizes of LLMs demonstrate that CMPQ not only enhances performance in integer-bit quantization tasks but also achieves significant performance gains with a modest increase in memory usage. CMPQ thus represents an adaptive and effective approach to LLM quantization, offering substantial benefits across diverse device capabilities.
Authors: Kuldeep Singh, Simerjot Kaur, Charese Smiley
Abstract: Financial decision-making hinges on the analysis of relevant information embedded in the enormous volume of documents in the financial domain. To address this challenge, we developed FinQAPT, an end-to-end pipeline that streamlines the identification of relevant financial reports based on a query, extracts pertinent context, and leverages Large Language Models (LLMs) to perform downstream tasks. To evaluate the pipeline, we experimented with various techniques to optimize the performance of each module using the FinQA dataset. We introduced a novel clustering-based negative sampling technique to enhance context extraction and a novel prompting method called Dynamic N-shot Prompting to boost the numerical question-answering capabilities of LLMs. At the module level, we achieved state-of-the-art accuracy on FinQA, attaining an accuracy of 80.6%. However, at the pipeline level, we observed decreased performance due to challenges in extracting relevant context from financial reports. We conducted a detailed error analysis of each module and the end-to-end pipeline, pinpointing specific challenges that must be addressed to develop a robust solution for handling complex financial tasks.
Authors: Fei Liu, Yiming Yao, Ping Guo, Zhiyuan Yang, Zhe Zhao, Xi Lin, Xialiang Tong, Mingxuan Yuan, Zhichao Lu, Zhenkun Wang, Qingfu Zhang
Abstract: Algorithm Design (AD) is crucial for effective problem-solving across various domains. The advent of Large Language Models (LLMs) has notably enhanced the automation and innovation within this field, offering new perspectives and promising solutions. Over the past three years, the integration of LLMs into AD (LLM4AD) has seen substantial progress, with applications spanning optimization, machine learning, mathematical reasoning, and scientific discovery. Given the rapid advancements and expanding scope of this field, a systematic review is both timely and necessary. This paper provides a systematic review of LLM4AD. First, we offer an overview and summary of existing studies. Then, we introduce a taxonomy and review the literature across four dimensions: the roles of LLMs, search methods, prompt methods, and application domains with a discussion of potential and achievements of LLMs in AD. Finally, we identify current challenges and highlight several promising directions for future research.
Authors: XuDong Wang, Shaolun Zhang, Shufan Li, Konstantinos Kallidromitis, Kehan Li, Yusuke Kato, Kazuki Kozuka, Trevor Darrell
Abstract: We present SegLLM, a novel multi-round interactive reasoning segmentation model that enhances LLM-based segmentation by exploiting conversational memory of both visual and textual outputs. By leveraging a mask-aware multimodal LLM, SegLLM re-integrates previous segmentation results into its input stream, enabling it to reason about complex user intentions and segment objects in relation to previously identified entities, including positional, interactional, and hierarchical relationships, across multiple interactions. This capability allows SegLLM to respond to visual and text queries in a chat-like manner. Evaluated on the newly curated MRSeg benchmark, SegLLM outperforms existing methods in multi-round interactive reasoning segmentation by over 20%. Additionally, we observed that training on multi-round reasoning segmentation data enhances performance on standard single-round referring segmentation and localization tasks, resulting in a 5.5% increase in cIoU for referring expression segmentation and a 4.5% improvement in Acc@0.5 for referring expression localization.
Authors: Antesh Upadhyay, Abolfazl Hashemi
Abstract: Federated learning is a prominent distributed learning paradigm that incorporates collaboration among diverse clients, promotes data locality, and thus ensures privacy. These clients have their own technological, cultural, and other biases in the process of data generation. However, the present standard often ignores this bias/heterogeneity, perpetuating bias against certain groups rather than mitigating it. In response to this concern, we propose an equitable clustering-based framework where the clients are categorized/clustered based on how similar they are to each other. We propose a unique way to construct the similarity matrix that uses activation vectors. Furthermore, we propose a client weighing mechanism to ensure that each cluster receives equal importance and establish $O(1/\sqrt{K})$ rate of convergence to reach an $\epsilon-$stationary solution. We assess the effectiveness of our proposed strategy against common baselines, demonstrating its efficacy in terms of reducing the bias existing amongst various client clusters and consequently ameliorating algorithmic bias against specific groups.
Authors: Yixiu Mao, Qi Wang, Chen Chen, Yun Qu, Xiangyang Ji
Abstract: In offline reinforcement learning (RL), addressing the out-of-distribution (OOD) action issue has been a focus, but we argue that there exists an OOD state issue that also impairs performance yet has been underexplored. Such an issue describes the scenario when the agent encounters states out of the offline dataset during the test phase, leading to uncontrolled behavior and performance degradation. To this end, we propose SCAS, a simple yet effective approach that unifies OOD state correction and OOD action suppression in offline RL. Technically, SCAS achieves value-aware OOD state correction, capable of correcting the agent from OOD states to high-value in-distribution states. Theoretical and empirical results show that SCAS also exhibits the effect of suppressing OOD actions. On standard offline RL benchmarks, SCAS achieves excellent performance without additional hyperparameter tuning. Moreover, benefiting from its OOD state correction feature, SCAS demonstrates enhanced robustness against environmental perturbations.
Authors: Luping Wang, Sheng Chen, Linnan Jiang, Shu Pan, Runze Cai, Sen Yang, Fei Yang
Abstract: The large models, as predicted by scaling raw forecasts, have made groundbreaking progress in many fields, particularly in natural language generation tasks, where they have approached or even surpassed human levels. However, the unprecedented scale of their parameters brings significant computational and storage costs. These large models require substantial computational resources and GPU memory to operate. When adapting large models to specific downstream tasks, their massive parameter scale poses a significant challenge in fine-tuning on hardware platforms with limited computational power and GPU memory. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) offers a practical solution by efficiently adjusting the parameters of large pre-trained models to suit various downstream tasks. Specifically, PEFT adjusts the parameters of pre-trained large models to adapt to specific tasks or domains, minimizing the introduction of additional parameters and the computational resources required. This review mainly introduces the preliminary knowledge of PEFT, the core ideas and principles of various PEFT algorithms, the applications of PEFT, and potential future research directions. By reading this review, we believe that interested parties can quickly grasp the PEFT methodology, thereby accelerating its development and innovation.
Authors: Mingjiang Liang, Yongkang Cheng, Hualin Liang, Shaoli Huang, Wei Liu
Abstract: We present RopeTP, a novel framework that combines Robust pose estimation with a diffusion Trajectory Prior to reconstruct global human motion from videos. At the heart of RopeTP is a hierarchical attention mechanism that significantly improves context awareness, which is essential for accurately inferring the posture of occluded body parts. This is achieved by exploiting the relationships with visible anatomical structures, enhancing the accuracy of local pose estimations. The improved robustness of these local estimations allows for the reconstruction of precise and stable global trajectories. Additionally, RopeTP incorporates a diffusion trajectory model that predicts realistic human motion from local pose sequences. This model ensures that the generated trajectories are not only consistent with observed local actions but also unfold naturally over time, thereby improving the realism and stability of 3D human motion reconstruction. Extensive experimental validation shows that RopeTP surpasses current methods on two benchmark datasets, particularly excelling in scenarios with occlusions. It also outperforms methods that rely on SLAM for initial camera estimates and extensive optimization, delivering more accurate and realistic trajectories.
Authors: Yongkang Cheng, Mingjiang Liang, Shaoli Huang, Gaoge Han, Jifeng Ning, Wei Liu
Abstract: Audio-driven simultaneous gesture generation is vital for human-computer communication, AI games, and film production. While previous research has shown promise, there are still limitations. Methods based on VAEs are accompanied by issues of local jitter and global instability, whereas methods based on diffusion models are hampered by low generation efficiency. This is because the denoising process of DDPM in the latter relies on the assumption that the noise added at each step is sampled from a unimodal distribution, and the noise values are small. DDIM borrows the idea from the Euler method for solving differential equations, disrupts the Markov chain process, and increases the noise step size to reduce the number of denoising steps, thereby accelerating generation. However, simply increasing the step size during the step-by-step denoising process causes the results to gradually deviate from the original data distribution, leading to a significant drop in the quality of the generated actions and the emergence of unnatural artifacts. In this paper, we break the assumptions of DDPM and achieves breakthrough progress in denoising speed and fidelity. Specifically, we introduce a conditional GAN to capture audio control signals and implicitly match the multimodal denoising distribution between the diffusion and denoising steps within the same sampling step, aiming to sample larger noise values and apply fewer denoising steps for high-speed generation.
Authors: Po-Yu Liang, Jun Bai
Abstract: Biologists frequently desire protein inhibitors for a variety of reasons, including use as research tools for understanding biological processes and application to societal problems in agriculture, healthcare, etc. Immunotherapy, for instance, relies on immune checkpoint inhibitors to block checkpoint proteins, preventing their binding with partner proteins and boosting immune cell function against abnormal cells. Inhibitor discovery has long been a tedious process, which in recent years has been accelerated by computational approaches. Advances in artificial intelligence now provide an opportunity to make inhibitor discovery smarter than ever before. While extensive research has been conducted on computer-aided inhibitor discovery, it has mainly focused on either sequence-to-structure mapping, reverse mapping, or bio-activity prediction, making it unrealistic for biologists to utilize such tools. Instead, our work proposes a new method of computer-assisted inhibitor discovery: de novo pocket-aware peptide structure and sequence generation network. Our approach consists of two sequential diffusion models for end-to-end structure generation and sequence prediction. By leveraging angle and dihedral relationships between backbone atoms, we ensure an E(3)-invariant representation of peptide structures. Our results demonstrate that our method achieves comparable performance to state-of-the-art models, highlighting its potential in pocket-aware peptide design. This work offers a new approach for precise drug discovery using receptor-specific peptide generation.
Authors: Yoel Shoshan, Moshiko Raboh, Michal Ozery-Flato, Vadim Ratner, Alex Golts, Jeffrey K. Weber, Ella Barkan, Simona Rabinovici-Cohen, Sagi Polaczek, Ido Amos, Ben Shapira, Liam Hazan, Matan Ninio, Sivan Ravid, Michael M. Danziger, Joseph A. Morrone, Parthasarathy Suryanarayanan, Michal Rosen-Zvi, Efrat Hexter
Abstract: Drug discovery typically consists of multiple steps, including identifying a target protein key to a disease's etiology, validating that interacting with this target could prevent symptoms or cure the disease, discovering a small molecule or biologic therapeutic to interact with it, and optimizing the candidate molecule through a complex landscape of required properties. Drug discovery related tasks often involve prediction and generation while considering multiple entities that potentially interact, which poses a challenge for typical AI models. For this purpose we present MAMMAL - Molecular Aligned Multi-Modal Architecture and Language - a method that we applied to create a versatile multi-task multi-align foundation model that learns from large-scale biological datasets (2 billion samples) across diverse modalities, including proteins, small molecules, and genes. We introduce a prompt syntax that supports a wide range of classification, regression, and generation tasks. It allows combining different modalities and entity types as inputs and/or outputs. Our model handles combinations of tokens and scalars and enables the generation of small molecules and proteins, property prediction, and transcriptomic lab test predictions. We evaluated the model on 11 diverse downstream tasks spanning different steps within a typical drug discovery pipeline, where it reaches new SOTA in 9 tasks and is comparable to SOTA in 2 tasks. This performance is achieved while using a unified architecture serving all tasks, in contrast to the original SOTA performance achieved using tailored architectures. The model code and pretrained weights are publicly available at https://github.com/BiomedSciAI/biomed-multi-alignment and https://huggingface.co/ibm/biomed.omics.bl.sm.ma-ted-458m.
URLs: https://github.com/BiomedSciAI/biomed-multi-alignment, https://huggingface.co/ibm/biomed.omics.bl.sm.ma-ted-458m.
Authors: Jerry Yao-Chieh Hu, Dennis Wu, Han Liu
Abstract: We study the optimal memorization capacity of modern Hopfield models and Kernelized Hopfield Models (KHMs), a transformer-compatible class of Dense Associative Memories. We present a tight analysis by establishing a connection between the memory configuration of KHMs and spherical codes from information theory. Specifically, we treat the stored memory set as a specialized spherical code. This enables us to cast the memorization problem in KHMs into a point arrangement problem on a hypersphere. We show that the optimal capacity of KHMs occurs when the feature space allows memories to form an optimal spherical code. This unique perspective leads to: (i) An analysis of how KHMs achieve optimal memory capacity, and identify corresponding necessary conditions. Importantly, we establish an upper capacity bound that matches the well-known exponential lower bound in the literature. This provides the first tight and optimal asymptotic memory capacity for modern Hopfield models. (ii) A sub-linear time algorithm $\mathtt{U}\text{-}\mathtt{Hop}$+ to reach KHMs' optimal capacity. (iii) An analysis of the scaling behavior of the required feature dimension relative to the number of stored memories. These efforts improve both the retrieval capability of KHMs and the representation learning of corresponding transformers. Experimentally, we provide thorough numerical results to back up theoretical findings.
Authors: Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, Yingnian Wu, Lijuan Wang
Abstract: Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model's context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its temporal LoRA parameters based on local inputs and outputs, thereby efficiently storing episodic memory in its parameters. We further propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop, enabling the recall of prior multi-episode experiences for context-aware skill learning. To facilitate the slow learning of an approximate world model, we collect a large-scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Extensive experiments show that SlowFast-VGen outperforms baselines across various metrics for action-driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow-fast learning loop algorithm significantly enhances performances on long-horizon planning tasks as well. Project Website: https://slowfast-vgen.github.io
Authors: Baekrok Shin, Junsoo Oh, Hanseul Cho, Chulhee Yun
Abstract: Warm-starting neural network training by initializing networks with previously learned weights is appealing, as practical neural networks are often deployed under a continuous influx of new data. However, it often leads to loss of plasticity, where the network loses its ability to learn new information, resulting in worse generalization than training from scratch. This occurs even under stationary data distributions, and its underlying mechanism is poorly understood. We develop a framework emulating real-world neural network training and identify noise memorization as the primary cause of plasticity loss when warm-starting on stationary data. Motivated by this, we propose Direction-Aware SHrinking (DASH), a method aiming to mitigate plasticity loss by selectively forgetting memorized noise while preserving learned features. We validate our approach on vision tasks, demonstrating improvements in test accuracy and training efficiency.
Authors: Kyungjin Seo, Junghoon Seo, Hanseok Jeong, Sangpil Kim, Sang Ho Yoon
Abstract: We present PiMForce, a novel framework that enhances hand pressure estimation by leveraging 3D hand posture information to augment forearm surface electromyography (sEMG) signals. Our approach utilizes detailed spatial information from 3D hand poses in conjunction with dynamic muscle activity from sEMG to enable accurate and robust whole-hand pressure measurements under diverse hand-object interactions. We also developed a multimodal data collection system that combines a pressure glove, an sEMG armband, and a markerless finger-tracking module. We created a comprehensive dataset from 21 participants, capturing synchronized data of hand posture, sEMG signals, and exerted hand pressure across various hand postures and hand-object interaction scenarios using our collection system. Our framework enables precise hand pressure estimation in complex and natural interaction scenarios. Our approach substantially mitigates the limitations of traditional sEMG-based or vision-based methods by integrating 3D hand posture information with sEMG signals. Video demos, data, and code are available online.
Authors: Dizhen Liang
Abstract: Transformer-based architectures have achieved remarkable success in natural language processing and computer vision. However, their performance in multivariate long-term forecasting often lags behind simpler linear baselines. Previous studies have identified the traditional attention mechanism as a significant factor contributing to this limitation. To unlock the full potential of transformers for multivariate time series forecasting, I introduce \textbf{LSEAttention}, an approach designed to address entropy collapse and training instability commonly observed in transformer models. I validate the effectiveness of LSEAttention across various real-world multivariate time series datasets, demonstrating that it not only outperforms existing time series transformer models but also exceeds the performance of some state-of-the-art models on specific datasets.
Authors: Quentin Guilhot, Micha{\l} W\'ojcik, Jascha Achterberg, Rui Ponte Costa
Abstract: Methods for analyzing representations in neural systems are increasingly popular tools in neuroscience and mechanistic interpretability. Measures comparing neural activations across conditions, architectures, and species give scalable ways to understand information transformation within different neural networks. However, recent findings show that some metrics respond to spurious signals, leading to misleading results. Establishing benchmark test cases is thus essential for identifying the most reliable metric and potential improvements. We propose that compositional learning in recurrent neural networks (RNNs) can provide a test case for dynamical representation alignment metrics. Implementing this case allows us to evaluate if metrics can identify representations that develop throughout learning and determine if representations identified by metrics reflect the network's actual computations. Building both attractor and RNN based test cases, we show that the recently proposed Dynamical Similarity Analysis (DSA) is more noise robust and reliably identifies behaviorally relevant representations compared to prior metrics (Procrustes, CKA). We also demonstrate how such test cases can extend beyond metric evaluation to study new architectures. Specifically, testing DSA in modern (Mamba) state space models suggests that these models, unlike RNNs, may not require changes in recurrent dynamics due to their expressive hidden states. Overall, we develop test cases that showcase how DSA's enhanced ability to detect dynamical motifs makes it highly effective for identifying ongoing computations in RNNs and revealing how networks learn tasks.