Authors: Tianqi Shang, Shu Yang, Weiqing He, Tianhua Zhai, Dawei Li, Bojian Hou, Tianlong Chen, Jason H. Moore, Marylyn D. Ritchie, Li Shen
Abstract: Growing evidence suggests that social determinants of health (SDoH), a set of nonmedical factors, affect individuals' risks of developing Alzheimer's disease (AD) and related dementias. Nevertheless, the etiological mechanisms underlying such relationships remain largely unclear, mainly due to difficulties in collecting relevant information. This study presents a novel, automated framework that leverages recent advancements of large language model (LLM) and natural language processing techniques to mine SDoH knowledge from extensive literature and integrate it with AD-related biological entities extracted from the general-purpose knowledge graph PrimeKG. Utilizing graph neural networks, we performed link prediction tasks to evaluate the resultant SDoH-augmented knowledge graph. Our framework shows promise for enhancing knowledge discovery in AD and can be generalized to other SDoH-related research areas, offering a new tool for exploring the impact of social determinants on health outcomes. Our code is available at: https://github.com/hwq0726/SDoHenPKG
Authors: Nuri Kim, Jeongho Park, Mineui Hong, Songhwai Oh
Abstract: In this paper, we introduce the Semantic Environment Atlas (SEA), a novel mapping approach designed to enhance visual navigation capabilities of embodied agents. The SEA utilizes semantic graph maps that intricately delineate the relationships between places and objects, thereby enriching the navigational context. These maps are constructed from image observations and capture visual landmarks as sparsely encoded nodes within the environment. The SEA integrates multiple semantic maps from various environments, retaining a memory of place-object relationships, which proves invaluable for tasks such as visual localization and navigation. We developed navigation frameworks that effectively leverage the SEA, and we evaluated these frameworks through visual localization and object-goal navigation tasks. Our SEA-based localization framework significantly outperforms existing methods, accurately identifying locations from single query images. Experimental results in Habitat scenarios show that our method not only achieves a success rate of 39.0%, an improvement of 12.4% over the current state-of-the-art, but also maintains robustness under noisy odometry and actuation conditions, all while keeping computational costs low.
Authors: Lu Chen, Yuxuan Huang, Yixing Li, Yaohui Jin, Shuai Zhao, Zilong Zheng, Quanshi Zhang
Abstract: This paper presents a method to evaluate the alignment between the decision-making logic of Large Language Models (LLMs) and human cognition in a case study on legal LLMs. Unlike traditional evaluations on language generation results, we propose to evaluate the correctness of the detailed decision-making logic of an LLM behind its seemingly correct outputs, which represents the core challenge for an LLM to earn human trust. To this end, we quantify the interactions encoded by the LLM as primitive decision-making logic, because recent theoretical achievements have proven several mathematical guarantees of the faithfulness of the interaction-based explanation. We design a set of metrics to evaluate the detailed decision-making logic of LLMs. Experiments show that even when the language generation results appear correct, a significant portion of the internal inference logic contains notable issues.
Authors: Naomi Saphra, Sarah Wiegreffe
Abstract: The rise of the term "mechanistic interpretability" has accompanied increasing interest in understanding neural models -- particularly language models. However, this jargon has also led to a fair amount of confusion. So, what does it mean to be "mechanistic"? We describe four uses of the term in interpretability research. The most narrow technical definition requires a claim of causality, while a broader technical definition allows for any exploration of a model's internals. However, the term also has a narrow cultural definition describing a cultural movement. To understand this semantic drift, we present a history of the NLP interpretability community and the formation of the separate, parallel "mechanistic" interpretability community. Finally, we discuss the broad cultural definition -- encompassing the entire field of interpretability -- and why the traditional NLP interpretability community has come to embrace it. We argue that the polysemy of "mechanistic" is the product of a critical divide within the interpretability community.
Authors: Brian Matejek, Daniel Elenius, Cale Gentry, David Stoker, Adam Cobb
Abstract: We propose a resource-constrained heuristic for instances of Max-SAT that iteratively decomposes a larger problem into smaller subcomponents that can be solved by optimized solvers and hardware. The unconstrained outer loop maintains the state space of a given problem and selects a subset of the SAT variables for optimization independent of previous calls. The resource-constrained inner loop maximizes the number of satisfiable clauses in the "sub-SAT" problem. Our outer loop is agnostic to the mechanisms of the inner loop, allowing for the use of traditional solvers for the optimization step. However, we can also transform the selected "sub-SAT" problem into a quadratic unconstrained binary optimization (QUBO) one and use specialized hardware for optimization. In contrast to existing solutions that convert a SAT instance into a QUBO one before decomposition, we choose a subset of the SAT variables before QUBO optimization. We analyze a set of variable selection methods, including a novel graph-based method that exploits the structure of a given SAT instance. The number of QUBO variables needed to encode a (sub-)SAT problem varies, so we additionally learn a model that predicts the size of sub-SAT problems that will fit a fixed-size QUBO solver. We empirically demonstrate our results on a set of randomly generated Max-SAT instances as well as real world examples from the Max-SAT evaluation benchmarks and outperform existing QUBO decomposer solutions.
Authors: Simeng Han, Aaron Yu, Rui Shen, Zhenting Qi, Martin Riddell, Wenfei Zhou, Yujie Qiao, Yilun Zhao, Semih Yavuz, Ye Liu, Shafiq Joty, Yingbo Zhou, Caiming Xiong, Dragomir Radev, Rex Ying, Arman Cohan
Abstract: Existing methods on understanding the capabilities of LLMs in logical reasoning rely on binary entailment classification or synthetically derived rationales, which are not sufficient for proper investigation of model's capabilities. We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains for a set of realistic logical reasoning stories also written by humans. P-FOLIO is collected with an annotation protocol that facilitates humans to annotate well-structured natural language proofs for first-order logic reasoning problems in a step-by-step manner. The number of reasoning steps in P-FOLIO span from 0 to 20. We further use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities. We evaluate LLM reasoning capabilities at a fine granularity via single-step inference rule classification, with more diverse inference rules of more diverse and higher levels of complexities than previous works. Given that a single model-generated reasoning chain could take a completely different path than the human-annotated one, we sample multiple reasoning chains from a model and use pass@k metrics for evaluating the quality of model-generated reasoning chains. We show that human-written reasoning chains significantly boost the logical reasoning capabilities of LLMs via many-shot prompting and fine-tuning. Furthermore, fine-tuning Llama3-7B on P-FOLIO improves the model performance by 10% or more on three other out-of-domain logical reasoning datasets. We also conduct detailed analysis to show where most powerful LLMs fall short in reasoning. We will release the dataset and code publicly.
Authors: Flavio Giorgi, Cesare Campagnano, Fabrizio Silvestri, Gabriele Tolomei
Abstract: Explainable Artificial Intelligence (XAI) has emerged as a critical area of research to unravel the opaque inner logic of (deep) machine learning models. Among the various XAI techniques proposed in the literature, counterfactual explanations stand out as one of the most promising approaches. However, these ``what-if'' explanations are frequently complex and technical, making them difficult for non-experts to understand and, more broadly, challenging for humans to interpret. To bridge this gap, in this work, we exploit the power of open-source Large Language Models to generate natural language explanations when prompted with valid counterfactual instances produced by state-of-the-art explainers for graph-based models. Experiments across several graph datasets and counterfactual explainers show that our approach effectively produces accurate natural language representations of counterfactual instances, as demonstrated by key performance metrics.
Authors: Yufeng Zou
Abstract: Pattern database (PDB) is one of the most popular automated heuristic generation techniques. A PDB maps states in a planning task to abstract states by considering a subset of variables and stores their optimal costs to the abstract goal in a look up table. As the result of the progress made on symbolic search over recent years, symbolic-PDB-based planners achieved impressive results in the International Planning Competition (IPC) 2018. Among them, Complementary 1 (CPC1) tied as the second best planners and the best non-portfolio planners in the cost optimal track, only 2 tasks behind the winner. It uses a combination of different pattern generation algorithms to construct PDBs that are complementary to existing ones. As shown in the post contest experiments, there is room for improvement. In this paper, we would like to present our work on refining the PDB construction mechanism of CPC1. By testing on IPC 2018 benchmarks, the results show that a significant improvement is made on our modified planner over the original version.
Authors: Hyuntae Park (Korea University), Yeachan Kim (Korea University), Jun-Hyung Park (Korea University), SangKeun Lee (Korea University)
Abstract: Recent approaches to zero-shot commonsense reasoning have enabled Pre-trained Language Models (PLMs) to learn a broad range of commonsense knowledge without being tailored to specific situations. However, they often suffer from human reporting bias inherent in textual commonsense knowledge, leading to discrepancies in understanding between PLMs and humans. In this work, we aim to bridge this gap by introducing an additional information channel to PLMs. We propose Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework designed to complement textual inputs with visual signals derived from machine-generated images. To achieve this, we enhance PLMs with imagination capabilities by incorporating an image generator into the reasoning process. To guide PLMs in effectively leveraging machine imagination, we create a synthetic pre-training dataset that simulates visual question-answering. Our extensive experiments on diverse reasoning benchmarks and analysis show that Imagine outperforms existing methods by a large margin, highlighting the strength of machine imagination in mitigating reporting bias and enhancing generalization capabilities.
Authors: Haoyang Su, Renqi Chen, Shixiang Tang, Xinzhe Zheng, Jingzhe Li, Zhenfei Yin, Wanli Ouyang, Nanqing Dong
Abstract: The rapid advancement of scientific progress requires innovative tools that can accelerate discovery. While recent AI methods, particularly large language models (LLMs), have shown promise in tasks such as hypothesis generation and experimental design, they fall short in replicating the collaborative nature of real-world scientific practices, where diverse teams of experts work together to tackle complex problems. To address the limitation, we propose an LLM-based multi-agent system, i.e., Virtual Scientists (VirSci), designed to mimic the teamwork inherent in scientific research. VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas. Through comprehensive experiments, we demonstrate that this multi-agent approach outperforms the state-of-the-art method in producing novel and impactful scientific ideas, showing potential in aligning with key insights in the Science of Science field. Our findings suggest that integrating collaborative agents can lead to more innovative scientific outputs, offering a robust system for autonomous scientific discovery.
Authors: Thomas Eiter, Jan Hadl, Nelson Higuera, Johannes Oetsch
Abstract: Visual Question Answering (VQA) is the task of answering a question about an image and requires processing multimodal input and reasoning to obtain the answer. Modular solutions that use declarative representations within the reasoning component have a clear advantage over end-to-end trained systems regarding interpretability. The downside is that crafting the rules for such a component can be an additional burden on the developer. We address this challenge by presenting an approach for declarative knowledge distillation from Large Language Models (LLMs). Our method is to prompt an LLM to extend an initial theory on VQA reasoning, given as an answer-set program, to meet the requirements of the VQA task. Examples from the VQA dataset are used to guide the LLM, validate the results, and mend rules if they are not correct by using feedback from the ASP solver. We demonstrate that our approach works on the prominent CLEVR and GQA datasets. Our results confirm that distilling knowledge from LLMs is in fact a promising direction besides data-driven rule learning approaches.
Authors: Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, Feng Zheng
Abstract: In the field of industrial inspection, Multimodal Large Language Models (MLLMs) have a high potential to renew the paradigms in practical applications due to their robust language capabilities and generalization abilities. However, despite their impressive problem-solving skills in many domains, MLLMs' ability in industrial anomaly detection has not been systematically studied. To bridge this gap, we present MMAD, the first-ever full-spectrum MLLMs benchmark in industrial Anomaly Detection. We defined seven key subtasks of MLLMs in industrial inspection and designed a novel pipeline to generate the MMAD dataset with 39,672 questions for 8,366 industrial images. With MMAD, we have conducted a comprehensive, quantitative evaluation of various state-of-the-art MLLMs. The commercial models performed the best, with the average accuracy of GPT-4o models reaching 74.9%. However, this result falls far short of industrial requirements. Our analysis reveals that current MLLMs still have significant room for improvement in answering questions related to industrial anomalies and defects. We further explore two training-free performance enhancement strategies to help models improve in industrial scenarios, highlighting their promising potential for future research.
Authors: Chen Gao, Baining Zhao, Weichen Zhang, Jinzhu Mao, Jun Zhang, Zhiheng Zheng, Fanhang Man, Jianjie Fang, Zile Zhou, Jinqiang Cui, Xinlei Chen, Yong Li
Abstract: Embodied artificial intelligence emphasizes the role of an agent's body in generating human-like behaviors. The recent efforts on EmbodiedAI pay a lot of attention to building up machine learning models to possess perceiving, planning, and acting abilities, thereby enabling real-time interaction with the world. However, most works focus on bounded indoor environments, such as navigation in a room or manipulating a device, with limited exploration of embodying the agents in open-world scenarios. That is, embodied intelligence in the open and outdoor environment is less explored, for which one potential reason is the lack of high-quality simulators, benchmarks, and datasets. To address it, in this paper, we construct a benchmark platform for embodied intelligence evaluation in real-world city environments. Specifically, we first construct a highly realistic 3D simulation environment based on the real buildings, roads, and other elements in a real city. In this environment, we combine historically collected data and simulation algorithms to conduct simulations of pedestrian and vehicle flows with high fidelity. Further, we designed a set of evaluation tasks covering different EmbodiedAI abilities. Moreover, we provide a complete set of input and output interfaces for access, enabling embodied agents to easily take task requirements and current environmental observations as input and then make decisions and obtain performance evaluations. On the one hand, it expands the capability of existing embodied intelligence to higher levels. On the other hand, it has a higher practical value in the real world and can support more potential applications for artificial general intelligence. Based on this platform, we evaluate some popular large language models for embodied intelligence capabilities of different dimensions and difficulties.
Authors: Jun Wang, Meng Fang, Ziyu Wan, Muning Wen, Jiachen Zhu, Anjie Liu, Ziqin Gong, Yan Song, Lei Chen, Lionel M. Ni, Linyi Yang, Ying Wen, Weinan Zhang
Abstract: In this technical report, we introduce OpenR, an open-source framework designed to integrate key components for enhancing the reasoning capabilities of large language models (LLMs). OpenR unifies data acquisition, reinforcement learning training (both online and offline), and non-autoregressive decoding into a cohesive software platform. Our goal is to establish an open-source platform and community to accelerate the development of LLM reasoning. Inspired by the success of OpenAI's o1 model, which demonstrated improved reasoning abilities through step-by-step reasoning and reinforcement learning, OpenR integrates test-time compute, reinforcement learning, and process supervision to improve reasoning in LLMs. Our work is the first to provide an open-source framework that explores the core techniques of OpenAI's o1 model with reinforcement learning, achieving advanced reasoning capabilities beyond traditional autoregressive methods. We demonstrate the efficacy of OpenR by evaluating it on the MATH dataset, utilising publicly available data and search methods. Our initial experiments confirm substantial gains, with relative improvements in reasoning and performance driven by test-time computation and reinforcement learning through process reward models. The OpenR framework, including code, models, and datasets, is accessible at https://openreasoner.github.io.
Authors: Duo Xu, Faramarz Fekri
Abstract: In this work, we study the problem of learning generalizable policies for compositional tasks given by a logic specification. These tasks are composed by temporally extended subgoals. Due to dependencies of subgoals and long task horizon, previous reinforcement learning (RL) algorithms, e.g., task-conditioned and goal-conditioned policies, still suffer from slow convergence and sub-optimality when solving the generalization problem of compositional tasks. In order to tackle these issues, this paper proposes a new hierarchical RL framework for the efficient and optimal generalization of compositional tasks. In the high level, we propose a new implicit planner designed specifically for generalizing compositional tasks. Specifically, the planner produces the selection of next sub-task and estimates the multi-step return of completing the rest of task from current state. It learns a latent transition model and conducts planning in the latent space based on a graph neural network (GNN). Then, the next sub-task selected by the high level guides the low-level agent efficiently to solve long-horizon tasks and the multi-step return makes the low-level policy consider dependencies of future sub-tasks. We conduct comprehensive experiments to show the advantage of proposed framework over previous methods in terms of optimality and efficiency.
Authors: Zhiguang Zhou, Haoxuan Wang, Zhengqing Zhao, Fengling Zheng, Yongheng Wang, Wei Chen, Yong Wang
Abstract: Chart images, such as bar charts, pie charts, and line charts, are explosively produced due to the wide usage of data visualizations. Accordingly, knowledge mining from chart images is becoming increasingly important, which can benefit downstream tasks like chart retrieval and knowledge graph completion. However, existing methods for chart knowledge mining mainly focus on converting chart images into raw data and often ignore their visual encodings and semantic meanings, which can result in information loss for many downstream tasks. In this paper, we propose ChartKG, a novel knowledge graph (KG) based representation for chart images, which can model the visual elements in a chart image and semantic relations among them including visual encodings and visual insights in a unified manner. Further, we develop a general framework to convert chart images to the proposed KG-based representation. It integrates a series of image processing techniques to identify visual elements and relations, e.g., CNNs to classify charts, yolov5 and optical character recognition to parse charts, and rule-based methods to construct graphs. We present four cases to illustrate how our knowledge-graph-based representation can model the detailed visual elements and semantic relations in charts, and further demonstrate how our approach can benefit downstream applications such as semantic-aware chart retrieval and chart question answering. We also conduct quantitative evaluations to assess the two fundamental building blocks of our chart-to-KG framework, i.e., object recognition and optical character recognition. The results provide support for the usefulness and effectiveness of ChartKG.
Authors: Yijie Li, Yuan Sun
Abstract: Recently, there has been a growing trend of employing large language models (LLMs) to judge the quality of other LLMs. Many studies have adopted closed-source models, mainly using GPT-4 as the evaluator. However, due to the closed-source nature of the GPT-4 model, employing it as an evaluator has resulted in issues including transparency, controllability, and cost-effectiveness. Some researchers have turned to using fine-tuned open-source LLMs as evaluators. However, existing open-source evaluation LLMs generally lack a user-friendly visualization tool, and they have not been optimized for accelerated model inference, which causes inconvenience for researchers with limited resources and those working across different fields. This paper presents EasyJudge, a model developed to evaluate significant language model responses. It is lightweight, precise, efficient, and user-friendly, featuring an intuitive visualization interface for ease of deployment and use. EasyJudge uses detailed datasets and refined prompts for model optimization, achieving strong consistency with human and proprietary model evaluations. The model optimized with quantitative methods enables EasyJudge to run efficiently on consumer-grade GPUs or even CPUs. We also provide detailed analysis and case studies to further reveal the potential of our method.
Authors: Manuj Kant, Manav Kant, Marzieh Nabi, Preston Carlson, Megan Ma
Abstract: The costs and complexity of the American judicial system limit access to legal solutions for many Americans. Large language models (LLMs) hold great potential to improve access to justice. However, a major challenge in applying AI and LLMs in legal contexts, where consistency and reliability are crucial, is the need for System 2 reasoning. In this paper, we explore the integration of LLMs with logic programming to enhance their ability to reason, bringing their strategic capabilities closer to that of a skilled lawyer. Our objective is to translate laws and contracts into logic programs that can be applied to specific legal cases, with a focus on insurance contracts. We demonstrate that while GPT-4o fails to encode a simple health insurance contract into logical code, the recently released OpenAI o1-preview model succeeds, exemplifying how LLMs with advanced System 2 reasoning capabilities can expand access to justice.
Authors: DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, Qinqing Zheng
Abstract: In human cognition theory, human thinking is governed by two systems: the fast and intuitive System 1 and the slower but more deliberative System 2. Recent studies have shown that incorporating System 2 process into Transformers including large language models (LLMs), significantly enhances their reasoning capabilities. Nevertheless, models that purely resemble System 2 thinking require substantially higher computational costs and are much slower to respond. To address this challenge, we present Dualformer, a single Transformer model that seamlessly integrates both the fast and slow reasoning modes. Dualformer is obtained by training on data with randomized reasoning traces, where different parts of the traces are dropped during training. The dropping strategies are specifically tailored according to the trace structure, analogous to analyzing our thinking process and creating shortcuts with patterns. At inference time, our model can be configured to output only the solutions (fast mode) or both the reasoning chain and the final solution (slow mode), or automatically decide which mode to engage (auto mode). In all cases, Dualformer outperforms the corresponding baseline models in both performance and computational efficiency: (1) in slow mode, Dualformer optimally solves unseen 30 x 30 maze navigation tasks 97.6% of the time, surpassing the Searchformer (trained on data with complete reasoning traces) baseline performance of 93.3%, while only using 45.5% fewer reasoning steps; (2) in fast mode, Dualformer completes those tasks with an 80% optimal rate, significantly outperforming the Solution-Only model (trained on solution-only data), which has an optimal rate of only 30%. For math problems, our techniques have also achieved improved performance with LLM fine-tuning, showing its generalization beyond task-specific models.
Authors: Arnaud Lequen
Abstract: We consider the problem of synthesizing interpretable models that recognize the behaviour of an agent compared to other agents, on a whole set of similar planning tasks expressed in PDDL. Our approach consists in learning logical formulas, from a small set of examples that show how an agent solved small planning instances. These formulas are expressed in a version of First-Order Temporal Logic (FTL) tailored to our planning formalism. Such formulas are human-readable, serve as (partial) descriptions of an agent's policy, and generalize to unseen instances. We show that learning such formulas is computationally intractable, as it is an NP-hard problem. As such, we propose to learn these behaviour classifiers through a topology-guided compilation to MaxSAT, which allows us to generate a wide range of different formulas. Experiments show that interesting and accurate formulas can be learned in reasonable time.
Authors: Abhishek Dutta, Yen-Che Hsiao
Abstract: This paper presents an innovative large language model (LLM) agent framework for enhancing diagnostic accuracy in simulated clinical environments using the AgentClinic benchmark. The proposed automatic correction enables doctor agents to iteratively refine their reasoning and actions following incorrect diagnoses, fostering improved decision-making over time. Experiments show that the implementation of the adaptive LLM-based doctor agents achieve correct diagnoses through dynamic interactions with simulated patients. The evaluations highlight the capacity of autonomous agents to adapt and improve in complex medical scenarios. Future enhancements will focus on refining the algorithm and expanding its applicability across a wider range of tasks and different large language models.
Authors: Achint Soni, Sreyas Venkataraman, Abhranil Chandra, Sebastian Fischmeister, Percy Liang, Bo Dai, Sherry Yang
Abstract: Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the generated video plans using a novel procedure which we call self-conditioning consistency, utilizing feedback from a pretrained vision-language model (VLM). As the refined video plan is being executed, VideoAgent collects additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robotics can be an effective tool in grounding video generation in the physical world.
Authors: Yifan Feng, Chengwu Yang, Xingliang Hou, Shaoyi Du, Shihui Ying, Zongze Wu, Yue Gao
Abstract: Existing benchmarks like NLGraph and GraphQA evaluate LLMs on graphs by focusing mainly on pairwise relationships, overlooking the high-order correlations found in real-world data. Hypergraphs, which can model complex beyond-pairwise relationships, offer a more robust framework but are still underexplored in the context of LLMs. To address this gap, we introduce LLM4Hypergraph, the first comprehensive benchmark comprising 21,500 problems across eight low-order, five high-order, and two isomorphism tasks, utilizing both synthetic and real-world hypergraphs from citation networks and protein structures. We evaluate six prominent LLMs, including GPT-4o, demonstrating our benchmark's effectiveness in identifying model strengths and weaknesses. Our specialized prompting framework incorporates seven hypergraph languages and introduces two novel techniques, Hyper-BAG and Hyper-COT, which enhance high-order reasoning and achieve an average 4% (up to 9%) performance improvement on structure classification tasks. This work establishes a foundational testbed for integrating hypergraph computational capabilities into LLMs, advancing their comprehension.
Authors: Jiajie Yu, Yuhong Wang, Wei Ma
Abstract: Bus holding control is a widely-adopted strategy for maintaining stability and improving the operational efficiency of bus systems. Traditional model-based methods often face challenges with the low accuracy of bus state prediction and passenger demand estimation. In contrast, Reinforcement Learning (RL), as a data-driven approach, has demonstrated great potential in formulating bus holding strategies. RL determines the optimal control strategies in order to maximize the cumulative reward, which reflects the overall control goals. However, translating sparse and delayed control goals in real-world tasks into dense and real-time rewards for RL is challenging, normally requiring extensive manual trial-and-error. In view of this, this study introduces an automatic reward generation paradigm by leveraging the in-context learning and reasoning capabilities of Large Language Models (LLMs). This new paradigm, termed the LLM-enhanced RL, comprises several LLM-based modules: reward initializer, reward modifier, performance analyzer, and reward refiner. These modules cooperate to initialize and iteratively improve the reward function according to the feedback from training and test results for the specified RL-based task. Ineffective reward functions generated by the LLM are filtered out to ensure the stable evolution of the RL agents' performance over iterations. To evaluate the feasibility of the proposed LLM-enhanced RL paradigm, it is applied to various bus holding control scenarios, including a synthetic single-line system and a real-world multi-line system. The results demonstrate the superiority and robustness of the proposed paradigm compared to vanilla RL strategies, the LLM-based controller, and conventional space headway-based feedback control. This study sheds light on the great potential of utilizing LLMs in various smart mobility applications.
Authors: Abhijit Manatkar, Ashlesha Akella, Parthivi Gupta, Krishnasuri Narayanam
Abstract: Discovering meaningful insights from a large dataset, known as Exploratory Data Analysis (EDA), is a challenging task that requires thorough exploration and analysis of the data. Automated Data Exploration (ADE) systems use goal-oriented methods with Large Language Models and Reinforcement Learning towards full automation. However, these methods require human involvement to anticipate goals that may limit insight extraction, while fully automated systems demand significant computational resources and retraining for new datasets. We introduce QUIS, a fully automated EDA system that operates in two stages: insight generation (ISGen) driven by question generation (QUGen). The QUGen module generates questions in iterations, refining them from previous iterations to enhance coverage without human intervention or manually curated examples. The ISGen module analyzes data to produce multiple relevant insights in response to each question, requiring no prior training and enabling QUIS to adapt to new datasets.
Authors: Joshua Ong Jun Leang, Aryo Pradipta Gema, Shay B. Cohen
Abstract: Mathematical reasoning remains a significant challenge for large language models (LLMs), despite progress in prompting techniques such as Chain-of-Thought (CoT). We present Chain of Mathematically Annotated Thought (CoMAT), which enhances reasoning through two stages: Symbolic Conversion (converting natural language queries into symbolic form) and Reasoning Execution (deriving answers from symbolic representations). CoMAT operates entirely with a single LLM and without external solvers. Across four LLMs, CoMAT outperforms traditional CoT on six out of seven benchmarks, achieving gains of 4.48% on MMLU-Redux (MATH) and 4.58% on GaoKao MCQ. In addition to improved performance, CoMAT ensures faithfulness and verifiability, offering a transparent reasoning process for complex mathematical tasks
Authors: Han Wang, Yilin Zhao, Dian Li, Xiaohan Wang, Gang Liu, Xuguang Lan, Hui Wang
Abstract: Humor is a culturally nuanced aspect of human language that presents challenges for understanding and generation, requiring participants to possess good creativity and strong associative thinking. Similar to reasoning tasks like solving math problems, humor generation requires continuous reflection and revision to foster creative thinking, rather than relying on a sudden flash of inspiration like Creative Leap-of-Thought (CLoT) paradigm. Although CLoT can realize the ability of remote association generation, this paradigm fails to generate humor content. Therefore, in this paper, we propose a systematic way of thinking about generating humor and based on it, we built Creative Leap of Structured Thought (CLoST) frame. First, a reward model is necessary achieve the purpose of being able to correct errors, since there is currently no expert model of humor and a usable rule to determine whether a piece of content is humorous. Judgement-oriented instructions are designed to improve the capability of a model, and we also propose an open-domain instruction evolutionary method to fully unleash the potential. Then, through reinforcement learning, the model learns to hone its rationales of the thought chain and refine the strategies it uses. Thus, it learns to recognize and correct its mistakes, and finally generate the most humorous and creative answer. These findings deepen our understanding of the creative capabilities of LLMs and provide ways to enhance LLMs' creative abilities for cross-domain innovative applications.
Authors: Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, Yicheng Li, Hao Chen, Fei Yu, Yin Zhang
Abstract: Instruction tuning is a crucial technique for aligning language models with humans' actual goals in the real world. Extensive research has highlighted the quality of instruction data is essential for the success of this alignment. However, creating high-quality data manually is labor-intensive and time-consuming, which leads researchers to explore using LLMs to synthesize data. Recent studies have focused on using a stronger LLM to iteratively enhance existing instruction data, showing promising results. Nevertheless, previous work often lacks control over the evolution direction, resulting in high uncertainty in the data synthesis process and low-quality instructions. In this paper, we introduce a general and scalable framework, IDEA-MCTS (Instruction Data Enhancement using Monte Carlo Tree Search), a scalable framework for efficiently synthesizing instructions. With tree search and evaluation models, it can efficiently guide each instruction to evolve into a high-quality form, aiding in instruction fine-tuning. Experimental results show that IDEA-MCTS significantly enhances the seed instruction data, raising the average evaluation scores of quality, diversity, and complexity from 2.19 to 3.81. Furthermore, in open-domain benchmarks, experimental results show that IDEA-MCTS improves the accuracy of real-world instruction-following skills in LLMs by an average of 5\% in low-resource settings.
Authors: Xi Wang, Liana Mikaelyan, Taketomo Isazawa, James Hensman
Abstract: In this paper, we propose Knowledge Base augmented Language Model (KBLaM), a new method for augmenting Large Language Models (LLMs) with external knowledge. KBLaM works with a knowledge base (KB) constructed from a corpus of documents, transforming each piece of knowledge in the KB into continuous key-value vector pairs via pre-trained sentence encoders with linear adapters and integrating them into pre-trained LLMs via a specialized rectangular attention mechanism. Unlike Retrieval-Augmented Generation, KBLaM eliminates external retrieval modules, and unlike in-context learning, its computational overhead scales linearly with KB size rather than quadratically. Our approach enables integrating a large KB of more than 10K triples into an 8B pre-trained LLM of only 8K context window on one single A100 80GB GPU and allows for dynamic updates without model fine-tuning or retraining. Experiments demonstrate KBLaM's effectiveness in various tasks, including question-answering and open-ended reasoning, while providing interpretable insights into its use of the augmented knowledge.
Authors: Haochuan Wang, Xiachong Feng, Lei Li, Zhanyue Qin, Dianbo Sui, Lingpeng Kong
Abstract: The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, current research focuses on a limited selection of games, resulting in low coverage. Classic game scenarios risk data leakage, and existing benchmarks often lack extensibility, making them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, a benchmark with comprehensive game type coverage, novel scenarios, and flexible organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games. We also employ synthetic data generation to create diverse, higher-quality scenarios through topic guidance and human inspection, referred to as story-based games. Lastly, we provide a sustainable framework for increasingly powerful LLMs by treating these games as atomic units and organizing them into more complex forms via sequential, parallel, and nested structures. Our comprehensive evaluation of mainstream LLMs covers tests on rational reasoning, robustness, Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini, OpenAI's latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and 70.0% on sequential, parallel, and nested games, highlighting TMGBench's challenges.
Authors: Martina Cinquini, Isacco Beretta, Salvatore Ruggieri, Isabel Valera
Abstract: In this paper, we focus on estimating the causal effect of an intervention over time on a dynamical system. To that end, we formally define causal interventions and their effects over time on discrete-time stochastic processes (DSPs). Then, we show under which conditions the equilibrium states of a DSP, both before and after a causal intervention, can be captured by a structural causal model (SCM). With such an equivalence at hand, we provide an explicit mapping from vector autoregressive models (VARs), broadly applied in econometrics, to linear, but potentially cyclic and/or affected by unmeasured confounders, SCMs. The resulting causal VAR framework allows us to perform causal inference over time from observational time series data. Our experiments on synthetic and real-world datasets show that the proposed framework achieves strong performance in terms of observational forecasting while enabling accurate estimation of the causal effect of interventions on dynamical systems. We demonstrate, through a case study, the potential practical questions that can be addressed using the proposed causal VAR framework.
Authors: Cecilia Di Florio, Huimin Dong, Antonino Rotolo
Abstract: Consistency of case bases is a way to avoid the problem of retrieving conflicting constraining precedents for new cases to be decided. However, in legal practice the consistency requirements for case bases may not be satisfied. As pointed out in (Broughton 2019), a model of precedential constraint should take into account the hierarchical structure of the specific legal system under consideration and the temporal dimension of cases. This article continues the research initiated in (Liu et al. 2022; Di Florio et al. 2023), which established a connection between Boolean classifiers and legal case-based reasoning. On this basis, we enrich the classifier models with an organisational structure that takes into account both the hierarchy of courts and which courts issue decisions that are binding/constraining on subsequent cases. We focus on common law systems. We also introduce a temporal relation between cases. Within this enriched framework, we can formalise the notions of overruled cases and cases decided per incuriam: such cases are not to be considered binding on later cases. Finally, we show under which condition principles based on the hierarchical structure and on the temporal dimension can provide an unambiguous decision-making process for new cases in the presence of conflicting binding precedents.
Authors: Naman Gupta, Shashank Kirtania, Priyanshu Gupta, Krishna Kariya, Sumit Gulwani, Arun Iyer, Suresh Parthasarathy, Arjun Radhakrishna, Sriram K. Rajamani, Gustavo Soares
Abstract: Large Language Models (LLMs) often generate incorrect or outdated information, especially in low-resource settings or when dealing with private data. To address this, Retrieval-Augmented Generation (RAG) uses external knowledge bases (KBs), but these can also suffer from inaccuracies. We introduce STACKFEED, a novel Structured Textual Actor-Critic Knowledge base editing with FEEDback approach that iteratively refines the KB based on expert feedback using a multi-actor, centralized critic reinforcement learning framework. Each document is assigned to an actor, modeled as a ReACT agent, which performs structured edits based on document-specific targeted instructions from a centralized critic. Experimental results show that STACKFEED significantly improves KB quality and RAG system performance, enhancing accuracy by up to 8% over baselines.
Authors: Christopher J. MacLellan, Erik Harpstead, Vincent Aleven, Kenneth R. Koedinger
Abstract: The literature on concept formation has demonstrated that humans are capable of learning concepts incrementally, with a variety of attribute types, and in both supervised and unsupervised settings. Many models of concept formation focus on a subset of these characteristics, but none account for all of them. In this paper, we present TRESTLE, an incremental account of probabilistic concept formation in structured domains that unifies prior concept learning models. TRESTLE works by creating a hierarchical categorization tree that can be used to predict missing attribute values and cluster sets of examples into conceptually meaningful groups. It updates its knowledge by partially matching novel structures and sorting them into its categorization tree. Finally, the system supports mixed-data representations, including nominal, numeric, relational, and component attributes. We evaluate TRESTLE's performance on a supervised learning task and an unsupervised clustering task. For both tasks, we compare it to a nonincremental model and to human participants. We find that this new categorization model is competitive with the nonincremental approach and more closely approximates human behavior on both tasks. These results serve as an initial demonstration of TRESTLE's capabilities and show that, by taking key characteristics of human learning into account, it can better model behavior than approaches that ignore them.
Authors: Kazuki Irie, Brenden M. Lake
Abstract: Since the earliest proposals for neural network models of the mind and brain, critics have pointed out key weaknesses in these models compared to human cognitive abilities. Here we review recent work that has used metalearning to help overcome some of these challenges. We characterize their successes as addressing an important developmental problem: they provide machines with an incentive to improve X (where X represents the desired capability) and opportunities to practice it, through explicit optimization for X; unlike conventional approaches that hope for achieving X through generalization from related but different objectives. We review applications of this principle to four classic challenges: systematicity, catastrophic forgetting, few-shot learning and multi-step reasoning; we also discuss related aspects of human development in natural environments.
Authors: John Mern, Anthony Corso, Damian Burch, Kurt House, Jef Caers
Abstract: Optimal Bayesian decision making on what geoscientific data to acquire requires stating a prior model of uncertainty. Data acquisition is then optimized by reducing uncertainty on some property of interest maximally, and on average. In the context of exploration, very few, sometimes no data at all, is available prior to data acquisition planning. The prior model therefore needs to include human interpretations on the nature of spatial variability, or on analogue data deemed relevant for the area being explored. In mineral exploration, for example, humans may rely on conceptual models on the genesis of the mineralization to define multiple hypotheses, each representing a specific spatial variability of mineralization. More often than not, after the data is acquired, all of the stated hypotheses may be proven incorrect, i.e. falsified, hence prior hypotheses need to be revised, or additional hypotheses generated. Planning data acquisition under wrong geological priors is likely to be inefficient since the estimated uncertainty on the target property is incorrect, hence uncertainty may not be reduced at all. In this paper, we develop an intelligent agent based on partially observable Markov decision processes that plans optimally in the case of multiple geological or geoscientific hypotheses on the nature of spatial variability. Additionally, the artificial intelligence is equipped with a method that allows detecting, early on, whether the human stated hypotheses are incorrect, thereby saving considerable expense in data acquisition. Our approach is tested on a sediment-hosted copper deposit, and the algorithm presented has aided in the characterization of an ultra high-grade deposit in Zambia in 2023.
Authors: Kuofeng Gao, Huanqia Cai, Qingyao Shuai, Dihong Gong, Zhifeng Li
Abstract: Accurate mathematical reasoning with Large Language Models (LLMs) is crucial in revolutionizing domains that heavily rely on such reasoning. However, LLMs often encounter difficulties in certain aspects of mathematical reasoning, leading to flawed reasoning and erroneous results. To mitigate these issues, we introduce a novel mechanism, the Chain of Self-Correction (CoSC), specifically designed to embed self-correction as an inherent ability in LLMs, enabling them to validate and rectify their own results. The CoSC mechanism operates through a sequence of self-correction stages. In each stage, the LLMs generate a program to address a given problem, execute this program using program-based tools to obtain an output, subsequently verify this output. Based on the verification, the LLMs either proceed to the next correction stage or finalize the answer. This iterative self-correction process allows the LLMs to refine their reasoning steps and improve the accuracy of their mathematical reasoning. To enable the CoSC mechanism at a low cost, we employ a two-phase finetuning approach. In the first phase, the LLMs are trained with a relatively small volume of seeding data generated from GPT-4, establishing an initial CoSC capability. In the second phase, the CoSC capability is further enhanced by training with a larger volume of self-generated data using the trained model in the first phase, without relying on the paid GPT-4. Our comprehensive experiments demonstrate that CoSC significantly improves performance on traditional mathematical datasets among existing open-source LLMs. Notably, our CoSC-Code-34B model achieved a 53.5% score on MATH, the most challenging mathematical reasoning dataset in the public domain, surpassing the performance of well-established models such as ChatGPT, GPT-4, and even multi-modal LLMs like GPT-4V, Gemini-1.0 Pro, and Gemini-1.0 Ultra.
Authors: Pengrui Quan, Xiaomin Ouyang, Jeya Vikranth Jeyakumar, Ziqi Wang, Yang Xing, Mani Srivastava
Abstract: Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems. Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in signal-processing tools. However, recent works show that Large Language Models (LLMs) have promising capabilities in processing sensory data, suggesting their potential as copilots for developing sensing systems. To explore this potential, we construct a comprehensive benchmark, SensorBench, to establish a quantifiable objective. The benchmark incorporates diverse real-world sensor datasets for various tasks. The results show that while LLMs exhibit considerable proficiency in simpler tasks, they face inherent challenges in processing compositional tasks with parameter selections compared to engineering experts. Additionally, we investigate four prompting strategies for sensor processing and show that self-verification can outperform all other baselines in 48% of tasks. Our study provides a comprehensive benchmark and prompting analysis for future developments, paving the way toward an LLM-based sensor processing copilot.
Authors: Yanbiao Ji, Chang Liu, Xin Chen, Yue Ding, Dan Luo, Mei Li, Wenqing Lin, Hongtao Lu
Abstract: Graphs are a fundamental data structure for representing relationships in real-world scenarios. With the success of Large Language Models (LLMs) across various natural language processing (NLP) tasks, there has been growing interest in integrating LLMs for graph learning. However, applying LLMs to graph-related tasks poses significant challenges, as these models are not inherently designed to capture the complex structural information present in graphs. Existing approaches address this challenge through two strategies: the chain of tasks approach, which uses Graph Neural Networks (GNNs) to encode the graph structure so that LLMs are relieved from understanding spatial positions; and Graph-to-Text Conversion, which translates graph structures into semantic text representations that LLMs can process. Despite their progress, these methods often struggle to fully preserve the topological information of graphs or require extensive computational resources, limiting their practical applicability. In this work, we introduce Node Tokenizer for Large Language Models (NT-LLM), a novel framework that efficiently encodes graph structures by selecting key nodes as anchors and representing each node based on its relative distance to these anchors. This position-anchored encoding effectively captures the graph topology, enabling enhanced reasoning capabilities in LLMs over graph data. Additionally, we implement a task-specific tuning procedure to further improve structural understanding within LLMs. Through extensive empirical evaluations, NT-LLM demonstrates significant performance improvements across a variety of graph-related tasks.
Authors: Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, Chenglin Wu
Abstract: Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce AFlow, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFlow's efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFlow enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. The code will be available at https://github.com/geekan/MetaGPT.
Authors: Shuoqiu Li, Han Xu, Haipeng Chen
Abstract: Large language models (LLMs) have significantly improved their reasoning and decision-making capabilities, as seen in methods like ReAct. However, despite its effectiveness in tackling complex tasks, ReAct faces two main challenges: losing focus on the original question and becoming stuck in action loops. To address these issues, we introduce Focused ReAct, an enhanced version of the ReAct paradigm that incorporates reiteration and early stop mechanisms. These improvements help the model stay focused on the original query and avoid repetitive behaviors. Experimental results show accuracy gains of 18% to 530% and a runtime reduction of up to 34% compared to the original ReAct method.
Authors: Nima Sedaghat, Martino Romaniello, Jonathan E. Carrick, Fran\c{c}ois-Xavier Pineau
Abstract: Machine learning has been widely applied to clearly defined problems of astronomy and astrophysics. However, deep learning and its conceptual differences to classical machine learning have been largely overlooked in these fields. The broad hypothesis behind our work is that letting the abundant real astrophysical data speak for itself, with minimal supervision and no labels, can reveal interesting patterns which may facilitate discovery of novel physical relationships. Here as the first step, we seek to interpret the representations a deep convolutional neural network chooses to learn, and find correlations in them with current physical understanding. We train an encoder-decoder architecture on the self-supervised auxiliary task of reconstruction to allow it to learn general representations without bias towards any specific task. By exerting weak disentanglement at the information bottleneck of the network, we implicitly enforce interpretability in the learned features. We develop two independent statistical and information-theoretical methods for finding the number of learned informative features, as well as measuring their true correlation with astrophysical validation labels. As a case study, we apply this method to a dataset of ~270000 stellar spectra, each of which comprising ~300000 dimensions. We find that the network clearly assigns specific nodes to estimate (notions of) parameters such as radial velocity and effective temperature without being asked to do so, all in a completely physics-agnostic process. This supports the first part of our hypothesis. Moreover, we find with high confidence that there are ~4 more independently informative dimensions that do not show a direct correlation with our validation parameters, presenting potential room for future studies.
Authors: Nima Sedaghat, Brianna M. Smart, J. Bryce Kalmbach, Erin L. Howard, Hamidreza Amindavar
Abstract: We report a study exploring how the use of deep neural networks with astronomical Big Data may help us find and uncover new insights into underlying phenomena: through our experiments towards unsupervised knowledge extraction from astronomical Big Data we serendipitously found that deep convolutional autoencoders tend to reject telluric lines in stellar spectra. With further experiments we found that only when the spectra are in the barycentric frame does the network automatically identify the statistical independence between two components, stellar vs telluric, and rejects the latter. We exploit this finding and turn it into a proof-of-concept method for removal of the telluric lines from stellar spectra in a fully unsupervised fashion: we increase the inter-observation entropy of telluric absorption lines by imposing a random, virtual radial velocity to the observed spectrum. This technique results in a non-standard form of ``whitening'' in the atmospheric components of the spectrum, decorrelating them across multiple observations. We process more than 250,000 spectra from the High Accuracy Radial velocity Planetary Search (HARPS) and with qualitative and quantitative evaluations against a database of known telluric lines, show that most of the telluric lines are successfully rejected. Our approach, `Stellar Karaoke', has zero need for prior knowledge about parameters such as observation time, location, or the distribution of atmospheric molecules and processes each spectrum in milliseconds. We also train and test on Sloan Digital Sky Survey (SDSS) and see a significant performance drop due to the low resolution. We discuss directions for developing tools on top of the introduced method in the future.
Authors: Jiachun Li, Pengfei Cao, Chenhao Wang, Zhuoran Jin, Yubo Chen, Daojian Zeng, Kang Liu, Jun Zhao
Abstract: Large language models exhibit high-level commonsense reasoning abilities, especially with enhancement methods like Chain-of-Thought (CoT). However, we find these CoT-like methods lead to a considerable number of originally correct answers turning wrong, which we define as the Toxic CoT problem. To interpret and mitigate this problem, we first utilize attribution tracing and causal tracing methods to probe the internal working mechanism of the LLM during CoT reasoning. Through comparisons, we prove that the model exhibits information loss from the question over the shallow attention layers when generating rationales or answers. Based on the probing findings, we design a novel method called RIDERS (Residual decodIng and sERial-position Swap), which compensates for the information deficit in the model from both decoding and serial-position perspectives. Through extensive experiments on multiple commonsense reasoning benchmarks, we validate that this method not only significantly eliminates Toxic CoT problems (decreased by 23.6%), but also effectively improves the model's overall commonsense reasoning performance (increased by 5.5%).
Authors: Alex Li
Abstract: Predicting volatility in financial markets, including stocks, index ETFs, foreign exchange, and cryptocurrencies, remains a challenging task due to the inherent complexity and non-linear dynamics of these time series. In this study, I apply TimeMixer, a state-of-the-art time series forecasting model, to predict the volatility of global financial assets. TimeMixer utilizes a multiscale-mixing approach that effectively captures both short-term and long-term temporal patterns by analyzing data across different scales. My empirical results reveal that while TimeMixer performs exceptionally well in short-term volatility forecasting, its accuracy diminishes for longer-term predictions, particularly in highly volatile markets. These findings highlight TimeMixer's strength in capturing short-term volatility, making it highly suitable for practical applications in financial risk management, where precise short-term forecasts are critical. However, the model's limitations in long-term forecasting point to potential areas for further refinement.
Authors: Tomas Bueno Momcilovic, Dian Balta, Beat Buesser, Giulio Zizzo, Mark Purcell
Abstract: The EU AI Act (EUAIA) introduces requirements for AI systems which intersect with the processes required to establish adversarial robustness. However, given the ambiguous language of regulation and the dynamicity of adversarial attacks, developers of systems with highly complex models such as LLMs may find their effort to be duplicated without the assurance of having achieved either compliance or robustness. This paper presents a functional architecture that focuses on bridging the two properties, by introducing components with clear reference to their source. Taking the detection layer recommended by the literature, and the reporting layer required by the law, we aim to support developers and auditors with a reasoning layer based on knowledge augmentation (rules, assurance cases, contextual mappings). Our findings demonstrate a novel direction for ensuring LLMs deployed in the EU are both compliant and adversarially robust, which underpin trustworthiness.
Authors: Aofei Chang, Jiaqi Wang, Han Liu, Parminder Bhatia, Cao Xiao, Ting Wang, Fenglong Ma
Abstract: Parameter Efficient Fine-Tuning (PEFT) offers an efficient solution for fine-tuning large pretrained language models for downstream tasks. However, most PEFT strategies are manually designed, often resulting in suboptimal performance. Recent automatic PEFT approaches aim to address this but face challenges such as search space entanglement, inefficiency, and lack of integration between parameter budgets and search processes. To overcome these issues, we introduce a novel Budget-guided Iterative search strategy for automatic PEFT (BIPEFT), significantly enhancing search efficiency. BIPEFT employs a new iterative search strategy to disentangle the binary module and rank dimension search spaces. Additionally, we design early selection strategies based on parameter budgets, accelerating the learning process by gradually removing unimportant modules and fixing rank dimensions. Extensive experiments on public benchmarks demonstrate the superior performance of BIPEFT in achieving efficient and effective PEFT for downstream tasks with a low parameter budget.
Authors: Jordis Emilia Herrmann, Aswath Mandakath Gopinath, Mikael Norrlof, Mark Niklas M\"uller
Abstract: Quickly resolving issues reported in industrial applications is crucial to minimize economic impact. However, the required data analysis makes diagnosing the underlying root causes a challenging and time-consuming task, even for experts. In contrast, large language models (LLMs) excel at analyzing large amounts of data. Indeed, prior work in AI-Ops demonstrates their effectiveness in analyzing IT systems. Here, we extend this work to the challenging and largely unexplored domain of robotics systems. To this end, we create SYSDIAGBENCH, a proprietary system diagnostics benchmark for robotics, containing over 2500 reported issues. We leverage SYSDIAGBENCH to investigate the performance of LLMs for root cause analysis, considering a range of model sizes and adaptation techniques. Our results show that QLoRA finetuning can be sufficient to let a 7B-parameter model outperform GPT-4 in terms of diagnostic accuracy while being significantly more cost-effective. We validate our LLM-as-a-judge results with a human expert study and find that our best model achieves similar approval ratings as our reference labels.
Authors: Gaurav Shinde, Tiana Kirstein, Souvick Ghosh, Patricia C. Franks
Abstract: The rapid expansion of records creates significant challenges in management, including retention and disposition, appraisal, and organization. Our study underscores the benefits of integrating artificial intelligence (AI) within the broad realm of archival science. In this work, we start by performing a thorough analysis to understand the current use of AI in this area and identify the techniques employed to address challenges. Subsequently, we document the results of our review according to specific criteria. Our findings highlight key AI driven strategies that promise to streamline record-keeping processes and enhance data retrieval efficiency. We also demonstrate our review process to ensure transparency regarding our methodology. Furthermore, this review not only outlines the current state of AI in archival science and records management but also lays the groundwork for integrating new techniques to transform archival practices. Our research emphasizes the necessity for enhanced collaboration between the disciplines of artificial intelligence and archival science.
Authors: Yinan Han, Qingyuan Jiang, Hongming Mei, Yang Yang, Jinhui Tang
Abstract: This report presents our method for Temporal Action Localisation (TAL), which focuses on identifying and classifying actions within specific time intervals throughout a video sequence. We employ a data augmentation technique by expanding the training dataset using overlapping labels from the Something-SomethingV2 dataset, enhancing the model's ability to generalize across various action classes. For feature extraction, we utilize state-of-the-art models, including UMT, VideoMAEv2 for video features, and BEATs and CAV-MAE for audio features. Our approach involves training both multimodal (video and audio) and unimodal (video only) models, followed by combining their predictions using the Weighted Box Fusion (WBF) method. This fusion strategy ensures robust action localisation. our overall approach achieves a score of 0.5498, securing first place in the competition.
Authors: S. Tamang, G. S. Chandana, B. K. Roy
Abstract: In today's digital age, cyberspace has become integral to daily life, however it has also led to an increase in cybercriminal activities. This paper explores cybercrime trends and highlights the need for cybercrime awareness (cyberawareness) to mitigate vulnerabilities. The study also examines Indian statistics on cybercrime. We review the existing literature on cybercrime and cybersecurity, focusing on various types of cybercrimes and their impacts. We present a list of 31 technical as well as non-technical solutions considering that a "common man" may not be technologically aware. Common man solutions, considering that they are not technologically updated. Expanding the list of solutions and validating their effectiveness in cyber threats can be the future scope of the research.
Authors: Haowen Xu (Jamie), Xueping Li (Jamie), Jose Tupayachi (Jamie), Jianming (Jamie), Lian, Femi Omitaomu
Abstract: Bibliometric analysis is essential for understanding research trends, scope, and impact in urban science, especially in high-impact journals, such Nature Portfolios. However, traditional methods, relying on keyword searches and basic NLP techniques, often fail to uncover valuable insights not explicitly stated in article titles or keywords. These approaches are unable to perform semantic searches and contextual understanding, limiting their effectiveness in classifying topics and characterizing studies. In this paper, we address these limitations by leveraging Generative AI models, specifically transformers and Retrieval-Augmented Generation (RAG), to automate and enhance bibliometric analysis. We developed a technical workflow that integrates a vector database, Sentence Transformers, a Gaussian Mixture Model (GMM), Retrieval Agent, and Large Language Models (LLMs) to enable contextual search, topic ranking, and characterization of research using customized prompt templates. A pilot study analyzing 223 urban science-related articles published in Nature Communications over the past decade highlights the effectiveness of our approach in generating insightful summary statistics on the quality, scope, and characteristics of papers in high-impact journals. This study introduces a new paradigm for enhancing bibliometric analysis and knowledge retrieval in urban research, positioning an AI agent as a powerful tool for advancing research evaluation and understanding.
Authors: Tarun Raheja, Nilay Pochhi
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, but their vulnerability to jailbreak attacks poses significant security risks. This survey paper presents a comprehensive analysis of recent advancements in attack strategies and defense mechanisms within the field of Large Language Model (LLM) red-teaming. We analyze various attack methods, including gradient-based optimization, reinforcement learning, and prompt engineering approaches. We discuss the implications of these attacks on LLM safety and the need for improved defense mechanisms. This work aims to provide a thorough understanding of the current landscape of red-teaming attacks and defenses on LLMs, enabling the development of more secure and reliable language models.
Authors: Anastasiya Danilenka, Alireza Furutanpey, Victor Casamayor Pujol, Boris Sedlak, Anna Lackinger, Maria Ganzha, Marcin Paprzycki, Schahram Dustdar
Abstract: Handling heterogeneity and unpredictability are two core problems in pervasive computing. The challenge is to seamlessly integrate devices with varying computational resources in a dynamic environment to form a cohesive system that can fulfill the needs of all participants. Existing work on systems that adapt to changing requirements typically focuses on optimizing individual variables or low-level Service Level Objectives (SLOs), such as constraining the usage of specific resources. While low-level control mechanisms permit fine-grained control over a system, they introduce considerable complexity, particularly in dynamic environments. To this end, we propose drawing from Active Inference (AIF), a neuroscientific framework for designing adaptive agents. Specifically, we introduce a conceptual agent for heterogeneous pervasive systems that permits setting global systems constraints as high-level SLOs. Instead of manually setting low-level SLOs, the system finds an equilibrium that can adapt to environmental changes. We demonstrate the viability of AIF agents with an extensive experiment design, using heterogeneous and lifelong federated learning as an application scenario. We conduct our experiments on a physical testbed of devices with different resource types and vendor specifications. The results provide convincing evidence that an AIF agent can adapt a system to environmental changes. In particular, the AIF agent can balance competing SLOs in resource heterogeneous environments to ensure up to 98% fulfillment rate.
Authors: Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, Wenxuan Zhou
Abstract: Large Language Models (LLMs) are susceptible to security and safety threats, such as prompt injection, prompt extraction, and harmful requests. One major cause of these vulnerabilities is the lack of an instruction hierarchy. Modern LLM architectures treat all inputs equally, failing to distinguish between and prioritize various types of instructions, such as system messages, user prompts, and data. As a result, lower-priority user prompts may override more critical system instructions, including safety protocols. Existing approaches to achieving instruction hierarchy, such as delimiters and instruction-based training, do not address this issue at the architectural level. We introduce the Instructional Segment Embedding (ISE) technique, inspired by BERT, to modern large language models, which embeds instruction priority information directly into the model. This approach enables models to explicitly differentiate and prioritize various instruction types, significantly improving safety against malicious prompts that attempt to override priority rules. Our experiments on the Structured Query and Instruction Hierarchy benchmarks demonstrate an average robust accuracy increase of up to 15.75% and 18.68%, respectively. Furthermore, we observe an improvement in instruction-following capability of up to 4.1% evaluated on AlpacaEval. Overall, our approach offers a promising direction for enhancing the safety and effectiveness of LLM architectures.
Authors: Yixian Shen, Qi Bi, Jia-Hong Huang, Hongyi Zhu, Anuj Pathania
Abstract: In the era of large language models, parameter-efficient fine-tuning (PEFT) has been extensively studied. However, these approaches usually rely on the space domain, which encounters storage challenges especially when handling extensive adaptations or larger models. The frequency domain, in contrast, is more effective in compressing trainable parameters while maintaining the expressive capability. In this paper, we propose a novel Selective Discrete Cosine Transformation (sDCTFT) fine-tuning scheme to push this frontier. Its general idea is to exploit the superior energy compaction and decorrelation properties of DCT to improve both model efficiency and accuracy. Specifically, it projects the weight change from the low-rank adaptation into the discrete cosine space. Then, the weight change is partitioned over different levels of the discrete cosine spectrum, and the most critical frequency components in each partition are selected. Extensive experiments on four benchmark datasets demonstrate the superior accuracy, reduced computational cost, and lower storage requirements of the proposed method over the prior arts. For instance, when performing instruction tuning on the LLaMA3.1-8B model, sDCTFT outperforms LoRA with just 0.05M trainable parameters compared to LoRA's 38.2M, and surpasses FourierFT with 30\% less trainable parameters. The source code will be publicly available.
Authors: Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian
Abstract: Inherited retinal diseases (IRDs) are a diverse group of genetic disorders that lead to progressive vision loss and are a major cause of blindness in working-age adults. The complexity and heterogeneity of IRDs pose significant challenges in diagnosis, prognosis, and management. Recent advancements in artificial intelligence (AI) offer promising solutions to these challenges. However, the rapid development of AI techniques and their varied applications have led to fragmented knowledge in this field. This review consolidates existing studies, identifies gaps, and provides an overview of AI's potential in diagnosing and managing IRDs. It aims to structure pathways for advancing clinical applications by exploring AI techniques like machine learning and deep learning, particularly in disease detection, progression prediction, and personalized treatment planning. Special focus is placed on the effectiveness of convolutional neural networks in these areas. Additionally, the integration of explainable AI is discussed, emphasizing its importance in clinical settings to improve transparency and trust in AI-based systems. The review addresses the need to bridge existing gaps in focused studies on AI's role in IRDs, offering a structured analysis of current AI techniques and outlining future research directions. It concludes with an overview of the challenges and opportunities in deploying AI for IRDs, highlighting the need for interdisciplinary collaboration and the continuous development of robust, interpretable AI models to advance clinical applications.
Authors: Kongyang Chen, Zeming Xu
Abstract: In recent years, research on the data trading market has been continuously deepened. In the transaction process, there is an information asymmetry process between agents and sellers. For sellers, direct data delivery faces the risk of privacy leakage. At the same time, sellers are not willing to provide data. A reasonable compensation method is needed to encourage sellers to provide data resources. For agents, the quality of data provided by sellers needs to be examined and evaluated. Otherwise, agents may consume too much cost and resources by recruiting sellers with poor data quality. Therefore, it is necessary to build a complete delivery process for the interaction between sellers and agents in the trading market so that the needs of sellers and agents can be met. The federated learning architecture is widely used in the data market due to its good privacy protection. Therefore, in this work, in response to the above challenges, we propose a transaction framework based on the federated learning architecture, and design a seller selection algorithm and incentive compensation mechanism. Specifically, we use gradient similarity and Shapley algorithm to fairly and accurately evaluate the contribution of sellers, and use the modified UCB algorithm to select sellers. After the training, fair compensation is made according to the seller's participation in the training. In view of the above work, we designed reasonable experiments for demonstration and obtained results, proving the rationality and effectiveness of the framework.
Authors: Qian Liu, Bing Gong, Xiaoran Zhuang, Xiaohui Zhong, Zhiming Kang, Hao Li
Abstract: The rapid advancement of artificial intelligence (AI) in weather research has been driven by the ability to learn from large, high-dimensional datasets. However, this progress also poses significant challenges, particularly regarding the substantial costs associated with processing extensive data and the limitations of computational resources. Inspired by the Neural Image Compression (NIC) task in computer vision, this study seeks to compress weather data to address these challenges and enhance the efficiency of downstream applications. Specifically, we propose a variational autoencoder (VAE) framework tailored for compressing high-resolution datasets, specifically the High Resolution China Meteorological Administration Land Data Assimilation System (HRCLDAS) with a spatial resolution of 1 km. Our framework successfully reduced the storage size of 3 years of HRCLDAS data from 8.61 TB to just 204 GB, while preserving essential information. In addition, we demonstrated the utility of the compressed data through a downscaling task, where the model trained on the compressed dataset achieved accuracy comparable to that of the model trained on the original data. These results highlight the effectiveness and potential of the compressed data for future weather research.
Authors: Jingyi Xu, Siwei Tu, Weidong Yang, Shuhao Li, Keyi Liu, Yeqi Luo, Lipeng Ma, Ben Fei, Lei Bai
Abstract: Variation of Arctic sea ice has significant impacts on polar ecosystems, transporting routes, coastal communities, and global climate. Tracing the change of sea ice at a finer scale is paramount for both operational applications and scientific studies. Recent pan-Arctic sea ice forecasting methods that leverage advances in artificial intelligence has made promising progress over numerical models. However, forecasting sea ice at higher resolutions is still under-explored. To bridge the gap, we propose a two-staged deep learning framework, IceDiff, to forecast sea ice concentration at finer scales. IceDiff first leverages an independently trained vision transformer to generate coarse yet superior forecasting over previous methods at a regular 25km x 25km grid. This high-quality sea ice forecasting can be utilized as reliable guidance for the next stage. Subsequently, an unconditional diffusion model pre-trained on sea ice concentration maps is utilized for sampling down-scaled sea ice forecasting via a zero-shot guided sampling strategy and a patch-based method. For the first time, IceDiff demonstrates sea ice forecasting with the 6.25km x 6.25km resolution. IceDiff extends the boundary of existing sea ice forecasting models and more importantly, its capability to generate high-resolution sea ice concentration data is vital for pragmatic usages and research.
Authors: Qianyue Hao, Jingyang Fan, Fengli Xu, Jian Yuan, Yong Li
Abstract: Citation networks are critical in modern science, and predicting which previous papers (candidates) will a new paper (query) cite is a critical problem. However, the roles of a paper's citations vary significantly, ranging from foundational knowledge basis to superficial contexts. Distinguishing these roles requires a deeper understanding of the logical relationships among papers, beyond simple edges in citation networks. The emergence of LLMs with textual reasoning capabilities offers new possibilities for discerning these relationships, but there are two major challenges. First, in practice, a new paper may select its citations from gigantic existing papers, where the texts exceed the context length of LLMs. Second, logical relationships between papers are implicit, and directly prompting an LLM to predict citations may result in surface-level textual similarities rather than the deeper logical reasoning. In this paper, we introduce the novel concept of core citation, which identifies the critical references that go beyond superficial mentions. Thereby, we elevate the citation prediction task from a simple binary classification to distinguishing core citations from both superficial citations and non-citations. To address this, we propose $\textbf{HLM-Cite}$, a $\textbf{H}$ybrid $\textbf{L}$anguage $\textbf{M}$odel workflow for citation prediction, which combines embedding and generative LMs. We design a curriculum finetune procedure to adapt a pretrained text embedding model to coarsely retrieve high-likelihood core citations from vast candidates and then design an LLM agentic workflow to rank the retrieved papers through one-shot reasoning, revealing the implicit relationships among papers. With the pipeline, we can scale the candidate sets to 100K papers. We evaluate HLM-Cite across 19 scientific fields, demonstrating a 17.6% performance improvement comparing SOTA methods.
Authors: Yanbiao Liang, Huihong Shi, Zhongfeng Wang
Abstract: Although Vision Transformers (ViTs) have achieved significant success, their intensive computations and substantial memory overheads challenge their deployment on edge devices. To address this, efficient ViTs have emerged, typically featuring Convolution-Transformer hybrid architectures to enhance both accuracy and hardware efficiency. While prior work has explored quantization for efficient ViTs to marry the best of efficient hybrid ViT architectures and quantization, it focuses on uniform quantization and overlooks the potential advantages of mixed quantization. Meanwhile, although several works have studied mixed quantization for standard ViTs, they are not directly applicable to hybrid ViTs due to their distinct algorithmic and hardware characteristics. To bridge this gap, we present M$^2$-ViT to accelerate Convolution-Transformer hybrid efficient ViTs with two-level mixed quantization. Specifically, we introduce a hardware-friendly two-level mixed quantization (M$^2$Q) strategy, characterized by both mixed quantization precision and mixed quantization schemes (i.e., uniform and power-of-two), to exploit the architectural properties of efficient ViTs. We further build a dedicated accelerator with heterogeneous computing engines to transform our algorithmic benefits into real hardware improvements. Experimental results validate our effectiveness, showcasing an average of $80\%$ energy-delay product (EDP) saving with comparable quantization accuracy compared to the prior work.
Authors: Andrey Anurin, Jonathan Ng, Kibo Schaffer, Ziyue Wang, Jason Schreiber, Esben Kran
Abstract: LLM agents have the potential to revolutionize defensive cyber operations, but their offensive capabilities are not yet fully understood. To prepare for emerging threats, model developers and governments are evaluating the cyber capabilities of foundation models. However, these assessments often lack transparency and a comprehensive focus on offensive capabilities. In response, we introduce the Catastrophic Cyber Capabilities Benchmark (3CB), a novel framework designed to rigorously assess the real-world offensive capabilities of LLM agents. Our evaluation of modern LLMs on 3CB reveals that frontier models, such as GPT-4o and Claude 3.5 Sonnet, can perform offensive tasks such as reconnaissance and exploitation across domains ranging from binary analysis to web technologies. Conversely, smaller open-source models exhibit limited offensive capabilities. Our software solution and the corresponding benchmark provides a critical tool to reduce the gap between rapidly improving capabilities and robustness of cyber offense evaluations, aiding in the safer deployment and regulation of these powerful technologies.
Authors: Sean Berry, Berk Gorgulu, Sait Tunc, Mucahit Cevik, Matthew J Ellis
Abstract: Kidney transplantation is the preferred treatment for end-stage renal disease, yet the scarcity of donors and inefficiencies in allocation systems create major bottlenecks, resulting in prolonged wait times and alarming mortality rates. Despite their severe scarcity, timely and effective interventions to prevent non-utilization of life-saving organs remain inadequate. Expedited out-of-sequence placement of hard-to-place kidneys to centers with the highest likelihood of utilizing them has been recommended in the literature as an effective strategy to improve placement success. Nevertheless, current attempts towards this practice is non-standardized and heavily rely on the subjective judgment of the decision-makers. This paper proposes a novel data-driven, machine learning-based ranking system for allocating hard-to-place kidneys to centers with a higher likelihood of accepting and successfully transplanting them. Using the national deceased donor kidney offer and transplant datasets, we construct a unique dataset with donor-, center-, and patient-specific features. We propose a data-driven out-of-sequence placement policy that utilizes machine learning models to predict the acceptance probability of a given kidney by a set of transplant centers, ranking them accordingly based on their likelihood of acceptance. Our experiments demonstrate that the proposed policy can reduce the average number of centers considered before placement by fourfold for all kidneys and tenfold for hard-to-place kidneys. This significant reduction indicates that our method can improve the utilization of hard-to-place kidneys and accelerate their acceptance, ultimately reducing patient mortality and the risk of graft failure. Further, we utilize machine learning interpretability tools to provide insights into factors influencing the kidney allocation decisions.
Authors: Shou Li, Andrey Kan, Laurent Callot, Bhavana Bhasker, Muhammad Shihab Rashid, Timothy B Esler
Abstract: As LLM-based agents exhibit exceptional capabilities in addressing complex problems, there is a growing focus on developing coding agents to tackle increasingly sophisticated tasks. Despite their promising performance, these coding agents often produce programs or modifications that contain runtime errors, which can cause code failures and are difficult for static analysis tools to detect. Enhancing the ability of coding agents to statically identify such errors could significantly improve their overall performance. In this work, we introduce Execution-free Runtime Error Detection for COding Agents (REDO), a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution. Additionally, we propose a benchmark task, SWE-Bench-Error-Detection (SWEDE), based on SWE-Bench (lite), to evaluate error detection in repository-level problems with complex external dependencies. Finally, through both quantitative and qualitative analyses across various error detection tasks, we demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and 9.1% higher weighted F1 score; and provide insights into the advantages of incorporating LLMs for error detection.
Authors: Yonatan Sverdlov, Yair Davidson, Nadav Dym, Tal Amir
Abstract: Many of the most popular graph neural networks fall into the category of message-passing neural networks (MPNNs). Famously, MPNNs' ability to distinguish between graphs is limited to graphs separable by the Weisfeiler-Lemann (WL) graph isomorphism test, and the strongest MPNNs, in terms of separation power, are WL-equivalent. Recently, it was shown that the quality of separation provided by standard WL-equivalent MPNN can be very low, resulting in WL-separable graphs being mapped to very similar, hardly distinguishable features. This paper addresses this issue by seeking bi-Lipschitz continuity guarantees for MPNNs. We demonstrate that, in contrast with standard summation-based MPNNs, which lack bi-Lipschitz properties, our proposed model provides a bi-Lipschitz graph embedding with respect to two standard graph metrics. Empirically, we show that our MPNN is competitive with standard MPNNs for several graph learning tasks and is far more accurate in over-squashing long-range tasks.
Authors: Ran Liu, Zhongzhou Liu, Xiaoli Li, Yuan Fang
Abstract: Knowledge graphs (KGs) are instrumental in various real-world applications, yet they often suffer from incompleteness due to missing relations. To predict instances for novel relations with limited training examples, few-shot relation learning approaches have emerged, utilizing techniques such as meta-learning. However, the assumption is that novel relations in meta-testing and base relations in meta-training are independently and identically distributed, which may not hold in practice. To address the limitation, we propose RelAdapter, a context-aware adapter for few-shot relation learning in KGs designed to enhance the adaptation process in meta-learning. First, RelAdapter is equipped with a lightweight adapter module that facilitates relation-specific, tunable adaptation of meta-knowledge in a parameter-efficient manner. Second, RelAdapter is enriched with contextual information about the target relation, enabling enhanced adaptation to each distinct relation. Extensive experiments on three benchmark KGs validate the superiority of RelAdapter over state-of-the-art methods.
Authors: Aleksei Korneev (CRIStAL, MAGNET), Jan Ramon (CRIStAL, MAGNET)
Abstract: Federated Learning (FL) is a widespread approach that allows training machine learning (ML) models with data distributed across multiple devices. In cross-silo FL, which often appears in domains like healthcare or finance, the number of participants is moderate, and each party typically represents a well-known organization. For instance, in medicine data owners are often hospitals or data hubs which are well-established entities. However, malicious parties may still attempt to disturb the training procedure in order to obtain certain benefits, for example, a biased result or a reduction in computational load. While one can easily detect a malicious agent when data used for training is public, the problem becomes much more acute when it is necessary to maintain the privacy of the training dataset. To address this issue, there is recently growing interest in developing verifiable protocols, where one can check that parties do not deviate from the training procedure and perform computations correctly. In this paper, we present a systematization of knowledge on verifiable cross-silo FL. We analyze various protocols, fit them in a taxonomy, and compare their efficiency and threat models. We also analyze Zero-Knowledge Proof (ZKP) schemes and discuss how their overall cost in a FL context can be minimized. Lastly, we identify research gaps and discuss potential directions for future scientific work.
Authors: Yukun Jiang, Peiran Wang, Chengguo Lin, Ziyue Huang, Yong Cheng
Abstract: Two-party split learning has emerged as a popular paradigm for vertical federated learning. To preserve the privacy of the label owner, split learning utilizes a split model, which only requires the exchange of intermediate representations (IRs) based on the inputs and gradients for each IR between two parties during the learning process. However, split learning has recently been proven to survive label inference attacks. Though several defense methods could be adopted, they either have limited defensive performance or significantly negatively impact the original mission. In this paper, we propose a novel two-party split learning method to defend against existing label inference attacks while maintaining the high utility of the learned models. Specifically, we first craft a dimension transformation module, SecDT, which could achieve bidirectional mapping between original labels and increased K-class labels to mitigate label leakage from the directional perspective. Then, a gradient normalization algorithm is designed to remove the magnitude divergence of gradients from different classes. We propose a softmax-normalized Gaussian noise to mitigate privacy leakage and make our K unknowable to adversaries. We conducted experiments on real-world datasets, including two binary-classification datasets (Avazu and Criteo) and three multi-classification datasets (MNIST, FashionMNIST, CIFAR-10); we also considered current attack schemes, including direction, norm, spectral, and model completion attacks. The detailed experiments demonstrate our proposed method's effectiveness and superiority over existing approaches. For instance, on the Avazu dataset, the attack AUC of evaluated four prominent attacks could be reduced by 0.4532+-0.0127.
Authors: Riccardo Gallon, Fabian Schiemenz, Alessandra Menicucci, Eberhard Gill
Abstract: Traditional anomaly detection techniques onboard satellites are based on reliable, yet limited, thresholding mechanisms which are designed to monitor univariate signals and trigger recovery actions according to specific European Cooperation for Space Standardization (ECSS) standards. However, Artificial Intelligence-based Fault Detection, Isolation and Recovery (FDIR) solutions have recently raised with the prospect to overcome the limitations of these standard methods, expanding the range of detectable failures and improving response times. This paper presents a novel approach to detecting stuck values within the Accelerometer and Inertial Measurement Unit of a drone-like spacecraft for the exploration of Small Solar System Bodies (SSSB), leveraging a multi-channel Convolutional Neural Network (CNN) to perform multi-target classification and independently detect faults in the sensors. Significant attention has been dedicated to ensuring the compatibility of the algorithm within the onboard FDIR system, representing a step forward to the in-orbit validation of a technology that remains experimental until its robustness is thoroughly proven. An integration methodology is proposed to enable the network to effectively detect anomalies and trigger recovery actions at the system level. The detection performances and the capability of the algorithm in reaction triggering are evaluated employing a set of custom-defined detection and system metrics, showing the outstanding performances of the algorithm in performing its FDIR task.
Authors: Pengyu Zhang, Congfeng Cao, Klim Zaporojets, Paul Groth
Abstract: Knowledge graphs constantly evolve with new entities emerging, existing definitions being revised, and entity relationships changing. These changes lead to temporal degradation in entity linking models, characterized as a decline in model performance over time. To address this issue, we propose leveraging graph relationships to aggregate information from neighboring entities across different time periods. This approach enhances the ability to distinguish similar entities over time, thereby minimizing the impact of temporal degradation. We introduce \textbf{CYCLE}: \textbf{C}ross-\textbf{Y}ear \textbf{C}ontrastive \textbf{L}earning for \textbf{E}ntity-Linking. This model employs a novel graph contrastive learning method to tackle temporal performance degradation in entity linking tasks. Our contrastive learning method treats newly added graph relationships as \textit{positive} samples and newly removed ones as \textit{negative} samples. This approach helps our model effectively prevent temporal degradation, achieving a 13.90\% performance improvement over the state-of-the-art from 2023 when the time gap is one year, and a 17.79\% improvement as the gap expands to three years. Further analysis shows that CYCLE is particularly robust for low-degree entities, which are less resistant to temporal degradation due to their sparse connectivity, making them particularly suitable for our method. The code and data are made available at \url{https://github.com/pengyu-zhang/CYCLE-Cross-Year-Contrastive-Learning-in-Entity-Linking}.
URLs: https://github.com/pengyu-zhang/CYCLE-Cross-Year-Contrastive-Learning-in-Entity-Linking
Authors: Pengyu Zhang, Congfeng Cao, Paul Groth
Abstract: Knowledge graphs change over time, for example, when new entities are introduced or entity descriptions change. This impacts the performance of entity linking, a key task in many uses of knowledge graphs such as web search and recommendation. Specifically, entity linking models exhibit temporal degradation - their performance decreases the further a knowledge graph moves from its original state on which an entity linking model was trained. To tackle this challenge, we introduce \textbf{TIGER}: a \textbf{T}emporally \textbf{I}mproved \textbf{G}raph \textbf{E}ntity Linke\textbf{r}. By incorporating structural information between entities into the model, we enhance the learned representation, making entities more distinguishable over time. The core idea is to integrate graph-based information into text-based information, from which both distinct and shared embeddings are based on an entity's feature and structural relationships and their interaction. Experiments on three datasets show that our model can effectively prevent temporal degradation, demonstrating a 16.24\% performance boost over the state-of-the-art in a temporal setting when the time gap is one year and an improvement to 20.93\% as the gap expands to three years. The code and data are made available at \url{https://github.com/pengyu-zhang/TIGER-Temporally-Improved-Graph-Entity-Linker}.
URLs: https://github.com/pengyu-zhang/TIGER-Temporally-Improved-Graph-Entity-Linker
Authors: Shuai Liu, Ning Cao, Yile Chen, Yue Jiang, Gao Cong
Abstract: Next location prediction is a critical task in human mobility analysis and serves as a foundation for various downstream applications. Existing methods typically rely on discrete IDs to represent locations, which inherently overlook spatial relationships and cannot generalize across cities. In this paper, we propose NextLocLLM, which leverages the advantages of large language models (LLMs) in processing natural language descriptions and their strong generalization capabilities for next location prediction. Specifically, instead of using IDs, NextLocLLM encodes locations based on continuous spatial coordinates to better model spatial relationships. These coordinates are further normalized to enable robust cross-city generalization. Another highlight of NextlocLLM is its LLM-enhanced POI embeddings. It utilizes LLMs' ability to encode each POI category's natural language description into embeddings. These embeddings are then integrated via nonlinear projections to form this LLM-enhanced POI embeddings, effectively capturing locations' functional attributes. Furthermore, task and data prompt prefix, together with trajectory embeddings, are incorporated as input for partly-frozen LLM backbone. NextLocLLM further introduces prediction retrieval module to ensure structural consistency in prediction. Experiments show that NextLocLLM outperforms existing models in next location prediction, excelling in both supervised and zero-shot settings.
Authors: Hao Yan, Chaozhuo Li, Zhigang Yu, Jun Yin, Ruochen Liu, Peiyan Zhang, Weihao Han, Mingzheng Li, Zhengxin Zeng, Hao Sun, Weiwei Deng, Feng Sun, Qi Zhang, Senzhang Wang
Abstract: Multimodal attributed graphs (MAGs) are prevalent in various real-world scenarios and generally contain two kinds of knowledge: (a) Attribute knowledge is mainly supported by the attributes of different modalities contained in nodes (entities) themselves, such as texts and images. (b) Topology knowledge, on the other hand, is provided by the complex interactions posed between nodes. The cornerstone of MAG representation learning lies in the seamless integration of multimodal attributes and topology. Recent advancements in Pre-trained Language/Vision models (PLMs/PVMs) and Graph neural networks (GNNs) have facilitated effective learning on MAGs, garnering increased research interest. However, the absence of meaningful benchmark datasets and standardized evaluation procedures for MAG representation learning has impeded progress in this field. In this paper, we propose Multimodal Attribute Graph Benchmark (MAGB)}, a comprehensive and diverse collection of challenging benchmark datasets for MAGs. The MAGB datasets are notably large in scale and encompass a wide range of domains, spanning from e-commerce networks to social networks. In addition to the brand-new datasets, we conduct extensive benchmark experiments over MAGB with various learning paradigms, ranging from GNN-based and PLM-based methods, to explore the necessity and feasibility of integrating multimodal attributes and graph topology. In a nutshell, we provide an overview of the MAG datasets, standardized evaluation procedures, and present baseline experiments. The entire MAGB project is publicly accessible at https://github.com/sktsherlock/ATG.
Authors: Mingjun Wang, Remington Dechene
Abstract: The need for autonomous and adaptive defense mechanisms has become paramount in the rapidly evolving landscape of cyber threats. Multi-Agent Deep Reinforcement Learning (MADRL) presents a promising approach to enhancing the efficacy and resilience of autonomous cyber operations. This paper explores the application of Multi-Agent Actor-Critic algorithms which provides a general form in Multi-Agent learning to cyber defense, leveraging the collaborative interactions among multiple agents to detect, mitigate, and respond to cyber threats. We demonstrate each agent is able to learn quickly and counter act on the threats autonomously using MADRL in simulated cyber-attack scenarios. The results indicate that MADRL can significantly enhance the capability of autonomous cyber defense systems, paving the way for more intelligent cybersecurity strategies. This study contributes to the growing body of knowledge on leveraging artificial intelligence for cybersecurity and sheds light for future research and development in autonomous cyber operations.
Authors: Luyu Gao, Yunyi Zhang, Jamie Callan
Abstract: Long-context modeling is one of the critical capabilities of language AI for digesting and reasoning over complex information pieces. In practice, long-context capabilities are typically built into a pre-trained language model~(LM) through a carefully designed context extension stage, with the goal of producing generalist long-context capabilities. In our preliminary experiments, however, we discovered that the current open-weight generalist long-context models are still lacking in practical long-context processing tasks. While this means perfectly effective long-context modeling demands task-specific data, the cost can be prohibitive. In this paper, we draw inspiration from how humans process a large body of information: a lossy \textbf{retrieval} stage ranks a large set of documents while the reader ends up reading deeply only the top candidates. We build an \textbf{automatic} data synthesis pipeline that mimics this process using short-context LMs. The short-context LMs are further tuned using these self-generated data to obtain task-specific long-context capabilities. Similar to how pre-training learns from imperfect data, we hypothesize and further demonstrate that the short-context model can bootstrap over the synthetic data, outperforming not only long-context generalist models but also the retrieval and read pipeline used to synthesize the training data in real-world tasks such as long-context retrieval augmented generation.
Authors: Wei Li, Luyao Zhu, Yang Song, Ruixi Lin, Rui Mao, Yang You
Abstract: Large language models (LLMs) have gained human trust due to their capabilities and helpfulness. However, this in turn may allow LLMs to affect users' mindsets by manipulating language. It is termed as gaslighting, a psychological effect. In this work, we aim to investigate the vulnerability of LLMs under prompt-based and fine-tuning-based gaslighting attacks. Therefore, we propose a two-stage framework DeepCoG designed to: 1) elicit gaslighting plans from LLMs with the proposed DeepGaslighting prompting template, and 2) acquire gaslighting conversations from LLMs through our Chain-of-Gaslighting method. The gaslighting conversation dataset along with a corresponding safe dataset is applied to fine-tuning-based attacks on open-source LLMs and anti-gaslighting safety alignment on these LLMs. Experiments demonstrate that both prompt-based and fine-tuning-based attacks transform three open-source LLMs into gaslighters. In contrast, we advanced three safety alignment strategies to strengthen (by 12.05%) the safety guardrail of LLMs. Our safety alignment strategies have minimal impacts on the utility of LLMs. Empirical studies indicate that an LLM may be a potential gaslighter, even if it passed the harmfulness test on general dangerous queries.
Authors: Noorbakhsh Amiri Golilarz, Elias Hossain, Abdoljalil Addeh, Keyan Alexander Rahimi
Abstract: In this paper, we discuss learning algorithms and their importance in different types of applications which includes training to identify important patterns and features in a straightforward, easy-to-understand manner. We will review the main concepts of artificial intelligence (AI), machine learning (ML), deep learning (DL), and hybrid models. Some important subsets of Machine Learning algorithms such as supervised, unsupervised, and reinforcement learning are also discussed in this paper. These techniques can be used for some important tasks like prediction, classification, and segmentation. Convolutional Neural Networks (CNNs) are used for image and video processing and many more applications. We dive into the architecture of CNNs and how to integrate CNNs with ML algorithms to build hybrid models. This paper explores the vulnerability of learning algorithms to noise, leading to misclassification. We further discuss the integration of learning algorithms with Large Language Models (LLM) to generate coherent responses applicable to many domains such as healthcare, marketing, and finance by learning important patterns from large volumes of data. Furthermore, we discuss the next generation of learning algorithms and how we may have an unified Adaptive and Dynamic Network to perform important tasks. Overall, this article provides brief overview of learning algorithms, exploring their current state, applications and future direction.
Authors: Vishnu Sarukkai, Brennan Shacklett, Zander Majercik, Kush Bhatia, Christopher R\'e, Kayvon Fatahalian
Abstract: Large Language Models (LLMs) have the potential to automate reward engineering by leveraging their broad domain knowledge across various tasks. However, they often need many iterations of trial-and-error to generate effective reward functions. This process is costly because evaluating every sampled reward function requires completing the full policy optimization process for each function. In this paper, we introduce an LLM-driven reward generation framework that is able to produce state-of-the-art policies on the challenging Bi-DexHands benchmark \textbf{with 20$\times$ fewer reward function samples} than the prior state-of-the-art work. Our key insight is that we reduce the problem of generating task-specific rewards to the problem of coarsely estimating \emph{task progress}. Our two-step solution leverages the task domain knowledge and the code synthesis abilities of LLMs to author \emph{progress functions} that estimate task progress from a given state. Then, we use this notion of progress to discretize states, and generate count-based intrinsic rewards using the low-dimensional state space. We show that the combination of LLM-generated progress functions and count-based intrinsic rewards is essential for our performance gains, while alternatives such as generic hash-based counts or using progress directly as a reward function fall short.
Authors: Stephen MacNeil, Magdalena Rogalska, Juho Leinonen, Paul Denny, Arto Hellas, Xandria Crosland
Abstract: Large language models (LLMs) present an exciting opportunity for generating synthetic classroom data. Such data could include code containing a typical distribution of errors, simulated student behaviour to address the cold start problem when developing education tools, and synthetic user data when access to authentic data is restricted due to privacy reasons. In this research paper, we conduct a comparative study examining the distribution of bugs generated by LLMs in contrast to those produced by computing students. Leveraging data from two previous large-scale analyses of student-generated bugs, we investigate whether LLMs can be coaxed to exhibit bug patterns that are similar to authentic student bugs when prompted to inject errors into code. The results suggest that unguided, LLMs do not generate plausible error distributions, and many of the generated errors are unlikely to be generated by real students. However, with guidance including descriptions of common errors and typical frequencies, LLMs can be shepherded to generate realistic distributions of errors in synthetic code.
Authors: Petar Radanliev, David De Roure, Carsten Maple, Jason R. C. Nurse, Razvan Nicolescu, Uchenna Ani
Abstract: We present a dependency model tailored to the context of current challenges in data strategies and make recommendations for the cybersecurity community. The model can be used for cyber risk estimation and assessment and generic risk impact assessment.
Authors: Athanasios Tsiligkaridis, Nicholas Kalinowski, Zhongheng Li, Elizabeth Hou
Abstract: Spatiotemporal data faces many analogous challenges to natural language text including the ordering of locations (words) in a sequence, long range dependencies between locations, and locations having multiple meanings. In this work, we propose a novel model for representing high dimensional spatiotemporal trajectories as sequences of discrete locations and encoding them with a Transformer-based neural network architecture. Similar to language models, our Sequence Transformer for Agent Representation Encodings (STARE) model can learn representations and structure in trajectory data through both supervisory tasks (e.g., classification), and self-supervisory tasks (e.g., masked modelling). We present experimental results on various synthetic and real trajectory datasets and show that our proposed model can learn meaningful encodings that are useful for many downstream tasks including discriminating between labels and indicating similarity between locations. Using these encodings, we also learn relationships between agents and locations present in spatiotemporal data.
Authors: Nicolas Legrand, Lilian Weber, Peter Thestrup Waade, Anna Hedvig M{\o}ller Daugaard, Mojtaba Khodadadi, Nace Miku\v{s}, Chris Mathys
Abstract: Bayesian models of cognition have gained considerable traction in computational neuroscience and psychiatry. Their scopes are now expected to expand rapidly to artificial intelligence, providing general inference frameworks to support embodied, adaptable, and energy-efficient autonomous agents. A central theory in this domain is predictive coding, which posits that learning and behaviour are driven by hierarchical probabilistic inferences about the causes of sensory inputs. Biological realism constrains these networks to rely on simple local computations in the form of precision-weighted predictions and prediction errors. This can make this framework highly efficient, but its implementation comes with unique challenges on the software development side. Embedding such models in standard neural network libraries often becomes limiting, as these libraries' compilation and differentiation backends can force a conceptual separation between optimization algorithms and the systems being optimized. This critically departs from other biological principles such as self-monitoring, self-organisation, cellular growth and functional plasticity. In this paper, we introduce \texttt{pyhgf}: a Python package backed by JAX and Rust for creating, manipulating and sampling dynamic networks for predictive coding. We improve over other frameworks by enclosing the network components as transparent, modular and malleable variables in the message-passing steps. The resulting graphs can implement arbitrary computational complexities as beliefs propagation. But the transparency of core variables can also translate into inference processes that leverage self-organisation principles, and express structure learning, meta-learning or causal discovery as the consequence of network structural adaptation to surprising inputs. The code, tutorials and documentation are hosted at: https://github.com/ilabcode/pyhgf.
Authors: Mishal Fatima Minhas, Rachmad Vidya Wicaksana Putra, Falah Awwad, Osman Hasan, Muhammad Shafique
Abstract: To adapt to real-world dynamics, intelligent systems need to assimilate new knowledge without catastrophic forgetting, where learning new tasks leads to a degradation in performance on old tasks. To address this, continual learning concept is proposed for enabling autonomous systems to acquire new knowledge and dynamically adapt to changing environments. Specifically, energy-efficient continual learning is needed to ensure the functionality of autonomous systems under tight compute and memory resource budgets (i.e., so-called autonomous embedded systems). Neuromorphic computing, with brain-inspired Spiking Neural Networks (SNNs), offers inherent advantages for enabling low-power/energy continual learning in autonomous embedded systems. In this paper, we comprehensively discuss the foundations and methods for enabling continual learning in neural networks, then analyze the state-of-the-art works considering SNNs. Afterward, comparative analyses of existing methods are conducted while considering crucial design factors, such as network complexity, memory, latency, and power/energy efficiency. We also explore the practical applications that can benefit from SNN-based continual learning and open challenges in real-world scenarios. In this manner, our survey provides valuable insights into the recent advancements of SNN-based continual learning for real-world application use-cases.
Authors: Ruochen Zhang, Qinan Yu, Matianyu Zang, Carsten Eickhoff, Ellie Pavlick
Abstract: We employ new tools from mechanistic interpretability in order to ask whether the internal structure of large language models (LLMs) shows correspondence to the linguistic structures which underlie the languages on which they are trained. In particular, we ask (1) when two languages employ the same morphosyntactic processes, do LLMs handle them using shared internal circuitry? and (2) when two languages require different morphosyntactic processes, do LLMs handle them using different internal circuitry? Using English and Chinese multilingual and monolingual models, we analyze the internal circuitry involved in two tasks. We find evidence that models employ the same circuit to handle the same syntactic process independently of the language in which it occurs, and that this is the case even for monolingual models trained completely independently. Moreover, we show that multilingual models employ language-specific components (attention heads and feed-forward networks) when needed to handle linguistic processes (e.g., morphological marking) that only exist in some languages. Together, our results provide new insights into how LLMs trade off between exploiting common structures and preserving linguistic differences when tasked with modeling multiple languages simultaneously.
Authors: Omer Moussa, Dietrich Klakow, Mariya Toneva
Abstract: Speech-language models align impressively with human brain responses to natural language. However, current models rely heavily on low-level speech features, indicating they lack brain-relevant semantics, limiting their utility as models of semantic processing in the brain. In this work, we address this limitation by inducing brain-relevant bias into the models via fine-tuning with fMRI recordings of people listening to natural stories, a process we call brain-tuning. After testing it on three different pretrained backbones, we show that brain-tuning improves alignment with new brain recordings in semantic language regions and reduces reliance on low-level speech features. Notably, brain-tuning leads to 1) consistent improvements in performance across various downstream tasks and 2) a representational space with increased semantic preference. Our results provide the first evidence that incorporating brain signals into the training of language models improves their semantic understanding.
Authors: C. Civili, E. Sherkhonov, R. E. K. Stirewalt
Abstract: Ontologies are known to improve the accuracy of Large Language Models (LLMs) when translating natural language queries into a formal query language like SQL or SPARQL. There are two ways to leverage ontologies when working with LLMs. One is to fine-tune the model, i.e., to enhance it with specific domain knowledge. Another is the zero-shot prompting approach, where the ontology is provided as part of the input question. Unfortunately, modern enterprises typically have ontologies that are too large to fit in a prompt due to LLM's token size limitations. We present a solution that incrementally reveals "just enough" of an ontology that is needed to answer a given question.
Authors: Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata
Abstract: Continuous normalizing flows (CNFs) can model data distributions with expressive infinite-length architectures. But this modeling involves computationally expensive process of solving an ordinary differential equation (ODE) during maximum likelihood training. Recently proposed flow matching (FM) framework allows to substantially simplify the training phase using a regression objective with the interpolated forward vector field. In this paper, we propose an interpolant-free dual flow matching (DFM) approach without explicit assumptions about the modeled vector field. DFM optimizes the forward and, additionally, a reverse vector field model using a novel objective that facilitates bijectivity of the forward and reverse transformations. Our experiments with the SMAP unsupervised anomaly detection show advantages of DFM when compared to the CNF trained with either maximum likelihood or FM objectives with the state-of-the-art performance metrics.
Authors: Jacob Haimes, Cenny Wenner, Kunvar Thaman, Vassil Tashev, Clement Neo, Esben Kran, Jason Schreiber
Abstract: The training data for many Large Language Models (LLMs) is contaminated with test data. This means that public benchmarks used to assess LLMs are compromised, suggesting a performance gap between benchmark scores and actual capabilities. Ideally, a private holdout set could be used to accurately verify scores. Unfortunately, such datasets do not exist for most benchmarks, and post-hoc construction of sufficiently similar datasets is non-trivial. To address these issues, we introduce a systematic methodology for (i) retrospectively constructing a holdout dataset for a target dataset, (ii) demonstrating the statistical indistinguishability of this retro-holdout dataset, and (iii) comparing LLMs on the two datasets to quantify the performance gap due to the dataset's public availability. Applying these methods to TruthfulQA, we construct and release Retro-Misconceptions, on which we evaluate twenty LLMs and find that some have inflated scores by as much as 16 percentage points. Our results demonstrate that public benchmark scores do not always accurately assess model properties, and underscore the importance of improved data practices in the field.
Authors: Chu-Hsuan Abraham Lin, Chen-Yu Liu, Samuel Yen-Chi Chen, Kuan-Cheng Chen
Abstract: The rise of deepfake technologies has posed significant challenges to privacy, security, and information integrity, particularly in audio and multimedia content. This paper introduces a Quantum-Trained Convolutional Neural Network (QT-CNN) framework designed to enhance the detection of deepfake audio, leveraging the computational power of quantum machine learning (QML). The QT-CNN employs a hybrid quantum-classical approach, integrating Quantum Neural Networks (QNNs) with classical neural architectures to optimize training efficiency while reducing the number of trainable parameters. Our method incorporates a novel quantum-to-classical parameter mapping that effectively utilizes quantum states to enhance the expressive power of the model, achieving up to 70% parameter reduction compared to classical models without compromising accuracy. Data pre-processing involved extracting essential audio features, label encoding, feature scaling, and constructing sequential datasets for robust model evaluation. Experimental results demonstrate that the QT-CNN achieves comparable performance to traditional CNNs, maintaining high accuracy during training and testing phases across varying configurations of QNN blocks. The QT framework's ability to reduce computational overhead while maintaining performance underscores its potential for real-world applications in deepfake detection and other resource-constrained scenarios. This work highlights the practical benefits of integrating quantum computing into artificial intelligence, offering a scalable and efficient approach to advancing deepfake detection technologies.
Authors: Minh Pham Dinh, Munira Syed, Michael G Yankoski, Trenton W. Ford
Abstract: Planning and performing interactive tasks, such as conducting experiments to determine the melting point of an unknown substance, is straightforward for humans but poses significant challenges for autonomous agents. We introduce ReasonPlanner, a novel generalist agent designed for reflective thinking, planning, and interactive reasoning. This agent leverages LLMs to plan hypothetical trajectories by building a World Model based on a Temporal Knowledge Graph. The agent interacts with the environment using a natural language actor-critic module, where the actor translates the imagined trajectory into a sequence of actionable steps, and the critic determines if replanning is necessary. ReasonPlanner significantly outperforms previous state-of-the-art prompting-based methods on the ScienceWorld benchmark by more than 1.8 times, while being more sample-efficient and interpretable. It relies solely on frozen weights thus requiring no gradient updates. ReasonPlanner can be deployed and utilized without specialized knowledge of Machine Learning, making it accessible to a wide range of users.
Authors: Anastasiia Birillo, Elizaveta Artser, Anna Potriasaeva, Ilya Vlasov, Katsiaryna Dzialets, Yaroslav Golubev, Igor Gerasimov, Hieke Keuning, Timofey Bryksin
Abstract: Students often struggle with solving programming problems when learning to code, especially when they have to do it online, with one of the most common disadvantages of working online being the lack of personalized help. This help can be provided as next-step hint generation, i.e., showing a student what specific small step they need to do next to get to the correct solution. There are many ways to generate such hints, with large language models (LLMs) being among the most actively studied right now. While LLMs constitute a promising technology for providing personalized help, combining them with other techniques, such as static analysis, can significantly improve the output quality. In this work, we utilize this idea and propose a novel system to provide both textual and code hints for programming tasks. The pipeline of the proposed approach uses a chain-of-thought prompting technique and consists of three distinct steps: (1) generating subgoals - a list of actions to proceed with the task from the current student's solution, (2) generating the code to achieve the next subgoal, and (3) generating the text to describe this needed action. During the second step, we apply static analysis to the generated code to control its size and quality. The tool is implemented as a modification to the open-source JetBrains Academy plugin, supporting students in their in-IDE courses. To evaluate our approach, we propose a list of criteria for all steps in our pipeline and conduct two rounds of expert validation. Finally, we evaluate the next-step hints in a classroom with 14 students from two universities. Our results show that both forms of the hints - textual and code - were helpful for the students, and the proposed system helped them to proceed with the coding tasks.
Authors: Jeremy Lucas, Isabeau Pr\'emont-Schwarz
Abstract: This paper presents the Articulated Animal AI Environment for Animal Cognition, an enhanced version of the previous AnimalAI Environment. Key improvements include the addition of agent limbs, enabling more complex behaviors and interactions with the environment that closely resemble real animal movements. The testbench features an integrated curriculum training sequence and evaluation tools, eliminating the need for users to develop their own training programs. Additionally, the tests and training procedures are randomized, which will improve the agent's generalization capabilities. These advancements significantly expand upon the original AnimalAI framework and will be used to evaluate agents on various aspects of animal cognition.
Authors: Harsh Mahesheka, Zhixian Xie, Zhaoran Wang, Wanxin Jin
Abstract: Learning from Demonstrations, particularly from biological experts like humans and animals, often encounters significant data acquisition challenges. While recent approaches leverage internet videos for learning, they require complex, task-specific pipelines to extract and retarget motion data for the agent. In this work, we introduce a language-model-assisted bi-level programming framework that enables a reinforcement learning agent to directly learn its reward from internet videos, bypassing dedicated data preparation. The framework includes two levels: an upper level where a vision-language model (VLM) provides feedback by comparing the learner's behavior with expert videos, and a lower level where a large language model (LLM) translates this feedback into reward updates. The VLM and LLM collaborate within this bi-level framework, using a "chain rule" approach to derive a valid search direction for reward learning. We validate the method for reward learning from YouTube videos, and the results have shown that the proposed method enables efficient reward design from expert videos of biological agents for complex behavior synthesis.
Authors: Jinjin Cai, Ruiqi Wang, Dezhong Zhao, Ziqin Yuan, Victoria McKenna, Aaron Friedman, Rachel Foot, Susan Storey, Ryan Boente, Sudip Vhaduri, Byung-Cheol Min
Abstract: Audio-based disease prediction is emerging as a promising supplement to traditional medical diagnosis methods, facilitating early, convenient, and non-invasive disease detection and prevention. Multimodal fusion, which integrates features from various domains within or across bio-acoustic modalities, has proven effective in enhancing diagnostic performance. However, most existing methods in the field employ unilateral fusion strategies that focus solely on either intra-modal or inter-modal fusion. This approach limits the full exploitation of the complementary nature of diverse acoustic feature domains and bio-acoustic modalities. Additionally, the inadequate and isolated exploration of latent dependencies within modality-specific and modality-shared spaces curtails their capacity to manage the inherent heterogeneity in multimodal data. To fill these gaps, we propose AuD-Former, a hierarchical transformer network designed for general multimodal audio-based disease prediction. Specifically, we seamlessly integrate intra-modal and inter-modal fusion in a hierarchical manner and proficiently encode the necessary intra-modal and inter-modal complementary correlations, respectively. Comprehensive experiments demonstrate that AuD-Former achieves state-of-the-art performance in predicting three diseases: COVID-19, Parkinson's disease, and pathological dysarthria, showcasing its promising potential in a broad context of audio-based disease prediction tasks. Additionally, extensive ablation studies and qualitative analyses highlight the significant benefits of each main component within our model.
Authors: Gary Tom, Stanley Lo, Samantha Corapi, Alan Aspuru-Guzik, Benjamin Sanchez-Lengeling
Abstract: Bayesian optimization (BO) has become an indispensable tool for autonomous decision-making across diverse applications from autonomous vehicle control to accelerated drug and materials discovery. With the growing interest in self-driving laboratories, BO of chemical systems is crucial for machine learning (ML) guided experimental planning. Typically, BO employs a regression surrogate model to predict the distribution of unseen parts of the search space. However, for the selection of molecules, picking the top candidates with respect to a distribution, the relative ordering of their properties may be more important than their exact values. In this paper, we introduce Rank-based Bayesian Optimization (RBO), which utilizes a ranking model as the surrogate. We present a comprehensive investigation of RBO's optimization performance compared to conventional BO on various chemical datasets. Our results demonstrate similar or improved optimization performance using ranking models, particularly for datasets with rough structure-property landscapes and activity cliffs. Furthermore, we observe a high correlation between the surrogate ranking ability and BO performance, and this ability is maintained even at early iterations of BO optimization when using ranking surrogate models. We conclude that RBO is an effective alternative to regression-based BO, especially for optimizing novel chemical compounds.
Authors: Yu Fei, Yasaman Razeghi, Sameer Singh
Abstract: Large language models (LLMs) require alignment, such as instruction-tuning or reinforcement learning from human feedback, to effectively and safely follow user instructions. This process necessitates training aligned versions for every model size in each model family, resulting in significant computational overhead. In this work, we propose nudging, a simple, plug-and-play, and training-free algorithm that aligns any base model at inference time using a small aligned model. Nudging is motivated by recent findings that alignment primarily alters the model's behavior on a small subset of stylistic tokens, such as "Sure" or "Thank". We find that base models are significantly more uncertain when generating these tokens. Leveraging this observation, nudging employs a small aligned model to generate nudging tokens to steer the large base model's output toward desired directions when the base model's uncertainty is high. We evaluate the effectiveness of nudging across 3 model families and 13 tasks, covering reasoning, general knowledge, instruction following, and safety benchmarks. Without any additional training, nudging a large base model with a 7x - 14x smaller aligned model achieves zero-shot performance comparable to, and sometimes surpassing, that of large aligned models. For example, nudging OLMo-7b with OLMo-1b-instruct, affecting less than 9% of tokens, achieves a 10% absolute improvement on GSM8K over OLMo-7b-instruct. Unlike prior inference-time tuning methods, nudging enables off-the-shelf collaboration between model families. For instance, nudging Gemma-2-27b with Llama-2-7b-chat outperforms Llama-2-70b-chat on various tasks. Overall, this work introduces a simple yet powerful approach to token-level model collaboration, offering a modular solution to LLM alignment. Our project website: https://fywalter.github.io/nudging/ .
Authors: Guanlin Liu, Kaixuan Ji, Renjie Zheng, Zheng Wu, Chen Dun, Quanquan Gu, Lin Yan
Abstract: Reinforcement Learning (RL) plays a crucial role in aligning large language models (LLMs) with human preferences and improving their ability to perform complex tasks. However, current approaches either require significant computational resources due to the use of multiple models and extensive online sampling for training (e.g., PPO) or are framed as bandit problems (e.g., DPO, DRO), which often struggle with multi-step reasoning tasks, such as math problem-solving and complex reasoning that involve long chains of thought. To overcome these limitations, we introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit-based methods, enabling more effective process supervision. Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
Authors: Paulo Coelho, Raul Araju, Lu\'is Ramos, Samir Saliba, Renato Vimieiro
Abstract: This paper introduces a novel Graph Neural Network (GNN) architecture for time series classification, based on visibility graph representations. Traditional time series classification methods often struggle with high computational complexity and inadequate capture of spatio-temporal dynamics. By representing time series as visibility graphs, it is possible to encode both spatial and temporal dependencies inherent to time series data, while being computationally efficient. Our architecture is fully modular, enabling flexible experimentation with different models and representations. We employ directed visibility graphs encoded with in-degree and PageRank features to improve the representation of time series, ensuring efficient computation while enhancing the model's ability to capture long-range dependencies in the data. We show the robustness and generalization capability of the proposed architecture across a diverse set of classification tasks and against a traditional model. Our work represents a significant advancement in the application of GNNs for time series analysis, offering a powerful and flexible framework for future research and practical implementations.
Authors: Qianyi Deng, Oishi Deb, Amir Patel, Christian Rupprecht, Philip Torr, Niki Trigoni, Andrew Markham
Abstract: Animal pose estimation (APE) aims to locate the animal body parts using a diverse array of sensor and modality inputs, which is crucial for research across neuroscience, biomechanics, and veterinary medicine. By evaluating 178 papers since 2013, APE methods are categorised by sensor and modality types, learning paradigms, experimental setup, and application domains, presenting detailed analyses of current trends, challenges, and future directions in single- and multi-modality APE systems. The analysis also highlights the transition between human and animal pose estimation. Additionally, 2D and 3D APE datasets and evaluation metrics based on different sensors and modalities are provided. A regularly updated project page is provided here: https://github.com/ChennyDeng/MM-APE.
Authors: Debanjan Ghosh, Sophia Chan
Abstract: We present \llinstruct: An 8B instruction-tuned model that is designed to generate content for English Language Proficiency Assessments (ELPA) and related applications. Our work involves creating a new dataset of 70K instructions and explanations in the ELPA domain and using these to fine-tune Llama-3 8B models (SFT) of different sizes (e.g., SFT-17K, SFT-50K and SFT-70K). Human evaluations are conducted over unseen instructions to compare these SFT models against SOTA models (e.g., Dolly-2, Mistral, Llama-3 base version, and GPT-3.5). The findings show although all three SFT models perform comparably, the model trained on largest instruction dataset -- SFT-70K - leads to the most valid outputs ready for assessments. However, although the SFT models perform better than larger model, e.g., GPT 3.5 on the aspect of explanations of outputs, many outputs still need human interventions to make them actual ready for real world assessments.
Authors: Maisha Maliha, Vishal Pramanik
Abstract: Automatic essay grading (AEG) has attracted the the attention of the NLP community because of its applications to several educational applications, such as scoring essays, short answers, etc. AEG systems can save significant time and money when grading essays. In the existing works, the essays are graded where a single network is responsible for the whole process, which may be ineffective because a single network may not be able to learn all the features of a human-written essay. In this work, we have introduced a new model that outperforms the state-of-the-art models in the field of AEG. We have used the concept of collaborative and transfer learning, where one network will be responsible for checking the grammatical and structural features of the sentences of an essay while another network is responsible for scoring the overall idea present in the essay. These learnings are transferred to another network to score the essay. We also compared the performances of the different models mentioned in our work, and our proposed model has shown the highest accuracy of 85.50%.
Authors: Sudhakar Sah, Ravish Kumar, Honnesh Rohmetra, Ehsan Saboori
Abstract: High runtime memory and high latency puts significant constraint on Vision Transformer training and inference, especially on edge devices. Token pruning reduces the number of input tokens to the ViT based on importance criteria of each token. We present a Background Aware Vision Transformer (BAViT) model, a pre-processing block to object detection models like DETR/YOLOS aimed to reduce runtime memory and increase throughput by using a novel approach to identify background tokens in the image. The background tokens can be pruned completely or partially before feeding to a ViT based object detector. We use the semantic information provided by segmentation map and/or bounding box annotation to train a few layers of ViT to classify tokens to either foreground or background. Using 2 layers and 10 layers of BAViT, background and foreground tokens can be separated with 75% and 88% accuracy on VOC dataset and 71% and 80% accuracy on COCO dataset respectively. We show a 2 layer BAViT-small model as pre-processor to YOLOS can increase the throughput by 30% - 40% with a mAP drop of 3% without any sparse fine-tuning and 2% with sparse fine-tuning. Our approach is specifically targeted for Edge AI use cases.
Authors: Tingyu Xia, Bowen Yu, Kai Dang, An Yang, Yuan Wu, Yuan Tian, Yi Chang, Junyang Lin
Abstract: Supervised fine-tuning (SFT) is crucial for aligning Large Language Models (LLMs) with human instructions. The primary goal during SFT is to select a small yet representative subset of training data from the larger pool, such that fine-tuning with this subset achieves results comparable to or even exceeding those obtained using the entire dataset. However, most existing data selection techniques are designed for small-scale data pools, which fail to meet the demands of real-world SFT scenarios. In this paper, we replicated several self-scoring methods those that do not rely on external model assistance on two million scale datasets, and found that nearly all methods struggled to significantly outperform random selection when dealing with such large-scale data pools. Moreover, our comparisons suggest that, during SFT, diversity in data selection is more critical than simply focusing on high quality data. We also analyzed the limitations of several current approaches, explaining why they perform poorly on large-scale datasets and why they are unsuitable for such contexts. Finally, we found that filtering data by token length offers a stable and efficient method for improving results. This approach, particularly when training on long text data, proves highly beneficial for relatively weaker base models, such as Llama3.
Authors: Amit Kumar Singh, Trapti Shrivastava, Vrijendra Singh
Abstract: Deep learning and advancements in contactless sensors have significantly enhanced our ability to understand complex human activities in healthcare settings. In particular, deep learning models utilizing computer vision have been developed to enable detailed analysis of human gesture recognition, especially repetitive gestures which are commonly observed behaviors in children with autism. This research work aims to identify repetitive behaviors indicative of autism by analyzing videos captured in natural settings as children engage in daily activities. The focus is on accurately categorizing real-time repetitive gestures such as spinning, head banging, and arm flapping. To this end, we utilize the publicly accessible Self-Stimulatory Behavior Dataset (SSBD) to classify these stereotypical movements. A key component of the proposed methodology is the use of \textbf{VideoMAE}, a model designed to improve both spatial and temporal analysis of video data through a masking and reconstruction mechanism. This model significantly outperformed traditional methods, achieving an accuracy of 97.7\%, a 14.7\% improvement over the previous state-of-the-art.
Authors: Wenlong Deng, Yize Zhao, Vala Vakilian, Minghui Chen, Xiaoxiao Li, Christos Thrampoulidis
Abstract: Storing open-source fine-tuned models separately introduces redundancy and increases response times in applications utilizing multiple models. Delta-parameter pruning (DPP), particularly the random drop and rescale (DARE) method proposed by Yu et al., addresses this by pruning the majority of delta parameters--the differences between fine-tuned and pre-trained model weights--while typically maintaining minimal performance loss. However, DARE fails when either the pruning rate or the magnitude of the delta parameters is large. We highlight two key reasons for this failure: (1) an excessively large rescaling factor as pruning rates increase, and (2) high mean and variance in the delta parameters. To push DARE's limits, we introduce DAREx (DARE the eXtreme), which features two algorithmic improvements: (1) DAREx-q, a rescaling factor modification that significantly boosts performance at high pruning rates (e.g., >30 % on COLA and SST2 for encoder models, with even greater gains in decoder models), and (2) DAREx-L2, which combines DARE with AdamR, an in-training method that applies appropriate delta regularization before DPP. We also demonstrate that DAREx-q can be seamlessly combined with vanilla parameter-efficient fine-tuning techniques like LoRA and can facilitate structural DPP. Additionally, we revisit the application of importance-based pruning techniques within DPP, demonstrating that they outperform random-based methods when delta parameters are large. Through this comprehensive study, we develop a pipeline for selecting the most appropriate DPP method under various practical scenarios.
Authors: Zhizhen Zhang, Ruihong Qiu, Xiaohui Xie
Abstract: On social media sharing platforms, some posts are inherently destined for popularity. Therefore, understanding the reasons behind this phenomenon and predicting popularity before post publication holds significant practical value. The previous work predominantly focuses on enhancing post content extraction for better prediction results. However, certain factors introduced by social platforms also impact post popularity, which has not been extensively studied. For instance, users are more likely to engage with posts from individuals they follow, potentially influencing the popularity of these posts. We term these factors, unrelated to the explicit attractiveness of content, as implicit social factors. Through the analysis of users' post browsing behavior (also validated in public datasets), we propose three implicit social factors related to popularity, including content relevance, user influence similarity, and user identity. To model the proposed social factors, we introduce three supervised contrastive learning tasks. For different task objectives and data types, we assign them to different encoders and control their gradient flows to achieve joint optimization. We also design corresponding sampling and augmentation algorithms to improve the effectiveness of contrastive learning. Extensive experiments on the Social Media Popularity Dataset validate the superiority of our proposed method and also confirm the important role of implicit social factors in popularity prediction. We open source the code at https://github.com/Daisy-zzz/PPCL.git.
Authors: Junyi Tao, Xiaoyin Chen, Nelson F. Liu
Abstract: Large language models (LMs) are capable of in-context learning from a few demonstrations (example-label pairs) to solve new tasks during inference. Despite the intuitive importance of high-quality demonstrations, previous work has observed that, in some settings, ICL performance is minimally affected by irrelevant labels (Min et al., 2022). We hypothesize that LMs perform ICL with irrelevant labels via two sequential processes: an inference function that solves the task, followed by a verbalization function that maps the inferred answer to the label space. Importantly, we hypothesize that the inference function is invariant to remappings of the label space (e.g., "true"/"false" to "cat"/"dog"), enabling LMs to share the same inference function across settings with different label words. We empirically validate this hypothesis with controlled layer-wise interchange intervention experiments. Our findings confirm the hypotheses on multiple datasets and tasks (natural language inference, sentiment analysis, and topic classification) and further suggest that the two functions can be localized in specific layers across various open-sourced models, including GEMMA-7B, MISTRAL-7B-V0.3, GEMMA-2-27B, and LLAMA-3.1-70B.
Authors: Jinyoung Park, Minseok Joo, Joo-Kyung Kim, Hyunwoo J. Kim
Abstract: Knowledge graph-grounded dialog generation requires retrieving a dialog-relevant subgraph from the given knowledge base graph and integrating it with the dialog history. Previous works typically represent the graph using an external encoder, such as graph neural networks, and retrieve relevant triplets based on the similarity between single-vector representations of triplets and the dialog history. However, these external encoders fail to leverage the rich knowledge of pretrained language models, and the retrieval process is also suboptimal due to the information bottleneck caused by the single-vector abstraction of the dialog history. In this work, we propose Dialog generation with Generative Subgraph Retrieval (DialogGSR), which retrieves relevant knowledge subgraphs by directly generating their token sequences on top of language models. For effective generative subgraph retrieval, we introduce two key methods: (i) structure-aware knowledge graph linearization with self-supervised graph-specific tokens and (ii) graph-constrained decoding utilizing graph structural proximity-based entity informativeness scores for valid and relevant generative retrieval. DialogGSR achieves state-of-the-art performance in knowledge graph-grounded dialog generation, as demonstrated on OpenDialKG and KOMODIS datasets.
Authors: Ardalan Arabzadeh, Tobias Vente, Joeran Beel
Abstract: As recommender systems become increasingly prevalent, the environmental impact and energy efficiency of training large-scale models have come under scrutiny. This paper investigates the potential for energy-efficient algorithm performance by optimizing dataset sizes through downsampling techniques in the context of Green Recommender Systems. We conducted experiments on the MovieLens 100K, 1M, 10M, and Amazon Toys and Games datasets, analyzing the performance of various recommender algorithms under different portions of dataset size. Our results indicate that while more training data generally leads to higher algorithm performance, certain algorithms, such as FunkSVD and BiasedMF, particularly with unbalanced and sparse datasets like Amazon Toys and Games, maintain high-quality recommendations with up to a 50% reduction in training data, achieving nDCG@10 scores within approximately 13% of full dataset performance. These findings suggest that strategic dataset reduction can decrease computational and environmental costs without substantially compromising recommendation quality. This study advances sustainable and green recommender systems by providing insights for reducing energy consumption while maintaining effectiveness.
Authors: Jongwoo Ko, Saket Dingliwal, Bhavana Ganesh, Sailik Sengupta, Sravan Bodapati, Aram Galstyan
Abstract: Direct alignment algorithms (DAAs), such as direct preference optimization (DPO), have become popular alternatives for Reinforcement Learning from Human Feedback (RLHF) due to their simplicity, efficiency, and stability. However, the preferences used in DAAs are usually collected before the alignment training begins and remain unchanged (off-policy). This can lead to two problems where the policy model (1) picks up on spurious correlations in the dataset (as opposed to learning the intended alignment expressed in the human preference labels), and (2) overfits to feedback on off-policy trajectories that have less likelihood of being generated by an updated policy model. To address these issues, we introduce Self-Reviewing and Alignment (SeRA), a cost-efficient and effective method that can be readily combined with existing DAAs. SeRA comprises of two components: (1) sample selection using implicit reward margins, which helps alleviate over-fitting to some undesired features, and (2) preference bootstrapping using implicit rewards to augment preference data with updated policy models in a cost-efficient manner. Extensive experimentation, including some on instruction-following tasks, demonstrate the effectiveness and generality of SeRA in training LLMs on offline preference datasets with DAAs.
Authors: Natalie Sinani, Sahil Salma, Paul Boutot, Sadaf Mustafiz
Abstract: In recent years, machine learning technologies have gained immense popularity and are being used in a wide range of domains. However, due to the complexity associated with machine learning algorithms, it is a challenge to make it user-friendly, easy to understand and apply. Machine learning applications are especially challenging for users who do not have proficiency in this area. In this paper, we use model-driven engineering (MDE) methods and tools for developing a domain-specific modelling environment to contribute towards providing a solution for this problem. We targeted reinforcement learning from the machine learning domain, and evaluated the proposed language, reinforcement learning modelling language (RLML), with multiple applications. The tool supports syntax-directed editing, constraint checking, and automatic generation of code from RLML models. The environment also provides support for comparing results generated with multiple RL algorithms. With our proposed MDE approach, we were able to help in abstracting reinforcement learning technologies and improve the learning curve for RL users.
Authors: Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou
Abstract: Previous work has demonstrated that attention mechanisms are Turing complete. More recently, it has been shown that a looped 13-layer Transformer can function as a universal programmable computer. In contrast, the multi-layer perceptrons with $\mathsf{ReLU}$ activation ($\mathsf{ReLU}$-$\mathsf{MLP}$), one of the most fundamental components of neural networks, is known to be expressive; specifically, a two-layer neural network is a universal approximator given an exponentially large number of hidden neurons. However, it remains unclear whether a $\mathsf{ReLU}$-$\mathsf{MLP}$ can be made into a universal programmable computer using a practical number of weights. In this work, we provide an affirmative answer that a looped 23-layer $\mathsf{ReLU}$-$\mathsf{MLP}$ is capable to perform the basic necessary operations, effectively functioning as a programmable computer. This indicates that simple modules have stronger expressive power than previously expected and have not been fully explored. Our work provides insights into the mechanisms of neural networks and demonstrates that complex tasks, such as functioning as a programmable computer, do not necessarily require advanced architectures like Transformers.
Authors: Ting Yu, Kunhao Fu, Jian Zhang, Qingming Huang, Jun Yu
Abstract: Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task focusing on semantic understanding of untrimmed long-term videos and diverse free-form questions, simultaneously emphasizing comprehensive cross-modal reasoning to yield precise answers. The canonical approaches often rely on off-the-shelf feature extractors to detour the expensive computation overhead, but often result in domain-independent modality-unrelated representations. Furthermore, the inherent gradient blocking between unimodal comprehension and cross-modal interaction hinders reliable answer generation. In contrast, recent emerging successful video-language pre-training models enable cost-effective end-to-end modeling but fall short in domain-specific ratiocination and exhibit disparities in task formulation. Toward this end, we present an entirely end-to-end solution for long-term VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation (MCG) model. To derive discriminative representations possessing high visual concepts, we introduce Joint Unimodal Modeling (JUM) on a clip-bone architecture and leverage Multi-granularity Contrastive Learning (MCL) to harness the intrinsically or explicitly exhibited semantic correspondences. To alleviate the task formulation discrepancy problem, we propose a Cross-modal Collaborative Generation (CCG) module to reformulate VideoQA as a generative task instead of the conventional classification scheme, empowering the model with the capability for cross-modal high-semantic fusion and generation so as to rationalize and answer. Extensive experiments conducted on six publicly available VideoQA datasets underscore the superiority of our proposed method.
Authors: Ting Yu, Kunhao Fu, Shuhui Wang, Qingming Huang, Jun Yu
Abstract: Video Question Answering (VideoQA) represents a crucial intersection between video understanding and language processing, requiring both discriminative unimodal comprehension and sophisticated cross-modal interaction for accurate inference. Despite advancements in multi-modal pre-trained models and video-language foundation models, these systems often struggle with domain-specific VideoQA due to their generalized pre-training objectives. Addressing this gap necessitates bridging the divide between broad cross-modal knowledge and the specific inference demands of VideoQA tasks. To this end, we introduce HeurVidQA, a framework that leverages domain-specific entity-action heuristics to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning. By delivering fine-grained heuristics, we improve the model's ability to identify and interpret key entities and actions, thereby enhancing its reasoning capabilities. Extensive evaluations across multiple VideoQA datasets demonstrate that our method significantly outperforms existing models, underscoring the importance of integrating domain-specific knowledge into video-language models for more accurate and context-aware VideoQA.
Authors: Sathya Kamesh Bhethanabhotla, Omar Swelam, Julien Siems, David Salinas, Frank Hutter
Abstract: This paper introduces Mamba4Cast, a zero-shot foundation model for time series forecasting. Based on the Mamba architecture and inspired by Prior-data Fitted Networks (PFNs), Mamba4Cast generalizes robustly across diverse time series tasks without the need for dataset specific fine-tuning. Mamba4Cast's key innovation lies in its ability to achieve strong zero-shot performance on real-world datasets while having much lower inference times than time series foundation models based on the transformer architecture. Trained solely on synthetic data, the model generates forecasts for entire horizons in a single pass, outpacing traditional auto-regressive approaches. Our experiments show that Mamba4Cast performs competitively against other state-of-the-art foundation models in various data sets while scaling significantly better with the prediction length. The source code can be accessed at https://github.com/automl/Mamba4Cast.
Authors: Peifan Jiang, Xuben Wang, Shuang Wang, Fei Deng, Kunpeng Wang, Bin Wang, Yuhan Yang, Islam Fadel
Abstract: Magnetotelluric deep learning (DL) inversion methods based on joint data-driven and physics-driven have become a hot topic in recent years. When mapping observation data (or forward modeling data) to the resistivity model using neural networks (NNs), incorporating the error (loss) term of the inversion resistivity's forward modeling response--which introduces physical information about electromagnetic field propagation--can significantly enhance the inversion accuracy. To efficiently achieve data-physical dual-driven MT deep learning inversion for large-scale 3-D MT data, we propose using DL forward modeling networks to compute this portion of the loss. This approach introduces pseudo-physical information through the forward modeling of NN simulation, further guiding the inversion network fitting. Specifically, we first pre-train the forward modeling networks as fixed forward modeling operators, then transfer and integrate them into the inversion network training, and finally optimize the inversion network by minimizing the multinomial loss. Theoretical experimental results indicate that despite some simulation errors in DL forward modeling, the introduced pseudo-physical information still enhances inversion accuracy and significantly mitigates the overfitting problem during training. Additionally, we propose a new input mode that involves masking and adding noise to the data, simulating the field data environment of 3-D MT inversion, thereby making the method more flexible and effective for practical applications.
Authors: Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in processing long-context information. However, the quadratic complexity of attention computation with respect to sequence length poses significant computational challenges, and I/O aware algorithms have been proposed. This paper presents a comprehensive analysis of the I/O complexity for attention mechanisms, focusing on backward passes by categorizing into small and large cache scenarios. Using the red-blue pebble game framework, we establish tight bounds on I/O complexity across all cache sizes. We confirm that the de facto standard I/O aware algorithm FlashAttention is optimal for both forward and backward passes for the large cache size scenario. For small cache sizes, we provide an algorithm that improves over existing methods and achieves the tight bounds. Additionally, we extend our analysis to sparse attention, a mainstream speeding-up approach, deriving fine-grained lower bounds for both forward and backward passes and both small and large caches. Our findings complete the theoretical foundation for I/O complexity in attention mechanisms, offering insights for designing efficient algorithms of LLM training and inference.
Authors: Lixia Zhang, Tianxu Liu, Kaihui Shen, Cheng Chen
Abstract: With the rapid advancement of Internet technology, the threat of malware to computer systems and network security has intensified. Malware affects individual privacy and security and poses risks to critical infrastructures of enterprises and nations. The increasing quantity and complexity of malware, along with its concealment and diversity, challenge traditional detection techniques. Static detection methods struggle against variants and packed malware, while dynamic methods face high costs and risks that limit their application. Consequently, there is an urgent need for novel and efficient malware detection techniques to improve accuracy and robustness. This study first employs the minhash algorithm to convert binary files of malware into grayscale images, followed by the extraction of global and local texture features using GIST and LBP algorithms. Additionally, the study utilizes IDA Pro to decompile and extract opcode sequences, applying N-gram and tf-idf algorithms for feature vectorization. The fusion of these features enables the model to comprehensively capture the behavioral characteristics of malware. In terms of model construction, a CNN-BiLSTM fusion model is designed to simultaneously process image features and opcode sequences, enhancing classification performance. Experimental validation on multiple public datasets demonstrates that the proposed method significantly outperforms traditional detection techniques in terms of accuracy, recall, and F1 score, particularly in detecting variants and obfuscated malware with greater stability. The research presented in this paper offers new insights into the development of malware detection technologies, validating the effectiveness of feature and model fusion, and holds promising application prospects.
Authors: Yicheng Fu, Raviteja Anantha, Jianpeng Cheng
Abstract: While server-side Large Language Models (LLMs) demonstrate proficiency in function calling and complex reasoning, deploying Small Language Models (SLMs) directly on devices brings opportunities to improve latency and privacy but also introduces unique challenges for accuracy and memory. We introduce CAMPHOR, an innovative on-device SLM multi-agent framework designed to handle multiple user inputs and reason over personal context locally, ensuring privacy is maintained. CAMPHOR employs a hierarchical architecture where a high-order reasoning agent decomposes complex tasks and coordinates expert agents responsible for personal context retrieval, tool interaction, and dynamic plan generation. By implementing parameter sharing across agents and leveraging prompt compression, we significantly reduce model size, latency, and memory usage. To validate our approach, we present a novel dataset capturing multi-agent task trajectories centered on personalized mobile assistant use-cases. Our experiments reveal that fine-tuned SLM agents not only surpass closed-source LLMs in task completion F1 by~35\% but also eliminate the need for server-device communication, all while enhancing privacy.
Authors: Youquan Li, Miao Zheng, Fan Yang, Guosheng Dong, Bin Cui, Weipeng Chen, Zenan Zhou, Wentao Zhang
Abstract: Human feedback is crucial in the interactions between humans and Large Language Models (LLMs). However, existing research primarily focuses on benchmarking LLMs in single-turn dialogues. Even in benchmarks designed for multi-turn dialogues, the user inputs are often independent, neglecting the nuanced and complex nature of human feedback within real-world usage scenarios. To fill this research gap, we introduce FB-Bench, a fine-grained, multi-task benchmark designed to evaluate LLMs' responsiveness to human feedback in real-world usage scenarios. Drawing from the two main interaction scenarios, FB-Bench comprises 734 meticulously curated samples, encompassing eight task types, five deficiency types of response, and nine feedback types. We extensively evaluate a broad array of popular LLMs, revealing significant variations in their performance across different interaction scenarios. Further analysis indicates that task, human feedback, and deficiencies of previous responses can also significantly impact LLMs' responsiveness. Our findings underscore both the strengths and limitations of current models, providing valuable insights and directions for future research. Both the toolkits and the dataset of FB-Bench are available at https://github.com/PKU-Baichuan-MLSystemLab/FB-Bench.
Authors: Haoming Lu, Feifei Zhong
Abstract: This study evaluates the capability of Vision-Language Models (VLMs) in image data annotation by comparing their performance on the CelebA dataset in terms of quality and cost-effectiveness against manual annotation. Annotations from the state-of-the-art LLaVA-NeXT model on 1000 CelebA images are in 79.5% agreement with the original human annotations. Incorporating re-annotations of disagreed cases into a majority vote boosts AI annotation consistency to 89.1% and even higher for more objective labels. Cost assessments demonstrate that AI annotation significantly reduces expenditures compared to traditional manual methods -- representing less than 1% of the costs for manual annotation in the CelebA dataset. These findings support the potential of VLMs as a viable, cost-effective alternative for specific annotation tasks, reducing both financial burden and ethical concerns associated with large-scale manual data annotation. The AI annotations and re-annotations utilized in this study are available on https://github.com/evev2024/EVEV2024_CelebA.
Authors: Yaming Yang, Dilixat Muhtar, Yelong Shen, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Denvy Deng, Feng Sun, Qi Zhang, Weizhu Chen, Yunhai Tong
Abstract: Parameter-efficient fine-tuning (PEFT) has been widely employed for domain adaptation, with LoRA being one of the most prominent methods due to its simplicity and effectiveness. However, in multi-task learning (MTL) scenarios, LoRA tends to obscure the distinction between tasks by projecting sparse high-dimensional features from different tasks into the same dense low-dimensional intrinsic space. This leads to task interference and suboptimal performance for LoRA and its variants. To tackle this challenge, we propose MTL-LoRA, which retains the advantages of low-rank adaptation while significantly enhancing multi-task learning capabilities. MTL-LoRA augments LoRA by incorporating additional task-adaptive parameters that differentiate task-specific information and effectively capture shared knowledge across various tasks within low-dimensional spaces. This approach enables large language models (LLMs) pre-trained on general corpus to adapt to different target task domains with a limited number of trainable parameters. Comprehensive experimental results, including evaluations on public academic benchmarks for natural language understanding, commonsense reasoning, and image-text understanding, as well as real-world industrial text Ads relevance datasets, demonstrate that MTL-LoRA outperforms LoRA and its various variants with comparable or even fewer learnable parameters in multitask learning.
Authors: Arjun Shah, Hetansh Shah, Vedica Bafna, Charmi Khandor, Sindhu Nair
Abstract: In today's day and age where information is rapidly spread through online platforms, the rise of fake news poses an alarming threat to the integrity of public discourse, societal trust, and reputed news sources. Classical machine learning and Transformer-based models have been extensively studied for the task of fake news detection, however they are hampered by their reliance on training data and are unable to generalize on unseen headlines. To address these challenges, we propose our novel solution, leveraging web-scraping techniques and Natural Language Inference (NLI) models to retrieve external knowledge necessary for verifying the accuracy of a headline. Our system is evaluated on a diverse self-curated evaluation dataset spanning over multiple news channels and broad domains. Our best performing pipeline achieves an accuracy of 84.3% surpassing the best classical Machine Learning model by 33.3% and Bidirectional Encoder Representations from Transformers (BERT) by 31.0% . This highlights the efficacy of combining dynamic web-scraping with Natural Language Inference to find support for a claimed headline in the corresponding externally retrieved knowledge for the task of fake news detection.
Authors: Christopher Mahlich, Tobias Vente, Joeran Beel
Abstract: This paper introduces e-fold cross-validation, an energy-efficient alternative to k-fold cross-validation. It dynamically adjusts the number of folds based on a stopping criterion. The criterion checks after each fold whether the standard deviation of the evaluated folds has consistently decreased or remained stable. Once met, the process stops early. We tested e-fold cross-validation on 15 datasets and 10 machine-learning algorithms. On average, it required 4 fewer folds than 10-fold cross-validation, reducing evaluation time, computational resources, and energy use by about 40%. Performance differences between e-fold and 10-fold cross-validation were less than 2% for larger datasets. More complex models showed even smaller discrepancies. In 96% of iterations, the results were within the confidence interval, confirming statistical significance. E-fold cross-validation offers a reliable and efficient alternative to k-fold, reducing computational costs while maintaining accuracy.
Authors: Xiquan Li, Wenxi Chen, Ziyang Ma, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Qiuqiang Kong, Xie Chen
Abstract: While automated audio captioning (AAC) has made notable progress, traditional fully supervised AAC models still face two critical challenges: the need for expensive audio-text pair data for training and performance degradation when transferring across domains. To overcome these limitations, we present DRCap, a data-efficient and flexible zero-shot audio captioning system that requires text-only data for training and can quickly adapt to new domains without additional fine-tuning. DRCap integrates a contrastive language-audio pre-training (CLAP) model and a large-language model (LLM) as its backbone. During training, the model predicts the ground-truth caption with a fixed text encoder from CLAP, whereas, during inference, the text encoder is replaced with the audio encoder to generate captions for audio clips in a zero-shot manner. To mitigate the modality gap of the CLAP model, we use both the projection strategy from the encoder side and the retrieval-augmented generation strategy from the decoder side. Specifically, audio embeddings are first projected onto a text embedding support to absorb extensive semantic information within the joint multi-modal space of CLAP. At the same time, similar captions retrieved from a datastore are fed as prompts to instruct the LLM, incorporating external knowledge to take full advantage of its strong generative capability. Conditioned on both the projected CLAP embedding and the retrieved similar captions, the model is able to produce a more accurate and semantically rich textual description. By tailoring the text embedding support and the caption datastore to the target domain, DRCap acquires a robust ability to adapt to new domains in a training-free manner. Experimental results demonstrate that DRCap outperforms all other zero-shot models in in-domain scenarios and achieves state-of-the-art performance in cross-domain scenarios.
Authors: Nikolaos Giakoumoglou, Tania Stathaki
Abstract: Knowledge distillation (KD) has been widely used to transfer knowledge from large, accurate models (teachers) to smaller, efficient ones (students). Recent methods have explored enforcing consistency by incorporating causal interpretations to distill invariant representations. In this work, we extend this line of research by introducing a dual augmentation strategy to promote invariant feature learning in both teacher and student models. Our approach leverages different augmentations applied to both models during distillation, pushing the student to capture robust, transferable features. This dual augmentation strategy complements invariant causal distillation by ensuring that the learned representations remain stable across a wider range of data variations and transformations. Extensive experiments on CIFAR-100 demonstrate the effectiveness of this approach, achieving competitive results in same-architecture KD.
Authors: Collin Leiber, Niklas Strau{\ss}, Matthias Schubert, Thomas Seidl
Abstract: Finding meaningful groups, i.e., clusters, in high-dimensional data such as images or texts without labeled data at hand is an important challenge in data mining. In recent years, deep clustering methods have achieved remarkable results in these tasks. However, most of these methods require the user to specify the number of clusters in advance. This is a major limitation since the number of clusters is typically unknown if labeled data is unavailable. Thus, an area of research has emerged that addresses this problem. Most of these approaches estimate the number of clusters separated from the clustering process. This results in a strong dependency of the clustering result on the quality of the initial embedding. Other approaches are tailored to specific clustering processes, making them hard to adapt to other scenarios. In this paper, we propose UNSEEN, a general framework that, starting from a given upper bound, is able to estimate the number of clusters. To the best of our knowledge, it is the first method that can be easily combined with various deep clustering algorithms. We demonstrate the applicability of our approach by combining UNSEEN with the popular deep clustering algorithms DCN, DEC, and DKM and verify its effectiveness through an extensive experimental evaluation on several image and tabular datasets. Moreover, we perform numerous ablations to analyze our approach and show the importance of its components. The code is available at: https://github.com/collinleiber/UNSEEN
Authors: Corentin Pla, Hugo Richard, Maxime Vono
Abstract: We consider the problem of mean estimation under user-level local differential privacy, where $n$ users are contributing through their local pool of data samples. Previous work assume that the number of data samples is the same across users. In contrast, we consider a more general and realistic scenario where each user $u \in [n]$ owns $m_u$ data samples drawn from some generative distribution $\mu$; $m_u$ being unknown to the statistician but drawn from a known distribution $M$ over $\mathbb{N}^\star$. Based on a distribution-aware mean estimation algorithm, we establish an $M$-dependent upper bounds on the worst-case risk over $\mu$ for the task of mean estimation. We then derive a lower bound. The two bounds are asymptotically matching up to logarithmic factors and reduce to known bounds when $m_u = m$ for any user $u$.
Authors: Antonio Purificato, Fabrizio Silvestri
Abstract: Recommender systems play a crucial role in alleviating information overload by providing personalized recommendations tailored to users' preferences and interests. Recently, Graph Neural Networks (GNNs) have emerged as a promising approach for recommender systems, leveraging their ability to effectively capture complex relationships and dependencies between users and items by representing them as nodes in a graph structure. In this study, we investigate the environmental impact of GNN-based recommender systems, an aspect that has been largely overlooked in the literature. Specifically, we conduct a comprehensive analysis of the carbon emissions associated with training and deploying GNN models for recommendation tasks. We evaluate the energy consumption and carbon footprint of different GNN architectures and configurations, considering factors such as model complexity, training duration, hardware specifications and embedding size. By addressing the environmental impact of resource-intensive algorithms in recommender systems, this study contributes to the ongoing efforts towards sustainable and responsible artificial intelligence, promoting the development of eco-friendly recommendation technologies that balance performance and environmental considerations. Code is available at: https://github.com/antoniopurificato/gnn_recommendation_and_environment.
URLs: https://github.com/antoniopurificato/gnn_recommendation_and_environment.
Authors: Vencia Herzog, Stefan Suwelack
Abstract: Self-supervised pre-training has achieved remarkable success in NLP and 2D vision. However, these advances have yet to translate to 3D data. Techniques like masked reconstruction face inherent challenges on unstructured point clouds, while many contrastive learning tasks lack in complexity and informative value. In this paper, we present Pic@Point, an effective contrastive learning method based on structural 2D-3D correspondences. We leverage image cues rich in semantic and contextual knowledge to provide a guiding signal for point cloud representations at various abstraction levels. Our lightweight approach outperforms state-of-the-art pre-training methods on several 3D benchmarks.
Authors: Jialian Li, Yipin Zhang, Wei Shen, Yuzi Yan, Jian Xie, Dong Yan
Abstract: Logical reasoning is a crucial task for Large Language Models (LLMs), enabling them to tackle complex problems. Among reasoning tasks, multi-step reasoning poses a particular challenge. Grounded in the theory of formal logic, we have developed an automated method, Multi-step Deduction (MuseD), for deductive reasoning data. MuseD has allowed us to create training and testing datasets for multi-step reasoning. Our generation method enables control over the complexity of the generated instructions, facilitating training and evaluation of models across different difficulty levels. Through RLHF training, our training data has demonstrated significant improvements in logical capabilities for both in-domain of out-of-domain reasoning tasks. Additionally, we have conducted tests to assess the multi-step reasoning abilities of various models.
Authors: Seung-Yeon Back, Geonho Son, Dahye Jeong, Eunil Park, Simon S. Woo
Abstract: Photo restoration technology enables preserving visual memories in photographs. However, physical prints are vulnerable to various forms of deterioration, ranging from physical damage to loss of image quality, etc. While restoration by human experts can improve the quality of outcomes, it often comes at a high price in terms of cost and time for restoration. In this work, we present the AI-based photo restoration framework composed of multiple stages, where each stage is tailored to enhance and restore specific types of photo damage, accelerating and automating the photo restoration process. By integrating these techniques into a unified architecture, our framework aims to offer a one-stop solution for restoring old and deteriorated photographs. Furthermore, we present a novel old photo restoration dataset because we lack a publicly available dataset for our evaluation.
Authors: Tianshi Xu, Shuzhang Zhong, Wenxuan Zeng, Runsheng Wang, Meng Li
Abstract: Private deep neural network (DNN) inference based on secure two-party computation (2PC) enables secure privacy protection for both the server and the client. However, existing secure 2PC frameworks suffer from a high inference latency due to enormous communication. As the communication of both linear and non-linear DNN layers reduces with the bit widths of weight and activation, in this paper, we propose PrivQuant, a framework that jointly optimizes the 2PC-based quantized inference protocols and the network quantization algorithm, enabling communication-efficient private inference. PrivQuant proposes DNN architecture-aware optimizations for the 2PC protocols for communication-intensive quantized operators and conducts graph-level operator fusion for communication reduction. Moreover, PrivQuant also develops a communication-aware mixed precision quantization algorithm to improve inference efficiency while maintaining high accuracy. The network/protocol co-optimization enables PrivQuant to outperform prior-art 2PC frameworks. With extensive experiments, we demonstrate PrivQuant reduces communication by $11\times, 2.5\times \mathrm{and}~ 2.8\times$, which results in $8.7\times, 1.8\times ~ \mathrm{and}~ 2.4\times$ latency reduction compared with SiRNN, COINN, and CoPriv, respectively.
Authors: Jiachun Li, Pengfei Cao, Chenhao Wang, Zhuoran Jin, Yubo Chen, Kang Liu, Xiaojian Jiang, Jiexin Xu, Jun Zhao
Abstract: Large language models (LLMs) sometimes demonstrate poor performance on knowledge-intensive tasks, commonsense reasoning is one of them. Researchers typically address these issues by retrieving related knowledge from knowledge graphs or employing self-enhancement methods to elicit knowledge in LLMs. However, noisy knowledge and invalid reasoning issues hamper their ability to answer questions accurately. To this end, we propose a novel method named eliciting, filtering and integrating knowledge in large language model (LINKED). In it, we design a reward model to filter out the noisy knowledge and take the marginal consistent reasoning module to reduce invalid reasoning. With our comprehensive experiments on two complex commonsense reasoning benchmarks, our method outperforms SOTA baselines (up to 9.0% improvement of accuracy). Besides, to measure the positive and negative impact of the injected knowledge, we propose a new metric called effectiveness-preservation score for the knowledge enhancement works. Finally, through extensive experiments, we conduct an in-depth analysis and find many meaningful conclusions about LLMs in commonsense reasoning tasks.
Authors: Jiachun Li, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao
Abstract: Inductive reasoning is an essential capability for large language models (LLMs) to achieve higher intelligence, which requires the model to generalize rules from observed facts and then apply them to unseen examples. We present {\scshape Mirage}, a synthetic dataset that addresses the limitations of previous work, specifically the lack of comprehensive evaluation and flexible test data. In it, we evaluate LLMs' capabilities in both the inductive and deductive stages, allowing for flexible variation in input distribution, task scenario, and task difficulty to analyze the factors influencing LLMs' inductive reasoning. Based on these multi-faceted evaluations, we demonstrate that the LLM is a poor rule-based reasoner. In many cases, when conducting inductive reasoning, they do not rely on a correct rule to answer the unseen case. From the perspectives of different prompting methods, observation numbers, and task forms, models tend to consistently conduct correct deduction without correct inductive rules. Besides, we find that LLMs are good neighbor-based reasoners. In the inductive reasoning process, the model tends to focus on observed facts that are close to the current test example in feature space. By leveraging these similar examples, the model maintains strong inductive capabilities within a localized region, significantly improving its deductive performance.
Authors: Xiaoran Jiao, Weian Mao, Wengong Jin, Peiyuan Yang, Hao Chen, Chunhua Shen
Abstract: Predicting the change in binding free energy ($\Delta \Delta G$) is crucial for understanding and modulating protein-protein interactions, which are critical in drug design. Due to the scarcity of experimental $\Delta \Delta G$ data, existing methods focus on pre-training, while neglecting the importance of alignment. In this work, we propose the Boltzmann Alignment technique to transfer knowledge from pre-trained inverse folding models to $\Delta \Delta G$ prediction. We begin by analyzing the thermodynamic definition of $\Delta \Delta G$ and introducing the Boltzmann distribution to connect energy with protein conformational distribution. However, the protein conformational distribution is intractable; therefore, we employ Bayes' theorem to circumvent direct estimation and instead utilize the log-likelihood provided by protein inverse folding models for $\Delta \Delta G$ estimation. Compared to previous inverse folding-based methods, our method explicitly accounts for the unbound state of protein complex in the $\Delta \Delta G$ thermodynamic cycle, introducing a physical inductive bias and achieving both supervised and unsupervised state-of-the-art (SoTA) performance. Experimental results on SKEMPI v2 indicate that our method achieves Spearman coefficients of 0.3201 (unsupervised) and 0.5134 (supervised), significantly surpassing the previously reported SoTA values of 0.2632 and 0.4324, respectively. Futhermore, we demonstrate the capability of our method on binding energy prediction, protein-protein docking and antibody optimization tasks.
Authors: Takumi Ohashi, Tsubasa Nakagawa, Hitoshi Iyatomi
Abstract: Rapid advancements in artificial intelligence (AI) have made it crucial to integrate moral reasoning into AI systems. However, existing models and datasets often overlook regional and cultural differences. To address this shortcoming, we have expanded the JCommonsenseMorality (JCM) dataset, the only publicly available dataset focused on Japanese morality. The Extended JCM (eJCM) has grown from the original 13,975 sentences to 31,184 sentences using our proposed sentence expansion method called Masked Token and Label Enhancement (MTLE). MTLE selectively masks important parts of sentences related to moral judgment and replaces them with alternative expressions generated by a large language model (LLM), while re-assigning appropriate labels. The model trained using our eJCM achieved an F1 score of 0.857, higher than the scores for the original JCM (0.837), ChatGPT one-shot classification (0.841), and data augmented using AugGPT, a state-of-the-art augmentation method (0.850). Specifically, in complex moral reasoning tasks unique to Japanese culture, the model trained with eJCM showed a significant improvement in performance (increasing from 0.681 to 0.756) and achieved a performance close to that of GPT-4 Turbo (0.787). These results demonstrate the validity of the eJCM dataset and the importance of developing models and datasets that consider the cultural context.
Authors: Gilad Gressel, Rahul Pankajakshan, Yisroel Mirsky
Abstract: Large Language Models (LLMs) have demonstrated an alarming ability to impersonate humans in conversation, raising concerns about their potential misuse in scams and deception. Humans have a right to know if they are conversing to an LLM. We evaluate text-based prompts designed as challenges to expose LLM imposters in real-time. To this end we compile and release an open-source benchmark dataset that includes 'implicit challenges' that exploit an LLM's instruction-following mechanism to cause role deviation, and 'exlicit challenges' that test an LLM's ability to perform simple tasks typically easy for humans but difficult for LLMs. Our evaluation of 9 leading models from the LMSYS leaderboard revealed that explicit challenges successfully detected LLMs in 78.4% of cases, while implicit challenges were effective in 22.9% of instances. User studies validate the real-world applicability of our methods, with humans outperforming LLMs on explicit challenges (78% vs 22% success rate). Our framework unexpectedly revealed that many study participants were using LLMs to complete tasks, demonstrating its effectiveness in detecting both AI impostors and human misuse of AI tools. This work addresses the critical need for reliable, real-time LLM detection methods in high-stakes conversations.
Authors: Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Zhaoxiang Zhang
Abstract: This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.
Authors: Subhankar Maity, Aniket Deroy
Abstract: In recent years, large language models (LLMs) and generative AI have revolutionized natural language processing (NLP), offering unprecedented capabilities in education. This chapter explores the transformative potential of LLMs in automated question generation and answer assessment. It begins by examining the mechanisms behind LLMs, emphasizing their ability to comprehend and generate human-like text. The chapter then discusses methodologies for creating diverse, contextually relevant questions, enhancing learning through tailored, adaptive strategies. Key prompting techniques, such as zero-shot and chain-of-thought prompting, are evaluated for their effectiveness in generating high-quality questions, including open-ended and multiple-choice formats in various languages. Advanced NLP methods like fine-tuning and prompt-tuning are explored for their role in generating task-specific questions, despite associated costs. The chapter also covers the human evaluation of generated questions, highlighting quality variations across different methods and areas for improvement. Furthermore, it delves into automated answer assessment, demonstrating how LLMs can accurately evaluate responses, provide constructive feedback, and identify nuanced understanding or misconceptions. Examples illustrate both successful assessments and areas needing improvement. The discussion underscores the potential of LLMs to replace costly, time-consuming human assessments when appropriately guided, showcasing their advanced understanding and reasoning capabilities in streamlining educational processes.
Authors: Julian Stier
Abstract: Within one decade, Deep Learning overtook the dominating solution methods of countless problems of artificial intelligence. ``Deep'' refers to the deep architectures with operations in manifolds of which there are no immediate observations. For these deep architectures some kind of structure is pre-defined -- but what is this structure? With a formal definition for structures of neural networks, neural architecture search problems and solution methods can be formulated under a common framework. Both practical and theoretical questions arise from closing the gap between applied neural architecture search and learning theory. Does structure make a difference or can it be chosen arbitrarily? This work is concerned with deep structures of artificial neural networks and examines automatic construction methods under empirical principles to shed light on to the so called ``black-box models''. Our contributions include a formulation of graph-induced neural networks that is used to pose optimisation problems for neural architecture. We analyse structural properties for different neural network objectives such as correctness, robustness or energy consumption and discuss how structure affects them. Selected automation methods for neural architecture optimisation problems are discussed and empirically analysed. With the insights gained from formalising graph-induced neural networks, analysing structural properties and comparing the applicability of neural architecture search methods qualitatively and quantitatively we advance these methods in two ways. First, new predictive models are presented for replacing computationally expensive evaluation schemes, and second, new generative models for informed sampling during neural architecture search are analysed and discussed.
Authors: Hongbin Xu, Junduan Huang, Yuer Ma, Zifeng Li, Wenxiong Kang
Abstract: 3D biometric techniques on finger traits have become a new trend and have demonstrated a powerful ability for recognition and anti-counterfeiting. Existing methods follow an explicit 3D pipeline that reconstructs the models first and then extracts features from 3D models. However, these explicit 3D methods suffer from the following problems: 1) Inevitable information dropping during 3D reconstruction; 2) Tight coupling between specific hardware and algorithm for 3D reconstruction. It leads us to a question: Is it indispensable to reconstruct 3D information explicitly in recognition tasks? Hence, we consider this problem in an implicit manner, leaving the nerve-wracking 3D reconstruction problem for learnable neural networks with the help of neural radiance fields (NeRFs). We propose FingerNeRF, a novel generalizable NeRF for 3D finger biometrics. To handle the shape-radiance ambiguity problem that may result in incorrect 3D geometry, we aim to involve extra geometric priors based on the correspondence of binary finger traits like fingerprints or finger veins. First, we propose a novel Trait Guided Transformer (TGT) module to enhance the feature correspondence with the guidance of finger traits. Second, we involve extra geometric constraints on the volume rendering loss with the proposed Depth Distillation Loss and Trait Guided Rendering Loss. To evaluate the performance of the proposed method on different modalities, we collect two new datasets: SCUT-Finger-3D with finger images and SCUT-FingerVein-3D with finger vein images. Moreover, we also utilize the UNSW-3D dataset with fingerprint images for evaluation. In experiments, our FingerNeRF can achieve 4.37% EER on SCUT-Finger-3D dataset, 8.12% EER on SCUT-FingerVein-3D dataset, and 2.90% EER on UNSW-3D dataset, showing the superiority of the proposed implicit method in 3D finger biometrics.
Authors: Guanting Dong, Xiaoshuai Song, Yutao Zhu, Runqi Qiao, Zhicheng Dou, Ji-Rong Wen
Abstract: Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems. Despite recent advancements in Large Language Models (LLMs), research on assessing and improving instruction-following (IF) alignment within the RAG domain remains limited. To address this issue, we propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems. We start by manually crafting a minimal set of atomic instructions (<100) and developing combination rules to synthesize and verify complex instructions for a seed set. We then use supervised models for instruction rewriting while simultaneously generating code to automate the verification of instruction quality via a Python executor. Finally, we integrate these instructions with extensive RAG and general data samples, scaling up to a high-quality VIF-RAG-QA dataset (>100k) through automated processes. To further bridge the gap in instruction-following auto-evaluation for RAG systems, we introduce FollowRAG Benchmark, which includes approximately 3K test samples, covering 22 categories of general instruction constraints and four knowledge-intensive QA datasets. Due to its robust pipeline design, FollowRAG can seamlessly integrate with different RAG benchmarks. Using FollowRAG and eight widely-used IF and foundational abilities benchmarks for LLMs, we demonstrate that VIF-RAG markedly enhances LLM performance across a broad range of general instruction constraints while effectively leveraging its capabilities in RAG scenarios. Further analysis offers practical insights for achieving IF alignment in RAG systems. Our code and datasets are released at https://FollowRAG.github.io.
Authors: Hongbin Xu, Weitao Chen, Zhipeng Zhou, Feng Xiao, Baigui Sun, Mike Zheng Shou, Wenxiong Kang
Abstract: Despite recent advancements in 3D generation methods, achieving controllability still remains a challenging issue. Current approaches utilizing score-distillation sampling are hindered by laborious procedures that consume a significant amount of time. Furthermore, the process of first generating 2D representations and then mapping them to 3D lacks internal alignment between the two forms of representation. To address these challenges, we introduce ControLRM, an end-to-end feed-forward model designed for rapid and controllable 3D generation using a large reconstruction model (LRM). ControLRM comprises a 2D condition generator, a condition encoding transformer, and a triplane decoder transformer. Instead of training our model from scratch, we advocate for a joint training framework. In the condition training branch, we lock the triplane decoder and reuses the deep and robust encoding layers pretrained with millions of 3D data in LRM. In the image training branch, we unlock the triplane decoder to establish an implicit alignment between the 2D and 3D representations. To ensure unbiased evaluation, we curate evaluation samples from three distinct datasets (G-OBJ, GSO, ABO) rather than relying on cherry-picking manual generation. The comprehensive experiments conducted on quantitative and qualitative comparisons of 3D controllability and generation quality demonstrate the strong generalization capacity of our proposed approach.
Authors: Steve Hanneke, Kun Wang
Abstract: We study the stochastic noisy bandit problem with an unknown reward function $f^*$ in a known function class $\mathcal{F}$. Formally, a model $M$ maps arms $\pi$ to a probability distribution $M(\pi)$ of reward. A model class $\mathcal{M}$ is a collection of models. For each model $M$, define its mean reward function $f^M(\pi)=\mathbb{E}_{r \sim M(\pi)}[r]$. In the bandit learning problem, we proceed in rounds, pulling one arm $\pi$ each round and observing a reward sampled from $M(\pi)$. With knowledge of $\mathcal{M}$, supposing that the true model $M\in \mathcal{M}$, the objective is to identify an arm $\hat{\pi}$ of near-maximal mean reward $f^M(\hat{\pi})$ with high probability in a bounded number of rounds. If this is possible, then the model class is said to be learnable. Importantly, a result of \cite{hanneke2023bandit} shows there exist model classes for which learnability is undecidable. However, the model class they consider features deterministic rewards, and they raise the question of whether learnability is decidable for classes containing sufficiently noisy models. For the first time, we answer this question in the positive by giving a complete characterization of learnability for model classes with arbitrary noise. In addition to that, we also describe the full spectrum of possible optimal query complexities. Further, we prove adaptivity is sometimes necessary to achieve the optimal query complexity. Last, we revisit an important complexity measure for interactive decision making, the Decision-Estimation-Coefficient \citep{foster2021statistical,foster2023tight}, and propose a new variant of the DEC which also characterizes learnability in this setting.
Authors: Arezou Zahiri Pourzarandi, Farshad Jafari
Abstract: This study employs Natural Language Processing (NLP) to analyze the intricate linguistic and emotional dimensions within the plays of Bernard-Marie Kolt\`es, a central figure in contemporary French theatre. By integrating advanced computational techniques, we dissect Kolt\`es' narrative style, revealing the subtle interplay between language and emotion across his dramatic oeuvre. Our findings highlight how Kolt\`es crafts his narratives, enriching our understanding of his thematic explorations and contributing to the broader field of digital humanities in literary analysis.
Authors: Angelos Poulis, Eleni Tsalapati, Manolis Koubarakis
Abstract: Recent advancements in transformer-based language models have sparked research into their logical reasoning capabilities. Most of the benchmarks used to evaluate these models are simple: generated from short (fragments of) first-order logic sentences with only a few logical operators and quantifiers. We construct the natural language dataset, DELTA$_D$, using the expressive description logic language $\mathcal{ALCQ}$. DELTA$_D$ comprises 384K examples and increases in two dimensions: i) reasoning depth, and ii) linguistic complexity. In this way, we systematically investigate the logical reasoning capabilities of a supervised fine-tuned DeBERTa-based model and two large language models (GPT-3.5, GPT-4) with few-shot prompting. We show that the DeBERTa-based model fine-tuned on our dataset can master the entailment checking task. Moreover, the performance of GPTs can improve significantly even when a small number of samples is provided (9 shots). We open-source our code and datasets.
Authors: Mohammad Mozaffari, Maryam Mehri Dehnavi
Abstract: Large Language Models (LLMs) have revolutionized natural language understanding and generation tasks but suffer from high memory consumption and slow inference times due to their large parameter sizes. Traditional model compression techniques, such as quantization and pruning, mitigate these issues but often require retraining to maintain accuracy, which is computationally expensive. This paper introduces SLiM, a novel approach for compressing LLMs using a one-shot Quantized Sparse Plus Low-rank Approximation. SLiM eliminates the need for costly retraining by combining a symmetric quantization method (SLiM-Quant) with a saliency-based low-rank approximation. Our method reduces quantization error while leveraging sparse representations compatible with accelerated hardware architectures. Additionally, we propose a parameter-efficient fine-tuning recipe that significantly reduces overhead compared to conventional quantization-aware training. SLiM achieves up to a 5.4% improvement in model accuracy for sparsity patterns like 2:4, and the fine-tuning step further enhances accuracy by up to 5.8%, demonstrating state-of-the-art performance. This work provides a pathway for efficiently deploying large models in memory-constrained environments without compromising accuracy.
Authors: Jiaxin Zhang, Wendi Cui, Yiran Huang, Kamalika Das, Sricharan Kumar
Abstract: Large language models (LLMs) are proficient in capturing factual knowledge across various domains. However, refining their capabilities on previously seen knowledge or integrating new knowledge from external sources remains a significant challenge. In this work, we propose a novel synthetic knowledge ingestion method called Ski, which leverages fine-grained synthesis, interleaved generation, and assemble augmentation strategies to construct high-quality data representations from raw knowledge sources. We then integrate Ski and its variations with three knowledge injection techniques: Retrieval Augmented Generation (RAG), Supervised Fine-tuning (SFT), and Continual Pre-training (CPT) to inject and refine knowledge in language models. Extensive empirical experiments are conducted on various question-answering tasks spanning finance, biomedicine, and open-generation domains to demonstrate that Ski significantly outperforms baseline methods by facilitating effective knowledge injection. We believe that our work is an important step towards enhancing the factual accuracy of LLM outputs by refining knowledge representation and injection capabilities.
Authors: Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh
Abstract: Early detection of intrapartum risk enables interventions to potentially prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently, there is no accurate automated system to predict such events to assist with clinical decision-making. To fill this gap, we propose "Artificial Intelligence (AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning framework that not only predicts adverse labor outcomes from maternal, fetal, obstetrical, and intrapartum risk factors but also provides the model's reasoning behind the predictions made. The latter can provide insights into what modifications in the input variables of the model could have changed the predicted outcome. We address the challenges of imbalance and small datasets by synthesizing additional training data using Adaptive Synthetic Sampling (ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN uses an ensemble of fully-connected neural networks as the backbone for its classification with the data augmentation supported by either ADASYN or CTGAN. AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in classification. AIMEN can predict a high risk for adverse labor outcomes with an average F1 score of 0.784. It also provides counterfactual explanations that can be achieved by changing 2 to 3 attributes on average. Resources available: https://github.com/ab9mamun/AIMEN.
Authors: Ryotaro Nagase, Takashi Sumiyoshi, Natsuo Yamashita, Kota Dohi, Yohei Kawaguchi
Abstract: This paper proposes a zero-shot speech emotion recognition (SER) method that estimates emotions not previously defined in the SER model training. Conventional methods are limited to recognizing emotions defined by a single word. Moreover, we have the motivation to recognize unknown bipolar emotions such as ``I want to buy - I do not want to buy.'' In order to allow the model to define classes using sentences freely and to estimate unknown bipolar emotions, our proposed method expands upon the contrastive language-audio pre-training (CLAP) framework by introducing multi-class and multi-task settings. We also focus on purchase intention as a bipolar emotion and investigate the model's performance to zero-shot estimate it. This study is the first attempt to estimate purchase intention from speech directly. Experiments confirm that the results of zero-shot estimation by the proposed method are at the same level as those of the model trained by supervised learning.
Authors: Nandan Kumar Jha, Brandon Reagen
Abstract: LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. However, it introduces significant challenges in mechanistic interpretability, outlier feature suppression, faithful signal propagation, and computational and communication complexity of private inference. This work explores desirable activation functions in normalization-free decoder-only LLMs. Contrary to the conventional preference for the GELU in transformer-based models, our empirical findings demonstrate an {\em opposite trend} -- ReLU significantly outperforms GELU in LayerNorm-free models, leading to an {\bf 8.2\%} perplexity improvement. We discover a key issue with GELU, where early layers experience entropic overload, leading to the under-utilization of the representational capacity of attention heads. This highlights that smoother activations like GELU are {\em ill-suited} for LayerNorm-free architectures, whereas ReLU's geometrical properties -- specialization in input space and intra-class selectivity -- lead to improved learning dynamics and better information retention in the absence of LayerNorm. This study offers key insights for optimizing transformer architectures where LayerNorm introduces significant challenges.
Authors: El-Mahdi El-Mhamdi, L\^e-Nguy\^en Hoang
Abstract: ``When a measure becomes a target, it ceases to be a good measure'', this adage is known as {\it Goodhart's law}. In this paper, we investigate formally this law and prove that it critically depends on the tail distribution of the discrepancy between the true goal and the measure that is optimized. Discrepancies with long-tail distributions favor a Goodhart's law, that is, the optimization of the measure can have a counter-productive effect on the goal. We provide a formal setting to assess Goodhart's law by studying the asymptotic behavior of the correlation between the goal and the measure, as the measure is optimized. Moreover, we introduce a distinction between a {\it weak} Goodhart's law, when over-optimizing the metric is useless for the true goal, and a {\it strong} Goodhart's law, when over-optimizing the metric is harmful for the true goal. A distinction which we prove to depend on the tail distribution. We stress the implications of this result to large-scale decision making and policies that are (and have to be) based on metrics, and propose numerous research directions to better assess the safety of such policies in general, and to the particularly concerning case where these policies are automated with algorithms.
Authors: Abdullah Mamun, Krista S. Leonard, Megan E. Petrov, Matthew P. Buman, Hassan Ghasemzadeh
Abstract: Objective: This research aims to develop a lifestyle intervention system, called MoveSense, that forecasts a patient's activity behavior to allow for early and personalized interventions in real-world clinical environments. Methods: We conducted two clinical studies involving 58 prediabetic veterans and 60 patients with obstructive sleep apnea to gather multimodal behavioral data using wearable devices. We develop multimodal long short-term memory (LSTM) network models, which are capable of forecasting the number of step counts of a patient up to 24 hours in advance by examining data from activity and engagement modalities. Furthermore, we design goal-based forecasting models to predict whether a person's next-day steps will be over a certain threshold. Results: Multimodal LSTM with early fusion achieves 33% and 37% lower mean absolute errors than linear regression and ARIMA respectively on the prediabetes dataset. LSTM also outperforms linear regression and ARIMA with a margin of 13% and 32% on the sleep dataset. Multimodal forecasting models also perform with 72% and 79% accuracy on the prediabetes dataset and sleep dataset respectively on goal-based forecasting. Conclusion: Our experiments conclude that multimodal LSTM models with early fusion are better than multimodal LSTM with late fusion and unimodal LSTM models and also than ARIMA and linear regression models. Significance: We address an important and challenging task of time-series forecasting in uncontrolled environments. Effective forecasting of a person's physical activity can aid in designing adaptive behavioral interventions to keep the user engaged and adherent to a prescribed routine.
Authors: Ankita Sinha, Wendi Cui, Kamalika Das, Jiaxin Zhang
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities; however, the optimization of their prompts has historically prioritized performance metrics at the expense of crucial safety and security considerations. To overcome this shortcoming, we introduce "Survival of the Safest" (SoS), an innovative multi-objective prompt optimization framework that enhances both performance and security in LLMs simultaneously. SoS utilizes an interleaved multi-objective evolution strategy, integrating semantic, feedback, and crossover mutations to effectively traverse the prompt landscape. Differing from the computationally demanding Pareto front methods, SoS provides a scalable solution that expedites optimization in complex, high-dimensional discrete search spaces while keeping computational demands low. Our approach accommodates flexible weighting of objectives and generates a pool of optimized candidates, empowering users to select prompts that optimally meet their specific performance and security needs. Experimental evaluations across diverse benchmark datasets affirm SoS's efficacy in delivering high performance and notably enhancing safety and security compared to single-objective methods. This advancement marks a significant stride towards the deployment of LLM systems that are both high-performing and secure across varied industrial applications
Authors: Aly Sabri Abdalla, Ahmad Al-Kabbany, Ehab F. Badran, Vuk Marojevic
Abstract: Vehicle-to-everything (V2X) networks support a variety of safety, entertainment, and commercial applications. This is realized by applying the principles of the Internet of Vehicles (IoV) to facilitate connectivity among vehicles and between vehicles and roadside units (RSUs). Network congestion management is essential for IoVs and it represents a significant concern due to its impact on improving the efficiency of transportation systems and providing reliable communication among vehicles for the timely delivery of safety-critical packets. This paper introduces a framework for proactive congestion management for IoV networks. We generate congestion scenarios and a data set to predict the congestion using LSTM. We present the framework and the packet congestion dataset. Simulation results using SUMO with NS3 demonstrate the effectiveness of the framework for forecasting IoV network congestion and clustering/prioritizing packets employing recurrent neural networks.
Authors: Christopher Diehl, Peter Karkus, Shushant Veer, Marco Pavone, Torsten Bertram
Abstract: Distribution shifts between operational domains can severely affect the performance of learned models in self-driving vehicles (SDVs). While this is a well-established problem, prior work has mostly explored naive solutions such as fine-tuning, focusing on the motion prediction task. In this work, we explore novel adaptation strategies for differentiable autonomy stacks consisting of prediction, planning, and control, perform evaluation in closed-loop, and investigate the often-overlooked issue of catastrophic forgetting. Specifically, we introduce two simple yet effective techniques: a low-rank residual decoder (LoRD) and multi-task fine-tuning. Through experiments across three models conducted on two real-world autonomous driving datasets (nuPlan, exiD), we demonstrate the effectiveness of our methods and highlight a significant performance gap between open-loop and closed-loop evaluation in prior approaches. Our approach improves forgetting by up to 23.33% and the closed-loop OOD driving score by 8.83% in comparison to standard fine-tuning.
Authors: Ajinkya Tejankar, KL Navaneet, Ujjawal Panchal, Kossar Pourahmadi, Hamed Pirsiavash
Abstract: The goal of this paper is to improve (upcycle) an existing large language model without the prohibitive requirements of continued pre-training of the full-model. The idea is to split the pre-training data into semantically relevant groups and train an expert on each subset. An expert takes the form of a lightweight adapter added on the top of a frozen base model. During inference, an incoming query is first routed to the most relevant expert which is then loaded onto the base model for the forward pass. Unlike typical Mixture of Experts (MoE) models, the experts in our method do not work with other experts for a single query. Hence, we dub them "introvert" experts. Freezing the base model and keeping the experts as lightweight adapters allows extreme parallelism during training and inference. Training of all experts can be done in parallel without any communication channels between them. Similarly, the inference can also be heavily parallelized by distributing experts on different GPUs and routing each request to the GPU containing its relevant expert. We implement a proof-of-concept version of this method and show the validity of our approach.
Authors: Kaidong Li, Tianxiao Zhang, Chuncong Zhong, Ziming Zhang, Guanghui Wang
Abstract: 3D point cloud classification requires distinct models from 2D image classification due to the divergent characteristics of the respective input data. While 3D point clouds are unstructured and sparse, 2D images are structured and dense. Bridging the domain gap between these two data types is a non-trivial challenge to enable model interchangeability. Recent research using Lattice Point Classifier (LPC) highlights the feasibility of cross-domain applicability. However, the lattice projection operation in LPC generates 2D images with disconnected projected pixels. In this paper, we explore three distinct algorithms for mapping 3D point clouds into 2D images. Through extensive experiments, we thoroughly examine and analyze their performance and defense mechanisms. Leveraging current large foundation models, we scrutinize the feature disparities between regular 2D images and projected 2D images. The proposed approaches demonstrate superior accuracy and robustness against adversarial attacks. The generative model-based mapping algorithms yield regular 2D images, further minimizing the domain gap from regular 2D classification tasks. The source code is available at https://github.com/KaidongLi/pytorch-LatticePointClassifier.git.
URLs: https://github.com/KaidongLi/pytorch-LatticePointClassifier.git.
Authors: Hai Huang, Randall Balestriero
Abstract: Low-Rank Adaptation (LoRA) is the bread and butter of Large Language Model (LLM) finetuning. LoRA learns an additive low-rank perturbation, $AB$, of a pretrained matrix parameter $W$ to align the model to a new task or dataset with $W+AB$. We identify three core limitations to LoRA for finetuning--a setting that employs limited amount of data and training steps. First, LoRA employs Dropout to prevent overfitting. We prove that Dropout is only suitable for long training episodes but fails to converge to a reliable regularizer for short training episodes. Second, LoRA's initialization of $B$ at $0$ creates a slow training dynamic between $A$ and $B$. That dynamic is also exacerbated by Dropout that further slows the escape from $0$ for $B$ which is particularly harmful for short training episodes. Third, the scaling factor multiplying each LoRA additive perturbation creates ``short-sighted'' interactions between the LoRA modules of different layers. Motivated by principled analysis of those limitations, we find an elegant solution: a Dropout-free, scaling-free, LoRA with Adaptive Learning rate--coined ALLoRA. By scaling the per sample and per parameter gradients with a coefficient inversely proportional to parameters' $\ell_2$ norm, ALLoRA alleviates those three limitations. As a by-product, ALLoRA removes two hyper-parameters from LoRA: the scaling factor and the dropout rate. Empirical results show that ALLoRA admits better accuracy than LoRA on various settings, including against recent LoRA variants such as Weight-Decomposed Low-Rank Adaptation (DoRA). Ablation studies show our solution is the optimal in a family of weight-dependent / output-dependent approaches on various LLMs including the latest Llama3.
Authors: Chengrui Gao, Haopu Shang, Ke Xue, Chao Qian
Abstract: Machine learning has increasingly been employed to solve NP-hard combinatorial optimization problems, resulting in the emergence of neural solvers that demonstrate remarkable performance, even with minimal domain-specific knowledge. To date, the community has created numerous open-source neural solvers with distinct motivations and inductive biases. While considerable efforts are devoted to designing powerful single solvers, our findings reveal that existing solvers typically demonstrate complementary performance across different problem instances. This suggests that significant improvements could be achieved through effective coordination of neural solvers at the instance level. In this work, we propose the first general framework to coordinate the neural solvers, which involves feature extraction, selection model, and selection strategy, aiming to allocate each instance to the most suitable solvers. To instantiate, we collect several typical neural solvers with state-of-the-art performance as alternatives, and explore various methods for each component of the framework. We evaluated our framework on two extensively studied combinatorial optimization problems, Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP). Experimental results show that the proposed framework can effectively distribute instances and the resulting composite solver can achieve significantly better performance (e.g., reduce the optimality gap by 0.88\% on TSPLIB and 0.71\% on CVRPLIB) than the best individual neural solver with little extra time cost.
Authors: Qixun Wang, Yifei Wang, Yisen Wang, Xianghua Ying
Abstract: In this work, we explore the mechanism of in-context learning (ICL) on out-of-distribution (OOD) tasks that were not encountered during training. To achieve this, we conduct synthetic experiments where the objective is to learn OOD mathematical functions through ICL using a GPT-2 model. We reveal that Transformers may struggle to learn OOD task functions through ICL. Specifically, ICL performance resembles implementing a function within the pretraining hypothesis space and optimizing it with gradient descent based on the in-context examples. Additionally, we investigate ICL's well-documented ability to learn unseen abstract labels in context. We demonstrate that such ability only manifests in the scenarios without distributional shifts and, therefore, may not serve as evidence of new-task-learning ability. Furthermore, we assess ICL's performance on OOD tasks when the model is pretrained on multiple tasks. Both empirical and theoretical analyses demonstrate the existence of the \textbf{low-test-error preference} of ICL, where it tends to implement the pretraining function that yields low test error in the testing context. We validate this through numerical experiments. This new theoretical result, combined with our empirical findings, elucidates the mechanism of ICL in addressing OOD tasks.
Authors: Xinxi Chen, Li Wang, Wei Wu, Qi Tang, Yiyao Liu
Abstract: Hallucination is a key roadblock for applications of Large Language Models (LLMs), particularly for enterprise applications that are sensitive to information accuracy. To address this issue, two general approaches have been explored: Retrieval-Augmented Generation (RAG) to supply LLMs with updated information as context, and fine-tuning the LLMs with new information and desired output styles. In this paper, we propose Honest AI: a novel strategy to fine-tune "small" language models to say "I don't know" to reduce hallucination, along with several alternative RAG approaches. The solution ranked 1st in Task 2 for the false premise question. The alternative approaches include using RAG with search engine and knowledge graph results, fine-tuning base LLMs with new information and combinations of both approaches. Although all approaches improve the performance of the LLMs, RAG alone does not significantly improve the performance and fine-tuning is needed for better results. Finally, the hybrid approach achieved the highest score in the CRAG benchmark. In addition, our approach emphasizes the use of relatively small models with fewer than 10 billion parameters, promoting resource efficiency.
Authors: Sheng-Chen Bai, Shi-Ju Ran
Abstract: Interpreting the representation and generalization powers has been a long-standing issue in the field of machine learning (ML) and artificial intelligence. This work contributes to uncovering the emergence of universal scaling laws in quantum-probabilistic ML. We take the generative tensor network (GTN) in the form of a matrix product state as an example and show that with an untrained GTN (such as a random TN state), the negative logarithmic likelihood (NLL) $L$ generally increases linearly with the number of features $M$, i.e., $L \simeq k M + const$. This is a consequence of the so-called ``catastrophe of orthogonality,'' which states that quantum many-body states tend to become exponentially orthogonal to each other as $M$ increases. We reveal that while gaining information through training, the linear scaling law is suppressed by a negative quadratic correction, leading to $L \simeq \beta M - \alpha M^2 + const$. The scaling coefficients exhibit logarithmic relationships with the number of training samples and the number of quantum channels $\chi$. The emergence of the quadratic correction term in NLL for the testing (training) set can be regarded as evidence of the generalization (representation) power of GTN. Over-parameterization can be identified by the deviation in the values of $\alpha$ between training and testing sets while increasing $\chi$. We further investigate how orthogonality in the quantum feature map relates to the satisfaction of quantum probabilistic interpretation, as well as to the representation and generalization powers of GTN. The unveiling of universal scaling laws in quantum-probabilistic ML would be a valuable step toward establishing a white-box ML scheme interpreted within the quantum probabilistic framework.
Authors: Weinan Zhang, Junwei Liao, Ning Li, Kounianhua Du
Abstract: What will information entry look like in the next generation of digital products? Since the 1970s, user access to relevant information has relied on domain-specific architectures of information retrieval (IR). Over the past two decades, the advent of modern IR systems, including web search engines and personalized recommender systems, has greatly improved the efficiency of retrieving relevant information from vast data corpora. However, the core paradigm of these IR systems remains largely unchanged, relying on filtering a predefined set of candidate items. Since 2022, breakthroughs in large language models (LLMs) have begun transforming how information is accessed, establishing a new technical paradigm. In this position paper, we introduce Agentic Information Retrieval (Agentic IR), a novel IR paradigm shaped by the capabilities of LLM agents. Agentic IR expands the scope of accessible tasks and leverages a suite of new techniques to redefine information retrieval. We discuss three types of cutting-edge applications of agentic IR and the challenges faced. We propose that agentic IR holds promise for generating innovative applications, potentially becoming a central information entry point in future digital ecosystems.
Authors: Tengfei Cheng, Yunxuan Dong, Yangdi Huang
Abstract: Tidal energy is one of the key components in increasing the penetration rate of renewable energy. The penetration of tidal energy in the electrical grid depends on the accuracy of tidal current speed forecasting. Modeling inaccuracies hinder forecast accuracy. Previous research has primarily used physical models to forecast tidal current speed. However, tidal current variations influenced by the orbital periods of celestial bodies make accurate physical modeling challenging. Researching the multiple periodicity of tides is crucial for accurately forecasting tidal current speed. In this article, we propose the Wavelet-Enhanced Convolutional Network (WCN) to learn multiple periodicity. The framework embeds intra-period and inter-period variations of one-dimensional tidal current data into the rows and columns of a two-dimensional tensor. Then, the two-dimensional variations of the sequence can be processed by convolutional kernels. We integrate a time-frequency analysis method into the framework to further address local periodic features. Additionally, to enhance the framework's stability, we optimize the framework's hyperparameters with the Tree-structured Parzen Estimator algorithm. The proposed framework avoids the lack of learning multiple periodicity. Compared with benchmarks, the proposed framework reduces the mean absolute error and mean square error in 10-step forecasting by, at most, 90.36% and 97.56%, respectively.
Authors: Tavish Mankash, V. S. Chaithanya Kota, Anish De, Praveen Prakash, Kshitij Jadhav
Abstract: Hospitals generate thousands of handwritten prescriptions, a practice that remains prevalent despite the availability of Electronic Medical Records (EMR). This method of record-keeping hinders the examination of long-term medication effects, impedes statistical analysis, and makes the retrieval of records challenging. Handwritten prescriptions pose a unique challenge, requiring specialized data for training models to recognize medications and their patterns of recommendation. While current handwriting recognition approaches typically employ 2-D LSTMs, recent studies have explored the use of Large Language Models (LLMs) for Optical Character Recognition (OCR). Building on this approach, we focus on extracting medication names from medical records. Our methodology MIRAGE (Multimodal Identification and Recognition of Annotations in indian GEneral prescriptions) involves fine-tuning the LLaVA 1.6 and Idefics2 models. Our research utilizes a dataset provided by Medyug Technology, consisting of 743,118 fully annotated high-resolution simulated medical records from 1,133 doctors across India. We demonstrate that our methodology exhibits 82% accuracy in medication name and dosage extraction. We provide a detailed account of our research methodology and results, notes about HWR with Multimodal LLMs, and release a small dataset of 100 medical records with labels.
Authors: Dotan Di Castro, Omkar Joglekar, Shir Kozlovsky, Vladimir Tchuiev, Michal Moshkovitz
Abstract: Training neural networks is computationally heavy and energy-intensive. Many methodologies were developed to save computational requirements and energy by reducing the precision of network weights at inference time and introducing techniques such as rounding, stochastic rounding, and quantization. However, most of these techniques still require full gradient precision at training time, which makes training such models prohibitive on edge devices. This work presents a novel technique for training neural networks without needing gradients. This enables a training process where all the weights are one or two bits, without any hidden full precision computations. We show that it is possible to train models without gradient-based optimization techniques by identifying erroneous contributions of each neuron towards the expected classification and flipping the relevant bits using logical operations. We tested our method on several standard datasets and achieved performance comparable to corresponding gradient-based baselines with a fraction of the compute power.
Authors: Pengfei Hu, Yuhang Qian, Tianyue Zheng, Ang Li, Zhe Chen, Yue Gao, Xiuzhen Cheng, Jun Luo
Abstract: Given the wide adoption of multimodal sensors (e.g., camera, lidar, radar) by autonomous vehicles (AVs), deep analytics to fuse their outputs for a robust perception become imperative. However, existing fusion methods often make two assumptions rarely holding in practice: i) similar data distributions for all inputs and ii) constant availability for all sensors. Because, for example, lidars have various resolutions and failures of radars may occur, such variability often results in significant performance degradation in fusion. To this end, we present tREADi, an adaptive inference system that accommodates the variability of multimodal sensory data and thus enables robust and efficient perception. t-READi identifies variation-sensitive yet structure-specific model parameters; it then adapts only these parameters while keeping the rest intact. t-READi also leverages a cross-modality contrastive learning method to compensate for the loss from missing modalities. Both functions are implemented to maintain compatibility with existing multimodal deep fusion methods. The extensive experiments evidently demonstrate that compared with the status quo approaches, t-READi not only improves the average inference accuracy by more than 6% but also reduces the inference latency by almost 15x with the cost of only 5% extra memory overhead in the worst case under realistic data and modal variations.
Authors: Juseong Jin, Chang Wook Jeong
Abstract: Conversation agents powered by large language models are revolutionizing the way we interact with visual data. Recently, large vision-language models (LVLMs) have been extensively studied for both images and videos. However, these studies typically focus on common scenarios. In this work, we introduce an LVLM specifically designed for surgical scenarios. We integrate visual representations of surgical images and videos into the language feature space. Consequently, we establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios. Our experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts, occasionally displaying multi-modal behaviors on unseen instructions. We conduct a quantitative evaluation of visual question-answering datasets for surgical scenarios. The results show superior performance compared to previous works, indicating the potential of our model to tackle more complex surgery scenarios.
Authors: Hojoon Lee, Dongyoon Hwang, Donghu Kim, Hyunseung Kim, Jun Jet Tai, Kaushik Subramanian, Peter R. Wurman, Jaegul Choo, Peter Stone, Takuma Seno
Abstract: Recent advances in CV and NLP have been largely driven by scaling up the number of network parameters, despite traditional theories suggesting that larger networks are prone to overfitting. These large networks avoid overfitting by integrating components that induce a simplicity bias, guiding models toward simple and generalizable solutions. However, in deep RL, designing and scaling up networks have been less explored. Motivated by this opportunity, we present SimBa, an architecture designed to scale up parameters in deep RL by injecting a simplicity bias. SimBa consists of three components: (i) an observation normalization layer that standardizes inputs with running statistics, (ii) a residual feedforward block to provide a linear pathway from the input to output, and (iii) a layer normalization to control feature magnitudes. By scaling up parameters with SimBa, the sample efficiency of various deep RL algorithms-including off-policy, on-policy, and unsupervised methods-is consistently improved. Moreover, solely by integrating SimBa architecture into SAC, it matches or surpasses state-of-the-art deep RL methods with high computational efficiency across DMC, MyoSuite, and HumanoidBench. These results demonstrate SimBa's broad applicability and effectiveness across diverse RL algorithms and environments.
Authors: Biplov Paneru, Bishwash Paneru, Khem Narayan Poudyal
Abstract: This paper presents an Artificial Intelligence (AI) integrated novel approach to Brain-Computer Interface (BCI)-based wheelchair development, utilizing a voluntary Right Left Hand Movement mechanism for control. The system is designed to simulate wheelchair navigation based on voluntary right and left-hand movements using electroencephalogram (EEG) data. A pre-filtered dataset, obtained from an open-source EEG repository, was segmented into arrays of 19x200 to capture the onset of hand movements. The data was acquired at a sampling frequency 200Hz in the laboratory experiment. The system integrates a Tkinter-based interface for simulating wheelchair movements, offering users a functional and intuitive control system. Various machine learning models, including Support Vector Machines (SVM), XGBoost, random forest, and a Bi-directional Long Short-Term Memory (Bi-LSTM) attention-based model, were developed. The random forest model obtained 79% accuracy. Great performance was seen on the Logistic Regression model which outperforms other models with 92% accuracy and 91% accuracy on the Multi-Layer Perceptron (MLP) model. The Bi-LSTM attention-based model achieved a mean accuracy of 86% through cross-validation, showcasing the potential of attention mechanisms in BCI applications.
Authors: Huan Liu, Shusen Yang, Yuzhe Zhang, Mengze Wang, Fanyu Gong, Chengxi Xie, Guanjian Liu, Dalin Zhang
Abstract: EEG-based emotion recognition (EER) is garnering increasing attention due to its potential in understanding and analyzing human emotions. Recently, significant advancements have been achieved using various deep learning-based techniques to address the EER problem. However, the absence of a convincing benchmark and open-source codebase complicates fair comparisons between different models and poses reproducibility challenges for practitioners. These issues considerably impede progress in this field. In light of this, we propose a comprehensive benchmark and algorithm library (LibEER) for fair comparisons in EER by making most of the implementation details of different methods consistent and using the same single codebase in PyTorch. In response to these challenges, we propose LibEER, a comprehensive benchmark and algorithm library for fair comparisons in EER, by ensuring consistency in the implementation details of various methods and utilizing a single codebase in PyTorch. LibEER establishes a unified evaluation framework with standardized experimental settings, enabling unbiased evaluations of over ten representative deep learning-based EER models across the four most commonly used datasets. Additionally, we conduct an exhaustive and reproducible comparison of the performance and efficiency of popular models, providing valuable insights for researchers in selecting and designing EER models. We aspire for our work to not only lower the barriers for beginners entering the field of EEG-based emotion recognition but also promote the standardization of research in this domain, thereby fostering steady development. The source code is available at \url{https://github.com/ButterSen/LibEER}.
Authors: Sandeep Kumar, Mohit Sahu, Vardhan Gacche, Tirthankar Ghosal, Asif Ekbal
Abstract: The integrity of the peer-review process is vital for maintaining scientific rigor and trust within the academic community. With the steady increase in the usage of large language models (LLMs) like ChatGPT in academic writing, there is a growing concern that AI-generated texts could compromise scientific publishing, including peer-reviews. Previous works have focused on generic AI-generated text detection or have presented an approach for estimating the fraction of peer-reviews that can be AI-generated. Our focus here is to solve a real-world problem by assisting the editor or chair in determining whether a review is written by ChatGPT or not. To address this, we introduce the Term Frequency (TF) model, which posits that AI often repeats tokens, and the Review Regeneration (RR) model, which is based on the idea that ChatGPT generates similar outputs upon re-prompting. We stress test these detectors against token attack and paraphrasing. Finally, we propose an effective defensive strategy to reduce the effect of paraphrasing on our models. Our findings suggest both our proposed methods perform better than the other AI text detectors. Our RR model is more robust, although our TF model performs better than the RR model without any attacks. We make our code, dataset, and model public.
Authors: Yingjing Xu, Xueyan Cai, Zihong Zhou, Mengru Xue, Bo Wang, Haotian Wang, Zhengke Li, Chentian Weng, Wei Luo, Cheng Yao, Bo Lin, Jianwei Yin
Abstract: Hypomimia is a non-motor symptom of Parkinson's disease that manifests as delayed facial movements and expressions, along with challenges in articulation and emotion. Currently, subjective evaluation by neurologists is the primary method for hypomimia detection, and conventional rehabilitation approaches heavily rely on verbal prompts from rehabilitation physicians. There remains a deficiency in accessible, user-friendly and scientifically rigorous assistive tools for hypomimia treatments. To investigate this, we developed HypomimaCoach, an Action Unit (AU)-based digital therapy system for hypomimia detection and rehabilitation in Parkinson's disease. The HypomimaCoach system was designed to facilitate engagement through the incorporation of both relaxed and controlled rehabilitation exercises, while also stimulating initiative through the integration of digital therapies that incorporated traditional face training methods. We extract action unit(AU) features and their relationship for hypomimia detection. In order to facilitate rehabilitation, a series of training programmes have been devised based on the Action Units (AUs) and patients are provided with real-time feedback through an additional AU recognition model, which guides them through their training routines. A pilot study was conducted with seven participants in China, all of whom exhibited symptoms of Parkinson's disease hypomimia. The results of the pilot study demonstrated a positive impact on participants' self-efficacy, with favourable feedback received. Furthermore, physician evaluations validated the system's applicability in a therapeutic setting for patients with Parkinson's disease, as well as its potential value in clinical applications.
Authors: Gisang Lee, Sangwoo Park, Junyoung Park, Andrew Chung, Sieun Park, Yoonah Park, Byungju Kim, Min-gyu Cho
Abstract: Large Language Models (LLMs) have exhibited remarkable capabilities in many complex tasks including mathematical reasoning. However, traditional approaches heavily rely on ensuring self-consistency within single prompting method, which limits the exploration of diverse problem-solving strategies. This study addresses these limitations by performing an experimental analysis of distinct prompting methods within the domain of mathematical reasoning. Our findings demonstrate that each method explores a distinct search space, and this differentiation becomes more evident with increasing problem complexity. To leverage this phenomenon, we applied efficient sampling process that uniformly combines samples from these diverse methods, which not only expands the maximum search space but achieves higher performance with fewer runs compared to single methods. Especially, within the subset of difficult questions of MATH dataset named MATH-hard, The maximum search space was achieved while utilizing approximately 43% fewer runs than single methods on average. These findings highlight the importance of integrating diverse problem-solving strategies to enhance the reasoning abilities of LLMs.
Authors: Fanmeng Wang, Minjie Cheng, Hongteng Xu
Abstract: Predicting ground-state conformation from the corresponding molecular graph is crucial for many chemical applications, such as molecular modeling, molecular docking, and molecular property prediction. Recently, many learning-based methods have been proposed to replace time-consuming simulations for this task. However, these methods are often inefficient and sub-optimal as they merely rely on molecular graph information to make predictions from scratch. In this work, considering that molecular low-quality conformations are readily available, we propose a novel framework called ConfOpt to predict molecular ground-state conformation from the perspective of conformation optimization. Specifically, ConfOpt takes the molecular graph and corresponding low-quality 3D conformation as inputs, and then derives the ground-state conformation by iteratively optimizing the low-quality conformation under the guidance of the molecular graph. During training, ConfOpt concurrently optimizes the predicted atomic 3D coordinates and the corresponding interatomic distances, resulting in a strong predictive model. Extensive experiments demonstrate that ConfOpt significantly outperforms existing methods, thus providing a new paradigm for efficiently and accurately predicting molecular ground-state conformation.
Authors: Eungbean Lee, Somi Jeong, Kwanghoon Sohn
Abstract: Exemplar-guided image translation, synthesizing photo-realistic images that conform to both structural control and style exemplars, is attracting attention due to its ability to enhance user control over style manipulation. Previous methodologies have predominantly depended on establishing dense correspondences across cross-domain inputs. Despite these efforts, they incur quadratic memory and computational costs for establishing dense correspondence, resulting in limited versatility and performance degradation. In this paper, we propose a novel approach termed Exemplar-guided Image Translation with Brownian-Bridge Diffusion Models (EBDM). Our method formulates the task as a stochastic Brownian bridge process, a diffusion process with a fixed initial point as structure control and translates into the corresponding photo-realistic image while being conditioned solely on the given exemplar image. To efficiently guide the diffusion process toward the style of exemplar, we delineate three pivotal components: the Global Encoder, the Exemplar Network, and the Exemplar Attention Module to incorporate global and detailed texture information from exemplar images. Leveraging Bridge diffusion, the network can translate images from structure control while exclusively conditioned on the exemplar style, leading to more robust training and inference processes. We illustrate the superiority of our method over competing approaches through comprehensive benchmark evaluations and visual results.
Authors: Xinyuan Wang, Victor Shea-Jay Huang, Renmiao Chen, Hao Wang, Chengwei Pan, Lei Sha, Minlie Huang
Abstract: While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages Multiobjective Evolutionary Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks across multiple objectives including ASR, stealthiness, and semantic relevance. By integrating mechanisms like mutation, crossover, and Pareto-dominance, BlackDAN provides a transparent and interpretable process for generating jailbreaks. Furthermore, the framework allows customization based on user preferences, enabling the selection of prompts that balance harmfulness, relevance, and other factors. Experimental results demonstrate that BlackDAN outperforms traditional single-objective methods, yielding higher success rates and improved robustness across various LLMs and multimodal LLMs, while ensuring jailbreak responses are both relevant and less detectable.
Authors: Soyoung Yang, Hojun Cho, Jiyoung Lee, Sohee Yoon, Edward Choi, Jaegul Choo, Won Ik Cho
Abstract: Aspect-based sentiment analysis (ABSA) is the challenging task of extracting sentiment along with its corresponding aspects and opinions from human language. Due to the inherent variability of natural language, aspect and opinion terms can be expressed in various surface forms, making their accurate identification complex. Current evaluation methods for this task often restrict answers to a single ground truth, penalizing semantically equivalent predictions that differ in surface form. To address this limitation, we propose a novel, fully automated pipeline that augments existing test sets with alternative valid responses for aspect and opinion terms. This approach enables a fairer assessment of language models by accommodating linguistic diversity, resulting in higher human agreement than single-answer test sets (up to 10%p improvement in Kendall's Tau score). Our experimental results demonstrate that Large Language Models (LLMs) show substantial performance improvements over T5 models when evaluated using our augmented test set, suggesting that LLMs' capabilities in ABSA tasks may have been underestimated. This work contributes to a more comprehensive evaluation framework for ABSA, potentially leading to more accurate assessments of model performance in information extraction tasks, particularly those involving span extraction.
Authors: Md Tanvir Islam, Inzamamul Alam, Simon S. Woo, Saeed Anwar, IK Hyun Lee, Khan Muhammad
Abstract: Low-light image enhancement (LLIE) is essential for numerous computer vision tasks, including object detection, tracking, segmentation, and scene understanding. Despite substantial research on improving low-quality images captured in underexposed conditions, clear vision remains critical for autonomous vehicles, which often struggle with low-light scenarios, signifying the need for continuous research. However, paired datasets for LLIE are scarce, particularly for street scenes, limiting the development of robust LLIE methods. Despite using advanced transformers and/or diffusion-based models, current LLIE methods struggle in real-world low-light conditions and lack training on street-scene datasets, limiting their effectiveness for autonomous vehicles. To bridge these gaps, we introduce a new dataset LoLI-Street (Low-Light Images of Streets) with 33k paired low-light and well-exposed images from street scenes in developed cities, covering 19k object classes for object detection. LoLI-Street dataset also features 1,000 real low-light test images for testing LLIE models under real-life conditions. Furthermore, we propose a transformer and diffusion-based LLIE model named "TriFuse". Leveraging the LoLI-Street dataset, we train and evaluate our TriFuse and SOTA models to benchmark on our dataset. Comparing various models, our dataset's generalization feasibility is evident in testing across different mainstream datasets by significantly enhancing images and object detection for practical applications in autonomous driving and surveillance systems. The complete code and dataset is available on https://github.com/tanvirnwu/TriFuse.
Authors: Rui Min, Zeyu Qin, Nevin L. Zhang, Li Shen, Minhao Cheng
Abstract: Backdoor attacks pose a significant threat to Deep Neural Networks (DNNs) as they allow attackers to manipulate model predictions with backdoor triggers. To address these security vulnerabilities, various backdoor purification methods have been proposed to purify compromised models. Typically, these purified models exhibit low Attack Success Rates (ASR), rendering them resistant to backdoored inputs. However, Does achieving a low ASR through current safety purification methods truly eliminate learned backdoor features from the pretraining phase? In this paper, we provide an affirmative answer to this question by thoroughly investigating the Post-Purification Robustness of current backdoor purification methods. We find that current safety purification methods are vulnerable to the rapid re-learning of backdoor behavior, even when further fine-tuning of purified models is performed using a very small number of poisoned samples. Based on this, we further propose the practical Query-based Reactivation Attack (QRA) which could effectively reactivate the backdoor by merely querying purified models. We find the failure to achieve satisfactory post-tuning robustness stems from the insufficient deviation of purified models from the backdoored model along the backdoor-connected path. To improve the post-purification robustness, we propose a straightforward tuning defense, Path-Aware Minimization (PAM), which promotes deviation along backdoor-connected paths with extra model updates. Extensive experiments demonstrate that PAM significantly improves post-purification robustness while maintaining a good clean accuracy and low ASR. Our work provides a new perspective on understanding the effectiveness of backdoor safety tuning and highlights the importance of faithfully assessing the model's safety.
Authors: Hideyuki Oiso, Yuto Matsunaga, Kazuya Kakizaki, Taiki Miyagawa
Abstract: We study test-time domain adaptation for audio deepfake detection (ADD), addressing three challenges: (i) source-target domain gaps, (ii) limited target dataset size, and (iii) high computational costs. We propose an ADD method using prompt tuning in a plug-in style. It bridges domain gaps by integrating it seamlessly with state-of-the-art transformer models and/or with other fine-tuning methods, boosting their performance on target data (challenge (i)). In addition, our method can fit small target datasets because it does not require a large number of extra parameters (challenge (ii)). This feature also contributes to computational efficiency, countering the high computational costs typically associated with large-scale pre-trained models in ADD (challenge (iii)). We conclude that prompt tuning for ADD under domain gaps presents a promising avenue for enhancing accuracy with minimal target data and negligible extra computational burden.
Authors: Yein Park, Chanwoong Yoon, Jungwoo Park, Donghyeon Lee, Minbyul Jeong, Jaewoo Kang
Abstract: Large language models (LLMs) have significantly impacted many aspects of our lives. However, assessing and ensuring their chronological knowledge remains challenging. Existing approaches fall short in addressing the accumulative nature of knowledge, often relying on a single time stamp. To overcome this, we introduce ChroKnowBench, a benchmark dataset designed to evaluate chronologically accumulated knowledge across three key aspects: multiple domains, time dependency, temporal state. Our benchmark distinguishes between knowledge that evolves (e.g., scientific discoveries, amended laws) and knowledge that remain constant (e.g., mathematical truths, commonsense facts). Building on this benchmark, we present ChroKnowledge (Chronological Categorization of Knowledge), a novel sampling-based framework for evaluating and updating LLMs' non-parametric chronological knowledge. Our evaluation shows: (1) The ability of eliciting temporal knowledge varies depending on the data format that model was trained on. (2) LLMs partially recall knowledge or show a cut-off at temporal boundaries rather than recalling all aspects of knowledge correctly. Thus, we apply our ChroKnowPrompt, an in-depth prompting to elicit chronological knowledge by traversing step-by-step through the surrounding time spans. We observe that our framework successfully updates the overall knowledge across the entire timeline in both the biomedical domain (+11.9%) and the general domain (+2.8%), demonstrating its effectiveness in refining temporal knowledge. This non-parametric approach also enables knowledge updates not only in open-source models but also in proprietary LLMs, ensuring comprehensive applicability across model types. We perform a comprehensive analysis based on temporal characteristics of ChroKnowPrompt and validate the potential of various models to elicit intrinsic temporal knowledge through our method.
Authors: Linshan Wu, Jiaxin Zhuang, Hao Chen
Abstract: The scarcity of annotations poses a significant challenge in medical image analysis. Large-scale pre-training has emerged as a promising label-efficient solution, owing to the utilization of large-scale data, large models, and advanced pre-training techniques. However, its development in medical images remains underexplored. The primary challenge lies in harnessing large-scale unlabeled data and learning high-level semantics without annotations. We observe that 3D medical images exhibit consistent geometric context, i.e., consistent geometric relations between different organs, which leads to a promising way for learning consistent representations. Motivated by this, we introduce a simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision. Given an input volume, we extract base crops from different regions to construct positive and negative pairs for contrastive learning. Then we predict the contextual position of a random crop by contrasting its similarity to the base crops. In this way, VoCo encodes the inherent geometric context into model representations, facilitating high-level semantic learning without annotations. Specifically, we (1) introduce the largest medical pre-training dataset PreCT-160K; (2) investigate scaling laws and propose guidelines for tailoring different model sizes to various medical tasks; (3) build a benchmark encompassing 48 medical tasks. Extensive experiments highlight the superiority of VoCo. Codes at https://github.com/Luffy03/Large-Scale-Medical.
Authors: Pengfei Jin, Peng Shu, Sekeun Kim, Qing Xiao, Sifan Song, Cheng Chen, Tianming Liu, Xiang Li, Quanzheng Li
Abstract: Foundation models have become a cornerstone in deep learning, with techniques like Low-Rank Adaptation (LoRA) offering efficient fine-tuning of large models. Similarly, methods such as Retrieval-Augmented Generation (RAG), which leverage vectorized databases, have further improved model performance by grounding outputs in external information. While these approaches have demonstrated notable success, they often require extensive training or labeled data, which can limit their adaptability in resource-constrained environments. To address these challenges, we introduce Retrieval-based Parameter Ensemble (RPE), a new method that creates a vectorized database of LoRAs, enabling efficient retrieval and application of model adaptations to new tasks. RPE minimizes the need for extensive training and eliminates the requirement for labeled data, making it particularly effective for zero-shot learning. Additionally, RPE is well-suited for privacy-sensitive domains like healthcare, as it modifies model parameters without accessing raw data. When applied to tasks such as medical report generation and image segmentation, RPE not only proved effective but also surpassed supervised fine-tuning methods in certain cases, highlighting its potential to enhance both computational efficiency and privacy in deep learning applications.
Authors: Chunyan Mao, Shuaishuai Huang, Mingxiu Sui, Haowei Yang, Xueshe Wang
Abstract: With the rapid development of the internet and the explosion of information, providing users with accurate personalized recommendations has become an important research topic. This paper designs and analyzes a personalized recommendation system based on a dynamic user interest model. The system captures user behavior data, constructs a dynamic user interest model, and combines multiple recommendation algorithms to provide personalized content to users. The research results show that this system significantly improves recommendation accuracy and user satisfaction. This paper discusses the system's architecture design, algorithm implementation, and experimental results in detail and explores future research directions.
Authors: Megha Sharma, Muhammad Taimoor Haseeb, Gus Xia, Yoshimasa Tsuruoka
Abstract: This paper introduces M2M Gen, a multi modal framework for generating background music tailored to Japanese manga. The key challenges in this task are the lack of an available dataset or a baseline. To address these challenges, we propose an automated music generation pipeline that produces background music for an input manga book. Initially, we use the dialogues in a manga to detect scene boundaries and perform emotion classification using the characters faces within a scene. Then, we use GPT4o to translate this low level scene information into a high level music directive. Conditioned on the scene information and the music directive, another instance of GPT 4o generates page level music captions to guide a text to music model. This produces music that is aligned with the mangas evolving narrative. The effectiveness of M2M Gen is confirmed through extensive subjective evaluations, showcasing its capability to generate higher quality, more relevant and consistent music that complements specific scenes when compared to our baselines.
Authors: Dan Ley, Shichang Zhang, Suraj Srinivas, Gili Rusak, Himabindu Lakkaraju
Abstract: Data Attribution (DA) methods quantify the influence of individual training data points on model outputs and have broad applications such as explainability, data selection, and noisy label identification. However, existing DA methods are often computationally intensive, limiting their applicability to large-scale machine learning models. To address this challenge, we introduce the Generalized Group Data Attribution (GGDA) framework, which computationally simplifies DA by attributing to groups of training points instead of individual ones. GGDA is a general framework that subsumes existing attribution methods and can be applied to new DA techniques as they emerge. It allows users to optimize the trade-off between efficiency and fidelity based on their needs. Our empirical results demonstrate that GGDA applied to popular DA methods such as Influence Functions, TracIn, and TRAK results in upto 10x-50x speedups over standard DA methods while gracefully trading off attribution fidelity. For downstream applications such as dataset pruning and noisy label identification, we demonstrate that GGDA significantly improves computational efficiency and maintains effectiveness, enabling practical applications in large-scale machine learning scenarios that were previously infeasible.
Authors: Cynthia Jayne Amol, Everlyn Asiko Chimoto, Rose Delilah Gesicho, Antony M. Gitau, Naome A. Etori, Caringtone Kinyanjui, Steven Ndung'u, Lawrence Moruye, Samson Otieno Ooko, Kavengi Kitonga, Brian Muhia, Catherine Gitau, Antony Ndolo, Lilian D. A. Wanzare, Albert Njoroge Kahira, Ronald Tombe
Abstract: Kenya, known for its linguistic diversity, faces unique challenges and promising opportunities in advancing Natural Language Processing (NLP) technologies, particularly for its underrepresented indigenous languages. This survey provides a detailed assessment of the current state of NLP in Kenya, emphasizing ongoing efforts in dataset creation, machine translation, sentiment analysis, and speech recognition for local dialects such as Kiswahili, Dholuo, Kikuyu, and Luhya. Despite these advancements, the development of NLP in Kenya remains constrained by limited resources and tools, resulting in the underrepresentation of most indigenous languages in digital spaces. This paper uncovers significant gaps by critically evaluating the available datasets and existing NLP models, most notably the need for large-scale language models and the insufficient digital representation of Indigenous languages. We also analyze key NLP applications: machine translation, information retrieval, and sentiment analysis-examining how they are tailored to address local linguistic needs. Furthermore, the paper explores the governance, policies, and regulations shaping the future of AI and NLP in Kenya and proposes a strategic roadmap to guide future research and development efforts. Our goal is to provide a foundation for accelerating the growth of NLP technologies that meet Kenya's diverse linguistic demands.
Authors: Jingyu Liu, Xinyu Liu, Mingzhe Qu, Tianyi Lyu
Abstract: Integrating IoT technology into basketball action recognition enhances sports analytics, providing crucial insights into player performance and game strategy. However, existing methods often fall short in terms of accuracy and efficiency, particularly in complex, real-time environments where player movements are frequently occluded or involve intricate interactions. To overcome these challenges, we propose the EITNet model, a deep learning framework that combines EfficientDet for object detection, I3D for spatiotemporal feature extraction, and TimeSformer for temporal analysis, all integrated with IoT technology for seamless real-time data collection and processing. Our contributions include developing a robust architecture that improves recognition accuracy to 92\%, surpassing the baseline EfficientDet model's 87\%, and reducing loss to below 5.0 compared to EfficientDet's 9.0 over 50 epochs. Furthermore, the integration of IoT technology enhances real-time data processing, providing adaptive insights into player performance and strategy. The paper details the design and implementation of EITNet, experimental validation, and a comprehensive evaluation against existing models. The results demonstrate EITNet's potential to significantly advance automated sports analysis and optimize data utilization for player performance and strategy improvement.
Authors: Muhammad Umar, Muhammad Asif, Arif Mahmood
Abstract: Single-cell RNA sequencing (scRNA-seq) enables the study of cellular diversity at single cell level. It provides a global view of cell-type specification during the onset of biological mechanisms such as developmental processes and human organogenesis. Various statistical, machine and deep learning-based methods have been proposed for cell-type classification. Most of the methods utilizes unsupervised lower dimensional projections obtained from for a large reference data. In this work, we proposed a reference-based method for cell type classification, called EnProCell. The EnProCell, first, computes lower dimensional projections that capture both the high variance and class separability through an ensemble of principle component analysis and multiple discriminant analysis. In the second phase, EnProCell trains a deep neural network on the lower dimensional representation of data to classify cell types. The proposed method outperformed the existing state-of-the-art methods when tested on four different data sets produced from different single-cell sequencing technologies. The EnProCell showed higher accuracy (98.91) and F1 score (98.64) than other methods for predicting reference from reference datasets. Similarly, EnProCell also showed better performance than existing methods in predicting cell types for data with unknown cell types (query) from reference datasets (accuracy:99.52; F1 score: 99.07). In addition to improved performance, the proposed methodology is simple and does not require more computational resources and time. the EnProCell is available at https://github.com/umar1196/EnProCell.
Authors: Mohammad Mozafari, Hosein Hasani, Reza Vahidimajd, Mohamadreza Fereydooni, Mahdieh Soleymani Baghshah
Abstract: In recent years, few-shot segmentation (FSS) models have emerged as a promising approach in medical imaging analysis, offering remarkable adaptability to segment novel classes with limited annotated data. Existing approaches to few-shot segmentation have often overlooked the potential of the query itself, failing to fully utilize the valuable information it contains. However, treating the query as unlabeled data provides an opportunity to enhance prediction accuracy. Specifically in the domain of medical imaging, the volumetric structure of queries offers a considerable source of valuable information that can be used to improve the target slice segmentation. In this work, we present a novel strategy to efficiently leverage the intrinsic information of the query sample for final segmentation during inference. First, we use the support slices from a reference volume to generate an initial segmentation score for the query slices through a prototypical approach. Subsequently, we apply a confidence-aware pseudo-labeling procedure to transfer the most informative parts of query slices to the support set. The final prediction is performed based on the new expanded support set, enabling the prediction of a more accurate segmentation mask for the query volume. Extensive experiments show that the proposed method can effectively boost performance across diverse settings and datasets.
Authors: Kyungmin Kim, JB Lanier, Pierre Baldi, Charless Fowlkes, Roy Fox
Abstract: Recent advancements in Model-Based Reinforcement Learning (MBRL) have made it a powerful tool for visual control tasks. Despite improved data efficiency, it remains challenging to train MBRL agents with generalizable perception. Training in the presence of visual distractions is particularly difficult due to the high variation they introduce to representation learning. Building on DREAMER, a popular MBRL method, we propose a simple yet effective auxiliary task to facilitate representation learning in distracting environments. Under the assumption that task-relevant components of image observations are straightforward to identify with prior knowledge in a given task, we use a segmentation mask on image observations to only reconstruct task-relevant components. In doing so, we greatly reduce the complexity of representation learning by removing the need to encode task-irrelevant objects in the latent representation. Our method, Segmentation Dreamer (SD), can be used either with ground-truth masks easily accessible in simulation or by leveraging potentially imperfect segmentation foundation models. The latter is further improved by selectively applying the reconstruction loss to avoid providing misleading learning signals due to mask prediction errors. In modified DeepMind Control suite (DMC) and Meta-World tasks with added visual distractions, SD achieves significantly better sample efficiency and greater final performance than prior work. We find that SD is especially helpful in sparse reward tasks otherwise unsolvable by prior work, enabling the training of visually robust agents without the need for extensive reward engineering.
Authors: Michal Kosinski
Abstract: A growing number of studies have linked facial width-to-height ratio (fWHR) with various antisocial or violent behavioral tendencies. However, those studies have predominantly been laboratory based and low powered. This work reexamined the links between fWHR and behavioral tendencies in a large sample of 137,163 participants. Behavioral tendencies were measured using 55 well-established psychometric scales, including self-report scales measuring intelligence, domains and facets of the five-factor model of personality, impulsiveness, sense of fairness, sensational interests, self-monitoring, impression management, and satisfaction with life. The findings revealed that fWHR is not substantially linked with any of these self-reported measures of behavioral tendencies, calling into question whether the links between fWHR and behavior generalize beyond the small samples and specific experimental settings that have been used in past fWHR research.
Authors: Sandeep Sricharan Mukku, Abinesh Kanagarajan, Chetan Aggarwal, Promod Yenigalla
Abstract: Summarizing customer feedback to provide actionable insights for products/services at scale is an important problem for businesses across industries. Lately, the review volumes are increasing across regions and languages, therefore the challenge of aggregating and understanding customer sentiment across multiple languages becomes increasingly vital. In this paper, we propose a novel framework involving a two-step paradigm \textit{Extract-then-Summarise}, namely MARS to revolutionise traditions and address the domain agnostic aspect-level multilingual review summarisation. Extensive automatic and human evaluation shows that our approach brings substantial improvements over abstractive baselines and efficiency to real-time systems.
Authors: Nan Jiang, Qi Li, Lin Tan, Tianyi Zhang
Abstract: Despite their success, large language models (LLMs) face the critical challenge of hallucinations, generating plausible but incorrect content. While much research has focused on hallucinations in multiple modalities including images and natural language text, less attention has been given to hallucinations in source code, which leads to incorrect and vulnerable code that causes significant financial loss. To pave the way for research in LLMs' hallucinations in code, we introduce Collu-Bench, a benchmark for predicting code hallucinations of LLMs across code generation (CG) and automated program repair (APR) tasks. Collu-Bench includes 13,234 code hallucination instances collected from five datasets and 11 diverse LLMs, ranging from open-source models to commercial ones. To better understand and predict code hallucinations, Collu-Bench provides detailed features such as the per-step log probabilities of LLMs' output, token types, and the execution feedback of LLMs' generated code for in-depth analysis. In addition, we conduct experiments to predict hallucination on Collu-Bench, using both traditional machine learning techniques and neural networks, which achieves 22.03 -- 33.15% accuracy. Our experiments draw insightful findings of code hallucination patterns, reveal the challenge of accurately localizing LLMs' hallucinations, and highlight the need for more sophisticated techniques.
Authors: Guorui Lu, Jing Peng, Bingyuan Huang, Chang Gao, Todor Stefanov, Yong Hao, Qinyu Chen
Abstract: Epileptic seizures cause abnormal brain activity, and their unpredictability can lead to accidents, underscoring the need for long-term seizure prediction. Although seizures can be predicted by analyzing electroencephalogram (EEG) signals, existing methods often require too many electrode channels or larger models, limiting mobile usability. This paper introduces a SlimSeiz framework that utilizes adaptive channel selection with a lightweight neural network model. SlimSeiz operates in two states: the first stage selects the optimal channel set for seizure prediction using machine learning algorithms, and the second stage employs a lightweight neural network based on convolution and Mamba for prediction. On the Children's Hospital Boston-MIT (CHB-MIT) EEG dataset, SlimSeiz can reduce channels from 22 to 8 while achieving a satisfactory result of 94.8% accuracy, 95.5% sensitivity, and 94.0% specificity with only 21.2K model parameters, matching or outperforming larger models' performance. We also validate SlimSeiz on a new EEG dataset, SRH-LEI, collected from Shanghai Renji Hospital, demonstrating its effectiveness across different patients. The code and SRH-LEI dataset are available at https://github.com/guoruilu/SlimSeiz.
Authors: Sandeep Sricharan Mukku, Abinesh Kanagarajan, Pushpendu Ghosh, Chetan Aggarwal
Abstract: Businesses can benefit from customer feedback in different modalities, such as text and images, to enhance their products and services. However, it is difficult to extract actionable and relevant pairs of text segments and images from customer feedback in a single pass. In this paper, we propose a novel multi-modal method that fuses image and text information in a latent space and decodes it to extract the relevant feedback segments using an image-text grounded text decoder. We also introduce a weakly-supervised data generation technique that produces training data for this task. We evaluate our model on unseen data and demonstrate that it can effectively mine actionable insights from multi-modal customer feedback, outperforming the existing baselines by $14$ points in F1 score.
Authors: John M. Carpenter, Andrea Corvill\'on, Nihar B. Shah
Abstract: The increasing volume of papers and proposals undergoing peer review emphasizes the pressing need for greater automation to effectively manage the growing scale. In this study, we present the deployment and evaluation of machine learning and optimization techniques for assigning proposals to reviewers that was developed for the Atacama Large Millimeter/submillimeter Array (ALMA) during the Cycle 10 Call for Proposals issued in 2023. By utilizing topic modeling algorithms, we identify the proposal topics and assess reviewers' expertise based on their historical ALMA proposal submissions. We then apply an adapted version of the assignment optimization algorithm from PeerReview4All (Stelmakh et al. 2021a) to maximize the alignment between proposal topics and reviewer expertise. Our evaluation shows a significant improvement in matching reviewer expertise: the median similarity score between the proposal topic and reviewer expertise increased by 51 percentage points compared to the previous cycle, and the percentage of reviewers reporting expertise in their assigned proposals rose by 20 percentage points. Furthermore, the assignment process proved highly effective in that no proposals required reassignment due to significant mismatches, resulting in a savings of 3 to 5 days of manual effort.
Authors: Hyeong Kyu Choi, Xuefeng Du, Yixuan Li
Abstract: Fine-tuning Large Language Models (LLMs) has emerged as a common practice for tailoring models to individual needs and preferences. The choice of datasets for fine-tuning can be diverse, introducing safety concerns regarding the potential inclusion of harmful data samples. Manually filtering or avoiding such samples, however, can be labor-intensive and subjective. To address these difficulties, we propose a novel Safety-Aware Fine-Tuning (SAFT) framework designed to automatically detect and remove potentially harmful data, by leveraging a scoring function that exploits the subspace information of harmful and benign samples. Experimental results demonstrate the efficacy of SAFT across different LLMs and varying contamination rates, achieving reductions in harmfulness of up to 27.8%. Going beyond, we delve into the mechanism of our approach and validate its versatility in addressing practical challenges in real-world scenarios.
Authors: Vineet Jagadeesan Nair, Lucas Pereira
Abstract: This proposal aims to develop more accurate federated learning (FL) methods with faster convergence properties and lower communication requirements, specifically for forecasting distributed energy resources (DER) such as renewables, energy storage, and loads in modern, low-carbon power grids. This will be achieved by (i) leveraging recently developed extensions of FL such as hierarchical and iterative clustering to improve performance with non-IID data, (ii) experimenting with different types of FL global models well-suited to time-series data, and (iii) incorporating domain-specific knowledge from power systems to build more general FL frameworks and architectures that can be applied to diverse types of DERs beyond just load forecasting, and with heterogeneous clients.
Authors: Lecheng Zheng, Zhengzhang Chen, Haifeng Chen, Jingrui He
Abstract: Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems. Traditional data-driven RCA methods are typically limited to offline applications due to high computational demands, and existing online RCA methods handle only single-modal data, overlooking complex interactions in multi-modal systems. In this paper, we introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization. OCEAN employs a dilated convolutional neural network to capture long-term temporal dependencies and graph neural networks to learn causal relationships among system entities and key performance indicators. We further design a multi-factor attention mechanism to analyze and reassess the relationships among different metrics and log indicators/attributes for enhanced online causal graph learning. Additionally, a contrastive mutual information maximization-based graph fusion module is developed to effectively model the relationships across various modalities. Extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of our proposed method.
Authors: Yun Joon Soh, Jishen Zhao
Abstract: The explosion of open-sourced models and Question-Answering (QA) datasets emphasizes the importance of automated QA evaluation. We studied the statistics of the existing evaluation metrics for a better understanding of their limitations. By measuring the correlation coefficients of each evaluation metric concerning human-like evaluation score, we observed the following: (1) existing metrics have a high correlation among them concerning the question type (e.g., single word, single phrase, etc.), (2) no single metric can adequately estimate the human-like evaluation. As a potential solution, we discuss how a Mixture Of Grader could potentially improve the auto QA evaluator quality.
Authors: Kunpeng Xu, Lifei Chen, Shengrui Wang
Abstract: Dynamic concepts in time series are crucial for understanding complex systems such as financial markets, healthcare, and online activity logs. These concepts help reveal structures and behaviors in sequential data for better decision-making and forecasting. Existing models struggle with detecting and tracking concept drift due to limitations in interpretability and adaptability. This paper introduces Kolmogorov-Arnold Networks (KAN) into time series and proposes WormKAN, a KAN-based auto-encoder to address concept drift in co-evolving time series. WormKAN integrates the KAN-SR module, in which the encoder, decoder, and self-representation layer are built on KAN, along with a temporal constraint to capture concept transitions. These transitions, akin to passing through a "wormhole", are identified by abrupt changes in the latent space. Experiments show that KAN and KAN-based models (WormKAN) effectively segment time series into meaningful concepts, enhancing the identification and tracking of concept drifts.
Authors: Hakan Aktas, Emre Ugur
Abstract: This paper proposes a novel neural network model capable of discovering high-level skill representations from unlabeled demonstration data. We also propose a bi-level planning pipeline that utilizes our model using a gradient-based planning approach. While extracting high-level representations, our model also preserves the low-level information, which can be used for low-level action planning. In the experiments, we tested the skill discovery performance of our model under different conditions, tested whether Multi-Modal LLMs can be utilized to label the learned high-level skill representations, and finally tested the high-level and low-level planning performance of our pipeline.
Authors: Osvaldo Arreche, Tanish Guntur, Mustafa Abdallah
Abstract: Explainability and evaluation of AI models are crucial parts of the security of modern intrusion detection systems (IDS) in the network security field, yet they are lacking. Accordingly, feature selection is essential for such parts in IDS because it identifies the most paramount features, enhancing attack detection and its description. In this work, we tackle the feature selection problem for IDS by suggesting new ways of applying eXplainable AI (XAI) methods for this problem. We identify the crucial attributes originated by distinct AI methods in tandem with the novel five attribute selection methods. We then compare many state-of-the-art feature selection strategies with our XAI-based feature selection methods, showing that most AI models perform better when using the XAI-based approach proposed in this work. By providing novel feature selection techniques and establishing the foundation for several XAI-based strategies, this research aids security analysts in the AI decision-making reasoning of IDS by providing them with a better grasp of critical intrusion traits. Furthermore, we make the source codes available so that the community may develop additional models on top of our foundational XAI-based feature selection framework.
Authors: Qi Liu, Wanjing Ma
Abstract: In this paper, we identify and analyze a recurring training loss pattern, which we term the \textit{Epochal Sawtooth Effect (ESE)}, commonly observed during training with adaptive gradient-based optimizers, particularly Adam optimizer. This pattern is characterized by a sharp drop in loss at the beginning of each epoch, followed by a gradual increase, resulting in a sawtooth-shaped loss curve. Through empirical observations, we demonstrate that while this effect is most pronounced with Adam, it persists, although less severely, with other optimizers such as RMSProp. We provide an in-depth explanation of the underlying mechanisms that lead to the Epochal Sawtooth Effect. The influences of factors like \(\beta\), batch size, data shuffling on this pattern have been studied. We quantify the influence of \(\beta_2\) on the shape of the loss curve, showing that higher values of \(\beta_2\) result in a nearly linear increase in loss, while lower values create a concave upward trend. Our analysis reveals that this behavior stems from the adaptive learning rate controlled by the second moment estimate, with \(\beta_1\) playing a minimal role when \(\beta_2\) is large. To support our analysis, we replicate this phenomenon through a controlled quadratic minimization task. By incrementally solving a series of quadratic optimization problems using Adam, we demonstrate that the Epochal Sawtooth Effect can emerge even in simple optimization scenarios, reinforcing the generality of this pattern. This paper provides both theoretical insights and quantitative analysis, offering a comprehensive understanding of this ubiquitous phenomenon in modern optimization techniques.
Authors: Jonathan DeCastro, Andrew Silva, Deepak Gopinath, Emily Sumner, Thomas M. Balch, Laporsha Dees, Guy Rosman
Abstract: Tight coordination is required for effective human-robot teams in domains involving fast dynamics and tactical decisions, such as multi-car racing. In such settings, robot teammates must react to cues of a human teammate's tactical objective to assist in a way that is consistent with the objective (e.g., navigating left or right around an obstacle). To address this challenge, we present Dream2Assist, a framework that combines a rich world model able to infer human objectives and value functions, and an assistive agent that provides appropriate expert assistance to a given human teammate. Our approach builds on a recurrent state space model to explicitly infer human intents, enabling the assistive agent to select actions that align with the human and enabling a fluid teaming interaction. We demonstrate our approach in a high-speed racing domain with a population of synthetic human drivers pursuing mutually exclusive objectives, such as "stay-behind" and "overtake". We show that the combined human-robot team, when blending its actions with those of the human, outperforms the synthetic humans alone as well as several baseline assistance strategies, and that intent-conditioning enables adherence to human preferences during task execution, leading to improved performance while satisfying the human's objective.
Authors: Olena Burda-Lassen
Abstract: Folktales are linguistically very rich and culturally significant in understanding the source language. Historically, only human translation has been used for translating folklore. Therefore, the number of translated texts is very sparse, which limits access to knowledge about cultural traditions and customs. We have created a new Ukrainian-To-English parallel corpus of familiar Ukrainian folktales based on available English translations and suggested several new ones. We offer a combined domain-specific approach to building and augmenting this corpus, considering the nature of the domain and differences in the purpose of human versus machine translation. Our corpus is word and sentence-aligned, allowing for the best curation of meaning, specifically tailored for use as training data for machine translation models.
Authors: Chengsong Huang, Langlin Huang, Jiaxin Huang
Abstract: In-Context Learning (ICL) emerges as a key feature for Large Language Models (LLMs), allowing them to adapt to new tasks by leveraging task-specific examples without updating model parameters. However, ICL faces challenges with increasing numbers of examples due to performance degradation and quadratic computational costs. In this paper, we propose Logit Arithmetic Reweighting Approach (LARA), a novel framework that enhances ICL by using logit-based ensembling of multiple demonstrations. Our approach divides long input demonstrations into parallelizable shorter inputs to significantly reduce memory requirements, and then effectively aggregate the information by reweighting logits of each group via a non-gradient optimization approach. We further introduce Binary LARA (B-LARA), a variant that constrains weights to binary values to simplify the search space and reduces memory usage by filtering out less informative demonstration groups. Experiments on BBH and MMLU demonstrate that LARA and B-LARA outperform all baseline methods in both accuracy and memory efficiency. We also conduct extensive analysis to show that LARA generalizes well to scenarios of varying numbers of examples from limited to many-shot demonstrations.
Authors: Sudeep Dasari, Oier Mees, Sebastian Zhao, Mohan Kumar Srirama, Sergey Levine
Abstract: In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named \method, that significantly outperforms the state of the art in solving long-horizon ($1500+$ time-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: https://dit-policy.github.io
Authors: Shengwei Ji, Yujie Tian, Fei Liu, Xinlu Li, Le Wu
Abstract: Graph Convolutional Networks (GCNs) are widely used in graph-based applications, such as social networks and recommendation systems. Nevertheless, large-scale graphs or deep aggregation layers in full-batch GCNs consume significant GPU memory, causing out of memory (OOM) errors on mainstream GPUs (e.g., 29GB memory consumption on the Ogbnproducts graph with 5 layers). The subgraph sampling methods reduce memory consumption to achieve lightweight GCNs by partitioning the graph into multiple subgraphs and sequentially training GCNs on each subgraph. However, these methods yield gaps among subgraphs, i.e., GCNs can only be trained based on subgraphs instead of global graph information, which reduces the accuracy of GCNs. In this paper, we propose PromptGCN, a novel prompt-based lightweight GCN model to bridge the gaps among subgraphs. First, the learnable prompt embeddings are designed to obtain global information. Then, the prompts are attached into each subgraph to transfer the global information among subgraphs. Extensive experimental results on seven largescale graphs demonstrate that PromptGCN exhibits superior performance compared to baselines. Notably, PromptGCN improves the accuracy of subgraph sampling methods by up to 5.48% on the Flickr dataset. Overall, PromptGCN can be easily combined with any subgraph sampling method to obtain a lightweight GCN model with higher accuracy.
Authors: Zhiyun Song, Yinjie Zhao, Xiaomin Li, Manman Fei, Xiangyu Zhao, Mengjun Liu, Cunjian Chen, Chung-Hsing Yeh, Qian Wang, Guoyan Zheng, Songtao Ai, Lichi Zhang
Abstract: High-resolution (HR) 3D magnetic resonance imaging (MRI) can provide detailed anatomical structural information, enabling precise segmentation of regions of interest for various medical image analysis tasks. Due to the high demands of acquisition device, collection of HR images with their annotations is always impractical in clinical scenarios. Consequently, segmentation results based on low-resolution (LR) images with large slice thickness are often unsatisfactory for subsequent tasks. In this paper, we propose a novel Resource-Efficient High-Resolution Segmentation framework (REHRSeg) to address the above-mentioned challenges in real-world applications, which can achieve HR segmentation while only employing the LR images as input. REHRSeg is designed to leverage self-supervised super-resolution (self-SR) to provide pseudo supervision, therefore the relatively easier-to-acquire LR annotated images generated by 2D scanning protocols can be directly used for model training. The main contribution to ensure the effectiveness in self-SR for enhancing segmentation is three-fold: (1) We mitigate the data scarcity problem in the medical field by using pseudo-data for training the segmentation model. (2) We design an uncertainty-aware super-resolution (UASR) head in self-SR to raise the awareness of segmentation uncertainty as commonly appeared on the ROI boundaries. (3) We align the spatial features for self-SR and segmentation through structural knowledge distillation to enable a better capture of region correlations. Experimental results demonstrate that REHRSeg achieves high-quality HR segmentation without intensive supervision, while also significantly improving the baseline performance for LR segmentation.
Authors: Morris Yau, Ekin Akyurek, Jiayuan Mao, Joshua B. Tenenbaum, Stefanie Jegelka, Jacob Andreas
Abstract: Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.
Authors: Jianqiao Lu, Yingjia Wan, Yinya Huang, Jing Xiong, Zhengying Liu, Zhijiang Guo
Abstract: Autoformalization aims to convert informal mathematical proofs into machine-verifiable formats, bridging the gap between natural and formal languages. However, ensuring semantic alignment between the informal and formalized statements remains challenging. Existing approaches heavily rely on manual verification, hindering scalability. To address this, we introduce \textsc{FormalAlign}, the first automated framework designed for evaluating the alignment between natural and formal languages in autoformalization. \textsc{FormalAlign} trains on both the autoformalization sequence generation task and the representational alignment between input and output, employing a dual loss that combines a pair of mutually enhancing autoformalization and alignment tasks. Evaluated across four benchmarks augmented by our proposed misalignment strategies, \textsc{FormalAlign} demonstrates superior performance. In our experiments, \textsc{FormalAlign} outperforms GPT-4, achieving an Alignment-Selection Score 11.58\% higher on \forml-Basic (99.21\% vs. 88.91\%) and 3.19\% higher on MiniF2F-Valid (66.39\% vs. 64.34\%). This effective alignment evaluation significantly reduces the need for manual verification. Both the dataset and code can be accessed via~\url{https://github.com/rookie-joe/FormalAlign}.
Authors: Garima Agrawal, Sashank Gummuluri, Cosimo Spera
Abstract: In customer contact centers, human agents often struggle with long average handling times (AHT) due to the need to manually interpret queries and retrieve relevant knowledge base (KB) articles. While retrieval augmented generation (RAG) systems using large language models (LLMs) have been widely adopted in industry to assist with such tasks, RAG faces challenges in real-time conversations, such as inaccurate query formulation and redundant retrieval of frequently asked questions (FAQs). To address these limitations, we propose a decision support system that can look beyond RAG by first identifying customer questions in real time. If the query matches an FAQ, the system retrieves the answer directly from the FAQ database; otherwise, it generates answers via RAG. Our approach reduces reliance on manual queries, providing responses to agents within 2 seconds. Deployed in AI-powered human-agent assist solution at Minerva CQ, this system improves efficiency, reduces AHT, and lowers operational costs. We also introduce an automated LLM-agentic workflow to identify FAQs from historical transcripts when no predefined FAQs exist.
Authors: Hongyi Yuan, Suqi Liu, Kelly Cho, Katherine Liao, Alexandre Pereira, Tianxi Cai
Abstract: We introduce GENomic Encoding REpresentation with Language Model (GENEREL), a framework designed to bridge genetic and biomedical knowledge bases. What sets GENEREL apart is its ability to fine-tune language models to infuse biological knowledge behind clinical concepts such as diseases and medications. This fine-tuning enables the model to capture complex biomedical relationships more effectively, enriching the understanding of how genomic data connects to clinical outcomes. By constructing a unified embedding space for biomedical concepts and a wide range of common SNPs from sources such as patient-level data, biomedical knowledge graphs, and GWAS summaries, GENEREL aligns the embeddings of SNPs and clinical concepts through multi-task contrastive learning. This allows the model to adapt to diverse natural language representations of biomedical concepts while bypassing the limitations of traditional code mapping systems across different data sources. Our experiments demonstrate GENEREL's ability to effectively capture the nuanced relationships between SNPs and clinical concepts. GENEREL also emerges to discern the degree of relatedness, potentially allowing for a more refined identification of concepts. This pioneering approach in constructing a unified embedding system for both SNPs and biomedical concepts enhances the potential for data integration and discovery in biomedical research.
Authors: Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Rong Jin, Xiangnan He
Abstract: Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameterizing the reward function. However, DPO depends on a potentially suboptimal reference model, and SimPO's assumption of a fixed target reward margin may lead to suboptimal decisions in diverse data settings. In this work, we propose $\alpha$-DPO, an adaptive preference optimization algorithm designed to address these limitations by introducing a dynamic reward margin. Specifically, $\alpha$-DPO employs an adaptive preference distribution, balancing the policy model and the reference model to achieve personalized reward margins. We provide theoretical guarantees for $\alpha$-DPO, demonstrating its effectiveness as a surrogate optimization objective and its ability to balance alignment and diversity through KL divergence control. Empirical evaluations on AlpacaEval 2 and Arena-Hard show that $\alpha$-DPO consistently outperforms DPO and SimPO across various model settings, establishing it as a robust approach for fine-tuning LLMs. Our method achieves significant improvements in win rates, highlighting its potential as a powerful tool for LLM alignment. The code is available at https://github.com/junkangwu/alpha-DPO
Authors: Yifan Luo, Zhennan Zhou, Meitan Wang, Bin Dong
Abstract: In this paper, we investigate the safety mechanisms of instruction fine-tuned large language models (LLMs). We discover that re-weighting MLP neurons can significantly compromise a model's safety, especially for MLPs in end-of-sentence inferences. We hypothesize that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process. Based on this hypothesis, we develop 2 novel white-box jailbreak methods: a prompt-specific method and a prompt-general method. The prompt-specific method targets individual prompts and optimizes the attack on the fly, while the prompt-general method is pre-trained offline and can generalize to unseen harmful prompts. Our methods demonstrate robust performance across 7 popular open-source LLMs, size ranging from 2B to 72B. Furthermore, our study provides insights into vulnerabilities of instruction-tuned LLM's safety and deepens the understanding of the internal mechanisms of LLMs.
Authors: Bo Chen, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. This paper introduces a novel approach to accelerate attention computation in LLMs, particularly for long-context scenarios. We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention (with $\mathsf{ReLU}^\alpha$ activation, $\alpha \in \mathbb{N}_+$), to significantly reduce the running time complexity. Our method employs a Half-Space Reporting (HSR) data structure to rapidly identify non-zero or "massively activated" entries in the attention matrix. We present theoretical analyses for two key scenarios: attention generation and full attention computation with long input context. Our approach achieves a running time of $O(mn^{4/5})$ significantly faster than the naive approach $O(mn)$ for attention generation, where $n$ is the context length, $m$ is the query length, and $d$ is the hidden dimension. We can also reduce the running time of full attention computation from $O(mn)$ to $O(mn^{1 - 1 / \lfloor d/2\rfloor} + mn^{4/5})$. Importantly, our method introduces no error for ReLU attention and only provably negligible error for Softmax attention, where the latter is supported by our empirical validation. This work represents a significant step towards enabling efficient long-context processing in LLMs, potentially broadening their applicability across various domains.
Authors: Yongjin Yang, Sihyeon Kim, Hojung Jung, Sangmin Bae, SangMook Kim, Se-Young Yun, Kimin Lee
Abstract: Fine-tuning text-to-image diffusion models with human feedback is an effective method for aligning model behavior with human intentions. However, this alignment process often suffers from slow convergence due to the large size and noise present in human feedback datasets. In this work, we propose FiFA, a novel automated data filtering algorithm designed to enhance the fine-tuning of diffusion models using human feedback datasets with direct preference optimization (DPO). Specifically, our approach selects data by solving an optimization problem to maximize three components: preference margin, text quality, and text diversity. The concept of preference margin is used to identify samples that contain high informational value to address the noisy nature of feedback dataset, which is calculated using a proxy reward model. Additionally, we incorporate text quality, assessed by large language models to prevent harmful contents, and consider text diversity through a k-nearest neighbor entropy estimator to improve generalization. Finally, we integrate all these components into an optimization process, with approximating the solution by assigning importance score to each data pair and selecting the most important ones. As a result, our method efficiently filters data automatically, without the need for manual intervention, and can be applied to any large-scale dataset. Experimental results show that FiFA significantly enhances training stability and achieves better performance, being preferred by humans 17% more, while using less than 0.5% of the full data and thus 1% of the GPU hours compared to utilizing full human feedback datasets.
Authors: Peter Schafhalter, Shun Liao, Yanqi Zhou, Chih-Kuan Yeh, Arun Kandoor, James Laudon
Abstract: Domain-specific adaptation is critical to maximizing the performance of pre-trained language models (PLMs) on one or multiple targeted tasks, especially under resource-constrained use cases, such as edge devices. However, existing methods often struggle to balance domain-specific performance, retention of general knowledge, and efficiency for training and inference. To address these challenges, we propose Modular Domain Experts (MoDE). MoDE is a mixture-of-experts architecture that augments a general PLMs with modular, domain-specialized experts. These experts are trained independently and composed together via a lightweight training process. In contrast to standard low-rank adaptation methods, each MoDE expert consists of several transformer layers which scale better with more training examples and larger parameter counts. Our evaluation demonstrates that MoDE achieves comparable target performances to full parameter fine-tuning while achieving 1.65% better retention performance. Moreover, MoDE's architecture enables flexible sharding configurations and improves training speeds by up to 38% over state-of-the-art distributed training configurations.
Authors: Ying Liu, Ge Bai, Chenji Lu, Shilong Li, Zhang Zhang, Ruifang Liu, Wenbin Guo
Abstract: Despite the remarkable advancements in Visual Question Answering (VQA), the challenge of mitigating the language bias introduced by textual information remains unresolved. Previous approaches capture language bias from a coarse-grained perspective. However, the finer-grained information within a sentence, such as context and keywords, can result in different biases. Due to the ignorance of fine-grained information, most existing methods fail to sufficiently capture language bias. In this paper, we propose a novel causal intervention training scheme named CIBi to eliminate language bias from a finer-grained perspective. Specifically, we divide the language bias into context bias and keyword bias. We employ causal intervention and contrastive learning to eliminate context bias and improve the multi-modal representation. Additionally, we design a new question-only branch based on counterfactual generation to distill and eliminate keyword bias. Experimental results illustrate that CIBi is applicable to various VQA models, yielding competitive performance.
Authors: Tung Nguyen, Qiuyi Zhang, Bangding Yang, Chansoo Lee, Jorg Bornschein, Yingjie Miao, Sagi Perel, Yutian Chen, Xingyou Song
Abstract: Bayesian Optimization is ubiquitous in the field of experimental design and blackbox optimization for improving search efficiency, but has been traditionally restricted to regression models which are only applicable to fixed search spaces and tabular input features. We propose Embed-then-Regress, a paradigm for applying in-context regression over string inputs, through the use of string embedding capabilities of pretrained language models. By expressing all inputs as strings, we are able to perform general-purpose regression for Bayesian Optimization over various domains including synthetic, combinatorial, and hyperparameter optimization, obtaining comparable results to state-of-the-art Gaussian Process-based algorithms. Code can be found at github.com/google-research/optformer/embed_then_regress.
Authors: Gahyun Yoo, Jay Yoon Lee
Abstract: Reinforcement learning has shown great promise in aligning language models with human preferences in a variety of text generation tasks, including machine translation. For translation tasks, rewards can easily be obtained from quality estimation (QE) models which can generate rewards for unlabeled data. Despite its usefulness, reinforcement learning cannot exploit the gradients with respect to the QE score. We propose QE-EBM, a method of employing quality estimators as trainable loss networks that can directly backpropagate to the NMT model. We examine our method on several low and high resource target languages with English as the source language. QE-EBM outperforms strong baselines such as REINFORCE and proximal policy optimization (PPO) as well as supervised fine-tuning for all target languages, especially low-resource target languages. Most notably, for English-to-Mongolian translation, our method achieves improvements of 2.5 BLEU, 7.1 COMET-KIWI, 5.3 COMET, and 6.4 XCOMET relative to the supervised baseline.
Authors: Md Rashad Al Hasan Rony, Sudipto Kumar Shaha, Rakib Al Hasan, Sumon Kanti Dey, Amzad Hossain Rafi, Amzad Hossain Rafi, Ashraf Hasan Sirajee, Jens Lehmann
Abstract: Bengali is the seventh most spoken language on earth, yet considered a low-resource language in the field of natural language processing (NLP). Question answering over unstructured text is a challenging NLP task as it requires understanding both question and passage. Very few researchers attempted to perform question answering over Bengali (natively pronounced as Bangla) text. Typically, existing approaches construct the dataset by directly translating them from English to Bengali, which produces noisy and improper sentence structures. Furthermore, they lack topics and terminologies related to the Bengali language and people. This paper introduces BanglaQuAD, a Bengali question answering dataset, containing 30,808 question-answer pairs constructed from Bengali Wikipedia articles by native speakers. Additionally, we propose an annotation tool that facilitates question-answering dataset construction on a local machine. A qualitative analysis demonstrates the quality of our proposed dataset.
Authors: Jiawei Li, Fanrui Zhang, Jiaying Zhu, Esther Sun, Qiang Zhang, Zheng-Jun Zha
Abstract: Multimodal Large Language Models (MLLMs), such as GPT4o, have shown strong capabilities in visual reasoning and explanation generation. However, despite these strengths, they face significant challenges in the increasingly critical task of Image Forgery Detection and Localization (IFDL). Moreover, existing IFDL methods are typically limited to the learning of low-level semantic-agnostic clues and merely provide a single outcome judgment. To tackle these issues, we propose ForgeryGPT, a novel framework that advances the IFDL task by capturing high-order forensics knowledge correlations of forged images from diverse linguistic feature spaces, while enabling explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture. Specifically, ForgeryGPT enhances traditional LLMs by integrating the Mask-Aware Forgery Extractor, which enables the excavating of precise forgery mask information from input images and facilitating pixel-level understanding of tampering artifacts. The Mask-Aware Forgery Extractor consists of a Forgery Localization Expert (FL-Expert) and a Mask Encoder, where the FL-Expert is augmented with an Object-agnostic Forgery Prompt and a Vocabulary-enhanced Vision Encoder, allowing for effectively capturing of multi-scale fine-grained forgery details. To enhance its performance, we implement a three-stage training strategy, supported by our designed Mask-Text Alignment and IFDL Task-Specific Instruction Tuning datasets, which align vision-language modalities and improve forgery detection and instruction-following capabilities. Extensive experiments demonstrate the effectiveness of the proposed method.
Authors: Jintang Li, Ruofan Wu, Yuchang Zhu, Huizhe Zhang, Xinzhou Jin, Guibin Zhang, Zulun Zhu, Zibin Zheng, Liang Chen
Abstract: Graph autoencoders (GAEs) are self-supervised learning models that can learn meaningful representations of graph-structured data by reconstructing the input graph from a low-dimensional latent space. Over the past few years, GAEs have gained significant attention in academia and industry. In particular, the recent advent of GAEs with masked autoencoding schemes marks a significant advancement in graph self-supervised learning research. While numerous GAEs have been proposed, the underlying mechanisms of GAEs are not well understood, and a comprehensive benchmark for GAEs is still lacking. In this work, we bridge the gap between GAEs and contrastive learning by establishing conceptual and methodological connections. We revisit the GAEs studied in previous works and demonstrate how contrastive learning principles can be applied to GAEs. Motivated by these insights, we introduce lrGAE (left-right GAE), a general and powerful GAE framework that leverages contrastive learning principles to learn meaningful representations. Our proposed lrGAE not only facilitates a deeper understanding of GAEs but also sets a new benchmark for GAEs across diverse graph-based learning tasks. The source code for lrGAE, including the baselines and all the code for reproducing the results, is publicly available at https://github.com/EdisonLeeeee/lrGAE.
Authors: Chenhao Ding, Xinyuan Gao, Songlin Dong, Yuhang He, Qiang Wang, Alex Kot, Yihong Gong
Abstract: Existing prompt learning methods in Vision-Language Models (VLM) have effectively enhanced the transfer capability of VLM to downstream tasks, but they suffer from a significant decline in generalization due to severe overfitting. To address this issue, we propose a framework named LOBG for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that might cause overfitting, thereby guiding prompts with basic visual concepts. To further mitigate overfitting, we devel oped a structural topology preservation (STP) loss at the feature level, which endows the feature space with overall plasticity, allowing effective reshaping of the feature space during optimization. Additionally, we employed hierarchical logit distilation (HLD) at the output level to constrain outputs, complementing STP at the output end. Extensive experimental results demonstrate that our method significantly improves generalization capability and alleviates overfitting compared to state-of-the-art approaches.
Authors: Jindou Jia, Zihan Yang, Meng Wang, Kexin Guo, Jianfei Yang, Xiang Yu, Lei Guo
Abstract: The well-known generalization problem hinders the application of artificial neural networks in continuous-time prediction tasks with varying latent dynamics. In sharp contrast, biological systems can neatly adapt to evolving environments benefiting from real-time feedback mechanisms. Inspired by the feedback philosophy, we present feedback neural networks, showing that a feedback loop can flexibly correct the learned latent dynamics of neural ordinary differential equations (neural ODEs), leading to a prominent generalization improvement. The feedback neural network is a novel two-DOF neural network, which possesses robust performance in unseen scenarios with no loss of accuracy performance on previous tasks. A linear feedback form is presented to correct the learned latent dynamics firstly, with a convergence guarantee. Then, domain randomization is utilized to learn a nonlinear neural feedback form. Finally, extensive tests including trajectory prediction of a real irregular object and model predictive control of a quadrotor with various uncertainties, are implemented, indicating significant improvements over state-of-the-art model-based and learning-based methods.
Authors: Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, Christopher R\'e
Abstract: Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with orders of magnitudes less memory and compute. We base these steps on two findings. First, we can replace an LLM's softmax attentions with closely-approximating linear attentions, simply by training the linear attentions to match their softmax counterparts with an output MSE loss ("attention transfer"). Then, this enables adjusting for approximation errors and recovering LLM quality simply with low-rank adaptation (LoRA). LoLCATs significantly improves linearizing quality, training efficiency, and scalability. We significantly reduce the linearizing quality gap and produce state-of-the-art subquadratic LLMs from Llama 3 8B and Mistral 7B v0.1, leading to 20+ points of improvement on 5-shot MMLU. Furthermore, LoLCATs does so with only 0.2% of past methods' model parameters and 0.4% of their training tokens. Finally, we apply LoLCATs to create the first linearized 70B and 405B LLMs (50x larger than prior work). When compared with prior approaches under the same compute budgets, LoLCATs significantly improves linearizing quality, closing the gap between linearized and original Llama 3.1 70B and 405B LLMs by 77.8% and 78.1% on 5-shot MMLU.
Authors: Kasper Cools, Clara Maathuis
Abstract: The integration of Autonomous Weapon Systems (AWS) into military operations presents both significant opportunities and challenges. This paper explores the multifaceted nature of trust in AWS, emphasising the necessity of establishing reliable and transparent systems to mitigate risks associated with bias, operational failures, and accountability. Despite advancements in Artificial Intelligence (AI), the trustworthiness of these systems, especially in high-stakes military applications, remains a critical issue. Through a systematic review of existing literature, this research identifies gaps in the understanding of trust dynamics during the development and deployment phases of AWS. It advocates for a collaborative approach that includes technologists, ethicists, and military strategists to address these ongoing challenges. The findings underscore the importance of Human-Machine teaming and enhancing system intelligibility to ensure accountability and adherence to International Humanitarian Law. Ultimately, this paper aims to contribute to the ongoing discourse on the ethical implications of AWS and the imperative for trustworthy AI in defense contexts.
Authors: Xiangru Zhu, Penglei Sun, Yaoxian Song, Yanghua Xiao, Zhixu Li, Chengyu Wang, Jun Huang, Bei Yang, Xiaoxiao Xu
Abstract: Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
Authors: Zhangchi Feng, Dongdong Kuang, Zhongyuan Wang, Zhijie Nie, Yaowei Zheng, Richong Zhang
Abstract: This paper presents EasyRAG, a simple, lightweight, and efficient retrieval-augmented generation framework for network automated operations. The advantages of our solution are: 1.Accurate Question Answering: We designed a straightforward RAG scheme based on (1) a specific data processing workflow (2) dual-route sparse retrieval for coarse ranking (3) LLM Reranker for reranking (4) LLM answer generation and optimization. This approach achieved first place in the GLM4 track in the preliminary round and second place in the GLM4 track in the semifinals. 2.Simple Deployment: Our method primarily consists of BM25 retrieval and BGE-reranker reranking, requiring no fine-tuning of any models, occupying minimal VRAM, easy to deploy, and highly scalable; we provide a flexible code library with various search and generation strategies, facilitating custom process implementation. 3.Efficient Inference: We designed an efficient inference acceleration scheme for the entire coarse ranking, reranking, and generation process that significantly reduces the inference latency of RAG while maintaining a good level of accuracy; each acceleration scheme can be plug-and-play into any component of the RAG process, consistently enhancing the efficiency of the RAG system. Our code and data are released at https://github.com/BUAADreamer/EasyRAG.
Authors: Daohan Su, Xunkai Li, Zhenjun Li, Yinping Liao, Rong-Hua Li, Guoren Wang
Abstract: Recently, graph neural network (GNN) has emerged as a powerful representation learning tool for graph-structured data. However, most approaches are tailored for undirected graphs, neglecting the abundant information embedded in the edges of directed graphs (digraphs). In fact, digraphs are widely applied in the real world (e.g., social networks and recommendations) and are also confirmed to offer a new perspective for addressing topological heterophily challenges (i.e., connected nodes have complex patterns of feature distribution or labels). Despite recent significant advancements in DiGNNs, existing spatial- and spectral-based methods have inherent limitations due to the complex learning mechanisms and reliance on high-quality topology, leading to low efficiency and unstable performance. To address these issues, we propose Directed Random Walk (DiRW), which can be viewed as a plug-and-play strategy or an innovative neural architecture that provides a guidance or new learning paradigm for most spatial-based methods or digraphs. Specifically, DiRW incorporates a direction-aware path sampler optimized from the perspectives of walk probability, length, and number in a weight-free manner by considering node profiles and topological structure. Building upon this, DiRW utilizes a node-wise learnable path aggregator for generalized messages obtained by our proposed adaptive walkers to represent the current node. Extensive experiments on 9 datasets demonstrate that DiRW: (1) enhances most spatial-based methods as a plug-and-play strategy; (2) achieves SOTA performance as a new digraph learning paradigm.
Authors: Yun Zhu, Haizhou Shi, Xiaotang Wang, Yongchao Liu, Yaoke Wang, Boci Peng, Chuntao Hong, Siliang Tang
Abstract: Recently, research on Text-Attributed Graphs (TAGs) has gained significant attention due to the prevalence of free-text node features in real-world applications and the advancements in Large Language Models (LLMs) that bolster TAG methodologies. However, current TAG approaches face two primary challenges: (i) Heavy reliance on label information and (ii) Limited cross-domain zero/few-shot transferability. These issues constrain the scaling of both data and model size, owing to high labor costs and scaling laws, complicating the development of graph foundation models with strong transferability. In this work, we propose the GraphCLIP framework to address these challenges by learning graph foundation models with strong cross-domain zero/few-shot transferability through a self-supervised contrastive graph-summary pretraining method. Specifically, we generate and curate large-scale graph-summary pair data with the assistance of LLMs, and introduce a novel graph-summary pretraining method, combined with invariant learning, to enhance graph foundation models with strong cross-domain zero-shot transferability. For few-shot learning, we propose a novel graph prompt tuning technique aligned with our pretraining objective to mitigate catastrophic forgetting and minimize learning costs. Extensive experiments show the superiority of GraphCLIP in both zero-shot and few-shot settings, while evaluations across various downstream tasks confirm the versatility of GraphCLIP. Our code is available at: https://github.com/ZhuYun97/GraphCLIP
Authors: Yiping Jin, Leo Wanner, Aneesh Moideen Koya
Abstract: Hate speech (HS) classifiers do not perform equally well in detecting hateful expressions towards different target identities. They also demonstrate systematic biases in predicted hatefulness scores. Tapping on two recently proposed functionality test datasets for HS detection, we quantitatively analyze the impact of different factors on HS prediction. Experiments on popular industrial and academic models demonstrate that HS detectors assign a higher hatefulness score merely based on the mention of specific target identities. Besides, models often confuse hatefulness and the polarity of emotions. This result is worrisome as the effort to build HS detectors might harm the vulnerable identity groups we wish to protect: posts expressing anger or disapproval of hate expressions might be flagged as hateful themselves. We also carry out a study inspired by social psychology theory, which reveals that the accuracy of hatefulness prediction correlates strongly with the intensity of the stereotype.
Authors: Yuntao Shou, Xiangyong Cao, Deyu Meng
Abstract: Graph Contrastive Learning (GCL) excels at managing noise and fluctuations in input data, making it popular in various fields (e.g., social networks, and knowledge graphs). Our study finds that the difference in high-frequency information between augmented graphs is greater than that in low-frequency information. However, most existing GCL methods focus mainly on the time domain (low-frequency information) for node feature representations and cannot make good use of high-frequency information to speed up model convergence. Furthermore, existing GCL paradigms optimize graph embedding representations by pulling the distance between positive sample pairs closer and pushing the distance between positive and negative sample pairs farther away, but our theoretical analysis shows that graph contrastive learning benefits from pushing negative pairs farther away rather than pulling positive pairs closer. To solve the above-mentioned problems, we propose a novel spectral GCL framework without positive samples, named SpeGCL. Specifically, to solve the problem that existing GCL methods cannot utilize high-frequency information, SpeGCL uses a Fourier transform to extract high-frequency and low-frequency information of node features, and constructs a contrastive learning mechanism in a Fourier space to obtain better node feature representation. Furthermore, SpeGCL relies entirely on negative samples to refine the graph embedding. We also provide a theoretical justification for the efficacy of using only negative samples in SpeGCL. Extensive experiments on un-supervised learning, transfer learning, and semi-supervised learning have validated the superiority of our SpeGCL framework over the state-of-the-art GCL methods.
Authors: Zehua Cheng, Di Yuan, Thomas Lukasiewicz
Abstract: The combination of semi-supervised learning (SemiSL) and contrastive learning (CL) has been successful in medical image segmentation with limited annotations. However, these works often rely on pretext tasks that lack the specificity required for pixel-level segmentation, and still face overfitting issues due to insufficient supervision signals resulting from too few annotations. Therefore, this paper proposes an affinity-graph-guided semi-supervised contrastive learning framework (Semi-AGCL) by establishing additional affinity-graph-based supervision signals between the student and teacher network, to achieve medical image segmentation with minimal annotations without pretext. The framework first designs an average-patch-entropy-driven inter-patch sampling method, which can provide a robust initial feature space without relying on pretext tasks. Furthermore, the framework designs an affinity-graph-guided loss function, which can improve the quality of the learned representation and the model generalization ability by exploiting the inherent structure of the data, thus mitigating overfitting. Our experiments indicate that with merely 10% of the complete annotation set, our model approaches the accuracy of the fully annotated baseline, manifesting a marginal deviation of only 2.52%. Under the stringent conditions where only 5% of the annotations are employed, our model exhibits a significant enhancement in performance surpassing the second best baseline by 23.09% on the dice metric and achieving an improvement of 26.57% on the notably arduous CRAG and ACDC datasets.
Authors: Argyrios Papoudakis, Mirella Lapata, Frank Keller
Abstract: Characters are at the heart of every story, driving the plot and engaging readers. In this study, we explore the understanding of characters in full-length books, which contain complex narratives and numerous interacting characters. We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation, including character development, personality, and social context. We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses. Using this dataset, we evaluate state-of-the-art long-context models in zero-shot and fine-tuning settings, utilizing both retrieval-based and hierarchical processing for book-length inputs. Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks. Additionally, fine-tuned models using coreference-based retrieval produce the most factual descriptions, as measured by fact- and entailment-based metrics. We hope our dataset, experiments, and analysis will inspire further research in character-based narrative understanding.
Authors: Cornelius V. Braun, Robert T. Lange, Marc Toussaint
Abstract: Stein Variational Gradient Descent (SVGD) is a highly efficient method to sample from an unnormalized probability distribution. However, the SVGD update relies on gradients of the log-density, which may not always be available. Existing gradient-free versions of SVGD make use of simple Monte Carlo approximations or gradients from surrogate distributions, both with limitations. To improve gradient-free Stein variational inference, we combine SVGD steps with evolution strategy (ES) updates. Our results demonstrate that the resulting algorithm generates high-quality samples from unnormalized target densities without requiring gradient information. Compared to prior gradient-free SVGD methods, we find that the integration of the ES update in SVGD significantly improves the performance on multiple challenging benchmark problems.
Authors: Kaidong Zhang, Pengzhen Ren, Bingqian Lin, Junfan Lin, Shikui Ma, Hang Xu, Xiaodan Liang
Abstract: Language-guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks. Previous work trivially fitting the data without revealing the relation between instruction and low-level executable actions, these models are prone to memorizing the surficial pattern of the data instead of acquiring the transferable knowledge, and thus are fragile to dynamic environment changes. To address this issue, we propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) that focuses solely on the prediction of task-relevant waypoints. Specifically, PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive-driven waypoint prediction, while the latter focuses on decoding low-level actions. Additionally, we also design an asynchronous hierarchical executor (AHE), which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency. Our PIVOT-R outperforms state-of-the-art (SoTA) open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT-R, the execution efficiency of PIVOT-R with AHE is increased by 28-fold, with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT-R can significantly improve both the performance and efficiency of robotic manipulation.
Authors: Yu Lei, Hao Liu, Chengxing Xie, Songjia Liu, Zhiyu Yin, Canyu chen, Guohao Li, Philip Torr, Zhen Wu
Abstract: AI alignment is a pivotal issue concerning AI control and safety. It should consider not only value-neutral human preferences but also moral and ethical considerations. In this study, we introduced FairMindSim, which simulates the moral dilemma through a series of unfair scenarios. We used LLM agents to simulate human behavior, ensuring alignment across various stages. To explore the various socioeconomic motivations, which we refer to as beliefs, that drive both humans and LLM agents as bystanders to intervene in unjust situations involving others, and how these beliefs interact to influence individual behavior, we incorporated knowledge from relevant sociological fields and proposed the Belief-Reward Alignment Behavior Evolution Model (BREM) based on the recursive reward model (RRM). Our findings indicate that, behaviorally, GPT-4o exhibits a stronger sense of social justice, while humans display a richer range of emotions. Additionally, we discussed the potential impact of emotions on behavior. This study provides a theoretical foundation for applications in aligning LLMs with altruistic values.
Authors: Xuezhi Xiang, Yibo Ning, Lei Zhang, Denis Ombati, Himaloy Himu, Xiantong Zhen
Abstract: Semantic segmentation of remote sensing images is a fundamental task in geospatial research. However, widely used Convolutional Neural Networks (CNNs) and Transformers have notable drawbacks: CNNs may be limited by insufficient remote sensing modeling capability, while Transformers face challenges due to computational complexity. In this paper, we propose a remote-sensing image semantic segmentation network named LKASeg, which combines Large Kernel Attention(LSKA) and Full-Scale Skip Connections(FSC). Specifically, we propose a decoder based on Large Kernel Attention (LKA), which extract global features while avoiding the computational overhead of self-attention and providing channel adaptability. To achieve full-scale feature learning and fusion, we apply Full-Scale Skip Connections (FSC) between the encoder and decoder. We conducted experiments by combining the LKA-based decoder with FSC. On the ISPRS Vaihingen dataset, the mF1 and mIoU scores achieved 90.33% and 82.77%.
Authors: Kai Han, Jianyuan Guo, Yehui Tang, Wei He, Enhua Wu, Yunhe Wang
Abstract: Vision-language large models have achieved remarkable success in various multi-modal tasks, yet applying them to video understanding remains challenging due to the inherent complexity and computational demands of video data. While training-based video-LLMs deliver high performance, they often require substantial resources for training and inference. Conversely, training-free approaches offer a more efficient alternative by adapting pre-trained image-LLMs models for video tasks without additional training, but they face inference efficiency bottlenecks due to the large number of visual tokens generated from video frames. In this work, we present a novel prompt-guided visual perception framework (abbreviated as \emph{Free Video-LLM}) for efficient inference of training-free video LLMs. The proposed framework decouples spatial-temporal dimension and performs temporal frame sampling and spatial RoI cropping respectively based on task-specific prompts. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks. Extensive experiments demonstrate that our approach achieves competitive results with significantly fewer tokens, offering an optimal trade-off between accuracy and computational efficiency compared to state-of-the-art video LLMs. The code will be available at \url{https://github.com/contrastive/FreeVideoLLM}.
Authors: Haoyu Tu, Lin Chen, Zuguang Li, Xiaopei Chen, Wen Wu
Abstract: In this paper,we study a vehicle selection problem for federated learning (FL) over vehicular networks. Specifically, we design a mobility-aware vehicular federated learning (MAVFL) scheme in which vehicles drive through a road segment to perform FL. Some vehicles may drive out of the segment which leads to unsuccessful training.In the proposed scheme, the real-time successful training participation ratio is utilized to implement vehicle selection. We conduct the convergence analysis to indicate the influence of vehicle mobility on training loss. Furthermore, we propose a multi-armed bandit-based vehicle selection algorithm to minimize the utility function considering training loss and delay. The simulation results show that compared with baselines, the proposed algorithm can achieve better training performance with approximately 28\% faster convergence.
Authors: Asger Horn Brorholt, Kim Guldstrand Larsen, Christian Schilling
Abstract: Deep reinforcement learning has emerged as a powerful tool for obtaining high-performance policies. However, the safety of these policies has been a long-standing issue. One promising paradigm to guarantee safety is a shield, which shields a policy from making unsafe actions. However, computing a shield scales exponentially in the number of state variables. This is a particular concern in multi-agent systems with many agents. In this work, we propose a novel approach for multi-agent shielding. We address scalability by computing individual shields for each agent. The challenge is that typical safety specifications are global properties, but the shields of individual agents only ensure local properties. Our key to overcome this challenge is to apply assume-guarantee reasoning. Specifically, we present a sound proof rule that decomposes a (global, complex) safety specification into (local, simple) obligations for the shields of the individual agents. Moreover, we show that applying the shields during reinforcement learning significantly improves the quality of the policies obtained for a given training budget. We demonstrate the effectiveness and scalability of our multi-agent shielding framework in two case studies, reducing the computation time from hours to seconds and achieving fast learning convergence.
Authors: Emmanouil Panagiotou, Manuel Heurich, Tim Landgraf, Eirini Ntoutsi
Abstract: In the field of Explainable AI (XAI), counterfactual (CF) explanations are one prominent method to interpret a black-box model by suggesting changes to the input that would alter a prediction. In real-world applications, the input is predominantly in tabular form and comprised of mixed data types and complex feature interdependencies. These unique data characteristics are difficult to model, and we empirically show that they lead to bias towards specific feature types when generating CFs. To overcome this issue, we introduce TABCF, a CF explanation method that leverages a transformer-based Variational Autoencoder (VAE) tailored for modeling tabular data. Our approach uses transformers to learn a continuous latent space and a novel Gumbel-Softmax detokenizer that enables precise categorical reconstruction while preserving end-to-end differentiability. Extensive quantitative evaluation on five financial datasets demonstrates that TABCF does not exhibit bias toward specific feature types, and outperforms existing methods in producing effective CFs that align with common CF desiderata.
Authors: Gabriel Roccabruna, Massimo Rizzoli, Giuseppe Riccardi
Abstract: The automatic detection of temporal relations among events has been mainly investigated with encoder-only models such as RoBERTa. Large Language Models (LLM) have recently shown promising performance in temporal reasoning tasks such as temporal question answering. Nevertheless, recent studies have tested the LLMs' performance in detecting temporal relations of closed-source models only, limiting the interpretability of those results. In this work, we investigate LLMs' performance and decision process in the Temporal Relation Classification task. First, we assess the performance of seven open and closed-sourced LLMs experimenting with in-context learning and lightweight fine-tuning approaches. Results show that LLMs with in-context learning significantly underperform smaller encoder-only models based on RoBERTa. Then, we delve into the possible reasons for this gap by applying explainable methods. The outcome suggests a limitation of LLMs in this task due to their autoregressive nature, which causes them to focus only on the last part of the sequence. Additionally, we evaluate the word embeddings of these two models to better understand their pre-training differences. The code and the fine-tuned models can be found respectively on GitHub.
Authors: Zhaomin Wu, Jizhou Guo, Junyi Hou, Bingsheng He, Lixin Fan, Qiang Yang
Abstract: As large language models (LLMs) become increasingly prevalent in web services, effectively leveraging domain-specific knowledge while ensuring privacy has become critical. Existing methods, such as retrieval-augmented generation (RAG) and differentially private data synthesis, often compromise either the utility of domain knowledge or the privacy of sensitive data, limiting their applicability in specialized domains. To address these challenges, we propose \textit{Llamdex}, a novel framework that integrates privacy-preserving, domain-specific models into LLMs. Our approach significantly enhances the accuracy of domain-specific tasks, achieving up to a 26\% improvement compared to existing methods under the same differential privacy constraints. Experimental results show that Llamdex not only improves the accuracy of LLM responses but also maintains comparable inference efficiency to the original LLM, highlighting its potential for real-world applications.
Authors: Jorge Garc\'ia-Torres, {\O}yvind Meinich-Bache, Anders Johannessen, Siren Rettedal, Vilde Kolstad, Kjersti Engan
Abstract: Around 5-10\% of newborns need assistance to start breathing. Currently, there is a lack of evidence-based research, objective data collection, and opportunities for learning from real newborn resuscitation emergency events. Generating and evaluating automated newborn resuscitation algorithm activity timelines relative to the Time of Birth (ToB) offers a promising opportunity to enhance newborn care practices. Given the importance of prompt resuscitation interventions within the "golden minute" after birth, having an accurate ToB with second precision is essential for effective subsequent analysis of newborn resuscitation episodes. Instead, ToB is generally registered manually, often with minute precision, making the process inefficient and susceptible to error and imprecision. In this work, we explore the fusion of Artificial Intelligence (AI) and thermal imaging to develop the first AI-driven ToB detector. The use of temperature information offers a promising alternative to detect the newborn while respecting the privacy of healthcare providers and mothers. However, the frequent inconsistencies in thermal measurements, especially in a multi-camera setup, make normalization strategies critical. Our methodology involves a three-step process: first, we propose an adaptive normalization method based on Gaussian mixture models (GMM) to mitigate issues related to temperature variations; second, we implement and deploy an AI model to detect the presence of the newborn within the thermal video frames; and third, we evaluate and post-process the model's predictions to estimate the ToB. A precision of 88.1\% and a recall of 89.3\% are reported in the detection of the newborn within thermal frames during performance evaluation. Our approach achieves an absolute median deviation of 2.7 seconds in estimating the ToB relative to the manual annotations.
Authors: Sharif Kazemi, Gloria Gerhardt, Jonty Katz, Caroline Ida Kuria, Estelle Pan, Umang Prabhakar
Abstract: The training data for LLMs embeds societal values, increasing their familiarity with the language's culture. Our analysis found that 44% of the variance in the ability of GPT-4o to reflect the societal values of a country, as measured by the World Values Survey, correlates with the availability of digital resources in that language. Notably, the error rate was more than five times higher for the languages of the lowest resource compared to the languages of the highest resource. For GPT-4-turbo, this correlation rose to 72%, suggesting efforts to improve the familiarity with the non-English language beyond the web-scraped data. Our study developed one of the largest and most robust datasets in this topic area with 21 country-language pairs, each of which contain 94 survey questions verified by native speakers. Our results highlight the link between LLM performance and digital data availability in target languages. Weaker performance in low-resource languages, especially prominent in the Global South, may worsen digital divides. We discuss strategies proposed to address this, including developing multilingual LLMs from the ground up and enhancing fine-tuning on diverse linguistic datasets, as seen in African language initiatives.
Authors: Shikun Feng, Yuyan Ni, Yan Lu, Zhi-Ming Ma, Wei-Ying Ma, Yanyan Lan
Abstract: Molecular generation and molecular property prediction are both crucial for drug discovery, but they are often developed independently. Inspired by recent studies, which demonstrate that diffusion model, a prominent generative approach, can learn meaningful data representations that enhance predictive tasks, we explore the potential for developing a unified generative model in the molecular domain that effectively addresses both molecular generation and property prediction tasks. However, the integration of these tasks is challenging due to inherent inconsistencies, making simple multi-task learning ineffective. To address this, we propose UniGEM, the first unified model to successfully integrate molecular generation and property prediction, delivering superior performance in both tasks. Our key innovation lies in a novel two-phase generative process, where predictive tasks are activated in the later stages, after the molecular scaffold is formed. We further enhance task balance through innovative training strategies. Rigorous theoretical analysis and comprehensive experiments demonstrate our significant improvements in both tasks. The principles behind UniGEM hold promise for broader applications, including natural language processing and computer vision.
Authors: Kemal Davaslioglu, Sastry Kompella, Tugba Erpek, Yalin E. Sagduyu
Abstract: Deep Reinforcement Learning (DRL) has been highly effective in learning from and adapting to RF environments and thus detecting and mitigating jamming effects to facilitate reliable wireless communications. However, traditional DRL methods are susceptible to catastrophic forgetting (namely forgetting old tasks when learning new ones), especially in dynamic wireless environments where jammer patterns change over time. This paper considers an anti-jamming system and addresses the challenge of catastrophic forgetting in DRL applied to jammer detection and mitigation. First, we demonstrate the impact of catastrophic forgetting in DRL when applied to jammer detection and mitigation tasks, where the network forgets previously learned jammer patterns while adapting to new ones. This catastrophic interference undermines the effectiveness of the system, particularly in scenarios where the environment is non-stationary. We present a method that enables the network to retain knowledge of old jammer patterns while learning to handle new ones. Our approach substantially reduces catastrophic forgetting, allowing the anti-jamming system to learn new tasks without compromising its ability to perform previously learned tasks effectively. Furthermore, we introduce a systematic methodology for sequentially learning tasks in the anti-jamming framework. By leveraging continual DRL techniques based on PackNet, we achieve superior anti-jamming performance compared to standard DRL methods. Our proposed approach not only addresses catastrophic forgetting but also enhances the adaptability and robustness of the system in dynamic jamming environments. We demonstrate the efficacy of our method in preserving knowledge of past jammer patterns, learning new tasks efficiently, and achieving superior anti-jamming performance compared to traditional DRL approaches.
Authors: Zhongchao Yi, Zhengyang Zhou, Qihe Huang, Yanjiang Chen, Liheng Yu, Xu Wang, Yang Wang
Abstract: Spatiotemporal learning has become a pivotal technique to enable urban intelligence. Traditional spatiotemporal models mostly focus on a specific task by assuming a same distribution between training and testing sets. However, given that urban systems are usually dynamic, multi-sourced with imbalanced data distributions, current specific task-specific models fail to generalize to new urban conditions and adapt to new domains without explicitly modeling interdependencies across various dimensions and types of urban data. To this end, we argue that there is an essential to propose a Continuous Multi-task Spatio-Temporal learning framework (CMuST) to empower collective urban intelligence, which reforms the urban spatiotemporal learning from single-domain to cooperatively multi-dimensional and multi-task learning. Specifically, CMuST proposes a new multi-dimensional spatiotemporal interaction network (MSTI) to allow cross-interactions between context and main observations as well as self-interactions within spatial and temporal aspects to be exposed, which is also the core for capturing task-level commonality and personalization. To ensure continuous task learning, a novel Rolling Adaptation training scheme (RoAda) is devised, which not only preserves task uniqueness by constructing data summarization-driven task prompts, but also harnesses correlated patterns among tasks by iterative model behavior modeling. We further establish a benchmark of three cities for multi-task spatiotemporal learning, and empirically demonstrate the superiority of CMuST via extensive evaluations on these datasets. The impressive improvements on both few-shot streaming data and new domain tasks against existing SOAT methods are achieved. Code is available at https://github.com/DILab-USTCSZ/CMuST.
Authors: Jan Vrba, Jakub Steinbach, Tom\'a\v{s} Jirsa, Laura Verde, Roberta De Fazio, Noriyasu Homma, Yuwen Zeng, Key Ichiji, Luk\'a\v{s} H\'ajek, Zuzana Sedl\'akov\'a, Jan Mare\v{s}
Abstract: In this study, we propose a robust set of features derived from a thorough research of contemporary practices in voice pathology detection. The feature set is based on the combination of acoustic handcrafted features. Additionally, we introduce pitch difference as a novel feature. We combine this feature set, containing data from the publicly available Saarbr\"ucken Voice Database (SVD), with preprocessing using the K-Means Synthetic Minority Over-Sampling Technique algorithm to address class imbalance. Moreover, we applied multiple ML models as binary classifiers. We utilized support vector machine, k-nearest neighbors, naive Bayes, decision tree, random forest and AdaBoost classifiers. To determine the best classification approach, we performed grid search on feasible hyperparameters of respective classifiers and subsections of features. Our approach has achieved the state-of-the-art performance, measured by unweighted average recall in voice pathology detection on SVD database. We intentionally omit accuracy as it is highly biased metric in case of unbalanced data compared to aforementioned metrics. The results are further enhanced by eliminating the potential overestimation of the results with repeated stratified cross-validation. This advancement demonstrates significant potential for the clinical deployment of ML methods, offering a valuable tool for an objective examination of voice pathologies. To support our claims, we provide a publicly available GitHub repository with DOI 10.5281/zenodo.13771573. Finally, we provide REFORMS checklist.
Authors: Shubham Kumar Nigam, Aniket Deroy, Subhankar Maity, Arnab Bhattacharya
Abstract: This study investigates judgment prediction in a realistic scenario within the context of Indian judgments, utilizing a range of transformer-based models, including InLegalBERT, BERT, and XLNet, alongside LLMs such as Llama-2 and GPT-3.5 Turbo. In this realistic scenario, we simulate how judgments are predicted at the point when a case is presented for a decision in court, using only the information available at that time, such as the facts of the case, statutes, precedents, and arguments. This approach mimics real-world conditions, where decisions must be made without the benefit of hindsight, unlike retrospective analyses often found in previous studies. For transformer models, we experiment with hierarchical transformers and the summarization of judgment facts to optimize input for these models. Our experiments with LLMs reveal that GPT-3.5 Turbo excels in realistic scenarios, demonstrating robust performance in judgment prediction. Furthermore, incorporating additional legal information, such as statutes and precedents, significantly improves the outcome of the prediction task. The LLMs also provide explanations for their predictions. To evaluate the quality of these predictions and explanations, we introduce two human evaluation metrics: Clarity and Linking. Our findings from both automatic and human evaluations indicate that, despite advancements in LLMs, they are yet to achieve expert-level performance in judgment prediction and explanation tasks.
Authors: Changqing Gong, Huafeng Qin, Moun\^im A. El-Yacoubi
Abstract: Alzheimer's Disease (AD) is a prevalent neurodegenerative condition where early detection is vital. Handwriting, often affected early in AD, offers a non-invasive and cost-effective way to capture subtle motor changes. State-of-the-art research on handwriting, mostly online, based AD detection has predominantly relied on manually extracted features, fed as input to shallow machine learning models. Some recent works have proposed deep learning (DL)-based models, either 1D-CNN or 2D-CNN architectures, with performance comparing favorably to handcrafted schemes. These approaches, however, overlook the intrinsic relationship between the 2D spatial patterns of handwriting strokes and their 1D dynamic characteristics, thus limiting their capacity to capture the multimodal nature of handwriting data. Moreover, the application of Transformer models remains basically unexplored. To address these limitations, we propose a novel approach for AD detection, consisting of a learnable multimodal hybrid attention model that integrates simultaneously 2D handwriting images with 1D dynamic handwriting signals. Our model leverages a gated mechanism to combine similarity and difference attention, blending the two modalities and learning robust features by incorporating information at different scales. Our model achieved state-of-the-art performance on the DARWIN dataset, with an F1-score of 90.32\% and accuracy of 90.91\% in Task 8 ('L' writing), surpassing the previous best by 4.61% and 6.06% respectively.
Authors: Mahsa Salmani, Nikita Trukhanov, Ilya Soloveychik
Abstract: The ever increasing sizes of Large Language Models (LLMs) beyond hundreds of billions of parameters have generated enormous pressure on the manufacturers of dedicated hardware accelerators and made the innovative design of the latter one of the most rapidly expanding fields of the AI industry. Various approaches have been explored to enable efficient and accurate processing of LLMs on the available accelerators given their computational and storage limitations. Among these, various quantization techniques have become the main focus of the community as a means of reducing the compute, communication and storage requirements. Quantization to lower precision formats naturally poses a number of challenges caused by the limited range of the available value representations. When it comes to processing the popular Transformer models on hardware, one of the main issues becomes calculation of the LayerNorm simply because accumulation of the variance requires a much wider dynamic range than the hardware enables. In this article, we address this matter and propose a computationally-efficient scaling technique that can be easily applied to Transformer models during inference. Our method suggests a straightforward way of scaling the LayerNorm inputs based on the static weights of the immediately preceding linear layers. The scaling factors are computed offline, based solely on the linear layer weights, hence no latency or computational overhead is added during inference. Most importantly, our technique ensures that no numerical issues such as overflow or underflow could happen during the compute. This approach offers smooth, accurate and resource-effective inference across a wide range of hardware architectures. The article provides theoretical justification as well as supporting numerical simulations.
Authors: Martin Aubard, L\'aszl\'o Antal, Ana Madureira, Luis F. Teixeira, Erika \'Abrah\'am
Abstract: This paper introduces ROSAR, a novel framework enhancing the robustness of deep learning object detection models tailored for side-scan sonar (SSS) images, generated by autonomous underwater vehicles using sonar sensors. By extending our prior work on knowledge distillation (KD), this framework integrates KD with adversarial retraining to address the dual challenges of model efficiency and robustness against SSS noises. We introduce three novel, publicly available SSS datasets, capturing different sonar setups and noise conditions. We propose and formalize two SSS safety properties and utilize them to generate adversarial datasets for retraining. Through a comparative analysis of projected gradient descent (PGD) and patch-based adversarial attacks, ROSAR demonstrates significant improvements in model robustness and detection accuracy under SSS-specific conditions, enhancing the model's robustness by up to 1.85%. ROSAR is available at https://github.com/remaro-network/ROSAR-framework.
Authors: Juan Sebastian Rojas, Chi-Guhn Lee
Abstract: Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty. However, average-reward MDPs have remained largely unexplored in reinforcement learning (RL) settings, with the majority of RL-based efforts having been allocated to episodic and discounted MDPs. In this work, we study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning: a novel RL framework that can be used to effectively and efficiently solve various subtasks simultaneously in the average-reward setting. We introduce a family of RED learning algorithms for prediction and control, including proven-convergent algorithms for the tabular case. We then showcase the power of these algorithms by demonstrating how they can be used to learn a policy that optimizes, for the first time, the well-known conditional value-at-risk (CVaR) risk measure in a fully-online manner, without the use of an explicit bi-level optimization scheme or an augmented state-space.
Authors: Ayushman Gupta, Akhil Bhogal, Kripabandhu Ghosh
Abstract: Code-mixing, the practice of alternating between two or more languages in an utterance, is a common phenomenon in multilingual communities. Due to the colloquial nature of code-mixing, there is no singular correct way to translate an English sentence into a code-mixed sentence. For this reason, standard n-gram-based MT evaluation metrics such as the BLEU score are not appropriate for code-mixed evaluation. To demonstrate this, we propose a novel method for code-mixed text generation: Controlled Generation, which parameterizes the code-mixing degree (CMD) and enables the generation of multiple semantically equivalent code-mixed sentences from a given English sentence. We introduce a robust new evaluation metric: GAME: A Gold-Standard Agnostic Measure for Evaluation of Code-Mixed Sentences. GAME is both language-agnostic and gold-standard-agnostic, i.e. unlike other metrics, GAME does not require gold-standard code-mixed sentences for evaluation, thus eliminating the need for human annotators in the code-mixed evaluation process. When used to evaluate semantically equivalent code-mixed sentences, we find that GAME scores have a lower standard deviation than BLEU scores. Further, we create and release a dataset containing gold-standard code-mixed sentences across 4 language pairs: English-{Hindi, Bengali, French, Spanish} to encourage more computational research on code-mixing.
Authors: Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun
Abstract: Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 25--39\% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is effective in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at https://github.com/openbmb/visrag .
Authors: Shaohao Rui, Lingzhi Chen, Zhenyu Tang, Lilong Wang, Mianxin Liu, Shaoting Zhang, Xiaosong Wang
Abstract: Accurate diagnosis of brain abnormalities is greatly enhanced by the inclusion of complementary multi-parametric MRI imaging data. There is significant potential to develop a universal pre-training model that can be quickly adapted for image modalities and various clinical scenarios. However, current models often rely on uni-modal image data, neglecting the cross-modal correlations among different image modalities or struggling to scale up pre-training in the presence of missing modality data. In this paper, we propose BrainMVP, a multi-modal vision pre-training framework for brain image analysis using multi-parametric MRI scans. First, we collect 16,022 brain MRI scans (over 2.4 million images), encompassing eight MRI modalities sourced from a diverse range of centers and devices. Then, a novel pre-training paradigm is proposed for the multi-modal MRI data, addressing the issue of missing modalities and achieving multi-modal information fusion. Cross-modal reconstruction is explored to learn distinctive brain image embeddings and efficient modality fusion capabilities. A modality-wise data distillation module is proposed to extract the essence representation of each MR image modality for both the pre-training and downstream application purposes. Furthermore, we introduce a modality-aware contrastive learning module to enhance the cross-modality association within a study. Extensive experiments on downstream tasks demonstrate superior performance compared to state-of-the-art pre-training methods in the medical domain, with Dice Score improvement of 0.28%-14.47% across six segmentation benchmarks and a consistent accuracy improvement of 0.65%-18.07% in four individual classification tasks.
Authors: Mengyu Wang, Shay B. Cohen, Tiejun Ma
Abstract: The diffusion of financial news into market prices is a complex process, making it challenging to evaluate the connections between news events and market movements. This paper introduces FININ (Financial Interconnected News Influence Network), a novel market prediction model that captures not only the links between news and prices but also the interactions among news items themselves. FININ effectively integrates multi-modal information from both market data and news articles. We conduct extensive experiments on two datasets, encompassing the S&P 500 and NASDAQ 100 indices over a 15-year period and over 2.7 million news articles. The results demonstrate FININ's effectiveness, outperforming advanced market prediction models with an improvement of 0.429 and 0.341 in the daily Sharpe ratio for the two markets respectively. Moreover, our results reveal insights into the financial news, including the delayed market pricing of news, the long memory effect of news, and the limitations of financial sentiment analysis in fully extracting predictive power from news data.
Authors: Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar
Abstract: LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning -- but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.
Authors: Adyasha Maharana, Jaehong Yoon, Tianlong Chen, Mohit Bansal
Abstract: Visual instruction datasets from various distributors are released at different times and often contain a significant number of semantically redundant text-image pairs, depending on their task compositions (i.e., skills) or reference sources. This redundancy greatly limits the efficient deployment of lifelong adaptable multimodal large language models, hindering their ability to refine existing skills and acquire new competencies over time. To address this, we reframe the problem of Lifelong Instruction Tuning (LiIT) via data selection, where the model automatically selects beneficial samples to learn from earlier and new datasets based on the current state of acquired knowledge in the model. Based on empirical analyses that show that selecting the best data subset using a static importance measure is often ineffective for multi-task datasets with evolving distributions, we propose Adapt-$\infty$, a new multi-way and adaptive data selection approach that dynamically balances sample efficiency and effectiveness during LiIT. We construct pseudo-skill clusters by grouping gradient-based sample vectors. Next, we select the best-performing data selector for each skill cluster from a pool of selector experts, including our newly proposed scoring function, Image Grounding score. This data selector samples a subset of the most important samples from each skill cluster for training. To prevent the continuous increase in the size of the dataset pool during LiIT, which would result in excessive computation, we further introduce a cluster-wise permanent data pruning strategy to remove the most semantically redundant samples from each cluster, keeping computational requirements manageable. Training with samples selected by Adapt-$\infty$ alleviates catastrophic forgetting, especially for rare tasks, and promotes forward transfer across the continuum using only a fraction of the original datasets.
Authors: James R. Han, Hugues Thomas, Jian Zhang, Nicholas Rhinehart, Timothy D. Barfoot
Abstract: How can a robot safely navigate around people exhibiting complex motion patterns? Reinforcement Learning (RL) or Deep RL (DRL) in simulation holds some promise, although much prior work relies on simulators that fail to precisely capture the nuances of real human motion. To address this gap, we propose Deep Residual Model Predictive Control (DR-MPC), a method to enable robots to quickly and safely perform DRL from real-world crowd navigation data. By blending MPC with model-free DRL, DR-MPC overcomes the traditional DRL challenges of large data requirements and unsafe initial behavior. DR-MPC is initialized with MPC-based path tracking, and gradually learns to interact more effectively with humans. To further accelerate learning, a safety component estimates when the robot encounters out-of-distribution states and guides it away from likely collisions. In simulation, we show that DR-MPC substantially outperforms prior work, including traditional DRL and residual DRL models. Real-world experiments show our approach successfully enables a robot to navigate a variety of crowded situations with few errors using less than 4 hours of training data.
Authors: Subhankar Maity, Aniket Deroy
Abstract: Generative Artificial Intelligence (AI) is revolutionizing educational technology by enabling highly personalized and adaptive learning environments within Intelligent Tutoring Systems (ITS). This report delves into the integration of Generative AI, particularly large language models (LLMs) like GPT-4, into ITS to enhance personalized education through dynamic content generation, real-time feedback, and adaptive learning pathways. We explore key applications such as automated question generation, customized feedback mechanisms, and interactive dialogue systems that respond to individual learner needs. The report also addresses significant challenges, including ensuring pedagogical accuracy, mitigating inherent biases in AI models, and maintaining learner engagement. Future directions highlight the potential advancements in multimodal AI integration, emotional intelligence in tutoring systems, and the ethical implications of AI-driven education. By synthesizing current research and practical implementations, this report underscores the transformative potential of Generative AI in creating more effective, equitable, and engaging educational experiences.
Authors: William A. Stigall
Abstract: In this study, we investigate the performance of Deep Q-Networks utilizing Convolutional Neural Networks (CNNs) and Transformer architectures across three different Atari games. The advent of DQNs has significantly advanced Reinforcement Learning, enabling agents to directly learn optimal policies from high-dimensional sensory inputs from pixel or RAM data. While CNN-based DQNs have been extensively studied and deployed in various domains, Transformer-based DQNs are relatively unexplored. Our research aims to fill this gap by benchmarking the performance of both DCQNs and DTQNs across the Atari games Asteroids, Space Invaders, and Centipede. We find that in the 35-40 million parameter range, the DCQN outperforms the DTQN in speed across both ViT and Projection Architectures. We also find the DCQN outperforms the DTQN in all games except for Centipede.
Authors: Aivin V. Solatorio, Gabriel Stefanini Vicente, Holly Krambeck, Olivier Dupriez
Abstract: Artificial Intelligence (AI), particularly large language models (LLMs), holds the potential to bridge language and information gaps, which can benefit the economies of developing nations. However, our analysis of FLORES-200, FLORES+, Ethnologue, and World Development Indicators data reveals that these benefits largely favor English speakers. Speakers of languages in low-income and lower-middle-income countries face higher costs when using OpenAI's GPT models via APIs because of how the system processes the input -- tokenization. Around 1.5 billion people, speaking languages primarily from lower-middle-income countries, could incur costs that are 4 to 6 times higher than those faced by English speakers. Disparities in LLM performance are significant, and tokenization in models priced per token amplifies inequalities in access, cost, and utility. Moreover, using the quality of translation tasks as a proxy measure, we show that LLMs perform poorly in low-resource languages, presenting a ``double jeopardy" of higher costs and poor performance for these users. We also discuss the direct impact of fragmentation in tokenizing low-resource languages on climate. This underscores the need for fairer algorithm development to benefit all linguistic groups.
Authors: Rory Young, Nicolas Pugeault
Abstract: Deep reinforcement learning agents achieve state-of-the-art performance in a wide range of simulated control tasks. However, successful applications to real-world problems remain limited. One reason for this dichotomy is because the learned policies are not robust to observation noise or adversarial attacks. In this paper, we investigate the robustness of deep RL policies to a single small state perturbation in deterministic continuous control tasks. We demonstrate that RL policies can be deterministically chaotic as small perturbations to the system state have a large impact on subsequent state and reward trajectories. This unstable non-linear behaviour has two consequences: First, inaccuracies in sensor readings, or adversarial attacks, can cause significant performance degradation; Second, even policies that show robust performance in terms of rewards may have unpredictable behaviour in practice. These two facets of chaos in RL policies drastically restrict the application of deep RL to real-world problems. To address this issue, we propose an improvement on the successful Dreamer V3 architecture, implementing a Maximal Lyapunov Exponent regularisation. This new approach reduces the chaotic state dynamics, rendering the learnt policies more resilient to sensor noise or adversarial attacks and thereby improving the suitability of Deep Reinforcement Learning for real-world applications.
Authors: Arpan Mukherjee, Shashanka Ubaru, Keerthiram Murugesan, Karthikeyan Shanmugam, Ali Tajer
Abstract: This paper considers the problem of combinatorial multi-armed bandits with semi-bandit feedback and a cardinality constraint on the super-arm size. Existing algorithms for solving this problem typically involve two key sub-routines: (1) a parameter estimation routine that sequentially estimates a set of base-arm parameters, and (2) a super-arm selection policy for selecting a subset of base arms deemed optimal based on these parameters. State-of-the-art algorithms assume access to an exact oracle for super-arm selection with unbounded computational power. At each instance, this oracle evaluates a list of score functions, the number of which grows as low as linearly and as high as exponentially with the number of arms. This can be prohibitive in the regime of a large number of arms. This paper introduces a novel realistic alternative to the perfect oracle. This algorithm uses a combination of group-testing for selecting the super arms and quantized Thompson sampling for parameter estimation. Under a general separability assumption on the reward function, the proposed algorithm reduces the complexity of the super-arm-selection oracle to be logarithmic in the number of base arms while achieving the same regret order as the state-of-the-art algorithms that use exact oracles. This translates to at least an exponential reduction in complexity compared to the oracle-based approaches.
Authors: Mohammad Asif Ibna Mustafa (Department of Computation, Information,Technology, Technical University of Munich, Munich, Germany), Ferdinand Heinrich (Fraunhofer Institute for Electronic Microsystems,Solid State Technologies EMFT, Machine Learning Enhanced Sensor Systems, Munich, Germany)
Abstract: Time series analysis has become increasingly important in various domains, and developing effective models relies heavily on high-quality benchmark datasets. Inspired by the success of Natural Language Processing (NLP) benchmark datasets in advancing pre-trained models, we propose a new approach to create a comprehensive benchmark dataset for time series analysis. This paper explores the methodologies used in NLP benchmark dataset creation and adapts them to the unique challenges of time series data. We discuss the process of curating diverse, representative, and challenging time series datasets, highlighting the importance of domain relevance and data complexity. Additionally, we investigate multi-task learning strategies that leverage the benchmark dataset to enhance the performance of time series models. This research contributes to the broader goal of advancing the state-of-the-art in time series modeling by adopting successful strategies from the NLP domain.
Authors: Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, Jing Shao
Abstract: This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets. ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs' knowledge to specify the correlated actors as various attack clues. In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1. We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks. Code is available at https://github.com/renqibing/ActorAttack.
Authors: Alaa Awad, Mohamed Hegazy, Salah A. Aly
Abstract: Thousands of individuals succumb annually to leukemia alone. This study explores the application of image processing and deep learning techniques for detecting Acute Lymphoblastic Leukemia (ALL), a severe form of blood cancer responsible for numerous annual fatalities. As artificial intelligence technologies advance, the research investigates the reliability of these methods in real-world scenarios. The study focuses on recent developments in ALL detection, particularly using the latest YOLO series models, to distinguish between malignant and benign white blood cells and to identify different stages of ALL, including early stages. Additionally, the models are capable of detecting hematogones, which are often misclassified as ALL. By utilizing advanced deep learning models like YOLOv8 and YOLOv11, the study achieves high accuracy rates reaching 98.8%, demonstrating the effectiveness of these algorithms across multiple datasets and various real-world situations.
Authors: Rasoul Shafipour, David Harrison, Maxwell Horton, Jeffrey Marker, Houman Bedayat, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, Saman Naderiparizi
Abstract: Large Language Models (LLMs) have transformed natural language processing, but face significant challenges in widespread deployment due to their high runtime cost. In this paper, we introduce SeedLM, a novel post-training compression method that uses seeds of pseudo-random generators to encode and compress model weights. Specifically, for each block of weights, we find a seed that is fed into a Linear Feedback Shift Register (LFSR) during inference to efficiently generate a random matrix. This matrix is then linearly combined with compressed coefficients to reconstruct the weight block. SeedLM reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses. Unlike state-of-the-art compression methods that rely on calibration data, our approach is data-free and generalizes well across diverse tasks. Our experiments with Llama 3 70B, which is particularly challenging to compress, show that SeedLM achieves significantly better zero-shot accuracy retention at 4- and 3-bit than state-of-the-art techniques, while maintaining performance comparable to FP16 baselines. Additionally, FPGA-based tests demonstrate that 4-bit SeedLM, as model size increases to 70B, approaches a 4x speed-up over an FP16 Llama 2/3 baseline.
Authors: Giorgos Iacovides, Wuyang Zhou, Danilo Mandic
Abstract: We propose a novel framework that leverages large language models (LLMs) to guide the rank selection in tensor network models for higher-order data analysis. By utilising the intrinsic reasoning capabilities and domain knowledge of LLMs, our approach offers enhanced interpretability of the rank choices and can effectively optimise the objective function. This framework enables users without specialised domain expertise to utilise tensor network decompositions and understand the underlying rationale within the rank selection process. Experimental results validate our method on financial higher-order datasets, demonstrating interpretable reasoning, strong generalisation to unseen test data, and its potential for self-enhancement over successive iterations. This work is placed at the intersection of large language models and higher-order data analysis.
Authors: Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han
Abstract: We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at https://github.com/mit-han-lab/efficientvit.
Authors: Yuqi Wang, Ke Cheng, Jiawei He, Qitai Wang, Hengchen Dai, Yuntao Chen, Fei Xia, Zhaoxiang Zhang
Abstract: Driving world models have gained increasing attention due to their ability to model complex physical dynamics. However, their superb modeling capability is yet to be fully unleashed due to the limited video diversity in current driving datasets. We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge, laying a stepping stone for future world model development. We further define an action instruction following (AIF) benchmark for world models and demonstrate the superiority of the proposed dataset for generating action-controlled future predictions.
Authors: Xinli Xu, Wenhang Ge, Jiantao Lin, Jiawei Feng, Lie Xu, HanFeng Zhao, Shunsi Zhang, Ying-Cong Chen
Abstract: In this work, we introduce FlexGen, a flexible framework designed to generate controllable and consistent multi-view images, conditioned on a single-view image, or a text prompt, or both. FlexGen tackles the challenges of controllable multi-view synthesis through additional conditioning on 3D-aware text annotations. We utilize the strong reasoning capabilities of GPT-4V to generate 3D-aware text annotations. By analyzing four orthogonal views of an object arranged as tiled multi-view images, GPT-4V can produce text annotations that include 3D-aware information with spatial relationship. By integrating the control signal with proposed adaptive dual-control module, our model can generate multi-view images that correspond to the specified text. FlexGen supports multiple controllable capabilities, allowing users to modify text prompts to generate reasonable and corresponding unseen parts. Additionally, users can influence attributes such as appearance and material properties, including metallic and roughness. Extensive experiments demonstrate that our approach offers enhanced multiple controllability, marking a significant advancement over existing multi-view diffusion models. This work has substantial implications for fields requiring rapid and flexible 3D content creation, including game development, animation, and virtual reality. Project page: https://xxu068.github.io/flexgen.github.io/.
Authors: Seungwoo Han
Abstract: With the advancements in graph neural network, there has been increasing interest in applying this network to ECG signal analysis. In this study, we generated an adjacency matrix using correlation matrix of extracted features and applied a graph neural network to classify arrhythmias. The proposed model was compared with existing approaches from the literature. The results demonstrated that precision and recall for all arrhythmia classes exceeded 50%, suggesting that this method can be considered an approach for arrhythmia classification.
Authors: Youwei Yu, Junhong Xu, Lantao Liu
Abstract: Model-free reinforcement learning has emerged as a powerful method for developing robust robot control policies capable of navigating through complex and unstructured terrains. The effectiveness of these methods hinges on two essential elements: (1) the use of massively parallel physics simulations to expedite policy training, and (2) an environment generator tasked with crafting sufficiently challenging yet attainable terrains to facilitate continuous policy improvement. Existing methods of environment generation often rely on heuristics constrained by a set of parameters, limiting the diversity and realism. In this work, we introduce the Adaptive Diffusion Terrain Generator (ADTG), a novel method that leverages Denoising Diffusion Probabilistic Models to dynamically expand existing training environments by adding more diverse and complex terrains adaptive to the current policy. ADTG guides the diffusion model's generation process through initial noise optimization, blending noise-corrupted terrains from existing training environments weighted by the policy's performance in each corresponding environment. By manipulating the noise corruption level, ADTG seamlessly transitions between generating similar terrains for policy fine-tuning and novel ones to expand training diversity. Our experiments show that the policy trained by ADTG outperforms both procedural generated and natural environments, along with popular navigation methods.
Authors: Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin
Abstract: Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at https://github.com/sail-sg/Attention-Sink.
Authors: Kajetan Schweighofer, Lukas Aichberger, Mykyta Ielanskyi, Sepp Hochreiter
Abstract: Reliable estimation of predictive uncertainty is crucial for machine learning applications, particularly in high-stakes scenarios where hedging against risks is essential. Despite its significance, a consensus on the correct measurement of predictive uncertainty remains elusive. In this work, we return to first principles to develop a fundamental framework of information-theoretic predictive uncertainty measures. Our proposed framework categorizes predictive uncertainty measures according to two factors: (I) The predicting model (II) The approximation of the true predictive distribution. Examining all possible combinations of these two factors, we derive a set of predictive uncertainty measures that includes both known and newly introduced ones. We empirically evaluate these measures in typical uncertainty estimation settings, such as misclassification detection, selective prediction, and out-of-distribution detection. The results show that no single measure is universal, but the effectiveness depends on the specific setting. Thus, our work provides clarity about the suitability of predictive uncertainty measures by clarifying their implicit assumptions and relationships.
Authors: Soon Yau Cheong, Duygu Ceylan, Armin Mustafa, Andrew Gilbert, Chun-Hao Paul Huang
Abstract: Recent advancements in diffusion models have significantly enhanced the quality of video generation. However, fine-grained control over camera pose remains a challenge. While U-Net-based models have shown promising results for camera control, transformer-based diffusion models (DiT)-the preferred architecture for large-scale video generation - suffer from severe degradation in camera motion accuracy. In this paper, we investigate the underlying causes of this issue and propose solutions tailored to DiT architectures. Our study reveals that camera control performance depends heavily on the choice of conditioning methods rather than camera pose representations that is commonly believed. To address the persistent motion degradation in DiT, we introduce Camera Motion Guidance (CMG), based on classifier-free guidance, which boosts camera control by over 400%. Additionally, we present a sparse camera control pipeline, significantly simplifying the process of specifying camera poses for long videos. Our method universally applies to both U-Net and DiT models, offering improved camera control for video generation tasks.
Authors: Youngjae Min, Anoopkumar Sonar, Navid Azizan
Abstract: Incorporating prior knowledge or specifications of input-output relationships into machine learning models has gained significant attention, as it enhances generalization from limited data and leads to conforming outputs. However, most existing approaches use soft constraints by penalizing violations through regularization, which offers no guarantee of constraint satisfaction -- an essential requirement in safety-critical applications. On the other hand, imposing hard constraints on neural networks may hinder their representational power, adversely affecting performance. To address this, we propose HardNet, a practical framework for constructing neural networks that inherently satisfy hard constraints without sacrificing model capacity. Specifically, we encode affine and convex hard constraints, dependent on both inputs and outputs, by appending a differentiable projection layer to the network's output. This architecture allows unconstrained optimization of the network parameters using standard algorithms while ensuring constraint satisfaction by construction. Furthermore, we show that HardNet retains the universal approximation capabilities of neural networks. We demonstrate the versatility and effectiveness of HardNet across various applications: fitting functions under constraints, learning optimization solvers, optimizing control policies in safety-critical systems, and learning safe decision logic for aircraft systems.
Authors: Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, Song Han
Abstract: We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images, rivaling diffusion models in image generation quality. Existing AR models face limitations due to the poor image reconstruction quality of their discrete tokenizers and the prohibitive training costs associated with generating 1024px images. To address these challenges, we present the hybrid tokenizer, which decomposes the continuous latents from the autoencoder into two components: discrete tokens representing the big picture and continuous tokens representing the residual components that cannot be represented by the discrete tokens. The discrete component is modeled by a scalable-resolution discrete AR model, while the continuous component is learned with a lightweight residual diffusion module with only 37M parameters. Compared with the discrete-only VAR tokenizer, our hybrid approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38. HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs. Our code is open sourced at https://github.com/mit-han-lab/hart.
Authors: Honghui Yang, Di Huang, Wei Yin, Chunhua Shen, Haifeng Liu, Xiaofei He, Binbin Lin, Wanli Ouyang, Tong He
Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse synthetic environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates-even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.
Authors: Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, Xihui Liu
Abstract: The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.
Authors: Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, Jianwei Yang
Abstract: Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.
Authors: Utsav Kumar Nareti, Chandranath Adak, Soumi Chattopadhyay
Abstract: In the film industry, movie posters have been an essential part of advertising and marketing for many decades, and continue to play a vital role even today in the form of digital posters through online, social media and OTT (over-the-top) platforms. Typically, movie posters can effectively promote and communicate the essence of a film, such as its genre, visual style/tone, vibe and storyline cue/theme, which are essential to attract potential viewers. Identifying the genres of a movie often has significant practical applications in recommending the film to target audiences. Previous studies on genre identification have primarily focused on sources such as plot synopses, subtitles, metadata, movie scenes, and trailer videos; however, posters precede the availability of these sources, and provide pre-release implicit information to generate mass interest. In this paper, we work for automated multi-label movie genre identification only from poster images, without any aid of additional textual/metadata/video information about movies, which is one of the earliest attempts of its kind. Here, we present a deep transformer network with a probabilistic module to identify the movie genres exclusively from the poster. For experiments, we procured 13882 number of posters of 13 genres from the Internet Movie Database (IMDb), where our model performances were encouraging and even outperformed some major contemporary architectures.
Authors: Mikihisa Yuasa, Huy T. Tran, Ramavarapu S. Sreenivas
Abstract: Understanding a \textit{reinforcement learning} policy, which guides state-to-action mappings to maximize rewards, necessitates an accompanying explanation for human comprehension. In this paper, we introduce a set of \textit{linear temporal logic} formulae designed to provide explanations for policies, and an algorithm for searching through those formulae for the one that best explains a given policy. Our focus is on explanations that elucidate both the ultimate objectives accomplished by the policy and the prerequisite conditions it upholds throughout its execution. The effectiveness of our proposed approach is illustrated through a simulated game of capture-the-flag and a car-parking environment,
Authors: Haoran Cheng, Dixin Luo, Hongteng Xu
Abstract: Graph matching is one of the most significant graph analytic tasks, which aims to find the node correspondence across different graphs. Most existing graph matching approaches mainly rely on topological information, whose performances are often sub-optimal and sensitive to data noise because of not fully leveraging the multi-modal information hidden in graphs, such as node attributes, subgraph structures, etc. In this study, we propose a novel and robust graph matching method based on an unbalanced hierarchical optimal transport (UHOT) framework, which, to our knowledge, makes the first attempt to exploit cross-modal alignment in graph matching. In principle, applying multi-layer message passing, we represent each graph as layer-wise node embeddings corresponding to different modalities. Given two graphs, we align their node embeddings within the same modality and across different modalities, respectively. Then, we infer the node correspondence by the weighted average of all the alignment results. This method is implemented as computing the UHOT distance between the two graphs -- each alignment is achieved by a node-level optimal transport plan between two sets of node embeddings, and the weights of all alignment results correspond to an unbalanced modality-level optimal transport plan. Experiments on various graph matching tasks demonstrate the superiority and robustness of our method compared to state-of-the-art approaches. Our implementation is available at https://github.com/Dixin-Lab/UHOT-GM.
Authors: Timothy R. McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Paul Watters, Malka N. Halgamuge
Abstract: The rapid rise in popularity of Large Language Models (LLMs) with emerging capabilities has spurred public curiosity to evaluate and compare different LLMs, leading many researchers to propose their own LLM benchmarks. Noticing preliminary inadequacies in those benchmarks, we embarked on a study to critically assess 23 state-of-the-art LLM benchmarks, using our novel unified evaluation framework through the lenses of people, process, and technology, under the pillars of benchmark functionality and integrity. Our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, evaluator diversity, and the overlooking of cultural and ideological norms in one comprehensive assessment. Our discussions emphasized the urgent need for standardized methodologies, regulatory certainties, and ethical guidelines in light of Artificial Intelligence (AI) advancements, including advocating for an evolution from static benchmarks to dynamic behavioral profiling to accurately capture LLMs' complex behaviors and potential risks. Our study highlighted the necessity for a paradigm shift in LLM evaluation methodologies, underlining the importance of collaborative efforts for the development of universally accepted benchmarks and the enhancement of AI systems' integration into society.
Authors: Eli Ben-Michael, D. James Greiner, Melody Huang, Kosuke Imai, Zhichao Jiang, Sooahn Shin
Abstract: The use of Artificial Intelligence (AI), or more generally data-driven algorithms, has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions compared to a human-alone or AI-alone system. We introduce a new methodological framework to empirically answer this question with a minimal set of assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded and unconfounded treatment assignment, where the provision of AI-generated recommendations is assumed to be randomized across cases with humans making final decisions. Under this study design, we show how to compare the performance of three alternative decision-making systems--human-alone, human-with-AI, and AI-alone. Importantly, the AI-alone system includes any individualized treatment assignment, including those that are not used in the original study. We also show when AI recommendations should be provided to a human-decision maker, and when one should follow such recommendations. We apply the proposed methodology to our own randomized controlled trial evaluating a pretrial risk assessment instrument. We find that the risk assessment recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Furthermore, we find that replacing a human judge with algorithms--the risk assessment score and a large language model in particular--leads to a worse classification performance.
Authors: Bastin Tony Roy Savarimuthu, Surangika Ranathunga, Stephen Cranefield
Abstract: Software agents, both human and computational, do not exist in isolation and often need to collaborate or coordinate with others to achieve their goals. In human society, social mechanisms such as norms ensure efficient functioning, and these techniques have been adopted by researchers in multi-agent systems (MAS) to create socially aware agents. However, traditional techniques have limitations, such as operating in limited environments often using brittle symbolic reasoning. The advent of Large Language Models (LLMs) offers a promising solution, providing a rich and expressive vocabulary for norms and enabling norm-capable agents that can perform a range of tasks such as norm discovery, normative reasoning and decision-making. This paper examines the potential of LLM-based agents to acquire normative capabilities, drawing on recent Natural Language Processing (NLP) and LLM research. We present our vision for creating normative LLM agents. In particular, we discuss how the recently proposed "LLM agent" approaches can be extended to implement such normative LLM agents. We also highlight challenges in this emerging field. This paper thus aims to foster collaboration between MAS, NLP and LLM researchers in order to advance the field of normative agents.
Authors: Ryan Williams
Abstract: This paper aims to show that a simple framework, utilizing basic formalisms from set theory and category theory, can clarify and inform our theories of the relation between mind and matter.
Authors: Cl\'audio Gomes, Jo\~ao Paulo Fernandes, Gabriel Falcao, Soummya Kar, Sridhar Tayur
Abstract: The rapid adoption of Electric Vehicles (EVs) poses challenges for electricity grids to accommodate or mitigate peak demand. Vehicle-to-Vehicle Charging (V2VC) has been recently adopted by popular EVs, posing new opportunities and challenges to the management and operation of EVs. We present a novel V2VC model that allows decision-makers to take V2VC into account when optimizing their EV operations. We show that optimizing V2VC is NP-Complete and find that even small problem instances are computationally challenging. We propose R-V2VC, a heuristic that takes advantage of the resulting totally unimodular constraint matrix to efficiently solve problems of realistic sizes. Our results demonstrate that R-V2VC presents a linear growth in the solution time as the problem size increases, while achieving solutions of optimal or near-optimal quality. R-V2VC can be used for real-world operations and to study what-if scenarios when evaluating the costs and benefits of V2VC.
Authors: Javier Naranjo-Alcazar, Jordi Grau-Haro, Pedro Zuccarello, David Almenar, Jesus Lopez-Ballester
Abstract: Insect pest control poses a global challenge, affecting public health, food safety, and the environment. Diseases transmitted by mosquitoes are expanding beyond tropical regions due to climate change. Agricultural pests further exacerbate economic losses by damaging crops. The Sterile Insect Technique (SIT) emerges as an eco-friendly alternative to chemical pesticides, involving the sterilization and release of male insects to curb population growth. This work focuses on the automation of the analysis of field ovitraps used to follow-up a SIT program for the Aedes albopictus mosquito in the Valencian Community, Spain, funded by the Conselleria de Agricultura, Agua, Ganaderia y Pesca. Previous research has leveraged deep learning algorithms to automate egg counting in ovitraps, yet faced challenges such as manual handling and limited analysis capacity. Innovations in our study include classifying eggs as hatched or unhatched and reconstructing ovitraps from partial images, mitigating issues of duplicity and cut eggs. Also, our device can analyze multiple ovitraps simultaneously without the need of manual replacement. This approach significantly enhances the accuracy of egg counting and classification, providing a valuable tool for large-scale field studies. This document describes part of the work of the project Application of Industry 4.0 techniques to the production of tiger mosquitoes for the Sterile Insect Technique (MoTIA2,IMDEEA/2022/70), financed by the Valencian Institute for Business Competitiveness (IVACE) and the FEDER funds. The participation of J.Naranjo-Alcazar, J.Grau-Haro and P.Zuccarello has been possible thanks to funding from IVACE and FEDER funds. The participation of D.Almenar has been financed by the Conselleria de Agricultura, Agua, Ganaderia y Pesca of the Generalitat Valenciana and the Subdireccion de Innovacion y Desarrollo de Servicios (TRAGSA group).
Authors: Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng
Abstract: Mixture of experts (MoE) has become the standard for constructing production-level large language models (LLMs) due to its promise to boost model capacity without causing significant overheads. Nevertheless, existing MoE methods usually enforce a constant top-k routing for all tokens, which is arguably restrictive because various tokens (e.g., "
Authors: Giorgio Dolci (Department of Computer Science, University of Verona, Verona, Italy, Department of Engineering for Innovation Medicine, University of Verona, Verona, Italy, Tri-Institutional Center for Translational Research in Neuroimaging and Data Science), Charles A. Ellis (Tri-Institutional Center for Translational Research in Neuroimaging and Data Science), Federica Cruciani (Department of Engineering for Innovation Medicine, University of Verona, Verona, Italy), Lorenza Brusini (Department of Engineering for Innovation Medicine, University of Verona, Verona, Italy), Anees Abrol (Tri-Institutional Center for Translational Research in Neuroimaging and Data Science), Ilaria Boscolo Galazzo (Department of Engineering for Innovation Medicine, University of Verona, Verona, Italy), Gloria Menegaz (Department of Engineering for Innovation Medicine, University of Verona, Verona, Italy), Vince D. Calhoun (Tri-Institutional Center for Translational Research in Neuroimaging and Data Science)
Abstract: Amyloid-$\beta$ (A$\beta$) plaques in conjunction with hyperphosphorylated tau proteins in the form of neurofibrillary tangles are the two neuropathological hallmarks of Alzheimer's disease. It is well-known that the identification of individuals with A$\beta$ positivity could enable early diagnosis. In this work, we aim at capturing the A$\beta$ positivity status in an unbalanced cohort enclosing subjects at different disease stages, exploiting the underlying structural and connectivity disease-induced modulations as revealed by structural, functional, and diffusion MRI. Of note, due to the unbalanced cohort, the outcomes may be guided by those factors rather than amyloid accumulation. The partial views provided by each modality are integrated in the model allowing to take full advantage of their complementarity in encoding the effects of the A$\beta$ accumulation, leading to an accuracy of $0.762\pm0.04$. The specificity of the information brought by each modality is assessed by \textit{post-hoc} explainability analysis (guided backpropagation), highlighting the underlying structural and functional changes. Noteworthy, well-established biomarker key regions related to A$\beta$ deposition could be identified by all modalities, including the hippocampus, thalamus, precuneus, and cingulate gyrus, witnessing in favor of the reliability of the method as well as its potential in shading light on modality-specific possibly unknown A$\beta$ deposition signatures.
Authors: Aliyah R. Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri, Yaxuan Huang, Anobel Y. Odisho, Peter R. Carroll, Bin Yu
Abstract: Automated mechanistic interpretation research has attracted great interest due to its potential to scale explanations of neural network internals to large models. Existing automated circuit discovery work relies on activation patching or its approximations to identify subgraphs in models for specific tasks (circuits). They often suffer from slow runtime, approximation errors, and specific requirements of metrics, such as non-zero gradients. In this work, we introduce contextual decomposition for transformers (CD-T) to build interpretable circuits in large language models. CD-T can produce circuits of arbitrary level of abstraction, and is the first able to produce circuits as fine-grained as attention heads at specific sequence positions efficiently. CD-T consists of a set of mathematical equations to isolate contribution of model features. Through recursively computing contribution of all nodes in a computational graph of a model using CD-T followed by pruning, we are able to reduce circuit discovery runtime from hours to seconds compared to state-of-the-art baselines. On three standard circuit evaluation datasets (indirect object identification, greater-than comparisons, and docstring completion), we demonstrate that CD-T outperforms ACDC and EAP by better recovering the manual circuits with an average of 97% ROC AUC under low runtimes. In addition, we provide evidence that faithfulness of CD-T circuits is not due to random chance by showing our circuits are 80% more faithful than random circuits of up to 60% of the original model size. Finally, we show CD-T circuits are able to perfectly replicate original models' behavior (faithfulness $ = 1$) using fewer nodes than the baselines for all tasks. Our results underscore the great promise of CD-T for efficient automated mechanistic interpretability, paving the way for new insights into the workings of large language models.
Authors: Jing Yu Koh, Stephen McAleer, Daniel Fried, Ruslan Salakhutdinov
Abstract: Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at https://jykoh.com/search-agents.
Authors: Haoyi Xiong, Zhiyuan Wang, Xuhong Li, Jiang Bian, Zeke Xie, Shahid Mumtaz, Anwer Al-Dulaimi, Laura E. Barnes
Abstract: This article explores the convergence of connectionist and symbolic artificial intelligence (AI), from historical debates to contemporary advancements. Traditionally considered distinct paradigms, connectionist AI focuses on neural networks, while symbolic AI emphasizes symbolic representation and logic. Recent advancements in large language models (LLMs), exemplified by ChatGPT and GPT-4, highlight the potential of connectionist architectures in handling human language as a form of symbols. The study argues that LLM-empowered Autonomous Agents (LAAs) embody this paradigm convergence. By utilizing LLMs for text-based knowledge modeling and representation, LAAs integrate neuro-symbolic AI principles, showcasing enhanced reasoning and decision-making capabilities. Comparing LAAs with Knowledge Graphs within the neuro-symbolic AI theme highlights the unique strengths of LAAs in mimicking human-like reasoning processes, scaling effectively with large datasets, and leveraging in-context samples without explicit re-training. The research underscores promising avenues in neuro-vector-symbolic integration, instructional encoding, and implicit reasoning, aimed at further enhancing LAA capabilities. By exploring the progression of neuro-symbolic AI and proposing future research trajectories, this work advances the understanding and development of AI technologies.
Authors: Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He
Abstract: Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter $\beta$, as well as to the quality of the preference data. We analyze the impact of $\beta$ and data quality on DPO, uncovering that optimal $\beta$ values vary with the informativeness of pairwise data. Addressing the limitations of static $\beta$ values, we introduce a novel framework that dynamically calibrates $\beta$ at the batch level, informed by data quality considerations. Additionally, our method incorporates $\beta$-guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic $\beta$ adjustment technique significantly improves DPO's performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. The code is available at \url{https://github.com/junkangwu/beta-DPO}.
Authors: Meng Lu
Abstract: Understanding intelligence is a central pursuit in neuroscience, cognitive science, and artificial intelligence. Intelligence encompasses learning, problem-solving, creativity, and even consciousness. Recent advancements in geometric analysis have revealed new insights into high-dimensional information representation and organisation, exposing intrinsic data structures and dynamic processes within neural and artificial systems. However, a comprehensive framework that unifies the static and dynamic aspects of intelligence is still lacking. This manuscript proposes a mathematical framework based on Riemannian geometry to describe the structure and dynamics of intelligence and consciousness. Intelligence elements are conceptualised as tokens embedded in a high-dimensional space. The learned token embeddings capture the interconnections of tokens across various scenarios and tasks, forming manifolds in the intelligence space. Thought flow is depicted as the sequential activation of tokens along geodesics within these manifolds. During the navigation of geodesics, consciousness, as a self-referential process, perceives the thought flow, evaluates it against predictions, and provides feedback through prediction errors, adjusting the geodesic: non-zero prediction errors, such as learning, lead to the restructuring of the curved manifolds, thus changing the geodesic of thought flow. This dynamic interaction integrates new information, evolves the geometry and facilitates learning. The geometry of intelligence guides consciousness, and consciousness structures the geometry of intelligence. By integrating geometric concepts, this proposed theory offers a unified, mathematically framework for describing the structure and dynamics of intelligence and consciousness. Applicable to biological and artificial intelligence, this framework may pave the way for future research and empirical validation.
Authors: Adam Wojciechowski, Mateusz Lango, Ondrej Dusek
Abstract: Existing explanation methods for image classification struggle to provide faithful and plausible explanations. This paper addresses this issue by proposing a post-hoc natural language explanation method that can be applied to any CNN-based classifier without altering its training process or affecting predictive performance. By analysing influential neurons and the corresponding activation maps, the method generates a faithful description of the classifier's decision process in the form of a structured meaning representation, which is then converted into text by a language model. Through this pipeline approach, the generated explanations are grounded in the neural network architecture, providing accurate insight into the classification process while remaining accessible to non-experts. Experimental results show that the NLEs constructed by our method are significantly more plausible and faithful. In particular, user interventions in the neural network structure (masking of neurons) are three times more effective than the baselines.
Authors: Ming Zhang, Caishuang Huang, Yilong Wu, Shichun Liu, Huiyuan Zheng, Yurui Dong, Yujiong Shen, Shihan Dou, Jun Zhao, Junjie Ye, Qi Zhang, Tao Gui, Xuanjing Huang
Abstract: Task-oriented dialogue (TOD) systems aim to efficiently handle task-oriented conversations, including information collection. How to utilize TOD accurately, efficiently and effectively for information collection has always been a critical and challenging task. Recent studies have demonstrated that Large Language Models (LLMs) excel in dialogue, instruction generation, and reasoning, and can significantly enhance the performance of TOD through fine-tuning. However, current datasets primarily cater to user-led systems and are limited to predefined specific scenarios and slots, thereby necessitating improvements in the proactiveness, diversity, and capabilities of TOD. In this study, we present a detailed multi-domain task-oriented data construction process for conversations, and a Chinese dialogue dataset generated based on this process, TransferTOD, which authentically simulates human-computer dialogues in 30 popular life service scenarios. Leveraging this dataset, we trained a model called TransferTOD-7B using full-parameter fine-tuning, showcasing notable abilities in slot filling and questioning. Our work has demonstrated its strong generalization capabilities in various downstream scenarios, significantly enhancing both data utilization efficiency and system performance. The data is released in https://github.com/KongLongGeFDU/TransferTOD.
Authors: Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, Yiming Yang
Abstract: While the scaling laws of large language models (LLMs) training have been extensively studied, optimal inference configurations of LLMs remain underexplored. We study inference scaling laws and compute-optimal inference, focusing on the trade-offs between model sizes and generating additional tokens with different inference strategies. As a first step towards understanding and designing compute-optimal inference methods, we studied cost-performance trade-offs for inference strategies such as greedy search, majority voting, best-of-$n$, weighted voting, and two different tree search algorithms, using different model sizes and compute budgets. Our findings indicate smaller models (e.g., Llemma-7B) can outperform larger models given the same computation budgets, and that smaller models paired with advanced inference algorithms yield Pareto-optimal cost-performance trade-offs. For instance, the Llemma-7B model, equipped with our novel tree search algorithm, consistently outperforms Llemma-34B with standard majority voting on the MATH benchmark across all FLOPs budgets. We hope these findings contribute to a broader understanding of inference scaling laws for LLMs.
Authors: Zihao Zhu, Bingzhe Wu, Zhengyou Zhang, Baoyuan Wu
Abstract: The integration of large language models (LLMs) into robotics significantly enhances the capabilities of embodied agents in understanding and executing complex natural language instructions. However, the unmitigated deployment of LLM-based embodied systems in real-world environments may pose potential physical risks, such as property damage and personal injury. Existing security benchmarks for LLMs overlook risk awareness for LLM-based embodied agents. To address this gap, we propose RiskAwareBench, an automated framework designed to assess physical risks awareness in LLM-based embodied agents. RiskAwareBench consists of four modules: safety tips generation, risky scene generation, plan generation, and evaluation, enabling comprehensive risk assessment with minimal manual intervention. Utilizing this framework, we compile the PhysicalRisk dataset, encompassing diverse scenarios with associated safety tips, observations, and instructions. Extensive experiments reveal that most LLMs exhibit insufficient physical risk awareness, and baseline risk mitigation strategies yield limited enhancement, which emphasizes the urgency and cruciality of improving risk awareness in LLM-based embodied agents in the future.
Authors: Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu
Abstract: In this paper, we propose a new method STRATEGIST that utilizes LLMs to acquire new skills for playing multi-agent games through a self-improvement process. Our method gathers quality feedback through self-play simulations with Monte Carlo tree search and LLM-based reflection, which can then be used to learn high-level strategic skills such as how to evaluate states that guide the low-level execution. We showcase how our method can be used in both action planning and dialogue generation in the context of games, achieving good performance on both tasks. Specifically, we demonstrate that our method can help train agents with better performance than both traditional reinforcement learning-based approaches and other LLM-based skill learning approaches in games including the Game of Pure Strategy (GOPS) and The Resistance: Avalon. STRATEGIST helps bridge the gap between foundation models and symbolic decision-making methods through its bi-level approach, leading to more robust decision-making.
Authors: Jiayi Gui, Yiming Liu, Jiale Cheng, Xiaotao Gu, Xiao Liu, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang
Abstract: Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.
Authors: Joshua Pickard, Marc Andrew Choi, Natalie Oliven, Cooper Stansbury, Jillian Cwycyshyn, Nicholas Galioto, Alex Gorodetsky, Alvaro Velasquez, Indika Rajapakse
Abstract: Recent advancements in Large Language Models (LLMs) are transforming biology, computer science, and many other research fields, as well as impacting everyday life. While transformer-based technologies are currently being deployed in biology, no available agentic system has been developed to tackle bioinformatics workflows. We present a prototype Bioinformatics Retrieval Augmented Data (BRAD) digital assistant. BRAD is a chatbot and agentic system that integrates a suite of tools to handle bioinformatics tasks, from code execution to online search. We demonstrate its capabilities through (1) improved question-and-answering with retrieval augmented generation (RAG), (2) the ability to run complex software pipelines, and (3) the ability to organize and distribute tasks in agentic workflows. We use BRAD for automation, performing tasks ranging from gene enrichment and searching the archive to automatic code generation for running biomarker identification pipelines. BRAD is a step toward autonomous, self-driving labs for digital biology.
Authors: Axel H{\o}jmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn, J\'er\'emy Scheurer
Abstract: To mitigate risks from AI systems, we need to assess their capabilities accurately. This is especially difficult in cases where capabilities are only rarely displayed. Phuong et al. propose two methods that aim to obtain better estimates of the probability of an AI agent successfully completing a given task. The milestone method decomposes tasks into subtasks, aiming to improve overall success rate estimation, while the expert best-of-N method leverages human guidance as a proxy for the model's independent performance. Our analysis of these methods as Monte Carlo estimators reveals that while both effectively reduce variance compared to naive Monte Carlo sampling, they also introduce bias. Experimental results demonstrate that the milestone method underestimates true solve rates for many real-world tasks due to its constraining assumptions. The expert best-of-N method exhibits even more severe underestimation across all tasks, attributed to an inherently flawed re-weighting factor. To enhance the accuracy of capability estimates of AI agents on difficult tasks, we suggest future work should leverage the rich literature on Monte Carlo Estimators.
Authors: Ruijie Xu, Zhihan Liu, Yongfei Liu, Shipeng Yan, Zhaoran Wang, Zhi Zhang, Xuming He
Abstract: We address the challenge of online Reinforcement Learning from Human Feedback (RLHF) with a focus on self-rewarding alignment methods. In online RLHF, obtaining feedback requires interaction with the environment, which can be costly when using additional reward models or the GPT-4 API. Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities, which are effective for large-scale models but challenging to transfer to smaller ones. To address these limitations, we propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities. Additionally, we employ fine-grained arithmetic control over the optimality gap between positive and negative examples, generating more hard negatives in the later stages of training to help the model better capture subtle human preferences. Finally, we conduct extensive experiments on two base models, Mistral-7B and Mistral-Instruct-7B, which significantly bootstrap the performance of the reference model, achieving 34.5% in the Length-controlled Win Rates of AlpacaEval 2.0.
Authors: Kevin Wang, Junbo Li, Neel P. Bhatt, Yihan Xi, Qiang Liu, Ufuk Topcu, Zhangyang Wang
Abstract: Recent advancements in Large Language Models (LLMs) have showcased their ability to perform complex reasoning tasks, but their effectiveness in planning remains underexplored. In this study, we evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks, focusing on three key aspects: feasibility, optimality, and generalizability. Through empirical evaluations on constraint-heavy tasks (e.g., $\textit{Barman}$, $\textit{Tyreworld}$) and spatially complex environments (e.g., $\textit{Termes}$, $\textit{Floortile}$), we highlight o1-preview's strengths in self-evaluation and constraint-following, while also identifying bottlenecks in decision-making and memory management, particularly in tasks requiring robust spatial reasoning. Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints and managing state transitions in structured environments. However, the model often generates suboptimal solutions with redundant actions and struggles to generalize effectively in spatially complex tasks. This pilot study provides foundational insights into the planning limitations of LLMs, offering key directions for future research on improving memory management, decision-making, and generalization in LLM-based planning. Code available at https://github.com/VITA-Group/o1-planning.
Authors: Zeyu Gan, Yong Liu
Abstract: Synthetic data has become a pivotal resource in post-training tasks for large language models (LLMs) due to the scarcity of high-quality, specific data. While various methods have been developed to generate synthetic data, there remains a discernible gap between the practical effects of synthetic data and our theoretical comprehension. To address this challenge, we commence by presenting a detailed modeling of the prevalent synthetic data generation process. Building upon this modeling, we demonstrate that the generalization capability of the post-trained model is critically determined by the information gain derived from the generative model, as analyzed from a novel reverse-bottleneck perspective. Moreover, we introduce the concept of Generalization Gain via Mutual Information (GGMI) and elucidate the relationship between generalization gain and information gain. This analysis serves as a theoretical foundation for synthetic data generation and further highlights its connection with the generalization capability of post-trained models, offering an understanding about the design of synthetic data generation techniques and the optimization of the post-training process. We open source our code at https://github.com/ZyGan1999/Towards-a-Theoretical-Understanding-of-Synthetic-Data-in-LLM-Post-Training.
Authors: Amogh Mannekote, Adam Davies, Jina Kang, Kristy Elizabeth Boyer
Abstract: Simulating learner actions helps stress-test open-ended interactive learning environments and prototype new adaptations before deployment. While recent studies show the promise of using large language models (LLMs) for simulating human behavior, such approaches have not gone beyond rudimentary proof-of-concept stages due to key limitations. First, LLMs are highly sensitive to minor prompt variations, raising doubts about their ability to generalize to new scenarios without extensive prompt engineering. Moreover, apparently successful outcomes can often be unreliable, either because domain experts unintentionally guide LLMs to produce expected results, leading to self-fulfilling prophecies; or because the LLM has encountered highly similar scenarios in its training data, meaning that models may not be simulating behavior so much as regurgitating memorized content. To address these challenges, we propose Hyp-Mix, a simulation authoring framework that allows experts to develop and evaluate simulations by combining testable hypotheses about learner behavior. Testing this framework in a physics learning environment, we found that GPT-4 Turbo maintains calibrated behavior even as the underlying learner model changes, providing the first evidence that LLMs can be used to simulate realistic behaviors in open-ended interactive learning environments, a necessary prerequisite for useful LLM behavioral simulation.
Authors: Kiran Busch, Henrik Leopold
Abstract: An increasing number of organizations are deploying Large Language Models (LLMs) for a wide range of tasks. Despite their general utility, LLMs are prone to errors, ranging from inaccuracies to hallucinations. To objectively assess the capabilities of existing LLMs, performance benchmarks are conducted. However, these benchmarks often do not translate to more specific real-world tasks. This paper addresses the gap in benchmarking LLM performance in the Business Process Management (BPM) domain. Currently, no BPM-specific benchmarks exist, creating uncertainty about the suitability of different LLMs for BPM tasks. This paper systematically compares LLM performance on four BPM tasks focusing on small open-source models. The analysis aims to identify task-specific performance variations, compare the effectiveness of open-source versus commercial models, and assess the impact of model size on BPM task performance. This paper provides insights into the practical applications of LLMs in BPM, guiding organizations in selecting appropriate models for their specific needs.
Authors: Lauren Nicole DeLong, Yojana Gadiya, Paola Galdi, Jacques D. Fleuriot, Daniel Domingo-Fern\'andez
Abstract: Neurosymbolic (NeSy) artificial intelligence describes the combination of logic or rule-based techniques with neural networks. Compared to neural approaches, NeSy methods often possess enhanced interpretability, which is particularly promising for biomedical applications like drug discovery. However, since interpretability is broadly defined, there are no clear guidelines for assessing the biological plausibility of model interpretations. To assess interpretability in the context of drug discovery, we devise a novel prediction task, called drug mechanism-of-action (MoA) deconvolution, with an associated, tailored knowledge graph (KG), MoA-net. We then develop the MoA Retrieval System (MARS), a NeSy approach for drug discovery which leverages logical rules with learned rule weights. Using this interpretable feature alongside domain knowledge, we find that MARS and other NeSy approaches on KGs are susceptible to reasoning shortcuts, in which the prediction of true labels is driven by "degree-bias" rather than the domain-based rules. Subsequently, we demonstrate ways to identify and mitigate this. Thereafter, MARS achieves performance on par with current state-of-the-art models while producing model interpretations aligned with known MoAs.
Authors: Jeongeun Park, Sungjoon Choi, Sangdoo Yun
Abstract: Recent advancements in large language models (LLMs) have greatly enhanced their ability to generate natural and contextually relevant text, making AI interactions more human-like. However, generating and understanding interactive human-like motion, where two individuals engage in coordinated movements, remains a challenge due to the complexity of modeling these coordinated interactions. Furthermore, a versatile model is required to handle diverse interactive scenarios, such as chat systems that follow user instructions or adapt to their assigned role while adjusting interaction dynamics. To tackle this problem, we introduce VIM, short for the Versatile Interactive Motion language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. To address the scarcity of multi-turn interactive motion data, we introduce a synthetic dataset, INERT-MT2, where we utilize pre-trained models to create diverse instructional datasets with interactive motion. Our approach first trains a motion tokenizer that encodes interactive motions into residual discrete tokens. In the pretraining stage, the model learns to align motion and text representations with these discrete tokens. During the instruction fine-tuning stage, VIM adapts to multi-turn conversations using the INTER-MT2 dataset. We evaluate the versatility of our method across motion-related tasks, motion to text, text to motion, reaction generation, motion editing, and reasoning about motion sequences. The results highlight the versatility and effectiveness of proposed method in handling complex interactive motion synthesis.
Authors: Simon Kasif
Abstract: AI Safety has become a vital front-line concern of many scientists within and outside the AI community. There are many immediate and long term anticipated risks that range from existential risk to human existence to deep fakes and bias in machine learning systems [1-5]. In this paper, we reduce the full scope and immense complexity of AI safety concerns to a trilogy of three important but tractable opportunities for advances that have the short-term potential to improve AI safety and reliability without reducing AI innovation in critical domains. In this perspective, we discuss this vision based on several case studies that already produced proofs of concept in critical ML applications in biomedical science.
Authors: Mingde Zhao, Tristan Sylvain, Doina Precup, Yoshua Bengio
Abstract: We are interested in target-directed agents, which produce targets during decision-time planning, to guide their behaviors and achieve better generalization during evaluation. Improper training of these agents can result in delusions: the agent may come to hold false beliefs about the targets, which cannot be properly rejected, leading to unwanted behaviors and damaging out-of-distribution generalization. We identify different types of delusions by using intuitive examples in carefully controlled environments, and investigate their causes. We demonstrate how delusions can be addressed for agents trained by hindsight relabeling, a mainstream approach in for training target-directed RL agents. We validate empirically the effectiveness of the proposed solutions in correcting delusional behaviors and improving out-of-distribution generalization.
Authors: Siyu Zhou, Tianyi Zhou, Yijun Yang, Guodong Long, Deheng Ye, Jing Jiang, Chengqi Zhang
Abstract: Can large language models (LLMs) directly serve as powerful world models for model-based agents? While the gaps between the prior knowledge of LLMs and the specified environment's dynamics do exist, our study reveals that the gaps can be bridged by aligning an LLM with its deployed environment and such "world alignment" can be efficiently achieved by rule learning on LLMs. Given the rich prior knowledge of LLMs, only a few additional rules suffice to align LLM predictions with the specified environment dynamics. To this end, we propose a neurosymbolic approach to learn these rules gradient-free through LLMs, by inducing, updating, and pruning rules based on comparisons of agent-explored trajectories and world model predictions. The resulting world model is composed of the LLM and the learned rules. Our embodied LLM agent "WALL-E" is built upon model-predictive control (MPC). By optimizing look-ahead actions based on the precise world model, MPC significantly improves exploration and learning efficiency. Compared to existing LLM agents, WALL-E's reasoning only requires a few principal rules rather than verbose buffered trajectories being included in the LLM input. On open-world challenges in Minecraft and ALFWorld, WALL-E achieves higher success rates than existing methods, with lower costs on replanning time and the number of tokens used for reasoning. In Minecraft, WALL-E exceeds baselines by 15-30% in success rate while costing 8-20 fewer replanning rounds and only 60-80% of tokens. In ALFWorld, its success rate surges to a new record high of 95% only after 6 iterations.
Authors: James Queeney, Ioannis Ch. Paschalidis, Christos G. Cassandras
Abstract: We develop a new class of model-free deep reinforcement learning algorithms for data-driven, learning-based control. Our Generalized Policy Improvement algorithms combine the policy improvement guarantees of on-policy methods with the efficiency of sample reuse, addressing a trade-off between two important deployment requirements for real-world control: (i) practical performance guarantees and (ii) data efficiency. We demonstrate the benefits of this new class of algorithms through extensive experimental analysis on a broad range of simulated control tasks.
Authors: Xiaoting Wang, Yanxiang Zhang
Abstract: Spiking neural networks (SNNs) present a promising energy efficient alternative to traditional Artificial Neural Networks (ANNs) due to their multiplication-free operations enabled by binarized intermediate activations. However, this binarization leads to precision loss, hindering the SNN performance. In this paper, we introduce Multiple Threshold (MT) approaches to significantly enhance SNN accuracy by mitigating precision loss. We propose two distinct modes for MT implementation, depending on the membrane update rule: parallel mode and cascade mode. MT-SNN models can be efficiently trained on standard hardwares like GPUs and TPUs, while retaining the multiplication-free advantage crucial for deployment on neuromorphic devices. Our extensive experiments on CIFAR10, CIFAR100, ImageNet, and DVS-CIFAR10 datasets demonstrate that both MT modes substantially improve the performance of single-threshold SNNs, achieving higher accuracy with fewer time steps and comparable energy consumption. Moreover, MT-SNNs outperform state-of-the-art (SOTA) results. Notably, with MT, a Parametric-Leaky-Integrate-Fire (PLIF) based ResNet-34 architecture reaches 72.17\% accuracy on ImageNet with a single time step, surpassing the previous SOTA by 2.75\% despite using 4 steps.
Authors: Yi Huang, Dmitrii Torbunov, Brett Viren, Haiwang Yu, Jin Huang, Meifeng Lin, Yihui Ren
Abstract: Deep learning algorithms often are trained and deployed on different datasets. Any systematic difference between the training and a test dataset may degrade the algorithm performance--what is known as the domain shift problem. This issue is prevalent in many scientific domains where algorithms are trained on simulated data but applied to real-world datasets. Typically, the domain shift problem is solved through various domain adaptation methods. However, these methods are often tailored for a specific downstream task and may not easily generalize to different tasks. This work explores the feasibility of using an alternative way to solve the domain shift problem that is not specific to any downstream algorithm. The proposed approach relies on modern Unpaired Image-to-Image translation techniques, designed to find translations between different image domains in a fully unsupervised fashion. In this study, the approach is applied to a domain shift problem commonly encountered in Liquid Argon Time Projection Chamber (LArTPC) detector research when seeking a way to translate samples between two differently distributed detector datasets deterministically. This translation allows for mapping real-world data into the simulated data domain where the downstream algorithms can be run with much less domain-shift-related degradation. Conversely, using the translation from the simulated data in a real-world domain can increase the realism of the simulated dataset and reduce the magnitude of any systematic uncertainties. We adapted several UI2I translation algorithms to work on scientific data and demonstrated the viability of these techniques for solving the domain shift problem with LArTPC detector data. To facilitate further development of domain adaptation techniques for scientific datasets, the "Simple Liquid-Argon Track Samples" dataset used in this study also is published.
Authors: Jinke He, Thomas M. Moerland, Joery A. de Vries, Frans A. Oliehoek
Abstract: Model-based reinforcement learning (MBRL) has drawn considerable interest in recent years, given its promise to improve sample efficiency. Moreover, when using deep-learned models, it is possible to learn compact and generalizable models from data. In this work, we study MuZero, a state-of-the-art deep model-based reinforcement learning algorithm that distinguishes itself from existing algorithms by learning a value-equivalent model. Despite MuZero's success and impact in the field of MBRL, existing literature has not thoroughly addressed why MuZero performs so well in practice. Specifically, there is a lack of in-depth investigation into the value-equivalent model learned by MuZero and its effectiveness in model-based credit assignment and policy improvement, which is vital for achieving sample efficiency in MBRL. To fill this gap, we explore two fundamental questions through our empirical analysis: 1) to what extent does MuZero achieve its learning objective of a value-equivalent model, and 2) how useful are these models for policy improvement? Our findings reveal that MuZero's model struggles to generalize when evaluating unseen policies, which limits its capacity for additional policy improvement. However, MuZero's incorporation of the policy prior in MCTS alleviates this problem, which biases the search towards actions where the model is more accurate.
Authors: Ali Shirali, Alexander Schubert, Ahmed Alaa
Abstract: Medical treatments often involve a sequence of decisions, each informed by previous outcomes. This process closely aligns with reinforcement learning (RL), a framework for optimizing sequential decisions to maximize cumulative rewards under unknown dynamics. While RL shows promise for creating data-driven treatment plans, its application in medical contexts is challenging due to the frequent need to use sparse rewards, primarily defined based on mortality outcomes. This sparsity can reduce the stability of offline estimates, posing a significant hurdle in fully utilizing RL for medical decision-making. We introduce a deep Q-learning approach to obtain more reliable critical care policies by integrating relevant but noisy frequently measured biomarker signals into the reward specification without compromising the optimization of the main outcome. Our method prunes the action space based on all available rewards before training a final model on the sparse main reward. This approach minimizes potential distortions of the main objective while extracting valuable information from intermediate signals to guide learning. We evaluate our method in off-policy and offline settings using simulated environments and real health records from intensive care units. Our empirical results demonstrate that our method outperforms common offline RL methods such as conservative Q-learning and batch-constrained deep Q-learning. By disentangling sparse rewards and frequently measured reward proxies through action pruning, our work represents a step towards developing reliable policies that effectively harness the wealth of available information in data-intensive critical care environments.
Authors: Ramnath Kumar, Kushal Majmundar, Dheeraj Nagaraj, Arun Sai Suggala
Abstract: We present Re-weighted Gradient Descent (RGD), a novel optimization technique that improves the performance of deep neural networks through dynamic sample re-weighting. Leveraging insights from distributionally robust optimization (DRO) with Kullback-Leibler divergence, our method dynamically assigns importance weights to training data during each optimization step. RGD is simple to implement, computationally efficient, and compatible with widely used optimizers such as SGD and Adam. We demonstrate the effectiveness of RGD on various learning tasks, including supervised learning, meta-learning, and out-of-domain generalization. Notably, RGD achieves state-of-the-art results on diverse benchmarks, with improvements of +0.7% on DomainBed, +1.44% on tabular classification, \textcolor{blue}+1.94% on GLUE with BERT, and +1.01% on ImageNet-1K with ViT.
Authors: Egor Cherepanov, Alexey Staroverov, Dmitry Yudin, Alexey K. Kovalev, Aleksandr I. Panov
Abstract: Recently, the use of transformers in offline reinforcement learning has become a rapidly developing area. This is due to their ability to treat the agent's trajectory in the environment as a sequence, thereby reducing the policy learning problem to sequence modeling. In environments where the agent's decisions depend on past events (POMDPs), it is essential to capture both the event itself and the decision point in the context of the model. However, the quadratic complexity of the attention mechanism limits the potential for context expansion. One solution to this problem is to extend transformers with memory mechanisms. This paper proposes a Recurrent Action Transformer with Memory (RATE), a novel model architecture that incorporates a recurrent memory mechanism designed to regulate information retention. To evaluate our model, we conducted extensive experiments on memory-intensive environments (ViZDoom-Two-Colors, T-Maze, Memory Maze, Minigrid-Memory), classic Atari games, and MuJoCo control environments. The results show that using memory can significantly improve performance in memory-intensive environments, while maintaining or improving results in classic environments. We believe that our results will stimulate research on memory mechanisms for transformers applicable to offline reinforcement learning.
Authors: Xiaolin Fang, Caelan Reed Garrett, Clemens Eppner, Tom\'as Lozano-P\'erez, Leslie Pack Kaelbling, Dieter Fox
Abstract: Generative models such as diffusion models, excel at capturing high-dimensional distributions with diverse input modalities, e.g. robot trajectories, but are less effective at multi-step constraint reasoning. Task and Motion Planning (TAMP) approaches are suited for planning multi-step autonomous robot manipulation. However, it can be difficult to apply them to domains where the environment and its dynamics are not fully known. We propose to overcome these limitations by composing diffusion models using a TAMP system. We use the learned components for constraints and samplers that are difficult to engineer in the planning model, and use a TAMP solver to search for the task plan with constraint-satisfying action parameter values. To tractably make predictions for unseen objects in the environment, we define the learned samplers and TAMP operators on learned latent embedding of changing object states. We evaluate our approach in a simulated articulated object manipulation domain and show how the combination of classical TAMP, generative modeling, and latent embedding enables multi-step constraint-based reasoning. We also apply the learned sampler in the real world. Website: https://sites.google.com/view/dimsam-tamp
Authors: Simon Doll, Niklas Hanselmann, Lukas Schneider, Richard Schulz, Markus Enzweiler, Hendrik P. A. Lensch
Abstract: Following the tracking-by-attention paradigm, this paper introduces an object-centric, transformer-based framework for tracking in 3D. Traditional model-based tracking approaches incorporate the geometric effect of object- and ego motion between frames with a geometric motion model. Inspired by this, we propose S.T.A.R.-Track, which uses a novel latent motion model (LMM) to additionally adjust object queries to account for changes in viewing direction and lighting conditions directly in the latent space, while still modeling the geometric motion explicitly. Combined with a novel learnable track embedding that aids in modeling the existence probability of tracks, this results in a generic tracking framework that can be integrated with any query-based detector. Extensive experiments on the nuScenes benchmark demonstrate the benefits of our approach, showing state-of-the-art performance for DETR3D-based trackers while drastically reducing the number of identity switches of tracks at the same time.
Authors: Xijie Huang, Zhiqiang Shen, Pingcheng Dong, Kwang-Ting Cheng
Abstract: Despite the outstanding performance of transformers in both language and vision tasks, the expanding computation and model size have increased the demand for efficient deployment. To address the heavy computation and parameter drawbacks, quantization is frequently studied in the community as a representative model compression technique and has seen extensive use on ConvNets. However, due to the unique properties of transformers, the low-bit quantization applications are still limited and underexplored. In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors, which significantly differ from ConvNets. Based on comprehensive quantitative analysis, we observe variation in three hierarchies: various module quantization sensitivities, outliers in static weight and activation distribution, and oscillation in dynamic parameter fluctuations. These variations of transformers bring instability to the quantization-aware training (QAT) and negatively influence the performance. We explore the best practices to alleviate the variation's influence during low-bit transformer QAT and propose a variation-aware quantization scheme for both vision and language transformers. We extensively verify and show our scheme can alleviate the variation and improve the performance of transformers across various models and tasks. Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement over previous state-of-the-art methods on ImageNet-1K and GLUE. Codes and models are available at https://github.com/HuangOwen/Quantization-Variation.
Authors: Juan Terven, Diana M. Cordova-Esparza, Alfonso Ramirez-Pedraza, Edgar A. Chavez-Urbiola, Julio A. Romero-Gonzalez
Abstract: When training or evaluating deep learning models, two essential parts are picking the proper loss function and deciding on performance metrics. In this paper, we provide a comprehensive overview of the most common loss functions and metrics used across many different types of deep learning tasks, from general tasks such as regression and classification to more specific tasks in Computer Vision and Natural Language Processing. We introduce the formula for each loss and metric, discuss their strengths and limitations, and describe how these methods can be applied to various problems within deep learning. This work can serve as a reference for researchers and practitioners in the field, helping them make informed decisions when selecting the most appropriate loss function and performance metrics for their deep learning projects.
Authors: Nachuan Xiao, Xiaoyin Hu, Kim-Chuan Toh
Abstract: In this paper, we focus on providing convergence guarantees for stochastic subgradient methods in minimizing nonsmooth nonconvex functions. We first investigate the global stability of a general framework for stochastic subgradient methods, where the corresponding differential inclusion admits a coercive Lyapunov function. We prove that, for any sequence of sufficiently small stepsizes and approximation parameters, coupled with sufficiently controlled noises, the iterates are uniformly bounded and asymptotically stabilize around the stable set of its corresponding differential inclusion. Moreover, we develop an improved analysis to apply our proposed framework to establish the global stability of a wide range of stochastic subgradient methods, where the corresponding Lyapunov functions are possibly non-coercive. These theoretical results illustrate the promising potential of our proposed framework for establishing the global stability of various stochastic subgradient methods.
Authors: Zhiyuan Li, Dongnan Liu, Heng Wang, Chaoyi Zhang, Weidong Cai
Abstract: Recently, training an image captioner without annotated image-sentence pairs has gained traction. Previous methods have faced limitations due to either using mismatched corpora for inaccurate pseudo annotations or relying on resource-intensive pre-training. To alleviate these challenges, we propose a new strategy where the prior knowledge from large pre-trained models (LPMs) is distilled and leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which can efficiently retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of high-quality pseudo sentences via LPMs. Additionally, we introduce a fluency filter and a CLIP guidance objective to enhance contrastive information learning. Experimental results indicate that our method outperforms SOTA captioning models across various settings including zero-shot, unsupervised, semi-supervised, and cross-domain scenarios. Code is available at: https://github.com/Zhiyuan-Li-John/RaPSG.
Authors: Aditya Golatkar, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto
Abstract: We introduce Compartmentalized Diffusion Models (CDM), a method to train different diffusion models (or prompts) on distinct data sources and arbitrarily compose them at inference time. The individual models can be trained in isolation, at different times, and on different distributions and domains and can be later composed to achieve performance comparable to a paragon model trained on all data simultaneously. Furthermore, each model only contains information about the subset of the data it was exposed to during training, enabling several forms of training data protection. In particular, CDMs enable perfect selective forgetting and continual learning for large-scale diffusion models, allow serving customized models based on the user's access rights. Empirically the quality (FID) of the class-conditional CDMs (8-splits) is within 10% (on fine-grained vision datasets) of a monolithic model (no splits), and allows (8x) faster forgetting compared monolithic model with a maximum FID increase of 1%. When applied to text-to-image generation, CDMs improve alignment (TIFA) by 14.33% over a monolithic model trained on MSCOCO. CDMs also allow determining the importance of a subset of the data (attribution) in generating particular samples, and reduce memorization.
Authors: Menglin Xia, Xuchao Zhang, Camille Couturier, Guoqing Zheng, Saravan Rajmohan, Victor Ruhle
Abstract: Large language models (LLMs) enhanced with retrieval augmentation has shown great performance in many applications. However, the computational demands for these models pose a challenge when applying them to real-time tasks, such as composition assistance. To address this, we propose Hybrid Retrieval-Augmented Composition Assistance (Hybrid-RACA), a novel system for real-time text prediction that efficiently combines a cloud-based LLM with a smaller client-side model through retrieval augmented memory. This integration enables the client model to generate better responses, benefiting from the LLM's capabilities and cloud-based data. Meanwhile, via a novel asynchronous memory update mechanism, the client model can deliver real-time completions to user inputs without the need to wait for responses from the cloud. Our experiments on five datasets demonstrate that Hybrid-RACA offers strong performance while maintaining low latency.
Authors: Damian S\'ojka, Sebastian Cygert, Bart{\l}omiej Twardowski, Tomasz Trzci\'nski
Abstract: Test-time adaptation is a promising research direction that allows the source model to adapt itself to changes in data distribution without any supervision. Yet, current methods are usually evaluated on benchmarks that are only a simplification of real-world scenarios. Hence, we propose to validate test-time adaptation methods using the recently introduced datasets for autonomous driving, namely CLAD-C and SHIFT. We observe that current test-time adaptation methods struggle to effectively handle varying degrees of domain shift, often resulting in degraded performance that falls below that of the source model. We noticed that the root of the problem lies in the inability to preserve the knowledge of the source model and adapt to dynamically changing, temporally correlated data streams. Therefore, we enhance the well-established self-training framework by incorporating a small memory buffer to increase model stability and at the same time perform dynamic adaptation based on the intensity of domain shift. The proposed method, named AR-TTA, outperforms existing approaches on both synthetic and more real-world benchmarks and shows robustness across a variety of TTA scenarios. The code is available at https://github.com/dmn-sjk/AR-TTA.
Authors: Mustafa O. Karabag, Sophia Smith, Negar Mehr, David Fridovich-Keil, Ufuk Topcu
Abstract: When interacting with other decision-making agents in non-adversarial scenarios, it is critical for an autonomous agent to have inferable behavior: The agent's actions must convey their intention and strategy. We model the inferability problem using Stackelberg games with observations where a leader and a follower repeatedly interact. During the interactions, the leader uses a fixed mixed strategy. The follower does not know the leader's strategy and dynamically reacts to the statistically inferred strategy based on the leader's previous actions. In the inference setting, the leader may have a lower performance compared to the setting where the follower has full information on the leader's strategy. We refer to the performance gap between these settings as the inferability gap. For a variety of game settings, we show that the inferability gap is upper-bounded by a function of the number of interactions and the stochasticity level of the leader's strategy, encouraging the use of inferable strategies with lower stochasticity levels. We also analyze bimatrix Stackelberg games and identify a set of games where the leader's near-optimal strategy may suffer from a large inferability gap.
Authors: Boyang Zheng, Chumeng Liang, Xiaoyu Wu
Abstract: Diffusion models build a new milestone for image generation yet raising public concerns, for they can be fine-tuned on unauthorized images for customization. Protection based on adversarial attacks rises to encounter this unauthorized diffusion customization, by adding protective watermarks to images and poisoning diffusion models. However, current protection, leveraging untargeted attacks, does not appear to be effective enough. In this paper, we propose a simple yet effective improvement for the protection against unauthorized diffusion customization by introducing targeted attacks. We show that by carefully selecting the target, targeted attacks significantly outperform untargeted attacks in poisoning diffusion models and degrading the customization image quality. Extensive experiments validate the superiority of our method on two mainstream customization methods of diffusion models, compared to existing protections. To explain the surprising success of targeted attacks, we delve into the mechanism of attack-based protections and propose a hypothesis based on our observation, which enhances the comprehension of attack-based protections. To the best of our knowledge, we are the first to both reveal the vulnerability of diffusion models to targeted attacks and leverage targeted attacks to enhance protection against unauthorized diffusion customization. Our code is available on GitHub: \url{https://github.com/psyker-team/mist-v2}.
Authors: Shouhua Zhang, Jiehan Zhou, Xue Ma, Susanna Pirttikangas, Chunsheng Yang
Abstract: Traditional fault diagnosis methods using Convolutional Neural Networks (CNNs) often struggle with capturing the temporal dynamics of vibration signals. To overcome this, the application of Transformer-based Vision Transformer (ViT) methods to fault diagnosis is gaining attraction. Nonetheless, these methods typically require extensive preprocessing, which increases computational complexity, potentially reducing the efficiency of the diagnosis process. Addressing this gap, this paper presents the Time Series Vision Transformer (TSViT), tailored for effective fault diagnosis. TSViT incorporates a convolutional layer to extract local features from vibration signals, alongside a transformer encoder to discern long-term temporal patterns. A thorough experimental comparison on three diverse datasets demonstrates TSViT's effectiveness and adaptability. Moreover, the paper delves into the influence of hyperparameter tuning on the model's performance, computational demand, and parameter count. Remarkably, TSViT achieves an unprecedented 100% average accuracy on two test sets and 99.99% on another, showcasing its exceptional diagnostic capabilities.
Authors: Huashan Sun, Yixiao Wu, Yuhao Ye, Yizhe Yang, Yinghao Li, Jiawei Li, Yang Gao
Abstract: Language style is necessary for AI systems to understand and generate diverse human language accurately. However, previous text style transfer primarily focused on sentence-level data-driven approaches, limiting exploration of potential problems in large language models (LLMs) and the ability to meet complex application needs. To overcome these limitations, we introduce a novel task called Public-Speaking Style Transfer (PSST), which aims to simulate humans to transform passage-level, official texts into a public-speaking style. Grounded in the analysis of real-world data from a linguistic perspective, we decompose public-speaking style into key sub-styles to pose challenges and quantify the style modeling capability of LLMs. For such intricate text style transfer, we further propose a fine-grained evaluation framework to analyze the characteristics and identify the problems of stylized texts. Comprehensive experiments suggest that current LLMs struggle to generate public speaking texts that align with human preferences, primarily due to excessive stylization and loss of semantic information.
Authors: Tianheng Ling, Chao Qian, Gregor Schiele
Abstract: Soft sensors are crucial in bridging autonomous systems' physical and digital realms, enhancing sensor fusion and perception. Instead of deploying soft sensors on the Cloud, this study shift towards employing on-device soft sensors, promising heightened efficiency and bolstering data security. Our approach substantially improves energy efficiency by deploying Artificial Intelligence (AI) directly on devices within a wireless sensor network. Furthermore, the synergistic integration of the Microcontroller Unit and Field-Programmable Gate Array (FPGA) leverages the rapid AI inference capabilities of the latter. Empirical evidence from our real-world use case demonstrates that FPGA-based soft sensors achieve inference times ranging remarkably from 1.04 to 12.04 microseconds. These compelling results highlight the considerable potential of our innovative approach for executing real-time inference tasks efficiently, thereby presenting a feasible alternative that effectively addresses the latency challenges intrinsic to Cloud-based deployments.
Authors: Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, David Wagner
Abstract: The capabilities of large language models have grown significantly in recent years and so too have concerns about their misuse. It is important to be able to distinguish machine-generated text from human-authored content. Prior works have proposed numerous schemes to watermark text, which would benefit from a systematic evaluation framework. This work focuses on LLM output watermarking techniques - as opposed to image or model watermarks - and proposes Mark My Words, a comprehensive benchmark for them under different natural language tasks. We focus on three main metrics: quality, size (i.e., the number of tokens needed to detect a watermark), and tamper resistance (i.e., the ability to detect a watermark after perturbing marked text). Current watermarking techniques are nearly practical enough for real-world use: Kirchenbauer et al. [33]'s scheme can watermark models like Llama 2 7B-chat or Mistral-7B-Instruct with no perceivable loss in quality on natural language tasks, the watermark can be detected with fewer than 100 tokens, and their scheme offers good tamper resistance to simple perturbations. However, they struggle to efficiently watermark code generations. We publicly release our benchmark (https://github.com/wagner-group/MarkMyWords).
Authors: Youshao Xiao, Zhenglei Zhou, Fagui Mao, Weichang Wu, Shangchun Zhao, Lin Ju, Lei Liang, Xiaolu Zhang, Jun Zhou
Abstract: Recently, ChatGPT or InstructGPT like large language models (LLM) has made a significant impact in the AI world. Many works have attempted to reproduce the complex InstructGPT's training pipeline, namely Reinforcement Learning with Human Feedback (RLHF). However, the mainstream distributed RLHF training methods typically adopt a fixed model placement strategy, referred to as the Co-located strategy. This strategy treats all four interdependent models involved in RLHF as a single entity, distributing them across all devices and applying parallelism techniques designed for a single model, regardless of the workload heterogeneity inherent to each model. As a result, this strategy exacerbates the generation bottlenecks in the RLHF training and degrades the overall training efficiency. To address these issues, we propose a flexible model placement framework that offers two general and agile model placement strategies. The Interleaving strategy helps reduce memory redundancy and communication costs of RLHF training by placing models without dependencies on exclusive devices with careful orchestration. On the other hand, the Disaggregated strategy improves the throughput of model training by separating the training and inference runtime of the RLHF pipeline with additional shadow models. Furthermore, our framework provides a simple user interface and guidelines to easily and flexibly configure these strategies in various training scenarios. Our experiments have shown that our strategy can achieve notable improvements up to 11x, compared to the current state-of-the-art (SOTA) approaches. The results highlight the effectiveness and adaptability of our methods in accelerating the training of distributed RLHF.
Authors: Ruifeng Yuan, Shichao Sun, Yongqi Li, Zili Wang, Ziqiang Cao, Wenjie Li
Abstract: With the rapid development of large language models, AI assistants like ChatGPT have become increasingly integrated into people's works and lives but are limited in personalized services. In this paper, we present a plug-and-play framework that could facilitate personalized large language model assistants with evolving conditional memory. The personalized assistant focuses on intelligently preserving the knowledge and experience from the history dialogue with the user, which can be applied to future tailored responses that better align with the user's preferences. Generally, the assistant generates a set of records from the dialogue dialogue, stores them in a memory bank, and retrieves related memory to improve the quality of the response. For the crucial memory design, we explore different ways of constructing the memory and propose a new memorizing mechanism named conditional memory. We also investigate the retrieval and usage of memory in the generation process. We build the first benchmark to evaluate personalized assistants' ability from three aspects. The experimental results illustrate the effectiveness of our method.
Authors: Zhaokun Jiang, Qianxi Lv, Ziyin Zhang, Lei Lei
Abstract: Large language models have demonstrated parallel and even superior translation performance compared to neural machine translation (NMT) systems. However, existing comparative studies between them mainly rely on automated metrics, raising questions into the feasibility of these metrics and their alignment with human judgment. The present study investigates the convergences and divergences between automated metrics and human evaluation in assessing the quality of machine translation from ChatGPT and three NMT systems. To perform automatic assessment, four automated metrics are employed, while human evaluation incorporates the DQF-MQM error typology and six rubrics. Notably, automatic assessment and human evaluation converge in measuring formal fidelity (e.g., error rates), but diverge when evaluating semantic and pragmatic fidelity, with automated metrics failing to capture the improvement of ChatGPT's translation brought by prompt engineering. These results underscore the indispensable role of human judgment in evaluating the performance of advanced translation tools at the current stage.
Authors: Haibo Wang, Weifeng Ge
Abstract: With the breakthrough of multi-modal large language models, answering complex visual questions that demand advanced reasoning abilities and world knowledge has become a much more important testbed for developing AI models than ever. However, equipping AI models with robust cross-modality reasoning ability remains challenging since the cognition scheme of humans has not been understood systematically. In this paper, we believe that if we can collect visual clues in the given image as much as possible, we will recognize the image more accurately, understand the question better, recall relevant knowledge more easily, and finally reason out the answer. We discover these rich visual clues by mining question-answer pairs in images and sending them into multi-modal large language models as prompts. We call the proposed method Q&A Prompts. Specifically, we first use the image-answer pairs and the corresponding questions in the training set as inputs and outputs to train a visual question generation model. Then, we use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers. Finally, we encode these generated question-answer pairs as prompts with a visual-aware prompting module and send them into pre-trained multi-modal large language models to reason out the final answers. Experimental results show that, compared with state-of-the-art methods, our Q&A Prompts achieves substantial improvements on the challenging visual question answering datasets requiring reasoning over diverse world knowledge, such as OK-VQA and A-OKVQA.
Authors: Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein
Abstract: Detecting text generated by modern large language models is thought to be hard, as both LLMs and humans can exhibit a wide range of complex behaviors. However, we find that a score based on contrasting two closely related language models is highly accurate at separating human-generated and machine-generated text. Based on this mechanism, we propose a novel LLM detector that only requires simple calculations using a pair of pre-trained LLMs. The method, called Binoculars, achieves state-of-the-art accuracy without any training data. It is capable of spotting machine text from a range of modern LLMs without any model-specific modifications. We comprehensively evaluate Binoculars on a number of text sources and in varied situations. Over a wide range of document types, Binoculars detects over 90% of generated samples from ChatGPT (and other LLMs) at a false positive rate of 0.01%, despite not being trained on any ChatGPT data.
Authors: Mathew Huerta-Enochian, Seung Yong Ko
Abstract: We present a novel study analyzing the effects of various prompt loss token weights (PLW) for supervised instruction fine-tuning (SIFT). While prompt-masking (PLW = 0) is common for SIFT, some fine-tuning APIs support fractional PLWs and suggest that using a small non-zero PLW can help stabilize learning when fine-tuning on short-completion data. However, there has never been a study confirming this claim, and OpenAI, a major cloud-based SIFT provider, recently removed this parameter from their fine-tuning API. We found that performance of models fine-tuned on short-completion data had a statistically-significant negative quadratic relationship with PLW. Using small values (0.01 - 0.5) of PLW produced better results on multiple-choice and short-generation benchmarks (outperforming models fine-tuned on long-completion data) while large values (~ 1.0) of PLW produced better results on long-generation benchmarks. We explained this effect and verified its importance through additional experiments. This research serves as a warning to API providers about the importance of providing a PLW parameter for SIFT.
Authors: Lai Wei, Zhiquan Tan, Chenghai Li, Jindong Wang, Weiran Huang
Abstract: Large Language Models (LLMs) have transformed natural language processing and extended their powerful capabilities to multi-modal domains. As LLMs continue to advance, it is crucial to develop diverse and appropriate metrics for their evaluation. In this paper, we introduce a novel rank-based metric, Diff-eRank, grounded in information theory and geometry principles. Diff-eRank assesses LLMs by analyzing their hidden representations, providing a quantitative measure of how efficiently they eliminate redundant information during training. We demonstrate the applicability of Diff-eRank in both single-modal (e.g., language) and multi-modal settings. For language models, our results show that Diff-eRank increases with model size and correlates well with conventional metrics such as loss and accuracy. In the multi-modal context, we propose an alignment evaluation method based on the eRank, and verify that contemporary multi-modal LLMs exhibit strong alignment performance based on our method. Our code is publicly available at https://github.com/waltonfuture/Diff-eRank.
Authors: Wesley H. Holliday, Matthew Mandelkern, Cedegao E. Zhang
Abstract: The reasoning abilities of large language models (LLMs) are the topic of a growing body of research in AI and cognitive science. In this paper, we probe the extent to which twenty-nine LLMs are able to distinguish logically correct inferences from logically fallacious ones. We focus on inference patterns involving conditionals (e.g., 'If Ann has a queen, then Bob has a jack') and epistemic modals (e.g., 'Ann might have an ace', 'Bob must have a king'). These inferences have been of special interest to logicians, philosophers, and linguists, since they play a central role in the fundamental human ability to reason about distal possibilities. Assessing LLMs on these inferences is thus highly relevant to the question of how much the reasoning abilities of LLMs match those of humans. All the LLMs we tested make some basic mistakes with conditionals or modals, though zero-shot chain-of-thought prompting helps them make fewer mistakes. Even the best performing LLMs make basic errors in modal reasoning, display logically inconsistent judgments across inference patterns involving epistemic modals and conditionals, and give answers about complex conditional inferences that do not match reported human judgments. These results highlight gaps in basic logical reasoning in today's LLMs.
Authors: Haoran Ye, Jiarui Wang, Zhiguang Cao, Federico Berto, Chuanbo Hua, Haeyeon Kim, Jinkyoo Park, Guojie Song
Abstract: The omnipresence of NP-hard combinatorial optimization problems (COPs) compels domain experts to engage in trial-and-error heuristic design. The long-standing endeavor of design automation has gained new momentum with the rise of large language models (LLMs). This paper introduces Language Hyper-Heuristics (LHHs), an emerging variant of Hyper-Heuristics that leverages LLMs for heuristic generation, featuring minimal manual intervention and open-ended heuristic spaces. To empower LHHs, we present Reflective Evolution (ReEvo), a novel integration of evolutionary search for efficiently exploring the heuristic space, and LLM reflections to provide verbal gradients within the space. Across five heterogeneous algorithmic types, six different COPs, and both white-box and black-box views of COPs, ReEvo yields state-of-the-art and competitive meta-heuristics, evolutionary algorithms, heuristics, and neural solvers, while being more sample-efficient than prior LHHs.
Authors: Siming Yan, Min Bai, Weifeng Chen, Xiong Zhou, Qixing Huang, Li Erran Li
Abstract: By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities. However, the generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements, missing significant parts of the scene, and inferring incorrect attributes of and relationships between objects. To address these issues, we introduce a novel framework, ViGoR (Visual Grounding Through Fine-Grained Reward Modeling) that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines. This improvement is efficiently achieved using much cheaper human evaluations instead of full supervisions, as well as automated methods. We show the effectiveness of our approach through a variety of evaluation methods and benchmarks. Additionally, we released our human annotation (https://github.com/amazon-science/vigor) comprising 15,440 images and generated text pairs with fine-grained evaluations to contribute to related research in the community.
Authors: Yuancheng Xu, Jiarui Yao, Manli Shu, Yanchao Sun, Zichu Wu, Ning Yu, Tom Goldstein, Furong Huang
Abstract: Vision-Language Models (VLMs) excel in generating textual responses from visual inputs, but their versatility raises security concerns. This study takes the first step in exposing VLMs' susceptibility to data poisoning attacks that can manipulate responses to innocuous, everyday prompts. We introduce Shadowcast, a stealthy data poisoning attack where poison samples are visually indistinguishable from benign images with matching texts. Shadowcast demonstrates effectiveness in two attack types. The first is a traditional Label Attack, tricking VLMs into misidentifying class labels, such as confusing Donald Trump for Joe Biden. The second is a novel Persuasion Attack, leveraging VLMs' text generation capabilities to craft persuasive and seemingly rational narratives for misinformation, such as portraying junk food as healthy. We show that Shadowcast effectively achieves the attacker's intentions using as few as 50 poison samples. Crucially, the poisoned samples demonstrate transferability across different VLM architectures, posing a significant concern in black-box settings. Moreover, Shadowcast remains potent under realistic conditions involving various text prompts, training data augmentation, and image compression techniques. This work reveals how poisoned VLMs can disseminate convincing yet deceptive misinformation to everyday, benign users, emphasizing the importance of data integrity for responsible VLM deployments. Our code is available at: https://github.com/umd-huang-lab/VLM-Poisoning.
Authors: Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Lee Boyd-Graber
Abstract: Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs). There are two challenges with current short-form QA evaluations: a lack of diverse styles of evaluation data and an over-reliance on expensive and slow LLMs. LLM-based scorers correlate better with humans, but this expensive task has only been tested on limited QA datasets. We rectify these issues by providing rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods(BERTScore).
Authors: Xinchen Zhang, Ling Yang, Yaqi Cai, Zhaochen Yu, Kai-Ni Wang, Jiake Xie, Ye Tian, Minkai Xu, Yong Tang, Yujiu Yang, Bin Cui
Abstract: Diffusion models have achieved remarkable advancements in text-to-image generation. However, existing models still have many difficulties when faced with multiple-object compositional generation. In this paper, we propose RealCompo, a new training-free and transferred-friendly text-to-image generation framework, which aims to leverage the respective advantages of text-to-image models and spatial-aware image diffusion models (e.g., layout, keypoints and segmentation maps) to enhance both realism and compositionality of the generated images. An intuitive and novel balancer is proposed to dynamically balance the strengths of the two models in denoising process, allowing plug-and-play use of any model without extra training. Extensive experiments show that our RealCompo consistently outperforms state-of-the-art text-to-image models and spatial-aware image diffusion models in multiple-object compositional generation while keeping satisfactory realism and compositionality of the generated images. Notably, our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models. Our code is available at: https://github.com/YangLing0818/RealCompo
Authors: Zekun Jiang, Dongjie Cheng, Ziyuan Qin, Jun Gao, Qicheng Lao, Abdullaev Bakhrom Ismoilovich, Urazboev Gayrat, Yuldashov Elyorbek, Bekchanov Habibullo, Defu Tang, LinJing Wei, Kang Li, Le Zhang
Abstract: This study presents a novel multimodal medical image zero-shot segmentation algorithm named the text-visual-prompt segment anything model (TV-SAM) without any manual annotations. The TV-SAM incorporates and integrates the large language model GPT-4, the vision language model GLIP, and the SAM to autonomously generate descriptive text prompts and visual bounding box prompts from medical images, thereby enhancing the SAM's capability for zero-shot segmentation. Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training. TV-SAM significantly outperforms SAM AUTO and GSAM, closely matching the performance of SAM BBOX with gold standard bounding box prompts and surpasses the state-of-the-art methods on specific datasets such as ISIC and WBC. The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm, highlighting the significant contribution of GPT-4 to zero-shot segmentation. By integrating foundational models such as GPT-4, GLIP, and SAM, the ability to address complex problems in specialized domains can be enhanced.
Authors: Masanari Ohi, Masahiro Kaneko, Ryuto Koike, Mengsay Loem, Naoaki Okazaki
Abstract: Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics. However, the likelihood, a measure of LLM's plausibility for a sentence, can vary due to superficial differences in sentences, such as word order and sentence structure. It is therefore possible that there might be a likelihood bias if LLMs are used for evaluation: they might overrate sentences with higher likelihoods while underrating those with lower likelihoods. In this paper, we investigate the presence and impact of likelihood bias in LLM-based evaluators. We also propose a method to mitigate the likelihood bias. Our method utilizes highly biased instances as few-shot examples for in-context learning. Our experiments in evaluating the data-to-text and grammatical error correction tasks reveal that several LLMs we test display a likelihood bias. Furthermore, our proposed method successfully mitigates this bias, also improving evaluation performance (in terms of correlation of models with human scores) significantly.
Authors: Tianyu Chen, Haoyi Zhou, Ying Li, Hao Wang, Chonghan Gao, Rongye Shi, Shanghang Zhang, Jianxin Li
Abstract: Foundation models have revolutionized language modeling, while whether this success is replicated in scientific computing remains unexplored. We present OmniArch, the first prototype aiming at solving multi-scale and multi-physics scientific computing problems with physical alignment. We addressed all three challenges with one unified architecture. Its pre-training stage contains a Fourier Encoder-decoder fading out the disharmony across separated dimensions and a Transformer backbone integrating quantities through temporal dynamics, and the novel PDE-Aligner performs physics-informed fine-tuning under flexible conditions. As far as we know, we first conduct 1D-2D-3D united pre-training on the PDEBench, and it sets not only new performance benchmarks for 1D, 2D, and 3D PDEs but also demonstrates exceptional adaptability to new physics via in-context and zero-shot learning approaches, which supports realistic engineering applications and foresight physics discovery.
Authors: Georg Manten, Cecilia Casolo, Emilio Ferrucci, S{\o}ren Wengel Mogensen, Cristopher Salvi, Niki Kilbertus
Abstract: Inferring the causal structure underlying stochastic dynamical systems from observational data holds great promise in domains ranging from science and health to finance. Such processes can often be accurately modeled via stochastic differential equations (SDEs), which naturally imply causal relationships via "which variables enter the differential of which other variables". In this paper, we develop conditional independence (CI) constraints on coordinate processes over selected intervals that are Markov with respect to the acyclic dependence graph (allowing self-loops) induced by a general SDE model. We then provide a sound and complete causal discovery algorithm, capable of handling both fully and partially observed data, and uniquely recovering the underlying or induced ancestral graph by exploiting time directionality assuming a CI oracle. Finally, to make our algorithm practically usable, we also propose a flexible, consistent signature kernel-based CI test to infer these constraints from data. We extensively benchmark the CI test in isolation and as part of our causal discovery algorithms, outperforming existing approaches in SDE models and beyond.
Authors: Toru Lin, Zhao-Heng Yin, Haozhi Qi, Pieter Abbeel, Jitendra Malik
Abstract: Manipulating objects with two multi-fingered hands has been a long-standing challenge in robotics, due to the contact-rich nature of many manipulation tasks and the complexity inherent in coordinating a high-dimensional bimanual system. In this work, we share novel insights into physical modeling, real-time perception, and reward design that enable policies trained in simulation using deep reinforcement learning (RL) to be effectively and efficiently transferred to the real world. Specifically, we consider the problem of twisting lids of various bottle-like objects with two hands, demonstrating policies with generalization capabilities across a diverse set of unseen objects as well as dynamic and dexterous behaviors. To the best of our knowledge, this is the first sim-to-real RL system that enables such capabilities on bimanual multi-fingered hands.
Authors: Chuanqi Cheng, Quan Tu, Shuo Shang, Cunli Mao, Zhengtao Yu, Wei Wu, Rui Yan
Abstract: Personalized dialogue systems have gained significant attention in recent years for their ability to generate responses in alignment with different personas. However, most existing approaches rely on pre-defined personal profiles, which are not only time-consuming and labor-intensive to create but also lack flexibility. We propose In-Dialogue Learning (IDL), a fine-tuning framework that enhances the ability of pre-trained large language models to leverage dialogue history to characterize persona for completing personalized dialogue generation tasks without pre-defined profiles. Our experiments on three datasets demonstrate that IDL brings substantial improvements, with BLEU and ROUGE scores increasing by up to 200% and 247%, respectively. Additionally, the results of human evaluations further validate the efficacy of our proposed method.
Authors: Markus Huff, Elanur Ulak\c{c}{\i}
Abstract: Large language models (LLMs) are excelling across various tasks despite not being based on human cognition, prompting an investigation into their potential to offer insights into human cognitive mechanisms. This study examines ChatGPT's ability to predict human performance in a language-based memory task. Following theories of text comprehension, we hypothesized that recognizing ambiguous sentences is easier with relevant preceding context. Participants, including humans and ChatGPT, were given pairs of sentences: the second always a garden-path sentence, and the first providing either fitting or unfitting context. We measured their ratings of sentence relatedness and memorability. Results showed a strong alignment between ChatGPT's assessments and human memory performance. Sentences in the fitting context were rated as being more related and memorable by ChatGPT and were better remembered by humans, highlighting LLMs' potential to predict human performance and contribute to psychological theories.
Authors: Yunlong Song, Sangbae Kim, Davide Scaramuzza
Abstract: This work explores the potential of using differentiable simulation for learning quadruped locomotion. Differentiable simulation promises fast convergence and stable training by computing low-variance first-order gradients using robot dynamics. However, its usage for legged robots is still limited to simulation. The main challenge lies in the complex optimization landscape of robotic tasks due to discontinuous dynamics. This work proposes a new differentiable simulation framework to overcome these challenges. Our approach combines a high-fidelity, non-differentiable simulator for forward dynamics with a simplified surrogate model for gradient backpropagation. This approach maintains simulation accuracy by aligning the robot states from the surrogate model with those of the precise, non-differentiable simulator. Our framework enables learning quadruped walking in simulation in minutes without parallelization. When augmented with GPU parallelization, our approach allows the quadruped robot to master diverse locomotion skills on challenging terrains in minutes. We demonstrate that differentiable simulation outperforms a reinforcement learning algorithm (PPO) by achieving significantly better sample efficiency while maintaining its effectiveness in handling large-scale environments. Our method represents one of the first successful applications of differentiable simulation to real-world quadruped locomotion, offering a compelling alternative to traditional RL methods.
Authors: Run Shao, Zhaoyang Zhang, Chao Tao, Yunsheng Zhang, Chengli Peng, Haifeng Li
Abstract: The tokenizer, as one of the fundamental components of large models, has long been overlooked or even misunderstood in visual tasks. One key factor of the great comprehension power of the large language model is that natural language tokenizers utilize meaningful words or subwords as the basic elements of language. In contrast, mainstream visual tokenizers, represented by patch-based methods such as Patch Embed, rely on meaningless rectangular patches as basic elements of vision, which cannot serve as effectively as words or subwords in language. Starting from the essence of the tokenizer, we defined semantically independent regions (SIRs) for vision. We designed a simple HOmogeneous visual tOKenizer: HOOK. HOOK mainly consists of two modules: the Object Perception Module (OPM) and the Object Vectorization Module (OVM). To achieve homogeneity, the OPM splits the image into 4*4 pixel seeds and then utilizes the attention mechanism to perceive SIRs. The OVM employs cross-attention to merge seeds within the same SIR. To achieve adaptability, the OVM defines a variable number of learnable vectors as cross-attention queries, allowing for the adjustment of token quantity. We conducted experiments on the NWPU-RESISC45, WHU-RS19 classification dataset, and GID5 segmentation dataset for sparse and dense tasks. The results demonstrate that the visual tokens obtained by HOOK correspond to individual objects, which demonstrates homogeneity. HOOK outperformed Patch Embed by 6\% and 10\% in the two tasks and achieved state-of-the-art performance compared to the baselines used for comparison. Compared to Patch Embed, which requires more than one hundred tokens for one image, HOOK requires only 6 and 8 tokens for sparse and dense tasks, respectively, resulting in efficiency improvements of 1.5 to 2.8 times. The code is available at https://github.com/GeoX-Lab/Hook.
Authors: Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, Hong Chen
Abstract: This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs). It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.LLM-Streamline comprises two parts: layer pruning, which removes consecutive layers with the lowest importance based on target sparsity, and layer replacement, a novel module that trains a lightweight network to replace the pruned layers to mitigate performance loss. Additionally, a new metric called stability is proposed to address the limitations of the widely used accuracy metric in evaluating model compression. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.
Authors: Haofeng Yuan, Rongping Zhu, Wanlu Yang, Shiji Song, Keyou You, Wei Fan, C. L. Philip Chen
Abstract: The traveling purchaser problem (TPP) is an important combinatorial optimization problem with broad applications. Due to the coupling between routing and purchasing, existing works on TPPs commonly address route construction and purchase planning simultaneously, which, however, leads to exact methods with high computational cost and heuristics with sophisticated design but limited performance. In sharp contrast, we propose a novel approach based on deep reinforcement learning (DRL), which addresses route construction and purchase planning separately, while evaluating and optimizing the solution from a global perspective. The key components of our approach include a bipartite graph representation for TPPs to capture the market-product relations, and a policy network that extracts information from the bipartite graph and uses it to sequentially construct the route. One significant benefit of our framework is that we can efficiently construct the route using the policy network, and once the route is determined, the associated purchasing plan can be easily derived through linear programming, while, leveraging DRL, we can train the policy network to optimize the global solution objective. Furthermore, by introducing a meta-learning strategy, the policy network can be trained stably on large-sized TPP instances, and generalize well across instances of varying sizes and distributions, even to much larger instances that are never seen during training. Experiments on various synthetic TPP instances and the TPPLIB benchmark demonstrate that our DRL-based approach can significantly outperform well-established TPP heuristics, reducing the optimality gap by 40%-90%, and also showing an advantage in runtime, especially on large-sized instances.
Authors: Paolo Faraboschi, Ellis Giles, Justin Hotard, Konstanty Owczarek, Andrew Wheeler
Abstract: The world has recently witnessed an unprecedented acceleration in demands for Machine Learning and Artificial Intelligence applications. This spike in demand has imposed tremendous strain on the underlying technology stack in supply chain, GPU-accelerated hardware, software, datacenter power density, and energy consumption. If left on the current technological trajectory, future demands show insurmountable spending trends, further limiting market players, stifling innovation, and widening the technology gap. To address these challenges, we propose a fundamental change in the AI training infrastructure throughout the technology ecosystem. The changes require advancements in supercomputing and novel AI training approaches, from high-end software to low-level hardware, microprocessor, and chip design, while advancing the energy efficiency required by a sustainable infrastructure. This paper presents the analytical framework that quantitatively highlights the challenges and points to the opportunities to reduce the barriers to entry for training large language models.
Authors: Sayeh Sharify, Utkarsh Saxena, Zifei Xu, Wanzin Yazar, Ilya Soloveychik, Xin Wang
Abstract: Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of three well-known post-training techniques, SmoothQuant, AWQ, and GPTQ, and provide a comprehensive analysis of their interactions and implications for advancing LLM quantization. We enhance the versatility of these methods by enabling quantization to microscaling (MX) formats, extending the applicability of these PTQ algorithms beyond their original fixed-point format targets. We show that combining different PTQ methods enables us to quantize models to 4-bit weights and 8-bit activations using the MXINT format with negligible accuracy loss compared to the uncompressed baseline.
Authors: Elena Merdjanovska, Ansar Aynetdinov, Alan Akbik
Abstract: Available training data for named entity recognition (NER) often contains a significant percentage of incorrect labels for entity types and entity boundaries. Such label noise poses challenges for supervised learning and may significantly deteriorate model quality. To address this, prior work proposed various noise-robust learning approaches capable of learning from data with partially incorrect labels. These approaches are typically evaluated using simulated noise where the labels in a clean dataset are automatically corrupted. However, as we show in this paper, this leads to unrealistic noise that is far easier to handle than real noise caused by human error or semi-automatic annotation. To enable the study of the impact of various types of real noise, we introduce NoiseBench, an NER benchmark consisting of clean training data corrupted with 6 types of real noise, including expert errors, crowdsourcing errors, automatic annotation errors and LLM errors. We present an analysis that shows that real noise is significantly more challenging than simulated noise, and show that current state-of-the-art models for noise-robust learning fall far short of their theoretically achievable upper bound. We release NoiseBench to the research community.
Authors: Arushi Jain, Josiah P. Hanna, Doina Precup
Abstract: General Value Functions (GVFs) (Sutton et al., 2011) represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique reward. Existing methods relying on fixed behavior policies or pre-collected data often face data efficiency issues when learning multiple GVFs in parallel using off-policy methods. To address this, we introduce GVFExplorer, which adaptively learns a single behavior policy that efficiently collects data for evaluating multiple GVFs in parallel. Our method optimizes the behavior policy by minimizing the total variance in return across GVFs, thereby reducing the required environmental interactions. We use an existing temporal-difference-style variance estimator to approximate the return variance. We prove that each behavior policy update decreases the overall mean squared error in GVF predictions. We empirically show our method's performance in tabular and nonlinear function approximation settings, including Mujoco environments, with stationary and non-stationary reward signals, optimizing data usage and reducing prediction errors across multiple GVFs.
Authors: Yunfan Jiang, Chen Wang, Ruohan Zhang, Jiajun Wu, Li Fei-Fei
Abstract: Learning in simulation and transferring the learned policy to the real world has the potential to enable generalist robots. The key challenge of this approach is to address simulation-to-reality (sim-to-real) gaps. Previous methods often require domain-specific knowledge a priori. We argue that a straightforward way to obtain such knowledge is by asking humans to observe and assist robot policy execution in the real world. The robots can then learn from humans to close various sim-to-real gaps. We propose TRANSIC, a data-driven approach to enable successful sim-to-real transfer based on a human-in-the-loop framework. TRANSIC allows humans to augment simulation policies to overcome various unmodeled sim-to-real gaps holistically through intervention and online correction. Residual policies can be learned from human corrections and integrated with simulation policies for autonomous execution. We show that our approach can achieve successful sim-to-real transfer in complex and contact-rich manipulation tasks such as furniture assembly. Through synergistic integration of policies learned in simulation and from humans, TRANSIC is effective as a holistic approach to addressing various, often coexisting sim-to-real gaps. It displays attractive properties such as scaling with human effort. Videos and code are available at https://transic-robot.github.io/
Authors: Lorenzo Sani, Alex Iacob, Zeyu Cao, Bill Marino, Yan Gao, Tomas Paulik, Wanru Zhao, William F. Shen, Preslav Aleksandrov, Xinchi Qiu, Nicholas D. Lane
Abstract: Generative pre-trained large language models (LLMs) have demonstrated impressive performance over a wide range of tasks, thanks to the unprecedented amount of data they have been trained on. As established scaling laws indicate, LLMs' future performance improvement depends on the amount of computing and data sources they can leverage for pre-training. Federated learning (FL) has the potential to unleash the majority of the planet's data and computational resources, which are underutilized by the data-center-focused training methodology of current LLM practice. Our work presents a robust, flexible, reproducible FL approach that enables large-scale collaboration across institutions to train LLMs. We propose a scalable deployment system called Photon to enable the investigation and development of this new training paradigm for LLM pre-training. We show that Photon can be used by organizations interested in collaborating with their private data sources and computational resources for pre-training LLMs with billions of parameters. This paradigm would mobilize more computational and data resources while matching or potentially exceeding centralized performance. We further show the effectiveness of the federated training scales with model size and present our approach for training billion-scale federated LLMs using limited resources. Thus far, we have used Photon to train LLM models to the size of 7B parameters and anticipate larger models being completed in the near future. Finally, we show that LLM training is highly resilient to the classical challenges of federated statistical and hardware heterogeneity. Furthermore, we show that convergence is robust to partial participation, opening the avenue for compute-efficient collaborative training. Photon will help data-rich actors to become the protagonists of LLMs pre-training instead of leaving the stage to compute-rich actors alone.
Authors: Alvin Heng, Alexandre H. Thiery, Harold Soh
Abstract: Out-of-distribution (OOD) detection is a critical task in machine learning that seeks to identify abnormal samples. Traditionally, unsupervised methods utilize a deep generative model for OOD detection. However, such approaches require a new model to be trained for each inlier dataset. This paper explores whether a single model can perform OOD detection across diverse tasks. To that end, we introduce Diffusion Paths (DiffPath), which uses a single diffusion model originally trained to perform unconditional generation for OOD detection. We introduce a novel technique of measuring the rate-of-change and curvature of the diffusion paths connecting samples to the standard normal. Extensive experiments show that with a single model, DiffPath is competitive with prior work using individual models on a variety of OOD tasks involving different distributions. Our code is publicly available at https://github.com/clear-nus/diffpath.
Authors: Shiji Huang, Lei Ye, Min Chen, Wenhai Luo, Chenqi Xu, Deyuan Liang, Dihong Wang
Abstract: A thorough understanding of the interaction between the target agent and surrounding agents is a prerequisite for accurate trajectory prediction. Although many methods have been explored, they all assign correlation coefficients to surrounding agents in a purely learning-based manner. In this study, we present ASPILin, which manually selects interacting agents and calculates their correlations instead of attention scores. Surprisingly, these simple modifications can significantly improve prediction performance and substantially reduce computational costs. Additionally, ASPILin models the interacting agents at each past time step separately, rather than only modeling the interacting agents at the current time step. This clarifies the causal chain of the target agent's historical trajectory and helps the model better understand dynamic interactions. We intentionally simplified our model in other aspects, such as map encoding. Remarkably, experiments conducted on the INTERACTION, highD, and CitySim datasets demonstrate that our method is efficient and straightforward, outperforming other state-of-the-art methods.
Authors: Guy Ohayon, Michael Elad, Tomer Michaeli
Abstract: Fairness in image restoration tasks is the desire to treat different sub-groups of images equally well. Existing definitions of fairness in image restoration are highly restrictive. They consider a reconstruction to be a correct outcome for a group (e.g., women) only if it falls within the group's set of ground truth images (e.g., natural images of women); otherwise, it is considered entirely incorrect. Consequently, such definitions are prone to controversy, as errors in image restoration can manifest in various ways. In this work we offer an alternative approach towards fairness in image restoration, by considering the Group Perceptual Index (GPI), which we define as the statistical distance between the distribution of the group's ground truth images and the distribution of their reconstructions. We assess the fairness of an algorithm by comparing the GPI of different groups, and say that it achieves perfect Perceptual Fairness (PF) if the GPIs of all groups are identical. We motivate and theoretically study our new notion of fairness, draw its connection to previous ones, and demonstrate its utility on state-of-the-art face image restoration algorithms.
Authors: Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hong, Zhenguo Li, Qiang Xu
Abstract: While controllable generative models for images and videos have achieved remarkable success, high-quality models for 3D scenes, particularly in unbounded scenarios like autonomous driving, remain underdeveloped due to high data acquisition costs. In this paper, we introduce MagicDrive3D, a novel pipeline for controllable 3D street scene generation that supports multi-condition control, including BEV maps, 3D objects, and text descriptions. Unlike previous methods that reconstruct before training the generative models, MagicDrive3D first trains a video generation model and then reconstructs from the generated data. This innovative approach enables easily controllable generation and static scene acquisition, resulting in high-quality scene reconstruction. To address the minor errors in generated content, we propose deformable Gaussian splatting with monocular depth initialization and appearance modeling to manage exposure discrepancies across viewpoints. Validated on the nuScenes dataset, MagicDrive3D generates diverse, high-quality 3D driving scenes that support any-view rendering and enhance downstream tasks like BEV segmentation. Our results demonstrate the framework's superior performance, showcasing its transformative potential for autonomous driving simulation and beyond.
Authors: Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha
Abstract: Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations. While hallucinations are well-studied, the exact causes behind them remain underexplored. In this paper, we first investigate the root causes of hallucinations in LVLMs. Our findings reveal that existing mitigation techniques primarily reduce hallucinations for visual recognition prompts-those that require simple descriptions of visual elements-but fail for cognitive prompts that demand deliberate reasoning. We identify the core issue as a lack of true visual perception in LVLMs: although they can accurately recognize visual elements, they struggle to fully interpret these elements in the context of the input prompt and effectively link this recognition to their internal knowledge, which is critical for reasoning. To address this gap, we introduce Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs. VDGD works by first generating a detailed description of the image and appending it as a prefix to the instruction. During response generation, tokens are sampled based on their KL divergence to the description, favoring candidates with lower divergence. Experimental results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD consistently outperforms existing baselines 2% - 33%. Finally, we introduce VaLLu, a benchmark designed for comprehensive evaluation of the cognitive capabilities of LVLMs.
Authors: R\'obert Csord\'as, Kazuki Irie, J\"urgen Schmidhuber, Christopher Potts, Christopher D. Manning
Abstract: Previous work on Universal Transformers (UTs) has demonstrated the importance of parameter sharing across layers. By allowing recurrence in depth, UTs have advantages over standard Transformers in learning compositional generalizations, but layer-sharing comes with a practical limitation of parameter-compute ratio: it drastically reduces the parameter count compared to the non-shared model with the same dimensionality. Naively scaling up the layer size to compensate for the loss of parameters makes its computational resource requirements prohibitive. In practice, no previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling. Here we propose MoEUT (pronounced "moot"), an effective mixture-of-experts (MoE)-based shared-layer Transformer architecture, which combines several recent advances in MoEs for both feedforward and attention layers of standard Transformers together with novel layer-normalization and grouping schemes that are specific and crucial to UTs. The resulting UT model, for the first time, slightly outperforms standard Transformers on language modeling tasks such as BLiMP and PIQA, while using significantly less compute and memory.
Authors: Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou
Abstract: Tensor Attention, a multi-view attention that is able to capture high-order correlations among multiple modalities, can overcome the representational limitations of classical matrix attention. However, the $O(n^3)$ time complexity of tensor attention poses a significant obstacle to its utilization in transformers, where $n$ is the input sequence length. In this work, we prove that the backward gradient of tensor attention training can be computed in almost linear time $n^{1+o(1)}$, the same complexity as its forward computation under the bounded entries assumption. We provide a closed-form solution for the gradient and propose a fast computation method utilizing polynomial approximation methods and tensor algebraic techniques. Furthermore, we prove the necessity and tightness of our assumption through hardness analysis, showing that slightly weakening it renders the gradient problem unsolvable in truly subcubic time. Our theoretical results establish the feasibility of efficient higher-order transformer training and may facilitate practical applications of tensor attention architectures.
Authors: Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou
Abstract: Diffusion models have made rapid progress in generating high-quality samples across various domains. However, a theoretical understanding of the Lipschitz continuity and second momentum properties of the diffusion process is still lacking. In this paper, we bridge this gap by providing a detailed examination of these smoothness properties for the case where the target data distribution is a mixture of Gaussians, which serves as a universal approximator for smooth densities such as image data. We prove that if the target distribution is a $k$-mixture of Gaussians, the density of the entire diffusion process will also be a $k$-mixture of Gaussians. We then derive tight upper bounds on the Lipschitz constant and second momentum that are independent of the number of mixture components $k$. Finally, we apply our analysis to various diffusion solvers, both SDE and ODE based, to establish concrete error guarantees in terms of the total variation distance and KL divergence between the target and learned distributions. Our results provide deeper theoretical insights into the dynamics of the diffusion process under common data distributions.
Authors: Max Liu, Chan-Hung Yu, Wei-Hsu Lee, Cheng-Wei Hung, Yen-Chun Chen, Shao-Hua Sun
Abstract: Programmatic reinforcement learning (PRL) has been explored for representing policies through programs as a means to achieve interpretability and generalization. Despite promising outcomes, current state-of-the-art PRL methods are hindered by sample inefficiency, necessitating tens of millions of program-environment interactions. To tackle this challenge, we introduce a novel LLM-guided search framework (LLM-GS). Our key insight is to leverage the programming expertise and common sense reasoning of LLMs to enhance the efficiency of assumption-free, random-guessing search methods. We address the challenge of LLMs' inability to generate precise and grammatically correct programs in domain-specific languages (DSLs) by proposing a Pythonic-DSL strategy - an LLM is instructed to initially generate Python codes and then convert them into DSL programs. To further optimize the LLM-generated programs, we develop a search algorithm named Scheduled Hill Climbing, designed to efficiently explore the programmatic search space to improve the programs consistently. Experimental results in the Karel domain demonstrate our LLM-GS framework's superior effectiveness and efficiency. Extensive ablation studies further verify the critical role of our Pythonic-DSL strategy and Scheduled Hill Climbing algorithm. Moreover, we conduct experiments with two novel tasks, showing that LLM-GS enables users without programming skills and knowledge of the domain or DSL to describe the tasks in natural language to obtain performant programs.
Authors: Dongbin Kim, Jinseong Park, Jaewook Lee, Hoki Kim
Abstract: Time series forecasting is crucial for applications across multiple domains and various scenarios. Although Transformer models have dramatically advanced the landscape of forecasting, their effectiveness remains debated. Recent findings have indicated that simpler linear models might outperform complex Transformer-based approaches, highlighting the potential for more streamlined architectures. In this paper, we shift the focus from evaluating the overall Transformer architecture to specifically examining the effectiveness of self-attention for time series forecasting. To this end, we introduce a new architecture, Cross-Attention-only Time Series transformer (CATS), that rethinks the traditional Transformer framework by eliminating self-attention and leveraging cross-attention mechanisms instead. By establishing future horizon-dependent parameters as queries and enhanced parameter sharing, our model not only improves long-term forecasting accuracy but also reduces the number of parameters and memory usage. Extensive experiment across various datasets demonstrates that our model achieves superior performance with the lowest mean squared error and uses fewer parameters compared to existing models. The implementation of our model is available at: https://github.com/dongbeank/CATS.
Authors: Kai Wang, Mingjia Shi, Yukun Zhou, Zekai Li, Zhihang Yuan, Yuzhang Shang, Xiaojiang Peng, Hanwang Zhang, Yang You
Abstract: Training diffusion models is always a computation-intensive task. In this paper, we introduce a novel speed-up method for diffusion model training, called, which is based on a closer look at time steps. Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergence areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training. To address this, we design an asymmetric sampling strategy that reduces the frequency of steps from the convergence area while increasing the sampling probability for steps from other areas. Additionally, we propose a weighting strategy to emphasize the importance of time steps with rapid-change process increments. As a plug-and-play and architecture-agnostic approach, SpeeD consistently achieves 3-times acceleration across various diffusion architectures, datasets, and tasks. Notably, due to its simple design, our approach significantly reduces the cost of diffusion model training with minimal overhead. Our research enables more researchers to train diffusion models at a lower cost.
Authors: Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, Yu-Gang Jiang
Abstract: Recent advancements in Large Vision-Language Models (VLMs) have underscored their superiority in various multimodal tasks. However, the adversarial robustness of VLMs has not been fully explored. Existing methods mainly assess robustness through unimodal adversarial attacks that perturb images, while assuming inherent resilience against text-based attacks. Different from existing attacks, in this work we propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within VLMs. Specifically, we propose a dual optimization objective aimed at guiding the model to generate affirmative responses with high toxicity. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input, thus imbuing the image with toxic semantics. Subsequently, an adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions. The discovered adversarial image prefix and text suffix are collectively denoted as a Universal Master Key (UMK). When integrated into various malicious queries, UMK can circumvent the alignment defenses of VLMs and lead to the generation of objectionable content, known as jailbreaks. The experimental results demonstrate that our universal attack strategy can effectively jailbreak MiniGPT-4 with a 96% success rate, highlighting the vulnerability of VLMs and the urgent need for new alignment strategies.
Authors: Xiaohao Xu, Ye Li, Tianyi Zhang, Jinrong Yang, Matthew Johnson-Roberson, Xiaonan Huang
Abstract: Constructing large-scale labeled datasets for multi-modal perception model training in autonomous driving presents significant challenges. This has motivated the development of self-supervised pretraining strategies. However, existing pretraining methods mainly employ distinct approaches for each modality. In contrast, we focus on LiDAR-Camera 3D perception models and introduce a unified pretraining strategy, NeRF-Supervised Masked Auto Encoder (NS-MAE), which optimizes all modalities through a shared formulation. NS-MAE leverages NeRF's ability to encode both appearance and geometry, enabling efficient masked reconstruction of multi-modal data. Specifically, embeddings are extracted from corrupted LiDAR point clouds and images, conditioned on view directions and locations. Then, these embeddings are rendered into multi-modal feature maps from two crucial viewpoints for 3D driving perception: perspective and bird's-eye views. The original uncorrupted data serve as reconstruction targets for self-supervised learning. Extensive experiments demonstrate the superior transferability of NS-MAE across various 3D perception tasks under different fine-tuning settings. Notably, NS-MAE outperforms prior SOTA pre-training methods that employ separate strategies for each modality in BEV map segmentation under the label-efficient fine-tuning setting. Our code is publicly available at https://github.com/Xiaohao-Xu/Unified-Pretrain-AD/ .
Authors: Pengxiang Li, Lu Yin, Xiaowei Gao, Shiwei Liu
Abstract: The rapid advancements in Large Language Models (LLMs) have revolutionized various natural language processing tasks. However, the substantial size of LLMs presents significant challenges in training or fine-tuning. While parameter-efficient approaches such as low-rank adaptation (LoRA) have gained popularity, they often compromise performance compared to full-rank fine-tuning. In this paper, we propose Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore), a new memory-efficient fine-tuning approach, inspired by the layerwise outlier distribution of LLMs. Unlike LoRA, which adds extra adapters to all layers, OwLore strategically assigns higher sampling probabilities to layers with more outliers, selectively sampling only a few layers and fine-tuning their pre-trained weights. To further increase the number of fine-tuned layers without a proportional rise in memory costs, we incorporate gradient low-rank projection, further boosting the approach's performance. Our extensive experiments across various architectures, including LLaMa2, LLaMa3, and Mistral, demonstrate that OwLore consistently outperforms baseline approaches, including full fine-tuning. Specifically, it achieves up to a 1.1% average accuracy gain on the Commonsense Reasoning benchmark, a 3.0% improvement on MMLU, and a notable 10% boost on MT-Bench, while being more memory efficient. OwLore allows us to fine-tune LLaMa2-7B with only 21GB of memory. Code is available at https://github.com/pixeli99/OwLore.
Authors: Marco S\"alzer, Eric Alsmann, Martin Lange
Abstract: We analyse the complexity of the satisfiability problem (SAT) for transformer encoders (TE), naturally occurring in formal verification or interpretation tasks. We find that SAT is undecidable when considering TE as they are commonly studied in the expressiveness community. Furthermore, we identify practical scenarios where SAT is decidable and establish corresponding complexity bounds. Beyond trivial cases, we find that quantized TE -- those restricted by fixed -- width arithmetic-lead to the decidability of SAT due to their limited attention capabilities. However, the problem remains difficult, as we establish scenarios where SAT is NEXPTIME-hard and others where it is solvable in NEXPTIME for quantized TE. To complement our complexity results, we place our findings and their implications in the broader context of formal reasoning.
Authors: Alain Ryser, Thomas M. Sutter, Alexander Marx, Julia E. Vogt
Abstract: Anomaly detection focuses on identifying samples that deviate from the norm. When working with high-dimensional data such as images, a crucial requirement for detecting anomalous patterns is learning lower-dimensional representations that capture concepts of normality. Recent advances in self-supervised learning have shown great promise in this regard. However, many successful self-supervised anomaly detection methods assume prior knowledge about anomalies to create synthetic outliers during training. Yet, in real-world applications, we often do not know what to expect from unseen data, and we can solely leverage knowledge about normal data. In this work, we propose Con$_2$, which learns representations through context augmentations that allow us to observe samples from two distinct perspectives while keeping the invariances of normal data. Con$_2$ learns rich representations of context-augmented samples by clustering them according to their context while simultaneously aligning their positions across clusters. At test time, representations of anomalies that do not adhere to the invariances of normal data then deviate from their respective context cluster. Learning representations in such a way thus allows us to detect anomalies without making assumptions about anomalous data.
Authors: Feiyu Zhu, Yuming Zhang, Changpeng Cai, Chenghao He, Xiuyuan Guo, Jiao Li, Peizhe Wang, Junhao Su, Jialin Gao
Abstract: Local learning offers an alternative to traditional end-to-end back-propagation in deep neural networks, significantly reducing GPU memory usage. While local learning has shown promise in image classification tasks, its application to other visual tasks remains limited. This limitation arises primarily from two factors: 1) architectures tailored for classification are often not transferable to other tasks, leading to a lack of reusability of task-specific knowledge; 2) the absence of cross-scale feature communication results in degraded performance in tasks such as object detection and super-resolution. To address these challenges, we propose the Memory-augmented Auxiliary Network (MAN), which introduces a simplified design principle and incorporates a feature bank to enhance cross-task adaptability and communication. This work represents the first successful application of local learning methods beyond classification, demonstrating that MAN not only conserves GPU memory but also achieves performance on par with end-to-end approaches across multiple datasets for various visual tasks.
Authors: Ziqian Zeng, Jianwei Wang, Junyao Yang, Zhengdong Lu, Huiping Zhuang, Cen Chen
Abstract: The widespread usage of online Large Language Models (LLMs) inference services has raised significant privacy concerns about the potential exposure of private information in user inputs to malicious eavesdroppers. Existing privacy protection methods for LLMs suffer from either insufficient privacy protection, performance degradation, or large inference time overhead. To address these limitations, we propose PrivacyRestore, a plug-and-play method to protect the privacy of user inputs during LLM inference. The server first trains restoration vectors for each privacy span and then release to clients. Privacy span is defined as a contiguous sequence of tokens within a text that contain private information. The client then aggregate restoration vectors of all privacy spans in the input into a single meta restoration vector which is later sent to the server side along with the input without privacy spans.The private information is restored via activation steering during inference. Furthermore, we prove that PrivacyRestore inherently prevents the linear growth of the privacy budget.We create three datasets, covering medical and legal domains, to evaluate the effectiveness of privacy preserving methods. The experimental results show that PrivacyRestore effectively protects private information and maintain acceptable levels of performance and inference overhead.
Authors: Seyd Teymoor Seydi
Abstract: This paper presents a comprehensive survey of 18 distinct polynomials and their potential applications in Kolmogorov-Arnold Network (KAN) models as an alternative to traditional spline-based methods. The polynomials are classified into various groups based on their mathematical properties, such as orthogonal polynomials, hypergeometric polynomials, q-polynomials, Fibonacci-related polynomials, combinatorial polynomials, and number-theoretic polynomials. The study aims to investigate the suitability of these polynomials as basis functions in KAN models for complex tasks like handwritten digit classification on the MNIST dataset. The performance metrics of the KAN models, including overall accuracy, Kappa, and F1 score, are evaluated and compared. The Gottlieb-KAN model achieves the highest performance across all metrics, suggesting its potential as a suitable choice for the given task. However, further analysis and tuning of these polynomials on more complex datasets are necessary to fully understand their capabilities in KAN models. The source code for the implementation of these KAN models is available at https://github.com/seydi1370/Basis_Functions .
Authors: Min Cai, Yuchen Zhang, Shichang Zhang, Fan Yin, Dan Zhang, Difan Zou, Yisong Yue, Ziniu Hu
Abstract: We propose SelfControl, an inference-time model control method utilizing gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a desired behavior expressed in a natural language suffix string concatenated to the input prompt, SelfControl computes gradients of the LLM's self-evaluation of the suffix with respect to its latent representations. The gradients are used to directly control the auto-regressive generation process towards desired behaviors, which eliminates human supervision, achieves precise and transparent control, and offers on-the-fly adaptability. To further enhance efficiency, we introduce SelfControl_{Prefix}, a compact module that encapsulates the learned representations from gradients into a SelfControl_{Prefix}, facilitating efficient inference-time control with no latency compared to the original model and allowing control for multiple behaviors simultaneously. Our experiments demonstrate SelfControl's efficacy across multiple domains, where it improves over SOTA for 8.3% in detoxification, 3.1% in truthfulness enhancement, 4%~10% in controlling on emotion tones, and 48.2% in privacy protection, i.e., completely remove privacy leakage issue. Additionally, we demonstrate that SelfControl can be used for data synthesis and to improve reasoning abilities.
Authors: Xinhao Yao, Xiaolin Hu, Shenzhi Yang, Yong Liu
Abstract: Pre-trained large language models (LLMs) based on Transformer have demonstrated striking in-context learning (ICL) abilities. With a few demonstration input-label pairs, they can predict the label for an unseen input without any parameter updates. In this paper, we show an exciting phenomenon that SVD-based weight pruning can enhance ICL performance, and more surprising, pruning weights in deep layers often results in more stable performance improvements than in shallow layers. However, the underlying mechanism of those findings still remains an open question. To reveal those findings, we conduct an in-depth theoretical analysis by presenting the implicit gradient descent (GD) trajectories of ICL and giving the mutual information based generalization bounds of ICL via full implicit GD trajectories. This helps us reasonably explain the surprising experimental findings. Besides, based on all our experimental and theoretical insights, we intuitively propose a simple, model-compression and derivative-free algorithm for downstream tasks in enhancing ICL inference. Experiments on benchmark datasets and open source LLMs display the method effectiveness\footnote{The code is available at \url{https://github.com/chen123CtrlS/EnhancingICL_SVDPruning}.}.
URLs: https://github.com/chen123CtrlS/EnhancingICL_SVDPruning
Authors: Daniel Kunin, Allan Ravent\'os, Cl\'ementine Domin\'e, Feng Chen, David Klindt, Andrew Saxe, Surya Ganguli
Abstract: While the impressive performance of modern neural networks is often attributed to their capacity to efficiently extract task-relevant features from data, the mechanisms underlying this rich feature learning regime remain elusive, with much of our theoretical understanding stemming from the opposing lazy regime. In this work, we derive exact solutions to a minimal model that transitions between lazy and rich learning, precisely elucidating how unbalanced layer-specific initialization variances and learning rates determine the degree of feature learning. Our analysis reveals that they conspire to influence the learning regime through a set of conserved quantities that constrain and modify the geometry of learning trajectories in parameter and function space. We extend our analysis to more complex linear models with multiple neurons, outputs, and layers and to shallow nonlinear networks with piecewise linear activation functions. In linear networks, rapid feature learning only occurs from balanced initializations, where all layers learn at similar speeds. While in nonlinear networks, unbalanced initializations that promote faster learning in earlier layers can accelerate rich learning. Through a series of experiments, we provide evidence that this unbalanced rich regime drives feature learning in deep finite-width networks, promotes interpretability of early layers in CNNs, reduces the sample complexity of learning hierarchical data, and decreases the time to grokking in modular arithmetic. Our theory motivates further exploration of unbalanced initializations to enhance efficient feature learning.
Authors: Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You
Abstract: Evaluating large language models (LLMs) is challenging. Traditional ground-truth-based benchmarks fail to capture the comprehensiveness and nuance of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query quantity. Both of them may also become contaminated over time. User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. In this work, we propose MixEval, a new paradigm for establishing efficient, gold-standard LLM evaluation by strategically mixing off-the-shelf benchmarks. It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks. Based on MixEval, we further build MixEval-Hard, which offers more room for model improvement. Our benchmarks' advantages lie in (1) a 0.96 model ranking correlation with Chatbot Arena arising from the highly impartial query distribution and grading mechanism, (2) fast, cheap, and reproducible execution (6% of the time and cost of MMLU), and (3) dynamic evaluation enabled by the rapid and stable data update pipeline. We provide extensive meta-evaluation and analysis for our and existing LLM benchmarks to deepen the community's understanding of LLM evaluation and guide future research directions.
Authors: Dylan Zhang, Shizhe Diao, Xueyan Zou, Hao Peng
Abstract: Preference learning provides a promising solution to address the limitations of supervised fine-tuning (SFT) for code language models, where the model is not explicitly trained to differentiate between correct and incorrect code. Recent findings demonstrate that on-policy data is the key to successful preference learning, where the preference data is collected using the same policy LM being trained. Inspired by this, we propose PLUM, an on-policy $\textbf{P}$reference $\textbf{L}$earning framework A$\textbf{u}$gmented with test cases for code L$\textbf{M}$ s. The framework operates in three key stages: (1) automatic generation of test cases from natural language instructions, (2) creation of a preference data by evaluating candidate code solutions sampled from the policy, which can then be used to (3) train the policy LM. PLUM levitates the need to train reward models, allowing for large scale on-policy and online preference data collation. PLUM is evaluated on both standard benchmarks (HumanEval, MBPP) and more challenging ones (LiveCodeBench), delivering substantial improvements over original SFT'ed models and other execution-feedback-driven approaches. We show PLUM's benefits are consistent across various widely-used code LMs even they have been well-trained with SFT. For example, PLUM increases pass rates by up to 4.8% on average on standard benchmarks and 11.8% on LiveCodeBench, demonstrating its effectiveness and generalizability. We also demonstrate the benefits of on-policy and online preference learning by comprehensive experimentation.
Authors: Xinyu Zhang, Yuhan Liu, Haonan Chang, Abdeslam Boularias
Abstract: Learning general-purpose models from diverse datasets has achieved great success in machine learning. In robotics, however, existing methods in multi-task learning are typically constrained to a single robot and workspace, while recent work such as RT-X requires a non-trivial action normalization procedure to manually bridge the gap between different action spaces in diverse environments. In this paper, we propose the visual kinematics chain as a precise and universal representation of quasi-static actions for robot learning over diverse environments, which requires no manual adjustment since the visual kinematic chains can be automatically obtained from the robot's model and camera parameters. We propose the Visual Kinematics Transformer (VKT), a convolution-free architecture that supports an arbitrary number of camera viewpoints, and that is trained with a single objective of forecasting kinematic structures through optimal point-set matching. We demonstrate the superior performance of VKT over BC transformers as a general agent on Calvin, RLBench, Open-X, and real robot manipulation tasks. Video demonstrations can be found at https://mlzxy.github.io/visual-kinetic-chain.
Authors: Yida Chen, Aoyu Wu, Trevor DePodesta, Catherine Yeh, Kenneth Li, Nicholas Castillo Marin, Oam Patel, Jan Riecke, Shivam Raval, Olivia Seow, Martin Wattenberg, Fernanda Vi\'egas
Abstract: Conversational LLMs function as black box systems, leaving users guessing about why they see the output they do. This lack of transparency is potentially problematic, especially given concerns around bias and truthfulness. To address this issue, we present an end-to-end prototype-connecting interpretability techniques with user experience design-that seeks to make chatbots more transparent. We begin by showing evidence that a prominent open-source LLM has a "user model": examining the internal state of the system, we can extract data related to a user's age, gender, educational level, and socioeconomic status. Next, we describe the design of a dashboard that accompanies the chatbot interface, displaying this user model in real time. The dashboard can also be used to control the user model and the system's behavior. Finally, we discuss a study in which users conversed with the instrumented system. Our results suggest that users appreciate seeing internal states, which helped them expose biased behavior and increased their sense of control. Participants also made valuable suggestions that point to future directions for both design and machine learning research. The project page and video demo of our TalkTuner system are available at https://bit.ly/talktuner-project-page
Authors: Thibaut Issenhuth, Sangchul Lee, Ludovic Dos Santos, Jean-Yves Franceschi, Chansoo Kim, Alain Rakotomamonjy
Abstract: Consistency models imitate the multi-step sampling of score-based diffusion in a single forward pass of a neural network. They can be learned in two ways: consistency distillation and consistency training. The former relies on the true velocity field of the corresponding differential equation, approximated by a pre-trained neural network. In contrast, the latter uses a single-sample Monte Carlo estimate of this velocity field. The related estimation error induces a discrepancy between consistency distillation and training that, we show, still holds in the continuous-time limit. To alleviate this issue, we propose a novel flow that transports noisy data towards their corresponding outputs derived from the currently trained model --~as a proxy of the true flow. Our empirical findings demonstrate that this approach mitigates the previously identified discrepancy. Furthermore, we present theoretical and empirical evidence indicating that our generator-induced flow surpasses dedicated optimal transport-based consistency models in effectively reducing the noise-data transport cost. Consequently, our method not only accelerates consistency training convergence but also enhances its overall performance. The code is available at: https://github.com/thibautissenhuth/consistency_GC.
Authors: Mingyu Derek Ma, Chenchen Ye, Yu Yan, Xiaoxuan Wang, Peipei Ping, Timothy S Chang, Wei Wang
Abstract: The integration of Artificial Intelligence (AI), especially Large Language Models (LLMs), into the clinical diagnosis process offers significant potential to improve the efficiency and accessibility of medical care. While LLMs have shown some promise in the medical domain, their application in clinical diagnosis remains underexplored, especially in real-world clinical practice, where highly sophisticated, patient-specific decisions need to be made. Current evaluations of LLMs in this field are often narrow in scope, focusing on specific diseases or specialties and employing simplified diagnostic tasks. To bridge this gap, we introduce CliBench, a novel benchmark developed from the MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnoses from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions. Supported by structured output ontologies, CliBench enables a precise and multi-granular evaluation, offering an in-depth understanding of LLM's capability on diverse clinical tasks of desired granularity. We conduct a zero-shot evaluation of leading LLMs to assess their proficiency in clinical decision-making. Our preliminary results shed light on the potential and limitations of current LLMs in clinical settings, providing valuable insights for future advancements in LLM-powered healthcare.
Authors: Yiye Zou, Tianyu Li, Lin Lu, Jingyu Wang, Shufan Zou, Laiping Zhang, Xiaogang Deng
Abstract: Advances in deep learning have enabled physics-informed neural networks to solve partial differential equations. Numerical differentiation using the finite-difference (FD) method is efficient in physics-constrained designs, even in parameterized settings. In traditional computational fluid dynamics(CFD), body-fitted block-structured grids are often employed for complex flow cases when obtaining FD solutions. However, convolution operators in convolutional neural networks for FD are typically limited to single-block grids. To address this issue, \blueText{graphs and graph networks are used} to learn flow representations across multi-block-structured grids. \blueText{A graph convolution-based FD method (GC-FDM) is proposed} to train graph networks in a label-free physics-constrained manner, enabling differentiable FD operations on unstructured graph outputs. To demonstrate model performance from single- to multi-block-structured grids, \blueText{the parameterized steady incompressible Navier-Stokes equations are solved} for a lid-driven cavity flow and the flows around single and double circular cylinder configurations. When compared to a CFD solver under various boundary conditions, the proposed method achieves a relative error in velocity field predictions on the order of $10^{-3}$. Furthermore, the proposed method reduces training costs by approximately 20\% compared to a physics-informed neural network. \blueText{To} further verify the effectiveness of GC-FDM in multi-block processing, \blueText{a 30P30N airfoil geometry is considered} and the \blueText{predicted} results are reasonable compared with those given by CFD. \blueText{Finally, the applicability of GC-FDM to three-dimensional (3D) case is tested using a 3D cavity geometry.
Authors: Alonso Silva
Abstract: Generative artificial intelligence (Generative AI), and in particular Large Language Models (LLMs) have gained significant popularity among researchers and industrial communities, paving the way for integrating LLMs in different domains, such as robotics, telecom, and healthcare. In this paper, we study the intersection of game theory and generative artificial intelligence, focusing on the capabilities of LLMs to find the Nash equilibrium in games with a mixed strategy Nash equilibrium and no pure strategy Nash equilibrium (that we denote mixed strategy Nash equilibrium games). The study reveals a significant enhancement in the performance of LLMs when they are equipped with the possibility to run code and are provided with a specific prompt to incentivize them to do so. However, our research also highlights the limitations of LLMs when the randomization strategy of the game is not easy to deduce. It is evident that while LLMs exhibit remarkable proficiency in well-known standard games, their performance dwindles when faced with slight modifications of the same games. This paper aims to contribute to the growing body of knowledge on the intersection of game theory and generative artificial intelligence while providing valuable insights into LLMs strengths and weaknesses. It also underscores the need for further research to overcome the limitations of LLMs, particularly in dealing with even slightly more complex scenarios, to harness their full potential.
Authors: T. Y. S. S Santosh, Kevin D. Ashley, Katie Atkinson, Matthias Grabmair
Abstract: Modeling legal reasoning and argumentation justifying decisions in cases has always been central to AI & Law, yet contemporary developments in legal NLP have increasingly focused on statistically classifying legal conclusions from text. While conceptually simpler, these approaches often fall short in providing usable justifications connecting to appropriate legal concepts. This paper reviews both traditional symbolic works in AI & Law and recent advances in legal NLP, and distills possibilities of integrating expert-informed knowledge to strike a balance between scalability and explanation in symbolic vs. data-driven approaches. We identify open challenges and discuss the potential of modern NLP models and methods that integrate
Authors: Amit Das, Zheng Zhang, Fatemeh Jamshidi, Vinija Jain, Aman Chadha, Nilanjana Raychawdhary, Mary Sandage, Lauramarie Pope, Gerry Dozier, Cheryl Seals
Abstract: Data annotation, the practice of assigning descriptive labels to raw data, is pivotal in optimizing the performance of machine learning models. However, it is a resource-intensive process susceptible to biases introduced by annotators. The emergence of sophisticated Large Language Models (LLMs), like ChatGPT presents a unique opportunity to modernize and streamline this complex procedure. While existing research extensively evaluates the efficacy of LLMs, as annotators, this paper delves into the biases present in LLMs, specifically GPT 3.5 and GPT 4o when annotating hate speech data. Our research contributes to understanding biases in four key categories: gender, race, religion, and disability. Specifically targeting highly vulnerable groups within these categories, we analyze annotator biases. Furthermore, we conduct a comprehensive examination of potential factors contributing to these biases by scrutinizing the annotated data. We introduce our custom hate speech detection dataset, HateSpeechCorpus, to conduct this research. Additionally, we perform the same experiments on the ETHOS (Mollas et al., 2022) dataset also for comparative analysis. This paper serves as a crucial resource, guiding researchers and practitioners in harnessing the potential of LLMs for dataannotation, thereby fostering advancements in this critical field. The HateSpeechCorpus dataset is available here: https://github.com/AmitDasRup123/HateSpeechCorpus
Authors: Han Zhou, Xingchen Wan, Yinhong Liu, Nigel Collier, Ivan Vuli\'c, Anna Korhonen
Abstract: Large language models (LLMs) have shown promising abilities as cost-effective and reference-free evaluators for assessing language generation quality. In particular, pairwise LLM evaluators, which compare two generated texts and determine the preferred one, have been employed in a wide range of applications. However, LLMs exhibit preference biases and worrying sensitivity to prompt designs. In this work, we first reveal that the predictive preference of LLMs can be highly brittle and skewed, even with semantically equivalent instructions. We find that fairer predictive preferences from LLMs consistently lead to judgments that are better aligned with humans. Motivated by this phenomenon, we propose an automatic Zero-shot Evaluation-oriented Prompt Optimization framework, ZEPO, which aims to produce fairer preference decisions and improve the alignment of LLM evaluators with human judgments. To this end, we propose a zero-shot learning objective based on the preference decision fairness. ZEPO demonstrates substantial performance improvements over state-of-the-art LLM evaluators, without requiring labeled data, on representative meta-evaluation benchmarks. Our findings underscore the critical correlation between preference fairness and human alignment, positioning ZEPO as an efficient prompt optimizer for bridging the gap between LLM evaluators and human judgments.
Authors: Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade, Eran Malach
Abstract: Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset. We theoretically prove that transcendence can be enabled by low-temperature sampling, and rigorously assess this claim experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting.
Authors: Yaoke Wang, Yun Zhu, Wenqiao Zhang, Yueting Zhuang, Yunfei Li, Siliang Tang
Abstract: Representation learning on text-attributed graphs (TAGs) is vital for real-world applications, as they combine semantic textual and contextual structural information. Research in this field generally consist of two main perspectives: local-level encoding and global-level aggregating, respectively refer to textual node information unification (e.g., using Language Models) and structure-augmented modeling (e.g., using Graph Neural Networks). Most existing works focus on combining different information levels but overlook the interconnections, i.e., the contextual textual information among nodes, which provides semantic insights to bridge local and global levels. In this paper, we propose GraphBridge, a multi-granularity integration framework that bridges local and global perspectives by leveraging contextual textual information, enhancing fine-grained understanding of TAGs. Besides, to tackle scalability and efficiency challenges, we introduce a graphaware token reduction module. Extensive experiments across various models and datasets show that our method achieves state-of-theart performance, while our graph-aware token reduction module significantly enhances efficiency and solves scalability issues.
Authors: Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bohan Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, Wenhu Chen
Abstract: The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.
Authors: Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, Yu Cheng
Abstract: In the era of large language models, model merging is a promising way to combine multiple task-specific models into a single multitask model without extra training. However, two challenges remain: (a) interference between different models and (b) heterogeneous data during testing. Traditional model merging methods often show significant performance gaps compared to fine-tuned models due to these issues. Additionally, a one-size-fits-all model lacks flexibility for diverse test data, leading to performance degradation. We show that both shared and exclusive task-specific knowledge are crucial for merging performance, but directly merging exclusive knowledge hinders overall performance. In view of this, we propose Twin-Merging, a method that encompasses two principal stages: (1) modularizing knowledge into shared and exclusive components, with compression to reduce redundancy and enhance efficiency; (2) dynamically merging shared and task-specific knowledge based on the input. This approach narrows the performance gap between merged and fine-tuned models and improves adaptability to heterogeneous data. Extensive experiments on $20$ datasets for both language and vision tasks demonstrate the effectiveness of our method, showing an average improvement of $28.34\%$ in absolute normalized score for discriminative tasks and even surpassing the fine-tuned upper bound on the generative tasks. Our implementation is available in \url{https://github.com/LZY-the-boys/Twin-Merging}
Authors: Chenghao Fan, Zhenyi Lu, Wei Wei, Jie Tian, Xiaoye Qu, Dangyang Chen, Yu Cheng
Abstract: Efficient fine-tuning of large language models for task-specific applications is imperative, yet the vast number of parameters in these models makes their training increasingly challenging. Despite numerous proposals for effective methods, a substantial memory overhead remains for gradient computations during updates. \thm{Can we fine-tune a series of task-specific small models and transfer their knowledge directly to a much larger model without additional training?} In this paper, we explore weak-to-strong specialization using logit arithmetic, facilitating a direct answer to this question. Existing weak-to-strong methods often employ a static knowledge transfer ratio and a single small model for transferring complex knowledge, which leads to suboptimal performance. % To address this, To surmount these limitations, we propose a dynamic logit fusion approach that works with a series of task-specific small models, each specialized in a different task. This method adaptively allocates weights among these models at each decoding step, learning the weights through Kullback-Leibler divergence constrained optimization problems. We conduct extensive experiments across various benchmarks in both single-task and multi-task settings, achieving leading results. By transferring expertise from the 7B model to the 13B model, our method closes the performance gap by 96.4\% in single-task scenarios and by 86.3\% in multi-task scenarios compared to full fine-tuning of the 13B model. Notably, we achieve surpassing performance on unseen tasks. Moreover, we further demonstrate that our method can effortlessly integrate in-context learning for single tasks and task arithmetic for multi-task scenarios.
Authors: Wanyu Bian, Panfeng Li, Mengyao Zheng, Chihang Wang, Anying Li, Ying Li, Haowei Ni, Zixuan Zeng
Abstract: This paper analyzes conventional and deep learning methods for eliminating electromagnetic interference (EMI) in MRI systems. We compare traditional analytical and adaptive techniques with advanced deep learning approaches. Key strengths and limitations of each method are highlighted. Recent advancements in active EMI elimination, such as external EMI receiver coils, are discussed alongside deep learning methods, which show superior EMI suppression by leveraging neural networks trained on MRI data. While deep learning improves EMI elimination and diagnostic capabilities, it introduces security and safety concerns, particularly in commercial applications. A balanced approach, integrating conventional reliability with deep learning's advanced capabilities, is proposed for more effective EMI suppression in MRI systems.
Authors: Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos
Abstract: Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs' information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., $10.5\%$ improvement on $20$ documents MDQA at position $10$ for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from $2.33\%$ to $6.19\%$). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.
Authors: Zhe Hu, Hou Pong Chan, Jing Li, Yu Yin
Abstract: Writing persuasive arguments is a challenging task for both humans and machines. It entails incorporating high-level beliefs from various perspectives on the topic, along with deliberate reasoning and planning to construct a coherent narrative. Current language models often generate surface tokens autoregressively, lacking explicit integration of these underlying controls, resulting in limited output diversity and coherence. In this work, we propose a persona-based multi-agent framework for argument writing. Inspired by the human debate, we first assign each agent a persona representing its high-level beliefs from a unique perspective, and then design an agent interaction process so that the agents can collaboratively debate and discuss the idea to form an overall plan for argument writing. Such debate process enables fluid and nonlinear development of ideas. We evaluate our framework on argumentative essay writing. The results show that our framework can generate more diverse and persuasive arguments through both automatic and human evaluations.
Authors: Justin Xu, Jack Gallifant, Alistair E. W. Johnson, Matthew B. A. McDermott
Abstract: Reproducibility remains a significant challenge in machine learning (ML) for healthcare. Datasets, model pipelines, and even task/cohort definitions are often private in this field, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. This paper addresses a significant part of this problem by introducing the Automatic Cohort Extraction System (ACES) for event-stream data. This library is designed to simultaneously simplify the development of task/cohorts for ML in healthcare and also enable the reproduction of these cohorts, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides (1) a highly intuitive and expressive configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion/exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or EventStreamGPT (ESGPT) formats, or to any dataset in which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks that learn representations, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies in this modality. ACES is available at https://github.com/justin13601/aces. A short video demonstration of ACES is available at https://youtu.be/i_hCaHDydqA.
URLs: https://github.com/justin13601/aces., https://youtu.be/i_hCaHDydqA.
Authors: Joan Nwatu, Oana Ignat, Rada Mihalcea
Abstract: Recent work has demonstrated that the unequal representation of cultures and socioeconomic groups in training data leads to biased Large Multi-modal (LMM) models. To improve LMM model performance on underrepresented data, we propose and evaluate several prompting strategies using non-English, geographic, and socioeconomic attributes. We show that these geographic and socioeconomic integrated prompts favor retrieving topic appearances commonly found in data from low-income households across different countries leading to improved LMM model performance on lower-income data. Our analyses identify and highlight contexts where these strategies yield the most improvements.
Authors: Yinquan Lu, Wenhao Zhu, Lei Li, Yu Qiao, Fei Yuan
Abstract: Large Language Models (LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we conduct extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs (by more than 10 spBLEU points) and performs on-par with specialized translation model (M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code \footnote{\url{https://github.com/CONE-MT/LLaMAX/.}} and the models \footnote{\url{https://huggingface.co/LLaMAX/.}} are publicly available.
URLs: https://github.com/CONE-MT/LLaMAX/., https://huggingface.co/LLaMAX/.
Authors: Murdock Aubry, Haoming Meng, Anton Sugolov, Vardan Papyan
Abstract: Large Language Models (LLMs) have made significant strides in natural language processing, and a precise understanding of the internal mechanisms driving their success is essential. In this work, we trace the trajectories of individual tokens as they pass through transformer blocks, and linearize the system along these trajectories through their Jacobian matrices. By examining the relationships between these Jacobians, we uncover a $\textbf{transformer block coupling}$ phenomenon in a variety of LLMs, characterized by the coupling of their top singular vectors across tokens and depth. Our findings reveal that coupling $\textit{positively correlates}$ with model performance, and that this relationship is stronger than with other hyperparameters, namely parameter budget, model depth, and embedding dimension. We further investigate the emergence of these properties through training, noting the development of coupling, as well as an increase in linearity and layer-wise exponential growth in the token trajectories. These collective insights provide a novel perspective on the interactions between token embeddings, and prompt further approaches to study training and generalization in LLMs.
Authors: Shravan Nayak, Kanishk Jain, Rabiul Awal, Siva Reddy, Sjoerd van Steenkiste, Lisa Anne Hendricks, Karolina Sta\'nczak, Aishwarya Agrawal
Abstract: Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM's geo-diverse cultural understanding. We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural understanding capabilities for North America while significantly lower performance for Africa. We observe disparity in their performance across cultural facets too, with clothing, rituals, and traditions seeing higher performances than food and drink. These disparities help us identify areas where VLMs lack cultural understanding and demonstrate the potential of CulturalVQA as a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures.
Authors: Jingze Shi, Bingheng Wu, Ting Xie, Lu He
Abstract: Recent studies have shown that, relative position encoding performs well in selective state space model scanning algorithms, and the architecture that balances SSM and Attention enhances the efficiency and effectiveness of the algorithm, while the sparse activation of the mixture of experts reduces the training cost. We studied the effectiveness of using different position encodings in structured state space dual algorithms, and the more effective SSD-Attn internal and external function mixing method, and designed a more efficient cross domain mixture of experts. We found that the same matrix is very wonderful in different algorithms, which allows us to establish a new hybrid sparse architecture: Cheems. Compared with other hybrid architectures, it is more efficient and more effective in language modeling tasks.
Authors: Santosh V. Patapati
Abstract: Major Depressive Disorder (MDD) is a pervasive mental health condition that affects 300 million people worldwide. This work presents a novel, BiLSTM-based tri-modal model-level fusion architecture for the binary classification of depression from clinical interview recordings. The proposed architecture incorporates Mel Frequency Cepstral Coefficients, Facial Action Units, and uses a two-shot learning based GPT-4 model to process text data. This is the first work to incorporate large language models into a multi-modal architecture for this task. It achieves impressive results on the DAIC-WOZ AVEC 2016 Challenge cross-validation split and Leave-One-Subject-Out cross-validation split, surpassing all baseline models and multiple state-of-the-art models. In Leave-One-Subject-Out testing, it achieves an accuracy of 91.01%, an F1-Score of 85.95%, a precision of 80%, and a recall of 92.86%.
Authors: Yonghyeon Lee, Byeongho Lee, Seungyeon Kim, Frank C. Park
Abstract: Developing text-based robot trajectory generation models is made particularly difficult by the small dataset size, high dimensionality of the trajectory space, and the inherent complexity of the text-conditional motion distribution. Recent manifold learning-based methods have partially addressed the dimensionality and dataset size issues, but struggle with the complex text-conditional distribution. In this paper we propose a text-based trajectory generation model that attempts to address all three challenges while relying on only a handful of demonstration trajectory data. Our key idea is to leverage recent flow-based models capable of capturing complex conditional distributions, not directly in the high-dimensional trajectory space, but rather in the low-dimensional latent coordinate space of the motion manifold, with deliberately designed regularization terms to ensure smoothness of motions and robustness to text variations. We show that our Motion Manifold Flow Primitive (MMFP) framework can accurately generate qualitatively distinct motions for a wide range of text inputs, significantly outperforming existing methods.
Authors: Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia
Abstract: Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of language model pre-training. This paper demonstrates that the optimal composition of training data from different domains is scale-dependent, challenging the existing practice of determining optimal mixtures through small-scale experiments and directly applying them at larger scales. We derive an analytical model for the dependence of optimal weights on data scale and introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales. *AutoScale* first uses a principled optimization framework to find optimal compositions at smaller, feasible scales, then predicts optimal compositions at larger scales using our derived model. Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance. Particularly, for GPT-2 Large on RedPajama, *AutoScale* decreases validation perplexity 28% faster than baselines, with up to 38% speed-up over unweighted training, achieving the best performance across downstream tasks. This work provides insights into the varying benefits of data sources across training scales for language models, contributing to the burgeoning research on scale-dependent data curation. Code is open-sourced.
Authors: Maximilian G. Schuh, Davide Boldini, Annkathrin I. Bohne, Stephan A. Sieber
Abstract: Accurate prediction of drug-target interactions is critical for advancing drug discovery. By reducing time and cost, machine learning and deep learning can accelerate this laborious discovery process. In a novel approach, BarlowDTI, we utilise the powerful Barlow Twins architecture for feature-extraction while considering the structure of the target protein. Our method achieves state-of-the-art predictive performance against multiple established benchmarks using only one-dimensional input. The use of gradient boosting machine as the underlying predictor ensures fast and efficient predictions without the need for substantial computational resources. We also investigate how the model reaches its decision based on individual training samples. By comparing co-crystal structures, we find that BarlowDTI effectively exploits catalytically active and stabilising residues, highlighting the model's ability to generalise from one-dimensional input data. In addition, we further benchmark new baselines against existing methods. Together, these innovations improve the efficiency and effectiveness of drug-target interaction predictions, providing robust tools for accelerating drug development and deepening the understanding of molecular interactions. Therefore, we provide an easy-to-use web interface that can be freely accessed at https://www.bio.nat.tum.de/oc2/barlowdti .
Authors: Noam Razin
Abstract: Despite the extreme popularity of deep learning in science and industry, its formal understanding is limited. This thesis puts forth notions of rank as key for developing a theory of deep learning, focusing on the fundamental aspects of generalization and expressiveness. In particular, we establish that gradient-based training can induce an implicit regularization towards low rank for several neural network architectures, and demonstrate empirically that this phenomenon may facilitate an explanation of generalization over natural data (e.g., audio, images, and text). Then, we characterize the ability of graph neural networks to model interactions via a notion of rank, which is commonly used for quantifying entanglement in quantum physics. A central tool underlying these results is a connection between neural networks and tensor factorizations. Practical implications of our theory for designing explicit regularization schemes and data preprocessing algorithms are presented.
Authors: Haozhe Ma, Zhengding Luo, Thanh Vinh Vo, Kuankuan Sima, Tze-Yun Leong
Abstract: Reward shaping is a technique in reinforcement learning that addresses the sparse-reward problem by providing more frequent and informative rewards. We introduce a self-adaptive and highly efficient reward shaping mechanism that incorporates success rates derived from historical experiences as shaped rewards. The success rates are sampled from Beta distributions, which dynamically evolve from uncertain to reliable values as data accumulates. Initially, the shaped rewards exhibit more randomness to encourage exploration, while over time, the increasing certainty enhances exploitation, naturally balancing exploration and exploitation. Our approach employs Kernel Density Estimation (KDE) combined with Random Fourier Features (RFF) to derive the Beta distributions, providing a computationally efficient, non-parametric, and learning-free solution for high-dimensional continuous state spaces. Our method is validated on various tasks with extremely sparse rewards, demonstrating notable improvements in sample efficiency and convergence stability over relevant baselines.
Authors: Zekai Li, Ziyao Guo, Wangbo Zhao, Tianle Zhang, Zhi-Qi Cheng, Samir Khaki, Kaipeng Zhang, Ahmad Sajedi, Konstantinos N Plataniotis, Kai Wang, Yang You
Abstract: Dataset Distillation aims to compress a large dataset into a significantly more compact, synthetic one without compromising the performance of the trained models. To achieve this, existing methods use the agent model to extract information from the target dataset and embed it into the distilled dataset. Consequently, the quality of extracted and embedded information determines the quality of the distilled dataset. In this work, we find that existing methods introduce misaligned information in both information extraction and embedding stages. To alleviate this, we propose Prioritize Alignment in Dataset Distillation (PAD), which aligns information from the following two perspectives. 1) We prune the target dataset according to the compressing ratio to filter the information that can be extracted by the agent model. 2) We use only deep layers of the agent model to perform the distillation to avoid excessively introducing low-level information. This simple strategy effectively filters out misaligned information and brings non-trivial improvement for mainstream matching-based distillation algorithms. Furthermore, built on trajectory matching, \textbf{PAD} achieves remarkable improvements on various benchmarks, achieving state-of-the-art performance.
Authors: Luca Mouchel, Debjit Paul, Shaobo Cui, Robert West, Antoine Bosselut, Boi Faltings
Abstract: Despite the remarkable performance of Large Language Models (LLMs) in natural language processing tasks, they still struggle with generating logically sound arguments, resulting in potential risks such as spreading misinformation. To address this issue, we introduce FIPO, a fallacy-informed framework that leverages preference optimization methods to steer LLMs toward logically sound arguments. FIPO includes a classification loss, to capture the fine-grained information on fallacy types. Our results on argumentation datasets show that our method reduces the fallacy errors by up to 17.5%. Furthermore, our human evaluation results indicate that the quality of the generated arguments by our method significantly outperforms the fine-tuned baselines, as well as other preference optimization methods, such as DPO. These findings highlight the importance of ensuring models are aware of logical fallacies for effective argument generation. Our code is available at github.com/lucamouchel/Logical-Fallacies.
Authors: Eduardo Jr Piedad, Christian Ainsley Del Rosario, Eduardo Prieto-Araujo, Oriol Gomis-Bellmunt
Abstract: Deep learning (DL) strategies have recently been utilized to diagnose motor faults by simply analyzing motor phase current signals, offering a less costly and non-intrusive alternative to vibration sensors. This research transforms these time-series current signals into time-frequency 2D representations via Wavelet Transform (WT). The dataset for motor current signals includes 3,750 data points across five categories: one representing normal conditions and four representing artificially induced faults, each under five different load conditions: 0, 25, 50, 75, and 100%. The study employs five WT-based techniques: WT-Amor, WT-Bump, WT-Morse, WSST-Amor, and WSST-Bump. Subsequently, five DL models adopting prior Convolutional Neural Network (CNN) architecture were developed and tested using the transformed 2D plots from each method. The DL models for WT-Amor, WT-Bump, and WT-Morse showed remarkable effectiveness with peak model accuracy of 90.93, 89.20, and 93.73%, respectively, surpassing previous 2D-image-based methods that recorded accuracy of 80.25, 74.80, and 82.80% respectively using the identical dataset and validation protocol. Notably, the WT-Morse approach slightly exceeded the formerly highest ML technique, achieving a 93.20% accuracy. However, the two WSST methods that utilized synchrosqueezing techniques faced difficulty accurately classifying motor faults. The performance of Wavelet-based deep learning methods offers a compelling alternative for machine condition monitoring.
Authors: Eduardo Jr Piedad, Zherish Galvin Mayordo, Eduardo Prieto-Araujo, Oriol Gomis-Bellmunt
Abstract: In motor condition diagnosis, electrical current signature serves as an alternative feature to vibration-based sensor data, which is a more expensive and invasive method. Machine learning (ML) techniques have been emerging in diagnosing motor conditions using only motor phase current signals. This study converts time-series motor current signals to time-frequency 2D plots using Short-time Fourier Transform (STFT) methods. The motor current signal dataset consists of 3,750 sample points with five classes - one healthy and four synthetically-applied motor fault conditions, and with five loading conditions: 0, 25, 50, 75, and 100%. Five transformation methods are used on the dataset: non-overlap and overlap STFTs, non-overlap and overlap realigned STFTs, and synchrosqueezed STFT. Then, deep learning (DL) models based on the previous Convolutional Neural Network (CNN) architecture are trained and validated from generated plots of each method. The DL models of overlap-STFT, overlap R-STFT, non-overlap STFT, non-overlap R-STFT, and synchrosqueezed-STFT performed exceptionally with an average accuracy of 97.65, 96.03, 96.08, 96.32, and 88.27%, respectively. Four methods outperformed the previous best ML method with 93.20% accuracy, while all five outperformed previous 2D-plot-based methods with accuracy of 80.25, 74.80, and 82.80%, respectively, using the same dataset, same DL architecture, and validation steps.
Authors: Xiangyu Zhao, Yuehan Zhang, Wenlong Zhang, Xiao-Ming Wu
Abstract: The fashion domain encompasses a variety of real-world multimodal tasks, including multimodal retrieval and multimodal generation. The rapid advancements in artificial intelligence generated content, particularly in technologies like large language models for text generation and diffusion models for visual generation, have sparked widespread research interest in applying these multimodal models in the fashion domain. However, tasks involving embeddings, such as image-to-text or text-to-image retrieval, have been largely overlooked from this perspective due to the diverse nature of the multimodal fashion domain. And current research on multi-task single models lack focus on image generation. In this work, we present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks. UniFashion unifies embedding and generative tasks by integrating a diffusion model and LLM, enabling controllable and high-fidelity generation. Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks, and can be readily adapted to manage complex vision-language tasks. This work demonstrates the potential learning synergy between multimodal generation and retrieval, offering a promising direction for future research in the fashion domain. The source code is available at https://github.com/xiangyu-mm/UniFashion.
Authors: James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun
Abstract: Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53$\times$ and 1.8$\times$ at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.
Authors: Jasper Dekoninck, Maximilian Baader, Martin Vechev
Abstract: Rating-based human evaluation has become an essential tool to accurately evaluate the impressive performance of large language models (LLMs). However, current rating systems suffer from several important limitations: first, they fail to account for biases that significantly influence evaluation results, second, they require large and expensive preference datasets to obtain accurate ratings, and third, they do not facilitate meaningful comparisons of model ratings across different tasks. To address these issues, we introduce Polyrating, an expressive and flexible rating system based on maximum a posteriori estimation that enables a more nuanced and thorough analysis of model performance at lower costs. Polyrating can detect and quantify biases affecting human preferences, ensuring fairer model comparisons. Further, Polyrating can reduce the cost of human evaluations by up to $41\%$ for new models and up to $77\%$ for new tasks by leveraging existing benchmark scores. Lastly, Polyrating enables direct comparisons of ratings across different tasks, providing a comprehensive understanding of an LLMs' strengths, weaknesses, and relative performance across different applications.
Authors: Kanthimathi S, Shravan Venkatraman, Jayasankar K S, Pranay Jiljith T, Jashwanth R
Abstract: Distributed Denial of Service (DDoS) attacks are a major concern in network security, as they overwhelm systems with excessive traffic, compromise sensitive data, and disrupt network services. Accurately detecting these attacks is crucial to protecting network infrastructure. Traditional approaches, such as single Convolutional Neural Networks (CNNs) or conventional Machine Learning (ML) algorithms like Decision Trees (DTs) and Support Vector Machines (SVMs), struggle to extract the diverse features needed for precise classification, resulting in suboptimal performance. This research addresses this gap by introducing a novel approach for DDoS attack detection. The proposed method combines three distinct CNN architectures: SA-Enabled CNN with XGBoost, SA-Enabled CNN with LSTM, and SA-Enabled CNN with Random Forest. Each model extracts features at multiple scales, while self-attention mechanisms enhance feature integration and relevance. The weighted ensemble approach ensures that both prominent and subtle features contribute to the final classification, improving adaptability to evolving attack patterns and novel threats. The proposed method achieves a precision of 98.71%, an F1-score of 98.66%, a recall of 98.63%, and an accuracy of 98.69%, outperforming traditional methods and setting a new benchmark in DDoS attack detection. This innovative approach addresses critical limitations in current models and advances the state of the art in network security.
Authors: Ramon Tavares, Ricardo Olinda
Abstract: This study presents a comprehensive methodology for modeling and forecasting the historical time series of active fire spots detected by the AQUA\_M-T satellite in the Amazon, Brazil. The approach employs a mixed Recurrent Neural Network (RNN) model, combining Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures to predict the monthly accumulations of daily detected active fire spots. Data analysis revealed a consistent seasonality over time, with annual maximum and minimum values tending to repeat at the same periods each year. The primary objective is to verify whether the forecasts capture this inherent seasonality through machine learning techniques. The methodology involved careful data preparation, model configuration, and training using cross-validation with two seeds, ensuring that the data generalizes well to both the test and validation sets for both seeds. The results indicate that the combined LSTM and GRU model delivers excellent forecasting performance, demonstrating its effectiveness in capturing complex temporal patterns and modeling the observed time series. This research significantly contributes to the application of deep learning techniques in environmental monitoring, specifically in forecasting active fire spots. The proposed approach highlights the potential for adaptation to other time series forecasting challenges, opening new opportunities for research and development in machine learning and prediction of natural phenomena. Keywords: Time Series Forecasting; Recurrent Neural Networks; Deep Learning.
Authors: Suyash Fulay, William Brannon, Shrestha Mohanty, Cassandra Overney, Elinor Poole-Dayan, Deb Roy, Jad Kabbara
Abstract: Language model alignment research often attempts to ensure that models are not only helpful and harmless, but also truthful and unbiased. However, optimizing these objectives simultaneously can obscure how improving one aspect might impact the others. In this work, we focus on analyzing the relationship between two concepts essential in both language model alignment and political science: truthfulness and political bias. We train reward models on various popular truthfulness datasets and subsequently evaluate their political bias. Our findings reveal that optimizing reward models for truthfulness on these datasets tends to result in a left-leaning political bias. We also find that existing open-source reward models (i.e., those trained on standard human preference datasets) already show a similar bias and that the bias is larger for larger models. These results raise important questions about the datasets used to represent truthfulness, potential limitations of aligning models to be both truthful and politically unbiased, and what language models capture about the relationship between truth and politics.
Authors: Kushal Kedia, Prithwish Dan, Angela Chao, Maximus Adrian Pace, Sanjiban Choudhury
Abstract: Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods either depend on robot-demonstrator paired data, which is infeasible to scale, or rely too heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically aligns robot and demonstrator task executions using optimal transport costs. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent demonstrator videos by retrieving and composing short-horizon demonstrator clips. This approach facilitates effective policy training without the need for paired data. We demonstrate that RHyME outperforms a range of baselines across cross-embodiment datasets, showing a 52% increase in task recall over prior cross-embodiment learning methods. We release our code and datasets at https://portal-cornell.github.io/rhyme/.
Authors: King Zhu, Qianbo Zang, Shian Jia, Siwei Wu, Feiteng Fang, Yizhi Li, Shawn Gavin, Tuney Zheng, Jiawei Guo, Bo Li, Haoning Wu, Xingwei Qu, Jian Yang, Zachary Liu, Xiang Yue, J. H. Liu, Chenghua Lin, Min Yang, Shiwen Ni, Wenhao Huang, Ge Zhang
Abstract: Multimodal Large Language Models (MLLMs) are evaluated on various benchmarks, such as image captioning, visual question answering, and reasoning. However, many of these benchmarks include overly simple or uninformative samples, complicating the effective distinction of different MLLMs' performance. Furthermore, evaluating models across numerous benchmarks incurs a significant computational burden. To address these issues, we propose LIME (Less Is More for MLLM Evaluation), a refined and efficient benchmark curated through a semi-automated pipeline. This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that necessitate image-based understanding. Our experiments indicate that LIME reduces the number of samples by 76% and evaluation time by 77%, while also providing a more effective means of distinguishing the capabilities of different models. Notably, we find that traditional automatic metrics, such as CIDEr, are inadequate for assessing MLLMs' captioning performance; excluding the caption task score yields a more accurate reflection of overall model performance. All code and data are available at https://github.com/kangreen0210/LIME.
Authors: Xiaotong Zhang, Dingcheng Huang, Kamal Youcef-Toumi
Abstract: Effective human-robot collaboration (HRC) requires the robots to possess human-like intelligence. Inspired by the human's cognitive ability to selectively process and filter elements in complex environments, this paper introduces a novel concept and scene-understanding approach termed `relevance.' It identifies relevant components in a scene. To accurately and efficiently quantify relevance, we developed an event-based framework that selectively triggers relevance determination, along with a probabilistic methodology built on a structured scene representation. Simulation results demonstrate that the relevance framework and methodology accurately predict the relevance of a general HRC setup, achieving a precision of 0.99 and a recall of 0.94. Relevance can be broadly applied to several areas in HRC to improve task planning time by 79.56% compared with pure planning for a cereal task, reduce perception latency by up to 26.53% for an object detector, improve HRC safety by up to 13.50% and reduce the number of inquiries for HRC by 80.84%. A real-world demonstration showcases the relevance framework's ability to intelligently assist humans in everyday tasks.
Authors: Maximilian St\"olzle, Cosimo Della Santina
Abstract: Even though a variety of methods have been proposed in the literature, efficient and effective latent-space control (i.e., control in a learned low-dimensional space) of physical systems remains an open challenge. We argue that a promising avenue is to leverage powerful and well-understood closed-form strategies from control theory literature in combination with learned dynamics, such as potential-energy shaping. We identify three fundamental shortcomings in existing latent-space models that have so far prevented this powerful combination: (i) they lack the mathematical structure of a physical system, (ii) they do not inherently conserve the stability properties of the real systems, (iii) these methods do not have an invertible mapping between input and latent-space forcing. This work proposes a novel Coupled Oscillator Network (CON) model that simultaneously tackles all these issues. More specifically, (i) we show analytically that CON is a Lagrangian system - i.e., it possesses well-defined potential and kinetic energy terms. Then, (ii) we provide formal proof of global Input-to-State stability using Lyapunov arguments. Moving to the experimental side, we demonstrate that CON reaches SoA performance when learning complex nonlinear dynamics of mechanical systems directly from images. An additional methodological innovation contributing to achieving this third goal is an approximated closed-form solution for efficient integration of network dynamics, which eases efficient training. We tackle (iii) by approximating the forcing-to-input mapping with a decoder that is trained to reconstruct the input based on the encoded latent space force. Finally, we show how these properties enable latent-space control. We use an integral-saturated PID with potential force compensation and demonstrate high-quality performance on a soft robot using raw pixels as the only feedback information.
Authors: Mario Giulianelli, Andreas Opedal, Ryan Cotterell
Abstract: We introduce a generalization of classic information-theoretic measures of predictive uncertainty in online language processing, based on the simulation of expected continuations of incremental linguistic contexts. Our framework provides a formal definition of anticipatory and responsive measures, and it equips experimenters with the tools to define new, more expressive measures beyond standard next-symbol entropy and surprisal. While extracting these standard quantities from language models is convenient, we demonstrate that using Monte Carlo simulation to estimate alternative responsive and anticipatory measures pays off empirically: New special cases of our generalized formula exhibit enhanced predictive power compared to surprisal for human cloze completion probability as well as ELAN, LAN, and N400 amplitudes, and greater complementarity with surprisal in predicting reading times.
Authors: Siddhant Dutta, Pavana P Karanth, Pedro Maciel Xavier, Iago Leal de Freitas, Nouhaila Innan, Sadok Ben Yahia, Muhammad Shafique, David E. Bernal Neira
Abstract: The widespread deployment of products powered by machine learning models is raising concerns around data privacy and information security worldwide. To address this issue, Federated Learning was first proposed as a privacy-preserving alternative to conventional methods that allow multiple learning clients to share model knowledge without disclosing private data. A complementary approach known as Fully Homomorphic Encryption (FHE) is a quantum-safe cryptographic system that enables operations to be performed on encrypted weights. However, implementing mechanisms such as these in practice often comes with significant computational overhead and can expose potential security threats. Novel computing paradigms, such as analog, quantum, and specialized digital hardware, present opportunities for implementing privacy-preserving machine learning systems while enhancing security and mitigating performance loss. This work instantiates these ideas by applying the FHE scheme to a Federated Learning Neural Network architecture that integrates both classical and quantum layers.
Authors: Ziwei Li, Xiaoqi Wang, Hong-You Chen, Han-Wei Shen, Wei-Lun Chao
Abstract: Federated learning (FL) has rapidly evolved as a promising paradigm that enables collaborative model training across distributed participants without exchanging their local data. Despite its broad applications in fields such as computer vision, graph learning, and natural language processing, the development of a data projection model that can be effectively used to visualize data in the context of FL is crucial yet remains heavily under-explored. Neighbor embedding (NE) is an essential technique for visualizing complex high-dimensional data, but collaboratively learning a joint NE model is difficult. The key challenge lies in the objective function, as effective visualization algorithms like NE require computing loss functions among pairs of data. In this paper, we introduce \textsc{FedNE}, a novel approach that integrates the \textsc{FedAvg} framework with the contrastive NE technique, without any requirements of shareable data. To address the lack of inter-client repulsion which is crucial for the alignment in the global embedding space, we develop a surrogate loss function that each client learns and shares with each other. Additionally, we propose a data-mixing strategy to augment the local data, aiming to relax the problems of invisible neighbors and false neighbors constructed by the local $k$NN graphs. We conduct comprehensive experiments on both synthetic and real-world datasets. The results demonstrate that our \textsc{FedNE} can effectively preserve the neighborhood data structures and enhance the alignment in the global embedding space compared to several baseline methods.
Authors: Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B. Burkhardt, Andrea Califano, Jonah Cool, Abby F. Dernburg, Kirsty Ewing, Emily B. Fox, Matthias Haury, Amy E. Herr, Eric Horvitz, Patrick D. Hsu, Viren Jain, Gregory R. Johnson, Thomas Kalil, David R. Kelley, Shana O. Kelley, Anna Kreshuk, Tim Mitchison, Stephani Otte, Jay Shendure, Nicholas J. Sofroniew, Fabian Theis, Christina V. Theodoris, Srigokul Upadhyayula, Marc Valer, Bo Wang, Eric Xing, Serena Yeung-Levy, Marinka Zitnik, Theofanis Karaletsos, Aviv Regev, Emma Lundberg, Jure Leskovec, Stephen R. Quake
Abstract: The cell is arguably the most fundamental unit of life and is central to understanding biology. Accurate modeling of cells is important for this understanding as well as for determining the root causes of disease. Recent advances in artificial intelligence (AI), combined with the ability to generate large-scale experimental data, present novel opportunities to model cells. Here we propose a vision of leveraging advances in AI to construct virtual cells, high-fidelity simulations of cells and cellular systems under different conditions that are directly learned from biological data across measurements and scales. We discuss desired capabilities of such AI Virtual Cells, including generating universal representations of biological entities across scales, and facilitating interpretable in silico experiments to predict and understand their behavior using virtual instruments. We further address the challenges, opportunities and requirements to realize this vision including data needs, evaluation strategies, and community standards and engagement to ensure biological accuracy and broad utility. We envision a future where AI Virtual Cells help identify new drug targets, predict cellular responses to perturbations, as well as scale hypothesis exploration. With open science collaborations across the biomedical ecosystem that includes academia, philanthropy, and the biopharma and AI industries, a comprehensive predictive understanding of cell mechanisms and interactions has come into reach.
Authors: Eric Cullhed
Abstract: This paper describes an experiment in fine-tuning a pretrained causal language model (Meta's Llama 3.1 8B Instruct) to support three key tasks in philological research: dating, geographic attribution, and restoring missing or illegible characters in ancient Greek inscriptions and documentary papyri. Using an instruction-based approach and a 95%/5% train/test split, the models for papyri achieved a character error rate (CER) of 14.9%, a top-1 accuracy of 73.5%, and a top-20 accuracy of 86.0% in text reconstruction on the test set. For geographic attribution, they achieved a top-1 accuracy of 66.4% and a top-3 accuracy of 79.9%. In chronological attribution, the models showed an average deviation of 21.7 years from the actual terminus post/ante quem, with a median deviation of 0 years. For inscriptions, the models achieved a CER of 20.5%, a top-1 accuracy of 63.7%, and a top-20 accuracy of 83.0% for sequences up to 10 characters. In geographic attribution, they reached a top-1 accuracy of 75.0% and a top-3 accuracy of 83.7%. For dating, they had an average deviation of 37.1 years and a median deviation of 3 years from the actual date. When benchmarked against the state-of-the-art model (Ithaca) on a shared test set and recently edited inscriptions, the instruction-tuned models excelled in text restoration, with the added benefit of ignoring spaces during reconstruction to align with the continuous script of ancient texts. However, the models performed lower than Ithaca in geographic and chronological attribution. These preliminary results suggest that fine-tuning larger pretrained causal language models with instruction templates holds promise for philological research, especially in textual criticism.
Authors: Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay
Abstract: The title of a research paper communicates in a succinct style the main theme and, sometimes, the findings of the paper. Coming up with the right title is often an arduous task, and therefore, it would be beneficial to authors if title generation can be automated. In this paper, we fine-tune pre-trained language models to generate titles of papers from their abstracts. Additionally, we use GPT-3.5-turbo in a zero-shot setting to generate paper titles. The performance of the models is measured with ROUGE, METEOR, MoverScore, BERTScore and SciBERTScore metrics. We find that fine-tuned PEGASUS-large outperforms the other models, including fine-tuned LLaMA-3-8B and GPT-3.5-turbo, across most metrics. We also demonstrate that ChatGPT can generate creative titles for papers. Our observations suggest that AI-generated paper titles are generally accurate and appropriate.
Authors: Zheda Mai, Arpita Chowdhury, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Vardaan Pahuja, Tanya Berger-Wolf, Song Gao, Charles Stewart, Yu Su, Wei-Lun Chao
Abstract: Fine-tuning is arguably the most straightforward way to tailor a pre-trained model (e.g., a foundation model) to downstream applications, but it also comes with the risk of losing valuable knowledge the model had learned in pre-training. For example, fine-tuning a pre-trained classifier capable of recognizing a large number of classes to master a subset of classes at hand is shown to drastically degrade the model's accuracy in the other classes it had previously learned. As such, it is hard to further use the fine-tuned model when it encounters classes beyond the fine-tuning data. In this paper, we systematically dissect the issue, aiming to answer the fundamental question, "What has been damaged in the fine-tuned model?" To our surprise, we find that the fine-tuned model neither forgets the relationship among the other classes nor degrades the features to recognize these classes. Instead, the fine-tuned model often produces more discriminative features for these other classes, even if they were missing during fine-tuning! {What really hurts the accuracy is the discrepant logit scales between the fine-tuning classes and the other classes}, implying that a simple post-processing calibration would bring back the pre-trained model's capability and at the same time unveil the feature improvement over all classes. We conduct an extensive empirical study to demonstrate the robustness of our findings and provide preliminary explanations underlying them, suggesting new directions for future theoretical analysis. Our code is available at https://github.com/OSU-MLB/Fine-Tuning-Is-Fine-If-Calibrated.
URLs: https://github.com/OSU-MLB/Fine-Tuning-Is-Fine-If-Calibrated.
Authors: Shuai Zhao, Leilei Gan, Zhongliang Guo, Xiaobao Wu, Luwei Xiao, Xiaoyu Xu, Cong-Duy Nguyen, Luu Anh Tuan
Abstract: Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning. However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from weak to strong based on feature alignment-enhanced knowledge distillation (W2SAttack). Specifically, we poison small-scale language models through full-parameter fine-tuning to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through feature alignment-enhanced knowledge distillation, which employs PEFT. Theoretical analysis reveals that W2SAttack has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of W2SAttack on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.
Authors: Sakhinana Sagar Srinivas, Vijay Sri Vaikunth, Venkataramana Runkana
Abstract: Patents are the currency of innovation, and like any currency, they need to be managed and protected (Gavin Potenza). Patents, as legal documents that secure intellectual property rights, play a critical role in technological innovation. The growing complexity of patent documents and the surge in patent applications have created a need for automated solutions in patent analysis. In this work, we present PatExpert, an autonomous multi-agent conversational framework designed to streamline and optimize patent-related tasks. The framework consists of a metaagent that coordinates task-specific expert agents for various patent-related tasks and a critique agent for error handling and feedback provision. The meta-agent orchestrates specialized expert agents, each fine-tuned for specific tasks such as patent classification, acceptance, claim generation, abstractive summarization, multi-patent analysis, and scientific hypothesis generation. For multi-patent analysis, the framework incorporates advanced methods like Graph Retrieval-Augmented Generation (GRAG) to enhance response accuracy and relevance by combining semantic similarity with knowledge graphs. Error handling is managed by critique agents (Gold-LLM-as-a-Judge and Reward-LLM-as-a-Judge), which evaluate output responses for accuracy and provide iterative feedback. The framework also prioritizes explainability, ensuring transparent justifications for decisions made during patent analysis. Its comprehensive capabilities make it a valuable tool for automating complex patent workflows, enhancing efficiency, accuracy, and compliance in patent-related tasks. Empirical evidence demonstrates significant improvements in patent processing tasks, concluding that the framework offers a robust solution for automating and optimizing patent analysis.
Authors: Xinyi Huang, Yingyi Wu, Danyang Zhang, Jiacheng Hu, Yujian Long
Abstract: This study addresses the critical challenges of assessing foundational academic skills by leveraging advancements in natural language processing (NLP). Traditional assessment methods often struggle to provide timely and comprehensive feedback on key cognitive and linguistic aspects, such as coherence, syntax, and analytical reasoning. Our approach integrates multiple state-of-the-art NLP models, including BERT, RoBERTa, BART, DeBERTa, and T5, within an ensemble learning framework. These models are combined through stacking techniques using LightGBM and Ridge regression to enhance predictive accuracy. The methodology involves detailed data preprocessing, feature extraction, and pseudo-label learning to optimize model performance. By incorporating sophisticated NLP techniques and ensemble learning, this study significantly improves the accuracy and efficiency of assessments, offering a robust solution that surpasses traditional methods and opens new avenues for educational technology research focused on enhancing core academic competencies.
Authors: Chun Jie Chong, Zhihao Yao, Iulian Neamtiu
Abstract: Generating code via a LLM (rather than writing code from scratch), has exploded in popularity. However, the security implications of LLM-generated code are still unknown. We performed a study that compared the security and quality of human-written code with that of LLM-generated code, for a wide range of programming tasks, including data structures, algorithms, cryptographic routines, and LeetCode questions. To assess code security we used unit testing, fuzzing, and static analysis. For code quality, we focused on complexity and size. We found that LLM can generate incorrect code that fails to implement the required functionality, especially for more complicated tasks; such errors can be subtle. For example, for the cryptographic algorithm SHA1, LLM generated an incorrect implementation that nevertheless compiles. In cases where its functionality was correct, we found that LLM-generated code is less secure, primarily due to the lack of defensive programming constructs, which invites a host of security issues such as buffer overflows or integer overflows. Fuzzing has revealed that LLM-generated code is more prone to hangs and crashes than human-written code. Quality-wise, we found that LLM generates bare-bones code that lacks defensive programming constructs, and is typically more complex (per line of code) compared to human-written code. Next, we constructed a feedback loop that asked the LLM to re-generate the code and eliminate the found issues (e.g., malloc overflow, array index out of bounds, null dereferences). We found that the LLM fails to eliminate such issues consistently: while succeeding in some cases, we found instances where the re-generated, supposedly more secure code, contains new issues; we also found that upon prompting, LLM can introduce issues in files that were issues-free before prompting.
Authors: Maciej Krzywda, Szymon {\L}ukasik, Amir Gandomi H
Abstract: The present study covers an approach to neural architecture search (NAS) using Cartesian genetic programming (CGP) for the design and optimization of Convolutional Neural Networks (CNNs). In designing artificial neural networks, one crucial aspect of the innovative approach is suggesting a novel neural architecture. Currently used architectures have mostly been developed manually by human experts, which is a time-consuming and error-prone process. In this work, we use pure Genetic Programming Approach to design CNNs, which employs only one genetic operation, i.e., mutation. In the course of preliminary experiments, our methodology yields promising results.
Authors: Mingye Zhu, Yi Liu, Quan Wang, Junbo Guo, Zhendong Mao
Abstract: Recent breakthroughs in preference alignment have significantly improved Large Language Models' ability to generate texts that align with human preferences and values. However, current alignment metrics typically emphasize the post-hoc overall improvement, while overlooking a critical aspect: regression, which refers to the backsliding on previously correctly-handled data after updates. This potential pitfall may arise from excessive fine-tuning on already well-aligned data, which subsequently leads to over-alignment and degeneration. To address this challenge, we propose FlipGuard, a constrained optimization approach to detect and mitigate update regression with focal attention. Specifically, FlipGuard identifies performance degradation using a customized reward characterization and strategically enforces a constraint to encourage conditional congruence with the pre-aligned model during training. Comprehensive experiments demonstrate that FlipGuard effectively alleviates update regression while demonstrating excellent overall performance, with the added benefit of knowledge preservation while aligning preferences.
Authors: Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, Guoren Wang
Abstract: Low-rank training has emerged as a promising approach for reducing memory usage in training Large Language Models (LLMs). Previous methods either rely on decomposing weight matrices (e.g., LoRA), or seek to decompose gradient matrices (e.g., GaLore) to ensure reduced memory consumption. However, both of them constrain the training in a low-rank subspace, thus inevitably leading to sub-optimal performance. This raises a question: whether it is possible to consistently preserve the low-rank constraint for memory efficiency, while achieving full-rank training (i.e., training with full-rank gradients of full-rank weights) to avoid inferior outcomes? In this paper, we propose a new plug-and-play training framework for LLMs called Fira, as the first attempt to achieve this goal. First, we observe an interesting phenomenon during LLM training: the scaling impact of adaptive optimizers (e.g., Adam) on the gradient norm remains similar from low-rank to full-rank training. Based on this observation, we propose a norm-based scaling method, which utilizes the scaling impact of low-rank optimizers as substitutes for that of original full-rank optimizers to enable full-rank training. In this way, we can preserve the low-rank constraint in the optimizer while achieving full-rank training for better performance. Moreover, we find that there are sudden gradient rises during the optimization process, potentially causing loss spikes. To address this, we further put forward a norm-growth limiter to smooth the gradient via regulating the relative increase of gradient norms. Extensive experiments on the pre-training and fine-tuning of LLMs show that Fira outperforms both LoRA and GaLore, achieving performance that is comparable to or even better than full-rank training.
Authors: Taekyung Lee, Jaemoo Choi, Myungjoo Kang, Jaewoong Choi
Abstract: Unpaired point cloud completion explores methods for learning a completion map from unpaired incomplete and complete point cloud data. In this paper, we propose a novel approach for unpaired point cloud completion using the unbalanced optimal transport map, called Unbalanced Optimal Transport Map for Unpaired Point Cloud Completion (UOT-UPC). We demonstrate that the unpaired point cloud completion can be naturally interpreted as the Optimal Transport (OT) problem and introduce the Unbalanced Optimal Transport (UOT) approach to address the class imbalance problem, which is prevalent in unpaired point cloud completion datasets. Moreover, we analyze the appropriate cost function for unpaired completion tasks. This analysis shows that the InfoCD cost function is particularly well-suited for this task. Our model is the first attempt to leverage UOT for unpaired point cloud completion, achieving competitive or superior results on both single-category and multi-category datasets. In particular, our model is especially effective in scenarios with class imbalance, where the proportions of categories are different between the incomplete and complete point cloud datasets.
Authors: Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, Jian Zhang
Abstract: The rapid development of generative AI is a double-edged sword, which not only facilitates content creation but also makes image manipulation easier and more difficult to detect. Although current image forgery detection and localization (IFDL) methods are generally effective, they tend to face two challenges: \textbf{1)} black-box nature with unknown detection principle, \textbf{2)} limited generalization across diverse tampering methods (e.g., Photoshop, DeepFake, AIGC-Editing). To address these issues, we propose the explainable IFDL task and design FakeShield, a multi-modal framework capable of evaluating image authenticity, generating tampered region masks, and providing a judgment basis based on pixel-level and image-level tampering clues. Additionally, we leverage GPT-4o to enhance existing IFDL datasets, creating the Multi-Modal Tamper Description dataSet (MMTD-Set) for training FakeShield's tampering analysis capabilities. Meanwhile, we incorporate a Domain Tag-guided Explainable Forgery Detection Module (DTE-FDM) and a Multi-modal Forgery Localization Module (MFLM) to address various types of tamper detection interpretation and achieve forgery localization guided by detailed textual descriptions. Extensive experiments demonstrate that FakeShield effectively detects and localizes various tampering techniques, offering an explainable and superior solution compared to previous IFDL methods.
Authors: Xinyu Zhang, Yuhan Liu, Haonan Chang, Liam Schramm, Abdeslam Boularias
Abstract: Autoregressive models have demonstrated remarkable success in natural language processing. In this work, we design a simple yet effective autoregressive architecture for robotic manipulation tasks. We propose the Chunking Causal Transformer (CCT), which extends the next-single-token prediction of causal transformers to support multi-token prediction in a single pass. Further, we design a novel attention interleaving strategy that allows CCT to be trained efficiently with teacher-forcing. Based on CCT, we propose the Autoregressive Policy (ARP) model, which learns to generate action sequences autoregressively. We find that action sequence learning enables better leverage of the underlying causal relationships in robotic tasks. We evaluate ARP across diverse robotic manipulation environments, including Push-T, ALOHA, and RLBench, and show that it outperforms the state-of-the-art methods in all tested environments, while being more efficient in computation and parameter sizes. Video demonstrations, our source code, and the models of ARP can be found at http://github.com/mlzxy/arp.
Authors: Nirmalya Thakur
Abstract: The work presented in this paper makes three scientific contributions with a specific focus on mining and analysis of COVID-19-related posts on Instagram. First, it presents a multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset, available at https://dx.doi.org/10.21227/d46p-v480, contains Instagram posts in 161 different languages as well as 535,021 distinct hashtags. After the development of this dataset, multilingual sentiment analysis was performed, which involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset. Second, it presents the results of performing sentiment analysis per year from 2020 to 2024. The findings revealed the trends in sentiment related to COVID-19 on Instagram since the beginning of the pandemic. For instance, between 2020 and 2024, the sentiment trends show a notable shift, with positive sentiment decreasing from 38.35% to 28.69%, while neutral sentiment rising from 44.19% to 58.34%. Finally, the paper also presents findings of language-specific sentiment analysis. This analysis highlighted similar and contrasting trends of sentiment across posts published in different languages on Instagram. For instance, out of all English posts, 49.68% were positive, 14.84% were negative, and 35.48% were neutral. In contrast, among Hindi posts, 4.40% were positive, 57.04% were negative, and 38.56% were neutral, reflecting distinct differences in the sentiment distribution between these two languages.
Authors: Sumuk Shashidhar, Abhinav Chinta, Vaibhav Sahai, Dilek Hakkani-T\"ur
Abstract: Large language models demonstrate impressive reasoning abilities but struggle to provide personalized content due to their lack of individual user preference information. Existing methods, such as in-context learning and parameter-efficient fine-tuning, fall short in capturing the complexity of human preferences, especially given the small, personal datasets individuals possess. In this paper, we propose a novel approach utilizing small parameter models as preference agents to generate natural language rules that guide a larger, pre-trained model, enabling efficient personalization. Our method involves a small, local "steering wheel" model that directs the outputs of a much larger foundation model, producing content tailored to an individual's preferences while leveraging the extensive knowledge and capabilities of the large model. Importantly, this personalization is achieved without the need to fine-tune the large model. Experimental results on email and article datasets, demonstrate that our technique significantly outperforms baseline personalization methods. By allowing foundation models to adapt to individual preferences in a data and compute-efficient manner, our approach paves the way for highly personalized language model applications.
Authors: Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, Yiqun Liu
Abstract: Learning from preference feedback is a common practice for aligning large language models~(LLMs) with human value. Conventionally, preference data is learned and encoded into a scalar reward model that connects a value head with an LLM to produce a scalar score as preference or reward. However, scalar models lack interpretability and are known to be susceptible to biases in datasets. This paper investigates leveraging the generation capability of LLMs to address both limitations in one shot. Specifically, we prompt the pre-trained LLM to generate positive and negative judgments, both supported with rationales in natural language form. The self-generated contrastive judgment pairs are used to train the generative judge with Direct Preference Optimization (DPO). This proposal of training the generative Judge using self-generated Contrastive judgments (Con-J) ensures natural interpretability due to the generated rationales together with the judgments, as well as high robustness against bias without the need for an additional reward head. Experimental results show that the performance of Con-J is comparable to the scalar reward model trained on the same collection of preference data, and demonstrate its superior interpretability and robustness in encoding human preferences.
Authors: Minhajur Rahman, Yasir Arafat
Abstract: Transformer models have demonstrated impressive performance in Non-Intrusive Load Monitoring (NILM) applications in recent years. Despite their success, existing studies have not thoroughly examined the impact of various hyper-parameters on model performance, which is crucial for advancing high-performing transformer models. In this work, a comprehensive series of experiments have been conducted to analyze the influence of these hyper-parameters in the context of residential NILM. This study delves into the effects of the number of hidden dimensions in the attention layer, the number of attention layers, the number of attention heads, and the dropout ratio on transformer performance. Furthermore, the role of the masking ratio has explored in BERT-style transformer training, providing a detailed investigation into its impact on NILM tasks. Based on these experiments, the optimal hyper-parameters have been selected and used them to train a transformer model, which surpasses the performance of existing models. The experimental findings offer valuable insights and guidelines for optimizing transformer architectures, aiming to enhance their effectiveness and efficiency in NILM applications. It is expected that this work will serve as a foundation for future research and development of more robust and capable transformer models for NILM.
Authors: Gang Li, Wendi Yu, Yao Yao, Wei Tong, Yingbin Liang, Qihang Lin, Tianbao Yang
Abstract: In the real world, a learning-enabled system usually undergoes multiple cycles of model development to enhance the system's ability to handle difficult or emerging tasks. This continual model development process raises a significant issue that the model development for acquiring new or improving existing capabilities may inadvertently lose capabilities of the old model, also known as catastrophic forgetting. Existing continual learning studies focus on mitigating catastrophic forgetting by trading off performance on previous tasks and new tasks to ensure good average performance. However, they are inadequate for many applications especially in safety-critical domains, as failure to strictly preserve the performance of the old model not only introduces safety risks and uncertainties but also imposes substantial expenses in the re-improving and re-validation of existing properties. To address this issue, we introduce model developmental safety as a guarantee of a learning system such that in the model development process the new model should strictly preserve the existing protected capabilities of the old model while improving its performance on target tasks. To ensure the model developmental safety, we present a safety-centric framework by formulating the model developmental safety as data-dependent constraints. Under this framework, we study how to develop a pretrained vision-language model (aka the CLIP model) for acquiring new capabilities or improving existing capabilities of image classification. We propose an efficient constrained optimization algorithm with theoretical guarantee and use its insights to finetune a CLIP model with task-dependent heads for promoting the model developmental safety. Our experiments on improving vision perception capabilities on autonomous driving and scene recognition datasets demonstrate the efficacy of the proposed approach.
Authors: Hongjun Wang, Jiyuan Chen, Zhengwei Yin, Xuan Song, Yinqiang Zheng
Abstract: Cloth-Changing Person Re-Identification (CC-ReID) involves recognizing individuals in images regardless of clothing status. In this paper, we empirically and experimentally demonstrate that completely eliminating or fully retaining clothing features is detrimental to the task. Existing work, either relying on clothing labels, silhouettes, or other auxiliary data, fundamentally aim to balance the learning of clothing and identity features. However, we practically find that achieving this balance is challenging and nuanced. In this study, we introduce a novel module called Diverse Norm, which expands personal features into orthogonal spaces and employs channel attention to separate clothing and identity features. A sample re-weighting optimization strategy is also introduced to guarantee the opposite optimization direction. Diverse Norm presents a simple yet effective approach that does not require additional data. Furthermore, Diverse Norm can be seamlessly integrated ResNet50 and significantly outperforms the state-of-the-art methods.
Authors: Jiahao Yu, Xian Wu, Hao Liu, Wenbo Guo, Xinyu Xing
Abstract: We propose BlockFound, a customized foundation model for anomaly blockchain transaction detection. Unlike existing methods that rely on rule-based systems or directly apply off-the-shelf large language models, BlockFound introduces a series of customized designs to model the unique data structure of blockchain transactions. First, a blockchain transaction is multi-modal, containing blockchain-specific tokens, texts, and numbers. We design a modularized tokenizer to handle these multi-modal inputs, balancing the information across different modalities. Second, we design a customized mask language learning mechanism for pretraining with RoPE embedding and FlashAttention for handling longer sequences. After training the foundation model, we further design a novel detection method for anomaly detection. Extensive evaluations on Ethereum and Solana transactions demonstrate BlockFound's exceptional capability in anomaly detection while maintaining a low false positive rate. Remarkably, BlockFound is the only method that successfully detects anomalous transactions on Solana with high accuracy, whereas all other approaches achieved very low or zero detection recall scores. This work not only provides new foundation models for blockchain but also sets a new benchmark for applying LLMs in blockchain data.
Authors: Ruizhe Chen, Xiaotian Zhang, Meng Luo, Wenhao Chai, Zuozhu Liu
Abstract: Aligning with personalized preferences, which vary significantly across cultural, educational, and political differences, poses a significant challenge due to the computational costs and data demands of traditional alignment methods. In response, this paper presents Personalized Alignment at Decoding-time (PAD), a novel framework designed to align LLM outputs with diverse personalized preferences during the inference phase, eliminating the need for additional training. By introducing a unique personalized reward modeling strategy, this framework decouples the text generation process from personalized preferences, facilitating the generation of generalizable token-level personalized rewards. The PAD algorithm leverages these rewards to guide the decoding process, dynamically tailoring the base model's predictions to personalized preferences. Extensive experimental results demonstrate that PAD not only outperforms existing training-based alignment methods in terms of aligning with diverse preferences but also shows significant generalizability to preferences unseen during training and scalability across different base models. This work advances the capability of LLMs to meet user needs in real-time applications, presenting a substantial step forward in personalized LLM alignment.
Authors: Zhaorui Tan, Xi Yang, Qiufeng Wang, Anh Nguyen, Kaizhu Huang
Abstract: Vision models excel in image classification but struggle to generalize to unseen data, such as classifying images from unseen domains or discovering novel categories. In this paper, we explore the relationship between logical reasoning and deep learning generalization in visual classification. A logical regularization termed L-Reg is derived which bridges a logical analysis framework to image classification. Our work reveals that L-Reg reduces the complexity of the model in terms of the feature distribution and classifier weights. Specifically, we unveil the interpretability brought by L-Reg, as it enables the model to extract the salient features, such as faces to persons, for classification. Theoretical analysis and experiments demonstrate that L-Reg enhances generalization across various scenarios, including multi-domain generalization and generalized category discovery. In complex real-world scenarios where images span unknown classes and unseen domains, L-Reg consistently improves generalization, highlighting its practical efficacy.
Authors: Dylan Zhang, Justin Wang, Francois Charton
Abstract: Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we rigorously examine the key factors that enable models to generalize to unseen instructions, providing insights to guide the collection of data for instruction-tuning. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization $\textbf{only emerges}$ when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model's adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of $\textit{$\textbf{specialist}$}$ and $\textit{$\textbf{generalist}$}$ models. In both cases, we demonstrate that 1) better performance can be achieved by increasing the diversity of an established dataset while keeping the data size constant, and 2) when scaling up the data, diversifying the semantics of instructions is more effective than simply increasing the quantity of similar data. Our research provides important insights for dataset collation, particularly when optimizing model performance by expanding training data for both specialist and generalist scenarios. We show that careful consideration of data diversification is key: training specialist models with data extending beyond their core domain leads to significant performance improvements, while generalist models benefit from diverse data mixtures that enhance their overall instruction-following capabilities across a wide range of applications. Our results highlight the critical role of strategic diversification and offer clear guidelines for improving data quality.
Authors: Attila Lovas
Abstract: Nonlinear time series models with exogenous regressors are essential in econometrics, queuing theory, and machine learning, though their statistical analysis remains incomplete. Key results, such as the law of large numbers and the functional central limit theorem, are known for weakly dependent variables. We demonstrate the transfer of mixing properties from the exogenous regressor to the response via coupling arguments. Additionally, we study Markov chains in random environments with drift and minorization conditions, even under non-stationary environments with favorable mixing properties, and apply this framework to single-server queuing models.
Authors: Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, Chaowei Xiao
Abstract: In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.
Authors: Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, Linfeng Zhang
Abstract: Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exhibit different sensitivities to feature caching, and feature caching on some tokens may lead to 10$\times$ more destruction to the overall generation quality compared with other tokens. In this paper, we introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching, and further enable us to apply different caching ratios to neural layers in different types and depths. Extensive experiments on PixArt-$\alpha$, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36$\times$ and 1.93$\times$ acceleration are achieved on OpenSora and PixArt-$\alpha$ with almost no drop in generation quality.
Authors: Adarsh Kumarappan, Mo Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, Anima Anandkumar
Abstract: Large Language Models (LLMs) have been successful in mathematical reasoning tasks such as formal theorem proving when integrated with interactive proof assistants like Lean. Existing approaches involve training or fine-tuning an LLM on a specific dataset to perform well on particular domains, such as undergraduate-level mathematics. These methods struggle with generalizability to advanced mathematics. A fundamental limitation is that these approaches operate on static domains, failing to capture how mathematicians often work across multiple domains and projects simultaneously or cyclically. We present LeanAgent, a novel lifelong learning framework for theorem proving that continuously generalizes to and improves on ever-expanding mathematical knowledge without forgetting previously learned knowledge. LeanAgent introduces several key innovations, including a curriculum learning strategy that optimizes the learning trajectory in terms of mathematical difficulty, a dynamic database for efficient management of evolving mathematical knowledge, and progressive training to balance stability and plasticity. LeanAgent successfully proves 162 theorems previously unproved by humans across 23 diverse Lean repositories, many from advanced mathematics. It performs up to 11$\times$ better than the static LLM baseline, proving challenging theorems in domains like abstract algebra and algebraic topology while showcasing a clear progression of learning from basic concepts to advanced topics. In addition, we analyze LeanAgent's superior performance on key lifelong learning metrics. LeanAgent achieves exceptional scores in stability and backward transfer, where learning new tasks improves performance on previously learned tasks. This emphasizes LeanAgent's continuous generalizability and improvement, explaining its superior theorem proving performance.
Authors: Krishna Aswani, Huilin Lu, Pranav Patankar, Priya Dhalwani, Iris Tan, Jayant Ganeshmohan, Simon Lacasse
Abstract: Recent advancements in prompt engineering strategies, such as Chain-of-Thought (CoT) and Self-Discover, have demonstrated significant potential in improving the reasoning abilities of Large Language Models (LLMs). However, these state-of-the-art (SOTA) prompting strategies rely on single or fixed set of static seed reasoning modules like "think step by step" or "break down this problem" intended to simulate human approach to problem-solving. This constraint limits the flexibility of models in tackling diverse problems effectively. In this paper, we introduce Auto-Evolve, a novel framework that enables LLMs to self-create dynamic reasoning modules and downstream action plan, resulting in significant improvements over current SOTA methods. We evaluate Auto-Evolve on the challenging BigBench-Hard (BBH) dataset with Claude 2.0, Claude 3 Sonnet, Mistral Large, and GPT 4, where it consistently outperforms the SOTA prompt strategies. Auto-Evolve outperforms CoT by up to 10.4% and on an average by 7% across these four models. Our framework introduces two innovations: a) Auto-Evolve dynamically generates reasoning modules for each task while aligning with human reasoning paradigm, thus eliminating the need for predefined templates. b) We introduce an iterative refinement component, that incrementally refines instruction guidance for LLMs and helps boost performance by average 2.8% compared to doing it in a single step.
Authors: Mathilde Papillon, Guillermo Bern\'ardez, Claudio Battiloro, Nina Miolane
Abstract: Graph Neural Networks (GNNs) excel in learning from relational datasets, processing node and edge features in a way that preserves the symmetries of the graph domain. However, many complex systems--such as biological or social networks--involve multiway complex interactions that are more naturally represented by higher-order topological spaces. The emerging field of Topological Deep Learning (TDL) aims to accommodate and leverage these higher-order structures. Combinatorial Complex Neural Networks (CCNNs), fairly general TDL models, have been shown to be more expressive and better performing than GNNs. However, differently from the graph deep learning ecosystem, TDL lacks a principled and standardized framework for easily defining new architectures, restricting its accessibility and applicability. To address this issue, we introduce Generalized CCNNs (GCCNs), a novel simple yet powerful family of TDL models that can be used to systematically transform any (graph) neural network into its TDL counterpart. We prove that GCCNs generalize and subsume CCNNs, while extensive experiments on a diverse class of GCCNs show that these architectures consistently match or outperform CCNNs, often with less model complexity. In an effort to accelerate and democratize TDL, we introduce TopoTune, a lightweight software that allows practitioners to define, build, and train GCCNs with unprecedented flexibility and ease.
Authors: Zonglin Yang, Wanhao Liu, Ben Gao, Tong Xie, Yuqiang Li, Wanli Ouyang, Soujanya Poria, Erik Cambria, Dongzhan Zhou
Abstract: Scientific discovery contributes largely to human society's prosperity, and recent progress shows that LLMs could potentially catalyze this process. However, it is still unclear whether LLMs can discover novel and valid hypotheses in chemistry. In this work, we investigate this central research question: Can LLMs automatically discover novel and valid chemistry research hypotheses given only a chemistry research background (consisting of a research question and/or a background survey), without limitation on the domain of the research question? After extensive discussions with chemistry experts, we propose an assumption that a majority of chemistry hypotheses can be resulted from a research background and several inspirations. With this key insight, we break the central question into three smaller fundamental questions. In brief, they are: (1) given a background question, whether LLMs can retrieve good inspirations; (2) with background and inspirations, whether LLMs can lead to hypothesis; and (3) whether LLMs can identify good hypotheses to rank them higher. To investigate these questions, we construct a benchmark consisting of 51 chemistry papers published in Nature, Science, or a similar level in 2024 (all papers are only available online since 2024). Every paper is divided by chemistry PhD students into three components: background, inspirations, and hypothesis. The goal is to rediscover the hypothesis, given only the background and a large randomly selected chemistry literature corpus consisting the ground truth inspiration papers, with LLMs trained with data up to 2023. We also develop an LLM-based multi-agent framework that leverages the assumption, consisting of three stages reflecting the three smaller questions. The proposed method can rediscover many hypotheses with very high similarity with the ground truth ones, covering the main innovations.
Authors: Christofel Rio Goenawan, Kevin Tirta Wijaya, Seung-Hyun Kong
Abstract: Point cloud completion networks are conventionally trained to minimize the disparities between the completed point cloud and the ground-truth counterpart. However, an incomplete object-level point cloud can have multiple valid completion solutions when it is examined in isolation. This one-to-many mapping issue can cause contradictory supervision signals to the network because the loss function may produce different values for identical input-output pairs of the network. In many cases, this issue could adversely affect the network optimization process. In this work, we propose to enhance the conventional learning objective using a novel completion consistency loss to mitigate the one-to-many mapping problem. Specifically, the proposed consistency loss ensure that a point cloud completion network generates a coherent completion solution for incomplete objects originating from the same source point cloud. Experimental results across multiple well-established datasets and benchmarks demonstrated the proposed completion consistency loss have excellent capability to enhance the completion performance of various existing networks without any modification to the design of the networks. The proposed consistency loss enhances the performance of the point completion network without affecting the inference speed, thereby increasing the accuracy of point cloud completion. Notably, a state-of-the-art point completion network trained with the proposed consistency loss can achieve state-of-the-art accuracy on the challenging new MVP dataset. The code and result of experiment various point completion models using proposed consistency loss will be available at: https://github.com/kaist-avelab/ConsistencyLoss .
Authors: Nan Fang, Guiliang Liu, Wei Gong
Abstract: Reinforcement Learning (RL) applied in healthcare can lead to unsafe medical decisions and treatment, such as excessive dosages or abrupt changes, often due to agents overlooking common-sense constraints. Consequently, Constrained Reinforcement Learning (CRL) is a natural choice for safe decisions. However, specifying the exact cost function is inherently difficult in healthcare. Recent Inverse Constrained Reinforcement Learning (ICRL) is a promising approach that infers constraints from expert demonstrations. ICRL algorithms model Markovian decisions in an interactive environment. These settings do not align with the practical requirement of a decision-making system in healthcare, where decisions rely on historical treatment recorded in an offline dataset. To tackle these issues, we propose the Constraint Transformer (CT). Specifically, 1) we utilize a causal attention mechanism to incorporate historical decisions and observations into the constraint modeling, while employing a Non-Markovian layer for weighted constraints to capture critical states. 2) A generative world model is used to perform exploratory data augmentation, enabling offline RL methods to simulate unsafe decision sequences. In multiple medical scenarios, empirical results demonstrate that CT can capture unsafe states and achieve strategies that approximate lower mortality rates, reducing the occurrence probability of unsafe behaviors.
Authors: Cristian Meo, Mircea Lica, Zarif Ikram, Akihiro Nakano, Vedant Shah, Aniket Rajiv Didolkar, Dianbo Liu, Anirudh Goyal, Justin Dauwels
Abstract: Deep Reinforcement Learning (RL) has become the leading approach for creating artificial agents in complex environments. Model-based approaches, which are RL methods with world models that predict environment dynamics, are among the most promising directions for improving data efficiency, forming a critical step toward bridging the gap between research and real-world deployment. In particular, world models enhance sample efficiency by learning in imagination, which involves training a generative sequence model of the environment in a self-supervised manner. Recently, Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling and generating token sequences. Building on the Efficient Stochastic Transformer-based World Models (STORM) architecture, we replace the traditional MLP prior with a Masked Generative Prior (e.g., MaskGIT Prior) and introduce GIT-STORM. We evaluate our model on two downstream tasks: reinforcement learning and video prediction. GIT-STORM demonstrates substantial performance gains in RL tasks on the Atari 100k benchmark. Moreover, we apply Transformer-based World Models to continuous action environments for the first time, addressing a significant gap in prior research. To achieve this, we employ a state mixer function that integrates latent state representations with actions, enabling our model to handle continuous control tasks. We validate this approach through qualitative and quantitative analyses on the DeepMind Control Suite, showcasing the effectiveness of Transformer-based World Models in this new domain. Our results highlight the versatility and efficacy of the MaskGIT dynamics prior, paving the way for more accurate world models and effective RL policies.
Authors: Yuanqi Du, Michael Plainer, Rob Brekelmans, Chenru Duan, Frank No\'e, Carla P. Gomes, Al\'an Aspuru-Guzik, Kirill Neklyudov
Abstract: Rare event sampling in dynamical systems is a fundamental problem arising in the natural sciences, which poses significant computational challenges due to an exponentially large space of trajectories. For settings where the dynamical system of interest follows a Brownian motion with known drift, the question of conditioning the process to reach a given endpoint or desired rare event is definitively answered by Doob's h-transform. However, the naive estimation of this transform is infeasible, as it requires simulating sufficiently many forward trajectories to estimate rare event probabilities. In this work, we propose a variational formulation of Doob's h-transform as an optimization problem over trajectories between a given initial point and the desired ending point. To solve this optimization, we propose a simulation-free training objective with a model parameterization that imposes the desired boundary conditions by design. Our approach significantly reduces the search space over trajectories and avoids expensive trajectory simulation and inefficient importance sampling estimators which are required in existing methods. We demonstrate the ability of our method to find feasible transition paths on real-world molecular simulation and protein folding tasks.
Authors: Xin Zhang, Xiang Lyu, Zhihao Du, Qian Chen, Dong Zhang, Hangrui Hu, Chaohong Tan, Tianyu Zhao, Yuxuan Wang, Bin Zhang, Heng Lu, Yaqian Zhou, Xipeng Qiu
Abstract: Current methods of building LLMs with voice interaction capabilities rely heavily on explicit text autoregressive generation before or during speech response generation to maintain content quality, which unfortunately brings computational overhead and increases latency in multi-turn interactions. To address this, we introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities. IntrinsicVoice aims to facilitate the transfer of textual capabilities of pre-trained LLMs to the speech modality by mitigating the modality gap between text and speech. Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences while generating high-quality audio, significantly reducing the length difference between speech and text, speeding up inference, and alleviating long-text modeling issues. Additionally, we construct a multi-turn speech-to-speech dialogue dataset named \method-500k which includes nearly 500k turns of speech-to-speech dialogues, and a cross-modality training strategy to enhance the semantic alignment between speech and text. Experimental results demonstrate that IntrinsicVoice can generate high-quality speech response with latency lower than 100ms in multi-turn dialogue scenarios. Demos are available at https://instrinsicvoice.github.io/.
Authors: Shuhe Wang, Guoyin Wang, Jiwei Li, Eduard Hovy, Chen Guo
Abstract: Packing, initially utilized in the pre-training phase, is an optimization technique designed to maximize hardware resource efficiency by combining different training sequences to fit the model's maximum input length. Although it has demonstrated effectiveness during pre-training, there remains a lack of comprehensive analysis for the supervised fine-tuning (SFT) stage on the following points: (1) whether packing can effectively enhance training efficiency while maintaining performance, (2) the suitable size of the model and dataset for fine-tuning with the packing method, and (3) whether packing unrelated or related training samples might cause the model to either excessively disregard or over-rely on the context. In this paper, we perform extensive comparisons between SFT methods using padding and packing, covering SFT datasets ranging from 69K to 1.2M and models from 8B to 70B. This provides the first comprehensive analysis of the advantages and limitations of packing versus padding, as well as practical considerations for implementing packing in various training scenarios. Our analysis covers various benchmarks, including knowledge, reasoning, and coding, as well as GPT-based evaluations, time efficiency, and other fine-tuning parameters. We also open-source our code for fine-tuning and evaluation and provide checkpoints fine-tuned on datasets of different sizes, aiming to advance future research on packing methods. Code is available at: https://github.com/ShuheWang1998/Packing-Analysis?tab=readme-ov-file.
URLs: https://github.com/ShuheWang1998/Packing-Analysis?tab=readme-ov-file.
Authors: Dom Nasrabadi
Abstract: We introduce JurEE, an ensemble of efficient, encoder-only transformer models designed to strengthen safeguards in AI-User interactions within LLM-based systems. Unlike existing LLM-as-Judge methods, which often struggle with generalization across risk taxonomies and only provide textual outputs, JurEE offers probabilistic risk estimates across a wide range of prevalent risks. Our approach leverages diverse data sources and employs progressive synthetic data generation techniques, including LLM-assisted augmentation, to enhance model robustness and performance. We create an in-house benchmark comprising of other reputable benchmarks such as the OpenAI Moderation Dataset and ToxicChat, where we find JurEE significantly outperforms baseline models, demonstrating superior accuracy, speed, and cost-efficiency. This makes it particularly suitable for applications requiring stringent content moderation, such as customer-facing chatbots. The encoder-ensemble's modular design allows users to set tailored risk thresholds, enhancing its versatility across various safety-related applications. JurEE's collective decision-making process, where each specialized encoder model contributes to the final output, not only improves predictive accuracy but also enhances interpretability. This approach provides a more efficient, performant, and economical alternative to traditional LLMs for large-scale implementations requiring robust content moderation.
Authors: Zhiwei Li, Guodong Long, Jing Jiang, Chengqi Zhang
Abstract: Federated recommendation systems are essential for providing personalized recommendations while protecting user privacy. However, current methods mainly rely on ID-based item embeddings, neglecting the rich multimodal information of items. To address this, we propose a Federated Multimodal Recommendation System, called FedMR. FedMR uses a foundation model on the server to encode multimodal item data, such as images and text. To handle data heterogeneity caused by user preference differences, FedMR introduces a Mixing Feature Fusion Module on each client, which adjusts fusion strategy weights based on user interaction history to generate personalized item representations that capture users' fine-grained preferences. FedMR is compatible with existing ID-based federated recommendation systems, improving performance without modifying the original framework. Experiments on four real-world multimodal datasets demonstrate FedMR's effectiveness. The code is available at https://anonymous.4open.science/r/FedMR.
Authors: Yupeng Ren
Abstract: With the rapid development of Large Language Models (LLMs), numerous mature applications of LLMs have emerged in the field of content safety detection. However, we have found that LLMs exhibit blind trust in safety detection agents. The general LLMs can be compromised by hackers with this vulnerability. Hence, this paper proposed an attack named Feign Agent Attack (F2A).Through such malicious forgery methods, adding fake safety detection results into the prompt, the defense mechanism of LLMs can be bypassed, thereby obtaining harmful content and hijacking the normal conversation. Continually, a series of experiments were conducted. In these experiments, the hijacking capability of F2A on LLMs was analyzed and demonstrated, exploring the fundamental reasons why LLMs blindly trust safety detection results. The experiments involved various scenarios where fake safety detection results were injected into prompts, and the responses were closely monitored to understand the extent of the vulnerability. Also, this paper provided a reasonable solution to this attack, emphasizing that it is important for LLMs to critically evaluate the results of augmented agents to prevent the generating harmful content. By doing so, the reliability and security can be significantly improved, protecting the LLMs from F2A.
Authors: Zeev Yampolsky, Itzik Klein
Abstract: Autonomous underwater vehicles (AUVs) are underwater robotic platforms used in a variety of applications. An AUV's navigation solution relies heavily on the fusion of inertial sensors and Doppler velocity logs (DVL), where the latter delivers accurate velocity updates. To ensure accurate navigation, a DVL calibration is undertaken before the mission begins to estimate its error terms. During calibration, the AUV follows a complex trajectory and employs nonlinear estimation filters to estimate error terms. In this paper, we introduce DCNet, a data-driven framework that utilizes a two-dimensional convolution kernel in an innovative way. Using DCNet and our proposed DVL error model, we offer a rapid calibration procedure. This can be applied to a trajectory with a nearly constant velocity. To train and test our proposed approach a dataset of 276 minutes long with real DVL recorded measurements was used. We demonstrated an average improvement of 70% in accuracy and 80% improvement in calibration time, compared to the baseline approach, with a low-performance DVL. As a result of those improvements, an AUV employing a low-cost DVL, can achieve higher accuracy, shorter calibration time, and apply a simple nearly constant velocity calibration trajectory. Our results also open up new applications for marine robotics utilizing low-cost, high-accurate DVLs.
Authors: Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin
Abstract: Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.
Authors: Jonas Schweisthal, Dennis Frauen, Maresa Schr\"oder, Konstantin Hess, Niki Kilbertus, Stefan Feuerriegel
Abstract: Reliable estimation of treatment effects from observational data is important in many disciplines such as medicine. However, estimation is challenging when unconfoundedness as a standard assumption in the causal inference literature is violated. In this work, we leverage arbitrary (potentially high-dimensional) instruments to estimate bounds on the conditional average treatment effect (CATE). Our contributions are three-fold: (1) We propose a novel approach for partial identification through a mapping of instruments to a discrete representation space so that we yield valid bounds on the CATE. This is crucial for reliable decision-making in real-world applications. (2) We derive a two-step procedure that learns tight bounds using a tailored neural partitioning of the latent instrument space. As a result, we avoid instability issues due to numerical approximations or adversarial training. Furthermore, our procedure aims to reduce the estimation variance in finite-sample settings to yield more reliable estimates. (3) We show theoretically that our procedure obtains valid bounds while reducing estimation variance. We further perform extensive experiments to demonstrate the effectiveness across various settings. Overall, our procedure offers a novel path for practitioners to make use of potentially high-dimensional instruments (e.g., as in Mendelian randomization).
Authors: Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies
Abstract: The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which use external tools and can execute multi-stage tasks -- may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at https://huggingface.co/datasets/ai-safety-institute/AgentHarm.
URLs: https://huggingface.co/datasets/ai-safety-institute/AgentHarm.
Authors: Justin Wong, Yury Orlovskiy, Michael Luo, Sanjit A. Seshia, Joseph E. Gonzalez
Abstract: Generating diverse responses from large language models (LLMs) is crucial for applications such as planning/search and synthetic data generation, where diversity provides distinct answers across generations. Prior approaches rely on increasing temperature to increase diversity. However, contrary to popular belief, we show not only does this approach produce lower quality individual generations as temperature increases, but it depends on model's next-token probabilities being similar to the true distribution of answers. We propose SimpleStrat, an alternative approach that uses the language model itself to partition the space into strata. At inference, a random stratum is selected and a sample drawn from within the strata. To measure diversity, we introduce CoverageQA, a dataset of underspecified questions with multiple equally plausible answers, and assess diversity by measuring KL Divergence between the output distribution and uniform distribution over valid ground truth answers. As computing probability per response/solution for proprietary models is infeasible, we measure recall on ground truth solutions. Our evaluation show using SimpleStrat achieves higher recall by 0.05 compared to GPT-4o and 0.36 average reduction in KL Divergence compared to Llama 3.