new Unmasking Conversational Bias in AI Multiagent Systems

Authors: Erica Coppolillo, Giuseppe Manco, Luca Maria Aiello

Abstract: Detecting biases in the outputs produced by generative models is essential to reduce the potential risks associated with their application in critical settings. However, the majority of existing methodologies for identifying biases in generated text consider the models in isolation and neglect their contextual applications. Specifically, the biases that may arise in multi-agent systems involving generative models remain under-researched. To address this gap, we present a framework designed to quantify biases within multi-agent systems of conversational Large Language Models (LLMs). Our approach involves simulating small echo chambers, where pairs of LLMs, initialized with aligned perspectives on a polarizing topic, engage in discussions. Contrary to expectations, we observe significant shifts in the stance expressed in the generated messages, particularly within echo chambers where all agents initially express conservative viewpoints, in line with the well-documented political bias of many LLMs toward liberal positions. Crucially, the bias observed in the echo-chamber experiment remains undetected by current state-of-the-art bias detection methods that rely on questionnaires. This highlights a critical need for the development of a more sophisticated toolkit for bias detection and mitigation for AI multi-agent systems. The code to perform the experiments is publicly available at https://anonymous.4open.science/r/LLMsConversationalBias-7725.

URLs: https://anonymous.4open.science/r/LLMsConversationalBias-7725.

new On the locality bias and results in the Long Range Arena

Authors: Pablo Miralles-Gonz\'alez, Javier Huertas-Tato, Alejandro Mart\'in, David Camacho

Abstract: The Long Range Arena (LRA) benchmark was designed to evaluate the performance of Transformer improvements and alternatives in long-range dependency modeling tasks. The Transformer and its main variants performed poorly on this benchmark, and a new series of architectures such as State Space Models (SSMs) gained some traction, greatly outperforming Transformers in the LRA. Recent work has shown that with a denoising pre-training phase, Transformers can achieve competitive results in the LRA with these new architectures. In this work, we discuss and explain the superiority of architectures such as MEGA and SSMs in the Long Range Arena, as well as the recent improvement in the results of Transformers, pointing to the positional and local nature of the tasks. We show that while the LRA is a benchmark for long-range dependency modeling, in reality most of the performance comes from short-range dependencies. Using training techniques to mitigate data inefficiency, Transformers are able to reach state-of-the-art performance with proper positional encoding. In addition, with the same techniques, we were able to remove all restrictions from SSM convolutional kernels and learn fully parameterized convolutions without decreasing performance, suggesting that the design choices behind SSMs simply added inductive biases and learning efficiency for these particular tasks. Our insights indicate that LRA results should be interpreted with caution and call for a redesign of the benchmark.

new JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models

Authors: Michael K. Chen, Xikun Zhang, Dacheng Tao

Abstract: Logical reasoning is a critical component of Large Language Models (LLMs), and substantial research efforts in recent years have aimed to enhance their deductive reasoning capabilities. However, existing deductive reasoning benchmarks, which are crucial for evaluating and advancing LLMs, are inadequate due to their lack of task complexity, presence of prior knowledge as a confounder, and superficial error analysis. To address these deficiencies, we introduce JustLogic, a synthetically generated deductive reasoning benchmark designed for rigorous evaluation of LLMs. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy. Our experimental results on JustLogic reveal that most state-of-the-art (SOTA) LLMs perform significantly worse than the human average, demonstrating substantial room for model improvement. All code and data are available at https://github.com/michaelchen-lab/JustLogic

URLs: https://github.com/michaelchen-lab/JustLogic

new Dynamic Adaptation of LoRA Fine-Tuning for Efficient and Task-Specific Optimization of Large Language Models

Authors: Xiaoxuan Liao, Chihang Wang, Shicheng Zhou, Jiacheng Hu, Hongye Zheng, Jia Gao

Abstract: This paper presents a novel methodology of fine-tuning for large language models-dynamic LoRA. Building from the standard Low-Rank Adaptation framework, this methodology further adds dynamic adaptation mechanisms to improve efficiency and performance. The key contribution of dynamic LoRA lies within its adaptive weight allocation mechanism coupled with an input feature-based adaptive strategy. These enhancements allow for a more precise fine-tuning process that is more tailored to specific tasks. Traditional LoRA methods use static adapter settings, not considering the different importance of model layers. In contrast, dynamic LoRA introduces a mechanism that dynamically evaluates the layer's importance during fine-tuning. This evaluation enables the reallocation of adapter parameters to fit the unique demands of each individual task, which leads to better optimization results. Another gain in flexibility arises from the consideration of the input feature distribution, which helps the model generalize better when faced with complicated and diverse datasets. The joint approach boosts not only the performance over each single task but also the generalization ability of the model. The efficiency of the dynamic LoRA was validated in experiments on benchmark datasets, such as GLUE, with surprising results. More specifically, this method achieved 88.1% accuracy with an F1-score of 87.3%. Noticeably, these improvements were made at a slight increase in computational costs: only 0.1% more resources than standard LoRA. This balance between performance and efficiency positions dynamic LoRA as a practical, scalable solution for fine-tuning LLMs, especially in resource-constrained scenarios. To take it a step further, its adaptability makes it a promising foundation for much more advanced applications, including multimodal tasks.

new DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images

Authors: Sami Baral, Li Lucy, Ryan Knight, Alice Ng, Luca Soldaini, Neil T. Heffernan, Kyle Lo

Abstract: In real-world settings, vision language models (VLMs) should robustly handle naturalistic, noisy visual content as well as domain-specific language and concepts. For example, K-12 educators using digital learning platforms may need to examine and provide feedback across many images of students' math work. To assess the potential of VLMs to support educators in settings like this one, we introduce DrawEduMath, an English-language dataset of 2,030 images of students' handwritten responses to K-12 math problems. Teachers provided detailed annotations, including free-form descriptions of each image and 11,661 question-answer (QA) pairs. These annotations capture a wealth of pedagogical insights, ranging from students' problem-solving strategies to the composition of their drawings, diagrams, and writing. We evaluate VLMs on teachers' QA pairs, as well as 44,362 synthetic QA pairs derived from teachers' descriptions using language models (LMs). We show that even state-of-the-art VLMs leave much room for improvement on DrawEduMath questions. We also find that synthetic QAs, though imperfect, can yield similar model rankings as teacher-written QAs. We release DrawEduMath to support the evaluation of VLMs' abilities to reason mathematically over images gathered with educational contexts in mind.

new Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Authors: Ameya Godbole, Robin Jia

Abstract: Improvements in large language models have led to increasing optimism that they can serve as reliable evaluators of natural language generation outputs. In this paper, we challenge this optimism by thoroughly re-evaluating five state-of-the-art factuality metrics on a collection of 11 datasets for summarization, retrieval-augmented generation, and question answering. We find that these evaluators are inconsistent with each other and often misestimate system-level performance, both of which can lead to a variety of pitfalls. We further show that these metrics exhibit biases against highly paraphrased outputs and outputs that draw upon faraway parts of the source documents. We urge users of these factuality metrics to proceed with caution and manually validate the reliability of these metrics in their domain of interest before proceeding.

new Self-reflecting Large Language Models: A Hegelian Dialectical Approach

Authors: Sara Abdali, Can Goksen, Saeed Amizadeh andKazuhito Koishida

Abstract: Investigating NLP through a philosophical lens has recently caught researcher's eyes as it connects computational methods with classical schools of philosophy. This paper introduces a philosophical approach inspired by the Hegelian Dialectic for LLMs' self-reflection, utilizing a self-dialectical approach to emulate internal critiques and then synthesize new ideas by resolving the contradicting points. Moreover, this paper investigates the effect of LLMs' temperature for generation by establishing a dynamic annealing approach, which promotes the creativity in the early stages and gradually refines it by focusing on the nuances, as well as a fixed temperature strategy for generation. Our proposed approach is examined to determine its ability to generate novel ideas from an initial proposition. Additionally, a Multi Agent Majority Voting (MAMV) strategy is leveraged to assess the validity and novelty of the generated ideas, which proves beneficial in the absence of domain experts. Our experiments show promise in generating new ideas and provide a stepping-stone for future research.

new Context-Aware Neural Gradient Mapping for Fine-Grained Instruction Processing

Authors: David Boldo, Lily Pemberton, Gabriel Thistledown, Jacob Fairchild, Felix Kowalski

Abstract: The integration of contextual embeddings into the optimization processes of large language models is an advancement in natural language processing. The Context-Aware Neural Gradient Mapping framework introduces a dynamic gradient adjustment mechanism, incorporating contextual embeddings directly into the optimization process. This approach facilitates real-time parameter adjustments, enhancing task-specific generalization even in the presence of sparse or noisy data inputs. The mathematical foundation of this framework relies on gradient descent modifications, where contextual embeddings are derived from a supplementary neural network trained to map input features to optimal adaptation gradients. By employing differential geometry principles, high-dimensional input dependencies are encoded into low-dimensional gradient manifolds, enabling efficient adaptation without necessitating the retraining of the entire model. Empirical evaluations demonstrate that the proposed framework consistently outperforms baseline models across various metrics, including accuracy, robustness to noise, and computational efficiency. The integration of context-specific embeddings allows for a more complex understanding of language, thereby improving the model's ability to handle diverse linguistic phenomena. Furthermore, the computational efficiency achieved through this method demonstrates its scalability for large-scale language models operating under diverse constraints.

new CASE-Bench: Context-Aware Safety Evaluation Benchmark for Large Language Models

Authors: Guangzhi Sun, Xiao Zhan, Shutong Feng, Philip C. Woodland, Jose Such

Abstract: Aligning large language models (LLMs) with human values is essential for their safe deployment and widespread adoption. Current LLM safety benchmarks often focus solely on the refusal of individual problematic queries, which overlooks the importance of the context where the query occurs and may cause undesired refusal of queries under safe contexts that diminish user experience. Addressing this gap, we introduce CASE-Bench, a Context-Aware Safety Evaluation Benchmark that integrates context into safety assessments of LLMs. CASE-Bench assigns distinct, formally described contexts to categorized queries based on Contextual Integrity theory. Additionally, in contrast to previous studies which mainly rely on majority voting from just a few annotators, we recruited a sufficient number of annotators necessary to ensure the detection of statistically significant differences among the experimental conditions based on power analysis. Our extensive analysis using CASE-Bench on various open-source and commercial LLMs reveals a substantial and significant influence of context on human judgments (p<0.0001 from a z-test), underscoring the necessity of context in safety evaluations. We also identify notable mismatches between human judgments and LLM responses, particularly in commercial models within safe contexts.

new ExPerT: Effective and Explainable Evaluation of Personalized Long-Form Text Generation

Authors: Alireza Salemi, Julian Killingback, Hamed Zamani

Abstract: Evaluating personalized text generated by large language models (LLMs) is challenging, as only the LLM user, i.e., prompt author, can reliably assess the output, but re-engaging the same individuals across studies is infeasible. This paper addresses the challenge of evaluating personalized text generation by introducing ExPerT, an explainable reference-based evaluation framework. ExPerT leverages an LLM to extract atomic aspects and their evidence from the generated and reference texts, match the aspects, and evaluate their alignment based on content and writing style -- two key attributes in personalized text generation. Additionally, ExPerT generates detailed, fine-grained explanations for every step of the evaluation process, enhancing transparency and interpretability. Our experiments demonstrate that ExPerT achieves a 7.2% relative improvement in alignment with human judgments compared to the state-of-the-art text generation evaluation methods. Furthermore, human evaluators rated the usability of ExPerT's explanations at 4.7 out of 5, highlighting its effectiveness in making evaluation decisions more interpretable.

new A review of annotation classification tools in the educational domain

Authors: Joaqu\'in Gayoso-Cabada, Antonio Sarasa-Cabezuelo, Jos\'e-Luis Sierra

Abstract: An annotation consists of a portion of information that is associated with a piece of content in order to explain something about the content or to add more information. The use of annotations as a tool in the educational field has positive effects on the learning process. The usual way to use this instrument is to provide students with contents, usually textual, with which they must associate annotations. In most cases this task is performed in groups of students who work collaboratively. This process encourages analysis and understanding of the contents since they have to understand them in order to annotate them, and also encourages teamwork. To facilitate its use, computer applications have been devel-oped in recent decades that implement the annotation process and offer a set of additional functionalities. One of these functionalities is the classification of the annotations made. This functionality can be exploited in various ways in the learning process, such as guiding the students in the annotation process, providing information to the student about how the annotation process is done and to the teacher about how the students write and how they understand the content, as well as implementing other innovative educational processes. In this sense, the classification of annotations plays a critical role in the application of the annotation in the educational field. There are many studies of annotations, but most of them consider the classification aspect marginally only. This paper presents an initial study of the classification mech-anisms used in the annotation tools, identifying four types of cases: absence of classification mechanisms, classification based on pre-established vocabularies, classification based on extensible vocabularies, and classification based on struc-tured vocabularies.

new The Muddy Waters of Modeling Empathy in Language: The Practical Impacts of Theoretical Constructs

Authors: Allison Lahnala, Charles Welch, David Jurgens, Lucie Flek

Abstract: Conceptual operationalizations of empathy in NLP are varied, with some having specific behaviors and properties, while others are more abstract. How these variations relate to one another and capture properties of empathy observable in text remains unclear. To provide insight into this, we analyze the transfer performance of empathy models adapted to empathy tasks with different theoretical groundings. We study (1) the dimensionality of empathy definitions, (2) the correspondence between the defined dimensions and measured/observed properties, and (3) the conduciveness of the data to represent them, finding they have a significant impact to performance compared to other transfer setting features. Characterizing the theoretical grounding of empathy tasks as direct, abstract, or adjacent further indicates that tasks that directly predict specified empathy components have higher transferability. Our work provides empirical evidence for the need for precise and multidimensional empathy operationalizations.

new Federated Retrieval Augmented Generation for Multi-Product Question Answering

Authors: Parshin Shojaee, Sai Sree Harsha, Dan Luo, Akash Maharaj, Tong Yu, Yunyao Li

Abstract: Recent advancements in Large Language Models and Retrieval-Augmented Generation have boosted interest in domain-specific question-answering for enterprise products. However, AI Assistants often face challenges in multi-product QA settings, requiring accurate responses across diverse domains. Existing multi-domain RAG-QA approaches either query all domains indiscriminately, increasing computational costs and LLM hallucinations, or rely on rigid resource selection, which can limit search results. We introduce MKP-QA, a novel multi-product knowledge-augmented QA framework with probabilistic federated search across domains and relevant knowledge. This method enhances multi-domain search quality by aggregating query-domain and query-passage probabilistic relevance. To address the lack of suitable benchmarks for multi-product QAs, we also present new datasets focused on three Adobe products: Adobe Experience Platform, Target, and Customer Journey Analytics. Our experiments show that MKP-QA significantly boosts multi-product RAG-QA performance in terms of both retrieval accuracy and response quality.

new MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models

Authors: Zhongpu Chen, Yinfeng Liu, Long Shi, Zhi-Jie Wang, Xingyan Chen, Yu Zhao, Fuji Ren

Abstract: Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readability and structure of the content generated by these language models. In this paper, we introduce MDEval, a comprehensive benchmark to assess Markdown Awareness for LLMs, by constructing a dataset with 20K instances covering 10 subjects in English and Chinese. Unlike traditional model-based evaluations, MDEval provides excellent interpretability by combining model-based generation tasks and statistical methods. Our results demonstrate that MDEval achieves a Spearman correlation of 0.791 and an accuracy of 84.1% with human, outperforming existing methods by a large margin. Extensive experimental results also show that through fine-tuning over our proposed dataset, less performant open-source models are able to achieve comparable performance to GPT-4o in terms of Markdown Awareness. To ensure reproducibility and transparency, MDEval is open sourced at https://github.com/SWUFE-DB-Group/MDEval-Benchmark.

URLs: https://github.com/SWUFE-DB-Group/MDEval-Benchmark.

new AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models

Authors: Zunhai Su, Wang Shen, Linge Li, Zhe Chen, Hanyu Wei, Huangqi Yu, Kehong Yuan

Abstract: Vision-language models (VLMs) show remarkable performance in multimodal tasks. However, excessively long multimodal inputs lead to oversized Key-Value (KV) caches, resulting in significant memory consumption and I/O bottlenecks. Previous KV quantization methods for Large Language Models (LLMs) may alleviate these issues but overlook the attention saliency differences of multimodal tokens, resulting in suboptimal performance. In this paper, we investigate the attention-aware token saliency patterns in VLM and propose AKVQ-VL. AKVQ-VL leverages the proposed Text-Salient Attention (TSA) and Pivot-Token-Salient Attention (PSA) patterns to adaptively allocate bit budgets. Moreover, achieving extremely low-bit quantization requires effectively addressing outliers in KV tensors. AKVQ-VL utilizes the Walsh-Hadamard transform (WHT) to construct outlier-free KV caches, thereby reducing quantization difficulty. Evaluations of 2-bit quantization on 12 long-context and multimodal tasks demonstrate that AKVQ-VL maintains or even improves accuracy, outperforming LLM-oriented methods. AKVQ-VL can reduce peak memory usage by 2.13x, support up to 3.25x larger batch sizes and 2.46x throughput.

new Using Large Language Models for education managements in Vietnamese with low resources

Authors: Duc Do Minh, Vinh Nguyen Van, Thang Dam Cong

Abstract: Large language models (LLMs), such as GPT-4, Gemini 1.5, Claude 3.5 Sonnet, and Llama3, have demonstrated significant advancements in various NLP tasks since the release of ChatGPT in 2022. Despite their success, fine-tuning and deploying LLMs remain computationally expensive, especially in resource-constrained environments. In this paper, we proposed VietEduFrame, a framework specifically designed to apply LLMs to educational management tasks in Vietnamese institutions. Our key contribution includes the development of a tailored dataset, derived from student education documents at Hanoi VNU, which addresses the unique challenges faced by educational systems with limited resources. Through extensive experiments, we show that our approach outperforms existing methods in terms of accuracy and efficiency, offering a promising solution for improving educational management in under-resourced environments. While our framework leverages synthetic data to supplement real-world examples, we discuss potential limitations regarding broader applicability and robustness in future implementations.

new SCCD: A Session-based Dataset for Chinese Cyberbullying Detection

Authors: Qingpo Yang, Yakai Chen, Zihui Xu, Yu-ming Shang, Sanchuan Guo, Xi Zhang

Abstract: The rampant spread of cyberbullying content poses a growing threat to societal well-being. However, research on cyberbullying detection in Chinese remains underdeveloped, primarily due to the lack of comprehensive and reliable datasets. Notably, no existing Chinese dataset is specifically tailored for cyberbullying detection. Moreover, while comments play a crucial role within sessions, current session-based datasets often lack detailed, fine-grained annotations at the comment level. To address these limitations, we present a novel Chinese cyber-bullying dataset, termed SCCD, which consists of 677 session-level samples sourced from a major social media platform Weibo. Moreover, each comment within the sessions is annotated with fine-grained labels rather than conventional binary class labels. Empirically, we evaluate the performance of various baseline methods on SCCD, highlighting the challenges for effective Chinese cyberbullying detection.

new Abstractive Text Summarization for Bangla Language Using NLP and Machine Learning Approaches

Authors: Asif Ahammad Miazee, Tonmoy Roy, Md Robiul Islam, Yeamin Safat

Abstract: Text summarization involves reducing extensive documents to short sentences that encapsulate the essential ideas. The goal is to create a summary that effectively conveys the main points of the original text. We spend a significant amount of time each day reading the newspaper to stay informed about current events both domestically and internationally. While reading newspapers enriches our knowledge, we sometimes come across unnecessary content that isn't particularly relevant to our lives. In this paper, we introduce a neural network model designed to summarize Bangla text into concise and straightforward paragraphs, aiming for greater stability and efficiency.

new An Attempt to Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models

Authors: Jaturong Kongmanee

Abstract: This research aims to unravel how large language models (LLMs) iteratively refine token predictions (or, in a general sense, vector predictions). We utilized a logit lens technique to analyze the model's token predictions derived from intermediate representations. Specifically, we focused on how LLMs access and use information from input contexts, and how positioning of relevant information affects the model's token prediction refinement process. Our findings for multi-document question answering task, by varying input context lengths (the number of documents), using GPT-2, revealed that the number of layers between the first layer that the model predicted next tokens correctly and the later layers that the model finalized its correct predictions, as a function of the position of relevant information (i.e., placing the relevant one at the beginning, middle, or end of the input context), has a nearly inverted U shape. We found that the gap between these two layers, on average, diminishes when relevant information is positioned at the beginning or end of the input context, suggesting that the model requires more refinements when processing longer contexts with relevant information situated in the middle, and highlighting which layers are essential for determining the correct output. Our analysis provides insights about how token predictions are distributed across different conditions, and establishes important connections to existing hypotheses and previous findings in AI safety research and development.

new Cross-modal Context Fusion and Adaptive Graph Convolutional Network for Multimodal Conversational Emotion Recognition

Authors: Junwei Feng, Xueyan Fan

Abstract: Emotion recognition has a wide range of applications in human-computer interaction, marketing, healthcare, and other fields. In recent years, the development of deep learning technology has provided new methods for emotion recognition. Prior to this, many emotion recognition methods have been proposed, including multimodal emotion recognition methods, but these methods ignore the mutual interference between different input modalities and pay little attention to the directional dialogue between speakers. Therefore, this article proposes a new multimodal emotion recognition method, including a cross modal context fusion module, an adaptive graph convolutional encoding module, and an emotion classification module. The cross modal context module includes a cross modal alignment module and a context fusion module, which are used to reduce the noise introduced by mutual interference between different input modalities. The adaptive graph convolution module constructs a dialogue relationship graph for extracting dependencies and self dependencies between speakers. Our model has surpassed some state-of-the-art methods on publicly available benchmark datasets and achieved high recognition accuracy.

new LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion

Authors: Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen

Abstract: Large language models (LLMs) have demonstrated remarkable progress in understanding long-context inputs. However, benchmarks for evaluating the long-context reasoning abilities of LLMs fall behind the pace. Existing benchmarks often focus on a narrow range of tasks or those that do not demand complex reasoning. To address this gap and enable a more comprehensive evaluation of the long-context reasoning capabilities of current LLMs, we propose a new synthetic benchmark, LongReason, which is constructed by synthesizing long-context reasoning questions from a varied set of short-context reasoning questions through context expansion. LongReason consists of 794 multiple-choice reasoning questions with diverse reasoning patterns across three task categories: reading comprehension, logical inference, and mathematical word problems. We evaluate 21 LLMs on LongReason, revealing that most models experience significant performance drops as context length increases. Our further analysis shows that even state-of-the-art LLMs still have significant room for improvement in providing robust reasoning across different tasks. We will open-source LongReason to support the comprehensive evaluation of LLMs' long-context reasoning capabilities.

new Speech Translation Refinement using Large Language Models

Authors: Huaixia Dou, Xinyu Tian, Xinglin Lyu, Jie Zhu, Junhui Li, Lifan Guo

Abstract: Recent advancements in large language models (LLMs) have demonstrated their remarkable capabilities across various language tasks. Inspired by the success of text-to-text translation refinement, this paper investigates how LLMs can improve the performance of speech translation by introducing a joint refinement process. Through the joint refinement of speech translation (ST) and automatic speech recognition (ASR) transcription via LLMs, the performance of the ST model is significantly improved in both training-free in-context learning and parameter-efficient fine-tuning scenarios. Additionally, we explore the effect of document-level context on refinement under the context-aware fine-tuning scenario. Experimental results on the MuST-C and CoVoST 2 datasets, which include seven translation tasks, demonstrate the effectiveness of the proposed approach using several popular LLMs including GPT-3.5-turbo, LLaMA3-8B, and Mistral-12B. Further analysis further suggests that jointly refining both transcription and translation yields better performance compared to refining translation alone. Meanwhile, incorporating document-level context significantly enhances refinement performance. We release our code and datasets on GitHub.

new Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training

Authors: Xunxin Cai, Chengrui Wang, Qingqing Long, Yuanchun Zhou, Meng Xiao

Abstract: The rapid advancement of large language models (LLMs) in biological-medical applications has highlighted a gap between their potential and the limited scale and often low quality of available open-source annotated textual datasets. In addition, the inherent complexity of the biomedical knowledge hierarchy significantly hampers efforts to bridge this gap.Can LLMs themselves play a pivotal role in overcoming this limitation? Motivated by this question, we investigate this challenge in the present study.We propose a framework that automates the distillation of high-quality textual training data from the extensive scientific literature. Our approach self-evaluates and generates questions that are more closely aligned with the biomedical domain, guided by the biomedical knowledge hierarchy through medical subject headings (MeSH). This comprehensive framework establishes an automated workflow, thereby eliminating the need for manual intervention. Furthermore, we conducted comprehensive experiments to evaluate the impact of our framework-generated data on downstream language models of varying sizes. Our approach substantially improves question-answering tasks compared to pre-trained models from the life sciences domain and powerful close-source models represented by GPT-4. Notably, the generated AI-Ready dataset enabled the Llama3-70B base model to outperform GPT-4 using MedPrompt with multiple times the number of parameters. Detailed case studies and ablation experiments underscore the significance of each component within our framework

new Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads

Authors: Xingyang He, Jie Liu, Shaowei Chen

Abstract: KV cache is a widely used acceleration technique for large language models (LLMs) inference. However, its memory requirement grows rapidly with input length. Previous studies have reduced the size of KV cache by either removing the same number of unimportant tokens for all attention heads or by allocating differentiated KV cache budgets for pre-identified attention heads. However, due to the importance of attention heads varies across different tasks, the pre-identified attention heads fail to adapt effectively to various downstream tasks. To address this issue, we propose Task-KV, a method that leverages the semantic differentiation of attention heads to allocate differentiated KV cache budgets across various tasks. We demonstrate that attention heads far from the semantic center (called heterogeneous heads) make an significant contribution to task outputs and semantic understanding. In contrast, other attention heads play the role of aggregating important information and focusing reasoning. Task-KV allocates full KV cache budget to heterogeneous heads to preserve comprehensive semantic information, while reserving a small number of recent tokens and attention sinks for non-heterogeneous heads. Furthermore, we innovatively introduce middle activations to preserve key contextual information aggregated from non-heterogeneous heads. To dynamically perceive semantic differences among attention heads, we design a semantic separator to distinguish heterogeneous heads from non-heterogeneous ones based on their distances from the semantic center. Experimental results on multiple benchmarks and different model architectures demonstrate that Task-KV significantly outperforms existing baseline methods.

new Option-ID Based Elimination For Multiple Choice Questions

Authors: Zhenhao Zhu, Bulou Liu

Abstract: Multiple choice questions (MCQs) are a common and important task for evaluating large language models (LLMs). Based on common strategies humans use when answering MCQs, the process of elimination has been proposed as an effective problem-solving method. Existing methods to the process of elimination generally fall into two categories: one involves having the model directly select the incorrect answer, while the other involves scoring the options. However, both methods incur high computational costs and often perform worse than methods that answer based on option ID. To address this issue, this paper proposes a process of elimination based on option ID. We select 10 LLMs and conduct zero-shot experiments on 7 different datasets. The experimental results demonstrate that our method significantly improves the model's performance. Further analysis reveals that the sequential elimination strategy can effectively enhance the model's reasoning ability. Additionally, we find that sequential elimination is also applicable to few-shot settings and can be combined with debias methods to further improve model performance.

new Who is the root in a syntactic dependency structure?

Authors: Ramon Ferrer-i-Cancho, Marta Arias

Abstract: The syntactic structure of a sentence can be described as a tree that indicates the syntactic relationships between words. In spite of significant progress in unsupervised methods that retrieve the syntactic structure of sentences, guessing the right direction of edges is still a challenge. As in a syntactic dependency structure edges are oriented away from the root, the challenge of guessing the right direction can be reduced to finding an undirected tree and the root. The limited performance of current unsupervised methods demonstrates the lack of a proper understanding of what a root vertex is from first principles. We consider an ensemble of centrality scores, some that only take into account the free tree (non-spatial scores) and others that take into account the position of vertices (spatial scores). We test the hypothesis that the root vertex is an important or central vertex of the syntactic dependency structure. We confirm that hypothesis and find that the best performance in guessing the root is achieved by novel scores that only take into account the position of a vertex and that of its neighbours. We provide theoretical and empirical foundations towards a universal notion of rootness from a network science perspective.

new Faster Machine Translation Ensembling with Reinforcement Learning and Competitive Correction

Authors: Kritarth Prasad, Mohammadi Zaki, Pratik Singh, Pankaj Wasnik

Abstract: Ensembling neural machine translation (NMT) models to produce higher-quality translations than the $L$ individual models has been extensively studied. Recent methods typically employ a candidate selection block (CSB) and an encoder-decoder fusion block (FB), requiring inference across \textit{all} candidate models, leading to significant computational overhead, generally $\Omega(L)$. This paper introduces \textbf{SmartGen}, a reinforcement learning (RL)-based strategy that improves the CSB by selecting a small, fixed number of candidates and identifying optimal groups to pass to the fusion block for each input sentence. Furthermore, previously, the CSB and FB were trained independently, leading to suboptimal NMT performance. Our DQN-based \textbf{SmartGen} addresses this by using feedback from the FB block as a reward during training. We also resolve a key issue in earlier methods, where candidates were passed to the FB without modification, by introducing a Competitive Correction Block (CCB). Finally, we validate our approach with extensive experiments on English-Hindi translation tasks in both directions.

new SEAL: Scaling to Emphasize Attention for Long-Context Retrieval

Authors: Changhun Lee, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park

Abstract: In this work, we introduce a novel approach called Scaling to Emphasize Attention for Long-context retrieval (SEAL), which enhances the retrieval performance of large language models (LLMs) over extended contexts. Previous studies have shown that each attention head in LLMs has a unique functionality and collectively contributes to the overall behavior of the model. Similarly, we observe that specific heads are closely tied to long-context retrieval, showing positive or negative correlation with retrieval scores. Built on this insight, we propose a learning-based mechanism using zero-shot generated data to emphasize these heads, improving the model's performance in long-context retrieval tasks. By applying SEAL, we can achieve significant improvements in in-domain retrieval performance, including document QA tasks from LongBench, and considerable improvements in out-of-domain cases. Additionally, when combined with existing training-free context extension techniques, SEAL extends the context limits of LLMs while maintaining highly reliable outputs, opening new avenues for research in this field.

new Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning

Authors: Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, Jiaxin Mao

Abstract: Retrieval-augmented generation (RAG) is extensively utilized to incorporate external, current knowledge into large language models, thereby minimizing hallucinations. A standard RAG pipeline may comprise several components, such as query rewriting, document retrieval, document filtering, and answer generation. However, these components are typically optimized separately through supervised fine-tuning, which can lead to misalignments between the objectives of individual modules and the overarching aim of generating accurate answers in question-answering (QA) tasks. Although recent efforts have explored reinforcement learning (RL) to optimize specific RAG components, these approaches often focus on overly simplistic pipelines with only two components or do not adequately address the complex interdependencies and collaborative interactions among the modules. To overcome these challenges, we propose treating the RAG pipeline as a multi-agent cooperative task, with each component regarded as an RL agent. Specifically, we present MMOA-RAG, a Multi-Module joint Optimization Algorithm for RAG, which employs multi-agent reinforcement learning to harmonize all agents' goals towards a unified reward, such as the F1 score of the final answer. Experiments conducted on various QA datasets demonstrate that MMOA-RAG improves the overall pipeline performance and outperforms existing baselines. Furthermore, comprehensive ablation studies validate the contributions of individual components and the adaptability of MMOA-RAG across different RAG components and datasets. The code of MMOA-RAG is on https://github.com/chenyiqun/MMOA-RAG.

URLs: https://github.com/chenyiqun/MMOA-RAG.

new ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval

Authors: Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

Abstract: Retrieval-Augmented Generation (RAG) models have drawn considerable attention in modern open-domain question answering. The effectiveness of RAG depends on the quality of the top retrieved documents. However, conventional retrieval methods sometimes fail to rank the most relevant documents at the top. In this paper, we introduce ASRank, a new re-ranking method based on scoring retrieved documents using zero-shot answer scent which relies on a pre-trained large language model to compute the likelihood of the document-derived answers aligning with the answer scent. Our approach demonstrates marked improvements across several datasets, including NQ, TriviaQA, WebQA, ArchivalQA, HotpotQA, and Entity Questions. Notably, ASRank increases Top-1 retrieval accuracy on NQ from $19.2\%$ to $46.5\%$ for MSS and $22.1\%$ to $47.3\%$ for BM25. It also shows strong retrieval performance on several datasets compared to state-of-the-art methods (47.3 Top-1 by ASRank vs 35.4 by UPR by BM25).

new Prompting ChatGPT for Chinese Learning as L2: A CEFR and EBCL Level Study

Authors: Miao Lin-Zucker, Jo\"el Bellasen, Jean-Daniel Zucker

Abstract: The use of chatbots in language learning has evolved significantly since the 1960s, becoming more sophisticated platforms as generative AI emerged. These tools now simulate natural conversations, adapting to individual learners' needs, including those studying Chinese. Our study explores how learners can use specific prompts to engage Large Language Models (LLM) as personalized chatbots, aiming to target their language level based on the Common European Framework of Reference for Languages (CEFR) and the European Benchmarking Chinese Language (EBCL) project. Focusing on A1, A1+ and A2 levels, we examine the teaching of Chinese, which presents unique challenges due to its logographic writing system. Our goal is to develop prompts that integrate oral and written skills, using high-frequency character lists and controlling oral lexical productions. These tools, powered by generative AI, aim to enhance language practice by crossing lexical and sinographic recurrence. While generative AI shows potential as a personalized tutor, further evaluation is needed to assess its effectiveness. We conducted a systematic series of experiments using ChatGPT models to evaluate their adherence to constraints specified in the prompts. The results indicate that incorporating level A1 and A1+ characters, along with the associated reference list, significantly enhances compliance with the EBCL character set. Properly prompted, LLMs can increase exposure to the target language and offer interactive exchanges to develop language skills.

new Breaking the Stigma! Unobtrusively Probe Symptoms in Depression Disorder Diagnosis Dialogue

Authors: Jieming Cao, Chen Huang, Yanan Zhang, Ruibo Deng, Jincheng Zhang, Wenqiang Lei

Abstract: Stigma has emerged as one of the major obstacles to effectively diagnosing depression, as it prevents users from open conversations about their struggles. This requires advanced questioning skills to carefully probe the presence of specific symptoms in an unobtrusive manner. While recent efforts have been made on depression-diagnosis-oriented dialogue systems, they largely ignore this problem, ultimately hampering their practical utility. To this end, we propose a novel and effective method, UPSD$^{4}$, developing a series of strategies to promote a sense of unobtrusiveness within the dialogue system and assessing depression disorder by probing symptoms. We experimentally show that UPSD$^{4}$ demonstrates a significant improvement over current baselines, including unobtrusiveness evaluation of dialogue content and diagnostic accuracy. We believe our work contributes to developing more accessible and user-friendly tools for addressing the widespread need for depression diagnosis.

new New Evaluation Paradigm for Lexical Simplification

Authors: Jipeng Qiang, Minjiang Huang, Yi Zhu, Yunhao Yuan, Chaowei Zhang, Xiaoye Ouyang

Abstract: Lexical Simplification (LS) methods use a three-step pipeline: complex word identification, substitute generation, and substitute ranking, each with separate evaluation datasets. We found large language models (LLMs) can simplify sentences directly with a single prompt, bypassing the traditional pipeline. However, existing LS datasets are not suitable for evaluating these LLM-generated simplified sentences, as they focus on providing substitutes for single complex words without identifying all complex words in a sentence. To address this gap, we propose a new annotation method for constructing an all-in-one LS dataset through human-machine collaboration. Automated methods generate a pool of potential substitutes, which human annotators then assess, suggesting additional alternatives as needed. Additionally, we explore LLM-based methods with single prompts, in-context learning, and chain-of-thought techniques. We introduce a multi-LLMs collaboration approach to simulate each step of the LS task. Experimental results demonstrate that LS based on multi-LLMs approaches significantly outperforms existing baselines.

new Pre-training a Transformer-Based Generative Model Using a Small Sepedi Dataset

Authors: Simon P. Ramalepe, Thipe I. Modipa, Marelie H. Davel

Abstract: Due to the scarcity of data in low-resourced languages, the development of language models for these languages has been very slow. Currently, pre-trained language models have gained popularity in natural language processing, especially, in developing domain-specific models for low-resourced languages. In this study, we experiment with the impact of using occlusion-based techniques when training a language model for a text generation task. We curate 2 new datasets, the Sepedi monolingual (SepMono) dataset from several South African resources and the Sepedi radio news (SepNews) dataset from the radio news domain. We use the SepMono dataset to pre-train transformer-based models using the occlusion and non-occlusion pre-training techniques and compare performance. The SepNews dataset is specifically used for fine-tuning. Our results show that the non-occlusion models perform better compared to the occlusion-based models when measuring validation loss and perplexity. However, analysis of the generated text using the BLEU score metric, which measures the quality of the generated text, shows a slightly higher BLEU score for the occlusion-based models compared to the non-occlusion models.

new Are Human Interactions Replicable by Generative Agents? A Case Study on Pronoun Usage in Hierarchical Interactions

Authors: Naihao Deng, Rada Mihalcea

Abstract: As Large Language Models (LLMs) advance in their capabilities, researchers have increasingly employed them for social simulation. In this paper, we investigate whether interactions among LLM agents resemble those of humans. Specifically, we focus on the pronoun usage difference between leaders and non-leaders, examining whether the simulation would lead to human-like pronoun usage patterns during the LLMs' interactions. Our evaluation reveals the significant discrepancies between LLM-based simulations and human pronoun usage, with prompt-based or specialized agents failing to demonstrate human-like pronoun usage patterns. In addition, we reveal that even if LLMs understand the human pronoun usage patterns, they fail to demonstrate them in the actual interaction process. Our study highlights the limitations of social simulations based on LLM agents, urging caution in using such social simulation in practitioners' decision-making process.

new You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning

Authors: Ayan Sengupta, Siddhant Chaudhary, Tanmoy Chakraborty

Abstract: The ever-increasing size of large language models (LLMs) presents significant challenges for deployment due to their heavy computational and memory requirements. Current model pruning techniques attempt to alleviate these issues by relying heavily on external calibration datasets to determine which parameters to prune or compress, thus limiting their flexibility and scalability across different compression ratios. Moreover, these methods often cause severe performance degradation, particularly in downstream tasks, when subjected to higher compression rates. In this paper, we propose PruneNet, a novel model compression method that addresses these limitations by reformulating model pruning as a policy learning process. PruneNet decouples the pruning process from the model architecture, eliminating the need for calibration datasets. It learns a stochastic pruning policy to assess parameter importance solely based on intrinsic model properties while preserving the spectral structure to minimize information loss. PruneNet can compress the LLaMA-2-7B model in just 15 minutes, achieving over 80% retention of its zero-shot performance with a 30% compression ratio, outperforming existing methods that retain only 75% performance. Furthermore, on complex multitask language understanding tasks, PruneNet demonstrates its robustness by preserving up to 80% performance of the original model, proving itself a superior alternative to conventional structured compression techniques.

new The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders?

Authors: Ayo Adedeji, Mardhiyah Sanni, Emmanuel Ayodele, Sarita Joshi, Tobi Olatunji

Abstract: The global adoption of Large Language Models (LLMs) in healthcare shows promise to enhance clinical workflows and improve patient outcomes. However, Automatic Speech Recognition (ASR) errors in critical medical terms remain a significant challenge. These errors can compromise patient care and safety if not detected. This study investigates the prevalence and impact of ASR errors in medical transcription in Nigeria, the United Kingdom, and the United States. By evaluating raw and LLM-corrected transcriptions of accented English in these regions, we assess the potential and limitations of LLMs to address challenges related to accents and medical terminology in ASR. Our findings highlight significant disparities in ASR accuracy across regions and identify specific conditions under which LLM corrections are most effective.

new Figurative-cum-Commonsense Knowledge Infusion for Multimodal Mental Health Meme Classification

Authors: Abdullah Mazhar, Zuhair hasan shaik, Aseem Srivastava, Polly Ruhnke, Lavanya Vaddavalli, Sri Keshav Katragadda, Shweta Yadav, Md Shad Akhtar

Abstract: The expression of mental health symptoms through non-traditional means, such as memes, has gained remarkable attention over the past few years, with users often highlighting their mental health struggles through figurative intricacies within memes. While humans rely on commonsense knowledge to interpret these complex expressions, current Multimodal Language Models (MLMs) struggle to capture these figurative aspects inherent in memes. To address this gap, we introduce a novel dataset, AxiOM, derived from the GAD anxiety questionnaire, which categorizes memes into six fine-grained anxiety symptoms. Next, we propose a commonsense and domain-enriched framework, M3H, to enhance MLMs' ability to interpret figurative language and commonsense knowledge. The overarching goal remains to first understand and then classify the mental health symptoms expressed in memes. We benchmark M3H against 6 competitive baselines (with 20 variations), demonstrating improvements in both quantitative and qualitative metrics, including a detailed human evaluation. We observe a clear improvement of 4.20% and 4.66% on weighted-F1 metric. To assess the generalizability, we perform extensive experiments on a public dataset, RESTORE, for depressive symptom identification, presenting an extensive ablation study that highlights the contribution of each module in both datasets. Our findings reveal limitations in existing models and the advantage of employing commonsense to enhance figurative understanding.

new Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection

Authors: Bo Yang, Jiaxian Guo, Yusuke Iwasawa, Yutaka Matsuo

Abstract: Recent studies have increasingly demonstrated that large language models (LLMs) possess significant theory of mind (ToM) capabilities, showing the potential for simulating the tracking of mental states in generative agents. In this study, we propose a novel paradigm called ToM-agent, designed to empower LLMs-based generative agents to simulate ToM in open-domain conversational interactions. ToM-agent disentangles the confidence from mental states, facilitating the emulation of an agent's perception of its counterpart's mental states, such as beliefs, desires, and intentions (BDIs). Using past conversation history and verbal reflections, ToM-Agent can dynamically adjust counterparts' inferred BDIs, along with related confidence levels. We further put forth a counterfactual intervention method that reflects on the gap between the predicted responses of counterparts and their real utterances, thereby enhancing the efficiency of reflection. Leveraging empathetic and persuasion dialogue datasets, we assess the advantages of implementing the ToM-agent with downstream tasks, as well as its performance in both the first-order and the \textit{second-order} ToM. Our findings indicate that the ToM-agent can grasp the underlying reasons for their counterpart's behaviors beyond mere semantic-emotional supporting or decision-making based on common sense, providing new insights for studying large-scale LLMs-based simulation of human social behaviors.

new Baichuan-Omni-1.5 Technical Report

Authors: Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang, Shusen Zhang, Xin Wu, Shuai Zhao, Linchu Xiong, Yozhen Wu, Jiahui Ye, Wenhao Lu, Bowen Li, Yan Zhang, Yaqi Zhou, Xin Chen, Lei Su, Hongda Zhang, Fuzhong Chen, Xuezhen Dong, Na Nie, Zhiying Wu, Bin Xiao, Ting Li, Shunya Dang, Ping Zhang, Yijia Sun, Jincheng Wu, Jinjie Yang, Xionghai Lin, Zhi Ma, Kegeng Wu, Jia li, Aiyuan Yang, Hui Liu, Jianqiang Zhang, Xiaoxi Chen, Guangwei Ai, Wentao Zhang, Yicong Chen, Xiaoqin Huang, Kun Li, Wenjing Luo, Yifei Duan, Lingling Zhu, Ran Xiao, Zhe Su, Jiani Pu, Dian Wang, Xu Jia, Tianyu Zhang, Mengyu Ai, Mang Wang, Yujing Qiao, Lei Zhang, Yanjun Shen, Fan Yang, Miao Zhen, Yijie Zhou, Mingyang Chen, Fei Li, Chenzheng Zhu, Keer Lu, Yaqi Zhao, Hao Liang, Youquan Li, Yanzhao Qin, Linzhuang Sun, Jianhua Xu, Haoze Sun, Mingan Lin, Zenan Zhou, Weipeng Chen

Abstract: We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.

new Evaluating the Effectiveness of XAI Techniques for Encoder-Based Language Models

Authors: Melkamu Abay Mersha, Mesay Gemeda Yigezu, Jugal Kalita

Abstract: The black-box nature of large language models (LLMs) necessitates the development of eXplainable AI (XAI) techniques for transparency and trustworthiness. However, evaluating these techniques remains a challenge. This study presents a general evaluation framework using four key metrics: Human-reasoning Agreement (HA), Robustness, Consistency, and Contrastivity. We assess the effectiveness of six explainability techniques from five different XAI categories model simplification (LIME), perturbation-based methods (SHAP), gradient-based approaches (InputXGradient, Grad-CAM), Layer-wise Relevance Propagation (LRP), and attention mechanisms-based explainability methods (Attention Mechanism Visualization, AMV) across five encoder-based language models: TinyBERT, BERTbase, BERTlarge, XLM-R large, and DeBERTa-xlarge, using the IMDB Movie Reviews and Tweet Sentiment Extraction (TSE) datasets. Our findings show that the model simplification-based XAI method (LIME) consistently outperforms across multiple metrics and models, significantly excelling in HA with a score of 0.9685 on DeBERTa-xlarge, robustness, and consistency as the complexity of large language models increases. AMV demonstrates the best Robustness, with scores as low as 0.0020. It also excels in Consistency, achieving near-perfect scores of 0.9999 across all models. Regarding Contrastivity, LRP performs the best, particularly on more complex models, with scores up to 0.9371.

new Qwen2.5-1M Technical Report

Authors: An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, Zipeng Zhang

Abstract: We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.

new How Green are Neural Language Models? Analyzing Energy Consumption in Text Summarization Fine-tuning

Authors: Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay

Abstract: Artificial intelligence systems significantly impact the environment, particularly in natural language processing (NLP) tasks. These tasks often require extensive computational resources to train deep neural networks, including large-scale language models containing billions of parameters. This study analyzes the trade-offs between energy consumption and performance across three neural language models: two pre-trained models (T5-base and BART-base), and one large language model (LLaMA 3-8B). These models were fine-tuned for the text summarization task, focusing on generating research paper highlights that encapsulate the core themes of each paper. A wide range of evaluation metrics, including ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore, were employed to assess their performance. Furthermore, the carbon footprint associated with fine-tuning each model was measured, offering a comprehensive assessment of their environmental impact. This research underscores the importance of incorporating environmental considerations into the design and implementation of neural language models and calls for the advancement of energy-efficient AI methodologies.

new Semantic Layered Embedding Diffusion in Large Language Models for Multi-Contextual Consistency

Authors: Irin Kabakum, Thomas Montgomery, Daniel Ravenwood, Genevieve Harrington

Abstract: The Semantic Layered Embedding Diffusion (SLED) mechanism redefines the representation of hierarchical semantics within transformer-based architectures, enabling enhanced contextual consistency across a wide array of linguistic tasks. By introducing a multi-layered diffusion process grounded in spectral analysis, it achieves a complex balance between global and local semantic coherence. Experimental results demonstrate significant improvements in perplexity and BLEU scores, emphasizing the mechanism's ability to adapt effectively across diverse domains, including multilingual and cross-domain text generation. A rigorous mathematical framework underpins the embedding diffusion process, incorporating weighted adjacency matrices, kernel-based refinements, and dynamic layer-wise normalization. Error distribution analysis reveals that SLED addresses challenges in semantic alignment and coherence, outperforming baseline approaches across varied benchmarks. Scalability studies illustrate that its performance gains are maintained consistently across different model sizes, reflecting a practical balance between computational efficiency and linguistic precision. The implementation also achieves energy efficiency, reducing resource consumption during training and inference phases without compromising accuracy. Qualitative case studies further validate its adaptability to extended narratives and context-intensive scenarios, highlighting the mechanism's potential for real-world applications. SLED offers a different perspective on embedding design and its implications for advancing language modeling.

new OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas

Authors: Xiaoyang Wang, Hongming Zhang, Tao Ge, Wenhao Yu, Dian Yu, Dong Yu

Abstract: Customizable role-playing in large language models (LLMs), also known as character generalization, is gaining increasing attention for its versatility and cost-efficiency in developing and deploying role-playing dialogue agents. This study explores a large-scale data synthesis approach to equip LLMs with character generalization capabilities. We begin by synthesizing large-scale character profiles using personas from Persona Hub and then explore two strategies: response rewriting and response generation, to create character-aligned instructional responses. To validate the effectiveness of our synthetic instruction tuning data for character generalization, we perform supervised fine-tuning (SFT) using the LLaMA-3 8B model. Our best-performing model strengthens the original LLaMA-3 8B Instruct model and achieves performance comparable to GPT-4o models on role-playing dialogue. We release our synthetic characters and instruction-tuning dialogues to support public research.

new Evaluating Simple Debiasing Techniques in RoBERTa-based Hate Speech Detection Models

Authors: Diana Iftimie, Erik Zinn

Abstract: The hate speech detection task is known to suffer from bias against African American English (AAE) dialect text, due to the annotation bias present in the underlying hate speech datasets used to train these models. This leads to a disparity where normal AAE text is more likely to be misclassified as abusive/hateful compared to non-AAE text. Simple debiasing techniques have been developed in the past to counter this sort of disparity, and in this work, we apply and evaluate these techniques in the scope of RoBERTa-based encoders. Experimental results suggest that the success of these techniques depends heavily on the methods used for training dataset construction, but with proper consideration of representation bias, they can reduce the disparity seen among dialect subgroups on the hate speech detection task.

new Token Democracy: The Architectural Limits of Alignment in Transformer-Based Language Models

Authors: Robin Young

Abstract: Modern language models paradoxically combine unprecedented capability with persistent vulnerability in that they can draft poetry yet cannot reliably refuse harmful requests. We reveal this fragility stems not from inadequate training, but from a fundamental architectural limitation: transformers process all tokens as equals. Transformers operate as computational democracies, granting equal voice to all tokens. This is a design tragically unsuited for AGI, where we cannot risk adversarial "candidates" hijacking the system. Through formal analysis, we demonstrate that safety instructions fundamentally lack privileged status in transformer architectures, that they compete with adversarial inputs in the same computational arena, making robust alignment through prompting or fine-tuning inherently limited. This "token democracy" explains why jailbreaks bypass even extensively safety-trained models and why positional shifts erode prompt effectiveness. Our work systematizes practitioners' tacit knowledge into an architectural critique, showing current alignment approaches create mere preferences, not constraints.

new STATE ToxiCN: A Benchmark for Span-level Target-Aware Toxicity Extraction in Chinese Hate Speech Detection

Authors: Zewen Bai, Yuanyuan Sun, Shengdi Yin, Junyu Lu, Jingjie Zeng, Haohao Zhu, Liang Yang, Hongfei Lin

Abstract: The proliferation of hate speech has caused significant harm to society. The intensity and directionality of hate are closely tied to the target and argument it is associated with. However, research on hate speech detection in Chinese has lagged behind, and existing datasets lack span-level fine-grained annotations. Furthermore, the lack of research on Chinese hateful slang poses a significant challenge. In this paper, we provide a solution for fine-grained detection of Chinese hate speech. First, we construct a dataset containing Target-Argument-Hateful-Group quadruples (STATE ToxiCN), which is the first span-level Chinese hate speech dataset. Secondly, we evaluate the span-level hate speech detection performance of existing models using STATE ToxiCN. Finally, we conduct the first study on Chinese hateful slang and evaluate the ability of LLMs to detect such expressions. Our work contributes valuable resources and insights to advance span-level hate speech detection in Chinese

new Data-adaptive Safety Rules for Training Reward Models

Authors: Xiaomin Li, Mingye Gao, Zhiwei Zhang, Jingxuan Fan, Weiyu Li

Abstract: Reinforcement Learning from Human Feedback (RLHF) is commonly employed to tailor models to human preferences, especially to improve the safety of outputs from large language models (LLMs). Traditionally, this method depends on selecting preferred responses from pairs. However, due to the variability in human opinions and the challenges in directly comparing two responses, there is an increasing trend towards fine-grained annotation approaches that evaluate responses using multiple targeted metrics or rules. The challenge lies in efficiently choosing and applying these rules to handle the diverse range of preference data. In this paper, we propose a dynamic method that adaptively selects the most important rules for each response pair. We introduce a mathematical framework that utilizes the maximum discrepancy across paired responses and demonstrate theoretically that this approach maximizes the mutual information between the rule-based annotations and the underlying true preferences. We then train an 8B reward model using this adaptively labeled preference dataset and assess its efficacy using RewardBench. As of January 25, 2025, our model achieved the highest safety performance on the leaderboard, surpassing various larger models.

new Query-based versus resource-based cache strategies in tag-based browsing systems

Authors: Joaqu\'in Gayoso-Cabada, Mercedes G\'omez-Albarr\'an, Jos\'e-Luis Sierra

Abstract: Tag-based browsing is a popular interaction model for navigating digital libraries. According to this model, users select descriptive tags to filter resources in the collections. Typical implementations of the model are based on inverted indexes. However, these implementations can require a considerable amount of set operations to update the browsing state. To palliate this inconven-ience, it is possible to adopt suitable cache strategies. In this paper we describe and compare two of these strategies: (i) a query-based strategy, according to which previously computed browsing states are indexed by sets of selected tags; and (ii) a resource-based strategy, according to which browsing states are in-dexed by sets of filtered resources. Our comparison focused on runtime perfor-mance, and was carried out empirically, using a real-world web-based collec-tion in the field of digital humanities. The results obtained show that the re-source-based strategy clearly outperforms the query-based one.

new Multilevel Browsing of Folksonomy-Based Digital Collections

Authors: Joaqu\'in Gayoso-Cabada, Daniel Rodr\'iguez-Cerezo, Jos\'e-Luis Sierra

Abstract: This paper describes how to extend the usual one-level tag selection navigation paradigm in folksonomy-based digital collections to a multilevel browsing one, according to which it is possible to incrementally narrow down the set of selected objects in a collection by sequentially adding more and more filtering tags. For this purpose, we present a browsing strategy based on finite automata. Also, we provide some experimental results concerning the application of the approach in Clavy, a system for managing digital collections with reconfigurable structures in digital humanities and educational settings.

new ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

Authors: Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao

Abstract: As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{https://github.com/yynil/RWKVInside}{https://github.com/yynil/RWKVInside}, \href{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.

URLs: https://github.com/yynil/RWKVInside, https://github.com/yynil/RWKVInside, https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1, https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1

new Cross-Cultural Fashion Design via Interactive Large Language Models and Diffusion Models

Authors: Spencer Ramsey, Amina Grant, Jeffrey Lee

Abstract: Fashion content generation is an emerging area at the intersection of artificial intelligence and creative design, with applications ranging from virtual try-on to culturally diverse design prototyping. Existing methods often struggle with cultural bias, limited scalability, and alignment between textual prompts and generated visuals, particularly under weak supervision. In this work, we propose a novel framework that integrates Large Language Models (LLMs) with Latent Diffusion Models (LDMs) to address these challenges. Our method leverages LLMs for semantic refinement of textual prompts and introduces a weak supervision filtering module to effectively utilize noisy or weakly labeled data. By fine-tuning the LDM on an enhanced DeepFashion+ dataset enriched with global fashion styles, the proposed approach achieves state-of-the-art performance. Experimental results demonstrate that our method significantly outperforms baselines, achieving lower Frechet Inception Distance (FID) and higher Inception Scores (IS), while human evaluations confirm its ability to generate culturally diverse and semantically relevant fashion content. These results highlight the potential of LLM-guided diffusion models in driving scalable and inclusive AI-driven fashion innovation.

new Instruction Tuning for Story Understanding and Generation with Weak Supervision

Authors: Yangshu Yuan, Heng Chen, Christian Ng

Abstract: Story understanding and generation have long been a challenging task in natural language processing (NLP), especially when dealing with various levels of instruction specificity. In this paper, we propose a novel approach called "Weak to Strong Instruction Tuning" for improving story generation by tuning models with instructions of varying clarity. We explore the potential of large language models (LLMs) to adapt to different types of instructions, weak and strong, and show that our method significantly enhances performance in story comprehension and generation. By leveraging the strength of instruction tuning, we train models to understand the nuances of story plots, characters, and themes while generating coherent and engaging narratives. Through extensive experiments on several benchmark datasets and comparison with state-of-the-art baselines, we demonstrate that our method outperforms existing techniques, yielding substantial improvements in both automatic evaluation metrics and human evaluations. Our work shows that adaptive instruction tuning can be a powerful tool in refining generative models for complex narrative tasks.

new Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework

Authors: Yuhong Sun, Zhangyue Yin, Xuanjing Huang, Xipeng Qiu, Hui Zhao

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains. Math Word Problems (MWPs) serve as a crucial benchmark for evaluating LLMs' reasoning abilities. While most research primarily focuses on improving accuracy, it often neglects understanding and addressing the underlying patterns of errors. Current error classification methods rely on static and predefined categories, which limit their ability to capture the full spectrum of error patterns in mathematical reasoning. To enable systematic error analysis, we collect error samples from 15 different LLMs of varying sizes across four distinct MWP datasets using multiple sampling strategies. Based on this extensive collection, we introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples that cover diverse error patterns and reasoning paths. To reduce human bias and enable fine-grained analysis of error patterns, we propose a novel framework for automated dynamic error classification in mathematical reasoning. Experimental results demonstrate that dataset characteristics significantly shape error patterns, which evolve from basic to complex manifestations as model capabilities increase. With deeper insights into error patterns, we propose error-aware prompting that incorporates common error patterns as explicit guidance, leading to significant improvements in mathematical reasoning performance.

new SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain

Authors: Dakuan Lu, Xiaoyu Tan, Rui Xu, Tianchu Yao, Chao Qu, Wei Chu, Yinghui Xu, Yuan Qi

Abstract: Recent breakthroughs in large language models (LLMs) exemplified by the impressive mathematical and scientific reasoning capabilities of the o1 model have spotlighted the critical importance of high-quality training data in advancing LLM performance across STEM disciplines. While the mathematics community has benefited from a growing body of curated datasets, the scientific domain at the higher education level has long suffered from a scarcity of comparable resources. To address this gap, we present SCP-116K, a new large-scale dataset of 116,756 high-quality problem-solution pairs, automatically extracted from heterogeneous sources using a streamlined and highly generalizable pipeline. Our approach involves stringent filtering to ensure the scientific rigor and educational level of the extracted materials, while maintaining adaptability for future expansions or domain transfers. By openly releasing both the dataset and the extraction pipeline, we seek to foster research on scientific reasoning, enable comprehensive performance evaluations of new LLMs, and lower the barrier to replicating the successes of advanced models like o1 in the broader science community. We believe SCP-116K will serve as a critical resource, catalyzing progress in high-level scientific reasoning tasks and promoting further innovations in LLM development. The dataset and code are publicly available at https://github.com/AQA6666/SCP-116K-open.

URLs: https://github.com/AQA6666/SCP-116K-open.

new Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Authors: Eduard Barbu, Meeri-Ly Muru, Sten Marcus Malva

Abstract: This study introduces an approach to Estonian text simplification using two model architectures: a neural machine translation model and a fine-tuned large language model (LLaMA). Given the limited resources for Estonian, we developed a new dataset, the Estonian Simplification Dataset, combining translated data and GPT-4.0-generated simplifications. We benchmarked OpenNMT, a neural machine translation model that frames text simplification as a translation task, and fine-tuned the LLaMA model on our dataset to tailor it specifically for Estonian simplification. Manual evaluations on the test set show that the LLaMA model consistently outperforms OpenNMT in readability, grammaticality, and meaning preservation. These findings underscore the potential of large language models for low-resource languages and provide a basis for further research in Estonian text simplification.

new Quantum-Enhanced Attention Mechanism in NLP: A Hybrid Classical-Quantum Approach

Authors: S. M. Yousuf Iqbal Tomal, Abdullah Al Shafin, Debojit Bhattacharjee, MD. Khairul Amin, Rafiad Sadat Shahir

Abstract: Transformer-based models have achieved remarkable results in natural language processing (NLP) tasks such as text classification and machine translation. However, their computational complexity and resource demands pose challenges for scalability and accessibility. This research proposes a hybrid quantum-classical transformer model that integrates a quantum-enhanced attention mechanism to address these limitations. By leveraging quantum kernel similarity and variational quantum circuits (VQC), the model captures intricate token dependencies while improving computational efficiency. Experimental results on the IMDb dataset demonstrate that the quantum-enhanced model outperforms the classical baseline across all key metrics, achieving a 1.5% improvement in accuracy (65.5% vs. 64%), precision, recall, and F1 score. Statistical significance tests validate these improvements, highlighting the robustness of the quantum approach. These findings illustrate the transformative potential of quantum-enhanced attention mechanisms in optimizing NLP architectures for real-world applications.

new People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text

Authors: Jenna Russell, Marzena Karpinska, Mohit Iyyer

Abstract: In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such "expert" annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts' free-form explanations shows that while they rely heavily on specific lexical clues ('AI vocabulary'), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.

new TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs

Authors: Yuxuan Gu, Wuyang Zhou, Giorgos Iacovides, Danilo Mandic

Abstract: The reasoning abilities of Large Language Models (LLMs) can be improved by structurally denoising their weights, yet existing techniques primarily focus on denoising the feed-forward network (FFN) of the transformer block, and can not efficiently utilise the Multi-head Attention (MHA) block, which is the core of transformer architectures. To address this issue, we propose a novel intuitive framework that, at its very core, performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. This enables both higher-dimensional structured denoising and compression of the MHA weights, by enforcing a shared higher-dimensional subspace across the weights of the multiple attention heads. We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets, and for both encoder-only and decoder-only architectures, while achieving compression rates of up to $\sim 250$ times in the MHA weights, all without requiring any additional data, training, or fine-tuning. Furthermore, we show that the proposed method can be seamlessly combined with existing FFN-only-based denoising techniques to achieve further improvements in LLM reasoning performance.

new Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts

Authors: Haodi Ma, Dzmitry Kasinets, Daisy Zhe Wang

Abstract: Multimodal knowledge graph completion (MMKGC) aims to predict missing links in multimodal knowledge graphs (MMKGs) by leveraging information from various modalities alongside structural data. Existing MMKGC approaches primarily extend traditional knowledge graph embedding (KGE) models, which often require creating an embedding for every entity. This results in large model sizes and inefficiencies in integrating multimodal information, particularly for real-world graphs. Meanwhile, Transformer-based models have demonstrated competitive performance in knowledge graph completion (KGC). However, their focus on single-modal knowledge limits their capacity to utilize cross-modal information. Recently, Large vision-language models (VLMs) have shown potential in cross-modal tasks but are constrained by the high cost of training. In this work, we propose a novel approach that integrates Transformer-based KGE models with cross-modal context generated by pre-trained VLMs, thereby extending their applicability to MMKGC. Specifically, we employ a pre-trained VLM to transform relevant visual information from entities and their neighbors into textual sequences. We then frame KGC as a sequence-to-sequence task, fine-tuning the model with the generated cross-modal context. This simple yet effective method significantly reduces model size compared to traditional KGE approaches while achieving competitive performance across multiple large-scale datasets with minimal hyperparameter tuning.

new Adapting Biomedical Abstracts into Plain language using Large Language Models

Authors: Haritha Gangavarapu, Giridhar Kaushik Ramachandran, Kevin Lybarger, Meliha Yetisgen, \"Ozlem Uzuner

Abstract: A vast amount of medical knowledge is available for public use through online health forums, and question-answering platforms on social media. The majority of the population in the United States doesn't have the right amount of health literacy to make the best use of that information. Health literacy means the ability to obtain and comprehend the basic health information to make appropriate health decisions. To build the bridge between this gap, organizations advocate adapting this medical knowledge into plain language. Building robust systems to automate the adaptations helps both medical and non-medical professionals best leverage the available information online. The goal of the Plain Language Adaptation of Biomedical Abstracts (PLABA) track is to adapt the biomedical abstracts in English language extracted from PubMed based on the questions asked in MedlinePlus for the general public using plain language at the sentence level. As part of this track, we leveraged the best open-source Large Language Models suitable and fine-tuned for dialog use cases. We compare and present the results for all of our systems and our ranking among the other participants' submissions. Our top performing GPT-4 based model ranked first in the avg. simplicity measure and 3rd on the avg. accuracy measure.

new StaICC: Standardized Evaluation for Classification Task in In-context Learning

Authors: Hakaze Cho, Naoya Inoue

Abstract: Classification tasks are widely investigated in the In-Context Learning (ICL) paradigm. However, current efforts are evaluated on disjoint benchmarks and settings, while their performances are significantly influenced by some trivial variables, such as prompt templates, data sampling, instructions, etc., which leads to significant inconsistencies in the results reported across various literature, preventing fair comparison or meta-analysis across different papers. Therefore, this paper proposes a standardized and easy-to-use evaluation toolkit (StaICC) for in-context classification. Including, for the normal classification task, we provide StaICC-Normal, selecting 10 widely used datasets, and generating prompts with a fixed form, to mitigate the variance among the experiment implementations. To enrich the usage of our benchmark, we also provide a sub-benchmark StaICC-Diag for diagnosing ICL from several aspects, aiming for a more robust inference processing.

new ESGSenticNet: A Neurosymbolic Knowledge Base for Corporate Sustainability Analysis

Authors: Keane Ong, Rui Mao, Frank Xing, Ranjan Satapathy, Johan Sulaeman, Erik Cambria, Gianmarco Mengaldo

Abstract: Evaluating corporate sustainability performance is essential to drive sustainable business practices, amid the need for a more sustainable economy. However, this is hindered by the complexity and volume of corporate sustainability data (i.e. sustainability disclosures), not least by the effectiveness of the NLP tools used to analyse them. To this end, we identify three primary challenges - immateriality, complexity, and subjectivity, that exacerbate the difficulty of extracting insights from sustainability disclosures. To address these issues, we introduce ESGSenticNet, a publicly available knowledge base for sustainability analysis. ESGSenticNet is constructed from a neurosymbolic framework that integrates specialised concept parsing, GPT-4o inference, and semi-supervised label propagation, together with a hierarchical taxonomy. This approach culminates in a structured knowledge base of 44k knowledge triplets - ('halve carbon emission', supports, 'emissions control'), for effective sustainability analysis. Experiments indicate that ESGSenticNet, when deployed as a lexical method, more effectively captures relevant and actionable sustainability information from sustainability disclosures compared to state of the art baselines. Besides capturing a high number of unique ESG topic terms, ESGSenticNet outperforms baselines on the ESG relatedness and ESG action orientation of these terms by 26% and 31% respectively. These metrics describe the extent to which topic terms are related to ESG, and depict an action toward ESG. Moreover, when deployed as a lexical method, ESGSenticNet does not require any training, possessing a key advantage in its simplicity for non-technical stakeholders.

new IndicMMLU-Pro: Benchmarking the Indic Large Language Models

Authors: Sankalp KJ, Ashutosh Kumar, Laxmaan Balaji, Nikunj Kotecha, Vinija Jain, Aman Chadha, Sreyoshi Bhaduri

Abstract: Known by more than 1.5 billion people in the Indian subcontinent, Indic languages present unique challenges and opportunities for natural language processing (NLP) research due to their rich cultural heritage, linguistic diversity, and complex structures. IndicMMLU-Pro is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) across Indic languages, building upon the MMLU Pro (Massive Multitask Language Understanding) framework. Covering major languages such as Hindi, Bengali, Gujarati, Marathi, Kannada, Punjabi, Tamil, Telugu, and Urdu, our benchmark addresses the unique challenges and opportunities presented by the linguistic diversity of the Indian subcontinent. This benchmark encompasses a wide range of tasks in language comprehension, reasoning, and generation, meticulously crafted to capture the intricacies of Indian languages. IndicMMLU-Pro provides a standardized evaluation framework to push the research boundaries in Indic language AI, facilitating the development of more accurate, efficient, and culturally sensitive models. This paper outlines the benchmarks' design principles, task taxonomy, data collection methodology, and presents baseline results from state-of-the-art multilingual models.

new Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference

Authors: Go Kamoda, Benjamin Hienzerling, Tatsuro Inaba, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

Abstract: According to the stages-of-inference hypothesis, early layers of language models map their subword-tokenized input, which does not necessarily correspond to a linguistically meaningful segmentation, to more meaningful representations that form the model's ``inner vocabulary''. Prior analysis of this detokenization stage has predominantly relied on probing and interventions such as path patching, which involve selecting particular inputs, choosing a subset of components that will be patched, and then observing changes in model behavior. Here, we show that several important aspects of the detokenization stage can be understood purely by analyzing model weights, without performing any model inference steps. Specifically, we introduce an analytical decomposition of first-layer attention in GPT-2. Our decomposition yields interpretable terms that quantify the relative contributions of position-related, token-related, and mixed effects. By focusing on terms in this decomposition, we discover weight-based explanations of attention bias toward close tokens and attention for detokenization.

new Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages

Authors: Ivory Yang, Weicheng Ma, Chunhui Zhang, Soroush Vosoughi

Abstract: Endangered languages, such as Navajo - the most widely spoken Native American language - are significantly underrepresented in contemporary language technologies, exacerbating the challenges of their preservation and revitalization. This study evaluates Google's large language model (LLM)-based language identification system, which consistently misidentifies Navajo, exposing inherent limitations when applied to low-resource Native American languages. To address this, we introduce a random forest classifier trained on Navajo and eight frequently confused languages. Despite its simplicity, the classifier achieves near-perfect accuracy (97-100%), significantly outperforming Google's LLM-based system. Additionally, the model demonstrates robustness across other Athabaskan languages - a family of Native American languages spoken primarily in Alaska, the Pacific Northwest, and parts of the Southwestern United States - suggesting its potential for broader application. Our findings underscore the pressing need for NLP systems that prioritize linguistic diversity and adaptability over centralized, one-size-fits-all solutions, especially in supporting underrepresented languages in a multicultural world. This work directly contributes to ongoing efforts to address cultural biases in language models and advocates for the development of culturally localized NLP tools that serve diverse linguistic communities.

new Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs

Authors: Momoka Furuhashi, Hiroaki Funayama, Yuya Iwase, Yuichiroh Matsubayashi, Yoriko Isobe, Toru Nagahama, Saku Sugawara, Kentaro Inui

Abstract: Short-reading comprehension questions help students understand text structure but lack effective feedback. Students struggle to identify and correct errors, while manual feedback creation is labor-intensive. This highlights the need for automated feedback linking responses to a scoring rubric for deeper comprehension. Despite advances in Natural Language Processing (NLP), research has focused on automatic grading, with limited work on feedback generation. To address this, we propose a system that generates feedback for student responses. Our contributions are twofold. First, we introduce the first system for feedback on short-answer reading comprehension. These answers are derived from the text, requiring structural understanding. We propose an "answer diagnosis graph," integrating the text's logical structure with feedback templates. Using this graph and NLP techniques, we estimate students' comprehension and generate targeted feedback. Second, we evaluate our feedback through an experiment with Japanese high school students (n=39). They answered two 70-80 word questions and were divided into two groups with minimal academic differences. One received a model answer, the other system-generated feedback. Both re-answered the questions, and we compared score changes. A questionnaire assessed perceptions and motivation. Results showed no significant score improvement between groups, but system-generated feedback helped students identify errors and key points in the text. It also significantly increased motivation. However, further refinement is needed to enhance text structure understanding.

new Large Language Models to Diffusion Finetuning

Authors: Edoardo Cetin, Tianyu Zhao, Yujin Tang

Abstract: We propose a new finetuning method to provide pre-trained large language models (LMs) the ability to scale test-time compute through the diffusion framework. By increasing the number of diffusion steps, we show our finetuned models achieve monotonically increasing accuracy, directly translating to improved performance across downstream tasks. Furthermore, our finetuned models can expertly answer questions on specific topics by integrating powerful guidance techniques, and autonomously determine the compute required for a given problem by leveraging adaptive ODE solvers. Our method is universally applicable to any foundation model pre-trained with a cross-entropy loss and does not modify any of its original weights, fully preserving its strong single-step generation capabilities. We show our method is more effective and fully compatible with traditional finetuning approaches, introducing an orthogonal new direction to unify the strengths of the autoregressive and diffusion frameworks.

new MADP: Multi-Agent Deductive Planning for Enhanced Cognitive-Behavioral Mental Health Question Answer

Authors: Qi Chen, Dexi Liu

Abstract: The Mental Health Question Answer (MHQA) task requires the seeker and supporter to complete the support process in one-turn dialogue. Given the richness of help-seeker posts, supporters must thoroughly understand the content and provide logical, comprehensive, and well-structured responses. Previous works in MHQA mostly focus on single-agent approaches based on the cognitive element of Cognitive Behavioral Therapy (CBT), but they overlook the interactions among various CBT elements, such as emotion and cognition. This limitation hinders the models' ability to thoroughly understand the distress of help-seekers. To address this, we propose a framework named Multi-Agent Deductive Planning (MADP), which is based on the interactions between the various psychological elements of CBT. This method guides Large Language Models (LLMs) to achieve a deeper understanding of the seeker's context and provide more personalized assistance based on individual circumstances. Furthermore, we construct a new dataset based on the MADP framework and use it to fine-tune LLMs, resulting in a specialized model named MADP-LLM. We conduct extensive experiments, including comparisons with multiple LLMs, human evaluations, and automatic evaluations, to validate the effectiveness of the MADP framework and MADP-LLM.

new Potential Applications of Artificial Intelligence for Cross-language Intelligibility Assessment of Dysarthric Speech

Authors: Eunjung Yeo, Julie Liss, Visar Berisha, David Mortensen

Abstract: Purpose: This commentary introduces how artificial intelligence (AI) can be leveraged to advance cross-language intelligibility assessment of dysarthric speech. Method: We propose a dual-component framework consisting of a universal module that generates language-independent speech representations and a language-specific intelligibility model that incorporates linguistic nuances. Additionally, we identify key barriers to cross-language intelligibility assessment, including data scarcity, annotation complexity, and limited linguistic insights, and present AI-driven solutions to overcome these challenges. Conclusion: Advances in AI offer transformative opportunities to enhance cross-language intelligibility assessment for dysarthric speech by balancing scalability across languages and adaptability by languages.

new LCTG Bench: LLM Controlled Text Generation Benchmark

Authors: Kentaro Kurihara, Masato Mita, Peinan Zhang, Shota Sasaki, Ryosuke Ishigami, Naoaki Okazaki

Abstract: The rise of large language models (LLMs) has led to more diverse and higher-quality machine-generated text. However, their high expressive power makes it difficult to control outputs based on specific business instructions. In response, benchmarks focusing on the controllability of LLMs have been developed, but several issues remain: (1) They primarily cover major languages like English and Chinese, neglecting low-resource languages like Japanese; (2) Current benchmarks employ task-specific evaluation metrics, lacking a unified framework for selecting models based on controllability across different use cases. To address these challenges, this research introduces LCTG Bench, the first Japanese benchmark for evaluating the controllability of LLMs. LCTG Bench provides a unified framework for assessing control performance, enabling users to select the most suitable model for their use cases based on controllability. By evaluating nine diverse Japanese-specific and multilingual LLMs like GPT-4, we highlight the current state and challenges of controllability in Japanese LLMs and reveal the significant gap between multilingual models and Japanese-specific models.

new Optimizing Sentence Embedding with Pseudo-Labeling and Model Ensembles: A Hierarchical Framework for Enhanced NLP Tasks

Authors: Ziwei Liu, Qi Zhang, Lifu Gao

Abstract: Sentence embedding tasks are important in natural language processing (NLP), but improving their performance while keeping them reliable is still hard. This paper presents a framework that combines pseudo-label generation and model ensemble techniques to improve sentence embeddings. We use external data from SimpleWiki, Wikipedia, and BookCorpus to make sure the training data is consistent. The framework includes a hierarchical model with an encoding layer, refinement layer, and ensemble prediction layer, using ALBERT-xxlarge, RoBERTa-large, and DeBERTa-large models. Cross-attention layers combine external context, and data augmentation techniques like synonym replacement and back-translation increase data variety. Experimental results show large improvements in accuracy and F1-score compared to basic models, and studies confirm that cross-attention and data augmentation make a difference. This work presents an effective way to improve sentence embedding tasks and lays the groundwork for future NLP research.

new Parametric Retrieval Augmented Generation

Authors: Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, Yiqun Liu

Abstract: Retrieval-augmented generation (RAG) techniques have emerged as a promising solution to enhance the reliability of large language models (LLMs) by addressing issues like hallucinations, outdated knowledge, and domain adaptation. In particular, existing RAG methods append relevant documents retrieved from external corpus or databases to the input of LLMs to guide their generation process, which we refer to as the in-context knowledge injection method. While this approach is simple and often effective, it has inherent limitations. Firstly, increasing the context length and number of relevant documents can lead to higher computational overhead and degraded performance, especially in complex reasoning tasks. More importantly, in-context knowledge injection operates primarily at the input level, but LLMs store their internal knowledge in their parameters. This gap fundamentally limits the capacity of in-context methods. To this end, we introduce Parametric retrieval-augmented generation (Parametric RAG), a new RAG paradigm that integrates external knowledge directly into the parameters of feed-forward networks (FFN) of an LLM through document parameterization. This approach not only saves online computational costs by eliminating the need to inject multiple documents into the LLMs' input context, but also deepens the integration of external knowledge into the parametric knowledge space of the LLM. Experimental results demonstrate that Parametric RAG substantially enhances both the effectiveness and efficiency of knowledge augmentation in LLMs. Also, it can be combined with in-context RAG methods to achieve even better performance. We have open-sourced all the code, data, and models in the following anonymized GitHub link: https://github.com/oneal2000/PRAG

URLs: https://github.com/oneal2000/PRAG

new Multi-View Attention Syntactic Enhanced Graph Convolutional Network for Aspect-based Sentiment Analysis

Authors: Xiang Huang, Hao Peng, Shuo Sun, Zhifeng Hao, Hui Lin, Shuhai Wang

Abstract: Aspect-based Sentiment Analysis (ABSA) is the task aimed at predicting the sentiment polarity of aspect words within sentences. Recently, incorporating graph neural networks (GNNs) to capture additional syntactic structure information in the dependency tree derived from syntactic dependency parsing has been proven to be an effective paradigm for boosting ABSA. Despite GNNs enhancing model capability by fusing more types of information, most works only utilize a single topology view of the dependency tree or simply conflate different perspectives of information without distinction, which limits the model performance. To address these challenges, in this paper, we propose a new multi-view attention syntactic enhanced graph convolutional network (MASGCN) that weighs different syntactic information of views using attention mechanisms. Specifically, we first construct distance mask matrices from the dependency tree to obtain multiple subgraph views for GNNs. To aggregate features from different views, we propose a multi-view attention mechanism to calculate the attention weights of views. Furthermore, to incorporate more syntactic information, we fuse the dependency type information matrix into the adjacency matrices and present a structural entropy loss to learn the dependency type adjacency matrix. Comprehensive experiments on four benchmark datasets demonstrate that our model outperforms state-of-the-art methods. The codes and datasets are available at https://github.com/SELGroup/MASGCN.

URLs: https://github.com/SELGroup/MASGCN.

new 3CEL: A corpus of legal Spanish contract clauses

Authors: Nuria Aldama Garc\'ia, Patricia Mars\`a Morales, David Betancur S\'anchez, \'Alvaro Barbero Jim\'enez, Marta Guerrero Nieto, Pablo Haya Coll, Patricia Mart\'in Chozas, Elena Montiel Ponsoda

Abstract: Legal corpora for Natural Language Processing (NLP) are valuable and scarce resources in languages like Spanish due to two main reasons: data accessibility and legal expert knowledge availability. INESData 2024 is a European Union funded project lead by the Universidad Polit\'ecnica de Madrid (UPM) and developed by Instituto de Ingenier\'ia del Conocimiento (IIC) to create a series of state-of-the-art NLP resources applied to the legal/administrative domain in Spanish. The goal of this paper is to present the Corpus of Legal Spanish Contract Clauses (3CEL), which is a contract information extraction corpus developed within the framework of INESData 2024. 3CEL contains 373 manually annotated tenders using 19 defined categories (4 782 total tags) that identify key information for contract understanding and reviewing.

new MEL: Legal Spanish Language Model

Authors: David Betancur S\'anchez, Nuria Aldama Garc\'ia, \'Alvaro Barbero Jim\'enez, Marta Guerrero Nieto, Patricia Mars\`a Morales, Nicol\'as Serrano Salas, Carlos Garc\'ia Hern\'an, Pablo Haya Coll, Elena Montiel Ponsoda, Pablo Calleja Ib\'a\~nez

Abstract: Legal texts, characterized by complex and specialized terminology, present a significant challenge for Language Models. Adding an underrepresented language, such as Spanish, to the mix makes it even more challenging. While pre-trained models like XLM-RoBERTa have shown capabilities in handling multilingual corpora, their performance on domain specific documents remains underexplored. This paper presents the development and evaluation of MEL, a legal language model based on XLM-RoBERTa-large, fine-tuned on legal documents such as BOE (Bolet\'in Oficial del Estado, the Spanish oficial report of laws) and congress texts. We detail the data collection, processing, training, and evaluation processes. Evaluation benchmarks show a significant improvement over baseline models in understanding the legal Spanish language. We also present case studies demonstrating the model's application to new legal texts, highlighting its potential to perform top results over different NLP tasks.

new PISCO: Pretty Simple Compression for Retrieval-Augmented Generation

Authors: Maxime Louis, Herv\'e D\'ejean, St\'ephane Clinchant

Abstract: Retrieval-Augmented Generation (RAG) pipelines enhance Large Language Models (LLMs) by retrieving relevant documents, but they face scalability issues due to high inference costs and limited context size. Document compression is a practical solution, but current soft compression methods suffer from accuracy losses and require extensive pretraining. In this paper, we introduce PISCO, a novel method that achieves a 16x compression rate with minimal accuracy loss (0-3%) across diverse RAG-based question-answering (QA) tasks. Unlike existing approaches, PISCO requires no pretraining or annotated data, relying solely on sequence-level knowledge distillation from document-based questions. With the ability to fine-tune a 7-10B LLM in 48 hours on a single A100 GPU, PISCO offers a highly efficient and scalable solution. We present comprehensive experiments showing that PISCO outperforms existing compression models by 8% in accuracy.

new RelCAT: Advancing Extraction of Clinical Inter-Entity Relationships from Unstructured Electronic Health Records

Authors: Shubham Agarwal, Vlad Dinu, Thomas Searle, Mart Ratas, Anthony Shek, Dan F. Stein, James Teo, Richard Dobson

Abstract: This study introduces RelCAT (Relation Concept Annotation Toolkit), an interactive tool, library, and workflow designed to classify relations between entities extracted from clinical narratives. Building upon the CogStack MedCAT framework, RelCAT addresses the challenge of capturing complete clinical relations dispersed within text. The toolkit implements state-of-the-art machine learning models such as BERT and Llama along with proven evaluation and training methods. We demonstrate a dataset annotation tool (built within MedCATTrainer), model training, and evaluate our methodology on both openly available gold-standard and real-world UK National Health Service (NHS) hospital clinical datasets. We perform extensive experimentation and a comparative analysis of the various publicly available models with varied approaches selected for model fine-tuning. Finally, we achieve macro F1-scores of 0.977 on the gold-standard n2c2, surpassing the previous state-of-the-art performance, and achieve performance of >=0.93 F1 on our NHS gathered datasets.

new Integration of LLM Quality Assurance into an NLG System

Authors: Ching-Yi Chen, Johanna Heininger, Adela Schneider, Christian Eckard, Andreas Madsack, Robert Wei{\ss}graeber

Abstract: In this paper, we present a system that uses a Large Language Model (LLM) to perform grammar and spelling correction as a component of Quality Assurance (QA) for texts generated by NLG systems, which is important for text production in real-world scenarios. Evaluating the results of the system on work-in-progress sports news texts in three languages, we show that it is able to deliver acceptable corrections.

new STAR: Stepwise Task Augmentation and Relation Learning for Aspect Sentiment Quad Prediction

Authors: Wenna Lai, Haoran Xie, Guandong Xu, Qing Li

Abstract: Aspect-based sentiment analysis (ABSA) aims to identify four sentiment elements, including aspect term, aspect category, opinion term, and sentiment polarity. These elements construct the complete picture of sentiments. The most challenging task, aspect sentiment quad prediction (ASQP), predicts these elements simultaneously, hindered by difficulties in accurately coupling different sentiment elements. A key challenge is insufficient annotated data that limits the capability of models in semantic understanding and reasoning about quad prediction. To address this, we propose stepwise task augmentation and relation learning (STAR), a strategy inspired by human reasoning. STAR constructs auxiliary data to learn quadruple relationships incrementally by augmenting with pairwise and overall relation tasks derived from training data. By encouraging the model to infer causal relationships among sentiment elements without requiring additional annotations, STAR effectively enhances quad prediction. Extensive experiments demonstrate the proposed STAR exhibits superior performance on four benchmark datasets.

new Towards Explainable Multimodal Depression Recognition for Clinical Interviews

Authors: Wenjie Zheng, Qiming Xie, Zengzhi Wang, Jianfei Yu, Rui Xia

Abstract: Recently, multimodal depression recognition for clinical interviews (MDRC) has recently attracted considerable attention. Existing MDRC studies mainly focus on improving task performance and have achieved significant development. However, for clinical applications, model transparency is critical, and previous works ignore the interpretability of decision-making processes. To address this issue, we propose an Explainable Multimodal Depression Recognition for Clinical Interviews (EMDRC) task, which aims to provide evidence for depression recognition by summarizing symptoms and uncovering underlying causes. Given an interviewer-participant interaction scenario, the goal of EMDRC is to structured summarize participant's symptoms based on the eight-item Patient Health Questionnaire depression scale (PHQ-8), and predict their depression severity. To tackle the EMDRC task, we construct a new dataset based on an existing MDRC dataset. Moreover, we utilize the PHQ-8 and propose a PHQ-aware multimodal multi-task learning framework, which captures the utterance-level symptom-related semantic information to help generate dialogue-level summary. Experiment results on our annotated dataset demonstrate the superiority of our proposed methods over baseline systems on the EMDRC task.

new From #Dr00gtiktok to #harmreduction: Exploring Substance Use Hashtags on TikTok

Authors: Layla Bouzoubaa, Muqi Guo, Joseph Trybala, Afsaneh Razi, Rezvaneh Rezapour

Abstract: The rise of TikTok as a primary source of information for youth, combined with its unique short-form video format, creates urgent questions about how substance use content manifests and spreads on the platform. This paper provides the first in-depth exploration of substance use-related content on TikTok, covering all major substance categories as classified by the Drug Enforcement Agency. Through social network analysis and qualitative coding, we examined more than 2,333 hashtags across 39,509 videos, identified 16 distinct hashtag communities and analyzed their interconnections and thematic content. Our analysis revealed a highly interconnected small-world network where recovery-focused hashtags like #addiction, #recovery, and #sober serve as central bridges between communities. Through manual coding of 351 representative videos, we found that Recovery Advocacy content (33.9%) and Satirical content (28.2%) dominate, while direct substance depiction appears in only 26% of videos, with active use shown in just 6.5% of them. This suggests TikTok functions primarily as a recovery support platform rather than a space promoting substance use. We found strong alignment between hashtag communities and video content, indicating organic community formation rather than attempts to evade content moderation. Our findings inform how platforms can balance content moderation with preserving valuable recovery support communities, while also providing insights for the design of social media-based recovery interventions.

new Evaluation of NMT-Assisted Grammar Transfer for a Multi-Language Configurable Data-to-Text System

Authors: Andreas Madsack, Johanna Heininger, Adela Schneider, Ching-Yi Chen, Christian Eckard, Robert Wei{\ss}graeber

Abstract: One approach for multilingual data-to-text generation is to translate grammatical configurations upfront from the source language into each target language. These configurations are then used by a surface realizer and in document planning stages to generate output. In this paper, we describe a rule-based NLG implementation of this approach where the configuration is translated by Neural Machine Translation (NMT) combined with a one-time human review, and introduce a cross-language grammar dependency model to create a multilingual NLG system that generates text from the source data, scaling the generation phase without a human in the loop. Additionally, we introduce a method for human post-editing evaluation on the automatically translated text. Our evaluation on the SportSett:Basketball dataset shows that our NLG system performs well, underlining its grammatical correctness in translation tasks.

new AdaCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Chain-of-Thought

Authors: Xin Huang, Tarun Kumar Vangani, Zhengyuan Liu, Bowei Zou, Ai Ti Aw

Abstract: Large language models (LLMs) have shown impressive multilingual capabilities through pretraining on diverse corpora. While these models show strong reasoning abilities, their performance varies significantly across languages due to uneven training data distribution. Existing approaches using machine translation, and extensive multilingual pretraining and cross-lingual tuning face scalability challenges and often fail to capture nuanced reasoning processes across languages. In this paper, we introduce AdaCoT (Adaptive Chain-of-Thought), a framework that enhances multilingual reasoning by dynamically routing thought processes through intermediary "thinking languages" before generating target-language responses. AdaCoT leverages a language-agnostic core and incorporates an adaptive, reward-based mechanism for selecting optimal reasoning pathways without requiring additional pretraining. Our comprehensive evaluation across multiple benchmarks demonstrates substantial improvements in both factual reasoning quality and cross-lingual consistency, with particularly strong performance gains in low-resource language settings. The results suggest that adaptive reasoning paths can effectively bridge the performance gap between high and low-resource languages while maintaining cultural and linguistic nuances.

new Can summarization approximate simplification? A gold standard comparison

Authors: Giacomo Magnifico, Eduard Barbu

Abstract: This study explores the overlap between text summarization and simplification outputs. While summarization evaluation methods are streamlined, simplification lacks cohesion, prompting the question: how closely can abstractive summarization resemble gold-standard simplification? We address this by applying two BART-based BRIO summarization methods to the Newsela corpus, comparing outputs with manually annotated simplifications and achieving a top ROUGE-L score of 0.654. This provides insight into where summarization and simplification outputs converge and differ.

new Provence: efficient and robust context pruning for retrieval-augmented generation

Authors: Nadezhda Chirkova, Thibault Formal, Vassilina Nikoulina, St\'ephane Clinchant

Abstract: Retrieval-augmented generation improves various aspects of large language models (LLMs) generation, but suffers from computational overhead caused by long contexts as well as the propagation of irrelevant retrieved information into generated responses. Context pruning deals with both aspects, by removing irrelevant parts of retrieved contexts before LLM generation. Existing context pruning approaches are however limited, and do not provide a universal model that would be both efficient and robust in a wide range of scenarios, e.g., when contexts contain a variable amount of relevant information or vary in length, or when evaluated on various domains. In this work, we close this gap and introduce Provence (Pruning and Reranking Of retrieVEd relevaNt ContExts), an efficient and robust context pruner for Question Answering, which dynamically detects the needed amount of pruning for a given context and can be used out-of-the-box for various domains. The three key ingredients of Provence are formulating the context pruning task as sequence labeling, unifying context pruning capabilities with context reranking, and training on diverse data. Our experimental results show that Provence enables context pruning with negligible to no drop in performance, in various domains and settings, at almost no cost in a standard RAG pipeline. We also conduct a deeper analysis alongside various ablations to provide insights into training context pruners for future work.

new DBRouting: Routing End User Queries to Databases for Answerability

Authors: Priyangshu Mandal, Manasi Patwardhan, Mayur Patidar, Lovekesh Vig

Abstract: Enterprise level data is often distributed across multiple sources and identifying the correct set-of data-sources with relevant information for a knowledge request is a fundamental challenge. In this work, we define the novel task of routing an end-user query to the appropriate data-source, where the data-sources are databases. We synthesize datasets by extending existing datasets designed for NL-to-SQL semantic parsing. We create baselines on these datasets by using open-source LLMs, using both pre-trained and task specific embeddings fine-tuned using the training data. With these baselines we demonstrate that open-source LLMs perform better than embedding based approach, but suffer from token length limitations. Embedding based approaches benefit from task specific fine-tuning, more so when there is availability of data in terms of database specific questions for training. We further find that the task becomes more difficult (i) with an increase in the number of data-sources, (ii) having data-sources closer in terms of their domains,(iii) having databases without external domain knowledge required to interpret its entities and (iv) with ambiguous and complex queries requiring more fine-grained understanding of the data-sources or logical reasoning for routing to an appropriate source. This calls for the need for developing more sophisticated solutions to better address the task.

new Echoes of Discord: Forecasting Hater Reactions to Counterspeech

Authors: Xiaoying Song, Sharon Lisseth Perez, Xinchen Yu, Eduardo Blanco, Lingzi Hong

Abstract: Hate speech (HS) erodes the inclusiveness of online users and propagates negativity and division. Counterspeech has been recognized as a way to mitigate the harmful consequences. While some research has investigated the impact of user-generated counterspeech on social media platforms, few have examined and modeled haters' reactions toward counterspeech, despite the immediate alteration of haters' attitudes being an important aspect of counterspeech. This study fills the gap by analyzing the impact of counterspeech from the hater's perspective, focusing on whether the counterspeech leads the hater to reenter the conversation and if the reentry is hateful. We compile the Reddit Echoes of Hate dataset (ReEco), which consists of triple-turn conversations featuring haters' reactions, to assess the impact of counterspeech. The linguistic analysis sheds insights on the language of counterspeech to hate eliciting different haters' reactions. Experimental results demonstrate that the 3-way classification model outperforms the two-stage reaction predictor, which first predicts reentry and then determines the reentry type. We conclude the study with an assessment showing the most common errors identified by the best-performing model.

new A foundation model for human-AI collaboration in medical literature mining

Authors: Zifeng Wang, Lang Cao, Qiao Jin, Joey Chan, Nicholas Wan, Behdad Afzali, Hyun-Jin Cho, Chang-In Choi, Mehdi Emamverdi, Manjot K. Gill, Sun-Hyung Kim, Yijia Li, Yi Liu, Hanley Ong, Justin Rousseau, Irfan Sheikh, Jenny J. Wei, Ziyang Xu, Christopher M. Zallek, Kyungsang Kim, Yifan Peng, Zhiyong Lu, Jimeng Sun

Abstract: Systematic literature review is essential for evidence-based medicine, requiring comprehensive analysis of clinical trial publications. However, the application of artificial intelligence (AI) models for medical literature mining has been limited by insufficient training and evaluation across broad therapeutic areas and diverse tasks. Here, we present LEADS, an AI foundation model for study search, screening, and data extraction from medical literature. The model is trained on 633,759 instruction data points in LEADSInstruct, curated from 21,335 systematic reviews, 453,625 clinical trial publications, and 27,015 clinical trial registries. We showed that LEADS demonstrates consistent improvements over four cutting-edge generic large language models (LLMs) on six tasks. Furthermore, LEADS enhances expert workflows by providing supportive references following expert requests, streamlining processes while maintaining high-quality results. A study with 16 clinicians and medical researchers from 14 different institutions revealed that experts collaborating with LEADS achieved a recall of 0.81 compared to 0.77 experts working alone in study selection, with a time savings of 22.6%. In data extraction tasks, experts using LEADS achieved an accuracy of 0.85 versus 0.80 without using LEADS, alongside a 26.9% time savings. These findings highlight the potential of specialized medical literature foundation models to outperform generic models, delivering significant quality and efficiency benefits when integrated into expert workflows for medical literature mining.

new Return of the Encoder: Maximizing Parameter Efficiency for SLMs

Authors: Mohamed Elfeki, Rui Liu, Chad Voegele

Abstract: The dominance of large decoder-only language models has overshadowed encoder-decoder architectures, despite their fundamental efficiency advantages in sequence processing. For small language models (SLMs) - those with 1 billion parameters or fewer - our systematic analysis across GPU, CPU, and NPU platforms reveals that encoder-decoder architectures achieve 47% lower first-token latency and 4.7x higher throughput compared to decoder-only models on edge devices. These gains may be attributed to encoder-decoder's one-time input processing and efficient separation of understanding and generation phases. We introduce a novel knowledge distillation framework that enables encoder-decoder models to leverage capabilities from large scalable decoder-only teachers while preserving their architectural advantages, achieving up to 6 average performance points improvement across diverse tasks, with significant gains in asymmetric sequence tasks where input and output distributions can benefit from different processing approaches. When combined with modern advances like Rotary Positional Embeddings (RoPE) and Vision encoders, our systematic investigation demonstrates that encoder-decoder architectures provide a more practical path toward deploying capable language models in resource-constrained environments. Our findings challenge the prevailing trend toward decoder-only scaling, showing that architectural choices become increasingly crucial as parameter budgets decrease, particularly for on-device and edge deployments where computational efficiency is paramount.

new URAG: Implementing a Unified Hybrid RAG for Precise Answers in University Admission Chatbots -- A Case Study at HCMUT

Authors: Long Nguyen, Tho Quan

Abstract: With the rapid advancement of Artificial Intelligence, particularly in Natural Language Processing, Large Language Models (LLMs) have become pivotal in educational question-answering systems, especially university admission chatbots. Concepts such as Retrieval-Augmented Generation (RAG) and other advanced techniques have been developed to enhance these systems by integrating specific university data, enabling LLMs to provide informed responses on admissions and academic counseling. However, these enhanced RAG techniques often involve high operational costs and require the training of complex, specialized modules, which poses challenges for practical deployment. Additionally, in the educational context, it is crucial to provide accurate answers to prevent misinformation, a task that LLM-based systems find challenging without appropriate strategies and methods. In this paper, we introduce the Unified RAG (URAG) Framework, a hybrid approach that significantly improves the accuracy of responses, particularly for critical queries. Experimental results demonstrate that URAG enhances our in-house, lightweight model to perform comparably to state-of-the-art commercial models. Moreover, to validate its practical applicability, we conducted a case study at our educational institution, which received positive feedback and acclaim. This study not only proves the effectiveness of URAG but also highlights its feasibility for real-world implementation in educational settings.

new Matryoshka Re-Ranker: A Flexible Re-Ranking Architecture With Configurable Depth and Width

Authors: Zheng Liu, Chaofan Li, Shitao Xiao, Chaozhuo Li, Defu Lian, Yingxia Shao

Abstract: Large language models (LLMs) provide powerful foundations to perform fine-grained text re-ranking. However, they are often prohibitive in reality due to constraints on computation bandwidth. In this work, we propose a \textbf{flexible} architecture called \textbf{Matroyshka Re-Ranker}, which is designed to facilitate \textbf{runtime customization} of model layers and sequence lengths at each layer based on users' configurations. Consequently, the LLM-based re-rankers can be made applicable across various real-world situations. The increased flexibility may come at the cost of precision loss. To address this problem, we introduce a suite of techniques to optimize the performance. First, we propose \textbf{cascaded self-distillation}, where each sub-architecture learns to preserve a precise re-ranking performance from its super components, whose predictions can be exploited as smooth and informative teacher signals. Second, we design a \textbf{factorized compensation mechanism}, where two collaborative Low-Rank Adaptation modules, vertical and horizontal, are jointly employed to compensate for the precision loss resulted from arbitrary combinations of layer and sequence compression. We perform comprehensive experiments based on the passage and document retrieval datasets from MSMARCO, along with all public datasets from BEIR benchmark. In our experiments, Matryoshka Re-Ranker substantially outperforms the existing methods, while effectively preserving its superior performance across various forms of compression and different application scenarios.

new RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Authors: Long Nguyen, Huy Nguyen, Bao Khuu, Huy Luu, Huy Le, Tuan Nguyen, Tho Quan

Abstract: Retrieving events from videos using text queries has become increasingly challenging due to the rapid growth of multimedia content. Existing methods for text-based video event retrieval often focus heavily on object-level descriptions, overlooking the crucial role of contextual information. This limitation is especially apparent when queries lack sufficient context, such as missing location details or ambiguous background elements. To address these challenges, we propose a novel system called RAPID (Retrieval-Augmented Parallel Inference Drafting), which leverages advancements in Large Language Models (LLMs) and prompt-based learning to semantically correct and enrich user queries with relevant contextual information. These enriched queries are then processed through parallel retrieval, followed by an evaluation step to select the most relevant results based on their alignment with the original query. Through extensive experiments on our custom-developed dataset, we demonstrate that RAPID significantly outperforms traditional retrieval methods, particularly for contextually incomplete queries. Our system was validated for both speed and accuracy through participation in the Ho Chi Minh City AI Challenge 2024, where it successfully retrieved events from over 300 hours of video. Further evaluation comparing RAPID with the baseline proposed by the competition organizers demonstrated its superior effectiveness, highlighting the strength and robustness of our approach.

new LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

Authors: Heting Gao, Hang Shao, Xiong Wang, Chaofan Qiu, Yunhang Shen, Siqi Cai, Yuchen Shi, Zihan Xu, Zuwei Long, Yike Zhang, Shaoqi Dong, Chaoyou Fu, Ke Li, Long Ma, Xing Sun

Abstract: The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.

cross From Critique to Clarity: A Pathway to Faithful and Personalized Code Explanations with Large Language Models

Authors: Zexing Xu, Zhuang Luo, Yichuan Li, Kyumin Lee, S. Rasoul Etesami

Abstract: In the realm of software development, providing accurate and personalized code explanations is crucial for both technical professionals and business stakeholders. Technical professionals benefit from enhanced understanding and improved problem-solving skills, while business stakeholders gain insights into project alignments and transparency. Despite the potential, generating such explanations is often time-consuming and challenging. This paper presents an innovative approach that leverages the advanced capabilities of large language models (LLMs) to generate faithful and personalized code explanations. Our methodology integrates prompt enhancement, self-correction mechanisms, personalized content customization, and interaction with external tools, facilitated by collaboration among multiple LLM agents. We evaluate our approach using both automatic and human assessments, demonstrating that our method not only produces accurate explanations but also tailors them to individual user preferences. Our findings suggest that this approach significantly improves the quality and relevance of code explanations, offering a valuable tool for developers and stakeholders alike.

cross Leveraging Social Media Data and Artificial Intelligence for Improving Earthquake Response Efforts

Authors: Kalin Kopanov, Velizar Varbanov, Tatiana Atanasova

Abstract: The integration of social media and artificial intelligence (AI) into disaster management, particularly for earthquake response, represents a profound evolution in emergency management practices. In the digital age, real-time information sharing has reached unprecedented levels, with social media platforms emerging as crucial communication channels during crises. This shift has transformed traditional, centralized emergency services into more decentralized, participatory models of disaster situational awareness. Our study includes an experimental analysis of 8,900 social media interactions, including 2,920 posts and 5,980 replies on X (formerly Twitter), following a magnitude 5.1 earthquake in Oklahoma on February 2, 2024. The analysis covers data from the immediate aftermath and extends over the following seven days, illustrating the critical role of digital platforms in modern disaster response. The results demonstrate that social media platforms can be effectively used as real-time situational awareness tools, delivering critical information to society and authorities during emergencies.

cross Wormhole Memory: A Rubik's Cube for Cross-Dialogue Retrieval

Authors: Libo Wang

Abstract: In view of the gap in the current large language model in sharing memory across dialogues, this research proposes a wormhole memory module (WMM) to realize memory as a Rubik's cube that can be arbitrarily retrieved between different dialogues. Through simulation experiments, the researcher built an experimental framework based on the Python environment and used setting memory barriers to simulate the current situation where memories between LLMs dialogues are difficult to share. The CoQA development data set was imported into the experiment, and the feasibility of its cross-dialogue memory retrieval function was verified for WMM's nonlinear indexing and dynamic retrieval, and a comparative analysis was conducted with the capabilities of Titans and MemGPT memory modules. Experimental results show that WMM demonstrated the ability to retrieve memory across dialogues and the stability of quantitative indicators in eight experiments. It contributes new technical approaches to the optimization of memory management of LLMs and provides experience for the practical application in the future.

cross E-Gen: Leveraging E-Graphs to Improve Continuous Representations of Symbolic Expressions

Authors: Hongbo Zheng, Suyuan Wang, Neeraj Gangwar, Nickvash Kani

Abstract: As vector representations have been pivotal in advancing natural language processing (NLP), some prior research has concentrated on creating embedding techniques for mathematical expressions by leveraging mathematically equivalent expressions. While effective, these methods are limited by the training data. In this work, we propose augmenting prior algorithms with larger synthetic dataset, using a novel e-graph-based generation scheme. This new mathematical dataset generation scheme, E-Gen, improves upon prior dataset-generation schemes that are limited in size and operator types. We use this dataset to compare embedding models trained with two methods: (1) training the model to generate mathematically equivalent expressions, and (2) training the model using contrastive learning to group mathematically equivalent expressions explicitly. We evaluate the embeddings generated by these methods against prior work on both in-distribution and out-of-distribution language processing tasks. Finally, we compare the performance of our embedding scheme against state-of-the-art large language models and demonstrate that embedding-based language processing methods perform better than LLMs on several tasks, demonstrating the necessity of optimizing embedding methods for the mathematical data modality.

cross LLM4DistReconfig: A Fine-tuned Large Language Model for Power Distribution Network Reconfiguration

Authors: Panayiotis Christou, Md. Zahidul Islam, Yuzhang Lin, Jingwei Xiong

Abstract: Power distribution networks are evolving due to the integration of DERs and increased customer participation. To maintain optimal operation, minimize losses, and meet varying load demands, frequent network reconfiguration is necessary. Traditionally, the reconfiguration task relies on optimization software and expert operators, but as systems grow more complex, faster and more adaptive solutions are required without expert intervention. Data-driven reconfiguration is gaining traction for its accuracy, speed, and robustness against incomplete network data. LLMs, with their ability to capture complex patterns, offer a promising approach for efficient and responsive network reconfiguration in evolving complex power networks. In this work, we introduce LLM4DistReconfig, a deep learning-based approach utilizing a fine-tuned LLM to solve the distribution network reconfiguration problem. By carefully crafting prompts and designing a custom loss function, we train the LLM with inputs representing network parameters such as buses, available lines, open lines, node voltages, and system loss. The model then predicts optimal reconfigurations by outputting updated network configurations that minimize system loss while meeting operational constraints. Our approach significantly reduces inference time compared to classical algorithms, allowing for near real-time optimal reconfiguration after training. Experimental results show that our method generates optimal configurations minimizing system loss for five individual and a combined test dataset. It also produces minimal invalid edges, no cycles, or subgraphs across all datasets, fulfilling domain-specific needs. Additionally, the generated responses contain less than 5% improper outputs on seen networks and satisfactory results on unseen networks, demonstrating its effectiveness and reliability for the reconfiguration task.

cross OptiSeq: Optimizing Example Ordering for In-Context Learning

Authors: Rahul Atul Bhope, Praveen Venkateswaran, K. R. Jayaram, Vatche Isahagian, Vinod Muthusamy, Nalini Venkatasubramanian

Abstract: Developers using LLMs in their applications and agents have provided plenty of anecdotal evidence that in-context-learning (ICL) is fragile. In addition to the quantity and quality of examples, we show that the order in which the in-context examples are listed in the prompt affects the output of the LLM and, consequently, their performance. In this paper, we present OptiSeq, which introduces a score based on log probabilities of LLM outputs to prune the universe of possible example orderings in few-shot ICL and recommend the best order(s) by distinguishing between correct and incorrect outputs resulting from different order permutations. Through a detailed empirical evaluation on multiple LLMs, datasets and prompts, we demonstrate that OptiSeq improves accuracy by 6 - 10.5 percentage points across multiple tasks.

cross Feedback-Aware Monte Carlo Tree Search for Efficient Information Seeking in Goal-Oriented Conversations

Authors: Harshita Chopra, Chirag Shah

Abstract: The ability to identify and acquire missing information is a critical component of effective decision making and problem solving. With the rise of conversational artificial intelligence (AI) systems, strategically formulating information-seeking questions becomes crucial and demands efficient methods to guide the search process. We introduce a novel approach to adaptive question-asking through a combination of Large Language Models (LLM) for generating questions that maximize information gain, Monte Carlo Tree Search (MCTS) for constructing and leveraging a decision tree across multiple samples, and a hierarchical feedback mechanism to learn from past interactions. We present two key innovations: (1) an adaptive MCTS algorithm that balances exploration and exploitation for efficient search over potential questions; and (2) a clustering-based feedback algorithm that leverages prior experience to guide future interactions. Each incoming sample is assigned to a cluster based on its semantic similarity with previously observed samples. Our UCT (Upper Confidence bound for Trees) formulation selects optimal questions by combining expected rewards, a function of information gain, with a cluster-specific bonus that decays with depth, to emphasize the importance of early-stage questions that have proven effective for narrowing the solution space in similar samples. Experiments across three domains, including medical diagnosis and troubleshooting, demonstrate that our method leads to an average of 12% improvement in success rates and a 10x reduction in the average number of LLM calls made per conversation for the search process, in comparison to the state of the art.

cross Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models

Authors: Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, Yuxin Peng

Abstract: Multi-modal large language models (MLLMs) have shown remarkable abilities in various visual understanding tasks. However, MLLMs still struggle with fine-grained visual recognition (FGVR), which aims to identify subordinate-level categories from images. This can negatively impact more advanced capabilities of MLLMs, such as object-centric visual question answering and reasoning. In our study, we revisit three quintessential capabilities of MLLMs for FGVR, including object information extraction, category knowledge reserve, object-category alignment, and position of the root cause as a misalignment problem. To address this issue, we present Finedefics, an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase. We employ contrastive learning on object-attribute pairs and attribute-category pairs simultaneously and use examples from similar but incorrect categories as hard negatives, naturally bringing representations of visual objects and category names closer. Extensive evaluations across multiple popular FGVR datasets demonstrate that Finedefics outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code is available at https://github.com/PKU-ICST-MIPL/Finedefics_ICLR2025.

URLs: https://github.com/PKU-ICST-MIPL/Finedefics_ICLR2025.

cross ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning

Authors: Shangqian Gao, Ting Hua, Reza Shirkavand, Chi-Heng Lin, Zhen Tang, Zhengao Li, Longge Yuan, Fangyi Li, Zeyu Zhang, Alireza Ganjdanesh, Lou Qian, Xu Jie, Yen-Chang Hsu

Abstract: Large Language Models (LLMs) have demonstrated remarkable abilities in tackling a wide range of complex tasks. However, their huge computational and memory costs raise significant challenges in deploying these models on resource-constrained devices or efficiently serving them. Prior approaches have attempted to alleviate these problems by permanently removing less important model structures, yet these methods often result in substantial performance degradation due to the permanent deletion of model parameters. In this work, we tried to mitigate this issue by reducing the number of active parameters without permanently removing them. Specifically, we introduce a differentiable dynamic pruning method that pushes dense models to maintain a fixed number of active parameters by converting their MLP layers into a Mixture of Experts (MoE) architecture. Our method, even without fine-tuning, consistently outperforms previous structural pruning techniques across diverse model families, including Phi-2, LLaMA-2, LLaMA-3, and Qwen-2.5.

cross Diffusion-based Hierarchical Negative Sampling for Multimodal Knowledge Graph Completion

Authors: Guanglin Niu, Xiaowei Zhang

Abstract: Multimodal Knowledge Graph Completion (MMKGC) aims to address the critical issue of missing knowledge in multimodal knowledge graphs (MMKGs) for their better applications. However, both the previous MMGKC and negative sampling (NS) approaches ignore the employment of multimodal information to generate diverse and high-quality negative triples from various semantic levels and hardness levels, thereby limiting the effectiveness of training MMKGC models. Thus, we propose a novel Diffusion-based Hierarchical Negative Sampling (DHNS) scheme tailored for MMKGC tasks, which tackles the challenge of generating high-quality negative triples by leveraging a Diffusion-based Hierarchical Embedding Generation (DiffHEG) that progressively conditions on entities and relations as well as multimodal semantics. Furthermore, we develop a Negative Triple-Adaptive Training (NTAT) strategy that dynamically adjusts training margins associated with the hardness level of the synthesized negative triples, facilitating a more robust and effective learning procedure to distinguish between positive and negative triples. Extensive experiments on three MMKGC benchmark datasets demonstrate that our framework outperforms several state-of-the-art MMKGC models and negative sampling techniques, illustrating the effectiveness of our DHNS for training MMKGC models. The source codes and datasets of this paper are available at https://github.com/ngl567/DHNS.

URLs: https://github.com/ngl567/DHNS.

cross The Potential of Large Language Models in Supply Chain Management: Advancing Decision-Making, Efficiency, and Innovation

Authors: Raha Aghaei, Ali A. Kiaei, Mahnaz Boush, Javad Vahidi, Zeynab Barzegar, Mahan Rofoosheh

Abstract: The integration of large language models (LLMs) into supply chain management (SCM) is revolutionizing the industry by improving decision-making, predictive analytics, and operational efficiency. This white paper explores the transformative impact of LLMs on various SCM functions, including demand forecasting, inventory management, supplier relationship management, and logistics optimization. By leveraging advanced data analytics and real-time insights, LLMs enable organizations to optimize resources, reduce costs, and improve responsiveness to market changes. Key findings highlight the benefits of integrating LLMs with emerging technologies such as IoT, blockchain, and robotics, which together create smarter and more autonomous supply chains. Ethical considerations, including bias mitigation and data protection, are taken into account to ensure fair and transparent AI practices. In addition, the paper discusses the need to educate the workforce on how to manage new AI-driven processes and the long-term strategic benefits of adopting LLMs. Strategic recommendations for SCM professionals include investing in high-quality data management, promoting cross-functional collaboration, and aligning LLM initiatives with overall business goals. The findings highlight the potential of LLMs to drive innovation, sustainability, and competitive advantage in the ever-changing supply chain management landscape.

cross Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values?

Authors: Hua Shen, Nicholas Clark, Tanushree Mitra

Abstract: Existing research primarily evaluates the values of LLMs by examining their stated inclinations towards specific values. However, the "Value-Action Gap," a phenomenon rooted in environmental and social psychology, reveals discrepancies between individuals' stated values and their actions in real-world contexts. To what extent do LLMs exhibit a similar gap between their stated values and their actions informed by those values? This study introduces ValueActionLens, an evaluation framework to assess the alignment between LLMs' stated values and their value-informed actions. The framework encompasses the generation of a dataset comprising 14.8k value-informed actions across twelve cultures and eleven social topics, and two tasks to evaluate how well LLMs' stated value inclinations and value-informed actions align across three different alignment measures. Extensive experiments reveal that the alignment between LLMs' stated values and actions is sub-optimal, varying significantly across scenarios and models. Analysis of misaligned results identifies potential harms from certain value-action gaps. To predict the value-action gaps, we also uncover that leveraging reasoned explanations improves performance. These findings underscore the risks of relying solely on the LLMs' stated values to predict their behaviors and emphasize the importance of context-aware evaluations of LLM values and value-action gaps.

cross Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning

Authors: Alexey Rukhovich, Alexander Podolskiy, Irina Piontkovskaya

Abstract: In multi-domain learning, a single model is trained on diverse data domains to leverage shared knowledge and improve generalization. The order in which the data from these domains is used for training can significantly affect the model's performance on each domain. However, this dependence is under-studied. In this paper, we investigate the influence of training order (or data mixing) in multi-domain learning using the concept of Lie bracket of gradient vector fields. By analyzing the infinitesimal effects of changing the training order, we identify regions in the parameter space where altering the order between two training domains can benefit the target loss. We validate the predictions of our theoretical framework on the influence of training order (or data mixing) both on a toy example and bilingual LLM pre-training.

cross ConceptCLIP: Towards Trustworthy Medical AI via Concept-Enhanced Contrastive Langauge-Image Pre-training

Authors: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Hao Chen

Abstract: Trustworthiness is essential for the precise and interpretable application of artificial intelligence (AI) in medical imaging. Traditionally, precision and interpretability have been addressed as separate tasks, namely medical image analysis and explainable AI, each developing its own models independently. In this study, for the first time, we investigate the development of a unified medical vision-language pre-training model that can achieve both accurate analysis and interpretable understanding of medical images across various modalities. To build the model, we construct MedConcept-23M, a large-scale dataset comprising 23 million medical image-text pairs extracted from 6.2 million scientific articles, enriched with concepts from the Unified Medical Language System (UMLS). Based on MedConcept-23M, we introduce ConceptCLIP, a medical AI model utilizing concept-enhanced contrastive language-image pre-training. The pre-training of ConceptCLIP involves two primary components: image-text alignment learning (IT-Align) and patch-concept alignment learning (PC-Align). This dual alignment strategy enhances the model's capability to associate specific image regions with relevant concepts, thereby improving both the precision of analysis and the interpretability of the AI system. We conducted extensive experiments on 5 diverse types of medical image analysis tasks, spanning 51 subtasks across 10 image modalities, with the broadest range of downstream tasks. The results demonstrate the effectiveness of the proposed vision-language pre-training model. Further explainability analysis across 6 modalities reveals that ConceptCLIP achieves superior performance, underscoring its robust ability to advance explainable AI in medical imaging. These findings highlight ConceptCLIP's capability in promoting trustworthy AI in the field of medicine.

cross Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning

Authors: Zeyu Gan, Yun Liao, Yong Liu

Abstract: Test-time scaling, which is also often referred to as \textit{slow-thinking}, has been demonstrated to enhance multi-step reasoning in large language models (LLMs). However, despite its widespread utilization, the mechanisms underlying slow-thinking methods remain poorly understood. This paper explores the mechanisms of external slow-thinking from a theoretical standpoint. We begin by examining the snowball error effect within the LLM reasoning process and connect it to the likelihood of correct reasoning using information theory. Building on this, we show that external slow-thinking methods can be interpreted as strategies to mitigate the error probability. We further provide a comparative analysis of popular external slow-thinking approaches, ranging from simple to complex, highlighting their differences and interrelationships. Our findings suggest that the efficacy of these methods is not primarily determined by the specific framework employed, and that expanding the search scope or the model's internal reasoning capacity may yield more sustained improvements in the long term. We open-source our code at \url{https://github.com/ZyGan1999/Snowball-Errors-and-Probability}.

URLs: https://github.com/ZyGan1999/Snowball-Errors-and-Probability

cross Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning

Authors: Qian Yang, Calbert Graham

Abstract: Voice conversion (VC) modifies voice characteristics while preserving linguistic content. This paper presents the Stepback network, a novel model for converting speaker identity using non-parallel data. Unlike traditional VC methods that rely on parallel data, our approach leverages deep learning techniques to enhance disentanglement completion and linguistic content preservation. The Stepback network incorporates a dual flow of different domain data inputs and uses constraints with self-destructive amendments to optimize the content encoder. Extensive experiments show that our model significantly improves VC performance, reducing training costs while achieving high-quality voice conversion. The Stepback network's design offers a promising solution for advanced voice conversion tasks.

cross Blissful (A)Ignorance: People form overly positive impressions of others based on their written messages, despite wide-scale adoption of Generative AI

Authors: Jiaqi Zhu, Andras Molnar

Abstract: As the use of Generative AI (GenAI) tools becomes more prevalent in interpersonal communication, understanding their impact on social perceptions is crucial. According to signaling theory, GenAI may undermine the credibility of social signals conveyed in writing, since it reduces the cost of writing and makes it hard to verify the authenticity of messages. Using a pre-registered large-scale online experiment (N = 647; Prolific), featuring scenarios in a range of communication contexts (personal vs. professional; close others vs. strangers), we explored how senders' use of GenAI influenced recipients' impressions of senders, both when GenAI use was known or uncertain. Consistent with past work, we found strong negative effects on social impressions when disclosing that a message was AI-generated, compared to when the same message was human-written. However, under the more realistic condition when potential GenAI use was not explicitly highlighted, recipients did not exhibit any skepticism towards senders, and these "uninformed" impressions were virtually indistinguishable from those of fully human-written messages. Even when we highlighted the potential (but uncertain) use of GenAI, recipients formed overly positive impressions. These results are especially striking given that 46% of our sample admitted having used such tools for writing messages, just within the past two weeks. Our findings put past work in a new light: While social judgments can be substantially affected when GenAI use is explicitly disclosed, this information may not be readily available in more realistic communication settings, making recipients blissfully ignorant about others' potential use of GenAI.

cross Beyond Benchmarks: On The False Promise of AI Regulation

Authors: Gabriel Stanovsky, Renana Keydar, Gadi Perl, Eliya Habba

Abstract: The rapid advancement of artificial intelligence (AI) systems in critical domains like healthcare, justice, and social services has sparked numerous regulatory initiatives aimed at ensuring their safe deployment. Current regulatory frameworks, exemplified by recent US and EU efforts, primarily focus on procedural guidelines while presuming that scientific benchmarking can effectively validate AI safety, similar to how crash tests verify vehicle safety or clinical trials validate drug efficacy. However, this approach fundamentally misunderstands the unique technical challenges posed by modern AI systems. Through systematic analysis of successful technology regulation case studies, we demonstrate that effective scientific regulation requires a causal theory linking observable test outcomes to future performance - for instance, how a vehicle's crash resistance at one speed predicts its safety at lower speeds. We show that deep learning models, which learn complex statistical patterns from training data without explicit causal mechanisms, preclude such guarantees. This limitation renders traditional regulatory approaches inadequate for ensuring AI safety. Moving forward, we call for regulators to reckon with this limitation, and propose a preliminary two-tiered regulatory framework that acknowledges these constraints: mandating human oversight for high-risk applications while developing appropriate risk communication strategies for lower-risk uses. Our findings highlight the urgent need to reconsider fundamental assumptions in AI regulation and suggest a concrete path forward for policymakers and researchers.

cross Risk-Aware Distributional Intervention Policies for Language Models

Authors: Bao Nguyen, Binh Nguyen, Duy Nguyen, Viet Anh Nguyen

Abstract: Language models are prone to occasionally undesirable generations, such as harmful or toxic content, despite their impressive capability to produce texts that appear accurate and coherent. This paper presents a new two-stage approach to detect and mitigate undesirable content generations by rectifying activations. First, we train an ensemble of layerwise classifiers to detect undesirable content using activations by minimizing a smooth surrogate of the risk-aware score. Then, for contents that are detected as undesirable, we propose layerwise distributional intervention policies that perturb the attention heads minimally while guaranteeing probabilistically the effectiveness of the intervention. Benchmarks on several language models and datasets show that our method outperforms baselines in reducing the generation of undesirable output.

cross LemmaHead: RAG Assisted Proof Generation Using Large Language Models

Authors: Tianbo Yang, Mingqi Yang, Hongyi Zhao, Tianshuo Yang

Abstract: Developing the logic necessary to solve mathematical problems or write mathematical proofs is one of the more difficult objectives for large language models (LLMS). Currently, the most popular methods in literature consists of fine-tuning the model on written mathematical content such as academic publications and textbooks, so that the model can learn to emulate the style of mathematical writing. In this project, we explore the effectiveness of using retrieval augmented generation (RAG) to address gaps in the mathematical reasoning of LLMs. We develop LemmaHead, a RAG knowledge base that supplements queries to the model with relevant mathematical context, with particular focus on context from published textbooks. To measure our model's performance in mathematical reasoning, our testing paradigm focuses on the task of automated theorem proving via generating proofs to a given mathematical claim in the Lean formal language.

cross Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data?

Authors: Yutong Yin, Zhaoran Wang

Abstract: Humans exhibit remarkable compositional reasoning by integrating knowledge from various sources. For example, if someone learns ( B = f(A) ) from one source and ( C = g(B) ) from another, they can deduce ( C=g(B)=g(f(A)) ) even without encountering ( ABC ) together, showcasing the generalization ability of human intelligence. In this paper, we introduce a synthetic learning task, "FTCT" (Fragmented at Training, Chained at Testing), to validate the potential of Transformers in replicating this skill and interpret its inner mechanism. In the training phase, data consist of separated knowledge fragments from an overall causal graph. During testing, Transformers must infer complete causal graph traces by integrating these fragments. Our findings demonstrate that few-shot Chain-of-Thought prompting enables Transformers to perform compositional reasoning on FTCT by revealing correct combinations of fragments, even if such combinations were absent in the training data. Furthermore, the emergence of compositional reasoning ability is strongly correlated with the model complexity and training-testing data similarity. We propose, both theoretically and empirically, that Transformers learn an underlying generalizable program from training, enabling effective compositional reasoning during testing.

cross Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Authors: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu

Abstract: Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.

cross Challenging Assumptions in Learning Generic Text Style Embeddings

Authors: Phil Ostheimer, Marius Kloft, Sophie Fellenz

Abstract: Recent advancements in language representation learning primarily emphasize language modeling for deriving meaningful representations, often neglecting style-specific considerations. This study addresses this gap by creating generic, sentence-level style embeddings crucial for style-centric tasks. Our approach is grounded on the premise that low-level text style changes can compose any high-level style. We hypothesize that applying this concept to representation learning enables the development of versatile text style embeddings. By fine-tuning a general-purpose text encoder using contrastive learning and standard cross-entropy loss, we aim to capture these low-level style shifts, anticipating that they offer insights applicable to high-level text styles. The outcomes prompt us to reconsider the underlying assumptions as the results do not always show that the learned style representations capture high-level text styles.

cross Enhancing and Exploring Mild Cognitive Impairment Detection with W2V-BERT-2.0

Authors: Yueguan Wang, Tatsunari Matsushima, Soichiro Matsushima, Toshimitsu Sakai

Abstract: This study explores a multi-lingual audio self-supervised learning model for detecting mild cognitive impairment (MCI) using the TAUKADIAL cross-lingual dataset. While speech transcription-based detection with BERT models is effective, limitations exist due to a lack of transcriptions and temporal information. To address these issues, the study utilizes features directly from speech utterances with W2V-BERT-2.0. We propose a visualization method to detect essential layers of the model for MCI classification and design a specific inference logic considering the characteristics of MCI. The experiment shows competitive results, and the proposed inference logic significantly contributes to the improvements from the baseline. We also conduct detailed analysis which reveals the challenges related to speaker bias in the features and the sensitivity of MCI classification accuracy to the data split, providing valuable insights for future research.

cross Phase Transitions in Large Language Models and the $O(N)$ Model

Authors: Youran Sun, Babak Haghighat

Abstract: Large language models (LLMs) exhibit unprecedentedly rich scaling behaviors. In physics, scaling behavior is closely related to phase transitions, critical phenomena, and field theory. To investigate the phase transition phenomena in LLMs, we reformulated the Transformer architecture as an $O(N)$ model. Our study reveals two distinct phase transitions corresponding to the temperature used in text generation and the model's parameter size, respectively. The first phase transition enables us to estimate the internal dimension of the model, while the second phase transition is of \textit{higher-depth} and signals the emergence of new capabilities. As an application, the energy of the $O(N)$ model can be used to evaluate whether an LLM's parameters are sufficient to learn the training data.

cross Zero-Shot Decision Tree Construction via Large Language Models

Authors: Lucas Carrasco, Felipe Urrutia, Andr\'es Abeliuk

Abstract: This paper introduces a novel algorithm for constructing decision trees using large language models (LLMs) in a zero-shot manner based on Classification and Regression Trees (CART) principles. Traditional decision tree induction methods rely heavily on labeled data to recursively partition data using criteria such as information gain or the Gini index. In contrast, we propose a method that uses the pre-trained knowledge embedded in LLMs to build decision trees without requiring training data. Our approach leverages LLMs to perform operations essential for decision tree construction, including attribute discretization, probability calculation, and Gini index computation based on the probabilities. We show that these zero-shot decision trees can outperform baseline zero-shot methods and achieve competitive performance compared to supervised data-driven decision trees on tabular datasets. The decision trees constructed via this method provide transparent and interpretable models, addressing data scarcity while preserving interpretability. This work establishes a new baseline in low-data machine learning, offering a principled, knowledge-driven alternative to data-driven tree construction.

cross Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity

Authors: Weixin Liang, Junhong Shen, Genghan Zhang, Ning Dong, Luke Zettlemoyer, Lili Yu

Abstract: State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba

URLs: https://github.com/Weixin-Liang/Mixture-of-Mamba

replace A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Authors: Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, Erik Cambria

Abstract: The utilization of large language models (LLMs) in the Healthcare domain has generated both excitement and concern due to their ability to effectively respond to freetext queries with certain professional knowledge. This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development roadmap from traditional Pretrained Language Models (PLMs) to LLMs. Specifically, we first explore the potential of LLMs to enhance the efficiency and effectiveness of various Healthcare applications highlighting both the strengths and limitations. Secondly, we conduct a comparison between the previous PLMs and the latest LLMs, as well as comparing various LLMs with each other. Then we summarize related Healthcare training data, training methods, optimization strategies, and usage. Finally, the unique concerns associated with deploying LLMs in Healthcare settings are investigated, particularly regarding fairness, accountability, transparency and ethics. Our survey provide a comprehensive investigation from perspectives of both computer science and Healthcare specialty. Besides the discussion about Healthcare concerns, we supports the computer science community by compiling a collection of open source resources, such as accessible datasets, the latest methodologies, code implementations, and evaluation benchmarks in the Github. Summarily, we contend that a significant paradigm shift is underway, transitioning from PLMs to LLMs. This shift encompasses a move from discriminative AI approaches to generative AI approaches, as well as a shift from model-centered methodologies to data-centered methodologies. Also, we determine that the biggest obstacle of using LLMs in Healthcare are fairness, accountability, transparency and ethics.

replace Large language models can replicate cross-cultural differences in personality

Authors: Pawe{\l} Niszczota, Mateusz Janczak, Micha{\l} Misiak

Abstract: We use a large-scale experiment (N=8000) to determine whether GPT-4 can replicate cross-cultural differences in the Big Five, measured using the Ten-Item Personality Inventory. We used the US and South Korea as the cultural pair, given that prior research suggests substantial personality differences between people from these two countries. We manipulated the target of the simulation (US vs. Korean), the language of the inventory (English vs. Korean), and the language model (GPT-4 vs. GPT-3.5). Our results show that GPT-4 replicated the cross-cultural differences for each factor. However, mean ratings had an upward bias and exhibited lower variation than in the human samples, as well as lower structural validity. We provide preliminary evidence that LLMs can aid cross-cultural researchers and practitioners.

replace Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scoring of Texts with Large Language Models

Authors: Patrick Y. Wu, Jonathan Nagler, Joshua A. Tucker, Solomon Messing

Abstract: Existing text scoring methods require a large corpus, struggle with short texts, or require hand-labeled data. We develop a text scoring framework that leverages generative large language models (LLMs) to (1) set texts against the backdrop of information from the near-totality of the web and digitized media, and (2) effectively transform pairwise text comparisons from a reasoning problem to a pattern recognition task. Our approach, concept-guided chain-of-thought (CGCoT), utilizes a chain of researcher-designed prompts with an LLM to generate a concept-specific breakdown for each text, akin to guidance provided to human coders. We then pairwise compare breakdowns using an LLM and aggregate answers into a score using a probability model. We apply this approach to better understand speech reflecting aversion to specific political parties on Twitter, a topic that has commanded increasing interest because of its potential contributions to democratic backsliding. We achieve stronger correlations with human judgments than widely used unsupervised text scoring methods like Wordfish. In a supervised setting, besides a small pilot dataset to develop CGCoT prompts, our measures require no additional hand-labeled data and produce predictions on par with RoBERTa-Large fine-tuned on thousands of hand-labeled tweets. This project showcases the potential of combining human expertise and LLMs for scoring tasks.

replace Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models

Authors: Ran Xu, Hejie Cui, Yue Yu, Xuan Kan, Wenqi Shi, Yuchen Zhuang, Wei Jin, Joyce Ho, Carl Yang

Abstract: Clinical natural language processing requires methods that can address domain-specific challenges, such as complex medical terminology and clinical contexts. Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources. To address this challenge, we delve into synthetic clinical text generation using LLMs for clinical NLP tasks. We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process. Our model involves clinical knowledge extraction and context-informed LLM prompting. Both clinical topics and writing styles are drawn from external domain-specific knowledge graphs and LLMs to guide data generation. Our extensive empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks, effectively aligning the distribution of real datasets and significantly enriching the diversity of generated training instances. Our code is available at \url{https://github.com/ritaranx/ClinGen}.

URLs: https://github.com/ritaranx/ClinGen

replace Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Authors: Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu

Abstract: The integration of large language models with external content has enabled applications such as Microsoft Copilot but also introduced vulnerabilities to indirect prompt injection attacks. In these attacks, malicious instructions embedded within external content can manipulate LLM outputs, causing deviations from user expectations. To address this critical yet under-explored issue, we introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities. Using BIPIA, we evaluate existing LLMs and find them universally vulnerable. Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content. Based on these findings, we propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings. Extensive experiments demonstrate that our black-box defense provides substantial mitigation, while our white-box defense reduces the attack success rate to near-zero levels, all while preserving the output quality of LLMs. We hope this work inspires further research into securing LLM applications and fostering their safe and reliable use.

replace Fine-Grained Stateful Knowledge Exploration: A Novel Paradigm for Integrating Knowledge Graphs with Large Language Models

Authors: Dehao Tao, Congqi Wang, Feng Huang, Junhao Chen, Yongfeng Huang, Minghu Jiang

Abstract: Large Language Models (LLMs) have shown impressive capabilities, yet updating their knowledge remains a significant challenge, often leading to outdated or inaccurate responses. A proposed solution is the integration of external knowledge bases, such as knowledge graphs, with LLMs. Most existing methods use a paradigm that treats the question as the objective, with relevant knowledge being incrementally retrieved from the knowledge graph. However, this strategy frequently experiences an mismatch in the granularity of knowledge between the target question and the entities and relations being retrieved. As a result, the information in the question cannot precisely correspond to the retrieved knowledge. This may cause redundant exploration or omission of vital knowledge, thereby leading to enhanced computational consumption and reduced retrieval accuracy. In this paper, we propose a novel paradigm of fine-grained stateful knowledge exploration, which addresses the `information granularity mismatch' issue. We extract fine-grained information from questions and explore the semantic mapping between this information and the knowledge in graph. By dynamically updating the mapping records, we avoid redundant exploration and ensure no pertinent information is overlooked, thereby reducing computational overhead and improving the accuracy of knowledge exploration. The use of fine-grained information also eliminates the need for a priori knowledge, a common requirement in existing methods. Experiments on multiple datasets revealed that our paradigm surpasses current advanced methods in knowledge retrieval while significantly reducing the average number of LLM invocations.

replace LiPO: Listwise Preference Optimization through Learning-to-Rank

Authors: Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, Xuanhui Wang

Abstract: Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a thorough study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a \textit{listwise} ranking problem and describe the LiPO framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment with DPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-$\lambda$, which leverages a state-of-the-art \textit{listwise} ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-$\lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks with both curated and real rankwise preference data.

replace A Survey to Recent Progress Towards Understanding In-Context Learning

Authors: Haitao Mao, Guangliang Liu, Yao Ma, Rongrong Wang, Kristen Johnson, Jiliang Tang

Abstract: In-Context Learning (ICL) empowers Large Language Models (LLMs) with the ability to learn from a few examples provided in the prompt, enabling downstream generalization without the requirement for gradient updates. Despite encouragingly empirical success, the underlying mechanism of ICL remains unclear. Existing research remains ambiguous with various viewpoints, utilizing intuition-driven and ad-hoc technical solutions to interpret ICL. In this paper, we leverage a data generation perspective to reinterpret recent efforts from a systematic angle, demonstrating the potential broader usage of these popular technical solutions. For a conceptual definition, we rigorously adopt the terms of skill recognition and skill learning. Skill recognition selects one learned data generation function previously seen during pre-training while skill learning can learn new data generation functions from in-context data. Furthermore, we provide insights into the strengths and weaknesses of both abilities, emphasizing their commonalities through the perspective of data generation. This analysis suggests potential directions for future research.

replace Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning

Authors: Yanfang Zhang, Yiliu Sun, Yibing Zhan, Dapeng Tao, Dacheng Tao, Chen Gong

Abstract: Recently, increasing attention has been focused on improving the ability of Large Language Models (LLMs) to perform complex reasoning. Advanced methods, such as Chain-of-Thought (CoT) and its variants, are found to enhance their reasoning skills by designing suitable prompts or breaking down complex problems into more manageable sub-problems. However, little concentration has been put on exploring the reasoning process, \textit{i.e.}, we discovered that most methods resort to Direct Reasoning (DR) and disregard Indirect Reasoning (IR). This can make LLMs difficult to solve IR tasks, which are often encountered in the real world. To address this issue, we propose a Direct-Indirect Reasoning (DIR) method, which considers DR and IR as multiple parallel reasoning paths that are merged to derive the final answer. We stimulate LLMs to implement IR by crafting prompt templates incorporating the principles of contrapositive and contradiction. These templates trigger LLMs to assume the negation of the conclusion as true, combine it with the premises to deduce a conclusion, and utilize the logical equivalence of the contrapositive to enhance their comprehension of the rules used in the reasoning process. Our DIR method is simple yet effective and can be straightforwardly integrated with existing variants of CoT methods. Experimental results on four datasets related to logical reasoning and mathematic proof demonstrate that our DIR method, when combined with various baseline methods, significantly outperforms all the original methods.

replace Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

Authors: Nicolas Boizard, Kevin El Haddad, C\'eline Hudelot, Pierre Colombo

Abstract: Deploying large language models (LLMs) of several billion parameters can be impractical in most industrial use cases due to constraints such as cost, latency limitations, and hardware accessibility. Knowledge distillation (KD) offers a solution by compressing knowledge from resource-intensive large models to smaller ones. Various strategies exist, some relying on the text generated by the teacher model and optionally utilizing his logits to enhance learning. However, these methods based on logits often require both teacher and student models to share the same tokenizer, limiting their applicability across different LLM families. In this paper, we introduce Universal Logit Distillation (ULD) loss, grounded in optimal transport, to address this limitation. Our experimental results demonstrate the effectiveness of ULD loss in enabling distillation across models with different architectures and tokenizers, paving the way to a more widespread use of distillation techniques.

replace Self-DC: When to Reason and When to Act? Self Divide-and-Conquer for Compositional Unknown Questions

Authors: Hongru Wang, Boyang Xue, Baohang Zhou, Tianhua Zhang, Cunxiang Wang, Huimin Wang, Guanhua Chen, Kam-fai Wong

Abstract: Previous research has typically concentrated on leveraging the internal knowledge of Large Language Models (LLMs) to answer known questions (i.e., \textit{internal reasoning such as generate-then-read}). In contrast, for questions that fall outside their known scope, these models rely on external knowledge retrieval to provide accurate responses (i.e., \textit{external acting such as retrieve-then-read}). However, few previous works consider the \textit{compositional questions}, which consist of several known and unknown sub-questions, necessitating the dynamic combination of previous two methods (i.e., \textit{internal reasoning and external acting}) to achieve a better trade-off between effectiveness and efficiency. To this end, we introduce a \textbf{Self} \textbf{D}ivide-and-\textbf{C}onquer (\textit{\texttt{Self-DC}}) framework, accompanying with the first \textbf{C}ompositional \textbf{u}nknown \textbf{Q}uestion-\textbf{A}nswering dataset (CuQA). This framework enables LLMs to adaptively choose between using internal knowledge and retrieving external knowledge as needed, resulting in a better trade-off between effectiveness and efficiency. Experimental results on two datasets demonstrate that \textit{\texttt{Self-DC}} can achieve comparable or even better performance with much fewer external calls compared with several strong baselines.

replace Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model Optimizers

Authors: Xinyu Tang, Xiaolei Wang, Wayne Xin Zhao, Siyuan Lu, Yaliang Li, Ji-Rong Wen

Abstract: Automatic prompt optimization is an important approach to improving the performance of large language models (LLMs). Recent research demonstrates the potential of using LLMs as prompt optimizers, which can generate improved task prompts via iterative refinement. In this paper, we propose a novel perspective to investigate the design of LLM-based prompt optimizers, by drawing an analogy with gradient-based model optimizers. To connect these two approaches, we identify two pivotal factors in model parameter learning: update direction and update method. By systematically analyzing a rich set of improvement strategies on the two aspects, we further develop a capable Gradient-inspired LLM-based Prompt Optimizer called GPO. At each step, it first retrieves relevant prompts from the optimization trajectory as the update direction. Then, it utilizes the generation-based refinement strategy to perform the update, while controlling the edit distance through a cosine-based decay strategy. Extensive experiments demonstrate the effectiveness and efficiency of GPO. In particular, GPO brings an additional improvement of up to 56.8% on Big-Bench Hard and 62.6% on MMLU compared to baseline methods. The code is available at https://github.com/RUCAIBox/GPO.

URLs: https://github.com/RUCAIBox/GPO.

replace KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents

Authors: Yuqi Zhu, Shuofei Qiao, Yixin Ou, Shumin Deng, Ningyu Zhang, Shiwei Lyu, Yue Shen, Lei Liang, Jinjie Gu, Huajun Chen

Abstract: Large Language Models (LLMs) have demonstrated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges, especially when interacting with environments through generating executable actions. This inadequacy primarily stems from the lack of built-in action knowledge in language agents, which fails to effectively guide the planning trajectories during task solving and results in planning hallucination. To address this issue, we introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge. Specifically, KnowAgent employs an action knowledge base and a knowledgeable self-learning strategy to constrain the action path during planning, enabling more reasonable trajectory synthesis, and thereby enhancing the planning performance of language agents. Experimental results on HotpotQA and ALFWorld based on various backbone models demonstrate that KnowAgent can achieve comparable or superior performance to existing baselines. Further analysis indicates the effectiveness of KnowAgent in terms of planning hallucinations mitigation. Code is available in https://github.com/zjunlp/KnowAgent.

URLs: https://github.com/zjunlp/KnowAgent.

replace Streamlining Redundant Layers to Compress Large Language Models

Authors: Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, Hong Chen

Abstract: This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs). It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.LLM-Streamline comprises two parts: layer pruning, which removes consecutive layers with the lowest importance based on target sparsity, and layer replacement, a novel module that trains a lightweight network to replace the pruned layers to mitigate performance loss. Additionally, a new metric called stability is proposed to address the limitations of the widely used accuracy metric in evaluating model compression. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.Our code is available at https://github.com/RUCKBReasoning/LLM-Streamline

URLs: https://github.com/RUCKBReasoning/LLM-Streamline

replace Uncovering Latent Arguments in Social Media Messaging by Employing LLMs-in-the-Loop Strategy

Authors: Tunazzina Islam, Dan Goldwasser

Abstract: The widespread use of social media has led to a surge in popularity for automated methods of analyzing public opinion. Supervised methods are adept at text categorization, yet the dynamic nature of social media discussions poses a continual challenge for these techniques due to the constant shifting of the focus. On the other hand, traditional unsupervised methods for extracting themes from public discourse, such as topic modeling, often reveal overarching patterns that might not capture specific nuances. Consequently, a significant portion of research into social media discourse still depends on labor-intensive manual coding techniques and a human-in-the-loop approach, which are both time-consuming and costly. In this work, we study the problem of discovering arguments associated with a specific theme. We propose a generic LLMs-in-the-Loop strategy that leverages the advanced capabilities of Large Language Models (LLMs) to extract latent arguments from social media messaging. To demonstrate our approach, we apply our framework to contentious topics. We use two publicly available datasets: (1) the climate campaigns dataset of 14k Facebook ads with 25 themes and (2) the COVID-19 vaccine campaigns dataset of 9k Facebook ads with 14 themes. Additionally, we design a downstream task as stance prediction by leveraging talking points in climate debates. Furthermore, we analyze demographic targeting and the adaptation of messaging based on real-world events.

replace Grounded Language Agent for Product Search via Intelligent Web Interactions

Authors: Moghis Fereidouni, Adib Mosharrof, A. B. Siddique

Abstract: The development of agents powered by large language models (LLMs) to accomplish complex high-level user intents, has attracted significant attention recently. However, employing LLMs with billions of parameters (e.g., GPT-4) may incur substantial costs on top of handcrafting extensive prompts. To address this, we introduce a Grounded Language Agent for Intelligent Web Interactions, named GLAINTEL. GLAINTEL employs Flan-T5 as its backbone and is flexible in training in various settings: unsupervised learning, supervised learning, and unsupervised domain adaptation. Specifically, we tackle both the challenge of learning without human demonstrations and the opportunity to leverage human demonstrations effectively when those are available. Additionally, we explore unsupervised domain adaptation for cases where demonstrations are limited to a specific domain. Experimental evaluations across diverse setups demonstrate the effectiveness of GLAINTEL in unsupervised settings, outperforming in-context learning-based approaches that employ larger models with up to 540 billion parameters. Surprisingly, behavioral cloning-based methods that straightforwardly use human demonstrations do not outperform unsupervised variants of GLAINTEL. Additionally, we show that combining human demonstrations with reinforcement learning-based training yields results comparable to methods utilizing GPT-4. The code is available at: https://github.com/MultifacetedNLP/WebAgents-Unsupervised.

URLs: https://github.com/MultifacetedNLP/WebAgents-Unsupervised.

replace Does Generative AI speak Nigerian-Pidgin?: Issues about Representativeness and Bias for Multilingualism in LLMs

Authors: David Ifeoluwa Adelani, A. Seza Do\u{g}ru\"oz, Iyanuoluwa Shode, Anuoluwapo Aremu

Abstract: Nigeria is a multilingual country with 500+ languages. Naija is a Nigerian-Pidgin spoken by approx. 120M speakers in Nigeria and it is a mixed language (e.g., English, Portuguese, Yoruba, Hausa and Igbo). Although it has mainly been a spoken language until recently, there are now various platforms publishing exclusively in Naija such as Naija Wikipedia. However, it is hard to distinguish by non-native from a larger pidgin languages spoken across West Africa known as West African Pidgin English (WAPE) -- which is more simplied and understandable by wider audience in Ghana, Nigeria, and Cameroon. BBC news platform publishes exclusively in WAPE to cater for several countries in West Africa. In our paper, we show through statistical analyses and Machine Translation experiments that these two creole varieties do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on WAPE. In other words, Naija is under-represented in Generative AI, and it is hard to teach LLMs with few examples.

replace COBias and Debias: Balancing Class Accuracies for Language Models in Inference Time via Nonlinear Integer Programming

Authors: Ruixi Lin, Yang You

Abstract: Large language models (LLMs) are good knowledge bases but struggle to perform equally well for all classes in text classification tasks. This paper investigates a fundamental inference-time problem in language models: imbalanced class accuracies. We find what's underneath the issue is a tendency to over-predict some classes while under-predicting some others. This class accuracy imbalance is difficult to solve from the root via better pre-training or fine-tuning strategies, but we show it can be effectively mitigated via inference-time combinatorial optimization. To this end, we conceptualize and quantify the over- and under-prediction issue as the Contextual Oddity Bias (COBias), and propose the Debiasing as Nonlinear Integer Programming (DNIP) model to correct in-context learned class probabilities based on minimizing COBias and maximizing overall accuracy, without LLM parameter update. Considering that the DNIP model implicitly contains non-differentiable elements, we therefore use the simulated annealing algorithm to solve it. Extensive evaluations on three LLMs across seven NLP classification tasks in different prompting settings show that DNIP simultaneously achieves significant COBias reduction (-27%) and accuracy improvement (+12%) over the conventional ICL approach, suggesting that inference-time mitigation of class accuracy imbalance is a promising direction to push forward LLM performances.

replace Emergence of a High-Dimensional Abstraction Phase in Language Transformers

Authors: Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Jade Yu, Alessandro Laio, Marco Baroni

Abstract: A language model (LM) is a mapping from a linguistic context to an output token. However, much remains to be known about this mapping, including how its geometric properties relate to its function. We take a high-level geometric approach to its analysis, observing, across five pre-trained transformer-based LMs and three input datasets, a distinct phase characterized by high intrinsic dimensionality. During this phase, representations (1) correspond to the first full linguistic abstraction of the input; (2) are the first to viably transfer to downstream tasks; (3) predict each other across different LMs. Moreover, we find that an earlier onset of the phase strongly predicts better language modelling performance. In short, our results suggest that a central high-dimensionality phase underlies core linguistic processing in many common LM architectures.

replace In-Context Editing: Learning Knowledge from Self-Induced Distributions

Authors: Siyuan Qi, Bangcheng Yang, Kailin Jiang, Xiaobo Wang, Jiaqi Li, Yifan Zhong, Yaodong Yang, Zilong Zheng

Abstract: In scenarios where language models must incorporate new information efficiently without extensive retraining, traditional fine-tuning methods are prone to overfitting, degraded generalization, and unnatural language generation. To address these limitations, we introduce Consistent In-Context Editing (ICE), a novel approach leveraging the model's in-context learning capability to optimize toward a contextual distribution rather than a one-hot target. ICE introduces a simple yet effective optimization framework for the model to internalize new knowledge by aligning its output distributions with and without additional context. This method enhances the robustness and effectiveness of gradient-based tuning methods, preventing overfitting and preserving the model's integrity. We analyze ICE across four critical aspects of knowledge editing: accuracy, locality, generalization, and linguistic quality, demonstrating its advantages. Experimental results confirm the effectiveness of ICE and demonstrate its potential for continual editing, ensuring that the integrity of the model is preserved while updating information.

replace Evaluating LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3

Authors: Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara

Abstract: Large Language Models (LLMs) have shown promising results in a variety of literary tasks, often using complex memorized details of narration and fictional characters. In this work, we evaluate the ability of Llama-3 at attributing utterances of direct-speech to their speaker in novels. The LLM shows impressive results on a corpus of 28 novels, surpassing published results with ChatGPT and encoder-based baselines by a large margin. We then validate these results by assessing the impact of book memorization and annotation contamination. We found that these types of memorization do not explain the large performance gain, making Llama-3 the new state-of-the-art for quotation attribution in English literature. We release publicly our code and data.

replace Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation

Authors: Takyoung Kim, Kyungjae Lee, Young Rok Jang, Ji Yong Cho, Gangwoo Kim, Minseok Cho, Moontae Lee

Abstract: Interactions with large language models (LLMs) often yield long and detailed responses, leveraging both parametric knowledge and retrieval-augmented generation (RAG). While these responses can provide rich insights, they often include redundant or less engaging content not aligned with user interests. This issue becomes apparent when users specify particular subtopics to include or exclude -- termed coverage-conditioned ($C^2$) queries -- as LLMs often struggle to provide tailored responses. To address this challenge, we investigate the role of query outlines, sequences of subqueries designed to guide LLMs in generating responses that meet specific user requirements. To systematically create and evaluate these outlines, we introduce QTree, a dataset of 10K hierarchical sets of information-seeking subqueries that define structured boundaries for outline creation and evaluation in $C^2$ scenarios. Additionally, we develop QPlanner, a 7B language model trained to generate customized outlines within boundaries of QTree. We evaluate the effectiveness of the generated outlines through automatic and human judgements, focusing on their impact within retrieval-augmented generation (RAG) systems. Experimental results demonstrate that QPlanner, especially when trained with alignment techniques like DPO, generates higher-quality outlines that better fulfill diverse user needs.

replace Panza: Design and Analysis of a Fully-Local Personalized Text Writing Assistant

Authors: Armand Nicolicioiu, Eugenia Iofinova, Andrej Jovanovic, Eldar Kurtic, Mahdi Nikdan, Andrei Panferov, Ilia Markov, Nir Shavit, Dan Alistarh

Abstract: The availability of powerful open-source large language models (LLMs) opens exciting use cases, such as automated personal assistants that adapt to the user's unique data and demands. Two key requirements for such assistants are personalization - in the sense that the assistant should reflect the user's own writing style - and privacy - users may prefer to always store their personal data locally, on their own computing device. In this application paper, we present a new design and evaluation for such an automated assistant, for the specific use case of email generation, which we call Panza. Specifically, Panza can be trained and deployed locally on commodity hardware, and is personalized to the user's writing style. Panza's personalization features are based on a combination of fine-tuning using a variant of the Reverse Instructions technique together with Retrieval-Augmented Generation (RAG). We demonstrate that this combination allows us to fine-tune an LLM to better reflect a user's writing style using limited data, while executing on extremely limited resources, e.g. on a free Google Colab instance. Our key methodological contribution is what we believe to be the first detailed study of evaluation metrics for this personalized writing task, and of how different choices of system components - e.g. the use of RAG and of different fine-tuning approaches - impact the system's performance. We also perform an ablation study showing that less than 100 emails are generally sufficient to produce a credible Panza model. We are releasing the full Panza code as well as a new "David" personalized email dataset licensed for research use, both available on https://github.com/IST-DASLab/PanzaMail.

URLs: https://github.com/IST-DASLab/PanzaMail.

replace Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Authors: Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou

Abstract: Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, all conducted under both question-aware and question-agnostic scenarios, demonstrate substantial quality improvements over existing methods.

replace Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators

Authors: Dingkang Yang, Dongling Xiao, Jinjie Wei, Mingcheng Li, Zhaoyu Chen, Ke Li, Lihua Zhang

Abstract: Despite their remarkable capabilities, Large Language Models (LLMs) are prone to generate responses that contradict verifiable facts, i.e., unfaithful hallucination content. Existing efforts generally focus on optimizing model parameters or editing semantic representations, which compromise the internal factual knowledge of target LLMs. In addition, hallucinations typically exhibit multifaceted patterns in downstream tasks, limiting the model's holistic performance across tasks. In this paper, we propose a Comparator-driven Decoding-Time (CDT) framework to alleviate the response hallucination. Firstly, we construct hallucinatory and truthful comparators with multi-task fine-tuning samples. In this case, we present an instruction prototype-guided mixture of experts strategy to enhance the ability of the corresponding comparators to capture different hallucination or truthfulness patterns in distinct task instructions. CDT constrains next-token predictions to factuality-robust distributions by contrasting the logit differences between the target LLMs and these comparators. Systematic experiments on multiple downstream tasks show that our framework can significantly improve the model performance and response factuality.

replace WaterSeeker: Pioneering Efficient Detection of Watermarked Segments in Large Documents

Authors: Leyi Pan, Aiwei Liu, Yijian Lu, Zitian Gao, Yichen Di, Shiyu Huang, Lijie Wen, Irwin King, Philip S. Yu

Abstract: Watermarking algorithms for large language models (LLMs) have attained high accuracy in detecting LLM-generated text. However, existing methods primarily focus on distinguishing fully watermarked text from non-watermarked text, overlooking real-world scenarios where LLMs generate only small sections within large documents. In this scenario, balancing time complexity and detection performance poses significant challenges. This paper presents WaterSeeker, a novel approach to efficiently detect and locate watermarked segments amid extensive natural text. It first applies an efficient anomaly extraction method to preliminarily locate suspicious watermarked regions. Following this, it conducts a local traversal and performs full-text detection for more precise verification. Theoretical analysis and experimental results demonstrate that WaterSeeker achieves a superior balance between detection accuracy and computational efficiency. Moreover, its localization capability lays the foundation for building interpretable AI detection systems. Our code is available at https://github.com/THU-BPM/WaterSeeker.

URLs: https://github.com/THU-BPM/WaterSeeker.

replace MOSAIC: Multiple Observers Spotting AI Content, a Robust Approach to Machine-Generated Text Detection

Authors: Matthieu Dubois, Fran\c{c}ois Yvon, Pablo Piantanida

Abstract: The dissemination of Large Language Models (LLMs), trained at scale, and endowed with powerful text-generating abilities has vastly increased the threats posed by generative AI technologies by reducing the cost of producing harmful, toxic, faked or forged content. In response, various proposals have been made to automatically discriminate artificially generated from human-written texts, typically framing the problem as a classification problem. Most approaches evaluate an input document by a well-chosen detector LLM, assuming that low-perplexity scores reliably signal machine-made content. As using one single detector can induce brittleness of performance, we instead consider several and derive a new, theoretically grounded approach to combine their respective strengths. Our experiments, using a variety of generator LLMs, suggest that our method effectively leads to robust detection performances. An early version of the code is available at https://github.com/BaggerOfWords/MOSAIC.

URLs: https://github.com/BaggerOfWords/MOSAIC.

replace Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

Authors: Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, Manaal Faruqui

Abstract: Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems.

replace GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge

Authors: Daniil Gurgurov, Rishu Kumar, Simon Ostermann

Abstract: Contextualized embeddings based on large language models (LLMs) are available for various languages, but their coverage is often limited for lower resourced languages. Using LLMs for such languages is often difficult due to a high computational cost; not only during training, but also during inference. Static word embeddings are much more resource-efficient ("green"), and thus still provide value, particularly for very low-resource languages. There is, however, a notable lack of comprehensive repositories with such embeddings for diverse languages. To address this gap, we present GrEmLIn, a centralized repository of green, static baseline embeddings for 87 mid- and low-resource languages. We compute GrEmLIn embeddings with a novel method that enhances GloVe embeddings by integrating multilingual graph knowledge, which makes our static embeddings competitive with LLM representations, while being parameter-free at inference time. Our experiments demonstrate that GrEmLIn embeddings outperform state-of-the-art contextualized embeddings from E5 on the task of lexical similarity. They remain competitive in extrinsic evaluation tasks like sentiment analysis and natural language inference, with average performance gaps of just 5-10\% or less compared to state-of-the-art models, given a sufficient vocabulary overlap with the target task, and underperform only on topic classification. Our code and embeddings are publicly available at https://huggingface.co/DFKI.

URLs: https://huggingface.co/DFKI.

replace Perception Compressor: A Training-Free Prompt Compression Framework in Long Context Scenarios

Authors: Jiwei Tang, Jin Xu, Tingwei Lu, Zhicheng Zhang, Yiming Zhao, Lin Hai, Hai-Tao Zheng

Abstract: Large language models (LLMs) demonstrate exceptional capabilities in various scenarios. However, they suffer from much redundant information and are sensitive to the position of key information in long context scenarios. To address these challenges, we present Perception Compressor, a training-free prompt compression framework. It includes a perception retriever that leverages guiding questions and instruction to retrieve the most relevant demonstrations, a dual-slope ratio allocator to dynamically allocate compression ratios and open-book ratios, and a semi-guided iterative compression that retains key information at the token level while removing tokens that distract the LLM. We conduct extensive experiments on long context benchmarks, i.e., NaturalQuestions, LongBench, and MuSiQue. Experiment results show that Perception Compressor outperforms existing methods by a large margin, achieving state-of-the-art performance.

replace Do Vision-Language Models Really Understand Visual Language?

Authors: Yifan Hou, Buse Giledereli, Yilei Tu, Mrinmaya Sachan

Abstract: Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Recent studies suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across domains to evaluate the recognition and reasoning abilities of models. Our evaluation of LVLMs shows that while they can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.

replace StringLLM: Understanding the String Processing Capability of Large Language Models

Authors: Xilong Wang, Hao Fu, Jindong Wang, Neil Zhenqiang Gong

Abstract: String processing, which mainly involves the analysis and manipulation of strings, is a fundamental component of modern computing. Despite the significant advancements of large language models (LLMs) in various natural language processing (NLP) tasks, their capability in string processing remains underexplored and underdeveloped. To bridge this gap, we present a comprehensive study of LLMs' string processing capability. In particular, we first propose StringLLM, a method to construct datasets for benchmarking string processing capability of LLMs. We use StringLLM to build a series of datasets, referred to as StringBench. It encompasses a wide range of string processing tasks, allowing us to systematically evaluate LLMs' performance in this area. Our evaluations indicate that LLMs struggle with accurately processing strings compared to humans. To uncover the underlying reasons for this limitation, we conduct an in-depth analysis and subsequently propose an effective approach that significantly enhances LLMs' string processing capability via fine-tuning. This work provides a foundation for future research to understand LLMs' string processing capability. Our code and data are available at https://github.com/wxl-lxw/StringLLM.

URLs: https://github.com/wxl-lxw/StringLLM.

replace Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

Authors: Xiang Hu, Zhihao Teng, Jun Zhao, Wei Wu, Kewei Tu

Abstract: Despite the success of Transformers, handling long contexts remains challenging due to the limited length generalization and quadratic complexity of self-attention. Thus Transformers often require post-training with a larger attention window, significantly increasing computational and memory costs. In this paper, we propose a novel attention mechanism based on dynamic context, Grouped Cross Attention (GCA), which can generalize to 1000 times the pre-training context length while maintaining the ability to access distant information with a constant attention window size. For a given input sequence, we split it into chunks and use each chunk to retrieve top-k relevant past chunks for subsequent text generation. Specifically, unlike most previous works that use an off-the-shelf retriever, our key innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner. Such a mechanism accommodates retrieved chunks with a fixed-size attention window to achieve long-range information access, significantly reducing computational and memory costs during training and inference. Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is 1000 times the training length.

replace Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise

Authors: Rose E. Wang, Ana T. Ribeiro, Carly D. Robinson, Susanna Loeb, Dora Demszky

Abstract: Generative AI, particularly Language Models (LMs), has the potential to transform real-world domains with societal impact, particularly where access to experts is limited. For example, in education, training novice educators with expert guidance is important for effectiveness but expensive, creating significant barriers to improving education quality at scale. This challenge disproportionately harms students from under-served communities, who stand to gain the most from high-quality education. We introduce Tutor CoPilot, a novel Human-AI approach that leverages a model of expert thinking to provide expert-like guidance to tutors as they tutor. This study is the first randomized controlled trial of a Human-AI system in live tutoring, involving 900 tutors and 1,800 K-12 students from historically under-served communities. Following a preregistered analysis plan, we find that students working with tutors that have access to Tutor CoPilot are 4 percentage points (p.p.) more likely to master topics (p<0.01). Notably, students of lower-rated tutors experienced the greatest benefit, improving mastery by 9 p.p. We find that Tutor CoPilot costs only $20 per-tutor annually. We analyze 550,000+ messages using classifiers to identify pedagogical strategies, and find that tutors with access to Tutor CoPilot are more likely to use high-quality strategies to foster student understanding (e.g., asking guiding questions) and less likely to give away the answer to the student. Tutor interviews highlight how Tutor CoPilot's guidance helps tutors to respond to student needs, though they flag issues in Tutor CoPilot, such as generating suggestions that are not grade-level appropriate. Altogether, our study of Tutor CoPilot demonstrates how Human-AI systems can scale expertise in real-world domains, bridge gaps in skills and create a future where high-quality education is accessible to all students.

replace Revisiting In-context Learning Inference Circuit in Large Language Models

Authors: Hakaze Cho, Mariko Kato, Yoshihiro Sakai, Naoya Inoue

Abstract: In-context Learning (ICL) is an emerging few-shot learning paradigm on Language Models (LMs) with inner mechanisms un-explored. There are already existing works describing the inner processing of ICL, while they struggle to capture all the inference phenomena in large language models. Therefore, this paper proposes a comprehensive circuit to model the inference dynamics and try to explain the observed phenomena of ICL. In detail, we divide ICL inference into 3 major operations: (1) Input Text Encode: LMs encode every input text (demonstrations and queries) into linear representation in the hidden states with sufficient information to solve ICL tasks. (2) Semantics Merge: LMs merge the encoded representations of demonstrations with their corresponding label tokens to produce joint representations of labels and demonstrations. (3) Feature Retrieval and Copy: LMs search the joint representations similar to the query representation on a task subspace, and copy the searched representations into the query. Then, language model heads capture these copied label representations to a certain extent and decode them into predicted labels. The proposed inference circuit successfully captured many phenomena observed during the ICL process, making it a comprehensive and practical explanation of the ICL inference process. Moreover, ablation analysis by disabling the proposed steps seriously damages the ICL performance, suggesting the proposed inference circuit is a dominating mechanism. Additionally, we confirm and list some bypass mechanisms that solve ICL tasks in parallel with the proposed circuit.

replace Learning Evolving Tools for Large Language Models

Authors: Guoxin Chen, Zhong Zhang, Xin Cong, Fangda Guo, Yesai Wu, Yankai Lin, Wenzheng Feng, Yasheng Wang

Abstract: Tool learning enables large language models (LLMs) to interact with external tools and APIs, greatly expanding the application scope of LLMs. However, due to the dynamic nature of external environments, these tools and APIs may become outdated over time, preventing LLMs from correctly invoking tools. Existing research primarily focuses on static environments and overlooks this issue, limiting the adaptability of LLMs in real-world applications. In this paper, we propose ToolEVO, a novel framework designed to enhance the adaptive and reflective capabilities of LLMs against tool variability. By leveraging Monte Carlo Tree Search, ToolEVO facilitates active exploration and interaction of LLMs within dynamic environments, allowing for autonomous self-reflection and self-updating of tool usage based on environmental feedback. Additionally, we introduce ToolQA-D, a benchmark specifically designed to evaluate the impact of tool variability. Extensive experiments demonstrate the effectiveness and stability of our approach, highlighting the importance of adaptability to tool variability for effective tool learning. Code: \url{https://github.com/Chen-GX/ToolEVO}

URLs: https://github.com/Chen-GX/ToolEVO

replace Learning to Summarize from LLM-generated Feedback

Authors: Hwanjun Song, Taewon Yun, Yuho Lee, Jihwan Oh, Gihun Lee, Jason Cai, Hang Su

Abstract: Developing effective text summarizers remains a challenge due to issues like hallucinations, key information omissions, and verbosity in LLM-generated summaries. This work explores using LLM-generated feedback to improve summary quality by aligning the summaries with human preferences for faithfulness, completeness, and conciseness. We introduce FeedSum, a large-scale dataset containing multi-dimensional LLM feedback on summaries of varying quality across diverse domains. Our experiments show how feedback quality, dimensionality, and granularity influence preference learning, revealing that high-quality, multi-dimensional, fine-grained feedback significantly improves summary generation. We also compare two methods for using this feedback: supervised fine-tuning and direct preference optimization. Finally, we introduce SummLlama3-8b, a model that outperforms the nearly 10x larger Llama3-70b-instruct in generating human-preferred summaries, demonstrating that smaller models can achieve superior performance with appropriate training. The full dataset and SummLlama3-8B model are available at https://huggingface.co/datasets/DISLab/FeedSum and https://huggingface.co/DISLab/SummLlama3-8B.

URLs: https://huggingface.co/datasets/DISLab/FeedSum, https://huggingface.co/DISLab/SummLlama3-8B.

replace Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Authors: Jiatao Li, Xinyu Hu, Xunjian Yin, Xiaojun Wan

Abstract: The integration of documents generated by LLMs themselves (Self-Docs) alongside retrieved documents has emerged as a promising strategy for retrieval-augmented generation systems. However, previous research primarily focuses on optimizing the use of Self-Docs, with their inherent properties remaining underexplored. To bridge this gap, we first investigate the overall effectiveness of Self-Docs, identifying key factors that shape their contribution to RAG performance (RQ1). Building on these insights, we develop a taxonomy grounded in Systemic Functional Linguistics to compare the influence of various Self-Docs categories (RQ2) and explore strategies for combining them with external sources (RQ3). Our findings reveal which types of Self-Docs are most beneficial and offer practical guidelines for leveraging them to achieve significant improvements in knowledge-intensive question answering tasks.

replace CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

Authors: Mian Zhang, Xianjun Yang, Xinlu Zhang, Travis Labrum, Jamie C. Chiu, Shaun M. Eack, Fei Fang, William Yang Wang, Zhiyu Zoey Chen

Abstract: There is a significant gap between patient needs and available mental health support today. In this paper, we aim to thoroughly examine the potential of using Large Language Models (LLMs) to assist professional psychotherapy. To this end, we propose a new benchmark, CBT-BENCH, for the systematic evaluation of cognitive behavioral therapy (CBT) assistance. We include three levels of tasks in CBT-BENCH: I: Basic CBT knowledge acquisition, with the task of multiple-choice questions; II: Cognitive model understanding, with the tasks of cognitive distortion classification, primary core belief classification, and fine-grained core belief classification; III: Therapeutic response generation, with the task of generating responses to patient speech in CBT therapy sessions. These tasks encompass key aspects of CBT that could potentially be enhanced through AI assistance, while also outlining a hierarchy of capability requirements, ranging from basic knowledge recitation to engaging in real therapeutic conversations. We evaluated representative LLMs on our benchmark. Experimental results indicate that while LLMs perform well in reciting CBT knowledge, they fall short in complex real-world scenarios requiring deep analysis of patients' cognitive structures and generating effective responses, suggesting potential future work.

replace Agent Skill Acquisition for Large Language Models via CycleQD

Authors: So Kuroki, Taishi Nakamura, Takuya Akiba, Yujin Tang

Abstract: Training large language models to acquire specific skills remains a challenging endeavor. Conventional training approaches often struggle with data distribution imbalances and inadequacies in objective functions that do not align well with task-specific performance. To address these challenges, we introduce CycleQD, a novel approach that leverages the Quality Diversity framework through a cyclic adaptation of the algorithm, along with a model merging based crossover and an SVD-based mutation. In CycleQD, each task's performance metric is alternated as the quality measure while the others serve as the behavioral characteristics. This cyclic focus on individual tasks allows for concentrated effort on one task at a time, eliminating the need for data ratio tuning and simplifying the design of the objective function. Empirical results from AgentBench indicate that applying CycleQD to LLAMA3-8B-INSTRUCT based models not only enables them to surpass traditional fine-tuning methods in coding, operating systems, and database tasks, but also achieves performance on par with GPT-3.5-TURBO, which potentially contains much more parameters, across these domains. Crucially, this enhanced performance is achieved while retaining robust language capabilities, as evidenced by its performance on widely adopted language benchmark tasks. We highlight the key design choices in CycleQD, detailing how these contribute to its effectiveness. Furthermore, our method is general and can be applied to image segmentation models, highlighting its applicability across different domains.

replace mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Authors: Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri

Abstract: Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality assurance process. Furthermore, we provide expert human translations for 15 diverse natural languages (NLs). We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs, offering insights into the current landscape of cross-lingual code generation.

replace Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Authors: Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig

Abstract: Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models' capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance of English data proportions, language popularity, and the number of multimodal training samples on overall performance. We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs, promoting equity and accessibility across a broader linguistic and cultural spectrum.

replace Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

Authors: Mingqi Gao, Xinyu Hu, Li Lin, Xiaojun Wan

Abstract: The correlation between NLG automatic evaluation metrics and human evaluation is often regarded as a critical criterion for assessing the capability of an evaluation metric. However, different grouping methods and correlation coefficients result in various types of correlation measures used in meta-evaluation. In specific evaluation scenarios, prior work often directly follows conventional measure settings, but the characteristics and differences between these measures have not gotten sufficient attention. Therefore, this paper analyzes 12 common correlation measures using a large amount of real-world data from six widely-used NLG evaluation datasets and 32 evaluation metrics, revealing that different measures indeed impact the meta-evaluation results. Furthermore, we propose three perspectives that reflect the capability of meta-evaluation: discriminative power, ranking consistency, and sensitivity to score granularity. We find that the measure using global grouping and Pearson correlation coefficient exhibits the best performance in both discriminative power and ranking consistency. Besides, the measures using system-level grouping or Kendall correlation are the least sensitive to score granularity.

replace SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

Authors: Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He

Abstract: Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips the LLM with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes the LLM on instruction-following, question-answering, and search-related data. Then, it prompts the same LLM to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these self-generated synthetic examples, the LLM can improve their performance on domain-specific RAG tasks. Experiments on 11 datasets, spanning two backbone sizes and three domains, demonstrate that SimRAG outperforms baselines by 1.2\%--8.6\%.

replace Long$^2$RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall

Authors: Zehan Qi, Rongwu Xu, Zhijiang Guo, Cunxiang Wang, Hao Zhang, Wei Xu

Abstract: Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in large language models (LLMs). However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they fail to adequately measure LLMs' capability in handling long-context retrieval due to a lack of datasets that reflect the characteristics of retrieved documents, and (2) they lack a comprehensive evaluation method for assessing LLMs' ability to generate long-form responses that effectively exploits retrieved information. To address these shortcomings, we introduce the Long$^2$RAG benchmark and the Key Point Recall (KPR) metric. Long$^2$RAG comprises 280 questions spanning 10 domains and across 8 question categories, each associated with 5 retrieved documents with an average length of 2,444 words. KPR evaluates the extent to which LLMs incorporate key points extracted from the retrieved documents into their generated responses, providing a more nuanced assessment of their ability to exploit retrieved information.

replace BitStack: Any-Size Compression of Large Language Models in Variable Memory Environments

Authors: Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu

Abstract: Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbf{BitStack}, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at https://github.com/xinghaow99/BitStack.

URLs: https://github.com/xinghaow99/BitStack.

replace WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Authors: Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, Yuxiao Dong

Abstract: Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.

replace MambaPEFT: Exploring Parameter-Efficient Fine-Tuning for Mamba

Authors: Masakazu Yoshimura, Teruaki Hayashi, Yota Maeda

Abstract: An ecosystem of Transformer-based models has been established by building large models with extensive data. Parameter-efficient fine-tuning (PEFT) is a crucial technology for deploying these models to downstream tasks with minimal cost while achieving effective performance. Recently, Mamba, a State Space Model (SSM)-based model, has attracted attention as a potential alternative to Transformers. While many large-scale Mamba-based models have been proposed, efficiently adapting pre-trained Mamba-based models to downstream tasks remains unexplored. In this paper, we conduct an exploratory analysis of PEFT methods for Mamba. We investigate the effectiveness of existing PEFT methods for Transformers when applied to Mamba. We also modify these methods to better align with the Mamba architecture. Additionally, we propose new Mamba-specific PEFT methods that leverage the distinctive structure of Mamba. Our experiments indicate that PEFT performs more effectively for Mamba than Transformers. Lastly, we demonstrate how to effectively combine multiple PEFT methods and provide a framework that outperforms previous works. To ensure reproducibility, we will release the code after publication.

replace Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

Authors: Ekaterina Artemova, Akim Tsvigun, Dominik Schlechtweg, Natalia Fedorova, Konstantin Chernyshev, Sergei Tilga, Boris Obmoroshev

Abstract: Training and deploying machine learning models relies on a large amount of human-annotated data. As human labeling becomes increasingly expensive and time-consuming, recent research has developed multiple strategies to speed up annotation and reduce costs and human workload: generating synthetic training data, active learning, and hybrid labeling. This tutorial is oriented toward practical applications: we will present the basics of each strategy, highlight their benefits and limitations, and discuss in detail real-life case studies. Additionally, we will walk through best practices for managing human annotators and controlling the quality of the final dataset. The tutorial includes a hands-on workshop, where attendees will be guided in implementing a hybrid annotation setup. This tutorial is designed for NLP practitioners from both research and industry backgrounds who are involved in or interested in optimizing data labeling projects.

replace Dynamic Graph Neural ODE Network for Multi-modal Emotion Recognition in Conversation

Authors: Yuntao Shou, Tao Meng, Wei Ai, Keqin Li

Abstract: Multimodal emotion recognition in conversation (MERC) refers to identifying and classifying human emotional states by combining data from multiple different modalities (e.g., audio, images, text, video, etc.). Most existing multimodal emotion recognition methods use GCN to improve performance, but existing GCN methods are prone to overfitting and cannot capture the temporal dependency of the speaker's emotions. To address the above problems, we propose a Dynamic Graph Neural Ordinary Differential Equation Network (DGODE) for MERC, which combines the dynamic changes of emotions to capture the temporal dependency of speakers' emotions, and effectively alleviates the overfitting problem of GCNs. Technically, the key idea of DGODE is to utilize an adaptive mixhop mechanism to improve the generalization ability of GCNs and use the graph ODE evolution network to characterize the continuous dynamics of node representations over time and capture temporal dependencies. Extensive experiments on two publicly available multimodal emotion recognition datasets demonstrate that the proposed DGODE model has superior performance compared to various baselines. Furthermore, the proposed DGODE can also alleviate the over-smoothing problem, thereby enabling the construction of a deep GCN network.

replace Rethinking Chain-of-Thought from the Perspective of Self-Training

Authors: Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, Lei Feng

Abstract: Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in LLMs. Interestingly, we observe that both CoT reasoning and self-training share the core objective: iteratively leveraging model-generated information to progressively reduce prediction uncertainty. Building on this insight, we propose a novel CoT framework to improve reasoning performance. Our framework integrates two key components: (i) a task-specific prompt module that optimizes the initial reasoning process, and (ii) an adaptive reasoning iteration module that dynamically refines the reasoning process and addresses the limitations of previous CoT approaches, \ie over-reasoning and high similarity between consecutive reasoning iterations. Extensive experiments demonstrate that the proposed method achieves significant advantages in both performance and computational efficiency.

replace Crabs: Consuming Resrouce via Auto-generation for LLM-DoS Attack under Black-box Settings

Authors: Yuanhe Zhang, Zhenhong Zhou, Wei Zhang, Xinyue Wang, Xiaojun Jia, Yang Liu, Sen Su

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks. LLMs continue to be vulnerable to external threats, particularly Denial-of-Service (DoS) attacks. Specifically, LLM-DoS attacks aim to exhaust computational resources and block services. However, prior works tend to focus on performing white-box attacks, overlooking black-box settings. In this work, we propose an automated algorithm designed for black-box LLMs, called Auto-Generation for LLM-DoS Attack (AutoDoS). AutoDoS introduces DoS Attack Tree and optimizes the prompt node coverage to enhance effectiveness under black-box conditions. Our method can bypass existing defense with enhanced stealthiness via semantic improvement of prompt nodes. Furthermore, we reveal that implanting Length Trojan in Basic DoS Prompt aids in achieving higher attack efficacy. Experimental results show that AutoDoS amplifies service response latency by over 250 $\times \uparrow$, leading to severe resource consumption in terms of GPU utilization and memory usage. Our code is available at https://github.com/shuita2333/AutoDoS.

URLs: https://github.com/shuita2333/AutoDoS.

replace DynaGRAG | Exploring the Topology of Information for Advancing Language Understanding and Generation in Graph Retrieval-Augmented Generation

Authors: Karishma Thakrar

Abstract: Graph Retrieval-Augmented Generation (GRAG or Graph RAG) architectures aim to enhance language understanding and generation by leveraging external knowledge. However, effectively capturing and integrating the rich semantic information present in textual and structured data remains a challenge. To address this, a novel GRAG framework, Dynamic Graph Retrieval-Agumented Generation (DynaGRAG), is proposed to focus on enhancing subgraph representation and diversity within the knowledge graph. By improving graph density, capturing entity and relation information more effectively, and dynamically prioritizing relevant and diverse subgraphs and information within them, the proposed approach enables a more comprehensive understanding of the underlying semantic structure. This is achieved through a combination of de-duplication processes, two-step mean pooling of embeddings, query-aware retrieval considering unique nodes, and a Dynamic Similarity-Aware BFS (DSA-BFS) traversal algorithm. Integrating Graph Convolutional Networks (GCNs) and Large Language Models (LLMs) through hard prompting further enhances the learning of rich node and edge representations while preserving the hierarchical subgraph structure. Experimental results demonstrate the effectiveness of DynaGRAG, showcasing the significance of enhanced subgraph representation and diversity for improved language understanding and generation.

replace PokerBench: Training Large Language Models to become Professional Poker Players

Authors: Richard Zhuang, Akshat Gupta, Richard Yang, Aniket Rahane, Zhengyu Li, Gopala Anumanchipalli

Abstract: We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios.

replace Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms

Authors: Kewei Li, Yanwen Kong, Yiping Xu, Jianlin Su, Lan Huang, Ruochi Zhang, Fengfeng Zhou

Abstract: Since the emergence of research on improving the length extrapolation capabilities of large language models in 2021, some studies have made modifications to the scaling factor in the scaled dot-product attention mechanism as part of their proposed methods without rigorous theoretical justifications. To fill this gap, we propose two new scaled temperatures based on information entropy invariance to enhance length extrapolation. First, a training-free method InfoScale is designed for dotproduct attention, and preserves focus on original tokens during length extrapolation by ensuring consistent entropy. Second, we theoretically analyze the impact of scaling (CosScale) on cosine attention. Experimental data demonstrates that combining InfoScale and CosScale achieves state-ofthe-art performance on the GAU-{\alpha} model with a context window extended to 64 times the training length, and outperforms seven existing methods. Our analysis reveals that significantly increasing CosScale approximates the Windowed Attention, and highlights the significance of attention score dilution as a key challenge in long-range context handling. The code and data are available at https://github.com/HT-NEKO/ Information-Entropy-Invariance.

URLs: https://github.com/HT-NEKO/

replace Enhanced Large Language Models for Effective Screening of Depression and Anxiety

Authors: June M. Liu, Mengxia Gao, Sahand Sabour, Zhuang Chen, Minlie Huang, Tatia M. C. Lee

Abstract: Depressive and anxiety disorders are widespread, necessitating timely identification and management. Recent advances in Large Language Models (LLMs) offer potential solutions, yet high costs and ethical concerns about training data remain challenges. This paper introduces a pipeline for synthesizing clinical interviews, resulting in 1,157 interactive dialogues (PsyInterview), and presents EmoScan, an LLM-based emotional disorder screening system. EmoScan distinguishes between coarse (e.g., anxiety or depressive disorders) and fine disorders (e.g., major depressive disorders) and conducts high-quality interviews. Evaluations showed that EmoScan exceeded the performance of base models and other LLMs like GPT-4 in screening emotional disorders (F1-score=0.7467). It also delivers superior explanations (BERTScore=0.9408) and demonstrates robust generalizability (F1-score of 0.67 on an external dataset). Furthermore, EmoScan outperforms baselines in interviewing skills, as validated by automated ratings and human evaluations. This work highlights the importance of scalable data-generative pipelines for developing effective mental health LLM tools.

replace Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

Authors: Tairan Fu, Javier Conde, Gonzalo Mart\'inez, Mar\'ia Grandury, Pedro Reviriego

Abstract: One of the most widely used methods to evaluate LLMs are Multiple Choice Question (MCQ) tests. MCQ benchmarks enable the testing of LLM knowledge on almost any topic at scale as the results can be processed automatically. To help the LLM answer, a few examples called few shots can be included in the prompt. Moreover, the LLM can be asked to answer the question directly with the selected option or to first provide the reasoning and then the selected answer, which is known as chain of thought. In addition to checking whether the selected answer is correct, the evaluation can look at the LLM-estimated probability of its response as an indication of the confidence of the LLM in the response. In this paper, we study how the LLM confidence in its answer depends on whether the model has been asked to answer directly or to provide the reasoning before answering. The results of the evaluation of questions on a wide range of topics in seven different models show that LLMs are more confident in their answers when they provide reasoning before the answer. This occurs regardless of whether the selected answer is correct. Our hypothesis is that this behavior is due to the reasoning that modifies the probability of the selected answer, as the LLM predicts the answer based on the input question and the reasoning that supports the selection made. Therefore, LLM estimated probabilities seem to have intrinsic limitations that should be understood in order to use them in evaluation procedures. Interestingly, the same behavior has been observed in humans, for whom explaining an answer increases confidence in its correctness.

replace Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation

Authors: Ivan Lopez, Fateme Nateghi Haredasht, Kaitlin Caoili, Jonathan H Chen, Akshay Chaudhari

Abstract: Accurate classification of clinical text often requires fine-tuning pre-trained language models, a process that is costly and time-consuming due to the need for high-quality data and expert annotators. Synthetic data generation offers an alternative, though pre-trained models may not capture the syntactic diversity of clinical notes. We propose an embedding-driven approach that uses diversity sampling from a small set of real clinical notes to guide large language models in few-shot prompting, generating synthetic text that better reflects clinical syntax. We evaluated this method using the CheXpert dataset on a classification task, comparing it to random few-shot and zero-shot approaches. Using cosine similarity and a Turing test, our approach produced synthetic notes that more closely align with real clinical text. Our pipeline reduced the data needed to reach the 0.85 AUC cutoff by 40% for AUROC and 30% for AUPRC, while augmenting models with synthetic data improved AUROC by 57% and AUPRC by 68%. Additionally, our synthetic data was 0.9 times as effective as real data, a 60% improvement in value.

replace Training Dialogue Systems by AI Feedback for Improving Overall Dialogue Impression

Authors: Kai Yoshida, Masahiro Mizukami, Seiya Kawano, Canasai Kruengkrai, Hiroaki Sugiyama, Koichiro Yoshino

Abstract: To improve user engagement during conversations with dialogue systems, we must improve individual dialogue responses and dialogue impressions such as consistency, personality, and empathy throughout the entire dialogue. While such dialogue systems have been developing rapidly with the help of large language models (LLMs), reinforcement learning from AI feedback (RLAIF) has attracted attention to align LLM-based dialogue models for such dialogue impressions. In RLAIF, a reward model based on another LLM is used to create a training signal for an LLM-based dialogue model using zero-shot/few-shot prompting techniques. However, evaluating an entire dialogue only by prompting LLMs is challenging. In this study, the supervised fine-tuning (SFT) of LLMs prepared reward models corresponding to 12 metrics related to the impression of the entire dialogue for evaluating dialogue responses. We tuned our dialogue models using the reward model signals as feedback to improve the impression of the system. The results of automatic and human evaluations showed that tuning the dialogue model using our reward model corresponding to dialogue impression improved the evaluation of individual metrics and the naturalness of the dialogue response.

replace EvidenceMap: Learning Evidence Analysis to Unleash the Power of Small Language Models for Biomedical Question Answering

Authors: Chang Zong, Jian Wan, Siliang Tang, Lei Zhang

Abstract: When addressing professional questions in the biomedical domain, humans typically acquire multiple pieces of information as evidence and engage in multifaceted evidence analysis to provide high-quality answers. Current LLM-based answer generation methods lack a detailed definition and learning process for evidence analysis, leading to the risk of error propagation and hallucinations while using evidence. Although increasing the parameter size of LLMs can alleviate these issues, it also presents challenges in model training and deployment with limited resources. In this study, we propose EvidenceMap, which aims to enable a tiny pre-trained language model to explicitly learn multiple aspects of biomedical evidence, including supportive evaluation, logical correlation and content summarization, thereby latently guiding a small generative model (around 3B parameters) to provide textual responses. Experimental results demonstrate that our method, fine-tuning a language model with 66M parameters, exceeds the RAG method with an 8B LLM by 19.9% and 5.7% in reference-based quality and accuracy, respectively.

replace Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

Authors: Bo Gao, Michael W. Spratling

Abstract: Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases. This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm. We identify the latter as essential for maintaining model performance. By replacing the non-linear transformation with the Softplus activation function and introducing a dynamic scale factor for different token lengths based on invariance entropy, we create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths. To further improve the length extrapolation ability of the proposed attention mechanism, we introduce a fine-tuning-free re-weighting mechanism that amplifies significant attention weights while diminishing weaker ones, enabling the model to concentrate more effectively on relevant tokens without requiring retraining. When combined with our proposed attention mechanism, this approach demonstrates significant promise in managing longer sequences, maintaining nearly constant validation loss even at 16$\times$ the training token length while ensuring numerical stability. Our code is available at: https://github.com/iminfine/freeatten.

URLs: https://github.com/iminfine/freeatten.

replace GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration

Authors: Yue Fan, Handong Zhao, Ruiyi Zhang, Yu Shen, Xin Eric Wang, Gang Wu

Abstract: Graphical User Interface (GUI) action grounding is a critical step in GUI automation that maps language instructions to actionable elements on GUI screens. Most recent works of GUI action grounding leverage large GUI datasets to fine-tune MLLMs. However, the fine-tuning data always covers limited GUI environments, and we find the performance of the resulting model deteriorates in novel environments. We argue that the GUI grounding models should be further aligned to the novel environments to reveal their full potential, when the inference is known to involve novel environments, i.e., environments not used during the previous fine-tuning. To realize this, we first propose GUI-Bee, an MLLM-based autonomous agent, to collect high-quality, environment-specific data through exploration and then continuously fine-tune GUI grounding models with the collected data. Our agent leverages a novel Q-value-Incentive In-Context Reinforcement Learning (Q-ICRL) method to optimize exploration efficiency and data quality. Additionally, we introduce NovelScreenSpot, a benchmark for testing how well the data can help align GUI action grounding models to novel environments and demonstrate the effectiveness of data collected by GUI-Bee in the experiments. Furthermore, we conduct an ablation study to validate the Q-ICRL method in enhancing the efficiency of GUI-Bee. Project page: https://gui-bee.github.io

URLs: https://gui-bee.github.io

replace The Breeze 2 Herd of Models: Traditional Chinese LLMs Based on Llama with Vision-Aware and Function-Calling Capabilities

Authors: MediaTek Research, :, Chan-Jan Hsu, Chia-Sheng Liu, Meng-Hsi Chen, Muxi Chen, Po-Chun Hsu, Yi-Chang Chen, Da-Shan Shiu

Abstract: Llama-Breeze2 (hereinafter referred to as Breeze2) is a suite of advanced multi-modal language models, available in 3B and 8B parameter configurations, specifically designed to enhance Traditional Chinese language representation. Building upon the Llama 3.2 model family, we continue the pre-training of Breeze2 on an extensive corpus to enhance the linguistic and cultural heritage of Traditional Chinese. In addition to language modeling capabilities, we significantly augment the models with function calling and vision understanding capabilities. At the time of this publication, as far as we are aware, absent reasoning-inducing prompts, Breeze2 are the strongest performing models in Traditional Chinese function calling and image understanding in its size class. The effectiveness of Breeze2 is benchmarked across various tasks, including Taiwan general knowledge, instruction-following, long context, function calling, and vision understanding. We are publicly releasing all Breeze2 models under the Llama 3.2 Community License. We also showcase the capabilities of the model running on mobile platform with a mobile application which we also open source.

replace-cross Selective Generation for Controllable Language Models

Authors: Minjae Lee, Kyungmin Kim, Taesoo Kim, Sangdon Park

Abstract: Trustworthiness of generative language models (GLMs) is crucial in their deployment to critical decision making systems. Hence, certified risk control methods such as selective prediction and conformal prediction have been applied to mitigating the hallucination problem in various supervised downstream tasks. However, the lack of appropriate correctness metric hinders applying such principled methods to language generation tasks. In this paper, we circumvent this problem by leveraging the concept of textual entailment to evaluate the correctness of the generated sequence, and propose two selective generation algorithms which control the false discovery rate with respect to the textual entailment relation (FDR-E) with a theoretical guarantee: $\texttt{SGen}^{\texttt{Sup}}$ and $\texttt{SGen}^{\texttt{Semi}}$. $\texttt{SGen}^{\texttt{Sup}}$, a direct modification of the selective prediction, is a supervised learning algorithm which exploits entailment-labeled data, annotated by humans. Since human annotation is costly, we further propose a semi-supervised version, $\texttt{SGen}^{\texttt{Semi}}$, which fully utilizes the unlabeled data by pseudo-labeling, leveraging an entailment set function learned via conformal prediction. Furthermore, $\texttt{SGen}^{\texttt{Semi}}$ enables to use more general class of selection functions, neuro-selection functions, and provides users with an optimal selection function class given multiple candidates. Finally, we demonstrate the efficacy of the $\texttt{SGen}$ family in achieving a desired FDR-E level with comparable selection efficiency to those from baselines on both open and closed source GLMs. Code and datasets are provided at https://github.com/ml-postech/selective-generation.

URLs: https://github.com/ml-postech/selective-generation.

replace-cross SoK: Machine Learning for Misinformation Detection

Authors: Madelyne Xiao, Jonathan Mayer

Abstract: We examine the disconnect between scholarship and practice in applying machine learning to trust and safety problems, using misinformation detection as a case study. We survey literature on automated detection of misinformation across a corpus of 248 well-cited papers in the field. We then examine subsets of papers for data and code availability, design missteps, reproducibility, and generalizability. Our paper corpus includes published work in security, natural language processing, and computational social science. Across these disparate disciplines, we identify common errors in dataset and method design. In general, detection tasks are often meaningfully distinct from the challenges that online services actually face. Datasets and model evaluation are often non-representative of real-world contexts, and evaluation frequently is not independent of model training. We demonstrate the limitations of current detection methods in a series of three representative replication studies. Based on the results of these analyses and our literature survey, we conclude that the current state-of-the-art in fully-automated misinformation detection has limited efficacy in detecting human-generated misinformation. We offer recommendations for evaluating applications of machine learning to trust and safety problems and recommend future directions for research.

replace-cross On The Truthfulness of 'Surprisingly Likely' Responses of Large Language Models

Authors: Naman Goel

Abstract: The principle of rewarding a crowd for surprisingly common answers has been used in the literature for designing a number of truthful information elicitation mechanisms. A related method has also been proposed in the literature for better aggregation of crowd wisdom. Drawing a comparison between crowd based collective intelligence systems and large language models, we define the notion of 'surprisingly likely' textual response of a large language model. This notion is inspired by the surprisingly common principle, but tailored for text in a language model. Using benchmarks such as TruthfulQA and openly available LLMs: GPT-2 and LLaMA-2, we show that the surprisingly likely textual responses of large language models are more accurate in many cases compared to standard baselines. For example, we observe up to 24 percentage points aggregate improvement on TruthfulQA and up to 70 percentage points improvement on individual categories of questions in this benchmark. We also provide further analysis of the results, including the cases when surprisingly likely responses are less or not more accurate.

replace-cross Merino: Entropy-driven Design for Generative Language Models on IoT Devices

Authors: Youpeng Zhao, Ming Lin, Huadong Tang, Qiang Wu, Jun Wang

Abstract: Generative Large Language Models (LLMs) stand as a revolutionary advancement in the modern era of artificial intelligence (AI). However, scaling down LLMs for resource-constrained hardware, such as Internet-of-Things (IoT) devices requires non-trivial efforts and domain knowledge. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative language models. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across fourteen NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better performance on both language modeling and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size.

replace-cross A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond

Authors: Qiushi Sun, Zhirui Chen, Fangzhi Xu, Kanzhi Cheng, Chang Ma, Zhangyue Yin, Jianing Wang, Chengcheng Han, Renyu Zhu, Shuai Yuan, Qipeng Guo, Xipeng Qiu, Pengcheng Yin, Xiaoli Li, Fei Yuan, Lingpeng Kong, Xiang Li, Zhiyong Wu

Abstract: Neural Code Intelligence -- leveraging deep learning to understand, generate, and optimize code -- holds immense potential for transformative impacts on the whole society. Bridging the gap between Natural Language and Programming Language, this domain has drawn significant attention from researchers in both research communities over the past few years. This survey presents a systematic and chronological review of the advancements in code intelligence, encompassing over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works. We follow the historical progression to trace the paradigm shifts across different research phases (e.g., from modeling code with recurrent neural networks to the era of Large Language Models). Concurrently, we highlight the major technical transitions in models, tasks, and evaluations spanning through different stages. For applications, we also observe a co-evolving shift. It spans from initial endeavors to tackling specific scenarios, through exploring a diverse array of tasks during its rapid expansion, to currently focusing on tackling increasingly complex and varied real-world challenges. Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence, uncovering new cross-domain opportunities and illustrating the substantial influence of code intelligence across various domains. Finally, we delve into both the opportunities and challenges associated with this field, alongside elucidating our insights on the most promising research directions. An ongoing, dynamically updated project and resources associated with this survey have been released at https://github.com/QiushiSun/Awesome-Code-Intelligence.

URLs: https://github.com/QiushiSun/Awesome-Code-Intelligence.

replace-cross BiMix: A Bivariate Data Mixing Law for Language Model Pretraining

Authors: Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding

Abstract: Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces $\textbf{BiMix}$, a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. $\textbf{BiMix}$ provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate $\textbf{BiMix}$'s high accuracy in loss extrapolation (mean relative error < 0.2%) and its generalization to unseen mixtures (R${}^{2}$ > 0.97). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish entropy-based measures as efficient proxies for data mixing, offering a computationally lightweight strategy. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency, paving the way for more effective scaling strategies in language model development.

replace-cross SLMRec: Distilling Large Language Models into Small for Sequential Recommendation

Authors: Wujiang Xu, Qitian Wu, Zujie Liang, Jiaojiao Han, Xuying Ning, Yunxiao Shi, Wenfang Lin, Yongfeng Zhang

Abstract: Sequential Recommendation (SR) task involves predicting the next item a user is likely to interact with, given their past interactions. The SR models examine the sequence of a user's actions to discern more complex behavioral patterns and temporal dynamics. Recent research demonstrates the great impact of LLMs on sequential recommendation systems, either viewing sequential recommendation as language modeling or serving as the backbone for user representation. Although these methods deliver outstanding performance, there is scant evidence of the necessity of a large language model and how large the language model is needed, especially in the sequential recommendation scene. Meanwhile, due to the huge size of LLMs, it is inefficient and impractical to apply a LLM-based model in real-world platforms that often need to process billions of traffic logs daily. In this paper, we explore the influence of LLMs' depth by conducting extensive experiments on large-scale industry datasets. Surprisingly, our motivational experiments reveal that most intermediate layers of LLMs are redundant, indicating that pruning the remaining layers can still maintain strong performance. Motivated by this insight, we empower small language models for SR, namely SLMRec, which adopt a simple yet effective knowledge distillation method. Moreover, SLMRec is orthogonal to other post-training efficiency techniques, such as quantization and pruning, so that they can be leveraged in combination. Comprehensive experimental results illustrate that the proposed SLMRec model attains the best performance using only 13% of the parameters found in LLM-based recommendation models while simultaneously achieving up to 6.6x and 8.0x speedups in training and inference time costs, respectively. Besides, we provide a theoretical justification for why small language models can perform comparably to large language models in SR.

replace-cross Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation

Authors: Zijie Zhong, Hanwen Liu, Xiaoya Cui, Xiaofan Zhang, Zengchang Qin

Abstract: Integrating information from various reference databases is a major challenge for Retrieval-Augmented Generation (RAG) systems because each knowledge source adopts a unique data structure and follows different conventions. Retrieving from multiple knowledge sources with one fixed strategy usually leads to under-exploitation of information. To mitigate this drawback, inspired by Mix-of-Expert, we introduce Mix-of-Granularity (MoG), a method that dynamically determines the optimal granularity of a knowledge source based on input queries using a router. The router is efficiently trained with a newly proposed loss function employing soft labels. We further extend MoG to MoG-Graph (MoGG), where reference documents are pre-processed as graphs, enabling the retrieval of distantly situated snippets. Experiments demonstrate that MoG and MoGG effectively predict optimal granularity levels, significantly enhancing the performance of the RAG system in downstream tasks. The code of both MoG and MoGG are released in https://github.com/ZGChung/Mix-of-Granularity.

URLs: https://github.com/ZGChung/Mix-of-Granularity.

replace-cross Deciphering Oracle Bone Language with Diffusion Models

Authors: Haisu Guan, Huanxin Yang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen Jin, Xiang Bai, Yuliang Liu

Abstract: Originating from China's Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a novel frontier for OBS decipherment, challenging traditional NLP methods that rely heavily on large textual corpora, a luxury not afforded by historical languages. This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD). Utilizing a conditional diffusion-based strategy, OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages. To validate its efficacy, extensive experiments were conducted on an oracle bone script dataset, with quantitative results demonstrating the effectiveness of OBSD. Code and decipherment results will be made available at https://github.com/guanhaisu/OBSD.

URLs: https://github.com/guanhaisu/OBSD.

replace-cross SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task

Authors: Ziije Zhong, Linqing Zhong, Zhaoze Sun, Qingyun Jin, Zengchang Qin, Xiaofan Zhang

Abstract: Integrating Large Language Models (LLMs) with existing Knowledge Graph (KG) databases presents a promising avenue for enhancing LLMs' efficacy and mitigating their "hallucinations". Given that most KGs reside in graph databases accessible solely through specialized query languages (e.g., Cypher), it is critical to connect LLMs with KG databases by automating the translation of natural language into Cypher queries (termed as "Text2Cypher" task). Prior efforts tried to bolster LLMs' proficiency in Cypher generation through Supervised Fine-Tuning (SFT). However, these explorations are hindered by the lack of annotated datasets of Query-Cypher pairs, resulting from the labor-intensive and domain-specific nature of such annotation. In this study, we propose SyntheT2C, a methodology for constructing a synthetic Query-Cypher pair dataset, comprising two distinct pipelines: (1) LLM-based prompting and (2) template-filling. SyntheT2C is applied to two medical KG databases, culminating in the creation of a synthetic dataset, MedT2C. Comprehensive experiments demonstrate that the MedT2C dataset effectively enhances the performance of backbone LLMs on Text2Cypher task via SFT. Both the SyntheT2C codebase and the MedT2C dataset are released in https://github.com/ZGChung/SyntheT2C.

URLs: https://github.com/ZGChung/SyntheT2C.

replace-cross BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Authors: Yibin Wang, Haizhou Shi, Ligong Han, Dimitris Metaxas, Hao Wang

Abstract: Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters learned during training. In this paper, we go beyond post-training Bayesianization and propose Bayesian Low-Rank Adaptation by Backpropagation (BLoB), an algorithm that continuously and jointly adjusts both the mean and covariance of LLM parameters throughout the whole fine-tuning process. Our empirical results verify the effectiveness of BLoB in terms of generalization and uncertainty estimation, when evaluated on both in-distribution and out-of-distribution data.

replace-cross Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition

Authors: Jielong Tang, Zhenxing Wang, Ziyang Gong, Jianxing Yu, Xiangwei Zhu, Jian Yin

Abstract: Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and corresponding visual regions of entities from given sentence-image pairs data. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. The former, utilizing human-designed type queries, struggles to differentiate ambiguous entities, such as Jordan (Person) and off-White x Jordan (Shoes). The latter, following the one-by-one decoding order, suffers from exposure bias issues. We maintain that these works misunderstand the relationships of multimodal entities. To tackle these, we propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels. Specifically, MQSPN explicitly aligns textual entities with visual regions by employing a set of learnable queries to strengthen intra-entity connections. Based on distinct intra-entity modeling, MQSPN reformulates GMNER as a set prediction, guiding models to establish appropriate inter-entity relationships from a optimal global matching perspective. Additionally, we incorporate a query-guided Fusion Net (QFNet) as a glue network to boost better alignment of two-level relationships. Extensive experiments demonstrate that our approach achieves state-of-the-art performances in widely used benchmarks.

replace-cross PAT: Pruning-Aware Tuning for Large Language Models

Authors: Yijiang Liu, Huanrui Yang, Youxin Chen, Rongyu Zhang, Miao Wang, Yuan Du, Li Du

Abstract: Large language models (LLMs) excel in language tasks, especially with supervised fine-tuning after pre-training. However, their substantial memory and computational requirements hinder practical applications. Structural pruning, which reduces less significant weight dimensions, is one solution. Yet, traditional post-hoc pruning often leads to significant performance loss, with limited recovery from further fine-tuning due to reduced capacity. Since the model fine-tuning refines the general and chaotic knowledge in pre-trained models, we aim to incorporate structural pruning with the fine-tuning, and propose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy while preserving the model performance to the maximum extend. Specifically, we insert the innovative Hybrid Sparsification Modules (HSMs) between the Attention and FFN components to accordingly sparsify the upstream and downstream linear modules. The HSM comprises a lightweight operator and a globally shared trainable mask. The lightweight operator maintains a training overhead comparable to that of LoRA, while the trainable mask unifies the channels to be sparsified, ensuring structural pruning. Additionally, we propose the Identity Loss which decouples the transformation and scaling properties of the HSMs to enhance training robustness. Extensive experiments demonstrate that PAT excels in both performance and efficiency. For example, our Llama2-7b model with a 25\% pruning ratio achieves 1.33$\times$ speedup while outperforming the LoRA-finetuned model by up to 1.26\% in accuracy with a similar training cost. Code: https://github.com/kriskrisliu/PAT_Pruning-Aware-Tuning

URLs: https://github.com/kriskrisliu/PAT_Pruning-Aware-Tuning

replace-cross GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase Recommendation

Authors: Ashirbad Mishra, Soumik Dey, Marshall Wu, Jinyu Zhao, He Yu, Kaichen Ni, Binbin Li, Kamesh Madduri

Abstract: Online sellers and advertisers are recommended keyphrases for their listed products, which they bid on to enhance their sales. One popular paradigm that generates such recommendations is Extreme Multi-Label Classification (XMC), which involves tagging/mapping keyphrases to items. We outline the limitations of using traditional item-query based tagging or mapping techniques for keyphrase recommendations on E-Commerce platforms. We introduce GraphEx, an innovative graph-based approach that recommends keyphrases to sellers using extraction of token permutations from item titles. Additionally, we demonstrate that relying on traditional metrics such as precision/recall can be misleading in practical applications, thereby necessitating a combination of metrics to evaluate performance in real-world scenarios. These metrics are designed to assess the relevance of keyphrases to items and the potential for buyer outreach. GraphEx outperforms production models at eBay, achieving the objectives mentioned above. It supports near real-time inferencing in resource-constrained production environments and scales effectively for billions of items.

replace-cross When A Man Says He Is Pregnant: ERP Evidence for A Rational Account of Speaker-contextualized Language Comprehension

Authors: Hanlin Wu, Zhenguang G. Cai

Abstract: Spoken language is often, if not always, understood in a context formed by the identity of the speaker. For example, we can easily make sense of an utterance such as "I'm going to have a manicure this weekend" or "The first time I got pregnant I had a hard time" when spoken by a woman, but it would be harder to understand when it is spoken by a man. Previous event-related potential (ERP) studies have shown mixed results regarding the neurophysiological responses to such speaker-content mismatches, with some reporting an N400 effect and others a P600 effect. In an electroencephalography (EEG) experiment involving 64 participants, we used social and biological mismatches as test cases to demonstrate how these distinct ERP patterns reflect different aspects of rational inference. We showed that when the mismatch involves social stereotypes (e.g., men getting a manicure), listeners can arrive at a "literal" interpretation by integrating the content with their social knowledge, though this integration requires additional effort due to stereotype violations-resulting in an N400 effect. In contrast, when the mismatch involves biological knowledge (e.g., men getting pregnant), a "literal" interpretation becomes impossible, leading listeners to treat the input as potentially containing errors and engage in correction processes-resulting in a P600 effect. Supporting this rational inference framework, we found that the social N400 effect decreased as a function of the listener's personality trait of openness (as more open-minded individuals maintain more flexible social expectations), while the biological P600 effect remained robust (as biological constraints are recognized regardless of individual personalities). Our findings help to reconcile the empirical inconsistencies and show how rational inference shapes speaker-contextualized language comprehension.

replace-cross An Adversarial Perspective on Machine Unlearning for AI Safety

Authors: Jakub {\L}ucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tram\`er, Javier Rando

Abstract: Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.

replace-cross DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

Abstract: Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo extensive speech instruction-tuning to bridge the gap between speech and text modalities. This requires significant annotation efforts and risks catastrophic forgetting of the original language capabilities. In this work, we present a simple yet effective automatic process for creating speech-text pair data that carefully injects speech paralinguistic understanding abilities into SLMs while preserving the inherent language capabilities of the text-based LLM. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data, achieving impressive performance on Dynamic-SUPERB and AIR-Bench-Chat benchmarks. Furthermore, our model exhibits the ability to follow complex instructions derived from LLMs, such as specific output formatting and chain-of-thought reasoning. Our approach not only enhances the versatility and effectiveness of SLMs but also reduces reliance on extensive annotated datasets, paving the way for more efficient and capable speech understanding systems.

replace-cross Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

Authors: Christopher Ackerman, Nina Panickssery

Abstract: It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

replace-cross PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization

Authors: Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo

Abstract: Existing weight-activation quantization methods for Large Language Models (LLMs) primarily address channel-wise outliers but often neglect token-wise outliers, which limits the accuracy of quantized models. In this work, we propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels (W4A4KV4 and W4A8KV4) and granularities (dynamic and static quantization) by effectively isolating token-wise outliers. First, PrefixQuant eliminates token-wise outliers by prefixing outlier tokens in the KV cache, a process that is training-free and highly efficient (e.g., 1 minutes for Llama-3-70B). Second, PrefixQuant introduces new trainable parameters for block-wise training to compensate for quantization error. Our experiments show that PrefixQuant significantly outperforms existing dynamic quantization methods, even under coarser static quantization settings. For instance, PrefixQuant achieves an average accuracy improvement of +3.08 and +2.85 points over SpinQuant (dynamic quantization) on five zero-shot reasoning tasks under dynamic and static quantization settings, respectively, on W4A4KV4 Llama-3-8B. Additionally, we demonstrate up to 2.74x prefilling speedup and 2.16x decoding speedup for LLMs using W4A4 PrefixQuant. Our code is available at https://github.com/ChenMnZ/PrefixQuant.

URLs: https://github.com/ChenMnZ/PrefixQuant.

replace-cross Information Discovery in e-Commerce

Authors: Zhaochun Ren, Xiangnan He, Dawei Yin, Maarten de Rijke

Abstract: Electronic commerce, or e-commerce, is the buying and selling of goods and services, or the transmitting of funds or data online. E-commerce platforms come in many kinds, with global players such as Amazon, Airbnb, Alibaba, eBay and platforms targeting specific geographic regions. Information retrieval has a natural role to play in e-commerce, especially in connecting people to goods and services. Information discovery in e-commerce concerns different types of search (e.g., exploratory search vs. lookup tasks), recommender systems, and natural language processing in e-commerce portals. The rise in popularity of e-commerce sites has made research on information discovery in e-commerce an increasingly active research area. This is witnessed by an increase in publications and dedicated workshops in this space. Methods for information discovery in e-commerce largely focus on improving the effectiveness of e-commerce search and recommender systems, on enriching and using knowledge graphs to support e-commerce, and on developing innovative question answering and bot-based solutions that help to connect people to goods and services. In this survey, an overview is given of the fundamental infrastructure, algorithms, and technical solutions for information discovery in e-commerce. The topics covered include user behavior and profiling, search, recommendation, and language technology in e-commerce.

replace-cross Natural Language Counterfactual Explanations for Graphs Using Large Language Models

Authors: Flavio Giorgi, Cesare Campagnano, Fabrizio Silvestri, Gabriele Tolomei

Abstract: Explainable Artificial Intelligence (XAI) has emerged as a critical area of research to unravel the opaque inner logic of (deep) machine learning models. Among the various XAI techniques proposed in the literature, counterfactual explanations stand out as one of the most promising approaches. However, these "what-if" explanations are frequently complex and technical, making them difficult for non-experts to understand and, more broadly, challenging for humans to interpret. To bridge this gap, in this work, we exploit the power of open-source Large Language Models to generate natural language explanations when prompted with valid counterfactual instances produced by state-of-the-art explainers for graph-based models. Experiments across several graph datasets and counterfactual explainers show that our approach effectively produces accurate natural language representations of counterfactual instances, as demonstrated by key performance metrics.

replace-cross MetRex: A Benchmark for Verilog Code Metric Reasoning Using LLMs

Authors: Manar Abdelatty, Jingxiao Ma, Sherief Reda

Abstract: Large Language Models (LLMs) have been applied to various hardware design tasks, including Verilog code generation, EDA tool scripting, and RTL bug fixing. Despite this extensive exploration, LLMs are yet to be used for the task of post-synthesis metric reasoning and estimation of HDL designs. In this paper, we assess the ability of LLMs to reason about post-synthesis metrics of Verilog designs. We introduce MetRex, a large-scale dataset comprising 25,868 Verilog HDL designs and their corresponding post-synthesis metrics, namely area, delay, and static power. MetRex incorporates a Chain of Thought (CoT) template to enhance LLMs' reasoning about these metrics. Extensive experiments show that Supervised Fine-Tuning (SFT) boosts the LLM's reasoning capabilities on average by 37.0\%, 25.3\%, and 25.7\% on the area, delay, and static power, respectively. While SFT improves performance on our benchmark, it remains far from achieving optimal results, especially on complex problems. Comparing to state-of-the-art regression models, our approach delivers accurate post-synthesis predictions for 17.4\% more designs (within a 5\% error margin), in addition to offering a 1.7x speedup by eliminating the need for pre-processing. This work lays the groundwork for advancing LLM-based Verilog code metric reasoning.

replace-cross DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation

Authors: Junyi Lu, Xiaojia Li, Zihan Hua, Lei Yu, Shiqi Cheng, Li Yang, Fengjun Zhang, Chun Zuo

Abstract: Code review is a vital but demanding aspect of software development, generating significant interest in automating review comments. Traditional evaluation methods for these comments, primarily based on text similarity, face two major challenges: inconsistent reliability of human-authored comments in open-source projects and the weak correlation of text similarity with objectives like enhancing code quality and detecting defects. This study empirically analyzes benchmark comments using a novel set of criteria informed by prior research and developer interviews. We then similarly revisit the evaluation of existing methodologies. Our evaluation framework, DeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a comprehensive reassessment of current techniques based on the criteria set. Besides, we also introduce an innovative and efficient baseline, LLM-Reviewer, leveraging the few-shot learning capabilities of LLMs for a target-oriented comparison. Our research highlights the limitations of text similarity metrics, finding that less than 10% of benchmark comments are high quality for automation. In contrast, DeepCRCEval effectively distinguishes between high and low-quality comments, proving to be a more reliable evaluation mechanism. Incorporating LLM evaluators into DeepCRCEval significantly boosts efficiency, reducing time and cost by 88.78% and 90.32%, respectively. Furthermore, LLM-Reviewer demonstrates significant potential of focusing task real targets in comment generation.

replace-cross 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Authors: Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing

Abstract: Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving. Our code are available at https://github.com/DAMO-NLP-SG/multimodal_textbook.

URLs: https://github.com/DAMO-NLP-SG/multimodal_textbook.

replace-cross Discriminative Representation learning via Attention-Enhanced Contrastive Learning for Short Text Clustering

Authors: Zhihao Yao

Abstract: Contrastive learning has gained significant attention in short text clustering, yet it has an inherent drawback of mistakenly identifying samples from the same category as negatives and then separating them in the feature space (false negative separation), which hinders the generation of superior representations. To generate more discriminative representations for efficient clustering, we propose a novel short text clustering method, called Discriminative Representation learning via \textbf{A}ttention-\textbf{E}nhanced \textbf{C}ontrastive \textbf{L}earning for Short Text Clustering (\textbf{AECL}). The \textbf{AECL} consists of two modules which are the pseudo-label generation module and the contrastive learning module. Both modules build a sample-level attention mechanism to capture similarity relationships between samples and aggregate cross-sample features to generate consistent representations. Then, the former module uses the more discriminative consistent representation to produce reliable supervision information for assist clustering, while the latter module explores similarity relationships and consistent representations optimize the construction of positive samples to perform similarity-guided contrastive learning, effectively addressing the false negative separation issue. Experimental results demonstrate that the proposed \textbf{AECL} outperforms state-of-the-art methods. If the paper is accepted, we will open-source the code.

replace-cross VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall

Authors: Guillermo Ruiz, Tania Ram\'irez, Daniela Moctezuma

Abstract: Image captioning has become an essential Vision & Language research task. It is about predicting the most accurate caption given a specific image or video. The research community has achieved impressive results by continuously proposing new models and approaches to improve the overall model's performance. Nevertheless, despite increasing proposals, the performance metrics used to measure their advances have remained practically untouched through the years. A probe of that, nowadays metrics like BLEU, METEOR, CIDEr, and ROUGE are still very used, aside from more sophisticated metrics such as BertScore and ClipScore. Hence, it is essential to adjust how are measure the advances, limitations, and scopes of the new image captioning proposals, as well as to adapt new metrics to these new advanced image captioning approaches. This work proposes a new evaluation metric for the image captioning problem. To do that, first, it was generated a human-labeled dataset to assess to which degree the captions correlate with the image's content. Taking these human scores as ground truth, we propose a new metric, and compare it with several well-known metrics, from classical to newer ones. Outperformed results were also found, and interesting insights were presented and discussed.

replace-cross Computational Protein Science in the Era of Large Language Models (LLMs)

Authors: Wenqi Fan, Yi Zhou, Shijie Wang, Yuyao Yan, Hui Liu, Qian Zhao, Le Song, Qing Li

Abstract: Considering the significance of proteins, computational protein science has always been a critical scientific field, dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm. In the last few decades, Artificial Intelligence (AI) has made significant impacts in computational protein science, leading to notable successes in specific protein modeling tasks. However, those previous AI models still meet limitations, such as the difficulty in comprehending the semantics of protein sequences, and the inability to generalize across a wide range of protein modeling tasks. Recently, LLMs have emerged as a milestone in AI due to their unprecedented language processing & generalization capability. They can promote comprehensive progress in fields rather than solving individual tasks. As a result, researchers have actively introduced LLM techniques in computational protein science, developing protein Language Models (pLMs) that skillfully grasp the foundational knowledge of proteins and can be effectively generalized to solve a diversity of sequence-structure-function reasoning problems. While witnessing prosperous developments, it's necessary to present a systematic overview of computational protein science empowered by LLM techniques. First, we summarize existing pLMs into categories based on their mastered protein knowledge, i.e., underlying sequence patterns, explicit structural and functional information, and external scientific languages. Second, we introduce the utilization and adaptation of pLMs, highlighting their remarkable achievements in promoting protein structure prediction, protein function prediction, and protein design studies. Then, we describe the practical application of pLMs in antibody design, enzyme design, and drug discovery. Finally, we specifically discuss the promising future directions in this fast-growing field.