new Can Cross Encoders Produce Useful Sentence Embeddings?

Authors: Haritha Ananthakrishnan, Julian Dolby, Harsha Kokel, Horst Samulowitz, Kavitha Srinivas

Abstract: Cross encoders (CEs) are trained with sentence pairs to detect relatedness. As CEs require sentence pairs at inference, the prevailing view is that they can only be used as re-rankers in information retrieval pipelines. Dual encoders (DEs) are instead used to embed sentences, where sentence pairs are encoded by two separate encoders with shared weights at training, and a loss function that ensures the pair's embeddings lie close in vector space if the sentences are related. DEs however, require much larger datasets to train, and are less accurate than CEs. We report a curious finding that embeddings from earlier layers of CEs can in fact be used within an information retrieval pipeline. We show how to exploit CEs to distill a lighter-weight DE, with a 5.15x speedup in inference time.

new Sorting the Babble in Babel: Assessing the Performance of Language Detection Algorithms on the OpenAlex Database

Authors: Maxime Holmberg Sainte-Marie, Diego Kozlowski, Luc\'ia C\'espedes, Vincent Larivi\`ere

Abstract: Following a recent study on the quality of OpenAlex linguistic metadata (C\'espedes et al., 2025), the present paper aims to optimize the latter through the design, use, and evaluation of various linguistic classification procedures based on the latest and most efficient automatic language detection algorithms. Starting from a multilingual set of manually-annotated samples of articles indexed in the database, different classification procedures are then designed, based on the application of a set of language detection algorithms on a series of corpora generated from different combinations of textual metadata of indexed articles. At sample level first, the performance of these different procedures for each of the main languages in the database is evaluated in terms of precision, recall, and processing time. Then, overall procedure performance is estimated at the database level by means of a probabilistic simulation of harmonically aggregated and weighted scores. Results show that procedure performance strongly depends on the importance given to each of the measures implemented: for contexts where precision is preferred, using the LangID algorithm on article titles, abstracts as well as journal names gives the best results; however, for all cases where recall is considered at least slightly more important than precision or as soon as processing times are given any kind of consideration, use of the FastSpell algorithm on article titles only outperforms all other alternatives. Given the lack of truly multilingual, large-scale bibliographic databases, it is hoped that these results help confirm and foster the unparalleled potential of the OpenAlex database for cross-linguistic, bibliometric-based research and analysis.

new Context-Preserving Gradient Modulation for Large Language Models: A Novel Approach to Semantic Consistency in Long-Form Text Generation

Authors: Nirola Kobanov, Edmund Weatherstone, Zachary Vanderpoel, Orlando Wetherby

Abstract: Maintaining semantic consistency over extended text sequences remains a fundamental challenge in long-form text generation, where conventional training methodologies often struggle to prevent contextual drift and coherence degradation. A novel gradient modulation approach is introduced, designed to adjust parameter updates dynamically in response to contextual relevance, ensuring that generated text remains aligned with prior discourse. By integrating a modulation function that selectively amplifies or attenuates gradients based on learned contextual dependencies, the proposed method enhances the stability of model-generated narratives without imposing significant computational overhead. Comparative evaluations against baseline models reveal improvements in coherence, contextual retention, and long-range dependency tracking, demonstrating the effectiveness of modifying the learning process at the gradient level. The results indicate that sentence structure variability and lexical diversity benefit from this approach, mitigating repetitive phrasing and improving adaptability across diverse linguistic contexts. Statistical validation of coherence metrics further substantiates the observed enhancements, with a significant reduction in inconsistencies emerging as a direct consequence of the modulation mechanism. Computational efficiency assessments confirm that the framework achieves these gains without requiring substantial modifications to the underlying architecture, ensuring compatibility with existing optimization workflows.

new Looking for the Inner Music: Probing LLMs' Understanding of Literary Style

Authors: Rebecca M. M. Hicke, David Mimno

Abstract: Recent work has demonstrated that language models can be trained to identify the author of much shorter literary passages than has been thought feasible for traditional stylometry. We replicate these results for authorship and extend them to a new dataset measuring novel genre. We find that LLMs are able to distinguish authorship and genre, but they do so in different ways. Some models seem to rely more on memorization, while others benefit more from training to learn author/genre characteristics. We then use three methods to probe one high-performing LLM for features that define style. These include direct syntactic ablations to input text as well as two methods that look at model internals. We find that authorial style is easier to define than genre-level style and is more impacted by minor syntactic decisions and contextual word usage. However, some traits like pronoun usage and word order prove significant for defining both kinds of literary style.

new Advancing Reasoning in Large Language Models: Promising Methods and Approaches

Authors: Avinash Patil

Abstract: Large Language Models (LLMs) have succeeded remarkably in various natural language processing (NLP) tasks, yet their reasoning capabilities remain a fundamental challenge. While LLMs exhibit impressive fluency and factual recall, their ability to perform complex reasoning-spanning logical deduction, mathematical problem-solving, commonsense inference, and multi-step reasoning-often falls short of human expectations. This survey provides a comprehensive review of emerging techniques enhancing reasoning in LLMs. We categorize existing methods into key approaches, including prompting strategies (e.g., Chain-of-Thought reasoning, Self-Consistency, and Tree-of-Thought reasoning), architectural innovations (e.g., retrieval-augmented models, modular reasoning networks, and neuro-symbolic integration), and learning paradigms (e.g., fine-tuning with reasoning-specific datasets, reinforcement learning, and self-supervised reasoning objectives). Additionally, we explore evaluation frameworks used to assess reasoning in LLMs and highlight open challenges, such as hallucinations, robustness, and reasoning generalization across diverse tasks. By synthesizing recent advancements, this survey aims to provide insights into promising directions for future research and practical applications of reasoning-augmented LLMs.

new Reflection-Window Decoding: Text Generation with Selective Refinement

Authors: Zeyu Tang, Zhenhao Chen, Loka Li, Xiangchen Song, Yunlong Deng, Yifan Shen, Guangyi Chen, Peter Spirtes, Kun Zhang

Abstract: The autoregressive decoding for text generation in large language models (LLMs), while widely used, is inherently suboptimal due to the lack of a built-in mechanism to perform refinement and/or correction of the generated content. In this paper, we consider optimality in terms of the joint probability over the generated response, when jointly considering all tokens at the same time. We theoretically characterize the potential deviation of the autoregressively generated response from its globally optimal counterpart that is of the same length. Our analysis suggests that we need to be cautious when noticeable uncertainty arises during text generation, which may signal the sub-optimality of the generation history. To address the pitfall of autoregressive decoding for text generation, we propose an approach that incorporates a sliding reflection window and a pausing criterion, such that refinement and generation can be carried out interchangeably as the decoding proceeds. Our selective refinement framework strikes a balance between efficiency and optimality, and our extensive experimental results demonstrate the effectiveness of our approach.

new Controlled LLM Decoding via Discrete Auto-regressive Biasing

Authors: Patrick Pynadath, Ruqi Zhang

Abstract: Controlled text generation allows for enforcing user-defined constraints on large language model outputs, an increasingly important field as LLMs become more prevalent in everyday life. One common approach uses energy-based decoding, which defines a target distribution through an energy function that combines multiple constraints into a weighted average. However, these methods often struggle to balance fluency with constraint satisfaction, even with extensive tuning of the energy function's coefficients. In this paper, we identify that this suboptimal balance arises from sampling in continuous space rather than the natural discrete space of text tokens. To address this, we propose Discrete Auto-regressive Biasing, a controlled decoding algorithm that leverages gradients while operating entirely in the discrete text domain. Specifically, we introduce a new formulation for controlled text generation by defining a joint distribution over the generated sequence and an auxiliary bias sequence. To efficiently sample from this joint distribution, we propose a Langevin-within-Gibbs sampling algorithm using gradient-based discrete MCMC. Our method significantly improves constraint satisfaction while maintaining comparable or better fluency, all with even lower computational costs. We demonstrate the advantages of our controlled decoding method on sentiment control, language detoxification, and keyword-guided generation.

new A Comparison of DeepSeek and Other LLMs

Authors: Tianchen Gao, Jiashun Jin, Zheng Tracy Ke, Gabriel Moryoussef

Abstract: Recently, DeepSeek has been the focus of attention in and beyond the AI community. An interesting problem is how DeepSeek compares to other large language models (LLMs). There are many tasks an LLM can do, and in this paper, we use the task of predicting an outcome using a short text for comparison. We consider two settings, an authorship classification setting and a citation classification setting. In the first one, the goal is to determine whether a short text is written by human or AI. In the second one, the goal is to classify a citation to one of four types using the textual content. For each experiment, we compare DeepSeek with $4$ popular LLMs: Claude, Gemini, GPT, and Llama. We find that, in terms of classification accuracy, DeepSeek outperforms Gemini, GPT, and Llama in most cases, but underperforms Claude. We also find that DeepSeek is comparably slower than others but with a low cost to use, while Claude is much more expensive than all the others. Finally, we find that in terms of similarity, the output of DeepSeek is most similar to those of Gemini and Claude (and among all $5$ LLMs, Claude and Gemini have the most similar outputs). In this paper, we also present a fully-labeled dataset collected by ourselves, and propose a recipe where we can use the LLMs and a recent data set, MADStat, to generate new data sets. The datasets in our paper can be used as benchmarks for future study on LLMs.

new LLM Alignment as Retriever Optimization: An Information Retrieval Perspective

Authors: Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O. Arik

Abstract: Large Language Models (LLMs) have revolutionized artificial intelligence with capabilities in reasoning, coding, and communication, driving innovation across industries. Their true potential depends on effective alignment to ensure correct, trustworthy and ethical behavior, addressing challenges like misinformation, hallucinations, bias and misuse. While existing Reinforcement Learning (RL)-based alignment methods are notoriously complex, direct optimization approaches offer a simpler alternative. In this work, we introduce a novel direct optimization approach for LLM alignment by drawing on established Information Retrieval (IR) principles. We present a systematic framework that bridges LLM alignment and IR methodologies, mapping LLM generation and reward models to IR's retriever-reranker paradigm. Building on this foundation, we propose LLM Alignment as Retriever Preference Optimization (LarPO), a new alignment method that enhances overall alignment quality. Extensive experiments validate LarPO's effectiveness with 38.9 % and 13.7 % averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work opens new avenues for advancing LLM alignment by integrating IR foundations, offering a promising direction for future research.

new Aggregate and conquer: detecting and steering LLM concepts by combining nonlinear predictors over multiple layers

Authors: Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adser\`a, Mikhail Belkin

Abstract: A trained Large Language Model (LLM) contains much of human knowledge. Yet, it is difficult to gauge the extent or accuracy of that knowledge, as LLMs do not always ``know what they know'' and may even be actively misleading. In this work, we give a general method for detecting semantic concepts in the internal activations of LLMs. Furthermore, we show that our methodology can be easily adapted to steer LLMs toward desirable outputs. Our innovations are the following: (1) we use a nonlinear feature learning method to identify important linear directions for predicting concepts from each layer; (2) we aggregate features across layers to build powerful concept detectors and steering mechanisms. We showcase the power of our approach by attaining state-of-the-art results for detecting hallucinations, harmfulness, toxicity, and untruthful content on seven benchmarks. We highlight the generality of our approach by steering LLMs towards new concepts that, to the best of our knowledge, have not been previously considered in the literature, including: semantic disambiguation, human languages, programming languages, hallucinated responses, science subjects, poetic/Shakespearean English, and even multiple concepts simultaneously. Moreover, our method can steer concepts with numerical attributes such as product reviews. We provide our code (including a simple API for our methods) at https://github.com/dmbeaglehole/neural_controllers .

URLs: https://github.com/dmbeaglehole/neural_controllers

new MultiQ&A: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers

Authors: Nicole Cho, William Watson

Abstract: One critical challenge in the institutional adoption journey of Large Language Models (LLMs) stems from their propensity to hallucinate in generated responses. To address this, we propose MultiQ&A, a systematic approach for evaluating the robustness and consistency of LLM-generated answers. We demonstrate MultiQ&A's ability to crowdsource question perturbations and their respective answers through independent LLM agents at scale. Our experiments culminated in the examination of 1.9 million question perturbations and 2.3 million answers. Furthermore, MultiQ&A shows that ensembled LLMs, such as gpt-3.5-turbo, remain relatively robust and consistent under perturbations. MultiQ&A provides clarity in the response generation space, offering an effective method for inspecting disagreements and variability. Therefore, our system offers a potential framework for institutional LLM adoption with the ability to measure confidence, consistency, and the quantification of hallucinations.

new Rethinking the Residual Distribution of Locate-then-Editing Methods in Model Editing

Authors: Xiaopeng Li, Shanwen Wang, Shasha Li, Shezheng Song, Bin Ji, Jun Ma, Jie Yu

Abstract: Model editing is a powerful technique for updating the knowledge of Large Language Models (LLMs). Locate-then-edit methods are a popular class of approaches that first identify the critical layers storing knowledge, then compute the residual of the last critical layer based on the edited knowledge, and finally perform multi-layer updates using a least-squares solution by evenly distributing the residual from the first critical layer to the last. Although these methods achieve promising results, they have been shown to degrade the original knowledge of LLMs. We argue that residual distribution leads to this issue. To explore this, we conduct a comprehensive analysis of residual distribution in locate-then-edit methods from both empirical and theoretical perspectives, revealing that residual distribution introduces editing errors, leading to inaccurate edits. To address this issue, we propose the Boundary Layer UpdatE (BLUE) strategy to enhance locate-then-edit methods. Sequential batch editing experiments on three LLMs and two datasets demonstrate that BLUE not only delivers an average performance improvement of 35.59\%, significantly advancing the state of the art in model editing, but also enhances the preservation of LLMs' general capabilities. Our code is available at https://github.com/xpq-tech/BLUE.

URLs: https://github.com/xpq-tech/BLUE.

new Hierarchical Contextual Manifold Alignment for Structuring Latent Representations in Large Language Models

Authors: Meiquan Dong, Haoran Liu, Yan Huang, Zixuan Feng, Jianhong Tang, Ruoxi Wang

Abstract: The organization of latent token representations plays a crucial role in determining the stability, generalization, and contextual consistency of language models, yet conventional approaches to embedding refinement often rely on parameter modifications that introduce additional computational overhead. A hierarchical alignment method was introduced to restructure token embeddings without altering core model weights, ensuring that representational distributions maintained coherence across different linguistic contexts. Experimental evaluations demonstrated improvements in rare token retrieval, adversarial robustness, and long-range dependency tracking, highlighting the advantages of hierarchical structuring in mitigating inconsistencies in latent space organization. The comparative analysis against conventional fine-tuning and embedding perturbation methods revealed that hierarchical restructuring maintained computational efficiency while achieving measurable gains in representation quality. Structural refinements introduced through the alignment process resulted in improved contextual stability across varied linguistic tasks, reducing inconsistencies in token proximity relationships and enhancing interpretability in language generation. A detailed computational assessment confirmed that the realignment process introduced minimal inference overhead, ensuring that representational improvements did not compromise model efficiency. The findings reinforced the broader significance of structured representation learning, illustrating that hierarchical embedding modifications could serve as an effective strategy for refining latent space distributions while preserving pre-learned semantic associations.

new It's All in The [MASK]: Simple Instruction-Tuning Enables BERT-like Masked Language Models As Generative Classifiers

Authors: Benjamin Clavi\'e, Nathan Cooper, Benjamin Warner

Abstract: While encoder-only models such as BERT and ModernBERT are ubiquitous in real-world NLP applications, their conventional reliance on task-specific classification heads can limit their applicability compared to decoder-based large language models (LLMs). In this work, we introduce ModernBERT-Large-Instruct, a 0.4B-parameter encoder model that leverages its masked language modelling (MLM) head for generative classification. Our approach employs an intentionally simple training loop and inference mechanism that requires no heavy pre-processing, heavily engineered prompting, or architectural modifications. ModernBERT-Large-Instruct exhibits strong zero-shot performance on both classification and knowledge-based tasks, outperforming similarly sized LLMs on MMLU and achieving 93% of Llama3-1B's MMLU performance with 60% less parameters. We also demonstrate that, when fine-tuned, the generative approach using the MLM head matches or even surpasses traditional classification-head methods across diverse NLU tasks.This capability emerges specifically in models trained on contemporary, diverse data mixes, with models trained on lower volume, less-diverse data yielding considerably weaker performance. Although preliminary, these results demonstrate the potential of using the original generative masked language modelling head over traditional task-specific heads for downstream tasks. Our work suggests that further exploration into this area is warranted, highlighting many avenues for future improvements.

new Enhancing Hallucination Detection through Noise Injection

Authors: Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Yao Qin, Roland Memisevic

Abstract: Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty, suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from a set of samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is sub-optimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking into account model uncertainty in the Bayesian sense. To this end, we propose a very simple and efficient approach that perturbs an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate its effectiveness across a wide range of datasets and model architectures.

new Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective

Authors: Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S Kevin Zhou

Abstract: Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large Key-Value (KV) cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. Evaluations on the Needle-in-a-Haystack test and Longbench benchmark show our algorithm enhances state-of-the-art cache eviction methods. Further empirical analysis confirms that our algorithm achieves lower output perturbations in over 92% attention heads in Llama model, thereby providing a significant improvement over existing methods.

new PsyPlay: Personality-Infused Role-Playing Conversational Agents

Authors: Tao Yang, Yuhua Zhu, Xiaojun Quan, Cong Liu, Qifan Wang

Abstract: The current research on Role-Playing Conversational Agents (RPCAs) with Large Language Models (LLMs) primarily focuses on imitating specific speaking styles and utilizing character backgrounds, neglecting the depiction of deeper personality traits.~In this study, we introduce personality-infused role-playing for LLM agents, which encourages agents to accurately portray their designated personality traits during dialogues. We then propose PsyPlay, a dialogue generation framework that facilitates the expression of rich personalities among multiple LLM agents. Specifically, PsyPlay enables agents to assume roles with distinct personality traits and engage in discussions centered around specific topics, consistently exhibiting their designated personality traits throughout the interactions. Validation on generated dialogue data demonstrates that PsyPlay can accurately portray the intended personality traits, achieving an overall success rate of 80.31% on GPT-3.5. Notably, we observe that LLMs aligned with positive values are more successful in portraying positive personality roles compared to negative ones. Moreover, we construct a dialogue corpus for personality-infused role-playing, called PsyPlay-Bench. The corpus, which consists of 4745 instances of correctly portrayed dialogues using PsyPlay, aims to further facilitate research in personalized role-playing and dialogue personality detection.

new Syntriever: How to Train Your Retriever with Synthetic Data from LLMs

Authors: Minsang Kim, Seungjun Baek

Abstract: LLMs have boosted progress in many AI applications. Recently, there were attempts to distill the vast knowledge of LLMs into information retrieval systems. Those distillation methods mostly use output probabilities of LLMs which are unavailable in the latest black-box LLMs. We propose Syntriever, a training framework for retrievers using synthetic data from black-box LLMs. Syntriever consists of two stages. Firstly in the distillation stage, we synthesize relevant and plausibly irrelevant passages and augmented queries using chain-of-thoughts for the given queries. LLM is asked to self-verify the synthetic data for possible hallucinations, after which retrievers are trained with a loss designed to cluster the embeddings of relevant passages. Secondly in the alignment stage, we align the retriever with the preferences of LLMs. We propose a preference modeling called partial Plackett-Luce ranking to learn LLM preferences with regularization which prevents the model from deviating excessively from that trained in the distillation stage. Experiments show that Syntriever achieves state-of-the-art performances on benchmark datasets from various domains in nDCG@$K$. The code is available at \href{https://github.com/kmswin1/Syntriever}{https://github.com/kmswin1/Syntriever}.

URLs: https://github.com/kmswin1/Syntriever, https://github.com/kmswin1/Syntriever

new A comprehensive survey of contemporary Arabic sentiment analysis: Methods, Challenges, and Future Directions

Authors: Zhiqiang Shi, Ruchit Agrawal

Abstract: Sentiment Analysis, a popular subtask of Natural Language Processing, employs computational methods to extract sentiment, opinions, and other subjective aspects from linguistic data. Given its crucial role in understanding human sentiment, research in sentiment analysis has witnessed significant growth in the recent years. However, the majority of approaches are aimed at the English language, and research towards Arabic sentiment analysis remains relatively unexplored. This paper presents a comprehensive and contemporary survey of Arabic Sentiment Analysis, identifies the challenges and limitations of existing literature in this field and presents avenues for future research. We present a systematic review of Arabic sentiment analysis methods, focusing specifically on research utilizing deep learning. We then situate Arabic Sentiment Analysis within the broader context, highlighting research gaps in Arabic sentiment analysis as compared to general sentiment analysis. Finally, we outline the main challenges and promising future directions for research in Arabic sentiment analysis.

new Improving Natural Language Understanding for LLMs via Large-Scale Instruction Synthesis

Authors: Lin Yuan, Jun Xu, Honghao Gui, Mengshu Sun, Zhiqiang Zhang, Lei Liang, Jun Zhou

Abstract: High-quality, large-scale instructions are crucial for aligning large language models (LLMs), however, there is a severe shortage of instruction in the field of natural language understanding (NLU). Previous works on constructing NLU instructions mainly focus on information extraction (IE), neglecting tasks such as machine reading comprehension, question answering, and text classification. Furthermore, the lack of diversity in the data has led to a decreased generalization ability of trained LLMs in other NLU tasks and a noticeable decline in the fundamental model's general capabilities. To address this issue, we propose Hum, a large-scale, high-quality synthetic instruction corpus for NLU tasks, designed to enhance the NLU capabilities of LLMs. Specifically, Hum includes IE (either close IE or open IE), machine reading comprehension, text classification, and instruction generalist tasks, thereby enriching task diversity. Additionally, we introduce a human-LLMs collaborative mechanism to synthesize instructions, which enriches instruction diversity by incorporating guidelines, preference rules, and format variants. We conduct extensive experiments on 5 NLU tasks and 28 general capability evaluation datasets for LLMs. Experimental results show that Hum enhances the NLU capabilities of six LLMs by an average of 3.1\%, with no significant decline observed in other general capabilities.

new BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation

Authors: Bo Pang, Hanze Dong, Jiacheng Xu, Silvio Savarese, Yingbo Zhou, Caiming Xiong

Abstract: Large language models (LLMs), such as o1 from OpenAI, have demonstrated remarkable reasoning capabilities. o1 generates a long chain-of-thought (LongCoT) before answering a question. LongCoT allows LLMs to analyze problems, devise plans, reflect, and backtrack effectively. These actions empower LLM to solve complex problems. After the release of o1, many teams have attempted to replicate its LongCoT and reasoning capabilities. In terms of methods, they primarily rely on knowledge distillation with data from existing models with LongCoT capacities (e.g., OpenAI-o1, Qwen-QwQ, DeepSeek-R1-Preview), leaving significant uncertainties on systematically developing such reasoning abilities. In terms of data domains, these works focus narrowly on math while a few others include coding, limiting their generalizability. This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations, where we bootstrap LongCoT (BOLT) from a standard instruct model. BOLT involves three stages: 1) LongCoT data bootstrapping with in-context learning on a standard instruct model; 2) LongCoT supervised finetuning; 3) online training to further refine LongCoT capacities. In BOLT, only a few in-context examples need to be constructed during the bootstrapping stage; in our experiments, we created 10 examples, demonstrating the feasibility of this approach. We use Llama-3.1-70B-Instruct to bootstrap LongCoT and apply our method to various model scales (7B, 8B, 70B). We achieve impressive performance on a variety of benchmarks, Arena-Hard, MT-Bench, WildBench, ZebraLogic, MATH500, which evaluate diverse task-solving and reasoning capabilities.

new Experiments with Large Language Models on Retrieval-Augmented Generation for Closed-Source Simulation Software

Authors: Andreas Baumann, Peter Eberhard

Abstract: Large Language Models (LLMs) are increasingly helpful in text generation, even writing code in programming languages based on user prompts written in natural language. They are even applied to generate simulation models for multibody systems from natural language. Research results suggest that LLMs surpass the mere replication of existing code examples, where some LLMs have been trained on an open-source multibody simulation code. However, for closed-source simulation software, such results are not to be expected as their ideas and concepts might differ from other publicly available ones. LLMs can hallucinate for knowledge-intensive tasks, such as model creation, which can lead to wrong responses. This is especially the case for the LLM unknown closed-source simulation software. The same applies to other internal knowledge kept private to protect intellectual property or data privacy. The Retrieval-Augmented Generation (RAG) approach might yield a solution for these knowledge-intensive tasks. This paper explores the application of RAG to closed-source simulation software and presents first experiments. After a brief introduction to LLMs, the RAG approach, and the simulation method applied by the close-source simulation software, several examples are provided to test LLMs' knowledge of the simulation software and the creation of simulation models using two RAG systems. The examples show promising results indicating the benefits of applying RAG systems to closed-source simulation software, helping to access their knowledge. Nevertheless, they also reveal gaps in the applied information and open questions for further research.

new Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond

Authors: Mardhiyah Sanni, Tassallah Abdullahi, Devendra D. Kayande, Emmanuel Ayodele, Naome A. Etori, Michael S. Mollel, Moshood Yekini, Chibuzor Okocha, Lukman E. Ismaila, Folafunmi Omofoye, Boluwatife A. Adewale, Tobi Olatunji

Abstract: Speech technologies are transforming interactions across various sectors, from healthcare to call centers and robots, yet their performance on African-accented conversations remains underexplored. We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations, designed to evaluate automatic speech recognition (ASR) and related technologies. We assess state-of-the-art (SOTA) speaker diarization and ASR systems on long-form, accented speech, comparing their performance with native accents and discover a 10%+ performance degradation. Additionally, we explore medical conversation summarization capabilities of large language models (LLMs) to demonstrate the impact of ASR errors on downstream medical summaries, providing insights into the challenges and opportunities for speech technologies in the Global South. Our work highlights the need for more inclusive datasets to advance conversational AI in low-resource settings.

new MAQInstruct: Instruction-based Unified Event Relation Extraction

Authors: Jun Xu, Mengshu Sun, Zhiqiang Zhang, Jun Zhou

Abstract: Extracting event relations that deviate from known schemas has proven challenging for previous methods based on multi-class classification, MASK prediction, or prototype matching. Recent advancements in large language models have shown impressive performance through instruction tuning. Nevertheless, in the task of event relation extraction, instruction-based methods face several challenges: there are a vast number of inference samples, and the relations between events are non-sequential. To tackle these challenges, we present an improved instruction-based event relation extraction framework named MAQInstruct. Firstly, we transform the task from extracting event relations using given event-event instructions to selecting events using given event-relation instructions, which reduces the number of samples required for inference. Then, by incorporating a bipartite matching loss, we reduce the dependency of the instruction-based method on the generation sequence. Our experimental results demonstrate that MAQInstruct significantly improves the performance of event relation extraction across multiple LLMs.

new PGB: One-Shot Pruning for BERT via Weight Grouping and Permutation

Authors: Hyemin Lim, Jaeyeon Lee, Dong-Wan Choi

Abstract: Large pretrained language models such as BERT suffer from slow inference and high memory usage, due to their huge size. Recent approaches to compressing BERT rely on iterative pruning and knowledge distillation, which, however, are often too complicated and computationally intensive. This paper proposes a novel semi-structured one-shot pruning method for BERT, called $\textit{Permutation and Grouping for BERT}$ (PGB), which achieves high compression efficiency and sparsity while preserving accuracy. To this end, PGB identifies important groups of individual weights by permutation and prunes all other weights as a structure in both multi-head attention and feed-forward layers. Furthermore, if no important group is formed in a particular layer, PGB drops the entire layer to produce an even more compact model. Our experimental results on BERT$_{\text{BASE}}$ demonstrate that PGB outperforms the state-of-the-art structured pruning methods in terms of computational cost and accuracy preservation.

new Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering

Authors: Longquan Jiang, Junbo Huang, Cedric M\"oller, Ricardo Usbeck

Abstract: Most existing Knowledge Graph Question Answering (KGQA) approaches are designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the heterogeneity of the underlying graph schema, topology and assertions, most KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without resource-intensive training data. We present OntoSCPrompt, a novel Large Language Model (LLM)-based KGQA approach with a two-stage architecture that separates semantic parsing from KG-dependent interactions. OntoSCPrompt first generates a SPARQL query structure (including SPARQL keywords such as SELECT, ASK, WHERE and placeholders for missing tokens) and then fills them with KG-specific information. To enhance the understanding of the underlying KG, we present an ontology-guided, hybrid prompt learning strategy that integrates KG ontology into the learning process of hybrid prompts (e.g., discrete and continuous vectors). We also present several task-specific decoding strategies to ensure the correctness and executability of generated SPARQL queries in both stages. Experimental results demonstrate that OntoSCPrompt performs as well as SOTA approaches without retraining on a number of KGQA datasets such as CWQ, WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code: \href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}

URLs: https://github.com/LongquanJiang/OntoSCPrompt, https://github.com/LongquanJiang/OntoSCPrompt

new Quantification of Biodiversity from Historical Survey Text with LLM-based Best-Worst Scaling

Authors: Thomas Haider, Tobias Perschl, Malte Rehbein

Abstract: In this study, we evaluate methods to determine the frequency of species via quantity estimation from historical survey text. To that end, we formulate classification tasks and finally show that this problem can be adequately framed as a regression task using Best-Worst Scaling (BWS) with Large Language Models (LLMs). We test Ministral-8B, DeepSeek-V3, and GPT-4, finding that the latter two have reasonable agreement with humans and each other. We conclude that this approach is more cost-effective and similarly robust compared to a fine-grained multi-class approach, allowing automated quantity estimation across species.

new Exploring Imbalanced Annotations for Effective In-Context Learning

Authors: Hongfu Gao, Feipeng Zhang, Hao Zeng, Deyu Meng, Bingyi Jing, Hongxin Wei

Abstract: Large language models (LLMs) have shown impressive performance on downstream tasks through in-context learning (ICL), which heavily relies on the demonstrations selected from annotated datasets. Existing selection methods may hinge on the distribution of annotated datasets, which can often be long-tailed in real-world scenarios. In this work, we show that imbalanced class distributions in annotated datasets significantly degrade the performance of ICL across various tasks and selection methods. Moreover, traditional rebalance methods fail to ameliorate the issue of class imbalance in ICL. Our method is motivated by decomposing the distributional differences between annotated and test datasets into two-component weights: class-wise weights and conditional bias. The key idea behind our method is to estimate the conditional bias by minimizing the empirical error on a balanced validation dataset and to employ the two-component weights to modify the original scoring functions during selection. Our approach can prevent selecting too many demonstrations from a single class while preserving the effectiveness of the original selection methods. Extensive experiments demonstrate the effectiveness of our method, improving the average accuracy by up to 5.46 on common benchmarks with imbalanced datasets.

new Simulating the Emergence of Differential Case Marking with Communicating Neural-Network Agents

Authors: Yuchen Lian, Arianna Bisazza, Tessa Verhoef

Abstract: Differential Case Marking (DCM) refers to the phenomenon where grammatical case marking is applied selectively based on semantic, pragmatic, or other factors. The emergence of DCM has been studied in artificial language learning experiments with human participants, which were specifically aimed at disentangling the effects of learning from those of communication (Smith & Culbertson, 2020). Multi-agent reinforcement learning frameworks based on neural networks have gained significant interest to simulate the emergence of human-like linguistic phenomena. In this study, we employ such a framework in which agents first acquire an artificial language before engaging in communicative interactions, enabling direct comparisons to human result. Using a very generic communication optimization algorithm and neural-network learners that have no prior experience with language or semantic preferences, our results demonstrate that learning alone does not lead to DCM, but when agents communicate, differential use of markers arises. This supports Smith and Culbertson (2020)'s findings that highlight the critical role of communication in shaping DCM and showcases the potential of neural-agent models to complement experimental research on language evolution.

new Predicting Large Language Model Capabilities on Closed-Book QA Tasks Using Only Information Available Prior to Training

Authors: Changhao Jiang, Ming Zhang, Junjie Ye, Xiaoran Fan, Yifei Cao, Jiajun Sun, Zhiheng Xi, Shihan Dou, Yi Dong, Yujiong Shen, Jingqi Tong, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang

Abstract: The GPT-4 technical report from OpenAI suggests that model performance on specific tasks can be predicted prior to training, though methodologies remain unspecified. This approach is crucial for optimizing resource allocation and ensuring data alignment with target tasks. To achieve this vision, we focus on predicting performance on Closed-book Question Answering (CBQA) tasks, which are closely tied to pre-training data and knowledge retention. We address three major challenges: 1) mastering the entire pre-training process, especially data construction; 2) evaluating a model's knowledge retention; and 3) predicting task-specific knowledge retention using only information available prior to training. To tackle these challenges, we pre-train three large language models (i.e., 1.6B, 7B, and 13B) using 560k dollars and 520k GPU hours. We analyze the pre-training data with knowledge triples and assess knowledge retention using established methods. Additionally, we introduce the SMI metric, an information-theoretic measure that quantifies the relationship between pre-training data, model size, and task-specific knowledge retention. Our experiments reveal a strong linear correlation ($\text{R}^2 > 0.84$) between the SMI metric and the model's accuracy on CBQA tasks across models of varying sizes (i.e., 1.1B, 1.6B, 7B, and 13B). The dataset, model, and code are available at https://github.com/yuhui1038/SMI.

URLs: https://github.com/yuhui1038/SMI.

new Controllable Emotion Generation with Emotion Vectors

Authors: Yurui Dong, Luozhijie Jin, Yao Yang, Bingjie Lu, Jiaxi Yang, Zhi Liu

Abstract: In recent years, technologies based on large-scale language models (LLMs) have made remarkable progress in many fields, especially in customer service, content creation, and embodied intelligence, showing broad application potential. However, The LLM's ability to express emotions with proper tone, timing, and in both direct and indirect forms is still insufficient but significant. Few works have studied on how to build the controlable emotional expression capability of LLMs. In this work, we propose a method for emotion expression output by LLMs, which is universal, highly flexible, and well controllable proved with the extensive experiments and verifications. This method has broad application prospects in fields involving emotions output by LLMs, such as intelligent customer service, literary creation, and home companion robots. The extensive experiments on various LLMs with different model-scales and architectures prove the versatility and the effectiveness of the proposed method.

new AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference

Authors: Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Chen Chen, Lei Chen, Xianzhi Yu, Wulong Liu, Jianye Hao, Mingxuan Yuan, Bin Li

Abstract: With the development of large language models (LLMs), efficient inference through Key-Value (KV) cache compression has attracted considerable attention, especially for long-context generation. To compress the KV cache, recent methods identify critical KV tokens through heuristic ranking with attention scores. However, these methods often struggle to accurately determine critical tokens as they neglect the \textit{temporal patterns} in attention scores, resulting in a noticeable degradation in LLM performance. To address this challenge, we propose AttentionPredictor, which is the first learning-based critical token identification approach. Specifically, AttentionPredictor learns a lightweight convolution model to capture spatiotemporal patterns and predict the next-token attention score. An appealing feature of AttentionPredictor is that it accurately predicts the attention score while consuming negligible memory. Moreover, we propose a cross-token critical cache prefetching framework that hides the token estimation time overhead to accelerate the decoding stage. By retaining most of the attention information, AttentionPredictor achieves 16$\times$ KV cache compression with comparable LLM performance, significantly outperforming the state-of-the-art.

new LLMs to Support a Domain Specific Knowledge Assistant

Authors: Maria-Flavia Lovin

Abstract: This work presents a custom approach to developing a domain specific knowledge assistant for sustainability reporting using the International Financial Reporting Standards (IFRS). In this domain, there is no publicly available question-answer dataset, which has impeded the development of a high-quality chatbot to support companies with IFRS reporting. The two key contributions of this project therefore are: (1) A high-quality synthetic question-answer (QA) dataset based on IFRS sustainability standards, created using a novel generation and evaluation pipeline leveraging Large Language Models (LLMs). This comprises 1,063 diverse QA pairs that address a wide spectrum of potential user queries in sustainability reporting. Various LLM-based techniques are employed to create the dataset, including chain-of-thought reasoning and few-shot prompting. A custom evaluation framework is developed to assess question and answer quality across multiple dimensions, including faithfulness, relevance, and domain specificity. The dataset averages a score range of 8.16 out of 10 on these metrics. (2) Two architectures for question-answering in the sustainability reporting domain - a RAG pipeline and a fully LLM-based pipeline. The architectures are developed by experimenting, fine-tuning, and training on the QA dataset. The final pipelines feature an LLM fine-tuned on domain specific data and an industry classification component to improve the handling of complex queries. The RAG architecture achieves an accuracy of 85.32% on single-industry and 72.15% on cross-industry multiple-choice questions, outperforming the baseline approach by 4.67 and 19.21 percentage points, respectively. The LLM-based pipeline achieves an accuracy of 93.45% on single-industry and 80.30% on cross-industry multiple-choice questions, an improvement of 12.80 and 27.36 percentage points over the baseline, respectively.

new The Order Effect: Investigating Prompt Sensitivity in Closed-Source LLMs

Authors: Bryan Guan, Tanya Roosta, Peyman Passban, Mehdi Rezagholizadeh

Abstract: As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in closed-source LLMs by conducting experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice questions. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation, however, fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development.

new UltraIF: Advancing Instruction Following from the Wild

Authors: Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, Baobao Chang

Abstract: Instruction-following made modern large language models (LLMs) helpful assistants. However, the key to taming LLMs on complex instructions remains mysterious, for that there are huge gaps between models trained by open-source community and those trained by leading companies. To bridge the gap, we propose a simple and scalable approach UltraIF for building LLMs that can follow complex instructions with open-source data. UltraIF first decomposes real-world user prompts into simpler queries, constraints, and corresponding evaluation questions for the constraints. Then, we train an UltraComposer to compose constraint-associated prompts with evaluation questions. This prompt composer allows us to synthesize complicated instructions as well as filter responses with evaluation questions. In our experiment, for the first time, we successfully align LLaMA-3.1-8B-Base to catch up with its instruct version on 5 instruction-following benchmarks without any benchmark information, using only 8B model as response generator and evaluator. The aligned model also achieved competitive scores on other benchmarks. Moreover, we also show that UltraIF could further improve LLaMA-3.1-8B-Instruct through self-alignment, motivating broader use cases for the method. Our code will be available at https://github.com/kkk-an/UltraIF.

URLs: https://github.com/kkk-an/UltraIF.

new Lexical Substitution is not Synonym Substitution: On the Importance of Producing Contextually Relevant Word Substitutes

Authors: Juraj Vladika, Stephen Meisenbacher, Florian Matthes

Abstract: Lexical Substitution is the task of replacing a single word in a sentence with a similar one. This should ideally be one that is not necessarily only synonymous, but also fits well into the surrounding context of the target word, while preserving the sentence's grammatical structure. Recent advances in Lexical Substitution have leveraged the masked token prediction task of Pre-trained Language Models to generate replacements for a given word in a sentence. With this technique, we introduce ConCat, a simple augmented approach which utilizes the original sentence to bolster contextual information sent to the model. Compared to existing approaches, it proves to be very effective in guiding the model to make contextually relevant predictions for the target word. Our study includes a quantitative evaluation, measured via sentence similarity and task performance. In addition, we conduct a qualitative human analysis to validate that users prefer the substitutions proposed by our method, as opposed to previous methods. Finally, we test our approach on the prevailing benchmark for Lexical Substitution, CoInCo, revealing potential pitfalls of the benchmark. These insights serve as the foundation for a critical discussion on the way in which Lexical Substitution is evaluated.

new The Best Instruction-Tuning Data are Those That Fit

Authors: Dylan Zhang, Qirun Dai, Hao Peng

Abstract: High-quality supervised fine-tuning (SFT) data are crucial for eliciting strong capabilities from pretrained large language models (LLMs). Typically, instructions are paired with multiple responses sampled from other LLMs, which are often out of the distribution of the target model to be fine-tuned. This, at scale, can lead to diminishing returns and even hurt the models' performance and robustness. We propose **GRAPE**, a novel SFT framework that accounts for the unique characteristics of the target model. For each instruction, it gathers responses from various LLMs and selects the one with the highest probability measured by the target model, indicating that it aligns most closely with the target model's pretrained distribution; it then proceeds with standard SFT training. We first evaluate GRAPE with a controlled experiment, where we sample various solutions for each question in UltraInteract from multiple models and fine-tune commonly used LMs like LLaMA3.1-8B, Mistral-7B, and Qwen2.5-7B on GRAPE-selected data. GRAPE significantly outperforms strong baselines, including distilling from the strongest model with an absolute gain of up to 13.8%, averaged across benchmarks, and training on 3x more data with a maximum performance improvement of 17.3%. GRAPE's strong performance generalizes to realistic settings. We experiment with the post-training data used for Tulu3 and Olmo-2. GRAPE outperforms strong baselines trained on 4.5 times more data by 6.1% and a state-of-the-art data selection approach by 3% on average performance. Remarkably, using 1/3 of the data and half the number of epochs, GRAPE enables LLaMA3.1-8B to surpass the performance of Tulu3-SFT by 3.5%.

new Sports and Women's Sports: Gender Bias in Text Generation with Olympic Data

Authors: Laura Biester

Abstract: Large Language Models (LLMs) have been shown to be biased in prior work, as they generate text that is in line with stereotypical views of the world or that is not representative of the viewpoints and values of historically marginalized demographic groups. In this work, we propose using data from parallel men's and women's events at the Olympic Games to investigate different forms of gender bias in language models. We define three metrics to measure bias, and find that models are consistently biased against women when the gender is ambiguous in the prompt. In this case, the model frequently retrieves only the results of the men's event with or without acknowledging them as such, revealing pervasive gender bias in LLMs in the context of athletics.

new A Classification System Approach in Predicting Chinese Censorship

Authors: Matt Prodani, Tianchu Ze, Yushen Hu

Abstract: This paper is dedicated to using a classifier to predict whether a Weibo post would be censored under the Chinese internet. Through randomized sampling from \citeauthor{Fu2021} and Chinese tokenizing strategies, we constructed a cleaned Chinese phrase dataset with binary censorship markings. Utilizing various probability-based information retrieval methods on the data, we were able to derive 4 logistic regression models for classification. Furthermore, we experimented with pre-trained transformers to perform similar classification tasks. After evaluating both the macro-F1 and ROC-AUC metrics, we concluded that the Fined-Tuned BERT model exceeds other strategies in performance.

new MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion

Authors: Xintong Hao, Ke Shen, Chenggang Li

Abstract: Despite the remarkable capabilities of large language models across various tasks, their continued scaling faces a critical challenge: the scarcity of high-quality pretraining data. While model architectures continue to evolve, the natural language data struggles to scale up. To tackle this bottleneck, we propose \textbf{MA}ssive \textbf{G}enre-\textbf{A}udience~(MAGA) reformulation method, which systematic synthesizes diverse, contextually-rich pretraining data from existing corpus. This work makes three main contributions: (1) We propose MAGA reformulation method, a lightweight and scalable approach for pretraining corpus expansion, and build a 770B tokens MAGACorpus. (2) We evaluate MAGACorpus with different data budget scaling strategies, demonstrating consistent improvements across various model sizes (134M-13B), establishing the necessity for next-generation large-scale synthetic pretraining language models. (3) Through comprehensive analysis, we investigate prompt engineering's impact on synthetic training collapse and reveal limitations in conventional collapse detection metrics using validation losses. Our work shows that MAGA can substantially expand training datasets while maintaining quality, offering a reliably pathway for scaling models beyond data limitations.

new TriNER: A Series of Named Entity Recognition Models For Hindi, Bengali & Marathi

Authors: Mohammed Amaan Dhamaskar, Rasika Ransing

Abstract: India's rich cultural and linguistic diversity poses various challenges in the domain of Natural Language Processing (NLP), particularly in Named Entity Recognition (NER). NER is a NLP task that aims to identify and classify tokens into different entity groups like Person, Location, Organization, Number, etc. This makes NER very useful for downstream tasks like context-aware anonymization. This paper details our work to build a multilingual NER model for the three most spoken languages in India - Hindi, Bengali & Marathi. We train a custom transformer model and fine tune a few pretrained models, achieving an F1 Score of 92.11 for a total of 6 entity groups. Through this paper, we aim to introduce a single model to perform NER and significantly reduce the inconsistencies in entity groups and tag names, across the three languages.

new How does a Multilingual LM Handle Multiple Languages?

Authors: Santhosh Kakarla, Gautama Shastry Bulusu Venkata, Aishwarya Gaddam

Abstract: Multilingual language models have significantly advanced due to rapid progress in natural language processing. Models like BLOOM 1.7B, trained on diverse multilingual datasets, aim to bridge linguistic gaps. However, their effectiveness in capturing linguistic knowledge, particularly for low-resource languages, remains an open question. This study critically examines MLMs capabilities in multilingual understanding, semantic representation, and cross-lingual knowledge transfer. While these models perform well for high-resource languages, they struggle with less-represented ones. Additionally, traditional evaluation methods often overlook their internal syntactic and semantic encoding. This research addresses key limitations through three objectives. First, it assesses semantic similarity by analyzing multilingual word embeddings for consistency using cosine similarity. Second, it examines BLOOM-1.7B and Qwen2 through Named Entity Recognition and sentence similarity tasks to understand their linguistic structures. Third, it explores cross-lingual knowledge transfer by evaluating generalization from high-resource to low-resource languages in sentiment analysis and text classification. By leveraging linguistic probing, performance metrics, and visualizations, this study provides insights into the strengths and limitations of MLMs. The findings aim to enhance multilingual NLP models, ensuring better support for both high- and low-resource languages, thereby promoting inclusivity in language technologies.

new A Methodology for Studying Linguistic and Cultural Change in China, 1900-1950

Authors: Spencer Dean Stewart

Abstract: This paper presents a quantitative approach to studying linguistic and cultural change in China during the first half of the twentieth century, a period that remains understudied in computational humanities research. The dramatic changes in Chinese language and culture during this time call for greater reflection on the tools and methods used for text analysis. This preliminary study offers a framework for analyzing Chinese texts from the late nineteenth and twentieth centuries, demonstrating how established methods such as word counts and word embeddings can provide new historical insights into the complex negotiations between Western modernity and Chinese cultural discourse.

new Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization

Authors: Yuanye Liu, Jiahang Xu, Li Lyna Zhang, Qi Chen, Xuan Feng, Yang Chen, Zhongxin Guo, Yuqing Yang, Cheng Peng

Abstract: Large Language Models (LLMs) have shown significant capability across various tasks, with their real-world effectiveness often driven by prompt design. While recent research has focused on optimizing prompt content, the role of prompt formatting, a critical but often overlooked dimension, has received limited systematic investigation. In this paper, we introduce Content-Format Integrated Prompt Optimization (CFPO), an innovative methodology that jointly optimizes both prompt content and formatting through an iterative refinement process. CFPO leverages natural language mutations to explore content variations and employs a dynamic format exploration strategy that systematically evaluates diverse format options. Our extensive evaluations across multiple tasks and open-source LLMs demonstrate that CFPO demonstrates measurable performance improvements compared to content-only optimization methods. This highlights the importance of integrated content-format optimization and offers a practical, model-agnostic approach to enhancing LLM performance. Code will be available at https://github.com/HenryLau7/CFPO.

URLs: https://github.com/HenryLau7/CFPO.

new ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

Authors: Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, Bryon Aragam

Abstract: Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address these challenges with ScoreFlow, a simple yet high-performance framework that leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs. Project: https://github.com/Gen-Verse/ScoreFlow

URLs: https://github.com/Gen-Verse/ScoreFlow

new BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation

Authors: The Omnilingual MT Team, Pierre Andrews, Mikel Artetxe, Mariano Coria Meglioli, Marta R. Costa-juss\`a, Joe Chuang, David Dale, Cynthia Gao, Jean Maillard, Alex Mourachko, Christophe Ropers, Safiyyah Saleem, Eduardo S\'anchez, Ioannis Tsiamas, Arina Turkatenko, Albert Ventayol-Boada, Shireen Yates

Abstract: This paper presents BOUQuET, a multicentric and multi-register/domain dataset and benchmark, and its broader collaborative extension initiative. This dataset is handcrafted in non-English languages first, each of these source languages being represented among the 23 languages commonly used by half of the world's population and therefore having the potential to serve as pivot languages that will enable more accurate translations. The dataset is specially designed to avoid contamination and be multicentric, so as to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation (MT) datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for the open initiative and call for translation participation that we are launching to extend it to a multi-way parallel corpus to any written language.

new ChamaleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters

Authors: Kamer Ali Yuksel, Hassan Sawaf

Abstract: Recent advances in large language models (LLMs) have shown remarkable performance across diverse tasks. However, these models are typically deployed with fixed weights, which limits their ability to adapt dynamically to the variability inherent in real-world data during inference. This paper introduces ChamaleonLLM, a novel framework that enables inference-time adaptation of LLMs by leveraging batch-aware clustering and on-the-fly generation of low-rank updates. Unlike traditional fine-tuning approaches such as Low-Rank Adaptation (LoRA) or methods that rely on a fixed set of pre-learned uniforms (changeable masks), our method dynamically generates adaptive modifications to the decoder weights based on the aggregated statistics of clustered batches. By intelligently grouping similar inputs and computing context-aware low-rank updates via a hyper-network, ChamaleonLLM achieves significant performance gains, outperforming conventional LoRA methods while eliminating the overhead of maintaining multiple expert models. Our experiments highlight the potential of our approach to serve as a versatile and highly adaptive solution for language model inference. ChamaleonLLM is open-sourced to ensure the reproducibility of our experiments: https://anonymous.4open.science/r/ChamaleonLLM/

URLs: https://anonymous.4open.science/r/ChamaleonLLM/

new Variation of sentence length across time and genre

Authors: Karolina Rudnicka

Abstract: The goal of this paper is threefold: i) to present some practical aspects of using full-text version of Corpus of Historical American English (COHA), the largest diachronic multi-genre corpus of the English language, in the investigation of a linguistic trend of change; ii) to test a widely held assumption that sentence length in written English has been steadily decreasing over the past few centuries; iii) to point to a possible link between the changes in sentence length and changes in the English syntactic usage. The empirical proof of concept for iii) is provided by the decline in the frequency of the non-finite purpose subordinator in order to. Sentence length, genre and the likelihood of occurrence of in order to are shown to be interrelated.

new Can Grammarly and ChatGPT accelerate language change? AI-powered technologies and their impact on the English language: wordiness vs. conciseness

Authors: Karolina Rudnicka

Abstract: The proliferation of NLP-powered language technologies, AI-based natural language generation models, and English as a mainstream means of communication among both native and non-native speakers make the output of AI-powered tools especially intriguing to linguists. This paper investigates how Grammarly and ChatGPT affect the English language regarding wordiness vs. conciseness. A case study focusing on the purpose subordinator in order to is presented to illustrate the way in which Grammarly and ChatGPT recommend shorter grammatical structures instead of longer and more elaborate ones. Although the analysed sentences were produced by native speakers, are perfectly correct, and were extracted from a language corpus of contemporary English, both Grammarly and ChatGPT suggest more conciseness and less verbosity, even for relatively short sentences. The present article argues that technologies such as Grammarly not only mirror language change but also have the potential to facilitate or accelerate it.

cross How Should I Build A Benchmark? Revisiting Code-Related Benchmarks For LLMs

Authors: Jialun Cao, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Ruixi Qiao, Yuting Han, Chaozheng Wang, Boxi Yu, Pinjia He, Shuai Wang, Zibin Zheng, Michael R. Lyu, Shing-Chi Cheung

Abstract: Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55- 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.

cross From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs

Authors: Jialun Cao, Yaojie Lu, Meiziniu Li, Haoyang Ma, Haokun Li, Mengda He, Cheng Wen, Le Sun, Hongyu Zhang, Shengchao Qin, Shing-Chi Cheung, Cong Tian

Abstract: The research in AI-based formal mathematical reasoning has shown an unstoppable growth trend. These studies have excelled in mathematical competitions like IMO, showing significant progress. However, these studies intertwined multiple skills simultaneously, i.e., problem-solving, reasoning, and writing formal specifications, making it hard to precisely identify the LLMs' strengths and weaknesses in each task. This paper focuses on formal verification, an immediate application scenario of formal reasoning, and decomposes it into six sub-tasks. We constructed 18k high-quality instruction-response pairs across five mainstream formal specification languages (Coq, Lean4, Dafny, ACSL, and TLA+) in six formal-verification-related tasks by distilling GPT-4o. They are split into a 14k+ fine-tuning dataset FM-alpaca and a 4k benchmark FM-Bench. We found that LLMs are good at writing proof segments when given either the code, or the detailed description of proof steps. Also, the fine-tuning brought about a nearly threefold improvement at most. Interestingly, we observed that fine-tuning with formal data also enhances mathematics, reasoning, and coding abilities. We hope our findings inspire further research. Fine-tuned models are released to facilitate subsequent studies

cross Teaching Language Models to Critique via Reinforcement Learning

Authors: Zhihui Xie, Jie chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong

Abstract: Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.

cross An Empirical Exploration of ChatGPT's Ability to Support Problem Formulation Tasks for Mission Engineering and a Documentation of its Performance Variability

Authors: Max Ofsa, Taylan G. Topcu

Abstract: Systems engineering (SE) is evolving with the availability of generative artificial intelligence (AI) and the demand for a systems-of-systems perspective, formalized under the purview of mission engineering (ME) in the US Department of Defense. Formulating ME problems is challenging because they are open-ended exercises that involve translation of ill-defined problems into well-defined ones that are amenable for engineering development. It remains to be seen to which extent AI could assist problem formulation objectives. To that end, this paper explores the quality and consistency of multi-purpose Large Language Models (LLM) in supporting ME problem formulation tasks, specifically focusing on stakeholder identification. We identify a relevant reference problem, a NASA space mission design challenge, and document ChatGPT-3.5's ability to perform stakeholder identification tasks. We execute multiple parallel attempts and qualitatively evaluate LLM outputs, focusing on both their quality and variability. Our findings portray a nuanced picture. We find that the LLM performs well in identifying human-focused stakeholders but poorly in recognizing external systems and environmental factors, despite explicit efforts to account for these. Additionally, LLMs struggle with preserving the desired level of abstraction and exhibit a tendency to produce solution specific outputs that are inappropriate for problem formulation. More importantly, we document great variability among parallel threads, highlighting that LLM outputs should be used with caution, ideally by adopting a stochastic view of their abilities. Overall, our findings suggest that, while ChatGPT could reduce some expert workload, its lack of consistency and domain understanding may limit its reliability for problem formulation tasks.

cross REALEDIT: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations

Authors: Peter Sushko, Ayana Bharadwaj, Zhi Yang Lim, Vasily Ilin, Ben Caffee, Dongping Chen, Mohammadreza Salehi, Cheng-Yu Hsieh, Ranjay Krishna

Abstract: Existing image editing models struggle to meet real-world demands. Despite excelling in academic benchmarks, they have yet to be widely adopted for real user needs. Datasets that power these models use artificial edits, lacking the scale and ecological validity necessary to address the true diversity of user requests. We introduce REALEDIT, a large-scale image editing dataset with authentic user requests and human-made edits sourced from Reddit. REALEDIT includes a test set of 9300 examples to evaluate models on real user requests. Our results show that existing models fall short on these tasks, highlighting the need for realistic training data. To address this, we introduce 48K training examples and train our REALEDIT model, achieving substantial gains - outperforming competitors by up to 165 Elo points in human judgment and 92 percent relative improvement on the automated VIEScore metric. We deploy our model on Reddit, testing it on new requests, and receive positive feedback. Beyond image editing, we explore REALEDIT's potential in detecting edited images by partnering with a deepfake detection non-profit. Finetuning their model on REALEDIT data improves its F1-score by 14 percentage points, underscoring the dataset's value for broad applications.

cross DocMIA: Document-Level Membership Inference Attacks against DocVQA Models

Authors: Khanh Nguyen, Raouf Kerkouche, Mario Fritz, Dimosthenis Karatzas

Abstract: Document Visual Question Answering (DocVQA) has introduced a new paradigm for end-to-end document understanding, and quickly became one of the standard benchmarks for multimodal LLMs. Automating document processing workflows, driven by DocVQA models, presents significant potential for many business sectors. However, documents tend to contain highly sensitive information, raising concerns about privacy risks associated with training such DocVQA models. One significant privacy vulnerability, exploited by the membership inference attack, is the possibility for an adversary to determine if a particular record was part of the model's training data. In this paper, we introduce two novel membership inference attacks tailored specifically to DocVQA models. These attacks are designed for two different adversarial scenarios: a white-box setting, where the attacker has full access to the model architecture and parameters, and a black-box setting, where only the model's outputs are available. Notably, our attacks assume the adversary lacks access to auxiliary datasets, which is more realistic in practice but also more challenging. Our unsupervised methods outperform existing state-of-the-art membership inference attacks across a variety of DocVQA models and datasets, demonstrating their effectiveness and highlighting the privacy risks in this domain.

cross Adaptive Semantic Prompt Caching with VectorQ

Authors: Luis Gaspar Schroeder, Shu Liu, Alejandro Cuadron, Mark Zhao, Stephan Krusche, Alfons Kemper, Matei Zaharia, Joseph E. Gonzalez

Abstract: Semantic prompt caches reduce the latency and cost of large language model (LLM) inference by reusing cached LLM-generated responses for semantically similar prompts. Vector similarity metrics assign a numerical score to quantify the similarity between an embedded prompt and its nearest neighbor in the cache. Existing systems rely on a static threshold to classify whether the similarity score is sufficiently high to result in a cache hit. We show that this one-size-fits-all threshold is insufficient across different prompts. We propose VectorQ, a framework to learn embedding-specific threshold regions that adapt to the complexity and uncertainty of an embedding. Through evaluations on a combination of four diverse datasets, we show that VectorQ consistently outperforms state-of-the-art systems across all static thresholds, achieving up to 12x increases in cache hit rate and error rate reductions up to 92%.

cross DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

Authors: Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, Yuxuan Wang

Abstract: Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.

cross Enhancing Online Learning Efficiency Through Heterogeneous Resource Integration with a Multi-Agent RAG System

Authors: Devansh Srivastav, Hasan Md Tusfiqur Alam, Afsaneh Asaei, Mahmoud Fazeli, Tanisha Sharma, Daniel Sonntag

Abstract: Efficient online learning requires seamless access to diverse resources such as videos, code repositories, documentation, and general web content. This poster paper introduces early-stage work on a Multi-Agent Retrieval-Augmented Generation (RAG) System designed to enhance learning efficiency by integrating these heterogeneous resources. Using specialized agents tailored for specific resource types (e.g., YouTube tutorials, GitHub repositories, documentation websites, and search engines), the system automates the retrieval and synthesis of relevant information. By streamlining the process of finding and combining knowledge, this approach reduces manual effort and enhances the learning experience. A preliminary user study confirmed the system's strong usability and moderate-high utility, demonstrating its potential to improve the efficiency of knowledge acquisition.

cross Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment

Authors: Haoyu Wang, Zeyu Qin, Li Shen, Xueqian Wang, Minhao Cheng, Dacheng Tao

Abstract: Training safe LLMs is one of the most critical research challenge. However, the commonly used method, Refusal Training (RT), struggles to generalize against various OOD jailbreaking attacks. Many safety training methods have been proposed to address this issue. While they offer valuable insights, we aim to complement this line of research by investigating whether OOD attacks truly exceed the capability of RT model. Conducting evaluation with BoN, we observe significant improvements on generalization as N increases. This underscores that the model possesses sufficient safety-related latent knowledge, but RT fails to consistently elicit this knowledge when addressing OOD attacks. Further analysis based on domain adaptation reveals that training with direct refusal causes model to rely on superficial shortcuts, resulting in learning of non-robust representation mappings. Based on our findings, we propose training model to perform safety reasoning for each query. Reasoning supervision encourages model to perform more computations, explicitly eliciting and using latent knowledge through reasoning. To achieve this, we synthesize reasoning supervision based on pre-guidelines, training the model to reason in alignment with them, thereby effectively eliciting and utilizing latent knowledge from diverse perspectives. Extensive experiments show that our method significantly improves generalization performance against OOD attacks.

cross Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Authors: Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi DAI, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue

Abstract: Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.

cross Multi-agent Architecture Search via Agentic Supernet

Authors: Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, Xiang Wang

Abstract: Large Language Model (LLM)-empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the \textbf{agentic supernet}, a probabilistic and continuous distribution of agentic architectures. We introduce MaAS, an automated framework that samples query-dependent agentic systems from the supernet, delivering high-quality solutions and tailored resource allocation (\textit{e.g.}, LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS \textbf{(I)} requires only $6\sim45\%$ of the inference costs of existing handcrafted or automated multi-agent systems, \textbf{(II)} surpasses them by $0.54\%\sim11.82\%$, and \textbf{(III)} enjoys superior cross-dataset and cross-LLM-backbone transferability.

cross Great Models Think Alike and this Undermines AI Oversight

Authors: Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping

Abstract: As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as "AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

cross Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

Authors: Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, Marzyeh Ghassemi

Abstract: Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative--two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of 0.319 in Attack Success Rate and 0.426 in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.

cross Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

Authors: Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao

Abstract: Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts. The core design of Ola lies in its progressive modality alignment strategy that extends the supporting modality of the language model progressively. Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities. The progressive learning pipeline also enables us to maintain a relatively small size of the cross-modal alignment data, making developing omni-modal from existing vision-language models easy and less costly. Moreover, to unlock an advanced interactive experience like GPT-4o, we further design a sentence-wise decoding solution for streaming speech generation. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.

URLs: https://github.com/Ola-Omni/Ola.

replace Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction

Authors: Ji Qi, Chuchun Zhang, Xiaozhi Wang, Kaisheng Zeng, Jifan Yu, Jinxin Liu, Jiuding Sun, Yuxiang Chen, Lei Hou, Juanzi Li, Bin Xu

Abstract: The robustness to distribution changes ensures that NLP models can be successfully applied in the realistic world, especially for information extraction tasks. However, most prior evaluation benchmarks have been devoted to validating pairwise matching correctness, ignoring the crucial measurement of robustness. In this paper, we present the first benchmark that simulates the evaluation of open information extraction models in the real world, where the syntactic and expressive distributions under the same knowledge meaning may drift variously. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique that consists of sentences with structured knowledge of the same meaning but with different syntactic and expressive forms. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques. We perform experiments on typical models published in the last decade as well as a popular large language model, the results show that the existing successful models exhibit a frustrating degradation, with a maximum drop of 23.43 F1 score. Our resources and code are available at https://github.com/qijimrc/ROBUST.

URLs: https://github.com/qijimrc/ROBUST.

replace Measuring Social Biases in Masked Language Models by Proxy of Prediction Quality

Authors: Rahul Zalkikar, Kanchan Chandra

Abstract: Transformer language models have achieved state-of-the-art performance for a variety of natural language tasks but have been shown to encode unwanted biases. We evaluate the social biases encoded by transformers trained with the masked language modeling objective using proposed proxy functions within an iterative masking experiment to measure the quality of transformer models' predictions and assess the preference of MLMs towards disadvantaged and advantaged groups. We find all models encode concerning social biases. We compare bias estimations with those produced by other evaluation methods using benchmark datasets and assess their alignment with human annotated biases. We extend previous work by evaluating social biases introduced after retraining an MLM under the masked language modeling objective and find proposed measures produce more accurate and sensitive estimations of biases introduced by retraining MLMs based on relative preference for biased sentences between models, while other methods tend to underestimate biases after retraining on sentences biased towards disadvantaged groups.

replace EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine Translation

Authors: Yuqiao Wen, Behzad Shayegh, Chenyang Huang, Yanshuai Cao, Lili Mou

Abstract: The ability of zero-shot translation emerges when we train a multilingual model with certain translation directions; the model can then directly translate in unseen directions. Alternatively, zero-shot translation can be accomplished by pivoting through a third language (e.g., English). In our work, we observe that both direct and pivot translations are noisy and achieve less satisfactory performance. We propose EBBS, an ensemble method with a novel bi-level beam search algorithm, where each ensemble component explores its own prediction step by step at the lower level but they are synchronized by a "soft voting" mechanism at the upper level. Results on two popular multilingual translation datasets show that EBBS consistently outperforms direct and pivot translations as well as existing ensemble techniques. Further, we can distill the ensemble's knowledge back to the multilingual model to improve inference efficiency; profoundly, our EBBS-based distillation does not sacrifice, or even improves, the translation quality.

replace Does Mapo Tofu Contain Coffee? Probing LLMs for Food-related Cultural Knowledge

Authors: Li Zhou, Taelin Karidi, Wanlong Liu, Nicolas Garneau, Yong Cao, Wenyu Chen, Haizhou Li, Daniel Hershcovich

Abstract: Recent studies have highlighted the presence of cultural biases in Large Language Models (LLMs), yet often lack a robust methodology to dissect these phenomena comprehensively. Our work aims to bridge this gap by delving into the Food domain, a universally relevant yet culturally diverse aspect of human life. We introduce FmLAMA, a multilingual dataset centered on food-related cultural facts and variations in food practices. We analyze LLMs across various architectures and configurations, evaluating their performance in both monolingual and multilingual settings. By leveraging templates in six different languages, we investigate how LLMs interact with language-specific and cultural knowledge. Our findings reveal that (1) LLMs demonstrate a pronounced bias towards food knowledge prevalent in the United States; (2) Incorporating relevant cultural context significantly improves LLMs' ability to access cultural knowledge; (3) The efficacy of LLMs in capturing cultural nuances is highly dependent on the interplay between the probing language, the specific model architecture, and the cultural context in question. This research underscores the complexity of integrating cultural understanding into LLMs and emphasizes the importance of culturally diverse datasets to mitigate biases and enhance model performance across different cultural domains.

replace Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs

Authors: Yu Xia, Rui Wang, Xu Liu, Mingyan Li, Tong Yu, Xiang Chen, Julian McAuley, Shuai Li

Abstract: Chain-of-Thought (CoT) has been a widely adopted prompting method, eliciting impressive reasoning abilities of Large Language Models (LLMs). Inspired by the sequential thought structure of CoT, a number of Chain-of-X (CoX) methods have been developed to address various challenges across diverse domains and tasks involving LLMs. In this paper, we provide a comprehensive survey of Chain-of-X methods for LLMs in different contexts. Specifically, we categorize them by taxonomies of nodes, i.e., the X in CoX, and application tasks. We also discuss the findings and implications of existing CoX methods, as well as potential future directions. Our survey aims to serve as a detailed and up-to-date resource for researchers seeking to apply the idea of CoT to broader scenarios.

replace Context Steering: Controllable Personalization at Inference Time

Authors: Jerry Zhi-Yang He, Sashrika Pandey, Mariah L. Schrum, Anca Dragan

Abstract: To deliver high-quality, personalized responses, large language models (LLMs) must effectively incorporate context -- personal, demographic, and cultural information specific to an end-user. For example, asking the model to explain Newton's second law with the context "I am a toddler" should produce a response different from when the context is "I am a physics professor". However, leveraging the context in practice is a nuanced and challenging task, and is often dependent on the specific situation or user base. The model must strike a balance between providing specific, personalized responses and maintaining general applicability. Current solutions, such as prompt-engineering and fine-tuning, require collection of contextually appropriate responses as examples, making them time-consuming and less flexible to use across different contexts. In this work, we introduce Context Steering (CoS) -- a simple, training-free decoding approach that amplifies the influence of the context in next token predictions. CoS computes contextual influence by comparing the output probabilities from two LLM forward passes: one that includes the context and one that does not. By linearly scaling the contextual influence, CoS allows practitioners to flexibly control the degree of personalization for different use cases. We show that CoS can be applied to autoregressive LLMs, and demonstrates strong performance in personalized recommendations. Additionally, we show that CoS can function as a Bayesian Generative model to infer and quantify correlations between open-ended texts, broadening its potential applications.

replace HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing

Authors: Zifan He, Yingqi Cao, Zongyue Qin, Neha Prakriya, Yizhou Sun, Jason Cong

Abstract: Transformer-based large language models (LLM) have been widely used in language processing applications. However, due to the memory constraints of the devices, most of them restrict the context window. Even though recurrent models in previous works can memorize past tokens to enable unlimited context and maintain effectiveness, they have ``flat'' memory architectures. Such architectures have limitations in selecting and filtering information. Since humans are good at learning and self-adjustment, we believe that imitating brain memory hierarchy is beneficial for model memorization. Thus, we propose the Hierarchical Memory Transformer (HMT), a novel framework that facilitates a model's long-context processing ability by imitating human memorization behavior. Leveraging memory-augmented segment-level recurrence, we organize the memory hierarchy by preserving tokens from early input segments, passing memory embeddings along the sequence, and recalling relevant information from history. Evaluating general language modeling, question-answering tasks, and the summarization task, we show that HMT consistently improves the long-context processing ability of existing models. Furthermore, HMT achieves a comparable or superior generation quality to long-context LLMs with $2 \sim 57\times$ fewer parameters and $2.5 \sim 116\times$ less inference memory, significantly outperforming previous memory-augmented models. Code on Github: https://github.com/OswaldHe/HMT-pytorch.

URLs: https://github.com/OswaldHe/HMT-pytorch.

replace Efficient Nearest Neighbor based Uncertainty Estimation for Natural Language Processing Tasks

Authors: Wataru Hashimoto, Hidetaka Kamigaito, Taro Watanabe

Abstract: Trustworthiness in model predictions is crucial for safety-critical applications in the real world. However, deep neural networks often suffer from the issues of uncertainty estimation, such as miscalibration. In this study, we propose $k$-Nearest Neighbor Uncertainty Estimation ($k$NN-UE), which is a new uncertainty estimation method that uses not only the distances from the neighbors, but also the ratio of labels in the neighbors. Experiments on sentiment analysis, natural language inference, and named entity recognition show that our proposed method outperforms the baselines and recent density-based methods in several calibration and uncertainty metrics. Moreover, our analyses indicate that approximate nearest neighbor search techniques reduce the inference overhead without significantly degrading the uncertainty estimation performance when they are appropriately combined.

replace Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation

Authors: Liwen Sun, James Zhao, Megan Han, Chenyan Xiong

Abstract: Multimodal foundation models hold significant potential for automating radiology report generation, thereby assisting clinicians in diagnosing cardiac diseases. However, generated reports often suffer from serious factual inaccuracy. In this paper, we introduce a fact-aware multimodal retrieval-augmented pipeline in generating accurate radiology reports (FactMM-RAG). We first leverage RadGraph to mine factual report pairs, then integrate factual knowledge to train a universal multimodal retriever. Given a radiology image, our retriever can identify high-quality reference reports to augment multimodal foundation models, thus enhancing the factual completeness and correctness of report generation. Experiments on two benchmark datasets show that our multimodal retriever outperforms state-of-the-art retrievers on both language generation and radiology-specific metrics, up to 6.5% and 2% score in F1CheXbert and F1RadGraph. Further analysis indicates that employing our factually-informed training strategy imposes an effective supervision signal, without relying on explicit diagnostic label guidance, and successfully propagates fact-aware capabilities from the multimodal retriever to the multimodal foundation model in radiology report generation.

replace AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation

Authors: Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jianguang Lou, Qingwei Lin, Ping Luo, Saravan Rajmohan

Abstract: Large Language Model-based agents have garnered significant attention and are becoming increasingly popular. Furthermore, planning ability is a crucial component of an LLM-based agent, which generally entails achieving a desired goal from an initial state. This paper investigates enhancing the planning abilities of LLMs through instruction tuning, referred to as agent training. Recent studies have demonstrated that utilizing expert-level trajectory for instruction-tuning LLMs effectively enhances their planning capabilities. However, existing work primarily focuses on synthesizing trajectories from manually designed planning tasks and environments. The labor-intensive nature of creating these environments and tasks impedes the generation of sufficiently varied and extensive trajectories. To address this limitation, this paper explores the automated synthesis of diverse environments and a gradual range of planning tasks, from easy to difficult. We introduce a framework, AgentGen, that leverages LLMs first to generate environments and subsequently generate planning tasks conditioned on these environments. Specifically, to improve environmental diversity, we propose using an inspiration corpus composed of various domain-specific text segments as the context for synthesizing environments. Moreover, to increase the difficulty diversity of generated planning tasks, we propose a bidirectional evolution method, Bi-Evol, that evolves planning tasks from easier and harder directions to synthesize a task set with a smoother difficulty curve. The evaluation results derived from AgentBoard show that AgentGen greatly improves LLMs' planning ability, e.g., the AgentGen instruction-tuned Llama-3.1-8B surpasses GPT-3.5 in overall performance. Moreover, the AgentGen-tuned Llama-3.1-70B model achieves state-of-the-art results in planning tasks. Project page: https://agent-gen.github.io/.

URLs: https://agent-gen.github.io/.

replace On Effects of Steering Latent Representation for Large Language Model Unlearning

Authors: Dang Huu-Tien, Trung-Tin Pham, Hoang Thanh-Tung, Naoya Inoue

Abstract: Representation Misdirection for Unlearning (RMU), which steers model representation in the intermediate layer to a target random representation, is an effective method for large language model (LLM) unlearning. Despite its high performance, the underlying cause and explanation remain underexplored. In this paper, we theoretically demonstrate that steering forget representations in the intermediate layer reduces token confidence, causing LLMs to generate wrong or nonsense responses. We investigate how the coefficient influences the alignment of forget-sample representations with the random direction and hint at the optimal coefficient values for effective unlearning across different network layers. We show that RMU unlearned models are robust against adversarial jailbreak attacks. Furthermore, our empirical analysis shows that RMU is less effective when applied to the middle and later layers in LLMs. To resolve this drawback, we propose Adaptive RMU--a simple yet effective alternative method that makes unlearning effective with most layers. Extensive experiments demonstrate that Adaptive RMU significantly improves the unlearning performance compared to prior art while incurring no additional computational cost.

replace MIDAS: Multi-level Intent, Domain, And Slot Knowledge Distillation for Multi-turn NLU

Authors: Yan Li, So-Eon Kim, Seong-Bae Park, Soyeon Caren Han

Abstract: Although Large Language Models(LLMs) can generate coherent and contextually relevant text, they often struggle to recognise the intent behind the human user's query. Natural Language Understanding (NLU) models, however, interpret the purpose and key information of user's input to enable responsive interactions. Existing NLU models generally map individual utterances to a dual-level semantic frame, involving sentence-level intent and word-level slot labels. However, real-life conversations primarily consist of multi-turn conversations, involving the interpretation of complex and extended dialogues. Researchers encounter challenges addressing all facets of multi-turn dialogue conversations using a unified single NLU model. This paper introduces a novel approach, MIDAS, leveraging a multi-level intent, domain, and slot knowledge distillation for multi-turn NLU. To achieve this, we construct distinct teachers for varying levels of conversation knowledge, namely, sentence-level intent detection, word-level slot filling, and conversation-level domain classification. These teachers are then fine-tuned to acquire specific knowledge of their designated levels. A multi-teacher loss is proposed to facilitate the combination of these multi-level teachers, guiding a student model in multi-turn dialogue tasks. The experimental results demonstrate the efficacy of our model in improving the overall multi-turn conversation understanding, showcasing the potential for advancements in NLU models through the incorporation of multi-level dialogue knowledge distillation techniques.

replace Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning

Authors: Chongjie Si, Zhiyi Shi, Shifan Zhang, Xiaokang Yang, Hanspeter Pfister, Wei Shen

Abstract: Large language models demonstrate impressive performance on downstream tasks, yet requiring extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions (TSDs)-critical for transitioning large models from pretrained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties, and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of TSDs during the fine-tuning process, thereby enhancing model performance on targeted tasks. Extensive experiments have conclusively demonstrated the effectiveness of LoRA-Dash, and in-depth analyses further reveal the underlying mechanisms of LoRA-Dash. The code is available at https://github.com/Chongjie-Si/Subspace-Tuning.

URLs: https://github.com/Chongjie-Si/Subspace-Tuning.

replace Language Models "Grok" to Copy

Authors: Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan

Abstract: We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context--a fundamental skill for various LLM applications, including in-context learning (ICL) and retrieval-augmented generation (RAG). We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking, which refers to sudden generalization on test set long after the model fit to the training set. Our experiments yield three arguments: (1) The pre-training loss decreases rapidly, while the context copying ability of models initially lags and then abruptly saturates. (2) The speed of developing copying ability is independent of the number of tokens trained, similarly to how grokking speed is unaffected by dataset size as long as the data distribution is preserved. (3) Induction heads, the attention heads responsible for copying, form from shallow to deep layers during training, mirroring the development of circuits in deeper layers during grokking. We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training, ultimately improving in-context performance. For example, we demonstrated that techniques that enhance grokking, such as regularization, either accelerate or enhance the development of context copying.

replace Thought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension

Authors: Chenxu Wang, Ping Jian, Zhen Yang

Abstract: Logical reading comprehension is a challenging task that entails grasping the underlying semantics of text and applying reasoning to deduce the correct answer. Prior researches have primarily focused on enhancing logical reasoning capabilities through Chain-of-Thought (CoT) or data augmentation. However, previous work constructing chain-of-thought rationales concentrates solely on analyzing correct options, neglecting the incorrect alternatives. Addtionally, earlier efforts on data augmentation by altering contexts rely on rule-based methods, which result in generated contexts that lack diversity and coherence. To address these issues, we propose a Premise-Oriented Data Augmentation (PODA) framework. This framework can generate CoT rationales including analyses for both correct and incorrect options, while constructing diverse and high-quality counterfactual contexts from incorrect candidate options. We integrate summarizing premises and identifying premises for each option into rationales. Subsequently, we employ multi-step prompts with identified premises to construct counterfactual context. To facilitate the model's capabilities to better differentiate the reasoning process associated with each option, we introduce a novel thought-path contrastive learning method that compares reasoning paths between the original and counterfactual samples. Experimental results on three representative LLMs demonstrate that our method can improve the baselines substantially across two challenging logical reasoning benchmarks (ReClor and LogiQA 2.0). The data and code are released at https://github.com/lalalamdbf/TPReasoner.

URLs: https://github.com/lalalamdbf/TPReasoner.

replace PACE: Abstractions for Communicating Efficiently

Authors: Jonathan D. Thomas, Andrea Silvi, Devdatt Dubhashi, Moa Johansson

Abstract: A central but unresolved aspect of problem-solving in AI is the capability to introduce and use abstractions, something humans excel at. Work in cognitive science has demonstrated that humans tend towards higher levels of abstraction when engaged in collaborative task-oriented communication, enabling gradually shorter and more information-efficient utterances. Several computational methods have attempted to replicate this phenomenon, but all make unrealistic simplifying assumptions about how abstractions are introduced and learned. Our method, Procedural Abstractions for Communicating Efficiently (PACE), overcomes these limitations through a neuro-symbolic approach. On the symbolic side, we draw on work from library learning for proposing abstractions. We combine this with neural methods for communication and reinforcement learning, via a novel use of bandit algorithms for controlling the exploration and exploitation trade-off in introducing new abstractions. PACE exhibits similar tendencies to humans on a collaborative construction task from the cognitive science literature, where one agent (the architect) instructs the other (the builder) to reconstruct a scene of block-buildings. PACE results in the emergence of an efficient language as a by-product of collaborative communication. Beyond providing mechanistic insights into human communication, our work serves as a first step to providing conversational agents with the ability for human-like communicative abstractions.

replace Recent Advances in Speech Language Models: A Survey

Authors: Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King

Abstract: Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion, significant latency due to the complex pipeline, and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize their evaluation metrics, and discuss the challenges and future research directions in this rapidly evolving field. The GitHub repository is available at https://github.com/dreamtheater123/Awesome-SpeechLM-Survey

URLs: https://github.com/dreamtheater123/Awesome-SpeechLM-Survey

replace Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level

Authors: Xinyi Zeng, Yuying Shang, Jiawei Chen, Jingyuan Zhang, Yu Tian

Abstract: Large language models (LLMs) have demonstrated immense utility across various industries. However, as LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts. While current methods effectively address jailbreak risks, they share common limitations: 1) Judging harmful responses from the prefill-level lacks utilization of the model's decoding outputs, leading to relatively lower effectiveness and robustness. 2) Rejecting potentially harmful responses based on a single evaluation can significantly impair the model's helpfulness.This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens. Motivated by pilot experiment results, we design a robust defense mechanism at the decoding level. Our novel decoder-oriented, step-by-step defense architecture corrects harmful queries directly rather than rejecting them outright. We introduce speculative decoding to enhance usability and facilitate deployment to boost secure decoding speed. Extensive experiments demonstrate that our approach improves model security without compromising reasoning speed. Notably, our method leverages the model's ability to discern hazardous information, maintaining its helpfulness compared to existing methods.

replace Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval

Authors: Yu Xia, Junda Wu, Sungchul Kim, Tong Yu, Ryan A. Rossi, Haoliang Wang, Julian McAuley

Abstract: Large language models (LLMs) have been used to generate query expansions augmenting original queries for improving information search. Recent studies also explore providing LLMs with initial retrieval results to generate query expansions more grounded to document corpus. However, these methods mostly focus on enhancing textual similarities between search queries and target documents, overlooking document relations. For queries like "Find me a highly rated camera for wildlife photography compatible with my Nikon F-Mount lenses", existing methods may generate expansions that are semantically similar but structurally unrelated to user intents. To handle such semi-structured queries with both textual and relational requirements, in this paper we propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG). To further address the limitation of entity-based scoring in existing KG-based methods, we leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR). Extensive experiments on three datasets of diverse domains show the advantages of our method compared against state-of-the-art baselines on textual and relational semi-structured retrieval.

replace Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement

Authors: Zihao Cheng, Li Zhou, Feng Jiang, Benyou Wang, Haizhou Li

Abstract: The rapid development of large language models (LLMs), like ChatGPT, has resulted in the widespread presence of LLM-generated content on social media platforms, raising concerns about misinformation, data biases, and privacy violations, which can undermine trust in online discourse. While detecting LLM-generated content is crucial for mitigating these risks, current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-LLM collaboration. To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content. This approach introduces two novel tasks: LLM Role Recognition (LLM-RR), a multi-class classification task that identifies specific roles of LLM in content generation, and LLM Influence Measurement (LLM-IM), a regression task that quantifies the extent of LLM involvement in content creation. To support these tasks, we propose LLMDetect, a benchmark designed to evaluate detectors' performance on these new tasks. LLMDetect includes the Hybrid News Detection Corpus (HNDC) for training detectors, as well as DetectEval, a comprehensive evaluation suite that considers five distinct cross-context variations and two multi-intensity variations within the same LLM role. This allows for a thorough assessment of detectors' generalization and robustness across diverse contexts. Our empirical validation of 10 baseline detection methods demonstrates that fine-tuned PLM-based models consistently outperform others on both tasks, while advanced LLMs face challenges in accurately detecting their own generated content. Our experimental results and analysis offer insights for developing more effective detection models for LLM-generated content. This research enhances the understanding of LLM-generated content and establishes a foundation for more nuanced detection methodologies.

replace A Complexity-Based Theory of Compositionality

Authors: Eric Elmoznino, Thomas Jiralerspong, Yoshua Bengio, Guillaume Lajoie

Abstract: Compositionality is believed to be fundamental to intelligence. In humans, it underlies the structure of thought, language, and higher-level reasoning. In AI, compositional representations can enable a powerful form of out-of-distribution generalization, in which a model systematically adapts to novel combinations of known concepts. However, while we have strong intuitions about what compositionality is, there currently exists no formal definition for it that is measurable and mathematical. Here, we propose such a definition, which we call representational compositionality, that accounts for and extends our intuitions about compositionality. The definition is conceptually simple, quantitative, grounded in algorithmic information theory, and applicable to any representation. Intuitively, representational compositionality states that a compositional representation satisfies three properties. First, it must be expressive. Second, it must be possible to re-describe the representation as a function of discrete symbolic sequences with re-combinable parts, analogous to sentences in natural language. Third, the function that relates these symbolic sequences to the representation, analogous to semantics in natural language, must be simple. Through experiments on both synthetic and real world data, we validate our definition of compositionality and show how it unifies disparate intuitions from across the literature in both AI and cognitive science. We also show that representational compositionality, while theoretically intractable, can be readily estimated using standard deep learning tools. Our definition has the potential to inspire the design of novel, theoretically-driven models that better capture the mechanisms of compositional thought.

replace CAST: Corpus-Aware Self-similarity Enhanced Topic modelling

Authors: Yanan Ma, Chenghao Xiao, Chenhan Yuan, Sabine N van der Veer, Lamiece Hassan, Chenghua Lin, Goran Nenadic

Abstract: Topic modelling is a pivotal unsupervised machine learning technique for extracting valuable insights from large document collections. Existing neural topic modelling methods often encode contextual information of documents, while ignoring contextual details of candidate centroid words, leading to the inaccurate selection of topic words due to the contextualization gap. In parallel, it is found that functional words are frequently selected over topical words. To address these limitations, we introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method that builds upon candidate centroid word embeddings contextualized on the dataset, and a novel self-similarity-based method to filter out less meaningful tokens. Inspired by findings in contrastive learning that self-similarities of functional token embeddings in different contexts are much lower than topical tokens, we find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words. Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data. Experiments on news benchmark datasets and one Twitter dataset demonstrate the method's superiority in generating coherent, diverse topics, and handling noisy data, outperforming strong baselines.

replace CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts

Authors: Malvina Nikandrou, Georgios Pantazopoulos, Nikolas Vitsakis, Ioannis Konstas, Alessandro Suglia

Abstract: As Vision and Language models (VLMs) are reaching users across the globe, assessing their cultural understanding has become a critical challenge. In this paper, we introduce CROPE, a visual question answering benchmark designed to probe the knowledge of culture-specific concepts and evaluate the capacity for cultural adaptation through contextual information. This allows us to distinguish between parametric knowledge acquired during training and contextual knowledge provided during inference via visual and textual descriptions. Our evaluation of several state-of-the-art open VLMs shows large performance disparities between culture-specific and common concepts in the parametric setting. Moreover, experiments with contextual knowledge indicate that models struggle to effectively utilize multimodal information and bind culture-specific concepts to their depictions. Our findings reveal limitations in the cultural understanding and adaptability of current VLMs that need to be addressed toward more culturally inclusive models.

replace Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

Authors: Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, Tal Schuster

Abstract: Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit "layer tying" as form of parameter sharing in Transformers, and introduce novel methods for converting existing LLMs into smaller "Recursive Transformers" that share parameters across layers, with minimal loss of performance. Here, our Recursive Transformers are efficiently initialized from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We further improve performance by introducing Relaxed Recursive Transformers that add flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still preserve the compactness of the overall model. We show that our recursive models (e.g., recursive Gemma 1B) outperform both similar-sized vanilla pretrained models (such as TinyLlama 1.1B and Pythia 1B) and knowledge distillation baselines -- and can even recover most of the performance of the original "full-size" model (e.g., Gemma 2B with no shared parameters). Finally, we propose Continuous Depth-wise Batching, a promising new inference paradigm enabled by the Recursive Transformer when paired with early exiting. In a theoretical analysis, we show that this has the potential to lead to significant (2-3x) gains in inference throughput.

replace Fair Summarization: Bridging Quality and Diversity in Extractive Summaries

Authors: Sina Bagheri Nezhad, Sayan Bandyapadhyay, Ameeta Agrawal

Abstract: Fairness in multi-document summarization of user-generated content remains a critical challenge in natural language processing (NLP). Existing summarization methods often fail to ensure equitable representation across different social groups, leading to biased outputs. In this paper, we introduce two novel methods for fair extractive summarization: FairExtract, a clustering-based approach, and FairGPT, which leverages GPT-3.5-turbo with fairness constraints. We evaluate these methods using Divsumm summarization dataset of White-aligned, Hispanic, and African-American dialect tweets and compare them against relevant baselines. The results obtained using a comprehensive set of summarization quality metrics such as SUPERT, BLANC, SummaQA, BARTScore, and UniEval, as well as a fairness metric F, demonstrate that FairExtract and FairGPT achieve superior fairness while maintaining competitive summarization quality. Additionally, we introduce composite metrics (e.g., SUPERT+F, BLANC+F) that integrate quality and fairness into a single evaluation framework, offering a more nuanced understanding of the trade-offs between these objectives. This work highlights the importance of fairness in summarization and sets a benchmark for future research in fairness-aware NLP models.

replace Training Bilingual LMs with Data Constraints in the Targeted Language

Authors: Skyler Seto, Maartje ter Hoeve, Richard He Bai, Natalie Schluter, David Grangier

Abstract: Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high quality pretraining data is unavailable. In this work, we study how to boost pretrained model performance in a target language with insufficient pretraining data for training a high performing language model, by enlisting data from an auxiliary language for which high quality data is available. We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language, exploring the benefits of translation systems, studying the limitations of model scaling when data is limited in the target languages, and proposing new methods for upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in performance gains without modification to the model or training objective for close languages, and, in particular, that performance gains due to the development of more information-rich English pretraining datasets can extend to targeted language settings with limited data.

replace OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System

Authors: Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, Haofen Wang, Huajun Chen

Abstract: We introduce OneKE, a dockerized schema-guided knowledge extraction system, which can extract knowledge from the Web and raw PDF Books, and support various domains (science, news, etc.). Specifically, we design OneKE with multiple agents and a configure knowledge base. Different agents perform their respective roles, enabling support for various extraction scenarios. The configure knowledge base facilitates schema configuration, error case debugging and correction, further improving the performance. Empirical evaluations on benchmark datasets demonstrate OneKE's efficacy, while case studies further elucidate its adaptability to diverse tasks across multiple domains, highlighting its potential for broad applications. We have open-sourced the Code at https://github.com/zjunlp/OneKE and released a Video at http://oneke.openkg.cn/demo.mp4.

URLs: https://github.com/zjunlp/OneKE, http://oneke.openkg.cn/demo.mp4.

replace PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation

Authors: Jinyu Wang, Jingjing Fu, Rui Wang, Lei Song, Jiang Bian

Abstract: Despite notable advancements in Retrieval-Augmented Generation (RAG) systems that expand large language model (LLM) capabilities through external retrieval, these systems often struggle to meet the complex and diverse needs of real-world industrial applications. The reliance on retrieval alone proves insufficient for extracting deep, domain-specific knowledge performing in logical reasoning from specialized corpora. To address this, we introduce sPecIalized KnowledgE and Rationale Augmentation Generation (PIKE-RAG), focusing on extracting, understanding, and applying specialized knowledge, while constructing coherent rationale to incrementally steer LLMs toward accurate responses. Recognizing the diverse challenges of industrial tasks, we introduce a new paradigm that classifies tasks based on their complexity in knowledge extraction and application, allowing for a systematic evaluation of RAG systems' problem-solving capabilities. This strategic approach offers a roadmap for the phased development and enhancement of RAG systems, tailored to meet the evolving demands of industrial applications. Furthermore, we propose knowledge atomizing and knowledge-aware task decomposition to effectively extract multifaceted knowledge from the data chunks and iteratively construct the rationale based on original query and the accumulated knowledge, respectively, showcasing exceptional performance across various benchmarks.

replace K-COMP: Retrieval-Augmented Medical Domain Question Answering With Knowledge-Injected Compressor

Authors: Jeonghun Cho, Gary Geunbae Lee

Abstract: Retrieval-augmented question answering (QA) integrates external information and thereby increases the QA accuracy of reader models that lack domain knowledge. However, documents retrieved for closed domains require high expertise, so the reader model may have difficulty fully comprehending the text. Moreover, the retrieved documents contain thousands of tokens, some unrelated to the question. As a result, the documents include some inaccurate information, which could lead the reader model to mistrust the passages and could result in hallucinations. To solve these problems, we propose K-comp (Knowledge-injected compressor) which provides the knowledge required to answer correctly. The compressor automatically generates the prior knowledge necessary to facilitate the answer process prior to compression of the retrieved passages. Subsequently, the passages are compressed autoregressively, with the generated knowledge being integrated into the compression process. This process ensures alignment between the question intent and the compressed context. By augmenting this prior knowledge and concise context, the reader models are guided toward relevant answers and trust the context.

replace ReGLA: Refining Gated Linear Attention

Authors: Peng Lu, Ivan Kobyzev, Mehdi Rezagholizadeh, Boxing Chen, Philippe Langlais

Abstract: Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the training process. Moreover, we explored the saturation phenomenon of the gating mechanism and augmented it with a refining module. We conducted extensive experiments and showed our architecture outperforms previous Gated Linear Attention mechanisms in extensive tasks including training from scratch and post-linearization with continual pre-training.

replace Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs

Authors: Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani Tur

Abstract: Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large language models (LLMs) by enabling detailed step-by-step solutions. However, due to the verbosity of LLMs, the resulting reasoning chains can be long, making it harder to verify the reasoning steps and trace issues resulting from dependencies between the steps that may be farther away in the sequence of steps. Importantly, mathematical reasoning allows each step to be derived from a small set of premises, which are a subset of the preceding steps in the reasoning chain. In this paper, we present a framework that identifies the premises for each step, to improve the evaluation of reasoning. We restructure conventional linear reasoning chains into Premise Augmented Reasoning Chains (PARC) by introducing premise links, resulting in a directed acyclic graph where the nodes are the steps and the edges are the premise links. Through experiments with a PARC-based dataset that we built, namely PERL (Premises and ERrors identification in LLMs), we demonstrate that LLMs can reliably identify premises within complex reasoning chains. In particular, even open-source LLMs achieve 90% recall in premise identification. We also show that PARC helps to identify errors in reasoning chains more reliably. The accuracy of error identification improves by 6% to 16% absolute when step-by-step verification is carried out in PARC under the premises. Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks and open new avenues for improving the reliability of LLM-based reasoning evaluations.

replace Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

Authors: Mayuka Jayawardhana, Renbo, Samuel Dooley, Valeriia Cherepanova, Andrew Gordon Wilson, Frank Hutter, Colin White, Tom Goldstein, Micah Goldblum

Abstract: Large language models (LLMs) perform remarkably well on tabular datasets in zero- and few-shot settings, since they can extract meaning from natural language column headers that describe features and labels. Similarly, TabPFN, a recent non-LLM transformer pretrained on numerous tables for in-context learning, has demonstrated excellent performance for dataset sizes up to a thousand samples. In contrast, gradient-boosted decision trees (GBDTs) are typically trained from scratch on each dataset without benefiting from pretraining data and must learn the relationships between columns from their entries alone since they lack natural language understanding. LLMs and TabPFN excel on small tabular datasets where a strong prior is essential, yet they are not competitive with GBDTs on medium or large datasets, since their context lengths are limited. In this paper, we propose a simple and lightweight approach for fusing large language models and TabPFN with gradient-boosted decision trees, which allows scalable GBDTs to benefit from the natural language capabilities and pretraining of transformers. We name our fusion methods LLM-Boost and PFN-Boost, respectively. While matching or surpassing the performance of the transformer at sufficiently small dataset sizes and GBDTs at sufficiently large sizes, LLM-Boost and PFN-Boost outperform both standalone components on a wide range of dataset sizes in between. We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms. We find that PFN-Boost achieves the best average performance among all methods we test for all but very small dataset sizes. We release our code at http://github.com/MayukaJ/LLM-Boost .

URLs: http://github.com/MayukaJ/LLM-Boost

replace-cross Had enough of experts? Quantitative knowledge retrieval from large language models

Authors: David Selby, Kai Spriestersbach, Yuichiro Iwashita, Mohammad Saad, Dennis Bappert, Archana Warrier, Sumantrak Mukherjee, Koichi Kise, Sebastian Vollmer

Abstract: Large language models (LLMs) have been extensively studied for their abilities to generate convincing natural language sequences, however their utility for quantitative information retrieval is less well understood. Here we explore the feasibility of LLMs as a mechanism for quantitative knowledge retrieval to aid two data analysis tasks: elicitation of prior distributions for Bayesian models and imputation of missing data. We introduce a framework that leverages LLMs to enhance Bayesian workflows by eliciting expert-like prior knowledge and imputing missing data. Tested on diverse datasets, this approach can improve predictive accuracy and reduce data requirements, offering significant potential in healthcare, environmental science and engineering applications. We discuss the implications and challenges of treating LLMs as 'experts'.

replace-cross MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition

Authors: Stefan Gerd Fritsch, Cennet Oguz, Vitor Fortes Rey, Lala Ray, Maximilian Kiefer-Emmanouilidis, Paul Lukowicz

Abstract: Human activity recognition (HAR) is a long-standing problem in artificial intelligence with applications in a broad range of areas, including healthcare, sports and fitness, security, and more. The performance of HAR in real-world settings is strongly dependent on the type and quality of the input signal that can be acquired. Given an unobstructed, high-quality camera view of a scene, computer vision systems, in particular in conjunction with foundation models, can today fairly reliably distinguish complex activities. On the other hand, recognition using modalities such as wearable sensors (which are often more broadly available, e.g., in mobile phones and smartwatches) is a more difficult problem, as the signals often contain less information and labeled training data is more difficult to acquire. To alleviate the need for labeled data, we introduce our comprehensive Fitness Multimodal Activity Dataset (FiMAD) in this work, which can be used with the proposed pre-training method MuJo (Multimodal Joint Feature Space Learning) to enhance HAR performance across various modalities. FiMAD was created using YouTube fitness videos and contains parallel video, language, pose, and simulated IMU sensor data. MuJo utilizes this dataset to learn a joint feature space for these modalities. We show that classifiers pre-trained on FiMAD can increase the performance on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH. For instance, on MM-Fit, we achieve a Macro F1-Score of up to 0.855 when fine-tuning on only 2% of the training data and 0.942 when utilizing the complete training set for classification tasks. We compare our approach with other self-supervised ones and show that, unlike them, ours consistently improves compared to the baseline network performance while also providing better data efficiency.

replace-cross Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Authors: Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo

Abstract: Predicting changes from scaling advanced AI systems is a desirable property for engineers, economists, governments and industry alike, and, while a well-established literature exists on how pretraining performance scales, predictable scaling behavior on downstream capabilities remains elusive. While many factors are certainly responsible, this paper identifies a significant factor that makes predicting scaling behavior on widely used multiple-choice question answering benchmarks challenging and illuminates a path towards making such downstream evaluations predictable with scale. Using five model families and twelve well-established multiple-choice benchmarks, we demonstrate that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively degrades the statistical relationship between performance and scale. We then pinpoint the mechanism causing this degradation: downstream metrics require comparing the correct choice against a small number of specific incorrect choices, meaning accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on the alternative incorrect choices with scale. We empirically study how probability mass on the correct choice co-varies with probability mass on incorrect choices with increasing compute, suggesting that scaling laws for \textit{incorrect} choices might be achievable. Our work also explains why pretraining scaling laws are commonly regarded as more predictable than downstream capabilities and contributes towards establishing scaling-predictable evaluations of frontier AI models.

replace-cross TourRank: Utilizing Large Language Models for Documents Ranking with a Tournament-Inspired Strategy

Authors: Yiqun Chen, Qi Liu, Yi Zhang, Weiwei Sun, Xinyu Ma, Wei Yang, Daiting Shi, Jiaxin Mao, Dawei Yin

Abstract: Large Language Models (LLMs) are increasingly employed in zero-shot documents ranking, yielding commendable results. However, several significant challenges still persist in LLMs for ranking: (1) LLMs are constrained by limited input length, precluding them from processing a large number of documents simultaneously; (2) The output document sequence is influenced by the input order of documents, resulting in inconsistent ranking outcomes; (3) Achieving a balance between cost and ranking performance is challenging. To tackle these issues, we introduce a novel documents ranking method called TourRank, which is inspired by the sport tournaments, such as FIFA World Cup. Specifically, we 1) overcome the limitation in input length and reduce the ranking latency by incorporating a multi-stage grouping strategy similar to the parallel group stage of sport tournaments; 2) improve the ranking performance and robustness to input orders by using a points system to ensemble multiple ranking results. We test TourRank with different LLMs on the TREC DL datasets and the BEIR benchmark. The experimental results demonstrate that TourRank delivers state-of-the-art performance at a modest cost. The code of TourRank can be seen on https://github.com/chenyiqun/TourRank.

URLs: https://github.com/chenyiqun/TourRank.

replace-cross Evaluating Numerical Reasoning in Text-to-Image Models

Authors: Ivana Kaji\'c, Olivia Wiles, Isabela Albuquerque, Matthias Bauer, Su Wang, Jordi Pont-Tuset, Aida Nematzadeh

Abstract: Text-to-image generative models are capable of producing high-quality images that often faithfully depict concepts described using natural language. In this work, we comprehensively evaluate a range of text-to-image models on numerical reasoning tasks of varying difficulty, and show that even the most advanced models have only rudimentary numerical skills. Specifically, their ability to correctly generate an exact number of objects in an image is limited to small numbers, it is highly dependent on the context the number term appears in, and it deteriorates quickly with each successive number. We also demonstrate that models have poor understanding of linguistic quantifiers (such as "a few" or "as many as"), the concept of zero, and struggle with more advanced concepts such as partial quantities and fractional representations. We bundle prompts, generated images and human annotations into GeckoNum, a novel benchmark for evaluation of numerical reasoning.

replace-cross ColPali: Efficient Document Retrieval with Vision Language Models

Authors: Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C\'eline Hudelot, Pierre Colombo

Abstract: Documents are visually rich structures that convey information through text, but also figures, page layouts, tables, or even fonts. Since modern retrieval systems mainly rely on the textual information they extract from document pages to index documents -often through lengthy and brittle processes-, they struggle to exploit key visual cues efficiently. This limits their capabilities in many practical document retrieval applications such as Retrieval Augmented Generation (RAG). To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieval tasks spanning multiple domains, languages, and practical settings. The inherent complexity and performance shortcomings of modern systems motivate a new concept; doing document retrieval by directly embedding the images of the document pages. We release ColPali, a Vision Language Model trained to produce high-quality multi-vector embeddings from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically simpler, faster and end-to-end trainable. We release models, data, code and benchmarks under open licenses at https://hf.co/vidore.

URLs: https://hf.co/vidore.

replace-cross The Role of Network and Identity in the Diffusion of Hashtags

Authors: Aparna Ananthasubramaniam, Yufei 'Louise' Zhu, David Jurgens, Daniel Romero

Abstract: The diffusion of culture online is theorized to be influenced by many interacting social factors (e.g., network and identity). However, most existing computational cascade models consider just a single factor (e.g., network or identity). This work offers a new framework for teasing apart the mechanisms underlying hashtag cascades. We curate a new dataset of 1,337 hashtags representing cultural innovation online, develop a 10-factor evaluation framework for comparing empirical and simulated cascades, and show that a combined network+identity model better simulates hashtag cascades than network- or identity-only counterfactuals. We also explore heterogeneity in performance: While a combined network+identity model best predicts the popularity of cascades, a network-only model best predicts cascade growth and an identity-only model best predicts adopter composition. The network+identity model has the highest comparative advantage among hashtags used for expressing racial or regional identity and talking about sports or news. In fact, we are able to predict what combination of network and/or identity best models each hashtag and use this to further improve performance. Our results show the utility of models incorporating the interactions of network, identity, and other social factors in the diffusion of hashtags in social media.

replace-cross How Reliable are Causal Probing Interventions?

Authors: Marc Canby, Adam Davies, Chirag Rastogi, Julia Hockenmaier

Abstract: Causal probing aims to analyze foundation models by examining how intervening on their representation of various latent properties impacts their outputs. Recent works have cast doubt on the theoretical basis of several leading causal probing methods, but it has been unclear how to systematically evaluate the effectiveness of these methods in practice. To address this, we define two key causal probing desiderata: completeness (how thoroughly the representation of the target property has been transformed) and selectivity (how little non-targeted properties have been impacted). We find that there is an inherent tradeoff between the two, which we define as reliability, their harmonic mean. We introduce an empirical analysis framework to measure and evaluate these quantities, allowing us to make the first direct comparisons between different families of leading causal probing methods (e.g., linear vs. nonlinear, or concept removal vs. counterfactual interventions). We find that: (1) no method is reliable across all layers; (2) more reliable methods have a greater impact on LLM behavior; (3) nonlinear interventions are more reliable in early and intermediate layers, and linear interventions are more reliable in later layers; and (4) concept removal methods are far less reliable than counterfactual interventions, suggesting that they may not be an effective approach to causal probing.

replace-cross Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective

Authors: Zeyu Gan, Yong Liu

Abstract: Synthetic data has become a pivotal resource in post-training tasks for large language models (LLMs) due to the scarcity of high-quality, specific data. While various methods have been developed to generate synthetic data, there remains a discernible gap between the practical effects of synthetic data and our theoretical comprehension. To address this challenge, we commence by presenting a detailed modeling of the prevalent synthetic data generation process. Building upon this modeling, we demonstrate that the generalization capability of the post-trained model is critically determined by the information gain derived from the generative model, as analyzed from a novel reverse-bottleneck perspective. Moreover, we introduce the concept of Generalization Gain via Mutual Information (GGMI) and elucidate the relationship between generalization gain and information gain. This analysis serves as a theoretical foundation for synthetic data generation and further highlights its connection with the generalization capability of post-trained models, offering an understanding about the design of synthetic data generation techniques and the optimization of the post-training process. We open-source our code at https://github.com/ZyGan1999/Towards-a-Theoretical-Understanding-of-Synthetic-Data-in-LLM-Post-Training.

URLs: https://github.com/ZyGan1999/Towards-a-Theoretical-Understanding-of-Synthetic-Data-in-LLM-Post-Training.

replace-cross House of Cards: Massive Weights in LLMs

Authors: Jaehoon Oh, Seungjun Shin, Dokwan Oh

Abstract: Massive activations, which manifest in specific feature dimensions of hidden states, introduce a significant bias in large language models (LLMs), leading to an overemphasis on the corresponding token. In this paper, we identify that massive activations originate not from the hidden state but from the intermediate state of a feed-forward network module in an early layer. Expanding on the previous observation that massive activations occur only in specific feature dimensions, we dive deep into the weights that cause massive activations. Specifically, we define top-$k$ massive weights as the weights that contribute to the dimensions with the top-$k$ magnitudes in the intermediate state. When these massive weights are set to zero, the functionality of LLMs is entirely disrupted. However, when all weights except for massive weights are set to zero, it results in a relatively minor performance drop, even though a much larger number of weights are set to zero. This implies that during the pre-training process, learning is dominantly focused on massive weights. Building on this observation, we propose a simple plug-and-play method called MacDrop (massive weights curriculum dropout), to rely less on massive weights during parameter-efficient fine-tuning. This method applies dropout to the pre-trained massive weights, starting with a high dropout probability and gradually decreasing it as fine-tuning progresses. Through various experiments, including zero-shot downstream tasks, long-context tasks, and ablation studies, we demonstrate that \texttt{MacDrop} generally improves performance and strengthens robustness.

replace-cross Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

Authors: Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, Ben He, Xianpei Han, Debing Zhang, Le Sun

Abstract: Reward Models (RMs) are crucial for aligning language models with human preferences. Currently, the evaluation of RMs depends on measuring accuracy against a validation set of manually annotated preference data. Although this method is straightforward and widely adopted, the relationship between RM accuracy and downstream policy performance remains under-explored. In this work, we conduct experiments in a synthetic setting to investigate how differences in RM measured by accuracy translate into gaps in optimized policy performance. Our findings reveal that while there is a weak positive correlation between accuracy and downstream performance, policies optimized towards RMs with similar accuracy can exhibit quite different performance. Moreover, we discover that the way of measuring accuracy significantly impacts its ability to predict the final policy performance. Through the lens of the Regressional Goodhart effect, we recognize that accuracy, when used for measuring RM quality, can fail to fully capture the potential RM overoptimization. This underscores the inadequacy of relying solely on accuracy to reflect their impact on policy optimization.

replace-cross Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Authors: Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin

Abstract: Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

replace-cross Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning

Authors: Heshan Fernando, Han Shen, Parikshit Ram, Yi Zhou, Horst Samulowitz, Nathalie Baracaldo, Tianyi Chen

Abstract: Post-training of pre-trained LLMs, which typically consists of the supervised fine-tuning (SFT) stage and the preference learning (RLHF or DPO) stage, is crucial to effective and safe LLM applications. The widely adopted approach in post-training popular open-source LLMs is to sequentially perform SFT and RLHF/DPO. However, sequential training is sub-optimal in terms of SFT and RLHF/DPO trade-off: the LLM gradually forgets about the first stage's training when undergoing the second stage's training. We theoretically prove the sub-optimality of sequential post-training. Furthermore, we propose a practical joint post-training framework with theoretical convergence guarantees and empirically outperforms sequential post-training framework, while having similar computational cost. Our code is available at https://github.com/heshandevaka/XRIGHT.

URLs: https://github.com/heshandevaka/XRIGHT.

replace-cross Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

Authors: Bohan Lyu, Yadi Cao, Duncan Watson-Parris, Leon Bergen, Taylor Berg-Kirkpatrick, Rose Yu

Abstract: Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but, even with domain-specific fine-tuning, often produce hallucinations for complex ones. While integrating LLMs with tools can mitigate this reliability issue, models finetuned on tool usage only often over-rely on them, incurring unnecessary costs from resource-intensive scientific tools even for simpler problems. Inspired by how human experts assess the complexity of the problem before choosing the solutions, we propose a novel two-component fine-tuning method, Adapting While Learning (AWL). In the first component, World Knowledge Learning (WKL), LLMs internalize scientific knowledge by learning from tools-generated solutions. In the second component, Tool Usage Adaptation (TUA), we classify questions as easy or hard based on the WKL-trained model's accuracy, and train it to maintain direct reasoning for simple problems while switching to tools for challenging ones. We validate our method on 6 scientific benchmark datasets in climate science, epidemiology, and mathematics. Compared to the base 8B model, our trained models achieve 28.27% higher answer accuracy and 13.76% better tool usage accuracy, even surpassing state-of-the-art models including GPT-4 and Claude-3.5 on 4 custom-created datasets.

replace-cross Self-Training Meets Consistency: Improving LLMs' Reasoning with Consistency-Driven Rationale Evaluation

Authors: Jaehyeok Lee, Keisuke Sakaguchi, JinYeong Bak

Abstract: Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches.

replace-cross EVQAScore: A Fine-grained Metric for Video Question Answering Data Quality Evaluation

Authors: Hao Liang, Zirong Chen, Hejun Dong, Wentao Zhang

Abstract: Video question-answering (QA) is a core task in video understanding. Evaluating the quality of video QA and video caption data quality for training video large language models (VideoLLMs) is an essential challenge. Although various methods have been proposed for assessing video caption quality, there remains a lack of dedicated evaluation methods for Video QA. To address this gap, we introduce EVQAScore, a reference-free method that leverages keyword extraction to assess both video caption and video QA data quality. Additionally, we incorporate frame sampling and rescaling techniques to enhance the efficiency and robustness of our evaluation, this enables our score to evaluate the quality of extremely long videos. Our approach achieves state-of-the-art (SOTA) performance (32.8 for Kendall correlation and 42.3 for Spearman correlation, 4.7 and 5.9 higher than the previous method PAC-S++) on the VATEX-EVAL benchmark for video caption evaluation. Furthermore, by using EVQAScore for data selection, we achieved SOTA results with only 12.5\% of the original data volume, outperforming the previous SOTA method PAC-S and 100\% of data.

replace-cross CPRM: A LLM-based Continual Pre-training Framework for Relevance Modeling in Commercial Search

Authors: Kaixin Wu, Yixin Ji, Zeyuan Chen, Qiang Wang, Cunxiang Wang, Hong Liu, Baijun Ji, Jia Xu, Zhongyi Liu, Jinjie Gu, Yuan Zhou, Linjian Mo

Abstract: Relevance modeling between queries and items stands as a pivotal component in commercial search engines, directly affecting the user experience. Given the remarkable achievements of large language models (LLMs) in various natural language processing (NLP) tasks, LLM-based relevance modeling is gradually being adopted within industrial search systems. Nevertheless, foundational LLMs lack domain-specific knowledge and do not fully exploit the potential of in-context learning. Furthermore, structured item text remains underutilized, and there is a shortage in the supply of corresponding queries and background knowledge. We thereby propose CPRM (Continual Pre-training for Relevance Modeling), a framework designed for the continual pre-training of LLMs to address these issues. Our CPRM framework includes three modules: 1) employing both queries and multi-field item to jointly pre-train for enhancing domain knowledge, 2) applying in-context pre-training, a novel approach where LLMs are pre-trained on a sequence of related queries or items, and 3) conducting reading comprehension on items to produce associated domain knowledge and background information (e.g., generating summaries and corresponding queries) to further strengthen LLMs. Results on offline experiments and online A/B testing demonstrate that our model achieves convincing performance compared to strong baselines.

replace-cross DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts

Authors: Tobias Braun, Mark Rothermel, Marcus Rohrbach, Anna Rohrbach

Abstract: The proliferation of disinformation demands reliable and scalable fact-checking solutions. We present Dynamic Evidence-based FAct-checking with Multimodal Experts (DEFAME), a modular, zero-shot MLLM pipeline for open-domain, text-image claim verification. DEFAME operates in a six-stage process, dynamically selecting the tools and search depth to extract and evaluate textual and visual evidence. Unlike prior approaches that are text-only, lack explainability, or rely solely on parametric knowledge, DEFAME performs end-to-end verification, accounting for images in claims and evidence while generating structured, multimodal reports. Evaluation on the popular benchmarks VERITE, AVerITeC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing itself as the new state-of-the-art fact-checking system for uni- and multimodal fact-checking. Moreover, we introduce a new benchmark, CLAIMREVIEW24+, featuring claims after the knowledge cutoff of GPT4o to avoid data leakage. Here, DEFAME drastically outperforms the GPT Chain-of-Thought baseline, demonstrating temporal generalizability and the potential for real-time fact-checking.

replace-cross InfAlign: Inference-aware language model alignment

Authors: Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami

Abstract: Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-N sampling and best-of-N jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.

replace-cross Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability

Authors: Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, Mengnan Du

Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in processing both visual and textual information. However, the critical challenge of alignment between visual and textual representations is not fully understood. This survey presents a comprehensive examination of alignment and misalignment in LVLMs through an explainability lens. We first examine the fundamentals of alignment, exploring its representational and behavioral aspects, training methodologies, and theoretical foundations. We then analyze misalignment phenomena across three semantic levels: object, attribute, and relational misalignment. Our investigation reveals that misalignment emerges from challenges at multiple levels: the data level, the model level, and the inference level. We provide a comprehensive review of existing mitigation strategies, categorizing them into parameter-frozen and parameter-tuning approaches. Finally, we outline promising future research directions, emphasizing the need for standardized evaluation protocols and in-depth explainability studies.

replace-cross MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation Methods

Authors: Zukang Xu, Yuxuan Yue, Xing Hu, Zhihang Yuan, Zixu Jiang, Zhixuan Chen, Jiangyong Yu, Chen Xu, Sifan Zhou, Dawei Yang

Abstract: Mamba is an efficient sequence model that rivals Transformers and demonstrates significant potential as a foundational architecture for various tasks. Quantization is commonly used in neural networks to reduce model size and computational latency. However, applying quantization to Mamba remains underexplored, and existing quantization methods, which have been effective for CNN and Transformer models, appear inadequate for Mamba models (e.g., Quarot suffers a 21% accuracy drop on Vim-T$^\dagger$ even under W8A8). We have pioneered the exploration of this issue and identified several key challenges. First, significant outliers are present in gate projections, output projections, and matrix multiplications. Second, Mamba's unique parallel scan further amplifies these outliers, leading to uneven and heavy-tailed data distributions. Third, even with the application of the Hadamard transform, the variance across channels in weights and activations still remains inconsistent. To these ends, we propose MambaQuant, a post-training quantization (PTQ) framework consisting of: 1) Karhunen-Loeve Transformation (KLT) enhanced rotation, rendering the rotation matrix adaptable to diverse channel distributions. 2) Smooth-Fused rotation, which equalizes channel variances and can merge additional parameters into model weights. Experiments show that MambaQuant can quantize both weights and activations into 8-bit with less than 1% accuracy loss for Mamba-based vision and language tasks. To the best of our knowledge, MambaQuant is the first comprehensive PTQ design for the Mamba family, paving the way for further advancements in its application.

replace-cross Humanity's Last Exam

Authors: Long Phan (Michael Pokorny), Alice Gatti (Michael Pokorny), Ziwen Han (Michael Pokorny), Nathaniel Li (Michael Pokorny), Josephina Hu (Michael Pokorny), Hugh Zhang (Michael Pokorny), Mohamed Shaaban (Michael Pokorny), John Ling (Michael Pokorny), Sean Shi (Michael Pokorny), Michael Choi (Michael Pokorny), Anish Agrawal (Michael Pokorny), Arnav Chopra (Michael Pokorny), Adam Khoja (Michael Pokorny), Ryan Kim (Michael Pokorny), Richard Ren (Michael Pokorny), Jason Hausenloy (Michael Pokorny), Oliver Zhang (Michael Pokorny), Mantas Mazeika (Michael Pokorny), Daron Anderson (Michael Pokorny), Tung Nguyen (Michael Pokorny), Imad Ali Shah (Michael Pokorny), Alun Cennyth Stokes (Michael Pokorny), Mobeen Mahmood (Michael Pokorny), Fiona Feng (Michael Pokorny), Steven Y. Feng (Michael Pokorny), Haoran Zhao (Michael Pokorny), Michael Yu (Michael Pokorny), Varun Gangal (Michael Pokorny), Chelsea Zou (Michael Pokorny), Zihan Wang (Michael Pokorny), Jaeho Lee (Michael Pokorny), Mikhail Doroshenko (Michael Pokorny), Jessica P. Wang (Michael Pokorny), Pawan Kumar (Michael Pokorny), Oleksandr Pokutnyi (Michael Pokorny), Oleg Iskra (Michael Pokorny), Robert Gerbicz (Michael Pokorny), Serguei Popov (Michael Pokorny), John-Clark Levin (Michael Pokorny), Mstyslav Kazakov (Michael Pokorny), Johannes Schmitt (Michael Pokorny), Geoff Galgon (Michael Pokorny), Alvaro Sanchez (Michael Pokorny), Yongki Lee (Michael Pokorny), Will Yeadon (Michael Pokorny), Scott Sauers (Michael Pokorny), Marc Roth (Michael Pokorny), Chidozie Agu (Michael Pokorny), S{\o}ren Riis (Michael Pokorny), Fabian Giska (Michael Pokorny), Saiteja Utpala (Michael Pokorny), Antrell Cheatom (Michael Pokorny), Zachary Giboney (Michael Pokorny), Gashaw M. Goshu (Michael Pokorny), Joan of Arc Xavier (Michael Pokorny), Sarah-Jane Crowson (Michael Pokorny), Mohinder Maheshbhai Naiya (Michael Pokorny), Noah Burns (Michael Pokorny), Lennart Finke (Michael Pokorny), Zerui Cheng (Michael Pokorny), Hyunwoo Park (Michael Pokorny), Francesco Fournier-Facio (Michael Pokorny), John Wydallis (Michael Pokorny), John B. Wydallis (Michael Pokorny), Mark Nandor (Michael Pokorny), Ankit Singh (Michael Pokorny), Tim Gehrunger (Michael Pokorny), Jiaqi Cai (Michael Pokorny), Ben McCarty (Michael Pokorny), Darling Duclosel (Michael Pokorny), Ahmed Menshawy (Michael Pokorny), Jungbae Nam (Michael Pokorny), Jennifer Zampese (Michael Pokorny), Ryan G. Hoerr (Michael Pokorny), Aras Bacho (Michael Pokorny), Jun Jin (Michael Pokorny), Gautier Abou Loume (Michael Pokorny), Abdallah Galal (Michael Pokorny), Hangrui Cao (Michael Pokorny), Alexis C Garretson (Michael Pokorny), Damien Sileo (Michael Pokorny), Qiuyu Ren (Michael Pokorny), Doru Cojoc (Michael Pokorny), Pavel Arkhipov (Michael Pokorny), Usman Qazi (Michael Pokorny), Lianghui Li (Michael Pokorny), Sumeet Motwani (Michael Pokorny), Christian Schroeder de Witt (Michael Pokorny), Alexei Kopylov (Michael Pokorny), Edwin Taylor (Michael Pokorny), Johannes Veith (Michael Pokorny), Eric Singer (Michael Pokorny), Taylor D. Hartman (Michael Pokorny), Paolo Rissone (Michael Pokorny), Jaehyeok Jin (Michael Pokorny), Jack Wei Lun Shi (Michael Pokorny), Chris G. Willcocks (Michael Pokorny), Joshua Robinson (Michael Pokorny), Aleksandar Mikov (Michael Pokorny), Ameya Prabhu (Michael Pokorny), Longke Tang (Michael Pokorny), Xavier Alapont (Michael Pokorny), Justine Leon Uro (Michael Pokorny), Kevin Zhou (Michael Pokorny), Emily de Oliveira Santos (Michael Pokorny), Andrey Pupasov Maksimov (Michael Pokorny), Edward Vendrow (Michael Pokorny), Kengo Zenitani (Michael Pokorny), Julien Guillod (Michael Pokorny), Sheeshram Siddh (Michael Pokorny), Yuqi Li (Michael Pokorny), Joshua Vendrow (Michael Pokorny), Vladyslav Kuchkin (Michael Pokorny), Ng Ze-An (Michael Pokorny), Pierre Marion (Michael Pokorny), Denis Efremov (Michael Pokorny), Jayson Lynch (Michael Pokorny), Kaiqu Liang (Michael Pokorny), Andrew Gritsevskiy (Michael Pokorny), Dakotah Martinez (Michael Pokorny), Ben Pageler (Michael Pokorny), Nick Crispino (Michael Pokorny), Dimitri Zvonkine (Michael Pokorny), Natanael Wildner Fraga (Michael Pokorny), Saeed Soori (Michael Pokorny), Ori Press (Michael Pokorny), Henry Tang (Michael Pokorny), Julian Salazar (Michael Pokorny), Sean R. Green (Michael Pokorny), Lina Br\"ussel (Michael Pokorny), Moon Twayana (Michael Pokorny), Aymeric Dieuleveut (Michael Pokorny), T. Ryan Rogers (Michael Pokorny), Wenjin Zhang (Michael Pokorny), Yashaswini Jain (Michael Pokorny), Bikun Li (Michael Pokorny), Jinzhou Yang (Michael Pokorny), Arun Rao (Michael Pokorny), Gabriel Loiseau (Michael Pokorny), Mikhail Kalinin (Michael Pokorny), Marco Lukas (Michael Pokorny), Ciprian Manolescu (Michael Pokorny), Subrata Mishra (Michael Pokorny), Ariel Ghislain Kemogne Kamdoum (Michael Pokorny), Tobias Kreiman (Michael Pokorny), Tad Hogg (Michael Pokorny), Alvin Jin (Michael Pokorny), Carlo Bosio (Michael Pokorny), Gongbo Sun (Michael Pokorny), Brian P Coppola (Michael Pokorny), Tim Tarver (Michael Pokorny), Haline Heidinger (Michael Pokorny), Rafael Sayous (Michael Pokorny), Stefan Ivanov (Michael Pokorny), Joseph M Cavanagh (Michael Pokorny), Jiawei Shen (Michael Pokorny), Joseph Marvin Imperial (Michael Pokorny), Philippe Schwaller (Michael Pokorny), Shaipranesh Senthilkuma (Michael Pokorny), Andres M Bran (Michael Pokorny), Ali Dehghan (Michael Pokorny), Andres Algaba (Michael Pokorny), Brecht Verbeken (Michael Pokorny), Kelsey Van den Houte (Michael Pokorny), Lynn Van Der Sypt (Michael Pokorny), David Noever (Michael Pokorny), Lisa Schut (Michael Pokorny), Ilia Sucholutsky (Michael Pokorny), Evgenii Zheltonozhskii (Michael Pokorny), Qiaochu Yuan (Michael Pokorny), Derek Lim (Michael Pokorny), Richard Stanley (Michael Pokorny), Shankar Sivarajan (Michael Pokorny), Tong Yang (Michael Pokorny), John Maar (Michael Pokorny), Julian Wykowski (Michael Pokorny), Mart\'i Oller (Michael Pokorny), Jennifer Sandlin (Michael Pokorny), Anmol Sahu (Michael Pokorny), Yuzheng Hu (Michael Pokorny), Sara Fish (Michael Pokorny), Nasser Heydari (Michael Pokorny), Archimedes Apronti (Michael Pokorny), Kaivalya Rawal (Michael Pokorny), Tobias Garcia Vilchis (Michael Pokorny), Yuexuan Zu (Michael Pokorny), Martin Lackner (Michael Pokorny), James Koppel (Michael Pokorny), Jeremy Nguyen (Michael Pokorny), Daniil S. Antonenko (Michael Pokorny), Steffi Chern (Michael Pokorny), Bingchen Zhao (Michael Pokorny), Pierrot Arsene (Michael Pokorny), Alan Goldfarb (Michael Pokorny), Sergey Ivanov (Michael Pokorny), Rafa{\l} Po\'swiata (Michael Pokorny), Chenguang Wang (Michael Pokorny), Daofeng Li (Michael Pokorny), Donato Crisostomi (Michael Pokorny), Andrea Achilleos (Michael Pokorny), Benjamin Myklebust (Michael Pokorny), Archan Sen (Michael Pokorny), David Perrella (Michael Pokorny), Nurdin Kaparov (Michael Pokorny), Mark H Inlow (Michael Pokorny), Keith Krenek (Michael Pokorny), Allen Zang (Michael Pokorny), Elliott Thornley (Michael Pokorny), Daniil Orel (Michael Pokorny), Vladislav Poritski (Michael Pokorny), Shalev Ben-David (Michael Pokorny), Zachary Berger (Michael Pokorny), Parker Whitfill (Michael Pokorny), Michael Foster (Michael Pokorny), Daniel Munro (Michael Pokorny), Linh Ho (Michael Pokorny), Dan Bar Hava (Michael Pokorny), Aleksey Kuchkin (Michael Pokorny), Robert Lauff (Michael Pokorny), David Holmes (Michael Pokorny), Frank Sommerhage (Michael Pokorny), Cesare Giulio Ardito (Michael Pokorny), Richard Moat (Michael Pokorny), Keith Schneider (Michael Pokorny), Zakayo Kazibwe (Michael Pokorny), Nate Stambaugh (Michael Pokorny), Mukhwinder Singh (Michael Pokorny), Ilias Magoulas (Michael Pokorny), Don Clarke (Michael Pokorny), Dae Hyun Kim (Michael Pokorny), Felipe Meneguitti Dias (Michael Pokorny), Veit Elser (Michael Pokorny), Kanu Priya Agarwal (Michael Pokorny), Victor Efren Guadarrama Vilchis (Michael Pokorny), Immo Klose (Michael Pokorny), Christoph Demian (Michael Pokorny), Ujjwala Anantheswaran (Michael Pokorny), Adam Zweiger (Michael Pokorny), Guglielmo Albani (Michael Pokorny), Jeffery Li (Michael Pokorny), Nicolas Daans (Michael Pokorny), Maksim Radionov (Michael Pokorny), V\'aclav Rozho\v{n} (Michael Pokorny), Ziqiao Ma (Michael Pokorny), Christian Stump (Michael Pokorny), Mohammed Berkani (Michael Pokorny), Jacob Platnick (Michael Pokorny), Volodymyr Nevirkovets (Michael Pokorny), Luke Basler (Michael Pokorny), Marco Piccardo (Michael Pokorny), Ferenc Jeanplong (Michael Pokorny), Niv Cohen (Michael Pokorny), Virendra Singh (Michael Pokorny), Josef Tkadlec (Michael Pokorny), Paul Rosu (Michael Pokorny), Piotr Padlewski (Michael Pokorny), Stanislaw Barzowski (Michael Pokorny), Kyle Montgomery (Michael Pokorny), Aline Menezes (Michael Pokorny), Arkil Patel (Michael Pokorny), Zixuan Wang (Michael Pokorny), Jamie Tucker-Foltz (Michael Pokorny), Jack Stade (Michael Pokorny), Tom Goertzen (Michael Pokorny), Fereshteh Kazemi (Michael Pokorny), Jeremiah Milbauer (Michael Pokorny), John Arnold Ambay (Michael Pokorny), Abhishek Shukla (Michael Pokorny), Yan Carlos Leyva Labrador (Michael Pokorny), Hao He (Michael Pokorny), Ling Zhang (Michael Pokorny), Alan Givr\'e (Michael Pokorny), Hew Wolff (Michael Pokorny), Vivien Rossbach (Michael Pokorny), Muhammad Fayez Aziz (Michael Pokorny), Younesse Kaddar (Michael Pokorny), Ivar \"Angquist (Michael Pokorny), Yanxu Chen (Michael Pokorny), Robin Zhang (Michael Pokorny), Jiayi Pan (Michael Pokorny), Antonio Terpin (Michael Pokorny), Niklas Muennighoff (Michael Pokorny), Hailey Schoelkopf (Michael Pokorny), Eric Zheng (Michael Pokorny), Avishy Carmi (Michael Pokorny), Adam Jones (Michael Pokorny), Jainam Shah (Michael Pokorny), Ethan D. L. Brown (Michael Pokorny), Kelin Zhu (Michael Pokorny), Max Bartolo (Michael Pokorny), Richard Wheeler (Michael Pokorny), Andrew Ho (Michael Pokorny), Shaul Barkan (Michael Pokorny), Jiaqi Wang (Michael Pokorny), Martin Stehberger (Michael Pokorny), Egor Kretov (Michael Pokorny), Peter Bradshaw (Michael Pokorny), JP Heimonen (Michael Pokorny), Kaustubh Sridhar (Michael Pokorny), Yury Makarychev (Michael Pokorny), Zienab EL-Wasif (Michael Pokorny), Anji Zhang (Michael Pokorny), Daniel Pyda (Michael Pokorny), Joanna Tam (Michael Pokorny), David M. Cunningham (Michael Pokorny), Vladimir Goryachev (Michael Pokorny), Demosthenes Patramanis (Michael Pokorny), Michael Krause (Michael Pokorny), Andrew Redenti (Michael Pokorny), Daniel Bugas (Michael Pokorny), David Aldous (Michael Pokorny), Jesyin Lai (Michael Pokorny), Shannon Coleman (Michael Pokorny), Mohsen Bahaloo (Michael Pokorny), Greg Bateman (Michael Pokorny), Jiangnan Xu (Michael Pokorny), Sangwon Lee (Michael Pokorny), Sandy Zhao (Michael Pokorny), Ning Tang (Michael Pokorny), Michael K. Cohen (Michael Pokorny), Micah Carroll (Michael Pokorny), Orr Paradise (Michael Pokorny), Jan Hendrik Kirchner (Michael Pokorny), Stefan Steinerberger (Michael Pokorny), Maksym Ovchynnikov (Michael Pokorny), Jason O. Matos (Michael Pokorny), Adithya Shenoy (Michael Pokorny), Benedito Alves de Oliveira Junior (Michael Pokorny), Michael Wang (Michael Pokorny), Ashley Aaron (Michael Pokorny), Yuzhou Nie (Michael Pokorny), Paolo Giordano (Michael Pokorny), Philipp Petersen (Michael Pokorny), Anna Sztyber-Betley (Michael Pokorny), Priti Shukla (Michael Pokorny), Paolo Faraboschi (Michael Pokorny), Jonathan Crozier (Michael Pokorny), Antonella Pinto (Michael Pokorny), Shreyas Verma (Michael Pokorny), Prashant Joshi (Michael Pokorny), Eli Meril (Michael Pokorny), Zheng-Xin Yong (Michael Pokorny), Allison Tee (Michael Pokorny), J\'er\'emy Andr\'eoletti (Michael Pokorny), Orion Weller (Michael Pokorny), Raghav Singhal (Michael Pokorny), Gang Zhang (Michael Pokorny), Alexander Ivanov (Michael Pokorny), Seri Khoury (Michael Pokorny), Nils Gustafsson (Michael Pokorny), Hamid Mostaghimi (Michael Pokorny), Kunvar Thaman (Michael Pokorny), Qijia Chen (Michael Pokorny), Tran Quoc Kh\'anh (Michael Pokorny), Jacob Loader (Michael Pokorny), Stefano Cavalleri (Michael Pokorny), Hannah Szlyk (Michael Pokorny), Zachary Brown (Michael Pokorny), Jonathan Roberts (Michael Pokorny), William Alley (Michael Pokorny), Kunyang Sun (Michael Pokorny), Ryan Stendall (Michael Pokorny), Max Lamparth (Michael Pokorny), Anka Reuel (Michael Pokorny), Ting Wang (Michael Pokorny), Hanmeng Xu (Michael Pokorny), Pablo Hern\'andez-C\'amara (Michael Pokorny), Freddie Martin (Michael Pokorny), Dmitry Malishev (Michael Pokorny), Thomas Preu (Michael Pokorny), Tomek Korbak (Michael Pokorny), Marcus Abramovitch (Michael Pokorny), Dominic Williamson (Michael Pokorny), Ziye Chen (Michael Pokorny), Bir\'o B\'alint (Michael Pokorny), M Saiful Bari (Michael Pokorny), Peyman Kassani (Michael Pokorny), Zihao Wang (Michael Pokorny), Behzad Ansarinejad (Michael Pokorny), Laxman Prasad Goswami (Michael Pokorny), Yewen Sun (Michael Pokorny), Hossam Elgnainy (Michael Pokorny), Mohamed Sayed (Michael Pokorny), Daniel Tordera (Michael Pokorny), George Balabanian (Michael Pokorny), Earth Anderson (Michael Pokorny), Lynna Kvistad (Michael Pokorny), Alejandro Jos\'e Moyano (Michael Pokorny), Rajat Maheshwari (Michael Pokorny), Ahmad Sakor (Michael Pokorny), Murat Eron (Michael Pokorny), Isaac C. McAlister (Michael Pokorny), Javier Gimenez (Michael Pokorny), Innocent Enyekwe (Michael Pokorny), Andrew Favre D. O. (Michael Pokorny), Shailesh Shah (Michael Pokorny), Xiaoxiang Zhou (Michael Pokorny), Firuz Kamalov (Michael Pokorny), Ronald Clark (Michael Pokorny), Sherwin Abdoli (Michael Pokorny), Tim Santens (Michael Pokorny), Khalida Meer (Michael Pokorny), Harrison K Wang (Michael Pokorny), Kalyan Ramakrishnan (Michael Pokorny), Evan Chen (Michael Pokorny), Alessandro Tomasiello (Michael Pokorny), G. Bruno De Luca (Michael Pokorny), Shi-Zhuo Looi (Michael Pokorny), Vinh-Kha Le (Michael Pokorny), Noam Kolt (Michael Pokorny), Niels M\"undler (Michael Pokorny), Avi Semler (Michael Pokorny), Emma Rodman (Michael Pokorny), Jacob Drori (Michael Pokorny), Carl J Fossum (Michael Pokorny), Luk Gloor (Michael Pokorny), Milind Jagota (Michael Pokorny), Ronak Pradeep (Michael Pokorny), Honglu Fan (Michael Pokorny), Tej Shah (Michael Pokorny), Jonathan Eicher (Michael Pokorny), Michael Chen (Michael Pokorny), Kushal Thaman (Michael Pokorny), William Merrill (Michael Pokorny), Moritz Firsching (Michael Pokorny), Carter Harris (Michael Pokorny), Stefan Ciob\^ac\u{a} (Michael Pokorny), Jason Gross (Michael Pokorny), Rohan Pandey (Michael Pokorny), Ilya Gusev (Michael Pokorny), Asankhaya Sharma (Michael Pokorny), Shashank Agnihotri (Michael Pokorny), Pavel Zhelnov (Michael Pokorny), Siranut Usawasutsakorn (Michael Pokorny), Mohammadreza Mofayezi (Michael Pokorny), Sergei Bogdanov (Michael Pokorny), Alexander Piperski (Michael Pokorny), Marc Carauleanu (Michael Pokorny), David K. Zhang (Michael Pokorny), Kostiantyn Dobarskyi (Michael Pokorny), Dylan Ler (Michael Pokorny), Roman Leventov (Michael Pokorny), Ignat Soroko (Michael Pokorny), Thorben Jansen (Michael Pokorny), Scott Creighton (Michael Pokorny), Pascal Lauer (Michael Pokorny), Joshua Duersch (Michael Pokorny), Vage Taamazyan (Michael Pokorny), Dario Bezzi (Michael Pokorny), Wiktor Morak (Michael Pokorny), Wenjie Ma (Michael Pokorny), William Held (Michael Pokorny), Tran {\DJ}uc Huy (Michael Pokorny), Ruicheng Xian (Michael Pokorny), Armel Randy Zebaze (Michael Pokorny), Mohanad Mohamed (Michael Pokorny), Julian Noah Leser (Michael Pokorny), Michelle X Yuan (Michael Pokorny), Laila Yacar (Michael Pokorny), Johannes Lengler (Michael Pokorny), Katarzyna Olszewska (Michael Pokorny), Hossein Shahrtash (Michael Pokorny), Edson Oliveira (Michael Pokorny), Joseph W. Jackson (Michael Pokorny), Daniel Espinosa Gonzalez (Michael Pokorny), Andy Zou (Michael Pokorny), Muthu Chidambaram (Michael Pokorny), Timothy Manik (Michael Pokorny), Hector Haffenden (Michael Pokorny), Dashiell Stander (Michael Pokorny), Ali Dasouqi (Michael Pokorny), Alexander Shen (Michael Pokorny), Emilien Duc (Michael Pokorny), Bita Golshani (Michael Pokorny), David Stap (Michael Pokorny), Mikalai Uzhou (Michael Pokorny), Alina Borisovna Zhidkovskaya (Michael Pokorny), Lukas Lewark (Michael Pokorny), Miguel Orbegozo Rodriguez (Michael Pokorny), M\'aty\'as Vincze (Michael Pokorny), Dustin Wehr (Michael Pokorny), Colin Tang (Michael Pokorny), Zaki Hossain (Michael Pokorny), Shaun Phillips (Michael Pokorny), Fortuna Samuele (Michael Pokorny), Jiang Muzhen (Michael Pokorny), Fredrik Ekstr\"om (Michael Pokorny), Angela Hammon (Michael Pokorny), Oam Patel (Michael Pokorny), Nicolas Remy (Michael Pokorny), Faraz Farhidi (Michael Pokorny), George Medley (Michael Pokorny), Forough Mohammadzadeh (Michael Pokorny), Madellene Pe\~naflor (Michael Pokorny), Haile Kassahun (Michael Pokorny), Alena Friedrich (Michael Pokorny), Claire Sparrow (Michael Pokorny), Rayner Hernandez Perez (Michael Pokorny), Taom Sakal (Michael Pokorny), Omkar Dhamane (Michael Pokorny), Ali Khajegili Mirabadi (Michael Pokorny), Eric Hallman (Michael Pokorny), Kenchi Okutsu (Michael Pokorny), Mike Battaglia (Michael Pokorny), Mohammad Maghsoudimehrabani (Michael Pokorny), Hieu Hoang (Michael Pokorny), Alon Amit (Michael Pokorny), Dave Hulbert (Michael Pokorny), Roberto Pereira (Michael Pokorny), Simon Weber (Michael Pokorny), Stephen Mensah (Michael Pokorny), Alice Koech (Michael Pokorny), Handoko (Michael Pokorny), Anton Peristyy (Michael Pokorny), Chris Harjadi (Michael Pokorny), Himanshu Gupta (Michael Pokorny), Stephen Malina (Michael Pokorny), Samuel Albanie (Michael Pokorny), Will Cai (Michael Pokorny), Mustafa Mehkary (Michael Pokorny), Rami Aly (Michael Pokorny), Frank Reidegeld (Michael Pokorny), Anna-Katharina Dick (Michael Pokorny), Cary Friday (Michael Pokorny), Jasdeep Sidhu (Michael Pokorny), Hassan Shapourian (Michael Pokorny), Wanyoung Kim (Michael Pokorny), Mariana Costa (Michael Pokorny), Hubeyb Gurdogan (Michael Pokorny), Brian Weber (Michael Pokorny), Harsh Kumar (Michael Pokorny), Tong Jiang (Michael Pokorny), Arunim Agarwal (Michael Pokorny), Chiara Ceconello (Michael Pokorny), Warren S. Vaz (Michael Pokorny), Chao Zhuang (Michael Pokorny), Haon Park (Michael Pokorny), Andrew R. Tawfeek (Michael Pokorny), Daattavya Aggarwal (Michael Pokorny), Michael Kirchhof (Michael Pokorny), Linjie Dai (Michael Pokorny), Evan Kim (Michael Pokorny), Johan Ferret (Michael Pokorny), Yuzhou Wang (Michael Pokorny), Minghao Yan (Michael Pokorny), Krzysztof Burdzy (Michael Pokorny), Lixin Zhang (Michael Pokorny), Antonio Franca (Michael Pokorny), Diana T. Pham (Michael Pokorny), Kang Yong Loh (Michael Pokorny), Joshua Robinson (Michael Pokorny), Abram Jackson (Michael Pokorny), Shreen Gul (Michael Pokorny), Gunjan Chhablani (Michael Pokorny), Zhehang Du (Michael Pokorny), Adrian Cosma (Michael Pokorny), Jesus Colino (Michael Pokorny), Colin White (Michael Pokorny), Robin Riblet (Michael Pokorny), Prajvi Saxena (Michael Pokorny), Jacob Votava (Michael Pokorny), Vladimir Vinnikov (Michael Pokorny), Ethan Delaney (Michael Pokorny), Shiv Halasyamani (Michael Pokorny), Syed M. Shahid (Michael Pokorny), Jean-Christophe Mourrat (Michael Pokorny), Lavr Vetoshkin (Michael Pokorny), Koen Sponselee (Michael Pokorny), Renas Bacho (Michael Pokorny), Vincent Ginis (Michael Pokorny), Aleksandr Maksapetyan (Michael Pokorny), Florencia de la Rosa (Michael Pokorny), Xiuyu Li (Michael Pokorny), Guillaume Malod (Michael Pokorny), Leon Lang (Michael Pokorny), Julien Laurendeau (Michael Pokorny), Murat Tiryakioglu (Michael Pokorny), Dmitry Kazakov (Michael Pokorny), Fatimah Adesanya (Michael Pokorny), Julien Portier (Michael Pokorny), Lawrence Hollom (Michael Pokorny), Victor Souza (Michael Pokorny), Yuchen Anna Zhou (Michael Pokorny), Julien Degorre (Michael Pokorny), Yi\u{g}it Yal{\i}n (Michael Pokorny), Gbenga Daniel Obikoya (Michael Pokorny), Luca Arnaboldi (Michael Pokorny), Rai (Michael Pokorny), Filippo Bigi (Quinn), M. C. Bosc\'a (Quinn), Oleg Shumar (Quinn), Kaniuar Bacho (Quinn), Pierre Clavier (Quinn), Gabriel Recchia (Quinn), Mara Popescu (Quinn), Nikita Shulga (Quinn), Ngefor Mildred Tanwie (Quinn), Thomas C. H. Lux (Quinn), Ben Rank (Quinn), Colin Ni (Quinn), Matthew Brooks (Quinn), Alesia Yakimchyk (Quinn), Huanxu (Quinn), Liu (Tony), Olle H\"aggstr\"om (Tony), Emil Verkama (Tony), Himanshu Narayan (Tony), Hans Gundlach (Tony), Leonor Brito-Santana (Tony), Brian Amaro (Tony), Vivek Vajipey (Tony), Rynaa Grover (Tony), Yiyang Fan (Tony), Gabriel Poesia Reis e Silva (Tony), Linwei Xin (Tony), Yosi Kratish (Tony), Jakub {\L}ucki (Tony), Wen-Ding Li (Tony), Sivakanth Gopi (Tony), Andrea Caciolai (Tony), Justin Xu (Tony), Kevin Joseph Scaria (Tony), Freddie Vargus (Tony), Farzad Habibi (Tony), Long (Tony), Lian, Emanuele Rodol\`a, Jules Robins, Vincent Cheng, Declan Grabb, Ida Bosio, Tony Fruhauff, Ido Akov, Brad Raynor, Eve J. Y. Lo, Hao Qi, Xi Jiang, Ben Segev, Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Michael P. Brenner, Mao Mao, Yibo Jiang, Xinyu Zhang, David Avagian, Eshawn Jessica Scipio, Muhammad Rehan Siddiqi, Alon Ragoler, Justin Tan, Deepakkumar Patil, Blake Sims, Rebeka Plecnik, Aaron Kirtland, Roselynn Grace Montecillo, Stephane Durand, Omer Faruk Bodur, D. P. Shinde, Zahra Adoul, Mohamed Zekry, Guillaume Douville, Ali Karakoc, Tania C. B. Santos, Samir Shamseldeen, Loukmane Karim, Anna Liakhovitskaia, Nate Resman, Nicholas Farina, Juan Carlos Gonzalez, Gabe Maayan, Sarah Hoback, Rodrigo De Oliveira Pena, Ross Finocchio, Glen Sherman, Elizabeth Kelley, Hodjat Mariji, Rasoul Pouriamanesh, Wentao Wu, G\"ozdenur Demir, Sandra Mendoza, Ismail Alarab, Joshua Cole, Danyelle Ferreira, Bryan Johnson, Hsiaoyun Milliron, Mohammad Safdari, Liangti Dai, Siriphan Arthornthurasuk, Alexey Pronin, Jing Fan, Angel Ramirez-Trinidad, Ashley Cartwright, Daphiny Pottmaier, Omid Taheri, David Outevsky, Stanley Stepanic, Samuel Perry, Luke Askew, Ra\'ul Adri\'an Huerta Rodr\'iguez, Ali M. R. Minissi, Abdelkader Dendane, Sam Ali, Ricardo Lorena, Krishnamurthy Iyer, Arshad Anil Fasiludeen, Sk Md Salauddin, Murat Islam, Juan Gonzalez, Josh Ducey, Russell Campbell, Maja Somrak, Vasilios Mavroudis, Eric Vergo, Juehang Qin, Benj\'amin Borb\'as, Eric Chu, Jack Lindsey, Anil Radhakrishnan, Antoine Jallon, I. M. J. McInnis, Alex Hoover, S\"oren M\"oller, Song Bian, John Lai, Denis Peskoff, Joseph McGowan, Tejal Patwardhan, Summer Yue, Alexandr Wang, Dan Hendrycks

Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

URLs: https://lastexam.ai.

replace-cross ACECODER: Acing Coder RL via Automated Test-Case Synthesis

Authors: Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, Wenhu Chen

Abstract: Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25\% and MBPP-plus by 6\% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.

replace-cross Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Authors: Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao

Abstract: Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.

replace-cross Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Authors: Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov

Abstract: We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.