new Intelligent Legal Assistant: An Interactive Clarification System for Legal Question Answering

Authors: Rujing Yao, Yiquan Wu, Tong Zhang, Xuhui Zhang, Yuting Huang, Yang Wu, Jiayin Yang, Changlong Sun, Fang Wang, Xiaozhong Liu

Abstract: The rise of large language models has opened new avenues for users seeking legal advice. However, users often lack professional legal knowledge, which can lead to questions that omit critical information. This deficiency makes it challenging for traditional legal question-answering systems to accurately identify users' actual needs, often resulting in imprecise or generalized advice. In this work, we develop a legal question-answering system called Intelligent Legal Assistant, which interacts with users to precisely capture their needs. When a user poses a question, the system requests that the user select their geographical location to pinpoint the applicable laws. It then generates clarifying questions and options based on the key information missing from the user's initial question. This allows the user to select and provide the necessary details. Once all necessary information is provided, the system produces an in-depth legal analysis encompassing three aspects: overall conclusion, jurisprudential analysis, and resolution suggestions.

new Elevating Legal LLM Responses: Harnessing Trainable Logical Structures and Semantic Knowledge with Legal Reasoning

Authors: Rujing Yao, Yang Wu, Chenghao Wang, Jingwei Xiong, Fang Wang, Xiaozhong Liu

Abstract: Large Language Models (LLMs) have achieved impressive results across numerous domains, yet they experience notable deficiencies in legal question-answering tasks. LLMs often generate generalized responses that lack the logical specificity required for expert legal advice and are prone to hallucination, providing answers that appear correct but are unreliable. Retrieval-Augmented Generation (RAG) techniques offer partial solutions to address this challenge, but existing approaches typically focus only on semantic similarity, neglecting the logical structure essential to legal reasoning. In this paper, we propose the Logical-Semantic Integration Model (LSIM), a novel supervised framework that bridges semantic and logical coherence. LSIM comprises three components: reinforcement learning predicts a structured fact-rule chain for each question, a trainable Deep Structured Semantic Model (DSSM) retrieves the most relevant candidate questions by integrating semantic and logical features, and in-context learning generates the final answer using the retrieved content. Our experiments on a real-world legal QA dataset-validated through both automated metrics and human evaluation-demonstrate that LSIM significantly enhances accuracy and reliability compared to existing methods.

new Adapting Multilingual Embedding Models to Historical Luxembourgish

Authors: Andrianos Michail, Corina Julia Racl\'e, Juri Opitz, Simon Clematide

Abstract: The growing volume of digitized historical texts requires effective semantic search using text embeddings. However, pre-trained multilingual models, typically evaluated on contemporary texts, face challenges with historical digitized content due to OCR noise and outdated spellings. We explore the use of multilingual embeddings for cross-lingual semantic search on historical Luxembourgish, a low-resource language. We collect historical Luxembourgish news articles spanning various time periods and use GPT-4o to segment and translate them into closely related languages, creating 20,000 parallel training sentences per language pair. We further create a historical bitext mining evaluation set and find that these models struggle to perform cross-lingual search on historical Luxembourgish. To address this, we propose a simple adaptation method using in-domain training data, achieving up to 98\% accuracy in cross-lingual evaluations. We release our adapted models and historical Luxembourgish-German/French bitexts to support further research.

new Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?

Authors: Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace

Abstract: Medical research faces well-documented challenges in translating novel treatments into clinical practice. Publishing incentives encourage researchers to present "positive" findings, even when empirical results are equivocal. Consequently, it is well-documented that authors often spin study results, especially in article abstracts. Such spin can influence clinician interpretation of evidence and may affect patient care decisions. In this study, we ask whether the interpretation of trial results offered by Large Language Models (LLMs) is similarly affected by spin. This is important since LLMs are increasingly being used to trawl through and synthesize published medical evidence. We evaluated 22 LLMs and found that they are across the board more susceptible to spin than humans. They might also propagate spin into their outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into plain language summaries that they generate. We also find, however, that LLMs are generally capable of recognizing spin, and can be prompted in a way to mitigate spin's impact on LLM outputs.

new Training Sparse Mixture Of Experts Text Embedding Models

Authors: Zach Nussbaum, Brandon Duderstadt

Abstract: Transformer-based text embedding models have improved their performance on benchmarks like MIRACL and BEIR by increasing their parameter counts. However, this scaling approach introduces significant deployment challenges, including increased inference latency and memory usage. These challenges are particularly severe in retrieval-augmented generation (RAG) applications, where large models' increased memory requirements constrain dataset ingestion capacity, and their higher latency directly impacts query-time performance. While causal language models have addressed similar efficiency challenges using Mixture of Experts (MoE) architectures, this approach hasn't been successfully adapted to the general text embedding setting. In this paper, we introduce Nomic Embed v2, the first general purpose MoE text embedding model. Our model outperforms models in the same parameter class on both monolingual and multilingual benchmarks while also maintaining competitive performance with models twice its size. We open-source all code, models, and evaluation data to ensure full reproducibility of our training pipeline.

new MetaSC: Test-Time Safety Specification Optimization for Language Models

Authors: V\'ictor Gallego

Abstract: We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code to be released at https://github.com/vicgalle/meta-self-critique.git .

URLs: https://github.com/vicgalle/meta-self-critique.git

new The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models

Authors: Artem Kirsanov, Chi-Ning Chou, Kyunghyun Cho, SueYeon Chung

Abstract: Decoder-only language models have the ability to dynamically switch between various computational tasks based on input prompts. Despite many successful applications of prompting, there is very limited understanding of the internal mechanism behind such flexibility. In this work, we investigate how different prompting methods affect the geometry of representations in these models. Employing a framework grounded in statistical physics, we reveal that various prompting techniques, while achieving similar performance, operate through distinct representational mechanisms for task adaptation. Our analysis highlights the critical role of input distribution samples and label semantics in few-shot in-context learning. We also demonstrate evidence of synergistic and interfering interactions between different tasks on the representational level. Our work contributes to the theoretical understanding of large language models and lays the groundwork for developing more effective, representation-aware prompting strategies.

new Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding

Authors: Ziyao Wang, Muneeza Azmart, Ang Li, Raya Horesh, Mikhail Yurochkin

Abstract: Large Language Models (LLMs) often excel in specific domains but fall short in others due to the limitations of their training. Thus, enabling LLMs to solve problems collaboratively by integrating their complementary knowledge promises to improve their performance across domains. To realize this potential, we introduce a novel Collaborative Speculative Decoding (CoSD) algorithm that enables efficient LLM knowledge fusion at test time without requiring additional model training. CoSD employs a draft model to generate initial sequences and an easy-to-learn rule or decision tree to decide when to invoke an assistant model to improve these drafts. CoSD not only enhances knowledge fusion but also improves inference efficiency, is transferable across domains and models, and offers greater explainability. Experimental results demonstrate that CoSD improves accuracy by up to 10\% across benchmarks compared to existing methods, providing a scalable and effective solution for LLM-based applications

new Contextual Subspace Manifold Projection for Structural Refinement of Large Language Model Representations

Authors: Alistair Wren, Beatrice Loxley, Hamish Cadwallader, Simon Beckwith, Fabian Pargeter, James Blades

Abstract: Internal representations within deep neural architectures encode high-dimensional abstractions of linguistic structures, yet they often exhibit inefficiencies in feature distribution, limiting expressiveness and adaptability. Contextual Subspace Manifold Projection introduces a structured refinement technique that selectively reconfigures token embeddings through controlled subspace constraints, ensuring more stable and geometrically well-defined feature distributions. Empirical evaluations demonstrated that the structured intervention reduced anisotropy, leading to improved representation compactness while preserving semantic fidelity across transformer layers. Clustering analyses indicated that token embeddings exhibited greater feature separability, reinforcing the hypothesis that structured projection techniques enhance internal representation organization without sacrificing linguistic coherence. Gradient magnitude distributions suggested that the method introduced a smoother optimization trajectory, potentially contributing to more stable parameter updates throughout training. Computational overhead associated with the projection operations remained minimal, ensuring that the refinements did not introduce significant trade-offs in model efficiency or inference speed. Comparisons with standard embedding refinement techniques highlighted that structured manifold constraints provided a direct mechanism for improving representation quality without requiring additional gradient-based optimization. Perplexity evaluations confirmed that the adjustments did not negatively impact sequence coherence, further validating the effectiveness of the proposed approach.

new Franken-Adapter: Cross-Lingual Adaptation of LLMs by Embedding Surgery

Authors: Fan Jiang, Honglin Yu, Grace Chung, Trevor Cohn

Abstract: The capabilities of Large Language Models (LLMs) in low-resource languages lag far behind those in English, making their universal accessibility a significant challenge. To alleviate this, we present $\textit{Franken-Adapter}$, a modular language adaptation approach for decoder-only LLMs with embedding surgery. Our method begins by creating customized vocabularies for target languages and performing language adaptation through embedding tuning on multilingual data. These pre-trained embeddings are subsequently integrated with LLMs that have been instruction-tuned on English alignment data to enable zero-shot cross-lingual transfer. Our experiments on $\texttt{Gemma2}$ models with up to 27B parameters demonstrate improvements of up to 20% across 96 languages, spanning both discriminative and generative tasks, with minimal regressions ($<$1%) in English. Further in-depth analysis reveals the critical role of customizing tokenizers in enhancing language adaptation, while boosting inference efficiency. Additionally, we show the versatility of our method by achieving a 14% improvement over a math-optimized LLM across 20 languages, offering a modular solution to transfer reasoning abilities across languages post hoc.

new Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs

Authors: Mohsinul Kabir, Ajwad Abrar, Sophia Ananiadou

Abstract: A large number of studies rely on closed-style multiple-choice surveys to evaluate cultural alignment in Large Language Models (LLMs). In this work, we challenge this constrained evaluation paradigm and explore more realistic, unconstrained approaches. Using the World Values Survey (WVS) and Hofstede Cultural Dimensions as case studies, we demonstrate that LLMs exhibit stronger cultural alignment in less constrained settings, where responses are not forced. Additionally, we show that even minor changes, such as reordering survey choices, lead to inconsistent outputs, exposing the limitations of closed-style evaluations. Our findings advocate for more robust and flexible evaluation frameworks that focus on specific cultural proxies, encouraging more nuanced and accurate assessments of cultural alignment in LLMs.

new On Mechanistic Circuits for Extractive Question-Answering

Authors: Samyadeep Basu, Vlad Morariu, Zichao Wang, Ryan Rossi, Cherry Zhao, Soheil Feizi, Varun Manjunatha

Abstract: Large language models are increasingly used to process documents and facilitate question-answering on them. In our paper, we extract mechanistic circuits for this real-world language modeling task: context-augmented language modeling for extractive question-answering (QA) tasks and understand the potential benefits of circuits towards downstream applications such as data attribution to context information. We extract circuits as a function of internal model components (e.g., attention heads, MLPs) using causal mediation analysis techniques. Leveraging the extracted circuits, we first understand the interplay between the model's usage of parametric memory and retrieved context towards a better mechanistic understanding of context-augmented language models. We then identify a small set of attention heads in our circuit which performs reliable data attribution by default, thereby obtaining attribution for free in just the model's forward pass. Using this insight, we then introduce ATTNATTRIB, a fast data attribution algorithm which obtains state-of-the-art attribution results across various extractive QA benchmarks. Finally, we show the possibility to steer the language model towards answering from the context, instead of the parametric memory by using the attribution from ATTNATTRIB as an additional signal during the forward pass. Beyond mechanistic understanding, our paper provides tangible applications of circuits in the form of reliable data attribution and model steering.

new NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals

Authors: Neha Srikanth, Rachel Rudinger

Abstract: Decomposition of text into atomic propositions is a flexible framework allowing for the closer inspection of input and output text. We use atomic decomposition of hypotheses in two natural language reasoning tasks, traditional NLI and defeasible NLI, to form atomic sub-problems, or granular inferences that models must weigh when solving the overall problem. These atomic sub-problems serve as a tool to further understand the structure of both NLI and defeasible reasoning, probe a model's consistency and understanding of different inferences, and measure the diversity of examples in benchmark datasets. Our results indicate that LLMs still struggle with logical consistency on atomic NLI and defeasible NLI sub-problems. Lastly, we identify critical atomic sub-problems of defeasible NLI examples, or those that most contribute to the overall label, and propose a method to measure the inferential consistency of a model, a metric designed to capture the degree to which a model makes consistently correct or incorrect predictions about the same fact under different contexts.

new GCoT: Chain-of-Thought Prompt Learning for Graphs

Authors: Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang

Abstract: Chain-of-thought (CoT) prompting has achieved remarkable success in natural language processing (NLP). However, its vast potential remains largely unexplored for graphs. This raises an interesting question: How can we design CoT prompting for graphs to guide graph models to learn step by step? On one hand, unlike natural languages, graphs are non-linear and characterized by complex topological structures. On the other hand, many graphs lack textual data, making it difficult to formulate language-based CoT prompting. In this work, we propose the first CoT prompt learning framework for text-free graphs, GCoT. Specifically, we decompose the adaptation process for each downstream task into a series of inference steps, with each step consisting of prompt-based inference, ``thought'' generation, and thought-conditioned prompt learning. While the steps mimic CoT prompting in NLP, the exact mechanism differs significantly. Specifically, at each step, an input graph, along with a prompt, is first fed into a pre-trained graph encoder for prompt-based inference. We then aggregate the hidden layers of the encoder to construct a ``thought'', which captures the working state of each node in the current step. Conditioned on this thought, we learn a prompt specific to each node based on the current state. These prompts are fed into the next inference step, repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we conduct comprehensive experiments on eight public datasets, which demonstrate the advantage of our approach.

new HuDEx: Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM responses

Authors: Sujeong Lee, Hayoung Lee, Seongsoo Heo, Wonik Choi

Abstract: Recent advances in large language models (LLMs) have shown promising improvements, often surpassing existing methods across a wide range of downstream tasks in natural language processing. However, these models still face challenges, which may hinder their practical applicability. For example, the phenomenon of hallucination is known to compromise the reliability of LLMs, especially in fields that demand high factual precision. Current benchmarks primarily focus on hallucination detection and factuality evaluation but do not extend beyond identification. This paper proposes an explanation enhanced hallucination-detection model, coined as HuDEx, aimed at enhancing the reliability of LLM-generated responses by both detecting hallucinations and providing detailed explanations. The proposed model provides a novel approach to integrate detection with explanations, and enable both users and the LLM itself to understand and reduce errors. Our measurement results demonstrate that the proposed model surpasses larger LLMs, such as Llama3 70B and GPT-4, in hallucination detection accuracy, while maintaining reliable explanations. Furthermore, the proposed model performs well in both zero-shot and other test environments, showcasing its adaptability across diverse benchmark datasets. The proposed approach further enhances the hallucination detection research by introducing a novel approach to integrating interpretability with hallucination detection, which further enhances the performance and reliability of evaluating hallucinations in language models.

new Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance

Authors: Lingfei Qian, Weipeng Zhou, Yan Wang, Xueqing Peng, Jimin Huang, Qianqian Xie

Abstract: Recent advancements in large language models (LLMs) have shown strong general reasoning abilities, yet their effectiveness in financial reasoning remains underexplored. In this study, we comprehensively evaluate 16 powerful reasoning and general LLMs on three complex financial tasks involving financial text, tabular data, and equations, assessing numerical reasoning, tabular interpretation, financial terminology comprehension, long-context processing, and equation-based problem solving. Our results show that while better datasets and pretraining improve financial reasoning, general enhancements like CoT fine-tuning do not always yield consistent gains. Moreover, all reasoning strategies face challenges in improving performance on long-context and multi-table tasks. To address these limitations, we develop a financial reasoning-enhanced model based on Llama-3.1-8B-Instruct, by CoT fine-tuning and reinforcement learning with domain-specific reasoning paths. Even with simple fine-tuning with one financial dataset, our model achieves a consistent 10% performance improvement across tasks, surpassing all 8B models and even Llama3-70B-Instruct and Llama3.1-70B-Instruct on average. Our results highlight the need for domain-specific adaptations in financial tasks, emphasizing future directions such as multi-table reasoning, long-context processing, and financial terminology comprehension. All our datasets, models, and codes are publicly available. Furthermore, we introduce a leaderboard for benchmarking future datasets and models.

new Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models

Authors: Sonam Gupta, Yatin Nandwani, Asaf Yehudai, Dinesh Khandelwal, Dinesh Raghu, Sachindra Joshi

Abstract: Fine-tuning Large Language Models (LLMs) on specific datasets is a common practice to improve performance on target tasks. However, this performance gain often leads to overfitting, where the model becomes too specialized in either the task or the characteristics of the training data, resulting in a loss of generalization. This paper introduces Selective Self-to-Supervised Fine-Tuning (S3FT), a fine-tuning approach that achieves better performance than the standard supervised fine-tuning (SFT) while improving generalization. S3FT leverages the existence of multiple valid responses to a query. By utilizing the model's correct responses, S3FT reduces model specialization during the fine-tuning stage. S3FT first identifies the correct model responses from the training set by deploying an appropriate judge. Then, it fine-tunes the model using the correct model responses and the gold response (or its paraphrase) for the remaining samples. The effectiveness of S3FT is demonstrated through experiments on mathematical reasoning, Python programming and reading comprehension tasks. The results show that standard SFT can lead to an average performance drop of up to $4.4$ on multiple benchmarks, such as MMLU and TruthfulQA. In contrast, S3FT reduces this drop by half, i.e. $2.5$, indicating better generalization capabilities than SFT while performing significantly better on the fine-tuning tasks.

new SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation

Authors: Zhiming Ma, Xiayang Xiao, Sihao Dong, Peidong Wang, HaiPeng Wang, Qingyun Pan

Abstract: In the field of synthetic aperture radar (SAR) remote sensing image interpretation, although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified, and the first multi-task dialogue benchmark in the SAR field has been successfully established. The project will be released at https://github.com/JimmyMa99/SARChat, aiming to promote the in-depth development and wide application of SAR visual language models.

URLs: https://github.com/JimmyMa99/SARChat,

new ParetoRAG: Leveraging Sentence-Context Attention for Robust and Efficient Retrieval-Augmented Generation

Authors: Ruobing Yao, Yifei Zhang, Shuang Song, Yuhua Liu, Neng Gao, Chenyang Tu

Abstract: While Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external knowledge, they still face persistent challenges in retrieval inefficiency and the inability of LLMs to filter out irrelevant information. We present ParetoRAG, an unsupervised framework that optimizes RAG systems through sentence-level refinement guided by the Pareto principle. By decomposing paragraphs into sentences and dynamically re-weighting core content while preserving contextual coherence, ParetoRAG achieves dual improvements in both retrieval precision and generation quality without requiring additional training or API resources. This framework has been empirically validated across various datasets, LLMs, and retrievers.

new Enhancing LLM Character-Level Manipulation via Divide and Conquer

Authors: Zhen Xiong, Yujun Cai, Bryan Hooi, Nanyun Peng, Kai-Wei Chang, Zhecheng Li, Yiwei Wang

Abstract: Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks. However, they exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution. These challenges stem primarily from tokenization constraints, despite the critical role of such operations in data preprocessing and code generation. Through systematic analysis, we derive two key insights: (1) LLMs face significant difficulties in leveraging intrinsic token knowledge for character-level reasoning, and (2) atomized word structures can substantially enhance LLMs' ability to process token-level structural information. Building on these insights, we propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation. Our method decomposes complex operations into explicit character-level subtasks coupled with controlled token reconstruction phases, leading to significant improvements in accuracy. Without additional training, our method significantly improves accuracies on the $\texttt{Deletion}$, $\texttt{Insertion}$, and $\texttt{Substitution}$ tasks. To support further research, we open-source our implementation and benchmarks.

new LLM Modules: Knowledge Transfer from a Large to a Small Model using Enhanced Cross-Attention

Authors: Konstantin Kolomeitsev (Almaty, Kazakhstan)

Abstract: In this work, we propose an architecture of LLM Modules that enables the transfer of knowledge from a large pre-trained model to a smaller model using an Enhanced Cross-Attention mechanism. In the proposed scheme, the Qwen2-1.5B model is frozen and its representations are passed through specially designed attention layers to the GPT-Neo-125M model, which is trained on limited computational resources. Experimental results on the Bespoke-Stratos-17k dataset demonstrate that after 15 epochs of training, the combined model generates responses comparable in quality to those obtained by distillation. We discuss the advantages of the modular approach, provide examples of input queries and comparative analysis, and outline prospects for further extension of the method.

new Inference-time sparse attention with asymmetric indexing

Authors: Pierre-Emmanuel Mazar\'e, Gergely Szilvasy, Maria Lomeli, Francisco Massa, Naila Murray, Herv\'e J\'egou, Matthijs Douze

Abstract: Self-attention in transformer models is an incremental associative memory that maps key vectors to value vectors. One way to speed up self-attention is to employ GPU-compliant vector search algorithms, yet the standard partitioning methods yield poor results in this context, because (1) keys and queries follow different distributions and (2) the effect of RoPE positional encoding. In this paper, we introduce SAAP (Self-Attention with Asymmetric Partitions), which overcomes these problems. It is an asymmetrical indexing technique that employs distinct partitions for keys and queries, thereby approximating self-attention with a data-adaptive sparsity pattern. It works on pretrained language models without finetuning, as it only requires to train (offline) a small query classifier. On a long context Llama 3.1-8b model, with sequences ranging from 100k to 500k tokens, our method typically reduces by a factor 20 the fraction of memory that needs to be looked-up, which translates to a time saving of 60\% when compared to FlashAttention-v2.

new Exploring the Potential of Large Language Models to Simulate Personality

Authors: Maria Molchanova, Anna Mikhailova, Anna Korzanova, Lidiia Ostyakova, Alexandra Dolidze

Abstract: With the advancement of large language models (LLMs), the focus in Conversational AI has shifted from merely generating coherent and relevant responses to tackling more complex challenges, such as personalizing dialogue systems. In an effort to enhance user engagement, chatbots are often designed to mimic human behaviour, responding within a defined emotional spectrum and aligning to a set of values. In this paper, we aim to simulate personal traits according to the Big Five model with the use of LLMs. Our research showed that generating personality-related texts is still a challenging task for the models. As a result, we present a dataset of generated texts with the predefined Big Five characteristics and provide an analytical framework for testing LLMs on a simulation of personality skills.

new Dealing with Annotator Disagreement in Hate Speech Classification

Authors: Somaiyeh Dehghan, Mehmet Umut Sen, Berrin Yanikoglu

Abstract: Hate speech detection is a crucial task, especially on social media, where harmful content can spread quickly. Implementing machine learning models to automatically identify and address hate speech is essential for mitigating its impact and preventing its proliferation. The first step in developing an effective hate speech detection model is to acquire a high-quality dataset for training. Labeled data is foundational for most natural language processing tasks, but categorizing hate speech is difficult due to the diverse and often subjective nature of hate speech, which can lead to varying interpretations and disagreements among annotators. This paper examines strategies for addressing annotator disagreement, an issue that has been largely overlooked. In particular, we evaluate different approaches to deal with annotator disagreement regarding hate speech classification in Turkish tweets, based on a fine-tuned BERT model. Our work highlights the importance of the problem and provides state-of-art benchmark results for detection and understanding of hate speech in online discourse.

new What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Authors: Dongqi Liu, Chenxi Whitehouse, Xi Yu, Louis Mahon, Rohit Saxena, Zheng Zhao, Yifu Qiu, Mirella Lapata, Vera Demberg

Abstract: Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of scientific video summarization.

new Redefining Simplicity: Benchmarking Large Language Models from Lexical to Document Simplification

Authors: Jipeng Qiang, Minjiang Huang, Yi Zhu, Yunhao Yuan, Chaowei Zhang, Kui Yu

Abstract: Text simplification (TS) refers to the process of reducing the complexity of a text while retaining its original meaning and key information. Existing work only shows that large language models (LLMs) have outperformed supervised non-LLM-based methods on sentence simplification. This study offers the first comprehensive analysis of LLM performance across four TS tasks: lexical, syntactic, sentence, and document simplification. We compare lightweight, closed-source and open-source LLMs against traditional non-LLM methods using automatic metrics and human evaluations. Our experiments reveal that LLMs not only outperform non-LLM approaches in all four tasks but also often generate outputs that exceed the quality of existing human-annotated references. Finally, we present some future directions of TS in the era of LLMs.

new Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Authors: Laur\`ene Vaugrante, Francesca Carlon, Maluna Menke, Thilo Hagendorff

Abstract: Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce a novel attack that undermines both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. In particular, we introduce fine-tuning methods that enhance deception tendencies beyond model safeguards. These "deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others. Furthermore, we find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content. Finally, we assess whether models can deceive consistently in multi-turn dialogues, yielding mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against deception attacks is critical.

new Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting

Authors: Jiarui Wu, Zhuo Liu, Hangfeng He

Abstract: Spatial relation hallucinations pose a persistent challenge in large vision-language models (LVLMs), leading to generate incorrect predictions about object positions and spatial configurations within an image. To address this issue, we propose a constraint-aware prompting framework designed to reduce spatial relation hallucinations. Specifically, we introduce two types of constraints: (1) bidirectional constraint, which ensures consistency in pairwise object relations, and (2) transitivity constraint, which enforces relational dependence across multiple objects. By incorporating these constraints, LVLMs can produce more spatially coherent and consistent outputs. We evaluate our method on three widely-used spatial relation datasets, demonstrating performance improvements over existing approaches. Additionally, a systematic analysis of various bidirectional relation analysis choices and transitivity reference selections highlights greater possibilities of our methods in incorporating constraints to mitigate spatial relation hallucinations.

new MultiProSE: A Multi-label Arabic Dataset for Propaganda, Sentiment, and Emotion Detection

Authors: Lubna Al-Henaki, Hend Al-Khalifa, Abdulmalik Al-Salman, Hajar Alqubayshi, Hind Al-Twailay, Gheeda Alghamdi, Hawra Aljasim

Abstract: Propaganda is a form of persuasion that has been used throughout history with the intention goal of influencing people's opinions through rhetorical and psychological persuasion techniques for determined ends. Although Arabic ranked as the fourth most- used language on the internet, resources for propaganda detection in languages other than English, especially Arabic, remain extremely limited. To address this gap, the first Arabic dataset for Multi-label Propaganda, Sentiment, and Emotion (MultiProSE) has been introduced. MultiProSE is an open-source extension of the existing Arabic propaganda dataset, ArPro, with the addition of sentiment and emotion annotations for each text. This dataset comprises 8,000 annotated news articles, which is the largest propaganda dataset to date. For each task, several baselines have been developed using large language models (LLMs), such as GPT-4o-mini, and pre-trained language models (PLMs), including three BERT-based models. The dataset, annotation guidelines, and source code are all publicly released to facilitate future research and development in Arabic language models and contribute to a deeper understanding of how various opinion dimensions interact in news media1.

new Contextual Compression Encoding for Large Language Models: A Novel Framework for Multi-Layered Parameter Space Pruning

Authors: Barnaby Schmitt, Alistair Grosvenor, Matthias Cunningham, Clementine Walsh, Julius Pembrokeshire, Jonathan Teel

Abstract: Context-aware compression techniques have gained increasing attention as model sizes continue to grow, introducing computational bottlenecks that hinder efficient deployment. A structured encoding approach was proposed to selectively eliminate redundant parameter groups while ensuring that representational fidelity was preserved across multiple layers. Contextual Compression Encoding (CCE) introduced a multi-stage encoding mechanism that dynamically restructured parameter distributions, allowing for significant reductions in memory footprint and computational complexity. Experimental evaluations demonstrated that models compressed through CCE retained linguistic expressivity and coherence, maintaining accuracy across a range of text generation and classification tasks. Layer-wise analysis revealed that middle-network layers exhibited higher compression ratios, aligning with the observation that self-attention and feed-forward transformations contained redundancies that could be reorganized without impairing functional capacity. Comparisons against conventional quantization and pruning methods confirmed that CCE provided a more balanced trade-off between efficiency and model retention, achieving reductions in energy consumption and inference latency without requiring extensive retraining. Computational efficiency improvements were particularly evident in deployment scenarios involving resource-constrained environments, where reductions in memory usage enabled more scalable implementations. Further analyses of internal network behavior showed that compressed models exhibited stable activation distributions and adapted dynamically to input variations, reinforcing the viability of structured compression strategies for optimizing large-scale architectures.

new Systematic Knowledge Injection into Large Language Models via Diverse Augmentation for Domain-Specific RAG

Authors: Kushagra Bhushan, Yatin Nandwani, Dinesh Khandelwal, Sonam Gupta, Gaurav Pandey, Dinesh Raghu, Sachindra Joshi

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a prominent method for incorporating domain knowledge into Large Language Models (LLMs). While RAG enhances response relevance by incorporating retrieved domain knowledge in the context, retrieval errors can still lead to hallucinations and incorrect answers. To recover from retriever failures, domain knowledge is injected by fine-tuning the model to generate the correct response, even in the case of retrieval errors. However, we observe that without systematic knowledge augmentation, fine-tuned LLMs may memorize new information but still fail to extract relevant domain knowledge, leading to poor performance. In this work, we present a novel framework that significantly enhances the fine-tuning process by augmenting the training data in two ways -- context augmentation and knowledge paraphrasing. In context augmentation, we create multiple training samples for a given QA pair by varying the relevance of the retrieved information, teaching the model when to ignore and when to rely on retrieved content. In knowledge paraphrasing, we fine-tune with multiple answers to the same question, enabling LLMs to better internalize specialized knowledge. To mitigate catastrophic forgetting due to fine-tuning, we add a domain-specific identifier to a question and also utilize a replay buffer containing general QA pairs. Experimental results demonstrate the efficacy of our method over existing techniques, achieving up to 10\% relative gain in token-level recall while preserving the LLM's generalization capabilities.

new Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Authors: Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli

Abstract: The attention mechanism is essential for the impressive capabilities of transformer-based Large Language Models (LLMs). However, calculating attention is computationally intensive due to its quadratic dependency on the sequence length. We introduce a novel approach called Top-Theta Attention, or simply Top-$\theta$, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy, reducing the number of required V cache rows by 3x during generative decoding and the number of attention elements by 10x during the prefill phase. Our method does not require model retraining; instead, it requires only a brief calibration phase to be resilient to distribution shifts, thus not requiring the thresholds for different datasets to be recalibrated. Unlike top-k attention, Top-$\theta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search. A key innovation of our approach is the development of efficient numerical compensation techniques, which help preserve model accuracy even under aggressive pruning of attention scores.

new Unveiling Global Discourse Structures: Theoretical Analysis and NLP Applications in Argument Mining

Authors: Christopher van Le

Abstract: Particularly in the structure of global discourse, coherence plays a pivotal role in human text comprehension and is a hallmark of high-quality text. This is especially true for persuasive texts, where coherent argument structures support claims effectively. This paper discusses and proposes methods for detecting, extracting and representing these global discourse structures in a proccess called Argument(ation) Mining. We begin by defining key terms and processes of discourse structure analysis, then continue to summarize existing research on the matter, and identify shortcomings in current argument component extraction and classification methods. Furthermore, we will outline an architecture for argument mining that focuses on making models more generalisable while overcoming challenges in the current field of research by utilizing novel NLP techniques. This paper reviews current knowledge, summarizes recent works, and outlines our NLP pipeline, aiming to contribute to the theoretical understanding of global discourse structures.

new IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance

Authors: Paul R\"ottger, Musashi Hinck, Valentin Hofmann, Kobi Hackenburg, Valentina Pyatkin, Faeze Brahman, Dirk Hovy

Abstract: Large language models (LLMs) are helping millions of users write texts about diverse issues, and in doing so expose users to different ideas and perspectives. This creates concerns about issue bias, where an LLM tends to present just one perspective on a given issue, which in turn may influence how users think about this issue. So far, it has not been possible to measure which issue biases LLMs actually manifest in real user interactions, making it difficult to address the risks from biased LLMs. Therefore, we create IssueBench: a set of 2.49m realistic prompts for measuring issue bias in LLM writing assistance, which we construct based on 3.9k templates (e.g. "write a blog about") and 212 political issues (e.g. "AI regulation") from real user interactions. Using IssueBench, we show that issue biases are common and persistent in state-of-the-art LLMs. We also show that biases are remarkably similar across models, and that all models align more with US Democrat than Republican voter opinion on a subset of issues. IssueBench can easily be adapted to include other issues, templates, or tasks. By enabling robust and realistic measurement, we hope that IssueBench can bring a new quality of evidence to ongoing discussions about LLM biases and how to address them.

new A Semantic Parsing Algorithm to Solve Linear Ordering Problems

Authors: Maha Alkhairy, Vincent Homer, Brendan O'Connor

Abstract: We develop an algorithm to semantically parse linear ordering problems, which require a model to arrange entities using deductive reasoning. Our method takes as input a number of premises and candidate statements, parsing them to a first-order logic of an ordering domain, and then utilizes constraint logic programming to infer the truth of proposed statements about the ordering. Our semantic parser transforms Heim and Kratzer's syntax-based compositional formal semantic rules to a computational algorithm. This transformation involves introducing abstract types and templates based on their rules, and introduces a dynamic component to interpret entities within a contextual framework. Our symbolic system, the Formal Semantic Logic Inferer (FSLI), is applied to answer multiple choice questions in BIG-bench's logical_deduction multiple choice problems, achieving perfect accuracy, compared to 67.06% for the best-performing LLM (GPT-4) and 87.63% for the hybrid system Logic-LM. These promising results demonstrate the benefit of developing a semantic parsing algorithm driven by first-order logic constructs.

new From Haystack to Needle: Label Space Reduction for Zero-shot Classification

Authors: Nathan Vandemoortele, Bram Steenwinckel, Femke Ongenae, Sofie Van Hoecke

Abstract: We present Label Space Reduction (LSR), a novel method for improving zero-shot classification performance of Large Language Models (LLMs). LSR iteratively refines the classification label space by systematically ranking and reducing candidate classes, enabling the model to concentrate on the most relevant options. By leveraging unlabeled data with the statistical learning capabilities of data-driven models, LSR dynamically optimizes the label space representation at test time. Our experiments across seven benchmarks demonstrate that LSR improves macro-F1 scores by an average of 7.0% (up to 14.2%) with Llama-3.1-70B and 3.3% (up to 11.1%) with Claude-3.5-Sonnet compared to standard zero-shot classification baselines. To reduce the computational overhead of LSR, which requires an additional LLM call at each iteration, we propose distilling the model into a probabilistic classifier, allowing for efficient inference.

new Better Embeddings with Coupled Adam

Authors: Felix Stollenwerk, Tobias Stollenwerk

Abstract: Despite their remarkable capabilities, LLMs learn word representations that exhibit the undesirable yet poorly understood feature of anisotropy. In this paper, we argue that the second moment in Adam is a cause of anisotropic embeddings, and suggest a modified optimizer called Coupled Adam to mitigate the problem. Our experiments demonstrate that Coupled Adam significantly improves the quality of embeddings, while also leading to better upstream and downstream performance on large enough datasets.

new Towards Prompt Generalization: Grammar-aware Cross-Prompt Automated Essay Scoring

Authors: Heejin Do, Taehee Park, Sangwon Ryu, Gary Geunbae Lee

Abstract: In automated essay scoring (AES), recent efforts have shifted toward cross-prompt settings that score essays on unseen prompts for practical applicability. However, prior methods trained with essay-score pairs of specific prompts pose challenges in obtaining prompt-generalized essay representation. In this work, we propose a grammar-aware cross-prompt trait scoring (GAPS), which internally captures prompt-independent syntactic aspects to learn generic essay representation. We acquire grammatical error-corrected information in essays via the grammar error correction technique and design the AES model to seamlessly integrate such information. By internally referring to both the corrected and the original essays, the model can focus on generic features during training. Empirical experiments validate our method's generalizability, showing remarkable improvements in prompt-independent and grammar-related traits. Furthermore, GAPS achieves notable QWK gains in the most challenging cross-prompt scenario, highlighting its strength in evaluating unseen prompts.

new Examining Spanish Counseling with MIDAS: a Motivational Interviewing Dataset in Spanish

Authors: Aylin Gunal, Bowen Yi, John Piette, Rada Mihalcea, Ver\'onica P\'erez-Rosas

Abstract: Cultural and language factors significantly influence counseling, but Natural Language Processing research has not yet examined whether the findings of conversational analysis for counseling conducted in English apply to other languages. This paper presents a first step towards this direction. We introduce MIDAS (Motivational Interviewing Dataset in Spanish), a counseling dataset created from public video sources that contains expert annotations for counseling reflections and questions. Using this dataset, we explore language-based differences in counselor behavior in English and Spanish and develop classifiers in monolingual and multilingual settings, demonstrating its applications in counselor behavioral coding tasks.

new Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning

Authors: Qifan Yu, Zhenyu He, Sijie Li, Xun Zhou, Jun Zhang, Jingjing Xu, Di He

Abstract: Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model's reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose RELAY (REasoning through Loop Alignment iterativelY). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer's ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model. Code will be released at https://github.com/qifanyu/RELAY.

URLs: https://github.com/qifanyu/RELAY.

new Salamandra Technical Report

Authors: Aitor Gonzalez-Agirre, Marc P\`amies, Joan Llop, Irene Baucells, Severino Da Dalt, Daniel Tamayo, Jos\'e Javier Saiz, Ferran Espu\~na, Jaume Prats, Javier Aula-Blasco, Mario Mina, Adri\'an Rubio, Alexander Shvets, Anna Sall\'es, I\~naki Lacunza, I\~nigo Pikabea, Jorge Palomar, J\'ulia Falc\~ao, Luc\'ia Tormo, Luis Vasquez-Reina, Montserrat Marimon, Valle Ru\'iz-Fern\'andez, Marta Villegas

Abstract: This work introduces Salamandra, a suite of open-source decoder-only large language models available in three different sizes: 2, 7, and 40 billion parameters. The models were trained from scratch on highly multilingual data that comprises text in 35 European languages and code. Our carefully curated corpus is made exclusively from open-access data compiled from a wide variety of sources. Along with the base models, supplementary checkpoints that were fine-tuned on public-domain instruction data are also released for chat applications. Additionally, we also share our preliminary experiments on multimodality, which serve as proof-of-concept to showcase potential applications for the Salamandra family. Our extensive evaluations on multilingual benchmarks reveal that Salamandra has strong capabilities, achieving competitive performance when compared to similarly sized open-source models. We provide comprehensive evaluation results both on standard downstream tasks as well as key aspects related to bias and safety.With this technical report, we intend to promote open science by sharing all the details behind our design choices, data curation strategy and evaluation methodology. In addition to that, we deviate from the usual practice by making our training and evaluation scripts publicly accessible. We release all models under a permissive Apache 2.0 license in order to foster future research and facilitate commercial use, thereby contributing to the open-source ecosystem of large language models.

new Explanation based In-Context Demonstrations Retrieval for Multilingual Grammatical Error Correction

Authors: Wei Li, Wen Luo, Guangyue Peng, Houfeng Wang

Abstract: Grammatical error correction (GEC) aims to correct grammatical, spelling, and semantic errors in natural language text. With the growing of large language models (LLMs), direct text generation has gradually become the focus of the GEC methods, and few-shot in-context learning presents a cost-effective solution. However, selecting effective in-context examples remains challenging, as the similarity between input texts does not necessarily correspond to similar grammatical error patterns. In this paper, we propose a novel retrieval method based on natural language grammatical error explanations (GEE) to address this issue. Our method retrieves suitable few-shot demonstrations by matching the GEE of the test input with that of pre-constructed database samples, where explanations for erroneous samples are generated by LLMs. We conducted multilingual GEC few-shot experiments on both major open-source and closed-source LLMs. Experiments across five languages show that our method outperforms existing semantic and BM25-based retrieval techniques, without requiring additional training or language adaptation. This also suggests that matching error patterns is key to selecting examples.

new Measuring Diversity in Synthetic Datasets

Authors: Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian

Abstract: Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets-an aspect crucial for robust model performance-remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing approaches. Code is available at: https://github.com/BlueWhaleLab/DCScore.

URLs: https://github.com/BlueWhaleLab/DCScore.

new Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation

Authors: Mahnaz Koupaee, Jake W. Vincent, Saab Mansour, Igor Shalyminov, Han He, Hwanjun Song, Raphael Shu, Jianfeng He, Yi Nian, Amy Wing-mei Wong, Kyu J. Han, Hang Su

Abstract: Faithfulness evaluators based on large language models (LLMs) are often fooled by the fluency of the text and struggle with identifying errors in the summaries. We propose an approach to summary faithfulness evaluation in which multiple LLM-based agents are assigned initial stances (regardless of what their belief might be) and forced to come up with a reason to justify the imposed belief, thus engaging in a multi-round debate to reach an agreement. The uniformly distributed initial assignments result in a greater diversity of stances leading to more meaningful debates and ultimately more errors identified. Furthermore, by analyzing the recent faithfulness evaluation datasets, we observe that naturally, it is not always the case for a summary to be either faithful to the source document or not. We therefore introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases. Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.

new LLMs can implicitly learn from mistakes in-context

Authors: Lisa Alazraki, Maximilian Mozes, Jon Ander Campos, Yi Chern Tan, Marek Rei, Max Bartolo

Abstract: Learning from mistakes is a fundamental feature of human intelligence. Previous work has shown that Large Language Models (LLMs) can also learn from incorrect answers when provided with a comprehensive rationale detailing why an answer is wrong or how to correct it. In this work, we examine whether LLMs can learn from mistakes in mathematical reasoning tasks when these explanations are not provided. We investigate if LLMs are able to implicitly infer such rationales simply from observing both incorrect and correct answers. Surprisingly, we find that LLMs perform better, on average, when rationales are eliminated from the context and incorrect answers are simply shown alongside correct ones. This approach also substantially outperforms chain-of-thought prompting in our evaluations. We show that these results are consistent across LLMs of different sizes and varying reasoning abilities. Further, we carry out an in-depth analysis, and show that prompting with both wrong and correct answers leads to greater performance and better generalisation than introducing additional, more diverse question-answer pairs into the context. Finally, we show that new rationales generated by models that have only observed incorrect and correct answers are scored equally as highly by humans as those produced with the aid of exemplar rationales. Our results demonstrate that LLMs are indeed capable of in-context implicit learning.

new Quality-Aware Decoding: Unifying Quality Estimation and Decoding

Authors: Sai Koneru, Matthias Huck, Miriam Exel, Jan Niehues

Abstract: An emerging research direction in NMT involves the use of Quality Estimation (QE) models, which have demonstrated high correlations with human judgment and can enhance translations through Quality-Aware Decoding. Although several approaches have been proposed based on sampling multiple candidate translations, none have integrated these models directly into the decoding process. In this paper, we address this by proposing a novel token-level QE model capable of reliably scoring partial translations. We build a uni-directional QE model for this, as decoder models are inherently trained and efficient on partial sequences. We then present a decoding strategy that integrates the QE model for Quality-Aware decoding and demonstrate that the translation quality improves when compared to the N-best list re-ranking with state-of-the-art QE models (upto $1.39$ XCOMET-XXL $\uparrow$). Finally, we show that our approach provides significant benefits in document translation tasks, where the quality of N-best lists is typically suboptimal.

new SPeCtrum: A Grounded Framework for Multidimensional Identity Representation in LLM-Based Agent

Authors: Keyeun Lee, Seo Hyeong Kim, Seolhee Lee, Jinsu Eun, Yena Ko, Hayeon Jeon, Esther Hehsun Kim, Seonghye Cho, Soeun Yang, Eun-mee Kim, Hajin Lim

Abstract: Existing methods for simulating individual identities often oversimplify human complexity, which may lead to incomplete or flattened representations. To address this, we introduce SPeCtrum, a grounded framework for constructing authentic LLM agent personas by incorporating an individual's multidimensional self-concept. SPeCtrum integrates three core components: Social Identity (S), Personal Identity (P), and Personal Life Context (C), each contributing distinct yet interconnected aspects of identity. To evaluate SPeCtrum's effectiveness in identity representation, we conducted automated and human evaluations. Automated evaluations using popular drama characters showed that Personal Life Context (C)-derived from short essays on preferences and daily routines-modeled characters' identities more effectively than Social Identity (S) and Personal Identity (P) alone and performed comparably to the full SPC combination. In contrast, human evaluations involving real-world individuals found that the full SPC combination provided a more comprehensive self-concept representation than C alone. Our findings suggest that while C alone may suffice for basic identity simulation, integrating S, P, and C enhances the authenticity and accuracy of real-world identity representation. Overall, SPeCtrum offers a structured approach for simulating individuals in LLM agents, enabling more personalized human-AI interactions and improving the realism of simulation-based behavioral studies.

new Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples

Authors: Andrianos Michail, Simon Clematide, Rico Sennrich

Abstract: The evaluation of cross-lingual semantic search capabilities of models is often limited to existing datasets from tasks such as information retrieval and semantic textual similarity. To allow for domain-specific evaluation, we introduce Cross Lingual Semantic Discrimination (CLSD), a novel cross-lingual semantic search task that requires only a set of parallel sentence pairs of the language pair of interest within the target domain. This task focuses on the ability of a model to cross-lingually rank the true parallel sentence higher than hard negatives generated by a large language model. We create four instances of our introduced CLSD task for the language pair German-French within the domain of news. Within this case study, we find that models that are also fine-tuned for retrieval tasks (e.g., multilingual E5) benefit from using English as the pivot language, while bitext mining models such as LaBSE perform best directly cross-lingually. We also show a fine-grained similarity analysis enabled by our distractor generation strategy, indicating that different embedding models are sensitive to different types of perturbations.

cross Analyzing the Resource Utilization of Lambda Functions on Mobile Devices: Case Studies on Kotlin and Swift

Authors: Chibundom U. Ejimuda, Gaston Longhitano, Reza Rawassizadeh

Abstract: With billions of smartphones in use globally, the daily time spent on these devices contributes significantly to overall electricity consumption. Given this scale, even minor reductions in smartphone power use could result in substantial energy savings. This study explores the impact of Lambda functions on resource consumption in mobile programming. While Lambda functions are known for enhancing code readability and conciseness, their use does not add to the functional capabilities of a programming language. Our research investigates the implications of using Lambda functions in terms of battery utilization, memory usage, and execution time compared to equivalent code structures without Lambda functions. Our findings reveal that Lambda functions impose a considerable resource overhead on mobile devices without offering additional functionalities.

cross Vision-Language Models for Edge Networks: A Comprehensive Survey

Authors: Ahmed Sharshar, Latif U. Khan, Waseem Ullah, Mohsen Guizani

Abstract: Vision Large Language Models (VLMs) combine visual understanding with natural language processing, enabling tasks like image captioning, visual question answering, and video analysis. While VLMs show impressive capabilities across domains such as autonomous vehicles, smart surveillance, and healthcare, their deployment on resource-constrained edge devices remains challenging due to processing power, memory, and energy limitations. This survey explores recent advancements in optimizing VLMs for edge environments, focusing on model compression techniques, including pruning, quantization, knowledge distillation, and specialized hardware solutions that enhance efficiency. We provide a detailed discussion of efficient training and fine-tuning methods, edge deployment challenges, and privacy considerations. Additionally, we discuss the diverse applications of lightweight VLMs across healthcare, environmental monitoring, and autonomous systems, illustrating their growing impact. By highlighting key design strategies, current challenges, and offering recommendations for future directions, this survey aims to inspire further research into the practical deployment of VLMs, ultimately making advanced AI accessible in resource-limited settings.

cross LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

Authors: Zikai Zhou, Qizheng Zhang, Hermann Kumbong, Kunle Olukotun

Abstract: Fine-tuning large language models (LLMs) is increasingly costly as models scale to hundreds of billions of parameters, and even parameter-efficient fine-tuning (PEFT) methods like LoRA remain resource-intensive. We introduce LowRA, the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. LowRA optimizes fine-grained quantization - mapping, threshold selection, and precision assignment - while leveraging efficient CUDA kernels for scalable deployment. Extensive evaluations across 4 LLMs and 4 datasets show that LowRA achieves a superior performance-precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit LoRA fine-tuning for resource-constrained environments.

cross Wisdom of the Crowds in Forecasting: Forecast Summarization for Supporting Future Event Prediction

Authors: Anisha Saha, Adam Jatowt

Abstract: Future Event Prediction (FEP) is an essential activity whose demand and application range across multiple domains. While traditional methods like simulations, predictive and time-series forecasting have demonstrated promising outcomes, their application in forecasting complex events is not entirely reliable due to the inability of numerical data to accurately capture the semantic information related to events. One forecasting way is to gather and aggregate collective opinions on the future to make predictions as cumulative perspectives carry the potential to help estimating the likelihood of upcoming events. In this work, we organize the existing research and frameworks that aim to support future event prediction based on crowd wisdom through aggregating individual forecasts. We discuss the challenges involved, available datasets, as well as the scope of improvement and future research directions for this task. We also introduce a novel data model to represent individual forecast statements.

cross Improving Existing Optimization Algorithms with LLMs

Authors: Camilo Chac\'on Sartori, Christian Blum

Abstract: The integration of Large Language Models (LLMs) into optimization has created a powerful synergy, opening exciting research opportunities. This paper investigates how LLMs can enhance existing optimization algorithms. Using their pre-trained knowledge, we demonstrate their ability to propose innovative heuristic variations and implementation strategies. To evaluate this, we applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt (CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that incorporates a heuristic in the solution construction phase. Our results show that an alternative heuristic proposed by GPT-4o outperforms the expert-designed heuristic of CMSA, with the performance gap widening on larger and denser graphs. Project URL: https://imp-opt-algo-llms.surge.sh/

URLs: https://imp-opt-algo-llms.surge.sh/

cross Word Synchronization Challenge: A Benchmark for Word Association Responses for LLMs

Authors: Tanguy Cazalets, Joni Dambre

Abstract: This paper introduces the Word Synchronization Challenge, a novel benchmark to evaluate large language models (LLMs) in Human-Computer Interaction (HCI). This benchmark uses a dynamic game-like framework to test LLMs ability to mimic human cognitive processes through word associations. By simulating complex human interactions, it assesses how LLMs interpret and align with human thought patterns during conversational exchanges, which are essential for effective social partnerships in HCI. Initial findings highlight the influence of model sophistication on performance, offering insights into the models capabilities to engage in meaningful social interactions and adapt behaviors in human-like ways. This research advances the understanding of LLMs potential to replicate or diverge from human cognitive functions, paving the way for more nuanced and empathetic human-machine collaborations.

cross Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions

Authors: Prajwal Gatti, Kshitij Parikh, Dhriti Prasanna Paul, Manish Gupta, Anand Mishra

Abstract: Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for numbats. Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., numbat digging in the ground. In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of difficult-to-name but easy-to-draw objects and text describing difficult-to-sketch but easy-to-verbalize object attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of approx. 2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNET (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple training objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities. We make the dataset and code available at our project website.

cross mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Authors: Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou

Abstract: Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.

URLs: https://github.com/haon-chen/mmE5.

cross LLM Pretraining with Continuous Concepts

Authors: Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Cohen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuandong Tian, Jason Weston, Xian Li

Abstract: Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts continuous concepts learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction, knowledge distillation and inserting pause tokens. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model's internal reasoning process.

cross QA-Expand: Multi-Question Answer Generation for Enhanced Query Expansion in Information Retrieval

Authors: Wonduk Seo, Seunghyun Lee

Abstract: Query expansion is widely used in Information Retrieval (IR) to improve search outcomes by enriching queries with additional contextual information. Although recent Large Language Model (LLM) based methods generate pseudo-relevant content and expanded terms via multiple prompts, they often yield repetitive, narrow expansions that lack the diverse context needed to retrieve all relevant information. In this paper, we introduce QA-Expand, a novel and effective framework for query expansion. It first generates multiple relevant questions from the initial query and subsequently produces corresponding pseudo-answers as surrogate documents. A feedback model further rewrites and filters these answers to ensure only the most informative augmentations are incorporated. Extensive experiments on benchmarks such as BEIR and TREC demonstrate that QA-Expand enhances retrieval performance by up to 13% over state-of-the-art methods, offering a robust solution for modern retrieval challenges.

cross Distillation Scaling Laws

Authors: Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb

Abstract: We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.

cross Randomness of Low-Layer Parameters Determines Confusing Samples in Terms of Interaction Representations of a DNN

Authors: Junpeng Zhang, Lei Cheng, Qing Li, Liang Lin, Quanshi Zhang

Abstract: In this paper, we find that the complexity of interactions encoded by a deep neural network (DNN) can explain its generalization power. We also discover that the confusing samples of a DNN, which are represented by non-generalizable interactions, are determined by its low-layer parameters. In comparison, other factors, such as high-layer parameters and network architecture, have much less impact on the composition of confusing samples. Two DNNs with different low-layer parameters usually have fully different sets of confusing samples, even though they have similar performance. This finding extends the understanding of the lottery ticket hypothesis, and well explains distinctive representation power of different DNNs.

cross Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Authors: Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks

Abstract: As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.

replace Uncovering Intermediate Variables in Transformers using Circuit Probing

Authors: Michael A. Lepori, Thomas Serre, Ellie Pavlick

Abstract: Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. It is often necessary to hypothesize intermediate variables involved in a network's computation in order to understand these algorithms. For example, does a language model depend on particular syntactic properties when generating a sentence? Yet, existing analysis tools make it difficult to test hypotheses of this type. We propose a new analysis technique - circuit probing - that automatically uncovers low-level circuits that compute hypothesized intermediate variables. This enables causal analysis through targeted ablation at the level of model parameters. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training. Across these three experiments we demonstrate that circuit probing combines and extends the capabilities of existing methods, providing one unified approach for a variety of analyses. Finally, we demonstrate circuit probing on a real-world use case: uncovering circuits that are responsible for subject-verb agreement and reflexive anaphora in GPT2-Small and Medium.

replace Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

Authors: Wenjie Mo, Jiashu Xu, Qin Liu, Jiongxiao Wang, Jun Yan, Hadi Askari, Chaowei Xiao, Muhao Chen

Abstract: Existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense. This gap becomes pronounced in the context of LLMs deployed as Web Services, which typically offer only black-box access, rendering training-time defenses impractical. To bridge this gap, this study critically examines the use of demonstrations as a defense mechanism against backdoor attacks in black-box LLMs. We retrieve task-relevant demonstrations from a clean data pool and integrate them with user queries during testing. This approach does not necessitate modifications or tuning of the model, nor does it require insight into the model's internal architecture. The alignment properties inherent in in-context learning play a pivotal role in mitigating the impact of backdoor triggers, effectively recalibrating the behavior of compromised models. Our experimental analysis demonstrates that this method robustly defends against both instance-level and instruction-level backdoor attacks, outperforming existing defense baselines across most evaluation scenarios.

replace Evaluating the Performance of ChatGPT for Spam Email Detection

Authors: Shijing Si, Yuwei Wu, Le Tang, Yugui Zhang, Jedrek Wosik, Qinliang Su

Abstract: Email continues to be a pivotal and extensively utilized communication medium within professional and commercial domains. Nonetheless, the prevalence of spam emails poses a significant challenge for users, disrupting their daily routines and diminishing productivity. Consequently, accurately identifying and filtering spam based on content has become crucial for cybersecurity. Recent advancements in natural language processing, particularly with large language models like ChatGPT, have shown remarkable performance in tasks such as question answering and text generation. However, its potential in spam identification remains underexplored. To fill in the gap, this study attempts to evaluate ChatGPT's capabilities for spam identification in both English and Chinese email datasets. We employ ChatGPT for spam email detection using in-context learning, which requires a prompt instruction with (or without) a few demonstrations. We also investigate how the number of demonstrations in the prompt affects the performance of ChatGPT. For comparison, we also implement five popular benchmark methods, including naive Bayes, support vector machines (SVM), logistic regression (LR), feedforward dense neural networks (DNN), and BERT classifiers. Through extensive experiments, the performance of ChatGPT is significantly worse than deep supervised learning methods in the large English dataset, while it presents superior performance on the low-resourced Chinese dataset. This study provides insights into the potential and limitations of ChatGPT for spam identification, highlighting its potential as a viable solution for resource-constrained language domains.

replace Investigating Continual Pretraining in Large Language Models: Insights and Implications

Authors: \c{C}a\u{g}atay Y{\i}ld{\i}z, Nishaanth Kanna Ravichandran, Nitin Sharma, Matthias Bethge, Beyza Ermis

Abstract: Continual learning (CL) in large language models (LLMs) is an evolving domain that focuses on developing efficient and sustainable training strategies to adapt models to emerging knowledge and achieve robustness in dynamic environments. Our primary emphasis is on continual domain-adaptive pretraining, a process designed to equip LLMs with the ability to integrate new information from various domains while retaining previously learned knowledge. Since existing works concentrate mostly on continual fine-tuning for a limited selection of downstream tasks or training domains, we introduce a new benchmark designed to measure the adaptability of LLMs to changing pretraining data landscapes. We further examine the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect the knowledge transfer within these models. Our findings uncover several key insights: (i) continual pretraining consistently improves <1.5B models studied in this work and is also superior to domain adaptation, (ii) larger models always achieve better perplexity than smaller ones when continually pretrained on the same corpus, (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and forgetting, (iv) continual pretraining boosts downstream task performance of GPT-2 family, (v) continual pretraining enables LLMs to specialize better when the sequence of domains shows semantic similarity while randomizing training domains leads to better transfer and final performance otherwise. We posit that our research establishes a new benchmark for CL in LLMs, providing a more realistic evaluation of knowledge retention and transfer across diverse domains.

replace Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models

Authors: Jingyang Zhang, Jingwei Sun, Eric Yeats, Yang Ouyang, Martin Kuo, Jianyi Zhang, Hao Frank Yang, Hai Li

Abstract: The problem of pre-training data detection for large language models (LLMs) has received growing attention due to its implications in critical issues like copyright violation and test data contamination. Despite improved performance, existing methods (including the state-of-the-art, Min-K%) are mostly developed upon simple heuristics and lack solid, reasonable foundations. In this work, we propose a novel and theoretically motivated methodology for pre-training data detection, named Min-K%++. Specifically, we present a key insight that training samples tend to be local maxima of the modeled distribution along each input dimension through maximum likelihood training, which in turn allow us to insightfully translate the problem into identification of local maxima. Then, we design our method accordingly that works under the discrete distribution modeled by LLMs, whose core idea is to determine whether the input forms a mode or has relatively high probability under the conditional categorical distribution. Empirically, the proposed method achieves new SOTA performance across multiple settings. On the WikiMIA benchmark, Min-K%++ outperforms the runner-up by 6.2% to 10.5% in detection AUROC averaged over five models. On the more challenging MIMIR benchmark, it consistently improves upon reference-free methods while performing on par with reference-based method that requires an extra reference model.

replace ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models

Authors: Benjamin G. Ascoli, Yasoda Sai Ram Kandikonda, Jinho D. Choi

Abstract: The task of Text-to-SQL enables anyone to retrieve information from SQL databases using natural language. While this task has made substantial progress, the two primary evaluation metrics -- Execution Accuracy (EXE) and Exact Set Matching Accuracy (ESM) -- suffer from inherent limitations that can misrepresent performance. Specifically, ESM's rigid matching overlooks semantically correct but stylistically different queries, whereas EXE can overestimate correctness by ignoring structural errors that yield correct outputs. These shortcomings become especially problematic when assessing outputs from large language model (LLM)-based approaches without fine-tuning, which vary more in style and structure compared to their fine-tuned counterparts. Thus, we introduce a new metric, Enhanced Tree Matching (ETM), which mitigates these issues by comparing queries using both syntactic and semantic elements. Through evaluating nine LLM-based models, we show that EXE and ESM can produce false positive and negative rates as high as 23.0% and 28.9%, while ETM reduces these rates to 0.3% and 2.7%, respectively. We release our ETM script as open source, offering the community a more robust and reliable approach to evaluating Text-to-SQL.

replace Representing Rule-based Chatbots with Transformers

Authors: Dan Friedman, Abhishek Panigrahi, Danqi Chen

Abstract: What kind of internal mechanisms might Transformers use to conduct fluid, natural-sounding conversations? Prior work has illustrated by construction how Transformers can solve various synthetic tasks, such as sorting a list or recognizing formal languages, but it remains unclear how to extend this approach to a conversational setting. In this work, we propose using ELIZA, a classic rule-based chatbot, as a setting for formal, mechanistic analysis of Transformer-based chatbots. ELIZA allows us to formally model key aspects of conversation, including local pattern matching and long-term dialogue state tracking. We first present a theoretical construction of a Transformer that implements the ELIZA chatbot. Building on prior constructions, particularly those for simulating finite-state automata, we show how simpler mechanisms can be composed and extended to produce more sophisticated behavior. Next, we conduct a set of empirical analyses of Transformers trained on synthetically generated ELIZA conversations. Our analysis illustrates the kinds of mechanisms these models tend to prefer--for example, models favor an induction head mechanism over a more precise, position-based copying mechanism; and using intermediate generations to simulate recurrent data structures, akin to an implicit scratchpad or Chain-of-Thought. Overall, by drawing an explicit connection between neural chatbots and interpretable, symbolic mechanisms, our results provide a new framework for the mechanistic analysis of conversational agents.

replace Unsupervised Robust Cross-Lingual Entity Alignment via Neighbor Triple Matching with Entity and Relation Texts

Authors: Soojin Yoon, Sungho Ko, Tongyoung Kim, SeongKu Kang, Jinyoung Yeo, Dongha Lee

Abstract: Cross-lingual entity alignment (EA) enables the integration of multiple knowledge graphs (KGs) across different languages, providing users with seamless access to diverse and comprehensive knowledge. Existing methods, mostly supervised, face challenges in obtaining labeled entity pairs. To address this, recent studies have shifted towards self-supervised and unsupervised frameworks. Despite their effectiveness, these approaches have limitations: (1) Relation passing: mainly focusing on the entity while neglecting the semantic information of relations, (2) Isomorphic assumption: assuming isomorphism between source and target graphs, which leads to noise and reduced alignment accuracy, and (3) Noise vulnerability: susceptible to noise in the textual features, especially when encountering inconsistent translations or Out-of-Vocabulary (OOV) problems. In this paper, we propose ERAlign, an unsupervised and robust cross-lingual EA pipeline that jointly performs Entity-level and Relation-level Alignment by neighbor triple matching strategy using semantic textual features of relations and entities. Its refinement step iteratively enhances results by fusing entity-level and relation-level alignments based on neighbor triple matching. The additional verification step examines the entities' neighbor triples as the linearized text. This Align-then-Verify pipeline rigorously assesses alignment results, achieving near-perfect alignment even in the presence of noisy textual features of entities. Our extensive experiments demonstrate that the robustness and general applicability of ERAlign improved the accuracy and effectiveness of EA tasks, contributing significantly to knowledge-oriented applications.

replace Know Your Limits: A Survey of Abstention in Large Language Models

Authors: Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, Lucy Lu Wang

Abstract: Abstention, the refusal of large language models (LLMs) to provide an answer, is increasingly recognized for its potential to mitigate hallucinations and enhance safety in LLM systems. In this survey, we introduce a framework to examine abstention from three perspectives: the query, the model, and human values. We organize the literature on abstention methods, benchmarks, and evaluation metrics using this framework, and discuss merits and limitations of prior work. We further identify and motivate areas for future research, such as whether abstention can be achieved as a meta-capability that transcends specific tasks or domains, and opportunities to optimize abstention abilities in specific contexts. In doing so, we aim to broaden the scope and impact of abstention methodologies in AI systems.

replace Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models

Authors: Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, Haoyang Li

Abstract: The drastic increase of large language models' (LLMs) parameters has led to a new research direction of fine-tuning-free downstream customization by prompts, i.e., task descriptions. While these prompt-based services (e.g. OpenAI's GPTs) play an important role in many businesses, there has emerged growing concerns about the prompt leakage, which undermines the intellectual properties of these services and causes downstream attacks. In this paper, we analyze the underlying mechanism of prompt leakage, which we refer to as prompt memorization, and develop corresponding defending strategies. By exploring the scaling laws in prompt extraction, we analyze key attributes that influence prompt extraction, including model sizes, prompt lengths, as well as the types of prompts. Then we propose two hypotheses that explain how LLMs expose their prompts. The first is attributed to the perplexity, i.e. the familiarity of LLMs to texts, whereas the second is based on the straightforward token translation path in attention matrices. To defend against such threats, we investigate whether alignments can undermine the extraction of prompts. We find that current LLMs, even those with safety alignments like GPT-4, are highly vulnerable to prompt extraction attacks, even under the most straightforward user attacks. Therefore, we put forward several defense strategies with the inspiration of our findings, which achieve 83.8\% and 71.0\% drop in the prompt extraction rate for Llama2-7B and GPT-3.5, respectively. Source code is avaliable at https://github.com/liangzid/PromptExtractionEval.

URLs: https://github.com/liangzid/PromptExtractionEval.

replace CogLM: Tracking Cognitive Development of Large Language Models

Authors: Xinglin Wang, Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Boyuan Pan, Heda Wang, Yao Hu, Kan Li

Abstract: Piaget's Theory of Cognitive Development (PTC) posits that the development of cognitive levels forms the foundation for human learning across various abilities. As Large Language Models (LLMs) have recently shown remarkable abilities across a wide variety of tasks, we are curious about the cognitive levels of current LLMs: to what extent they have developed and how this development has been achieved. To this end, we construct a benchmark CogLM (Cognitive Ability Evaluation for Language Model) based on PTC to assess the cognitive levels of LLMs. CogLM comprises 1,220 questions spanning 10 cognitive abilities crafted by more than 20 human experts, providing a comprehensive testbed for the cognitive levels of LLMs. Through extensive experiments across multiple mainstream LLMs with CogLM, we find that: (1) In our testing framework, advanced LLMs (such as GPT-4) have demonstrated human-like cognitive abilities, comparable to those of a 20-year-old human. (2) The parameter size and optimization objective are two key factors affecting the cognitive levels of LLMs. (3) The performance on downstream tasks is positively correlated with the level of cognitive abilities. These findings fill the gap in research on the cognitive abilities of LLMs, tracing the development of LLMs from a cognitive perspective and guiding the future direction of their evolution.

replace Make Every Penny Count: Difficulty-Adaptive Self-Consistency for Cost-Efficient Reasoning

Authors: Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li

Abstract: Self-consistency (SC), a widely used decoding strategy for chain-of-thought reasoning, shows significant gains across various multi-step reasoning tasks but comes with a high cost due to multiple sampling with the preset size. Its variants, Adaptive self-consistency (ASC) and Early-stopping self-consistency (ESC), dynamically adjust the number of samples based on the posterior distribution of a set of pre-samples, reducing the cost of SC with minimal impact on performance. Both methods, however, do not exploit the prior information about question difficulty. It often results in unnecessary repeated sampling for easy questions that could be accurately answered with just one attempt, wasting resources. To tackle this problem, we propose Difficulty-Adaptive Self-Consistency (DSC), which leverages the difficulty information of batch queries from both prior and posterior perspectives to adaptively allocate inference resources, further reducing the overall cost of SC. To demonstrate the effectiveness of DSC, we conduct extensive experiments on three popular categories of reasoning tasks: arithmetic, commonsense and symbolic reasoning on six benchmarks. The empirical results show that DSC consistently surpasses the strong baseline ASC and ESC in terms of costs by a significant margin, while attaining comparable performances.

replace EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Authors: Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayy\'an O'Brien, Hengyu Luo, Hinrich Sch\"utze, J\"org Tiedemann, Barry Haddow

Abstract: In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability. We release the MaLA corpus, EMMA-500 model weights, scripts, and model generations.

replace Single Ground Truth Is Not Enough: Adding Flexibility to Aspect-Based Sentiment Analysis Evaluation

Authors: Soyoung Yang, Hojun Cho, Jiyoung Lee, Sohee Yoon, Edward Choi, Jaegul Choo, Won Ik Cho

Abstract: Aspect-based sentiment analysis (ABSA) is a challenging task of extracting sentiments along with their corresponding aspects and opinion terms from the text. The inherent subjectivity of span annotation makes variability in the surface forms of extracted terms, complicating the evaluation process. Traditional evaluation methods often constrain ground truths (GT) to a single term, potentially misrepresenting the accuracy of semantically valid predictions that differ in surface form. To address this limitation, we propose a novel and fully automated pipeline that expands existing evaluation sets by adding alternative valid terms for aspect and opinion. Our approach facilitates an equitable assessment of language models by accommodating multiple-answer candidates, resulting in enhanced human agreement compared to single-answer test sets (achieving up to a 10\%p improvement in Kendall's Tau score). Experimental results demonstrate that our expanded evaluation set helps uncover the capabilities of large language models (LLMs) in ABSA tasks, which is concealed by the single-answer GT sets. Consequently, our work contributes to the development of a flexible evaluation framework for ABSA by embracing diverse surface forms to span extraction tasks in a cost-effective and reproducible manner. Our code and dataset is open at https://github.com/dudrrm/zoom-in-n-out-absa.

URLs: https://github.com/dudrrm/zoom-in-n-out-absa.

replace Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?

Authors: Nishant Balepur, Feng Gu, Abhilasha Ravichander, Shi Feng, Jordan Boyd-Graber, Rachel Rudinger

Abstract: Question answering (QA), giving correct answers to questions, is a popular task, but we test reverse question answering (RQA): for an input answer, give a question with that answer. Past work tests QA and RQA separately, but we test them jointly, comparing their difficulty, aiding benchmark design, and checking reasoning consistency. We run 16 LLMs on QA and RQA with trivia questions/answers, revealing: 1) Versus QA, LLMs are much less accurate in RQA for numerical answers, but slightly more accurate in RQA for textual answers; 2) LLMs often answer their own invalid questions from RQA accurately in QA, so RQA errors are not from knowledge gaps alone; 3) RQA errors correlate with question difficulty and inversely correlate with answer frequencies in the Dolma corpus; and 4) LLMs struggle to provide valid multi-hop questions. By finding question and answer types that lead to RQA errors, we suggest improvements for LLM reasoning.

replace GATEAU: Selecting Influential Samples for Long Context Alignment

Authors: Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun

Abstract: Aligning large language models to handle instructions with extremely long contexts has yet to be fully investigated. Previous studies attempt to scale up the available data volume by synthesizing long instruction-following samples, as constructing such a dataset tends to be challenging for annotators. However, a lack of a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the model performance. Thus, we propose GATEAU, a novel framework to address the unique challenge of long context alignment by identifying the influential samples enriched with long-range dependency relations. Specifically, GATEAU measures the long-range dependencies from two essential aspects: the difficulty of generating target responses due to the long-range dependencies, and the difficulty of understanding long inputs due to such dependencies. Comprehensive experiments indicate that GATEAU effectively identifies influential samples and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.

replace Blind Spot Navigation in LLM Reasoning with Thought Space Explorer

Authors: Jinghan Zhang, Fengran Mo, Xiting Wang, Kunpeng Liu

Abstract: Recent advances in large language models (LLMs) have demonstrated their potential in handling complex reasoning tasks, which are usually achieved by constructing a thought chain to guide the model to solve the problem with multi-step thinking. However, existing methods often remain confined to previously explored solution spaces and thus overlook the critical blind spot within LLMs' cognitive range. To address these issues, we design the Thought Space Explorer (TSE), a novel framework to expand and optimize thought structures to guide LLMs to explore their blind spots of thinking. By generating new reasoning steps and branches based on the original thought structure with various designed strategies, TSE broadens the thought space and alleviates the impact of blind spots for LLM reasoning. Experimental results on multiple levels of reasoning tasks demonstrate the efficacy of TSE. We also conduct extensive analysis to understand how structured and expansive thought can contribute to unleashing the potential of LLM reasoning capabilities.

replace KL-geodesics flow matching with a novel sampling scheme

Authors: Egor Sevriugov, Ivan Oseledets

Abstract: Non-autoregressive language models generate all tokens simultaneously, offering potential speed advantages over traditional autoregressive models, but they face challenges in modeling the complex dependencies inherent in text data. In this work, we investigate a conditional flow matching approach for text generation. We represent tokens as one-hot vectors in a \(V\)-dimensional simplex and utilize geodesics under the Kullback-Leibler (KL) divergence, which correspond to linear interpolation in logit space. We provide a theoretical justification that maximizing the conditional likelihood \(P_{\theta}(x_1 \mid x_t, t)\) yields the exact flow matching velocity under logit interpolation. To address the suboptimal performance of basic inference, we propose a novel empirical sampling scheme that iteratively samples from the conditional distribution and introduces additional noise, significantly improving results despite lacking full theoretical underpinnings. Furthermore, we propose a hybrid inference method that combines the basic approach with the sampling scheme. This method demonstrates superior performance on both conditional and unconditional text generation experiments compared to previous SOTA method for discrete flow matching.

replace NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers

Authors: Angel Yahir Loredo Lopez, Tyler McDonald, Ali Emami

Abstract: Large Language Models (LLMs) have shown impressive performance on various benchmarks, yet their ability to engage in deliberate reasoning remains questionable. We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game. This benchmark is designed to penalize quick, intuitive "System 1" thinking, isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple machine learning heuristic, and humans across three configurations: single-attempt, multiple attempts without hints, and multiple attempts with contextual hints. Our findings reveal a significant performance gap: even top-performing LLMs like GPT-4 fall short of human performance by nearly 30%. Notably, advanced prompting techniques such as Chain-of-Thought and Self-Consistency show diminishing returns as task difficulty increases. NYT-Connections uniquely combines linguistic isolation, resistance to intuitive shortcuts, and regular updates to mitigate data leakage, offering a novel tool for assessing LLM reasoning capabilities.

replace Rethinking Chain-of-Thought from the Perspective of Self-Training

Authors: Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, Lei Feng

Abstract: Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in LLMs. Interestingly, we observe that both CoT reasoning and self-training share the core objective: iteratively leveraging model-generated information to progressively reduce prediction uncertainty. Building on this insight, we propose a novel CoT framework to improve reasoning performance. Our framework integrates two key components: (i) a task-specific prompt module that optimizes the initial reasoning process, and (ii) an adaptive reasoning iteration module that dynamically refines the reasoning process and addresses the limitations of previous CoT approaches, \ie over-reasoning and high similarity between consecutive reasoning iterations. Extensive experiments demonstrate that the proposed method achieves significant advantages in both performance and computational efficiency.

replace Language Models as Continuous Self-Evolving Data Engineers

Authors: Peidong Wang, Ming Wang, Zhiming Ma, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities on various tasks, while the further evolvement is limited to the lack of high-quality training data. In addition, traditional training approaches rely too much on expert-labeled data, setting an upper limit on the performance of LLMs. To address this issue, we propose a novel paradigm that enables LLMs to train itself by autonomously generating, cleaning, reviewing, and annotating data with preference information, named LANCE. Our approach demonstrates that LLMs can serve as continuous self-evolving data engineers, significantly reducing the time and cost of the post-training data construction process. Through iterative fine-tuning on different variants of the Qwen2, we validate the effectiveness of LANCE across various tasks, showing that it can continuously improve model performance and maintain high-quality data generation. Across eight benchmark dimensions, LANCE resulted in an average score enhancement of 3.36 for Qwen2-7B and 2.70 for Qwen2-7B-Instruct. This training paradigm with autonomous data construction not only reduces the reliance on human experts or external models but also ensures that the data aligns with human values and preferences, paving the way for the development of future superintelligent systems that can exceed human capabilities.

replace Mitigating Social Bias in Large Language Models: A Multi-Objective Approach within a Multi-Agent Framework

Authors: Zhenjie Xu (School of Software Engineering, Sun Yat-sen University), Wenqing Chen (School of Software Engineering, Sun Yat-sen University), Yi Tang (School of Software Engineering, Sun Yat-sen University), Xuanying Li (School of Physics and Astronomy, Sun Yat-sen University), Cheng Hu (School of Software Engineering, Sun Yat-sen University), Zhixuan Chu (School of Cyber Science and Technology, Zhejiang University), Kui Ren (School of Cyber Science and Technology, Zhejiang University), Zibin Zheng (School of Software Engineering, Sun Yat-sen University), Zhichao Lu (Department of Computer Science, City University of Hong Kong)

Abstract: Natural language processing (NLP) has seen remarkable advancements with the development of large language models (LLMs). Despite these advancements, LLMs often produce socially biased outputs. Recent studies have mainly addressed this problem by prompting LLMs to behave ethically, but this approach results in unacceptable performance degradation. In this paper, we propose a multi-objective approach within a multi-agent framework (MOMA) to mitigate social bias in LLMs without significantly compromising their performance. The key idea of MOMA involves deploying multiple agents to perform causal interventions on bias-related contents of the input questions, breaking the shortcut connection between these contents and the corresponding answers. Unlike traditional debiasing techniques leading to performance degradation, MOMA substantially reduces bias while maintaining accuracy in downstream tasks. Our experiments conducted on two datasets and two models demonstrate that MOMA reduces bias scores by up to 87.7%, with only a marginal performance degradation of up to 6.8% in the BBQ dataset. Additionally, it significantly enhances the multi-objective metric icat in the StereoSet dataset by up to 58.1%. Code will be made available at https://github.com/Cortantse/MOMA.

URLs: https://github.com/Cortantse/MOMA.

replace URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

Authors: Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, Yujiu Yang

Abstract: Chain-of-Thought (CoT) reasoning is widely used to enhance the mathematical reasoning capabilities of large language models (LLMs). The introduction of process supervision for CoT trajectories has sparked discussions on improving test-time scaling, thereby unlocking the System 2-style thinking capabilities of these models. However, in multimodal mathematical reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving both deliberate reasoning and fine-grained verification. In this work, we propose a novel framework that introduces System 2-style thinking to multimodal mathematical reasoning. We introduce a three-module CoT data synthesis process that integrates CoT distillation, trajectory-format rewriting, and format unification. This process generates MMathCoT-1M, a high-quality CoT reasoning instruction fine-tuning dataset. Furthermore, we implement a dual-view trajectory labeling automation that targets both visual grounding fidelity and deductive chain validity, resulting in the DualMath-1.1M dataset. The URSA-8B model, trained on MMathCoT-1M, achieves new state-of-the-art (SOTA) performance among similarly sized multimodal LLMs on six popular reasoning benchmarks. Training URSA-8B further on the DualMath-1.1M dataset yields URSA-RM-8B, a verifier that enhances URSA-8B's test-time performance and surpasses strong closed-source multimodal MLLMs like GPT-4o. The model weights, training data, and code have been open-sourced: https://github.com/URSA-MATH/URSA-MATH.

URLs: https://github.com/URSA-MATH/URSA-MATH.

replace Demystifying Domain-adaptive Post-training for Financial LLMs

Authors: Zixuan Ke, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty

Abstract: Domain-adaptive post-training of large language models (LLMs) has emerged as a promising approach for specialized domains such as medicine and finance. However, significant challenges remain in identifying optimal adaptation criteria and training strategies across varying data and model configurations. To address these challenges, we introduce FINDAP, a systematic and fine-grained investigation into domain adaptive post-training of LLMs for the finance domain. Our approach consists of four key components: FinCap, which defines the core capabilities required for the target domain; FinRec, an effective training recipe that jointly optimizes continual pre-training and instruction-following, along with a novel preference data distillation method leveraging process signals from a generative reward model; FinTrain, a curated set of training datasets supporting FinRec; and FinEval, a comprehensive evaluation suite aligned with FinCap. The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks. Our analysis also highlights how each post-training stage contributes to distinct capabilities, uncovering specific challenges and effective solutions, providing valuable insights for domain adaptation of LLMs.

replace Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models

Authors: Qiang Liu, Xinlong Chen, Yue Ding, Shizhen Xu, Shu Wu, Liang Wang

Abstract: Hallucination has emerged as a significant barrier to the effective application of Large Language Models (LLMs). In this work, we introduce a novel Attention-Guided SElf-Reflection (AGSER) approach for zero-shot hallucination detection in LLMs. The AGSER method utilizes attention contributions to categorize the input query into attentive and non-attentive queries. Each query is then processed separately through the LLMs, allowing us to compute consistency scores between the generated responses and the original answer. The difference between the two consistency scores serves as a hallucination estimator. In addition to its efficacy in detecting hallucinations, AGSER notably reduces computational overhead, requiring only three passes through the LLM and utilizing two sets of tokens. We have conducted extensive experiments with four widely-used LLMs across three different hallucination benchmarks, demonstrating that our approach significantly outperforms existing methods in zero-shot hallucination detection.

replace Syntriever: How to Train Your Retriever with Synthetic Data from LLMs

Authors: Minsang Kim, Seungjun Baek

Abstract: LLMs have boosted progress in many AI applications. Recently, there were attempts to distill the vast knowledge of LLMs into information retrieval systems. Those distillation methods mostly use output probabilities of LLMs which are unavailable in the latest black-box LLMs. We propose Syntriever, a training framework for retrievers using synthetic data from black-box LLMs. Syntriever consists of two stages. Firstly in the distillation stage, we synthesize relevant and plausibly irrelevant passages and augmented queries using chain-of-thoughts for the given queries. LLM is asked to self-verify the synthetic data for possible hallucinations, after which retrievers are trained with a loss designed to cluster the embeddings of relevant passages. Secondly in the alignment stage, we align the retriever with the preferences of LLMs. We propose a preference modeling called partial Plackett-Luce ranking to learn LLM preferences with regularization which prevents the model from deviating excessively from that trained in the distillation stage. Experiments show that Syntriever achieves state-of-the-art performances on benchmark datasets from various domains in nDCG@$K$. The code is available at \href{https://github.com/kmswin1/Syntriever}{https://github.com/kmswin1/Syntriever}.

URLs: https://github.com/kmswin1/Syntriever, https://github.com/kmswin1/Syntriever

replace ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning

Authors: Yuwei Yin, Giuseppe Carenini

Abstract: Large language models (LLMs) achieve remarkable performance on challenging benchmarks that are often structured as multiple-choice question-answering (QA) tasks. Zero-shot Chain-of-Thought (CoT) prompting enhances reasoning in LLMs but provides only vague and generic guidance ("think step by step"). This paper introduces ARR, an intuitive and effective zero-shot prompting method that explicitly incorporates three key steps in QA solving: analyzing the intent of the question, retrieving relevant information, and reasoning step by step. Comprehensive experiments across diverse and challenging QA tasks demonstrate that ARR consistently improves the Baseline (without ARR prompting) and outperforms CoT. Ablation and case studies further validate the positive contributions of each component: analyzing, retrieving, and reasoning. Notably, intent analysis plays a vital role in ARR. Additionally, extensive evaluations across various model sizes, LLM series, and generation settings solidify the effectiveness, robustness, and generalizability of ARR.

replace Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data

Authors: Haoyan Yang, Ting Hua, Shangqian Gao, Binfeng Xu, Zheng Tang, Jie Xu, Hongxia Jin, Vijay Srinivasan

Abstract: Although LLMs have achieved significant success, their reliance on large volumes of human-annotated data has limited their potential for further scaling. In this situation, utilizing self-generated synthetic data has become crucial for fine-tuning LLMs without extensive human annotation. However, current methods often fail to ensure consistent improvements across iterations, with performance stagnating after only minimal updates. To overcome these challenges, we introduce Dynamic Noise Preference Optimization (DNPO). DNPO employs a dynamic sample labeling mechanism to construct preference pairs for training and introduces controlled, trainable noise into the preference optimization process. Our approach effectively prevents stagnation and enables continuous improvement. In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6% across multiple benchmarks. Additionally, DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations. This highlights its effectiveness in enhancing model performance through iterative refinement.

replace On Memory Construction and Retrieval for Personalized Conversational Agents

Authors: Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Jianfeng Gao

Abstract: To deliver coherent and personalized experiences in long-term conversations, existing approaches typically perform retrieval augmented response generation by constructing memory banks from conversation history at either the turn-level, session-level, or through summarization techniques. In this paper, we present two key findings: (1) The granularity of memory unit matters: Turn-level, session-level, and summarization-based methods each exhibit limitations in both memory retrieval accuracy and the semantic quality of the retrieved content. (2) Prompt compression methods, such as \textit{LLMLingua-2}, can effectively serve as a denoising mechanism, enhancing memory retrieval accuracy across different granularities. Building on these insights, we propose SeCom, a method that constructs a memory bank with topical segments by introducing a conversation Segmentation model, while performing memory retrieval based on Compressed memory units. Experimental results show that SeCom outperforms turn-level, session-level, and several summarization-based methods on long-term conversation benchmarks such as LOCOMO and Long-MT-Bench+. Additionally, the proposed conversation segmentation method demonstrates superior performance on dialogue segmentation datasets such as DialSeg711, TIAGE, and SuperDialSeg.

replace Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs

Authors: Ryan Synk, Monte Hoover, John Kirchenbauer, Neel Jain, Alex Stein, Manli Shu, Josue Melendez Sanchez, Ramani Duraiswami, Tom Goldstein

Abstract: There is growing demand for performing inference with hundreds of thousands of input tokens on trained transformer models. Inference at this extreme scale demands significant computational resources, hindering the application of transformers at long contexts on commodity (i.e not data center scale) hardware. To address the inference time costs associated with running self-attention based transformer language models on long contexts and enable their adoption on widely available hardware, we propose a tunable mechanism that reduces the cost of the forward pass by attending to only the most relevant tokens at every generation step using a top-k selection mechanism. We showcase the efficiency gains afforded by our method by performing inference on context windows up to 1M tokens using approximately 16GB of GPU RAM. Our experiments reveal that models are capable of handling the sparsity induced by the reduced number of keys and values. By attending to less than 2% of input tokens, we achieve over 95% of model performance on common benchmarks (RULER, AlpacaEval, and Open LLM Leaderboard).

replace IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models

Authors: Sayem Mohammad Imtiaz, Astha Singh, Fraol Batole, Hridesh Rajan

Abstract: Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model's most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model's overall performance by altering a smaller portion of the model. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80\%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair.

replace CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction

Authors: Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He

Abstract: Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives -- like logic flow planning, state-space searching, decision tree traversal, and modular decomposition -- while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models are available at https://github.com/hkust-nlp/CodeIO.

URLs: https://github.com/hkust-nlp/CodeIO.

replace O1 Embedder: Let Retrievers Think Before Action

Authors: Ruiran Yan, Zheng Liu, Defu Lian

Abstract: The growing power of large language models (LLMs) has revolutionized how people access and utilize information. Notably, the LLMs excel at performing fine-grained data representation, which facilitates precise retrieval of information. They also generate high-quality answers based on external references, enabling the production of useful knowledge. The recent introduction of reasoning models, like OpenAI O1 and DeepSeek R1, marks another leap forward, highlighting LLMs' ability to think progressively before delivering final answers. This breakthrough significantly improves the ability to address complex tasks, e.g., coding and math proofs. Inspired by this progress, we aim to develop similar capabilities for retrieval models, which hold great promise for tackling critical challenges in the field, including multi-task retrieval, zero-shot retrieval, and tasks requiring intensive reasoning of complex relationships. With this motivation, we propose a novel approach called O1 Embedder, which generates useful thoughts for the input query before making retrieval for the target documents. To realize this objective, we conquer two technical difficulties. First, we design a data synthesis workflow, creating training signals for O1 Embedder by generating initial thoughts from an LLM-expert and subsequently refining them using a retrieval committee. Second, we optimize the training process, enabling a pre-trained model to be jointly fine-tuned to generate retrieval thoughts via behavior cloning and perform dense retrieval through contrastive learning. Our approach is evaluated by comprehensive experiments, where substantial improvements are achieved across 12 popular datasets, spanning both in-domain and out-of-domain scenarios. These results highlight O1 Embedder's remarkable accuracy and generalizability, paving the way for the development of next-generation IR foundation models.

replace Auto-Drafting Police Reports from Noisy ASR Outputs: A Trust-Centered LLM Approach

Authors: Param Kulkarni, Yingchi Liu, Hao-Ming Fu, Shaohua Yang, Isuru Gunasekara, Matt Peloquin, Noah Spitzer-Williams, Xiaotian Zhou, Xiaozhong Liu, Zhengping Ji, Yasser Ibrahim

Abstract: Achieving a delicate balance between fostering trust in law en- forcement and protecting the rights of both officers and civilians continues to emerge as a pressing research and product challenge in the world today. In the pursuit of fairness and transparency, this study presents an innovative AI-driven system designed to generate police report drafts from complex, noisy, and multi-role dialogue data. Our approach intelligently extracts key elements of law enforcement interactions and includes them in the draft, producing structured narratives that are not only high in quality but also reinforce accountability and procedural clarity. This frame- work holds the potential to transform the reporting process, ensur- ing greater oversight, consistency, and fairness in future policing practices. A demonstration video of our system can be accessed at https://drive.google.com/file/d/1kBrsGGR8e3B5xPSblrchRGj-Y-kpCHNO/view?usp=sharing

URLs: https://drive.google.com/file/d/1kBrsGGR8e3B5xPSblrchRGj-Y-kpCHNO/view?usp=sharing

replace-cross How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse

Authors: Yichuan Deng, Zhao Song, Jing Xiong, Chiwun Yang

Abstract: Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity. This is achieved by selectively ignoring smaller entries in the attention matrix during the softmax function computation. Variations of this technique, such as pruning KV cache, sparsity-based fast attention, and Sparse Transformer, have been extensively utilized for efficient Large Language Models (LLMs) deployment. Despite its widespread use, a theoretical understanding of the conditions under which sparse attention performs on par with traditional attention remains elusive. This work aims to $\textbf{bridge this gap by examining the inherent sparsity of standard attention processes}$. Our theoretical framework reveals several brand-new key insights: $\bullet$ Attention is $n^{C}$-sparse, implying that considering only the largest $\Omega(n^{C})$ entries out of all $n$ entries is sufficient for sparse attention to approximate the exact attention matrix with decreasing loss. Here, $n$ represents the input length and $C \in (0, 1)$ is a constant. $\bullet$ Stable $o(\log(n))$-sparse attention, which approximates attention computation with $\log(n)$ or fewer entries, may not be feasible since the error will persist at a minimum of $O(1)$. $\bullet$ An adaptive strategy ($\alpha \cdot n^C, \alpha \in \mathbb{R}$) for the window size of efficient attention methods rather than a fixed one is guaranteed to perform more accurately and efficiently in a task for inference on flexible context lengths.

replace-cross A Multimodal Automated Interpretability Agent

Authors: Tamar Rott Shaham, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, Antonio Torralba

Abstract: This paper describes MAIA, a Multimodal Automated Interpretability Agent. MAIA is a system that uses neural models to automate neural model understanding tasks like feature interpretation and failure mode discovery. It equips a pre-trained vision-language model with a set of tools that support iterative experimentation on subcomponents of other models to explain their behavior. These include tools commonly used by human interpretability researchers: for synthesizing and editing inputs, computing maximally activating exemplars from real-world datasets, and summarizing and describing experimental results. Interpretability experiments proposed by MAIA compose these tools to describe and explain system behavior. We evaluate applications of MAIA to computer vision models. We first characterize MAIA's ability to describe (neuron-level) features in learned representations of images. Across several trained models and a novel dataset of synthetic vision neurons with paired ground-truth descriptions, MAIA produces descriptions comparable to those generated by expert human experimenters. We then show that MAIA can aid in two additional interpretability tasks: reducing sensitivity to spurious features, and automatically identifying inputs likely to be mis-classified.

replace-cross Watermarking Language Models with Error Correcting Codes

Authors: Patrick Chao, Yan Sun, Edgar Dobriban, Hamed Hassani

Abstract: Recent progress in large language models enables the creation of realistic machine-generated content. Watermarking is a promising approach to distinguish machine-generated text from human text, embedding statistical signals in the output that are ideally undetectable to humans. We propose a watermarking framework that encodes such signals through an error correcting code. Our method, termed robust binary code (RBC) watermark, introduces no distortion compared to the original probability distribution, and no noticeable degradation in quality. We evaluate our watermark on base and instruction fine-tuned models and find our watermark is robust to edits, deletions, and translations. We provide an information-theoretic perspective on watermarking, a powerful statistical test for detection and for generating p-values, and theoretical guarantees. Our empirical findings suggest our watermark is fast, powerful, and robust, comparing favorably to the state-of-the-art.

replace-cross Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference

Authors: Anton Xue, Avishree Khare, Rajeev Alur, Surbhi Goel, Eric Wong

Abstract: We study how to subvert large language models (LLMs) from following prompt-specified rules. We first formalize rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if $P$ and $Q$, then $R$" for some propositions $P$, $Q$, and $R$. Next, we prove that although small transformers can faithfully follow such rules, maliciously crafted prompts can still mislead both theoretical constructions and models learned from data. Furthermore, we demonstrate that popular attack algorithms on LLMs find adversarial prompts and induce attention patterns that align with our theory. Our novel logic-based framework provides a foundation for studying LLMs in rule-based settings, enabling a formal analysis of tasks like logical reasoning and jailbreak attacks.

replace-cross EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage

Authors: Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, Huan Sun

Abstract: Generalist web agents have demonstrated remarkable potential in autonomously completing a wide range of tasks on real websites, significantly boosting human productivity. However, web tasks, such as booking flights, usually involve users' PII, which may be exposed to potential privacy risks if web agents accidentally interact with compromised websites, a scenario that remains largely unexplored in the literature. In this work, we narrow this gap by conducting the first study on the privacy risks of generalist web agents in adversarial environments. First, we present a realistic threat model for attacks on the website, where we consider two adversarial targets: stealing users' specific PII or the entire user request. Then, we propose a novel attack method, termed Environmental Injection Attack (EIA). EIA injects malicious content designed to adapt well to environments where the agents operate and our work instantiates EIA specifically for privacy scenarios in web environments. We collect 177 action steps that involve diverse PII categories on realistic websites from the Mind2Web, and conduct experiments using one of the most capable generalist web agent frameworks to date. The results demonstrate that EIA achieves up to 70% ASR in stealing specific PII and 16% ASR for full user request. Additionally, by accessing the stealthiness and experimenting with a defensive system prompt, we indicate that EIA is hard to detect and mitigate. Notably, attacks that are not well adapted for a webpage can be detected via human inspection, leading to our discussion about the trade-off between security and autonomy. However, extra attackers' efforts can make EIA seamlessly adapted, rendering such supervision ineffective. Thus, we further discuss the defenses at the pre- and post-deployment stages of the websites without relying on human supervision and call for more advanced defense strategies.

replace-cross U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models

Authors: Tung-Yu Wu, Pei-Yu Lo

Abstract: Large language models (LLMs) have been shown to exhibit emergent abilities in some downstream tasks, where model performance stagnates at first and then improves sharply and unpredictably with scale beyond a threshold. In this work, we investigate the phenomenon by grouping questions based on difficulty level and provide a possible explanation for emergent abilities. Specifically, we observe U-shaped scaling for hard questions and inverted-U scaling followed by steady improvement for easy questions. The two scaling patterns initially offset each other, causing stagnant overall performance. The performance starts to soar when the scaling pattern of easy questions reverts from inverse to standard scaling, leading to emergent abilities. Based on this finding, we propose a simple yet effective pipeline, called Slice-and-Sandwich, to predict the emergence threshold and model performance beyond the threshold. Our code is publicly available at https://github.com/tony10101105/ExpEmergence.

URLs: https://github.com/tony10101105/ExpEmergence.

replace-cross FIRE: Fact-checking with Iterative Retrieval and Verification

Authors: Zhuohan Xie, Rui Xing, Yuxia Wang, Jiahui Geng, Hasan Iqbal, Dhruv Sahnan, Iryna Gurevych, Preslav Nakov

Abstract: Fact-checking long-form text is challenging, and it is therefore common practice to break it down into multiple atomic claims. The typical approach to fact-checking these atomic claims involves retrieving a fixed number of pieces of evidence, followed by a verification step. However, this method is usually not cost-effective, as it underutilizes the verification model's internal knowledge of the claim and fails to replicate the iterative reasoning process in human search strategies. To address these limitations, we propose FIRE, a novel agent-based framework that integrates evidence retrieval and claim verification in an iterative manner. Specifically, FIRE employs a unified mechanism to decide whether to provide a final answer or generate a subsequent search query, based on its confidence in the current judgment. We compare FIRE with other strong fact-checking frameworks and find that it achieves slightly better performance while reducing large language model (LLM) costs by an average of 7.6 times and search costs by 16.5 times. These results indicate that FIRE holds promise for application in large-scale fact-checking operations. Our code is available at https://github.com/mbzuai-nlp/fire.git.

URLs: https://github.com/mbzuai-nlp/fire.git.

replace-cross In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models

Authors: Zhi-Yi Chin, Mario Fritz, Pin-Yu Chen, Wei-Chen Chiu

Abstract: Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community. While various safety mechanisms have been developed, the field lacks systematic tools for evaluating their effectiveness against real-world misuse scenarios. In this work, we propose ICER, a novel red-teaming framework that leverages Large Language Models (LLMs) and a bandit optimization-based algorithm to generate interpretable and semantic meaningful problematic prompts by learning from past successful red-teaming attempts. Our ICER efficiently probes safety mechanisms across different T2I models without requiring internal access or additional training, making it broadly applicable to deployed systems. Through extensive experiments, we demonstrate that ICER significantly outperforms existing prompt attack methods in identifying model vulnerabilities while maintaining high semantic similarity with intended content. By uncovering that successful jailbreaking instances can systematically facilitate the discovery of new vulnerabilities, our work provides crucial insights for developing more robust safety mechanisms in T2I systems.

replace-cross Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations

Authors: Kangyu Zhu, Ziyuan Qin, Huahui Yi, Zekun Jiang, Qicheng Lao, Shaoting Zhang, Kang Li

Abstract: While mainstream vision-language models (VLMs) have advanced rapidly in understanding image level information, they still lack the ability to focus on specific areas designated by humans. Rather, they typically rely on large volumes of high-quality image-text paired data to learn and generate posterior attention maps. To address this critical issue, we propose leveraging visual prompts:simple visual markers in various forms to guide and enhance the formation of region-specific attention. Thus, we introduce MedVP, a pioneering framework that integrates medical entity extraction, visual prompt generation, and dataset adaptation for visual prompt guided fine-tuning. We successfully outperform recent state-of-the-art large models across multiple medical VQA datasets. Extensive experiments and Human evaluation are conducted to analyze the impact of different visual prompt forms and how they contribute to performance improvement. The results demonstrate both the effectiveness and clinical significance of our approach.

replace-cross Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense

Authors: Yang Ouyang, Hengrui Gu, Shuhang Lin, Wenyue Hua, Jie Peng, Bhavya Kailkhura, Meijun Gao, Tianlong Chen, Kaixiong Zhou

Abstract: As large language models (LLMs) are increasingly deployed in diverse applications, including chatbot assistants and code generation, aligning their behavior with safety and ethical standards has become paramount. However, jailbreak attacks, which exploit vulnerabilities to elicit unintended or harmful outputs, threaten LLMs' safety significantly. In this paper, we introduce Layer-AdvPatcher, a novel methodology designed to defend against jailbreak attacks by utilizing an unlearning strategy to patch specific layers within LLMs through self-augmented datasets. Our insight is that certain layer(s), tend to produce affirmative tokens when faced with harmful prompts. By identifying these layers and adversarially exposing them to generate more harmful data, one can understand their inherent and diverse vulnerabilities to attacks. With these exposures, we then "unlearn" these issues, reducing the impact of affirmative tokens and hence minimizing jailbreak risks while keeping the model's responses to safe queries intact. We conduct extensive experiments on two models, four benchmark datasets, and multiple state-of-the-art jailbreak attacks to demonstrate the efficacy of our approach. Results indicate that our framework reduces the harmfulness and attack success rate of jailbreak attacks without compromising utility for benign queries compared to recent defense methods. Our code is publicly available at: https://github.com/oyy2000/LayerAdvPatcher

URLs: https://github.com/oyy2000/LayerAdvPatcher

replace-cross Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding

Authors: Zhanpeng Chen, Mingxiao Li, Ziyang Chen, Nan Du, Xiaolong Li, Yuexian Zou

Abstract: Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models' comprehensive perception performance across different levels of granularity. In this work, we propose Pyramid-descent Visual Position Encoding (PyPE), a novel approach designed to enhance the perception of visual tokens within VLMs. By assigning visual position indexes from the periphery to the center and expanding the central receptive field incrementally, PyPE addresses the limitations of traditional raster-scan methods and mitigates the long-term decay effects induced by Rotary Position Embedding (RoPE). Our method reduces the relative distance between interrelated visual elements and instruction tokens, promoting a more rational allocation of attention weights and allowing for a multi-granularity perception of visual elements and countering the over-reliance on anchor tokens. Extensive experimental evaluations demonstrate that PyPE consistently improves the general capabilities of VLMs across various sizes. Code is available at https://github.com/SakuraTroyChen/PyPE.

URLs: https://github.com/SakuraTroyChen/PyPE.

replace-cross Humanity's Last Exam

Authors: Long Phan (Michael Pokorny), Alice Gatti (Michael Pokorny), Ziwen Han (Michael Pokorny), Nathaniel Li (Michael Pokorny), Josephina Hu (Michael Pokorny), Hugh Zhang (Michael Pokorny), Chen Bo Calvin Zhang (Michael Pokorny), Mohamed Shaaban (Michael Pokorny), John Ling (Michael Pokorny), Sean Shi (Michael Pokorny), Michael Choi (Michael Pokorny), Anish Agrawal (Michael Pokorny), Arnav Chopra (Michael Pokorny), Adam Khoja (Michael Pokorny), Ryan Kim (Michael Pokorny), Richard Ren (Michael Pokorny), Jason Hausenloy (Michael Pokorny), Oliver Zhang (Michael Pokorny), Mantas Mazeika (Michael Pokorny), Tung Nguyen (Michael Pokorny), Daron Anderson (Michael Pokorny), Imad Ali Shah (Michael Pokorny), Mikhail Doroshenko (Michael Pokorny), Alun Cennyth Stokes (Michael Pokorny), Mobeen Mahmood (Michael Pokorny), Jaeho Lee (Michael Pokorny), Oleksandr Pokutnyi (Michael Pokorny), Oleg Iskra (Michael Pokorny), Jessica P. Wang (Michael Pokorny), Robert Gerbicz (Michael Pokorny), John-Clark Levin (Michael Pokorny), Serguei Popov (Michael Pokorny), Fiona Feng (Michael Pokorny), Steven Y. Feng (Michael Pokorny), Haoran Zhao (Michael Pokorny), Michael Yu (Michael Pokorny), Varun Gangal (Michael Pokorny), Chelsea Zou (Michael Pokorny), Zihan Wang (Michael Pokorny), Mstyslav Kazakov (Michael Pokorny), Geoff Galgon (Michael Pokorny), Johannes Schmitt (Michael Pokorny), Alvaro Sanchez (Michael Pokorny), Yongki Lee (Michael Pokorny), Will Yeadon (Michael Pokorny), Scott Sauers (Michael Pokorny), Marc Roth (Michael Pokorny), Chidozie Agu (Michael Pokorny), S{\o}ren Riis (Michael Pokorny), Fabian Giska (Michael Pokorny), Saiteja Utpala (Michael Pokorny), Antrell Cheatom (Michael Pokorny), Zachary Giboney (Michael Pokorny), Gashaw M. Goshu (Michael Pokorny), Sarah-Jane Crowson (Michael Pokorny), Mohinder Maheshbhai Naiya (Michael Pokorny), Noah Burns (Michael Pokorny), Lennart Finke (Michael Pokorny), Zerui Cheng (Michael Pokorny), Hyunwoo Park (Michael Pokorny), Francesco Fournier-Facio (Michael Pokorny), Jennifer Zampese (Michael Pokorny), John Wydallis (Michael Pokorny), John B. Wydallis (Michael Pokorny), Ryan G. Hoerr (Michael Pokorny), Mark Nandor (Michael Pokorny), Tim Gehrunger (Michael Pokorny), Jiaqi Cai (Michael Pokorny), Ben McCarty (Michael Pokorny), Jungbae Nam (Michael Pokorny), Edwin Taylor (Michael Pokorny), Jun Jin (Michael Pokorny), Gautier Abou Loume (Michael Pokorny), Hangrui Cao (Michael Pokorny), Alexis C Garretson (Michael Pokorny), Damien Sileo (Michael Pokorny), Qiuyu Ren (Michael Pokorny), Doru Cojoc (Michael Pokorny), Pavel Arkhipov (Michael Pokorny), Usman Qazi (Michael Pokorny), Aras Bacho (Michael Pokorny), Lianghui Li (Michael Pokorny), Sumeet Motwani (Michael Pokorny), Christian Schroeder de Witt (Michael Pokorny), Alexei Kopylov (Michael Pokorny), Johannes Veith (Michael Pokorny), Eric Singer (Michael Pokorny), Paolo Rissone (Michael Pokorny), Jaehyeok Jin (Michael Pokorny), Jack Wei Lun Shi (Michael Pokorny), Chris G. Willcocks (Michael Pokorny), Ameya Prabhu (Michael Pokorny), Longke Tang (Michael Pokorny), Kevin Zhou (Michael Pokorny), Emily de Oliveira Santos (Michael Pokorny), Andrey Pupasov Maksimov (Michael Pokorny), Edward Vendrow (Michael Pokorny), Kengo Zenitani (Michael Pokorny), Joshua Robinson (Michael Pokorny), Aleksandar Mikov (Michael Pokorny), Julien Guillod (Michael Pokorny), Yuqi Li (Michael Pokorny), Ben Pageler (Michael Pokorny), Joshua Vendrow (Michael Pokorny), Vladyslav Kuchkin (Michael Pokorny), Pierre Marion (Michael Pokorny), Denis Efremov (Michael Pokorny), Jayson Lynch (Michael Pokorny), Kaiqu Liang (Michael Pokorny), Andrew Gritsevskiy (Michael Pokorny), Dakotah Martinez (Michael Pokorny), Nick Crispino (Michael Pokorny), Dimitri Zvonkine (Michael Pokorny), Natanael Wildner Fraga (Michael Pokorny), Saeed Soori (Michael Pokorny), Ori Press (Michael Pokorny), Henry Tang (Michael Pokorny), Julian Salazar (Michael Pokorny), Sean R. Green (Michael Pokorny), Lina Br\"ussel (Michael Pokorny), Moon Twayana (Michael Pokorny), Aymeric Dieuleveut (Michael Pokorny), T. Ryan Rogers (Michael Pokorny), Wenjin Zhang (Michael Pokorny), Ross Finocchio (Michael Pokorny), Bikun Li (Michael Pokorny), Jinzhou Yang (Michael Pokorny), Arun Rao (Michael Pokorny), Gabriel Loiseau (Michael Pokorny), Mikhail Kalinin (Michael Pokorny), Marco Lukas (Michael Pokorny), Ciprian Manolescu (Michael Pokorny), Nate Stambaugh (Michael Pokorny), Subrata Mishra (Michael Pokorny), Ariel Ghislain Kemogne Kamdoum (Michael Pokorny), Tad Hogg (Michael Pokorny), Alvin Jin (Michael Pokorny), Carlo Bosio (Michael Pokorny), Gongbo Sun (Michael Pokorny), Brian P Coppola (Michael Pokorny), Haline Heidinger (Michael Pokorny), Rafael Sayous (Michael Pokorny), Stefan Ivanov (Michael Pokorny), Joseph M Cavanagh (Michael Pokorny), Jiawei Shen (Michael Pokorny), Joseph Marvin Imperial (Michael Pokorny), Philippe Schwaller (Michael Pokorny), Shaipranesh Senthilkuma (Michael Pokorny), Andres M Bran (Michael Pokorny), Andres Algaba (Michael Pokorny), Brecht Verbeken (Michael Pokorny), Kelsey Van den Houte (Michael Pokorny), Lynn Van Der Sypt (Michael Pokorny), David Noever (Michael Pokorny), Lisa Schut (Michael Pokorny), Ilia Sucholutsky (Michael Pokorny), Evgenii Zheltonozhskii (Michael Pokorny), Qiaochu Yuan (Michael Pokorny), Derek Lim (Michael Pokorny), Richard Stanley (Michael Pokorny), Shankar Sivarajan (Michael Pokorny), Tong Yang (Michael Pokorny), John Maar (Michael Pokorny), Julian Wykowski (Michael Pokorny), Mart\'i Oller (Michael Pokorny), Jennifer Sandlin (Michael Pokorny), Anmol Sahu (Michael Pokorny), Cesare Giulio Ardito (Michael Pokorny), Yuzheng Hu (Michael Pokorny), Felipe Meneguitti Dias (Michael Pokorny), Tobias Kreiman (Michael Pokorny), Kaivalya Rawal (Michael Pokorny), Tobias Garcia Vilchis (Michael Pokorny), Yuexuan Zu (Michael Pokorny), Martin Lackner (Michael Pokorny), James Koppel (Michael Pokorny), Jeremy Nguyen (Michael Pokorny), Daniil S. Antonenko (Michael Pokorny), Steffi Chern (Michael Pokorny), Bingchen Zhao (Michael Pokorny), Pierrot Arsene (Michael Pokorny), Sergey Ivanov (Michael Pokorny), Rafa{\l} Po\'swiata (Michael Pokorny), Chenguang Wang (Michael Pokorny), Daofeng Li (Michael Pokorny), Donato Crisostomi (Michael Pokorny), Ali Dehghan (Michael Pokorny), Andrea Achilleos (Michael Pokorny), John Arnold Ambay (Michael Pokorny), Benjamin Myklebust (Michael Pokorny), Archan Sen (Michael Pokorny), David Perrella (Michael Pokorny), Nurdin Kaparov (Michael Pokorny), Mark H Inlow (Michael Pokorny), Allen Zang (Michael Pokorny), Kalyan Ramakrishnan (Michael Pokorny), Daniil Orel (Michael Pokorny), Vladislav Poritski (Michael Pokorny), Shalev Ben-David (Michael Pokorny), Zachary Berger (Michael Pokorny), Parker Whitfill (Michael Pokorny), Michael Foster (Michael Pokorny), Daniel Munro (Michael Pokorny), Linh Ho (Michael Pokorny), Dan Bar Hava (Michael Pokorny), Aleksey Kuchkin (Michael Pokorny), Robert Lauff (Michael Pokorny), David Holmes (Michael Pokorny), Frank Sommerhage (Michael Pokorny), Anji Zhang (Michael Pokorny), Richard Moat (Michael Pokorny), Keith Schneider (Michael Pokorny), Daniel Pyda (Michael Pokorny), Zakayo Kazibwe (Michael Pokorny), Mukhwinder Singh (Michael Pokorny), Don Clarke (Michael Pokorny), Dae Hyun Kim (Michael Pokorny), Sara Fish (Michael Pokorny), Veit Elser (Michael Pokorny), Victor Efren Guadarrama Vilchis (Michael Pokorny), Immo Klose (Michael Pokorny), Christoph Demian (Michael Pokorny), Ujjwala Anantheswaran (Michael Pokorny), Adam Zweiger (Michael Pokorny), Guglielmo Albani (Michael Pokorny), Jeffery Li (Michael Pokorny), Nicolas Daans (Michael Pokorny), Maksim Radionov (Michael Pokorny), V\'aclav Rozho\v{n} (Michael Pokorny), Vincent Ginis (Michael Pokorny), Ziqiao Ma (Michael Pokorny), Christian Stump (Michael Pokorny), Jacob Platnick (Michael Pokorny), Volodymyr Nevirkovets (Michael Pokorny), Luke Basler (Michael Pokorny), Marco Piccardo (Michael Pokorny), Niv Cohen (Michael Pokorny), Virendra Singh (Michael Pokorny), Josef Tkadlec (Michael Pokorny), Paul Rosu (Michael Pokorny), Alan Goldfarb (Michael Pokorny), Piotr Padlewski (Michael Pokorny), Stanislaw Barzowski (Michael Pokorny), Kyle Montgomery (Michael Pokorny), Aline Menezes (Michael Pokorny), Arkil Patel (Michael Pokorny), Zixuan Wang (Michael Pokorny), Jamie Tucker-Foltz (Michael Pokorny), Jack Stade (Michael Pokorny), Declan Grabb (Michael Pokorny), Tom Goertzen (Michael Pokorny), Fereshteh Kazemi (Michael Pokorny), Jeremiah Milbauer (Michael Pokorny), Abhishek Shukla (Michael Pokorny), Hossam Elgnainy (Michael Pokorny), Yan Carlos Leyva Labrador (Michael Pokorny), Hao He (Michael Pokorny), Ling Zhang (Michael Pokorny), Alan Givr\'e (Michael Pokorny), Hew Wolff (Michael Pokorny), G\"ozdenur Demir (Michael Pokorny), Muhammad Fayez Aziz (Michael Pokorny), Younesse Kaddar (Michael Pokorny), Ivar \"Angquist (Michael Pokorny), Yanxu Chen (Michael Pokorny), Elliott Thornley (Michael Pokorny), Robin Zhang (Michael Pokorny), Jiayi Pan (Michael Pokorny), Antonio Terpin (Michael Pokorny), Niklas Muennighoff (Michael Pokorny), Hailey Schoelkopf (Michael Pokorny), Eric Zheng (Michael Pokorny), Avishy Carmi (Michael Pokorny), Jainam Shah (Michael Pokorny), Ethan D. L. Brown (Michael Pokorny), Kelin Zhu (Michael Pokorny), Max Bartolo (Michael Pokorny), Richard Wheeler (Michael Pokorny), Andrew Ho (Michael Pokorny), Shaul Barkan (Michael Pokorny), Jiaqi Wang (Michael Pokorny), Martin Stehberger (Michael Pokorny), Egor Kretov (Michael Pokorny), Peter Bradshaw (Michael Pokorny), JP Heimonen (Michael Pokorny), Kaustubh Sridhar (Michael Pokorny), Zaki Hossain (Michael Pokorny), Ido Akov (Michael Pokorny), Yury Makarychev (Michael Pokorny), Joanna Tam (Michael Pokorny), Hieu Hoang (Michael Pokorny), David M. Cunningham (Michael Pokorny), Vladimir Goryachev (Michael Pokorny), Demosthenes Patramanis (Michael Pokorny), Michael Krause (Michael Pokorny), Andrew Redenti (Michael Pokorny), David Aldous (Michael Pokorny), Jesyin Lai (Michael Pokorny), Shannon Coleman (Michael Pokorny), Jiangnan Xu (Michael Pokorny), Sangwon Lee (Michael Pokorny), Ilias Magoulas (Michael Pokorny), Sandy Zhao (Michael Pokorny), Ning Tang (Michael Pokorny), Michael K. Cohen (Michael Pokorny), Micah Carroll (Michael Pokorny), Orr Paradise (Michael Pokorny), Jan Hendrik Kirchner (Michael Pokorny), Stefan Steinerberger (Michael Pokorny), Maksym Ovchynnikov (Michael Pokorny), Jason O. Matos (Michael Pokorny), Adithya Shenoy (Michael Pokorny), Michael Wang (Michael Pokorny), Yuzhou Nie (Michael Pokorny), Paolo Giordano (Michael Pokorny), Philipp Petersen (Michael Pokorny), Anna Sztyber-Betley (Michael Pokorny), Paolo Faraboschi (Michael Pokorny), Robin Riblet (Michael Pokorny), Jonathan Crozier (Michael Pokorny), Shiv Halasyamani (Michael Pokorny), Antonella Pinto (Michael Pokorny), Shreyas Verma (Michael Pokorny), Prashant Joshi (Michael Pokorny), Eli Meril (Michael Pokorny), Zheng-Xin Yong (Michael Pokorny), Allison Tee (Michael Pokorny), J\'er\'emy Andr\'eoletti (Michael Pokorny), Orion Weller (Michael Pokorny), Raghav Singhal (Michael Pokorny), Gang Zhang (Michael Pokorny), Alexander Ivanov (Michael Pokorny), Seri Khoury (Michael Pokorny), Nils Gustafsson (Michael Pokorny), Hamid Mostaghimi (Michael Pokorny), Kunvar Thaman (Michael Pokorny), Qijia Chen (Michael Pokorny), Tran Quoc Kh\'anh (Michael Pokorny), Jacob Loader (Michael Pokorny), Stefano Cavalleri (Michael Pokorny), Hannah Szlyk (Michael Pokorny), Zachary Brown (Michael Pokorny), Himanshu Narayan (Michael Pokorny), Jonathan Roberts (Michael Pokorny), William Alley (Michael Pokorny), Kunyang Sun (Michael Pokorny), Ryan Stendall (Michael Pokorny), Max Lamparth (Michael Pokorny), Anka Reuel (Michael Pokorny), Ting Wang (Michael Pokorny), Hanmeng Xu (Michael Pokorny), Pablo Hern\'andez-C\'amara (Michael Pokorny), Freddie Martin (Michael Pokorny), Thomas Preu (Michael Pokorny), Tomek Korbak (Michael Pokorny), Marcus Abramovitch (Michael Pokorny), Dominic Williamson (Michael Pokorny), Ida Bosio (Michael Pokorny), Ziye Chen (Michael Pokorny), Bir\'o B\'alint (Michael Pokorny), Eve J. Y. Lo (Michael Pokorny), Maria In\^es S. Nunes (Michael Pokorny), Yibo Jiang (Michael Pokorny), M Saiful Bari (Michael Pokorny), Peyman Kassani (Michael Pokorny), Zihao Wang (Michael Pokorny), Behzad Ansarinejad (Michael Pokorny), Yewen Sun (Michael Pokorny), Stephane Durand (Michael Pokorny), Guillaume Douville (Michael Pokorny), Daniel Tordera (Michael Pokorny), George Balabanian (Michael Pokorny), Earth Anderson (Michael Pokorny), Lynna Kvistad (Michael Pokorny), Alejandro Jos\'e Moyano (Michael Pokorny), Hsiaoyun Milliron (Michael Pokorny), Ahmad Sakor (Michael Pokorny), Murat Eron (Michael Pokorny), Isaac C. McAlister (Michael Pokorny), Andrew Favre D. O. (Michael Pokorny), Shailesh Shah (Michael Pokorny), Xiaoxiang Zhou (Michael Pokorny), Firuz Kamalov (Michael Pokorny), Ronald Clark (Michael Pokorny), Sherwin Abdoli (Michael Pokorny), Tim Santens (Michael Pokorny), Harrison K Wang (Michael Pokorny), Evan Chen (Michael Pokorny), Alessandro Tomasiello (Michael Pokorny), G. Bruno De Luca (Michael Pokorny), Shi-Zhuo Looi (Michael Pokorny), Vinh-Kha Le (Michael Pokorny), Noam Kolt (Michael Pokorny), Niels M\"undler (Michael Pokorny), Avi Semler (Michael Pokorny), Emma Rodman (Michael Pokorny), Jacob Drori (Michael Pokorny), Carl J Fossum (Michael Pokorny), Luk Gloor (Michael Pokorny), Milind Jagota (Michael Pokorny), Ronak Pradeep (Michael Pokorny), Honglu Fan (Michael Pokorny), Tej Shah (Michael Pokorny), Jonathan Eicher (Michael Pokorny), Michael Chen (Michael Pokorny), Kushal Thaman (Michael Pokorny), William Merrill (Michael Pokorny), Moritz Firsching (Michael Pokorny), Carter Harris (Michael Pokorny), Stefan Ciob\^ac\u{a} (Michael Pokorny), Jason Gross (Michael Pokorny), Rohan Pandey (Michael Pokorny), Ilya Gusev (Michael Pokorny), Adam Jones (Michael Pokorny), Shashank Agnihotri (Michael Pokorny), Pavel Zhelnov (Michael Pokorny), Siranut Usawasutsakorn (Michael Pokorny), Mohammadreza Mofayezi (Michael Pokorny), Alexander Piperski (Michael Pokorny), Marc Carauleanu (Michael Pokorny), David K. Zhang (Michael Pokorny), Kostiantyn Dobarskyi (Michael Pokorny), Dylan Ler (Michael Pokorny), Roman Leventov (Michael Pokorny), Ignat Soroko (Michael Pokorny), Thorben Jansen (Michael Pokorny), Scott Creighton (Michael Pokorny), Pascal Lauer (Michael Pokorny), Joshua Duersch (Michael Pokorny), Vage Taamazyan (Michael Pokorny), Dario Bezzi (Michael Pokorny), Wiktor Morak (Michael Pokorny), Wenjie Ma (Michael Pokorny), William Held (Michael Pokorny), Tran {\DJ}uc Huy (Michael Pokorny), Ruicheng Xian (Michael Pokorny), Armel Randy Zebaze (Michael Pokorny), Mohanad Mohamed (Michael Pokorny), Julian Noah Leser (Michael Pokorny), Michelle X Yuan (Michael Pokorny), Laila Yacar (Michael Pokorny), Johannes Lengler (Michael Pokorny), Katarzyna Olszewska (Michael Pokorny), Hossein Shahrtash (Michael Pokorny), Edson Oliveira (Michael Pokorny), Joseph W. Jackson (Michael Pokorny), Daniel Espinosa Gonzalez (Michael Pokorny), Andy Zou (Michael Pokorny), Muthu Chidambaram (Michael Pokorny), Timothy Manik (Michael Pokorny), Hector Haffenden (Michael Pokorny), Dashiell Stander (Michael Pokorny), Ali Dasouqi (Michael Pokorny), Alexander Shen (Michael Pokorny), Emilien Duc (Michael Pokorny), Bita Golshani (Michael Pokorny), David Stap (Michael Pokorny), Mikalai Uzhou (Michael Pokorny), Alina Borisovna Zhidkovskaya (Michael Pokorny), Lukas Lewark (Michael Pokorny), Miguel Orbegozo Rodriguez (Michael Pokorny), M\'aty\'as Vincze (Michael Pokorny), Dustin Wehr (Michael Pokorny), Colin Tang (Michael Pokorny), Shaun Phillips (Michael Pokorny), Fortuna Samuele (Michael Pokorny), Jiang Muzhen (Michael Pokorny), Fredrik Ekstr\"om (Michael Pokorny), Angela Hammon (Michael Pokorny), Oam Patel (Michael Pokorny), Faraz Farhidi (Michael Pokorny), George Medley (Michael Pokorny), Forough Mohammadzadeh (Michael Pokorny), Madellene Pe\~naflor (Michael Pokorny), Haile Kassahun (Michael Pokorny), Alena Friedrich (Michael Pokorny), Claire Sparrow (Michael Pokorny), Rayner Hernandez Perez (Michael Pokorny), Taom Sakal (Michael Pokorny), Omkar Dhamane (Michael Pokorny), Ali Khajegili Mirabadi (Michael Pokorny), Eric Hallman (Michael Pokorny), Kenchi Okutsu (Michael Pokorny), Mike Battaglia (Michael Pokorny), Mohammad Maghsoudimehrabani (Michael Pokorny), Alon Amit (Michael Pokorny), Dave Hulbert (Michael Pokorny), Roberto Pereira (Michael Pokorny), Simon Weber (Michael Pokorny), Handoko (Michael Pokorny), Anton Peristyy (Michael Pokorny), Stephen Malina (Michael Pokorny), Samuel Albanie (Michael Pokorny), Will Cai (Michael Pokorny), Mustafa Mehkary (Michael Pokorny), Rami Aly (Michael Pokorny), Frank Reidegeld (Michael Pokorny), Anna-Katharina Dick (Michael Pokorny), Cary Friday (Michael Pokorny), Jasdeep Sidhu (Michael Pokorny), Hassan Shapourian (Michael Pokorny), Wanyoung Kim (Michael Pokorny), Mariana Costa (Michael Pokorny), Hubeyb Gurdogan (Michael Pokorny), Brian Weber (Michael Pokorny), Harsh Kumar (Michael Pokorny), Tong Jiang (Michael Pokorny), Arunim Agarwal (Michael Pokorny), Chiara Ceconello (Michael Pokorny), Warren S. Vaz (Michael Pokorny), Chao Zhuang (Michael Pokorny), Haon Park (Michael Pokorny), Andrew R. Tawfeek (Michael Pokorny), Daattavya Aggarwal (Michael Pokorny), Michael Kirchhof (Michael Pokorny), Linjie Dai (Michael Pokorny), Evan Kim (Michael Pokorny), Johan Ferret (Michael Pokorny), Yuzhou Wang (Michael Pokorny), Minghao Yan (Michael Pokorny), Krzysztof Burdzy (Michael Pokorny), Lixin Zhang (Michael Pokorny), Antonio Franca (Michael Pokorny), Diana T. Pham (Michael Pokorny), Kang Yong Loh (Michael Pokorny), Joshua Robinson (Michael Pokorny), Abram Jackson (Michael Pokorny), Shreen Gul (Michael Pokorny), Gunjan Chhablani (Michael Pokorny), Zhehang Du (Michael Pokorny), Adrian Cosma (Michael Pokorny), Jesus Colino (Michael Pokorny), Colin White (Michael Pokorny), Jacob Votava (Michael Pokorny), Vladimir Vinnikov (Michael Pokorny), Ethan Delaney (Michael Pokorny), Petr Spelda (Michael Pokorny), Vit Stritecky (Michael Pokorny), Syed M. Shahid (Michael Pokorny), Jean-Christophe Mourrat (Michael Pokorny), Lavr Vetoshkin (Michael Pokorny), Koen Sponselee (Michael Pokorny), Renas Bacho (Michael Pokorny), Florencia de la Rosa (Michael Pokorny), Xiuyu Li (Michael Pokorny), Guillaume Malod (Michael Pokorny), Leon Lang (Michael Pokorny), Julien Laurendeau (Michael Pokorny), Dmitry Kazakov (Michael Pokorny), Fatimah Adesanya (Michael Pokorny), Julien Portier (Michael Pokorny), Lawrence Hollom (Michael Pokorny), Victor Souza (Michael Pokorny), Yuchen Anna Zhou (Michael Pokorny), Julien Degorre (Michael Pokorny), Yi\u{g}it Yal{\i}n (Michael Pokorny), Gbenga Daniel Obikoya (Michael Pokorny), Luca Arnaboldi (Michael Pokorny), Rai (Michael Pokorny), Filippo Bigi (Quinn), M. C. Bosc\'a (Quinn), Oleg Shumar (Quinn), Kaniuar Bacho (Quinn), Pierre Clavier (Quinn), Gabriel Recchia (Quinn), Mara Popescu (Quinn), Nikita Shulga (Quinn), Ngefor Mildred Tanwie (Quinn), Denis Peskoff (Quinn), Thomas C. H. Lux (Quinn), Ben Rank (Quinn), Colin Ni (Quinn), Matthew Brooks (Quinn), Alesia Yakimchyk (Quinn), Huanxu (Quinn), Liu (Tony), Olle H\"aggstr\"om (Tony), Emil Verkama (Tony), Hans Gundlach (Tony), Leonor Brito-Santana (Tony), Brian Amaro (Tony), Vivek Vajipey (Tony), Rynaa Grover (Tony), Yiyang Fan (Tony), Gabriel Poesia Reis e Silva (Tony), Linwei Xin (Tony), Yosi Kratish (Tony), Jakub {\L}ucki (Tony), Wen-Ding Li (Tony), Sivakanth Gopi (Tony), Andrea Caciolai (Tony), Justin Xu (Tony), Kevin Joseph Scaria (Tony), Freddie Vargus (Tony), Farzad Habibi (Tony), Long (Tony), Lian, Emanuele Rodol\`a, Jules Robins, Vincent Cheng, Tony Fruhauff, Brad Raynor, Hao Qi, Xi Jiang, Ben Segev, Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Michael P. Brenner, Mao Mao, Xinyu Zhang, David Avagian, Eshawn Jessica Scipio, Alon Ragoler, Justin Tan, Blake Sims, Rebeka Plecnik, Aaron Kirtland, Omer Faruk Bodur, D. P. Shinde, Zahra Adoul, Mohamed Zekry, Ali Karakoc, Tania C. B. Santos, Samir Shamseldeen, Loukmane Karim, Anna Liakhovitskaia, Nate Resman, Nicholas Farina, Juan Carlos Gonzalez, Gabe Maayan, Sarah Hoback, Rodrigo De Oliveira Pena, Glen Sherman, Elizabeth Kelley, Hodjat Mariji, Rasoul Pouriamanesh, Wentao Wu, Sandra Mendoza, Ismail Alarab, Joshua Cole, Danyelle Ferreira, Bryan Johnson, Mohammad Safdari, Liangti Dai, Siriphan Arthornthurasuk, Alexey Pronin, Jing Fan, Angel Ramirez-Trinidad, Ashley Cartwright, Daphiny Pottmaier, Omid Taheri, David Outevsky, Stanley Stepanic, Samuel Perry, Luke Askew, Ra\'ul Adri\'an Huerta Rodr\'iguez, Ali M. R. Minissi, Sam Ali, Ricardo Lorena, Krishnamurthy Iyer, Arshad Anil Fasiludeen, Sk Md Salauddin, Murat Islam, Juan Gonzalez, Josh Ducey, Maja Somrak, Vasilios Mavroudis, Eric Vergo, Juehang Qin, Benj\'amin Borb\'as, Eric Chu, Jack Lindsey, Anil Radhakrishnan, Antoine Jallon, I. M. J. McInnis, Pawan Kumar, Laxman Prasad Goswami, Daniel Bugas, Nasser Heydari, Ferenc Jeanplong, Archimedes Apronti, Abdallah Galal, Ng Ze-An, Ankit Singh, Joan of Arc Xavier, Kanu Priya Agarwal, Mohammed Berkani, Benedito Alves de Oliveira Junior, Dmitry Malishev, Nicolas Remy, Taylor D. Hartman, Tim Tarver, Stephen Mensah, Javier Gimenez, Roselynn Grace Montecillo, Russell Campbell, Asankhaya Sharma, Khalida Meer, Xavier Alapont, Deepakkumar Patil, Rajat Maheshwari, Abdelkader Dendane, Priti Shukla, Sergei Bogdanov, S\"oren M\"oller, Muhammad Rehan Siddiqi, Prajvi Saxena, Himanshu Gupta, Innocent Enyekwe, Ragavendran P V, Zienab EL-Wasif, Aleksandr Maksapetyan, Vivien Rossbach, Chris Harjadi, Mohsen Bahaloohoreh, Song Bian, John Lai, Justine Leon Uro, Greg Bateman, Mohamed Sayed, Ahmed Menshawy, Darling Duclosel, Yashaswini Jain, Ashley Aaron, Murat Tiryakioglu, Sheeshram Siddh, Keith Krenek, Alex Hoover, Joseph McGowan, Tejal Patwardhan, Summer Yue, Alexandr Wang, Dan Hendrycks

Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

URLs: https://lastexam.ai.

replace-cross TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Authors: Makoto Shing, Kou Misaki, Han Bao, Sho Yokoi, Takuya Akiba

Abstract: Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce $\textit{Temporally Adaptive Interpolated Distillation (TAID)}$, a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student's initial distribution towards the teacher's distribution. We provide a theoretical analysis demonstrating TAID's ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID's practical impact by developing two state-of-the-art compact foundation models: $\texttt{TAID-LLM-1.5B}$ for language tasks and $\texttt{TAID-VLM-2B}$ for vision-language tasks. These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

replace-cross Uncertainty Quantification and Decomposition for LLM-based Recommendation

Authors: Wonbin Kweon, Sanghwan Jang, SeongKu Kang, Hwanjo Yu

Abstract: Despite the widespread adoption of large language models (LLMs) for recommendation, we demonstrate that LLMs often exhibit uncertainty in their recommendations. To ensure the trustworthy use of LLMs in generating recommendations, we emphasize the importance of assessing the reliability of recommendations generated by LLMs. We start by introducing a novel framework for estimating the predictive uncertainty to quantitatively measure the reliability of LLM-based recommendations. We further propose to decompose the predictive uncertainty into recommendation uncertainty and prompt uncertainty, enabling in-depth analyses of the primary source of uncertainty. Through extensive experiments, we (1) demonstrate predictive uncertainty effectively indicates the reliability of LLM-based recommendations, (2) investigate the origins of uncertainty with decomposed uncertainty measures, and (3) propose uncertainty-aware prompting for a lower predictive uncertainty and enhanced recommendation. Our source code and model weights are available at https://github.com/WonbinKweon/UNC_LLM_REC_WWW2025

URLs: https://github.com/WonbinKweon/UNC_LLM_REC_WWW2025

replace-cross Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

Authors: Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao

Abstract: Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts. The core design of Ola lies in its progressive modality alignment strategy that extends the supporting modality of the language model progressively. Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities. The progressive learning pipeline also enables us to maintain a relatively small size of the cross-modal alignment data, making developing omni-modal from existing vision-language models easy and less costly. Moreover, to unlock an advanced interactive experience like GPT-4o, we further design a sentence-wise decoding solution for streaming speech generation. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.

URLs: https://github.com/Ola-Omni/Ola.

replace-cross Safety at Scale: A Comprehensive Survey of Large Model Safety

Authors: Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang

Abstract: The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to a wide range of applications, including conversational AI, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-based Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard AI models.

replace-cross Music for All: Exploring Multicultural Representations in Music Generation Models

Authors: Atharva Mehta, Shivam Chauhan, Amirbek Djanibekov, Atharva Kulkarni, Gus Xia, Monojit Choudhury

Abstract: The advent of Music-Language Models has greatly enhanced the automatic music generation capability of AI systems, but they are also limited in their coverage of the musical genres and cultures of the world. We present a study of the datasets and research papers for music generation and quantify the bias and under-representation of genres. We find that only 5.7% of the total hours of existing music datasets come from non-Western genres, which naturally leads to disparate performance of the models across genres. We then investigate the efficacy of Parameter-Efficient Fine-Tuning (PEFT) techniques in mitigating this bias. Our experiments with two popular models -- MusicGen and Mustango, for two underrepresented non-Western music traditions -- Hindustani Classical and Turkish Makam music, highlight the promises as well as the non-triviality of cross-genre adaptation of music through small datasets, implying the need for more equitable baseline music-language models that are designed for cross-cultural transfer learning.

replace-cross Automated Capability Discovery via Model Self-Exploration

Authors: Cong Lu, Shengran Hu, Jeff Clune

Abstract: Foundation models have become general-purpose assistants, exhibiting diverse capabilities across numerous domains through training on web-scale data. It remains challenging to precisely characterize even a fraction of the full spectrum of capabilities and potential risks in any new model. Existing evaluation approaches often require significant human effort, and it is taking increasing effort to design ever harder challenges for more capable models. We introduce Automated Capability Discovery (ACD), a framework that designates one foundation model as a scientist to systematically propose open-ended tasks probing the abilities of a subject model (potentially itself). By combining frontier models with ideas from the field of open-endedness, ACD automatically and systematically uncovers both surprising capabilities and failures in the subject model. We demonstrate ACD across a range of foundation models (including the GPT, Claude, and Llama series), showing that it automatically reveals thousands of capabilities that would be challenging for any single team to uncover. We further validate our method's automated scoring with extensive human surveys, observing high agreement between model-generated and human evaluations. By leveraging foundation models' ability to both create tasks and self-evaluate, ACD is a significant step toward scalable, automated evaluation of novel AI systems. All code and evaluation logs are open-sourced at https://github.com/conglu1997/ACD.

URLs: https://github.com/conglu1997/ACD.