new Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports

Authors: Mohamed Sobhi Jabal, Pranav Warman, Jikai Zhang, Kartikeye Gupta, Ayush Jain, Maciej Mazurowski, Walter Wiggins, Kirti Magudia, Evan Calabrese

Abstract: Purpose: To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weights large language models (LMs) and retrieval augmented generation (RAG), and to assess the effects of model configuration variables on extraction performance. Methods and Materials: The study utilized two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports annotated for isocitrate dehydrogenase (IDH) mutation status. An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations. The impact of model size, quantization, prompting strategies, output formatting, and inference parameters was systematically evaluated. Results: The best performing models achieved over 98% accuracy in extracting BT-RADS scores from radiology reports and over 90% for IDH mutation status extraction from pathology reports. The top model being medical fine-tuned llama3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models. Model quantization had minimal impact on performance. Few-shot prompting significantly improved accuracy. RAG improved performance for complex pathology reports but not for shorter radiology reports. Conclusions: Open LMs demonstrate significant potential for automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semi-automated optimization using annotated data are critical for optimal performance. These approaches could be reliable enough for practical use in research workflows, highlighting the potential for human-machine collaboration in healthcare data extraction.

new Exploring Fine-tuned Generative Models for Keyphrase Selection: A Case Study for Russian

Authors: Anna Glazkova, Dmitry Morozov

Abstract: Keyphrase selection plays a pivotal role within the domain of scholarly texts, facilitating efficient information retrieval, summarization, and indexing. In this work, we explored how to apply fine-tuned generative transformer-based models to the specific task of keyphrase selection within Russian scientific texts. We experimented with four distinct generative models, such as ruT5, ruGPT, mT5, and mBART, and evaluated their performance in both in-domain and cross-domain settings. The experiments were conducted on the texts of Russian scientific abstracts from four domains: mathematics \& computer science, history, medicine, and linguistics. The use of generative models, namely mBART, led to gains in in-domain performance (up to 4.9\% in BERTScore, 9.0\% in ROUGE-1, and 12.2\% in F1-score) over three keyphrase extraction baselines for the Russian language. Although the results for cross-domain usage were significantly lower, they still demonstrated the capability to surpass baseline performances in several cases, underscoring the promising potential for further exploration and refinement in this research field.

new Improving Multi-candidate Speculative Decoding

Authors: Xiaofan Lu, Yixiao Zeng, Feiyang Ma, Zixu Yu, Marco Levorato

Abstract: Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) by using a lower complexity draft model to propose candidate tokens verified by a larger target model. To further improve efficiency, Multi-Candidate Speculative Decoding (MCSD) improves upon this by sampling multiple candidate tokens from the draft model at each step and verifying them in parallel, thus increasing the chances of accepting a token and reducing generation time. Existing MCSD methods rely on the draft model to initialize the multi-candidate sequences and use static length and tree attention structure for draft generation. However, such an approach suffers from the draft and target model's output distribution differences, especially in dynamic generation context. In this work, we introduce an improved version of MCSD that includes a target model initialized multi-candidate process, dynamic sliced topology-aware causal mask for dynamic length adjustment, and decision models to optimize early stopping. Our framework improves the acceptance rate, defined as the ratio of the longest draft sequence length accepted by the target model over the maximum draft sequence length, by a maximum of 164% and gains a maximum of 75% generation speed up over the MCSD baseline. We also conduct an ablation study to evaluate the impact of the decision model.

new Visualizing Temporal Topic Embeddings with a Compass

Authors: Daniel Palamarchuk, Lemara Williams, Brian Mayer, Thomas Danielson, Rebecca Faust, Larry Deschaine, Chris North

Abstract: Dynamic topic modeling is useful at discovering the development and change in latent topics over time. However, present methodology relies on algorithms that separate document and word representations. This prevents the creation of a meaningful embedding space where changes in word usage and documents can be directly analyzed in a temporal context. This paper proposes an expansion of the compass-aligned temporal Word2Vec methodology into dynamic topic modeling. Such a method allows for the direct comparison of word and document embeddings across time in dynamic topics. This enables the creation of visualizations that incorporate temporal word embeddings within the context of documents into topic visualizations. In experiments against the current state-of-the-art, our proposed method demonstrates overall competitive performance in topic relevancy and diversity across temporal datasets of varying size. Simultaneously, it provides insightful visualizations focused on temporal word embeddings while maintaining the insights provided by global topic evolution, advancing our understanding of how topics evolve over time.

new Self-Attention Limits Working Memory Capacity of Transformer-Based Models

Authors: Dongyu Gong, Hantao Zhang

Abstract: Recent work on Transformer-based large language models (LLMs) has revealed striking limits in their working memory capacity, similar to what has been found in human behavioral studies. Specifically, these models' performance drops significantly on N-back tasks as N increases. However, there is still a lack of mechanistic interpretability as to why this phenomenon would arise. Inspired by the executive attention theory from behavioral sciences, we hypothesize that the self-attention mechanism within Transformer-based models might be responsible for their working memory capacity limits. To test this hypothesis, we train vanilla decoder-only transformers to perform N-back tasks and find that attention scores gradually aggregate to the N-back positions over training, suggesting that the model masters the task by learning a strategy to pay attention to the relationship between the current position and the N-back position. Critically, we find that the total entropy of the attention score matrix increases as N increases, suggesting that the dispersion of attention scores might be the cause of the capacity limit observed in N-back tasks.

new Generalized Measures of Anticipation and Responsivity in Online Language Processing

Authors: Mario Giulianelli, Andreas Opedal, Ryan Cotterell

Abstract: We introduce a generalization of classic information-theoretic measures of predictive uncertainty in online language processing, based on the simulation of expected continuations of incremental linguistic contexts. Our framework provides a formal definition of anticipatory and responsive measures, and it equips experimenters with the tools to define new, more expressive measures beyond standard next-symbol entropy and surprisal. While extracting these standard quantities from language models is convenient, we demonstrate that using Monte Carlo simulation to estimate alternative responsive and anticipatory measures pays off empirically: New special cases of our generalized formula exhibit enhanced predictive power compared to surprisal for human cloze completion probability as well as ELAN, LAN, and N400 amplitudes, and greater complementarity with surprisal in predicting reading times.

new Semantics Preserving Emoji Recommendation with Large Language Models

Authors: Zhongyi Qiu, Kangyi Qiu, Hanjia Lyu, Wei Xiong, Jiebo Luo

Abstract: Emojis have become an integral part of digital communication, enriching text by conveying emotions, tone, and intent. Existing emoji recommendation methods are primarily evaluated based on their ability to match the exact emoji a user chooses in the original text. However, they ignore the essence of users' behavior on social media in that each text can correspond to multiple reasonable emojis. To better assess a model's ability to align with such real-world emoji usage, we propose a new semantics preserving evaluation framework for emoji recommendation, which measures a model's ability to recommend emojis that maintain the semantic consistency with the user's text. To evaluate how well a model preserves semantics, we assess whether the predicted affective state, demographic profile, and attitudinal stance of the user remain unchanged. If these attributes are preserved, we consider the recommended emojis to have maintained the original semantics. The advanced abilities of Large Language Models (LLMs) in understanding and generating nuanced, contextually relevant output make them well-suited for handling the complexities of semantics preserving emoji recommendation. To this end, we construct a comprehensive benchmark to systematically assess the performance of six proprietary and open-source LLMs using different prompting techniques on our task. Our experiments demonstrate that GPT-4o outperforms other LLMs, achieving a semantics preservation score of 79.23%. Additionally, we conduct case studies to analyze model biases in downstream classification tasks and evaluate the diversity of the recommended emojis.

new Predicting Punctuation in Ancient Chinese Texts: A Multi-Layered LSTM and Attention-Based Approach

Authors: Tracy Cai, Kimmy Chang, Fahad Nabi

Abstract: It was only until the 20th century when the Chinese language began using punctuation. In fact, many ancient Chinese texts contain thousands of lines with no distinct punctuation marks or delimiters in sight. The lack of punctuation in such texts makes it difficult for humans to identify when there pauses or breaks between particular phrases and understand the semantic meaning of the written text (Mogahed, 2012). As a result, unless one was educated in the ancient time period, many readers of ancient Chinese would have significantly different interpretations of the texts. We propose an approach to predict the location (and type) of punctuation in ancient Chinese texts that extends the work of Oh et al (2017) by leveraging a bidirectional multi-layered LSTM with a multi-head attention mechanism as inspired by Luong et al.'s (2015) discussion of attention-based architectures. We find that the use of multi-layered LSTMs and multi-head attention significantly outperforms RNNs that don't incorporate such components when evaluating ancient Chinese texts.

new Model Tells Itself Where to Attend: Faithfulness Meets Automatic Attention Steering

Authors: Qingru Zhang, Xiaodong Yu, Chandan Singh, Xiaodong Liu, Liyuan Liu, Jianfeng Gao, Tuo Zhao, Dan Roth, Hao Cheng

Abstract: Large language models (LLMs) have demonstrated remarkable performance across various real-world tasks. However, they often struggle to fully comprehend and effectively utilize their input contexts, resulting in responses that are unfaithful or hallucinated. This difficulty increases for contexts that are long or contain distracting information, which can divert LLMs from fully capturing essential evidence. To address this issue, many works use prompting to help LLMs utilize contextual information more faithfully. For instance, iterative prompting highlights key information in two steps that first ask the LLM to identify important pieces of context and then derive answers accordingly. However, prompting methods are constrained to highlighting key information implicitly in token space, which is often insufficient to fully steer the model's attention. To improve model faithfulness more reliably, we propose AutoPASTA, a method that automatically identifies key contextual information and explicitly highlights it by steering an LLM's attention scores. Like prompting, AutoPASTA is applied at inference time and does not require changing any model parameters. Our experiments on open-book QA demonstrate that AutoPASTA effectively enables models to grasp essential contextual information, leading to substantially improved model faithfulness and performance, e.g., an average improvement of 7.95% for LLAMA3-70B-Instruct. Code will be publicly available at https://github.com/QingruZhang/AutoPASTA .

URLs: https://github.com/QingruZhang/AutoPASTA

new ReXErr: Synthesizing Clinically Meaningful Errors in Diagnostic Radiology Reports

Authors: Vishwanatha M. Rao, Serena Zhang, Julian N. Acosta, Subathra Adithan, Pranav Rajpurkar

Abstract: Accurately interpreting medical images and writing radiology reports is a critical but challenging task in healthcare. Both human-written and AI-generated reports can contain errors, ranging from clinical inaccuracies to linguistic mistakes. To address this, we introduce ReXErr, a methodology that leverages Large Language Models to generate representative errors within chest X-ray reports. Working with board-certified radiologists, we developed error categories that capture common mistakes in both human and AI-generated reports. Our approach uses a novel sampling scheme to inject diverse errors while maintaining clinical plausibility. ReXErr demonstrates consistency across error categories and produces errors that closely mimic those found in real-world scenarios. This method has the potential to aid in the development and evaluation of report correction algorithms, potentially enhancing the quality and reliability of radiology reporting.

new BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

Authors: S. Rohollah Hosseyni, Ali Ahmad Rahmani, S. Jamal Seyedmohammadi, Sanaz Seyedin, Arash Mohammadi

Abstract: Autoregressive models excel in modeling sequential dependencies by enforcing causal constraints, yet they struggle to capture complex bidirectional patterns due to their unidirectional nature. In contrast, mask-based models leverage bidirectional context, enabling richer dependency modeling. However, they often assume token independence during prediction, which undermines the modeling of sequential dependencies. Additionally, the corruption of sequences through masking or absorption can introduce unnatural distortions, complicating the learning process. To address these issues, we propose Bidirectional Autoregressive Diffusion (BAD), a novel approach that unifies the strengths of autoregressive and mask-based generative models. BAD utilizes a permutation-based corruption technique that preserves the natural sequence structure while enforcing causal dependencies through randomized ordering, enabling the effective capture of both sequential and bidirectional relationships. Comprehensive experiments show that BAD outperforms autoregressive and mask-based models in text-to-motion generation, suggesting a novel pre-training strategy for sequence modeling. The codebase for BAD is available on https://github.com/RohollahHS/BAD.

URLs: https://github.com/RohollahHS/BAD.

new Adaptive Large Language Models By Layerwise Attention Shortcuts

Authors: Prateek Verma, Mert Pilanci

Abstract: Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we propose to challenge this and introduce adaptive computations for LLM-like setups, which allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism, thereby introducing computational \textbf{attention shortcuts}. These shortcuts can thus make the architecture depth and context adaptive. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture. We give evidence via attention maps that the models learn complex dependencies across layers that are adaptive in context and depth depending on the input tokens.

new American Sign Language to Text Translation using Transformer and Seq2Seq with LSTM

Authors: Gregorius Guntur Sunardi Putra, Adifa Widyadhani Chanda D'Layla, Dimas Wahono, Riyanarto Sarno, Agus Tri Haryono

Abstract: Sign language translation is one of the important issues in communication between deaf and hearing people, as it expresses words through hand, body, and mouth movements. American Sign Language is one of the sign languages used, one of which is the alphabetic sign. The development of neural machine translation technology is moving towards sign language translation. Transformer became the state-of-the-art in natural language processing. This study compares the Transformer with the Sequence-to-Sequence (Seq2Seq) model in translating sign language to text. In addition, an experiment was conducted by adding Residual Long Short-Term Memory (ResidualLSTM) in the Transformer. The addition of ResidualLSTM to the Transformer reduces the performance of the Transformer model by 23.37% based on the BLEU Score value. In comparison, the Transformer itself increases the BLEU Score value by 28.14 compared to the Seq2Seq model.

new CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation for Meeting Summarization

Authors: Ziwei Gong, Lin Ai, Harshsaiprasad Deshpande, Alexander Johnson, Emmy Phung, Zehui Wu, Ahmad Emami, Julia Hirschberg

Abstract: Large Language Models (LLMs) have spurred interest in automatic evaluation methods for summarization, offering a faster, more cost-effective alternative to human evaluation. However, existing methods often fall short when applied to complex tasks like long-context summarizations and dialogue-based meeting summarizations. In this paper, we introduce CREAM (Comparison-Based Reference-Free Elo-Ranked Automatic Evaluation for Meeting Summarization), a novel framework that addresses the unique challenges of evaluating meeting summaries. CREAM leverages a combination of chain-of-thought reasoning and key facts alignment to assess conciseness and completeness of model-generated summaries without requiring reference. By employing an ELO ranking system, our approach provides a robust mechanism for comparing the quality of different models or prompt configurations.

new Attention-Seeker: Dynamic Self-Attention Scoring for Unsupervised Keyphrase Extraction

Authors: Erwin D. L\'opez Z., Cheng Tang, Atsushi Shimada

Abstract: This paper proposes Attention-Seeker, an unsupervised keyphrase extraction method that leverages self-attention maps from a Large Language Model to estimate the importance of candidate phrases. Our approach identifies specific components - such as layers, heads, and attention vectors - where the model pays significant attention to the key topics of the text. The attention weights provided by these components are then used to score the candidate phrases. Unlike previous models that require manual tuning of parameters (e.g., selection of heads, prompts, hyperparameters), Attention-Seeker dynamically adapts to the input text without any manual adjustments, enhancing its practical applicability. We evaluate Attention-Seeker on four publicly available datasets: Inspec, SemEval2010, SemEval2017, and Krapivin. Our results demonstrate that, even without parameter tuning, Attention-Seeker outperforms most baseline models, achieving state-of-the-art performance on three out of four datasets, particularly excelling in extracting keyphrases from long documents.

new Propulsion: Steering LLM with Tiny Fine-Tuning

Authors: Md Kowsher, Nusrat Jahan Prottasha, Prakash Bhat

Abstract: The rapid advancements in Large Language Models (LLMs) have revolutionized natural language processing (NLP) and related fields. However, fine-tuning these models for specific tasks remains computationally expensive and risks degrading pre-learned features. To address these challenges, we propose Propulsion, a novel parameter efficient fine-tuning (PEFT) method designed to optimize task-specific performance while drastically reducing computational overhead. Inspired by the concept of controlled adjustments in physical motion, Propulsion selectively re-scales specific dimensions of a pre-trained model, guiding output predictions toward task objectives without modifying the model's parameters. By introducing lightweight, trainable Propulsion parameters at the pre-trained layer, we minimize the number of parameters updated during fine-tuning, preventing overfitting or overwriting of existing knowledge. Our theoretical analysis, supported by Neural Tangent Kernel (NTK) theory, shows that Propulsion approximates the performance of full fine-tuning with far fewer trainable parameters. Empirically, Propulsion reduces the parameter count from 355.3 million to just 0.086 million, achieving over a 10x reduction compared to standard approaches like LoRA while maintaining competitive performance across benchmarks.

new Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

Authors: Yuepei Li, Kang Zhou, Qiao Qiao, Bach Nguyen, Qing Wang, Qi Li

Abstract: Retrieval-augmented generation (RAG) improves Large Language Models (LLMs) by incorporating external information into the response generation process. However, how context-faithful LLMs are and what factors influence LLMs' context-faithfulness remain largely unexplored. In this study, we investigate the impact of memory strength and evidence presentation on LLMs' receptiveness to external evidence. We introduce a method to quantify the memory strength of LLMs by measuring the divergence in LLMs' responses to different paraphrases of the same question, which is not considered by previous works. We also generate evidence in various styles to evaluate the effects of evidence in different styles. Two datasets are used for evaluation: Natural Questions (NQ) with popular questions and popQA featuring long-tail questions. Our results show that for questions with high memory strength, LLMs are more likely to rely on internal memory, particularly for larger LLMs such as GPT-4. On the other hand, presenting paraphrased evidence significantly increases LLMs' receptiveness compared to simple repetition or adding details.

new Cross-lingual transfer of multilingual models on low resource African Languages

Authors: Harish Thangaraj, Ananya Chenat, Jaskaran Singh Walia, Vukosi Marivate

Abstract: Large multilingual models have significantly advanced natural language processing (NLP) research. However, their high resource demands and potential biases from diverse data sources have raised concerns about their effectiveness across low-resource languages. In contrast, monolingual models, trained on a single language, may better capture the nuances of the target language, potentially providing more accurate results. This study benchmarks the cross-lingual transfer capabilities from a high-resource language to a low-resource language for both, monolingual and multilingual models, focusing on Kinyarwanda and Kirundi, two Bantu languages. We evaluate the performance of transformer based architectures like Multilingual BERT (mBERT), AfriBERT, and BantuBERTa against neural-based architectures such as BiGRU, CNN, and char-CNN. The models were trained on Kinyarwanda and tested on Kirundi, with fine-tuning applied to assess the extent of performance improvement and catastrophic forgetting. AfriBERT achieved the highest cross-lingual accuracy of 88.3% after fine-tuning, while BiGRU emerged as the best-performing neural model with 83.3% accuracy. We also analyze the degree of forgetting in the original language post-fine-tuning. While monolingual models remain competitive, this study highlights that multilingual models offer strong cross-lingual transfer capabilities in resource limited settings.

new GOSt-MT: A Knowledge Graph for Occupation-related Gender Biases in Machine Translation

Authors: Orfeas Menis Mastromichalakis, Giorgos Filandrianos, Eva Tsouparopoulou, Dimitris Parsanoglou, Maria Symeonaki, Giorgos Stamou

Abstract: Gender bias in machine translation (MT) systems poses significant challenges that often result in the reinforcement of harmful stereotypes. Especially in the labour domain where frequently occupations are inaccurately associated with specific genders, such biases perpetuate traditional gender stereotypes with a significant impact on society. Addressing these issues is crucial for ensuring equitable and accurate MT systems. This paper introduces a novel approach to studying occupation-related gender bias through the creation of the GOSt-MT (Gender and Occupation Statistics for Machine Translation) Knowledge Graph. GOSt-MT integrates comprehensive gender statistics from real-world labour data and textual corpora used in MT training. This Knowledge Graph allows for a detailed analysis of gender bias across English, French, and Greek, facilitating the identification of persistent stereotypes and areas requiring intervention. By providing a structured framework for understanding how occupations are gendered in both labour markets and MT systems, GOSt-MT contributes to efforts aimed at making MT systems more equitable and reducing gender biases in automated translations.

new Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

Authors: Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang

Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has led to remarkable performances across various domains. However, this progress is accompanied by a substantial surge in the resource consumption of these models. We address this pressing issue by introducing a new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance. Inspired by human attention patterns in Visual Question Answering (VQA) tasks, TRIM presents a fresh perspective on the selection and reduction of image tokens. The TRIM method has been extensively tested across 12 datasets, and the results demonstrate a significant reduction in computational overhead while maintaining a consistent level of performance. This research marks a critical stride in efficient MLLM development, promoting greater accessibility and sustainability of high-performing models.

new Contextual Breach: Assessing the Robustness of Transformer-based QA Models

Authors: Asir Saadat, Nahian Ibn Asad, Md Farhan Ishmam

Abstract: Contextual question-answering models are susceptible to adversarial perturbations to input context, commonly observed in real-world scenarios. These adversarial noises are designed to degrade the performance of the model by distorting the textual input. We introduce a unique dataset that incorporates seven distinct types of adversarial noise into the context, each applied at five different intensity levels on the SQuAD dataset. To quantify the robustness, we utilize robustness metrics providing a standardized measure for assessing model performance across varying noise types and levels. Experiments on transformer-based question-answering models reveal robustness vulnerabilities and important insights into the model's performance in realistic textual input.

new Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

Authors: Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul

Abstract: Audio language models can understand audio inputs and perform a range of audio-related tasks based on instructions, such as speech recognition and audio captioning, where the instructions are usually textual prompts. Audio language models are mostly initialized from pre-trained audio encoders and large language models (LLMs). Although these pre-trained components were developed to support multiple languages, audio-language models are trained predominantly on English data, which may limit their usability to only English instructions or English speech inputs. First, this paper examines the performance of existing audio language models in an underserved language using Thai as an example. This paper demonstrates that, despite being built on multilingual backbones, audio language models do not exhibit cross-lingual emergent abilities to low-resource languages. Second, this paper studies data mixture for developing audio language models that are optimized for a target language as well as English. In addition. this paper integrates audio comprehension and speech instruction-following capabilities into a single unified model. Our experiments provide insights into data mixture for enhancing instruction-following capabilities in both a low-resource language and English. Our model, Typhoon-Audio, outperforms existing open-source audio language models by a considerable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai languages.

new CAST: Cross-modal Alignment Similarity Test for Vision Language Models

Authors: Gautier Dagan, Olga Loginova, Anil Batra

Abstract: Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model's understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent in their outputs. We argue that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.

new GEIC: Universal and Multilingual Named Entity Recognition with Large Language Models

Authors: Hanjun Luo, Yibing Jin, Xuecheng Liu, Tong Shang, Ruizhe Chen, Zuozhu Liu

Abstract: Large Language Models (LLMs) have supplanted traditional methods in numerous natural language processing tasks. Nonetheless, in Named Entity Recognition (NER), existing LLM-based methods underperform compared to baselines and require significantly more computational resources, limiting their application. In this paper, we introduce the task of generation-based extraction and in-context classification (GEIC), designed to leverage LLMs' prior knowledge and self-attention mechanisms for NER tasks. We then propose CascadeNER, a universal and multilingual GEIC framework for few-shot and zero-shot NER. CascadeNER employs model cascading to utilize two small-parameter LLMs to extract and classify independently, reducing resource consumption while enhancing accuracy. We also introduce AnythingNER, the first NER dataset specifically designed for LLMs, including 8 languages, 155 entity types and a novel dynamic categorization system. Experiments show that CascadeNER achieves state-of-the-art performance on low-resource and fine-grained scenarios, including CrossNER and FewNERD. Our work is openly accessible.

new Hierarchical Narrative Analysis: Unraveling Perceptions of Generative AI

Authors: Riona Matsuoka, Hiroki Matsumoto, Takahiro Yoshida, Tomohiro Watanabe, Ryoma Kondo, Ryohei Hisano

Abstract: Written texts reflect an author's perspective, making the thorough analysis of literature a key research method in fields such as the humanities and social sciences. However, conventional text mining techniques like sentiment analysis and topic modeling are limited in their ability to capture the hierarchical narrative structures that reveal deeper argumentative patterns. To address this gap, we propose a method that leverages large language models (LLMs) to extract and organize these structures into a hierarchical framework. We validate this approach by analyzing public opinions on generative AI collected by Japan's Agency for Cultural Affairs, comparing the narratives of supporters and critics. Our analysis provides clearer visualization of the factors influencing divergent opinions on generative AI, offering deeper insights into the structures of agreement and disagreement.

new Towards No-Code Programming of Cobots: Experiments with Code Synthesis by Large Code Models for Conversational Programming

Authors: Kranti Chalamalasetti, Sherzod Hakimov, David Schlangen

Abstract: While there has been a lot of research recently on robots in household environments, at the present time, most robots in existence can be found on shop floors, and most interactions between humans and robots happen there. ``Collaborative robots'' (cobots) designed to work alongside humans on assembly lines traditionally require expert programming, limiting ability to make changes, or manual guidance, limiting expressivity of the resulting programs. To address these limitations, we explore using Large Language Models (LLMs), and in particular, their abilities of doing in-context learning, for conversational code generation. As a first step, we define RATS, the ``Repetitive Assembly Task'', a 2D building task designed to lay the foundation for simulating industry assembly scenarios. In this task, a `programmer' instructs a cobot, using natural language, on how a certain assembly is to be built; that is, the programmer induces a program, through natural language. We create a dataset that pairs target structures with various example instructions (human-authored, template-based, and model-generated) and example code. With this, we systematically evaluate the capabilities of state-of-the-art LLMs for synthesising this kind of code, given in-context examples. Evaluating in a simulated environment, we find that LLMs are capable of generating accurate `first order code' (instruction sequences), but have problems producing `higher-order code' (abstractions such as functions, or use of loops).

new A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

Authors: Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, Yongin Kwon

Abstract: Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q\&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.

new Large Language Models are Good Multi-lingual Learners : When LLMs Meet Cross-lingual Prompts

Authors: Teng Wang, Zhenqi He, Wing-Yin Yu, Xiaojin Fu, Xiongwei Han

Abstract: With the advent of Large Language Models (LLMs), generating rule-based data for real-world applications has become more accessible. Due to the inherent ambiguity of natural language and the complexity of rule sets, especially in long contexts, LLMs often struggle to follow all specified rules, frequently omitting at least one. To enhance the reasoning and understanding of LLMs on long and complex contexts, we propose a novel prompting strategy Multi-Lingual Prompt, namely MLPrompt, which automatically translates the error-prone rule that an LLM struggles to follow into another language, thus drawing greater attention to it. Experimental results on public datasets across various tasks have shown MLPrompt can outperform state-of-the-art prompting methods such as Chain of Thought, Tree of Thought, and Self-Consistency. Additionally, we introduce a framework integrating MLPrompt with an auto-checking mechanism for structured data generation, with a specific case study in text-to-MIP instances. Further, we extend the proposed framework for text-to-SQL to demonstrate its generation ability towards structured data synthesis.

new KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models

Authors: Bo Lv, Quan Zhou, Xuanang Ding, Yan Wang, Zeming Ma

Abstract: The bottleneck associated with the key-value(KV) cache presents a significant challenge during the inference processes of large language models. While depth pruning accelerates inference, it requires extensive recovery training, which can take up to two weeks. On the other hand, width pruning retains much of the performance but offers slight speed gains. To tackle these challenges, we propose KVPruner to improve model efficiency while maintaining performance. Our method uses global perplexity-based analysis to determine the importance ratio for each block and provides multiple strategies to prune non-essential KV channels within blocks. Compared to the original model, KVPruner reduces runtime memory usage by 50% and boosts throughput by over 35%. Additionally, our method requires only two hours of LoRA fine-tuning on small datasets to recover most of the performance.

new RoMath: A Mathematical Reasoning Benchmark in Romanian

Authors: Adrian Cosma, Ana-Maria Bucur, Emilian Radoi

Abstract: Mathematics has long been conveyed through natural language, primarily for human understanding. With the rise of mechanized mathematics and proof assistants, there is a growing need to understand informal mathematical text, yet most existing benchmarks focus solely on English, overlooking other languages. This paper introduces RoMath, a Romanian mathematical reasoning benchmark suite comprising three datasets: RoMath-Baccalaureate, RoMath-Competitions and RoMath-Synthetic, which cover a range of mathematical domains and difficulty levels, aiming to improve non-English language models and promote multilingual AI development. By focusing on Romanian, a low-resource language with unique linguistic features, RoMath addresses the limitations of Anglo-centric models and emphasizes the need for dedicated resources beyond simple automatic translation. We benchmark several open-weight language models, highlighting the importance of creating resources for underrepresented languages. We make the code and dataset available.

new Strategic Insights in Human and Large Language Model Tactics at Word Guessing Games

Authors: Mat\=iss Rikters, Sanita Reinsone

Abstract: At the beginning of 2022, a simplistic word-guessing game took the world by storm and was further adapted to many languages beyond the original English version. In this paper, we examine the strategies of daily word-guessing game players that have evolved during a period of over two years. A survey gathered from 25% of frequent players reveals their strategies and motivations for continuing the daily journey. We also explore the capability of several popular open-access large language model systems and open-source models at comprehending and playing the game in two different languages. Results highlight the struggles of certain models to maintain correct guess length and generate repetitions, as well as hallucinations of non-existent words and inflections.

new Diversity-grounded Channel Prototypical Learning for Out-of-Distribution Intent Detection

Authors: Bo Liu, Liming Zhan, Yujie Feng, Zexin Lu, Chengqiang Xie, Lei Xue, Xiao-Ming Wu, Albert Y. S. Lam

Abstract: In the realm of task-oriented dialogue systems, a robust intent detection mechanism must effectively handle malformed utterances encountered in real-world scenarios. This study presents a novel fine-tuning framework for large language models (LLMs) aimed at enhancing in-distribution (ID) intent classification and out-of-distribution (OOD) intent detection, which utilizes semantic matching with prototypes derived from ID class names. By harnessing the highly distinguishable representations of LLMs, we construct semantic prototypes for each ID class using a diversity-grounded prompt tuning approach. We rigorously test our framework in a challenging OOD context, where ID and OOD classes are semantically close yet distinct, referred to as \emph{near} OOD detection. For a thorough assessment, we benchmark our method against the prevalent fine-tuning approaches. The experimental findings reveal that our method demonstrates superior performance in both few-shot ID intent classification and near-OOD intent detection tasks.

new Semformer: Transformer Language Models with Semantic Planning

Authors: Yongjing Yin, Junran Ding, Kai Song, Yue Zhang

Abstract: Next-token prediction serves as the dominant component in current neural language models. During the training phase, the model employs teacher forcing, which predicts tokens based on all preceding ground truth tokens. However, this approach has been found to create shortcuts, utilizing the revealed prefix to spuriously fit future tokens, potentially compromising the accuracy of the next-token predictor. In this paper, we introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response. Specifically, we incorporate a sequence of planning tokens into the prefix, guiding the planning token representations to predict the latent semantic representations of the response, which are induced by an autoencoder. In a minimal planning task (i.e., graph path-finding), our model exhibits near-perfect performance and effectively mitigates shortcut learning, a feat that standard training methods and baseline models have been unable to accomplish. Furthermore, we pretrain Semformer from scratch with 125M parameters, demonstrating its efficacy through measures of perplexity, in-context learning, and fine-tuning on summarization tasks.

new Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning

Authors: Yukang Lin, Bingchen Zhong, Shuoran Jiang, Joanna Siebert, Qingcai Chen

Abstract: Large language models(LLMs) have exhibited remarkable few-shot learning capabilities and unified the paradigm of NLP tasks through the in-context learning(ICL) technique. Despite the success of ICL, the quality of the exemplar demonstrations can significantly influence the LLM's performance. Existing exemplar selection methods mainly focus on the semantic similarity between queries and candidate exemplars. On the other hand, the logical connections between reasoning steps can be beneficial to depict the problem-solving process as well. In this paper, we proposes a novel method named Reasoning Graph-enhanced Exemplar Retrieval(RGER). RGER first quires LLM to generate an initial response, then expresses intermediate problem-solving steps to a graph structure. After that, it employs graph kernel to select exemplars with semantic and structural similarity. Extensive experiments demonstrate the structural relationship is helpful to the alignment of queries and candidate exemplars. The efficacy of RGER on math and logit reasoning tasks showcases its superiority over state-of-the-art retrieval-based approaches. Our code is released at https://github.com/Yukang-Lin/RGER.

URLs: https://github.com/Yukang-Lin/RGER.

new Improving the Efficiency of Visually Augmented Language Models

Authors: Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune

Abstract: Despite the impressive performance of autoregressive Language Models (LM) it has been shown that due to reporting bias, LMs lack visual knowledge, i.e. they do not know much about the visual world and its properties. To augment LMs with visual knowledge, existing solutions often rely on explicit images, requiring time-consuming retrieval or image generation systems. This paper shows that explicit images are not necessary to visually augment an LM. Instead, we use visually-grounded text representations obtained from the well-known CLIP multimodal system. For a fair comparison, we modify VALM, a visually-augmented LM which uses image retrieval and representation, to work directly with visually-grounded text representations. We name this new model BLIND-VALM. We show that BLIND-VALM performs on par with VALM for Visual Language Understanding (VLU), Natural Language Understanding (NLU) and Language Modeling tasks, despite being significantly more efficient and simpler. We also show that scaling up our model within the compute budget of VALM, either increasing the model or pre-training corpus size, we outperform VALM for all the evaluation tasks.

new SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration

Authors: Xin Guan, Nathaniel Demchak, Saloni Gupta, Ze Wang, Ediz Ertekin Jr., Adriano Koshiyama, Emre Kazim, Zekun Wu

Abstract: The development of unbiased large language models is widely recognized as crucial, yet existing benchmarks fall short in detecting biases due to limited scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the first holistic benchmarking pipeline to address these problems. The pipeline encompasses five core stages: scraping materials, assembling benchmarks, generating responses, extracting numeric features, and diagnosing with disparity metrics. SAGED includes metrics for max disparity, such as impact ratio, and bias concentration, such as Max Z-scores. Noticing that assessment tool bias and contextual bias in prompts can distort evaluation, SAGED implements counterfactual branching and baseline calibration for mitigation. For demonstration, we use SAGED on G20 Countries with popular 8b-level models including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we find that while Mistral and Qwen2 show lower max disparity and higher bias concentration than Gemma2 and Llama3.1, all models are notably biased against countries like Russia and (except for Qwen2) China. With further experiments to have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more intensively than Biden and Harris, indicating role-playing performance bias in these models.

new Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization

Authors: Jianing Wang, Yang Zhou, Xiaocheng Zhang, Mengjiao Bao, Peng Yan

Abstract: Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs), but the performance is still underwhelming due to too much noisy preference data yielded in the loop. To combat this issue, we present an \textbf{U}ncertainty-enhanced \textbf{P}reference \textbf{O}ptimization (UPO) framework to make the LLM self-evolve with reliable feedback. The key idea is mitigating the noisy preference data derived from the current policy and reward models by performing pair-wise uncertainty estimation and judiciously reliable feedback sampling. To reach this goal, we thus introduce an estimator model, which incorporates Monte Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty estimation for the preference data derived from the LLM policy. Compared to the existing methods that directly filter generated responses based on the reward score, the estimator focuses on the model uncertainty in a pair-wise manner and effectively bypasses the confirmation bias problem of the reward model. Additionally, we also propose an uncertainty-enhanced self-evolution algorithm to improve the robustness of preference optimization and encourage the LLM to generate responses with both high reward and certainty. Extensive experiments over multiple benchmarks demonstrate that our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.

new Exploring ChatGPT-based Augmentation Strategies for Contrastive Aspect-based Sentiment Analysis

Authors: Lingling Xu, Haoran Xie, S. Joe Qin, Fu Lee Wang, Xiaohui Tao

Abstract: Aspect-based sentiment analysis (ABSA) involves identifying sentiment towards specific aspect terms in a sentence and allows us to uncover nuanced perspectives and attitudes on particular aspects of a product, service, or topic. However, the scarcity of labeled data poses a significant challenge to training high-quality models. To address this issue, we explore the potential of data augmentation using ChatGPT, a well-performing large language model (LLM), to enhance the sentiment classification performance towards aspect terms. Specifically, we explore three data augmentation strategies based on ChatGPT: context-focused, aspect-focused, and context-aspect data augmentation techniques. Context-focused data augmentation focuses on changing the word expression of context words in the sentence while keeping aspect terms unchanged. In contrast, aspect-focused data augmentation aims to change aspect terms but keep context words unchanged. Context-Aspect data augmentation integrates the above two data augmentations to generate augmented samples. Furthermore, we incorporate contrastive learning into the ABSA tasks to improve performance. Extensive experiments show that all three data augmentation techniques lead to performance improvements, with the context-aspect data augmentation strategy performing best and surpassing the performance of the baseline models.

new Fast Analysis of the OpenAI O1-Preview Model in Solving Random K-SAT Problem: Does the LLM Solve the Problem Itself or Call an External SAT Solver?

Authors: Raffaele Marino

Abstract: In this manuscript I present an analysis on the performance of OpenAI O1-preview model in solving random K-SAT instances for K$\in {2,3,4}$ as a function of $\alpha=M/N$ where $M$ is the number of clauses and $N$ is the number of variables of the satisfiable problem. I show that the model can call an external SAT solver to solve the instances, rather than solving them directly. Despite using external solvers, the model reports incorrect assignments as output. Moreover, I propose and present an analysis to quantify whether the OpenAI O1-preview model demonstrates a spark of intelligence or merely makes random guesses when outputting an assignment for a Boolean satisfiability problem.

new Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

Authors: Bishwash Khanal, Jeffery M. Capone

Abstract: Large language models (LLMs) offer powerful capabilities but incur substantial computational costs, driving the need for efficient compression techniques. This study evaluates the impact of popular compression methods - Magnitude Pruning, SparseGPT, and Wanda - on the LLaMA-2-7B model, focusing on the trade-offs between model size reduction, downstream task performance, and the role of calibration data. Our findings reveal that while SparseGPT and Wanda preserve perplexity even at 50% sparsity, they suffer significant degradation on downstream tasks, highlighting the inadequacy of perplexity as the sole evaluation metric. To address this, we introduce Jensen-Shannon (JS) Divergence as a more comprehensive metric that captures nuanced changes in model behavior post-compression. We further demonstrate that task-specific calibration data significantly enhances the downstream performance of compressed models compared to general calibration data. This research underscores the necessity for diverse evaluation metrics and careful calibration data selection to fully understand the complexities of LLM compression and its implications for practical applications.

new LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

Authors: Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong

Abstract: LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice questions or human annotators for large language model (LLM) evaluation. Their efficacy shines in evaluating long-form responses, serving a critical role as evaluators of leaderboards and as proxies to align LLMs via reinforcement learning. However, despite their popularity, their effectiveness outside of English remains largely unexplored. In this paper, we conduct a comprehensive analysis on automated evaluators, reporting key findings on their behavior in a non-English environment. First, we discover that English evaluation capabilities significantly influence language-specific capabilities, often more than the language proficiency itself, enabling evaluators trained in English to easily transfer their skills to other languages. Second, we identify critical shortcomings, where LLMs fail to detect and penalize errors, such as factual inaccuracies, cultural misrepresentations, and the presence of unwanted language. Finally, we release Kudge, the first non-English meta-evaluation dataset containing 5,012 human annotations in Korean.

new Spontaneous Informal Speech Dataset for Punctuation Restoration

Authors: Xing Yi Liu, Homayoon Beigi

Abstract: Presently, punctuation restoration models are evaluated almost solely on well-structured, scripted corpora. On the other hand, real-world ASR systems and post-processing pipelines typically apply towards spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this discrepancy, we introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, we contribute a filtering pipeline that can be used to generate more data. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a ``challenging" test set, aimed at evaluating models' ability to leverage audio information to predict otherwise grammatically ambiguous punctuation. SponSpeech is available at https://github.com/GitHubAccountAnonymous/PR, along with all code for dataset building and model runs.

URLs: https://github.com/GitHubAccountAnonymous/PR,

new Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

Authors: Maojia Song, Shang Hong Sim, Rishabh Bhardwaj, Hai Leong Chieu, Navonil Majumder, Soujanya Poria

Abstract: LLMs are an integral part of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the quality of end-to-end RAG systems, there is a lack of research on understanding the appropriateness of an LLM for the RAG task. Thus, we introduce a new metric, Trust-Score, that provides a holistic evaluation of the trustworthiness of LLMs in an RAG framework. We show that various prompting methods, such as in-context learning, fail to adapt LLMs effectively to the RAG task. Thus, we propose Trust-Align, a framework to align LLMs for higher Trust-Score. LLaMA-3-8b, aligned with our method, significantly outperforms open-source LLMs of comparable sizes on ASQA (up 10.7), QAMPARI (up 29.2) and ELI5 (up 14.9). We release our code at: https://github.com/declare-lab/trust-align.

URLs: https://github.com/declare-lab/trust-align.

new Linear Recency Bias During Training Improves Transformers' Fit to Reading Times

Authors: Christian Clark, Byung-Doh Oh, William Schuler

Abstract: Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown a strong fit between surprisal values from Transformers and reading times. However, standard Transformers work with a lossless representation of the entire previous linguistic context, unlike models of human language processing that include memory decay. To bridge this gap, this paper evaluates a modification of the Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to attention scores. Surprisal estimates with ALiBi show an improved fit to human reading times compared to a standard Transformer baseline. A subsequent analysis of attention heads suggests that ALiBi's mixture of slopes -- which determine the rate of memory decay in each attention head -- may play a role in the improvement by helping models with ALiBi to track different kinds of linguistic dependencies.

new WER We Stand: Benchmarking Urdu ASR Models

Authors: Samee Arif, Aamina Jamal Khan, Mustafa Abbas, Agha Ali Raza, Awais Athar

Abstract: This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. We analyze the performance of three ASR model families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER), along with a detailed examination of the most frequent wrong words and error types including insertions, deletions, and substitutions. Our analysis is conducted using two types of datasets, read speech and conversational speech. Notably, we present the first conversational speech dataset designed for benchmarking Urdu ASR models. We find that seamless-large outperforms other ASR models on the read speech dataset, while whisper-large performs best on the conversational speech dataset. Furthermore, this evaluation highlights the complexities of assessing ASR models for low-resource languages like Urdu using quantitative metrics alone and emphasizes the need for a robust Urdu text normalization system. Our findings contribute valuable insights for developing robust ASR systems for low-resource languages like Urdu.

new Norm of Mean Contextualized Embeddings Determines their Variance

Authors: Hiroaki Yamagiwa, Hidetoshi Shimodaira

Abstract: Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. This trade-off is likely influenced by the layer normalization mechanism used in Transformer models. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the within-cluster variance and the between-cluster variance. We found experimentally that as the layers of Transformer models deepen, the embeddings move farther from the origin, the between-cluster variance relatively decreases, and the within-cluster variance relatively increases. These results are consistent with existing studies on the anisotropy of the embedding spaces across layers.

new The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives

Authors: Samee Arif, Taimoor Arif, Aamina Jamal Khan, Muhammad Saad Haroon, Agha Ali Raza, Awais Athar

Abstract: This paper introduces the concept of an education tool that utilizes Generative Artificial Intelligence (GenAI) to enhance storytelling for children. The system combines GenAI-driven narrative co-creation, text-to-speech conversion, and text-to-video generation to produce an engaging experience for learners. We describe the co-creation process, the adaptation of narratives into spoken words using text-to-speech models, and the transformation of these narratives into contextually relevant visuals through text-to-video technology. Our evaluation covers the linguistics of the generated stories, the text-to-speech conversion quality, and the accuracy of the generated visuals.

new LOLA -- An Open-Source Massively Multilingual Large Language Model

Authors: Nikit Srivastava, Denis Kuchelev, Tatiana Moteu, Kshitij Shetty, Michael Roeder, Diego Moussallem, Hamada Zahera, Axel-Cyrille Ngonga Ngomo

Abstract: This paper presents LOLA, a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Our architectural and implementation choices address the challenge of harnessing linguistic diversity while maintaining efficiency and avoiding the common pitfalls of multilinguality. Our analysis of the evaluation results shows competitive performance in natural language generation and understanding tasks. Additionally, we demonstrate how the learned expert-routing mechanism exploits implicit phylogenetic linguistic patterns to potentially alleviate the curse of multilinguality. We provide an in-depth look at the training process, an analysis of the datasets, and a balanced exploration of the model's strengths and limitations. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research. Our findings enable the development of compute-efficient multilingual models with strong, scalable performance across languages.

new Task Arithmetic for Language Expansion in Speech Translation

Authors: Yao-Fei Cheng, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Wen Shen Teo, Siddhant Arora, Shinji Watanabe

Abstract: Recent advances in large language models (LLMs) have gained interest in speech-text multimodal foundation models, achieving strong performance on instruction-based speech translation (ST). However, expanding language pairs from an existing instruction-tuned ST system is costly due to the necessity of re-training on a combination of new and previous datasets. We propose to expand new language pairs by merging the model trained on new language pairs and the existing model, using task arithmetic. We find that the direct application of task arithmetic for ST causes the merged model to fail to follow instructions; thus, generating translation in incorrect languages. To eliminate language confusion, we propose an augmented task arithmetic method that merges an additional language control model. It is trained to generate the correct target language token following the instructions. Our experiments demonstrate that our proposed language control model can achieve language expansion by eliminating language confusion. In our MuST-C and CoVoST-2 experiments, it shows up to 4.66 and 4.92 BLEU scores improvement, respectively. In addition, we demonstrate the use of our task arithmetic framework can expand to a language pair where neither paired ST training data nor a pre-trained ST model is available. We first synthesize the ST system from machine translation (MT) systems via task analogy, then merge the synthesized ST system to the existing ST model.

new Leveraging Distillation Techniques for Document Understanding: A Case Study with FLAN-T5

Authors: Marcel Lamott, Muhammad Armaghan Shakir

Abstract: The surge of digital documents in various formats, including less standardized documents such as business reports and environmental assessments, underscores the growing importance of Document Understanding. While Large Language Models (LLMs) have showcased prowess across diverse natural language processing tasks, their direct application to Document Understanding remains a challenge. Previous research has demonstrated the utility of LLMs in this domain, yet their significant computational demands make them challenging to deploy effectively. Additionally, proprietary Blackbox LLMs often outperform their open-source counterparts, posing a barrier to widespread accessibility. In this paper, we delve into the realm of document understanding, leveraging distillation methods to harness the power of large LLMs while accommodating computational limitations. Specifically, we present a novel approach wherein we distill document understanding knowledge from the proprietary LLM ChatGPT into FLAN-T5. Our methodology integrates labeling and curriculum-learning mechanisms to facilitate efficient knowledge transfer. This work contributes to the advancement of document understanding methodologies by offering a scalable solution that bridges the gap between resource-intensive LLMs and practical applications. Our findings underscore the potential of distillation techniques in facilitating the deployment of sophisticated language models in real-world scenarios, thereby fostering advancements in natural language processing and document comprehension domains.

new Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Triples Modeling

Authors: Xinyue Fang, Zhen Huang, Zhiliang Tian, Minghui Fang, Ziyi Pan, Quntian Fang, Zhihua Wen, Hengyue Pan, Dongsheng Li

Abstract: LLMs obtain remarkable performance but suffer from hallucinations. Most research on detecting hallucination focuses on the questions with short and concrete correct answers that are easy to check the faithfulness. Hallucination detections for text generation with open-ended answers are more challenging. Some researchers use external knowledge to detect hallucinations in generated texts, but external resources for specific scenarios are hard to access. Recent studies on detecting hallucinations in long text without external resources conduct consistency comparison among multiple sampled outputs. To handle long texts, researchers split long texts into multiple facts and individually compare the consistency of each pairs of facts. However, these methods (1) hardly achieve alignment among multiple facts; (2) overlook dependencies between multiple contextual facts. In this paper, we propose a graph-based context-aware (GCA) hallucination detection for text generations, which aligns knowledge facts and considers the dependencies between contextual knowledge triples in consistency comparison. Particularly, to align multiple facts, we conduct a triple-oriented response segmentation to extract multiple knowledge triples. To model dependencies among contextual knowledge triple (facts), we construct contextual triple into a graph and enhance triples' interactions via message passing and aggregating via RGCN. To avoid the omission of knowledge triples in long text, we conduct a LLM-based reverse verification via reconstructing the knowledge triples. Experiments show that our model enhances hallucination detection and excels all baselines.

new SpMis: An Investigation of Synthetic Spoken Misinformation Detection

Authors: Peizhuo Liu, Li Wang, Renqiang He, Haorui He, Lei Wang, Huadi Zheng, Jie Shi, Tong Xiao, Zhizheng Wu

Abstract: In recent years, speech generation technology has advanced rapidly, fueled by generative models and large-scale training techniques. While these developments have enabled the production of high-quality synthetic speech, they have also raised concerns about the misuse of this technology, particularly for generating synthetic misinformation. Current research primarily focuses on distinguishing machine-generated speech from human-produced speech, but the more urgent challenge is detecting misinformation within spoken content. This task requires a thorough analysis of factors such as speaker identity, topic, and synthesis. To address this need, we conduct an initial investigation into synthetic spoken misinformation detection by introducing an open-source dataset, SpMis. SpMis includes speech synthesized from over 1,000 speakers across five common topics, utilizing state-of-the-art text-to-speech systems. Although our results show promising detection capabilities, they also reveal substantial challenges for practical implementation, underscoring the importance of ongoing research in this critical area.

new THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models

Authors: Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven

Abstract: Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models (LLMs). Existing detection and mitigation methods are often isolated and insufficient for domain-specific needs, lacking a standardized pipeline. This paper introduces THaMES (Tool for Hallucination Mitigations and EvaluationS), an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs, featuring automated test set generation, multifaceted benchmarking, and adaptable mitigation strategies. It automates test set creation from any corpus, ensuring high data quality, diversity, and cost-efficiency through techniques like batch processing, weighted sampling, and counterfactual validation. THaMES assesses a model's ability to detect and reduce hallucinations across various tasks, including text generation and binary classification, applying optimal mitigation strategies like In-Context Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base of academic papers, political news, and Wikipedia reveal that commercial models like GPT-4o benefit more from RAG than ICL, while open-weight models like Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT significantly enhances the performance of Llama-3.1-8B-Instruct in both evaluation tasks.

new CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Authors: Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan

Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.

new CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration

Authors: Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li

Abstract: The deployment of multimodal large language models (MLLMs) has demonstrated remarkable success in engaging in conversations involving visual inputs, thanks to the superior power of large language models (LLMs). Those MLLMs are typically built based on the LLMs, with an image encoder to process images into the token embedding space of the LLMs. However, the integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs and prone to generating sensitive or harmful responses, even though the LLM has been trained on textual dataset to align with human value. In this paper, we first raise the question: ``Do the MLLMs possess safety-awareness against malicious image inputs?". We find that after adding a principle that specifies the safety requirement into the input of the MLLM, the model's safety awareness becomes boosted. This phenomenon verifies the existence of MLLM's safety-awareness against image inputs, it is only weakened by the modality gap. We then introduce a simple yet effective technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution. Our proposed strategy helps the model reclaim its original safety awareness without losing its original capabilities. We verify the effectiveness of our approach on both multimodal safety and understanding benchmarks.

new Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Authors: Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee

Abstract: Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes increasingly important. This work addresses the question: How can we determine the optimal subset of data for effective training? While existing research often emphasizes local criteria like instance quality for subset selection, we argue that a global approach focused on data diversity is more critical. Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, reassessing each cluster's importance and sampling weight in every training iteration. This approach reduces the effect of outliers and automatically filters out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is available at https://github.com/for-ai/iterative-data-selection.

URLs: https://github.com/for-ai/iterative-data-selection.

new Says Who? Effective Zero-Shot Annotation of Focalization

Authors: Rebecca M. M. Hicke, Yuri Bizzoni, Pascale Feldkamp, Ross Deans Kristensen-McLachlan

Abstract: Focalization, the perspective through which narrative is presented, is encoded via a wide range of lexico-grammatical features and is subject to reader interpretation. Moreover, trained readers regularly disagree on interpretations, suggesting that this problem may be computationally intractable. In this paper, we provide experiments to test how well contemporary Large Language Models (LLMs) perform when annotating literary texts for focalization mode. Despite the challenging nature of the task, LLMs show comparable performance to trained human annotators in our experiments. We provide a case study working with the novels of Stephen King to demonstrate the usefulness of this approach for computational literary studies, illustrating how focalization can be studied at scale.

new NVLM: Open Frontier-Class Multimodal LLMs

Authors: Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

Abstract: We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: https://nvlm-project.github.io/.

URLs: https://nvlm-project.github.io/.

new AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

Authors: Basel Mousi, Nadir Durrani, Fatema Ahmad, Md. Arid Hasan, Maram Hasanain, Tameem Kabbani, Fahim Dalvi, Shammur Absar Chowdhury, Firoj Alam

Abstract: Arabic, with its rich diversity of dialects, remains significantly underrepresented in Large Language Models, particularly in dialectal variations. We address this gap by introducing seven synthetic datasets in dialects alongside Modern Standard Arabic (MSA), created using Machine Translation (MT) combined with human post-editing. We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on dialect comprehension and generation, focusing specifically on low-resource Arabic dialects. Additionally, we introduce the first-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions, providing a novel dimension to LLM evaluation. Our findings demonstrate that while Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, significant challenges persist in dialect identification, generation, and translation. This work contributes ~45K post-edited samples, a cultural benchmark, and highlights the importance of tailored training to improve LLM performance in capturing the nuances of diverse Arabic dialects and cultural contexts. We will release the dialectal translation models and benchmarks curated in this study.

cross "Is This It?": Towards Ecologically Valid Benchmarks for Situated Collaboration

Authors: Dan Bohus, Sean Andrist, Yuwei Bao, Eric Horvitz, Ann Paradiso

Abstract: We report initial work towards constructing ecologically valid benchmarks to assess the capabilities of large multimodal models for engaging in situated collaboration. In contrast to existing benchmarks, in which question-answer pairs are generated post hoc over preexisting or synthetic datasets via templates, human annotators, or large language models (LLMs), we propose and investigate an interactive system-driven approach, where the questions are generated by users in context, during their interactions with an end-to-end situated AI system. We illustrate how the questions that arise are different in form and content from questions typically found in existing embodied question answering (EQA) benchmarks and discuss new real-world challenge problems brought to the fore.

cross SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

Authors: Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, Chu-Song Chen

Abstract: We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our method enables MLLMs to learn pixel-level location information without requiring excessive modifications to the existing model architecture or adding specialized tokens. We introduce an inquiry-based approach that can effectively find prompt points for SAM to perform segmentation based on MLLM. It combines detailed visual information with the powerful expressive capabilities of large language models in a unified language-based manner without additional computational overhead in learning. Experimental results on pubic benchmarks demonstrate the effectiveness of our approach.

cross Agentic Society: Merging skeleton from real world and texture from Large Language Model

Authors: Yuqi Bai, Kun Sun, Huishi Yin

Abstract: Recent advancements in large language models (LLMs) and agent technologies offer promising solutions to the simulation of social science experiments, but the availability of data of real-world population required by many of them still poses as a major challenge. This paper explores a novel framework that leverages census data and LLMs to generate virtual populations, significantly reducing resource requirements and bypassing privacy compliance issues associated with real-world data, while keeping a statistical truthfulness. Drawing on real-world census data, our approach first generates a persona that reflects demographic characteristics of the population. We then employ LLMs to enrich these personas with intricate details, using techniques akin to those in image generative models but applied to textual data. Additionally, we propose a framework for the evaluation of the feasibility of our method with respect to capability of LLMs based on personality trait tests, specifically the Big Five model, which also enhances the depth and realism of the generated personas. Through preliminary experiments and analysis, we demonstrate that our method produces personas with variability essential for simulating diverse human behaviors in social science experiments. But the evaluation result shows that only weak sign of statistical truthfulness can be produced due to limited capability of current LLMs. Insights from our study also highlight the tension within LLMs between aligning with human values and reflecting real-world complexities. Thorough and rigorous test call for further research. Our codes are released at https://github.com/baiyuqi/agentic-society.git

URLs: https://github.com/baiyuqi/agentic-society.git

cross Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers

Authors: Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang

Abstract: In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically explains how the attention mechanism facilitates ICL under certain data models. It remains unclear how the other building blocks of the transformer contribute to ICL. To address this question, we study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data, where each token in the Markov chain statistically depends on the previous $n$ tokens. We analyze a sophisticated transformer model featuring relative positional embedding, multi-head softmax attention, and a feed-forward layer with normalization. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the induction head mechanism with a learned feature, resulting from the congruous contribution of all the building blocks. In the limiting model, the first attention layer acts as a $\mathit{copier}$, copying past tokens within a given window to each position, and the feed-forward network with normalization acts as a $\mathit{selector}$ that generates a feature vector by only looking at informationally relevant parents from the window. Finally, the second attention layer is a $\mathit{classifier}$ that compares these features with the feature at the output position, and uses the resulting similarity scores to generate the desired output. Our theory is further validated by experiments.

cross Eureka: Evaluating and Understanding Large Foundation Models

Authors: Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid Palangi, Eduardo Salinas, Vibhav Vineet, James Woffinden-Luey, Safoora Yousefi

Abstract: Rigorous and reproducible evaluation is critical for assessing the state of the art and for guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due to several reasons, including benchmark saturation, lack of transparency in methods used for measurement, development challenges in extracting measurements for generative tasks, and, more generally, the extensive number of capabilities required for a well-rounded comparison across models. We make three contributions to alleviate the above challenges. First, we present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. Second, we introduce Eureka-Bench as an extensible collection of benchmarks testing capabilities that (i) are still challenging for state-of-the-art models and (ii) represent fundamental but overlooked language and multimodal capabilities. The inherent space for improvement in non-saturated benchmarks enables us to discover meaningful differences between models at a capability level. Third, using Eureka, we conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison, which can be leveraged to plan targeted improvements. In contrast to recent trends in reports and leaderboards showing absolute rankings and claims for one model or another to be the best, our analysis shows that there is no such best model. Different models have different strengths, but there are models that appear more often than others as best performers for some capabilities. Despite the recent improvements, current models still struggle with several fundamental capabilities including detailed image understanding, benefiting from multimodal input when available rather than fully relying on language, factuality and grounding for information retrieval, and over refusals.

cross CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios

Authors: Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang

Abstract: Large Language Models (LLMs) have been widely adopted to process long-context tasks. However, the large memory overhead of the key-value (KV) cache poses significant challenges in long-context scenarios. Existing training-free KV cache compression methods typically focus on quantization and token pruning, which have compression limits, and excessive sparsity can lead to severe performance degradation. Other methods design new architectures with less KV overhead but require significant training overhead. To address the above two drawbacks, we further explore the redundancy in the channel dimension and apply an architecture-level design with minor training costs. Therefore, we introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression: (1) We first analyze the singular value distribution of the KV cache, revealing significant redundancy and compression potential along the channel dimension. Based on this observation, we propose using low-rank decomposition for key and value layers and storing the low-dimension features. (2) To preserve model performance, we introduce a bi-branch KV cache, including a window-based full-precision KV cache and a low-precision compressed KV cache. (3) To reduce the training costs, we minimize the layer-wise reconstruction loss for the compressed KV cache instead of retraining the entire LLMs. Extensive experiments show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model's long-context capability. Moreover, we show that our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%.

cross A Bayesian Interpretation of Adaptive Low-Rank Adaptation

Authors: Haolin Chen, Philip N. Garner

Abstract: Motivated by the sensitivity-based importance score of the adaptive low-rank adaptation (AdaLoRA), we utilize more theoretically supported metrics, including the signal-to-noise ratio (SNR), along with the Improved Variational Online Newton (IVON) optimizer, for adaptive parameter budget allocation. The resulting Bayesian counterpart not only has matched or surpassed the performance of using the sensitivity-based importance metric but is also a faster alternative to AdaLoRA with Adam. Our theoretical analysis reveals a significant connection between the two metrics, providing a Bayesian perspective on the efficacy of sensitivity as an importance score. Furthermore, our findings suggest that the magnitude, rather than the variance, is the primary indicator of the importance of parameters.

cross Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs

Authors: Yifan Wang, David Stevens, Pranay Shah, Wenwen Jiang, Miao Liu, Xu Chen, Robert Kuo, Na Li, Boying Gong, Daniel Lee, Jiabo Hu, Ning Zhang, Bob Kamma

Abstract: The growing demand for AI training data has transformed data annotation into a global industry, but traditional approaches relying on human annotators are often time-consuming, labor-intensive, and prone to inconsistent quality. We propose the Model-in-the-Loop (MILO) framework, which integrates AI/ML models into the annotation process. Our research introduces a collaborative paradigm that leverages the strengths of both professional human annotators and large language models (LLMs). By employing LLMs as pre-annotation and real-time assistants, and judges on annotator responses, MILO enables effective interaction patterns between human annotators and LLMs. Three empirical studies on multimodal data annotation demonstrate MILO's efficacy in reducing handling time, improving data quality, and enhancing annotator experiences. We also introduce quality rubrics for flexible evaluation and fine-grained feedback on open-ended annotations. The MILO framework has implications for accelerating AI/ML development, reducing reliance on human annotation alone, and promoting better alignment between human and machine values.

cross Self-supervised Speech Models for Word-Level Stuttered Speech Detection

Authors: Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G. Dimakis, David Harwath

Abstract: Clinical diagnosis of stuttering requires an assessment by a licensed speech-language pathologist. However, this process is time-consuming and requires clinicians with training and experience in stuttering and fluency disorders. Unfortunately, only a small percentage of speech-language pathologists report being comfortable working with individuals who stutter, which is inadequate to accommodate for the 80 million individuals who stutter worldwide. Developing machine learning models for detecting stuttered speech would enable universal and automated screening for stuttering, enabling speech pathologists to identify and follow up with patients who are most likely to be diagnosed with a stuttering speech disorder. Previous research in this area has predominantly focused on utterance-level detection, which is not sufficient for clinical settings where word-level annotation of stuttering is the norm. In this study, we curated a stuttered speech dataset with word-level annotations and introduced a word-level stuttering speech detection model leveraging self-supervised speech models. Our evaluation demonstrates that our model surpasses previous approaches in word-level stuttering speech detection. Additionally, we conducted an extensive ablation analysis of our method, providing insight into the most important aspects of adapting self-supervised speech models for stuttered speech detection.

cross NaviQAte: Functionality-Guided Web Application Navigation

Authors: Mobina Shahbandeh, Parsa Alian, Noor Nashid, Ali Mesbah

Abstract: End-to-end web testing is challenging due to the need to explore diverse web application functionalities. Current state-of-the-art methods, such as WebCanvas, are not designed for broad functionality exploration; they rely on specific, detailed task descriptions, limiting their adaptability in dynamic web environments. We introduce NaviQAte, which frames web application exploration as a question-and-answer task, generating action sequences for functionalities without requiring detailed parameters. Our three-phase approach utilizes advanced large language models like GPT-4o for complex decision-making and cost-effective models, such as GPT-4o mini, for simpler tasks. NaviQAte focuses on functionality-guided web application navigation, integrating multi-modal inputs such as text and images to enhance contextual understanding. Evaluations on the Mind2Web-Live and Mind2Web-Live-Abstracted datasets show that NaviQAte achieves a 44.23% success rate in user task navigation and a 38.46% success rate in functionality navigation, representing a 15% and 33% improvement over WebCanvas. These results underscore the effectiveness of our approach in advancing automated web application testing.

cross GenCRF: Generative Clustering and Reformulation Framework for Enhanced Intent-Driven Information Retrieval

Authors: Wonduk Seo, Haojie Zhang, Yueyang Zhang, Changhao Zhang, Songyao Duan, Lixin Su, Daiting Shi, Jiashu Zhao, Dawei Yin

Abstract: Query reformulation is a well-known problem in Information Retrieval (IR) aimed at enhancing single search successful completion rate by automatically modifying user's input query. Recent methods leverage Large Language Models (LLMs) to improve query reformulation, but often generate limited and redundant expansions, potentially constraining their effectiveness in capturing diverse intents. In this paper, we propose GenCRF: a Generative Clustering and Reformulation Framework to capture diverse intentions adaptively based on multiple differentiated, well-generated queries in the retrieval phase for the first time. GenCRF leverages LLMs to generate variable queries from the initial query using customized prompts, then clusters them into groups to distinctly represent diverse intents. Furthermore, the framework explores to combine diverse intents query with innovative weighted aggregation strategies to optimize retrieval performance and crucially integrates a novel Query Evaluation Rewarding Model (QERM) to refine the process through feedback loops. Empirical experiments on the BEIR benchmark demonstrate that GenCRF achieves state-of-the-art performance, surpassing previous query reformulation SOTAs by up to 12% on nDCG@10. These techniques can be adapted to various LLMs, significantly boosting retriever performance and advancing the field of Information Retrieval.

cross Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data

Authors: Jing Xu, Daxin Tan, Jiaqi Wang, Xiao Chen

Abstract: While large language models (LLMs) have been explored in the speech domain for both generation and recognition tasks, their applications are predominantly confined to the monolingual scenario, with limited exploration in multilingual and code-switched (CS) contexts. Additionally, speech generation and recognition tasks are often handled separately, such as VALL-E and Qwen-Audio. In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating multilingual speech generation and recognition tasks within the single LLM. Furthermore, we develop an effective data construction approach that splits and concatenates words from different languages to equip LLMs with CS synthesis ability without relying on CS data. The experimental results demonstrate that our model outperforms other baselines with a comparable data scale. Furthermore, our data construction approach not only equips LLMs with CS speech synthesis capability with comparable speaker consistency and similarity to any given speaker, but also improves the performance of LLMs in multilingual speech generation and recognition tasks.

cross Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection

Authors: Hsi-Che Lin, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee

Abstract: Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction. However, building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese. In this paper, we propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages. Specifically, we employ expressive Speech-to-Speech translation (S2ST) combined with a novel bootstrapping data selection pipeline to generate labeled data in the target language. Extensive experiments demonstrate that our method is both effective and generalizable across different upstream models and languages. Our results suggest that this approach can facilitate the development of more scalable and robust multilingual SER systems.

cross Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

Authors: Orion Weller, Benjamin Van Durme, Dawn Lawrie, Ashwin Paranjape, Yuhao Zhang, Jack Hessel

Abstract: Instruction-tuned language models (LM) are able to respond to imperative commands, providing a more natural user interface compared to their base counterparts. In this work, we present Promptriever, the first retrieval model able to be prompted like an LM. To train Promptriever, we curate and release a new instance-level instruction training set from MS MARCO, spanning nearly 500k instances. Promptriever not only achieves strong performance on standard retrieval tasks, but also follows instructions. We observe: (1) large gains (reaching SoTA) on following detailed relevance instructions (+14.3 p-MRR / +3.1 nDCG on FollowIR), (2) significantly increased robustness to lexical choices/phrasing in the query+instruction (+12.9 Robustness@10 on InstructIR), and (3) the ability to perform hyperparameter search via prompting to reliably improve retrieval performance (+1.4 average increase on BEIR). Promptriever demonstrates that retrieval models can be controlled with prompts on a per-query basis, setting the stage for future work aligning LM prompting techniques with information retrieval.

cross ISO: Overlap of Computation and Communication within Seqenence For LLM Inference

Authors: Bin Xiao, Lei Su

Abstract: In the realm of Large Language Model (LLM) inference, the inherent structure of transformer models coupled with the multi-GPU tensor parallelism strategy leads to a sequential execution of computation and communication. This results in substantial underutilization of computing resources during the communication phase. To mitigate this inefficiency, various techniques have been developed to optimize the use of computational power throughout the communication process. These strategies primarily involve overlapping matrix computations and communications, as well as interleaving micro-batches across different requests. Nonetheless, these approaches either fall short of achieving ideal overlap or impose certain limitations on their application. To overcome these challenges, this paper introduces a novel strategy for computation-communication overlap that operates at the sequence level. This method not only enhances the degree of overlap but also minimizes the constraints on its applicability. Experimental evaluations conducted using 30b/70b models have demonstrated significant improvements in efficiency. Specifically, the proposed technique has been shown to reduce time consumption by approximately 35% on 4090 GPU and by roughly 15% on A800 GPU during the prefill stage of LLM inference.

cross Capturing Differences in Character Representations Between Communities: An Initial Study with Fandom

Authors: Bianca N. Y. Kang

Abstract: Sociolinguistic theories have highlighted how narratives are often retold, co-constructed and reconceptualized in collaborative settings. This working paper focuses on the re-interpretation of characters, an integral part of the narrative story-world, and attempts to study how this may be computationally compared between online communities. Using online fandom - a highly communal phenomenon that has been largely studied qualitatively - as data, computational methods were applied to explore shifts in character representations between two communities and the original text. Specifically, text from the Harry Potter novels, r/HarryPotter subreddit, and fanfiction on Archive of Our Own were analyzed for changes in character mentions, centrality measures from co-occurrence networks, and semantic associations. While fandom elevates secondary characters as found in past work, the two fan communities prioritize different subsets of characters. Word embedding tests reveal starkly different associations of the same characters between communities on the gendered concepts of femininity/masculinity, cruelty, and beauty. Furthermore, fanfiction descriptions of a male character analyzed between romance pairings scored higher for feminine-coded characteristics in male-male romance, matching past qualitative theorizing. The results high-light the potential for computational methods to assist in capturing the re-conceptualization of narrative elements across communities and in supporting qualitative research on fandom.

cross Bio-Inspired Mamba: Temporal Locality and Bioplausible Learning in Selective State Space Models

Authors: Jiahao Qin

Abstract: This paper introduces Bio-Inspired Mamba (BIM), a novel online learning framework for selective state space models that integrates biological learning principles with the Mamba architecture. BIM combines Real-Time Recurrent Learning (RTRL) with Spike-Timing-Dependent Plasticity (STDP)-like local learning rules, addressing the challenges of temporal locality and biological plausibility in training spiking neural networks. Our approach leverages the inherent connection between backpropagation through time and STDP, offering a computationally efficient alternative that maintains the ability to capture long-range dependencies. We evaluate BIM on language modeling, speech recognition, and biomedical signal analysis tasks, demonstrating competitive performance against traditional methods while adhering to biological learning principles. Results show improved energy efficiency and potential for neuromorphic hardware implementation. BIM not only advances the field of biologically plausible machine learning but also provides insights into the mechanisms of temporal information processing in biological neural networks.

cross P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task

Authors: Weiye Xu, Min Wang, Wengang Zhou, Houqiang Li

Abstract: Embodied Everyday Task is a popular task in the embodied AI community, requiring agents to make a sequence of actions based on natural language instructions and visual observations. Traditional learning-based approaches face two challenges. Firstly, natural language instructions often lack explicit task planning. Secondly, extensive training is required to equip models with knowledge of the task environment. Previous works based on Large Language Model (LLM) either suffer from poor performance due to the lack of task-specific knowledge or rely on ground truth as few-shot samples. To address the above limitations, we propose a novel approach called Progressive Retrieval Augmented Generation (P-RAG), which not only effectively leverages the powerful language processing capabilities of LLMs but also progressively accumulates task-specific knowledge without ground-truth. Compared to the conventional RAG methods, which retrieve relevant information from the database in a one-shot manner to assist generation, P-RAG introduces an iterative approach to progressively update the database. In each iteration, P-RAG retrieves the latest database and obtains historical information from the previous interaction as experiential references for the current interaction. Moreover, we also introduce a more granular retrieval scheme that not only retrieves similar tasks but also incorporates retrieval of similar situations to provide more valuable reference experiences. Extensive experiments reveal that P-RAG achieves competitive results without utilizing ground truth and can even further improve performance through self-iterations.

cross EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage

Authors: Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, Huan Sun

Abstract: Generalist web agents have evolved rapidly and demonstrated remarkable potential. However, there are unprecedented safety risks associated with these them, which are nearly unexplored so far. In this work, we aim to narrow this gap by conducting the first study on the privacy risks of generalist web agents in adversarial environments. First, we present a threat model that discusses the adversarial targets, constraints, and attack scenarios. Particularly, we consider two types of adversarial targets: stealing users' specific personally identifiable information (PII) or stealing the entire user request. To achieve these objectives, we propose a novel attack method, termed Environmental Injection Attack (EIA). This attack injects malicious content designed to adapt well to different environments where the agents operate, causing them to perform unintended actions. This work instantiates EIA specifically for the privacy scenario. It inserts malicious web elements alongside persuasive instructions that mislead web agents into leaking private information, and can further leverage CSS and JavaScript features to remain stealthy. We collect 177 actions steps that involve diverse PII categories on realistic websites from the Mind2Web dataset, and conduct extensive experiments using one of the most capable generalist web agent frameworks to date, SeeAct. The results demonstrate that EIA achieves up to 70% ASR in stealing users' specific PII. Stealing full user requests is more challenging, but a relaxed version of EIA can still achieve 16% ASR. Despite these concerning results, it is important to note that the attack can still be detectable through careful human inspection, highlighting a trade-off between high autonomy and security. This leads to our detailed discussion on the efficacy of EIA under different levels of human supervision as well as implications on defenses for generalist web agents.

replace Large language models can replicate cross-cultural differences in personality

Authors: Pawe{\l} Niszczota, Mateusz Janczak, Micha{\l} Misiak

Abstract: We use a large-scale experiment (N=8000) to determine whether GPT-4 can replicate cross-cultural differences in the Big Five, measured using the Ten-Item Personality Inventory. We used the US and South Korea as the cultural pair, given that prior research suggests substantial personality differences between people from these two countries. We manipulated the target of the simulation (US vs. Korean), the language of the inventory (English vs. Korean), and the language model (GPT-4 vs. GPT-3.5). Our results show that GPT-4 replicated the cross-cultural differences for each factor. However, mean ratings had an upward bias and exhibited lower variation than in the human samples, as well as lower structural validity. We provide preliminary evidence that LLMs can aid cross-cultural researchers and practitioners.

replace Towards Goal-oriented Prompt Engineering for Large Language Models: A Survey

Authors: Haochen Li, Jonathan Leung, Zhiqi Shen

Abstract: Large Language Models (LLMs) have shown prominent performance in various downstream tasks and prompt engineering plays a pivotal role in optimizing LLMs' performance. This paper, not only as an overview of current prompt engineering methods, but also aims to highlight the limitation of designing prompts based on an anthropomorphic assumption that expects LLMs to think like humans. From our review of 50 representative studies, we demonstrate that a goal-oriented prompt formulation, which guides LLMs to follow established human logical thinking, significantly improves the performance of LLMs. Furthermore, We introduce a novel taxonomy that categorizes goal-oriented prompting methods into five interconnected stages and we demonstrate the broad applicability of our framework. With four future directions proposed, we hope to further emphasize the power and potential of goal-oriented prompt engineering in all fields.

replace Large Language Models for Mathematical Reasoning: Progresses and Challenges

Authors: Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, Wenpeng Yin

Abstract: Mathematical reasoning serves as a cornerstone for assessing the fundamental cognitive capabilities of human intelligence. In recent times, there has been a notable surge in the development of Large Language Models (LLMs) geared towards the automated resolution of mathematical problems. However, the landscape of mathematical problem types is vast and varied, with LLM-oriented techniques undergoing evaluation across diverse datasets and settings. This diversity makes it challenging to discern the true advancements and obstacles within this burgeoning field. This survey endeavors to address four pivotal dimensions: i) a comprehensive exploration of the various mathematical problems and their corresponding datasets that have been investigated; ii) an examination of the spectrum of LLM-oriented techniques that have been proposed for mathematical problem-solving; iii) an overview of factors and concerns affecting LLMs in solving math; and iv) an elucidation of the persisting challenges within this domain. To the best of our knowledge, this survey stands as one of the first extensive examinations of the landscape of LLMs in the realm of mathematics, providing a holistic perspective on the current state, accomplishments, and future challenges in this rapidly evolving field.

replace MURRE: Multi-Hop Table Retrieval with Removal for Open-Domain Text-to-SQL

Authors: Xuanliang Zhang, Dingzirui Wang, Longxu Dou, Qingfu Zhu, Wanxiang Che

Abstract: The open-domain text-to-SQL task aims to retrieve question-relevant tables from massive databases and generate SQL. However, the performance of current methods is constrained by single-hop retrieval, and existing multi-hop retrieval of open-domain question answering is not directly applicable due to the tendency to retrieve tables similar to the retrieved ones but irrelevant to the question. Since the questions in text-to-SQL usually contain all required information, while previous multi-hop retrieval supplements the questions with retrieved documents. Therefore, we propose the multi-hop table retrieval with removal (MURRE), which removes previously retrieved information from the question to guide the retriever towards unretrieved relevant tables. Our experiments on two open-domain text-to-SQL datasets demonstrate an average improvement of 5.7% over the previous state-of-the-art results.

replace COBIAS: Contextual Reliability in Bias Assessment

Authors: Priyanshul Govil, Hemang Jain, Vamshi Krishna Bonagiri, Aman Chadha, Ponnurangam Kumaraguru, Manas Gaur, Sanorita Dey

Abstract: Large Language Models (LLMs) often inherit biases from the web data they are trained on, which contains stereotypes and prejudices. Current methods for evaluating and mitigating these biases rely on bias-benchmark datasets. These benchmarks measure bias by observing an LLM's behavior on biased statements. However, these statements lack contextual considerations of the situations they try to present. To address this, we introduce a contextual reliability framework, which evaluates model robustness to biased statements by considering the various contexts in which they may appear. We develop the Context-Oriented Bias Indicator and Assessment Score (COBIAS) to measure a biased statement's reliability in detecting bias based on the variance in model behavior across different contexts. To evaluate the metric, we augment 2,291 stereotyped statements from two existing benchmark datasets by adding contextual information. We show that COBIAS aligns with human judgment on the contextual reliability of biased statements (Spearman's $\rho = 0.65$, $p = 3.4 * 10^{-60}$) and can be used to create reliable datasets, which would assist bias mitigation works.

replace On Evaluation Protocols for Data Augmentation in a Limited Data Scenario

Authors: Fr\'ed\'eric Piedboeuf, Philippe Langlais

Abstract: Textual data augmentation (DA) is a prolific field of study where novel techniques to create artificial data are regularly proposed, and that has demonstrated great efficiency on small data settings, at least for text classification tasks. In this paper, we challenge those results, showing that classical data augmentation (which modify sentences) is simply a way of performing better fine-tuning, and that spending more time doing so before applying data augmentation negates its effect. This is a significant contribution as it answers several questions that were left open in recent years, namely~: which DA technique performs best (all of them as long as they generate data close enough to the training set, as to not impair training) and why did DA show positive results (facilitates training of network). We further show that zero- and few-shot DA via conversational agents such as ChatGPT or LLama2 can increase performances, confirming that this form of data augmentation is preferable to classical methods.

replace Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Authors: Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, Benjamin Van Durme

Abstract: Released Large Language Models (LLMs) are often paired with a claimed knowledge cutoff date, or the dates at which training data was gathered. Such information is crucial for applications where the LLM must provide up to date information. However, this statement only scratches the surface: do all resources in the training data share the same knowledge cutoff date? Does the model's demonstrated knowledge for these subsets closely align to their cutoff dates? In this work, we define the notion of an effective cutoff. This is distinct from the LLM designer reported cutoff and applies separately to sub-resources and topics. We propose a simple approach to estimate effective cutoffs on the resource-level temporal alignment of an LLM by probing across versions of the data. Using this analysis, we find that effective cutoffs often differ from reported cutoffs. To understand the root cause of this observation, we conduct a direct large-scale analysis on open pre-training datasets. Our analysis reveals two reasons for these inconsistencies: (1) temporal biases of CommonCrawl data due to non-trivial amounts of old data in new dumps and (2) complications in LLM deduplication schemes involving semantic duplicates and lexical near-duplicates. Overall, our results show that knowledge cutoffs are not as simple as they have seemed and that care must be taken both by LLM dataset curators as well as practitioners who seek to use information from these models.

replace The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian

Authors: Giovanni Puccetti, Maria Cassese, Andrea Esuli

Abstract: While Italian is a high-resource language, there are few Italian-native benchmarks to evaluate generative Large Language Models (LLMs) in this language. This work presents three new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian, Invalsi ITA to evaluate language understanding in Italian and Olimpiadi MATE for more complex mathematical understanding. The first two benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy, the third one comes from the Italian high school math Olympics. We evaluate 10 powerful language models on these benchmarks and find that they are bound by 71% accuracy on Invasli MATE, achieved by Llama 3.1 70b instruct and by 88% on Invalsi ITA. For both Invalsi MATE and Invalsi ITA we compare LLMs with the average performance of Italian students to show that Llama 3.1 is the only one to outperform them on Invalsi MATE while most models do so on Invalsi ITA, we then show that Olimpiadi MATE is more challenging than Invalsi MATE and the highest accuracy, achieved by Llama 3.1 405b instruct is 45%. We will make data and evaluation code openly available upon acceptance of the paper.

replace Does Biomedical Training Lead to Better Medical Performance?

Authors: Amin Dada, Marie Bauer, Amanda Butler Contreras, Osman Alperen Kora\c{s}, Constantin Marc Seibold, Kaleb E Smith, Jens Kleesiek

Abstract: Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, biomedical training has not been systematically evaluated on medical tasks. This study investigates the effect of biomedical training in the context of six practical medical tasks evaluating $25$ models. In contrast to previous evaluations, our results reveal a performance decline in nine out of twelve biomedical models after fine-tuning, particularly on tasks involving hallucinations, ICD10 coding, and instruction adherence. General-domain models like Meta-Llama-3.1-70B-Instruct outperformed their biomedical counterparts, indicating a trade-off between domain-specific fine-tuning and general medical task performance. We open-source all evaluation scripts and datasets at https://github.com/TIO-IKIM/CLUE to support further research in this critical area.

URLs: https://github.com/TIO-IKIM/CLUE

replace Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?

Authors: Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, Fan Yang, Mengnan Du, Yongfeng Zhang

Abstract: Large language models (LLMs) have shown remarkable performances across a wide range of tasks. However, the mechanisms by which these models encode tasks of varying complexities remain poorly understood. In this paper, we explore the hypothesis that LLMs process concepts of varying complexities in different layers, introducing the idea of ``Concept Depth'' to suggest that more complex concepts are typically acquired in deeper layers. Specifically, we categorize concepts based on their level of abstraction, defining them in the order of increasing complexity within factual, emotional, and inferential tasks. We conduct extensive probing experiments using layer-wise representations across various LLM families (Gemma, LLaMA, Qwen) on various datasets spanning the three domains of tasks. Our findings reveal that models could efficiently conduct probing for simpler tasks in shallow layers, and more complex tasks typically necessitate deeper layers for accurate understanding. Additionally, we examine how external factors, such as adding noise to the input and quantizing the model weights, might affect layer-wise representations. Our findings suggest that these factors can impede the development of a conceptual understanding of LLMs until deeper layers are explored. We hope that our proposed concept and experimental insights will enhance the understanding of the mechanisms underlying LLMs. Our codes are available at \url{https://github.com/Luckfort/CD}.

URLs: https://github.com/Luckfort/CD

replace Do Language Models Exhibit Human-like Structural Priming Effects?

Authors: Jaap Jumelet, Willem Zuidema, Arabella Sinclair

Abstract: We explore which linguistic factors -- at the sentence and token level -- play an important role in influencing language model predictions, and investigate whether these are reflective of results found in humans and human corpora (Gries and Kootstra, 2017). We make use of the structural priming paradigm, where recent exposure to a structure facilitates processing of the same structure. We don't only investigate whether, but also where priming effects occur, and what factors predict them. We show that these effects can be explained via the inverse frequency effect, known in human priming, where rarer elements within a prime increase priming effects, as well as lexical dependence between prime and target. Our results provide an important piece in the puzzle of understanding how properties within their context affect structural prediction in language models.

replace Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination

Authors: Eve Fleisig, Genevieve Smith, Madeline Bossi, Ishita Rustagi, Xavier Yin, Dan Klein

Abstract: We present a large-scale study of linguistic bias exhibited by ChatGPT covering ten dialects of English (Standard American English, Standard British English, and eight widely spoken non-"standard" varieties from around the world). We prompted GPT-3.5 Turbo and GPT-4 with text by native speakers of each variety and analyzed the responses via detailed linguistic feature annotation and native speaker evaluation. We find that the models default to "standard" varieties of English; based on evaluation by native speakers, we also find that model responses to non-"standard" varieties consistently exhibit a range of issues: stereotyping (19% worse than for "standard" varieties), demeaning content (25% worse), lack of comprehension (9% worse), and condescending responses (15% worse). We also find that if these models are asked to imitate the writing style of prompts in non-"standard" varieties, they produce text that exhibits lower comprehension of the input and is especially prone to stereotyping. GPT-4 improves on GPT-3.5 in terms of comprehension, warmth, and friendliness, but also exhibits a marked increase in stereotyping (+18%). The results indicate that GPT-3.5 Turbo and GPT-4 can perpetuate linguistic discrimination toward speakers of non-"standard" varieties.

replace Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study

Authors: Mingyang Song, Mao Zheng, Xuan Luo

Abstract: Utilizing Large Language Models (LLMs) as evaluators for evaluating the performance of LLMs has recently garnered attention. However, this kind of evaluation approach is affected by potential biases in LLMs, raising concerns about the accuracy and reliability of the evaluation results. To mitigate this issue, we propose and study two many-shot ICL prompts, which rely on two versions of many-shot ICL prompt templates for helping LLM evaluators to mitigate the potential biases in LLMs, \textbf{M}any-\textbf{S}hot \textbf{w}ith \textbf{R}eference (\textbf{MSwR}) and \textbf{M}any-\textbf{S}hot with\textbf{o}ut \textbf{R}eference (\textbf{MSoR}). Concretely, the former utilizes in-context examples with model-generated rationales as guidance, and the latter without. Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime. Furthermore, we reveal the symbol bias hidden in the selection bias of LLMs and propose a simple yet effective approach to mitigate the bias. Experimental results further verify the effectiveness of the symbol bias mitigation approach.

replace AI-Assisted Human Evaluation of Machine Translation

Authors: Vil\'em Zouhar, Tom Kocmi, Mrinmaya Sachan

Abstract: Annually, research teams spend large amounts of money to evaluate the quality of machine translation systems (WMT, inter alia). This is expensive because it requires a lot of expert human labor. The recently adopted annotation protocol, Error Span Annotation (ESA), has annotators marking erroneous parts of the translation and then assigning a final score. A lot of the annotator time is spent on scanning the translation for possible errors. In our work, we help the annotators by pre-filling the error annotations with recall-oriented automatic quality estimation. With this AI assistance, we obtain annotations at the same quality level while cutting down the time per span annotation by half (71s/error span $\rightarrow$ 31s/error span). The biggest advantage of ESA$^\mathrm{AI}$ protocol is an accurate priming of annotators (pre-filled error spans) before they assign the final score. This also alleviates a potential automation bias, which we confirm to be low. In addition, the annotation budget can be reduced by almost 25\% with filtering of examples that the AI deems to be very likely to be correct.

replace Open-Source Web Service with Morphological Dictionary-Supplemented Deep Learning for Morphosyntactic Analysis of Czech

Authors: Milan Straka, Jana Strakov\'a

Abstract: We present an open-source web service for Czech morphosyntactic analysis. The system combines a deep learning model with rescoring by a high-precision morphological dictionary at inference time. We show that our hybrid method surpasses two competitive baselines: While the deep learning model ensures generalization for out-of-vocabulary words and better disambiguation, an improvement over an existing morphological analyser MorphoDiTa, at the same time, the deep learning model benefits from inference-time guidance of a manually curated morphological dictionary. We achieve 50% error reduction in lemmatization and 58% error reduction in POS tagging over MorphoDiTa, while also offering dependency parsing. The model is trained on one of the currently largest Czech morphosyntactic corpora, the PDT-C 1.0, with the trained models available at https://hdl.handle.net/11234/1-5293. We provide the tool as a web service deployed at https://lindat.mff.cuni.cz/services/udpipe/. The source code is available at GitHub (https://github.com/ufal/udpipe/tree/udpipe-2), along with a Python client for a simple use. The documentation for the models can be found at https://ufal.mff.cuni.cz/udpipe/2/models#czech_pdtc1.0_model.

URLs: https://hdl.handle.net/11234/1-5293., https://lindat.mff.cuni.cz/services/udpipe/., https://github.com/ufal/udpipe/tree/udpipe-2),, https://ufal.mff.cuni.cz/udpipe/2/models

replace SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

Authors: Gayane Ghazaryan, Erik Arakelyan, Pasquale Minervini, Isabelle Augenstein

Abstract: Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $\textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $\textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98\%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $\sim70\%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.

replace Unveiling Entity-Level Unlearning for Large Language Models: A Comprehensive Analysis

Authors: Weitao Ma, Xiaocheng Feng, Weihong Zhong, Lei Huang, Yangfan Ye, Xiachong Feng, Bing Qin

Abstract: Large language model unlearning has garnered increasing attention due to its potential to address security and privacy concerns, leading to extensive research in the field. However, much of this research has concentrated on instance-level unlearning, specifically targeting the removal of predefined instances containing sensitive content. This focus has left a significant gap in the exploration of full entity-level unlearning, which is critical in real-world scenarios such as copyright protection. To this end, we propose a novel task of Entity-level unlearning, which aims to erase entity-related knowledge from the target model completely. To thoroughly investigate this task, we systematically evaluate trending unlearning algorithms, revealing that current methods struggle to achieve effective entity-level unlearning. Then, we further explore the factors that influence the performance of the unlearning algorithms, identifying that knowledge coverage and the size of the forget set play pivotal roles. Notably, our analysis also uncovers that entities introduced through fine-tuning are more vulnerable to unlearning than pre-trained entities. These findings collectively offer valuable insights for advancing entity-level unlearning for LLMs.

replace EmPO: Emotion Grounding for Empathetic Response Generation through Preference Optimization

Authors: Ondrej Sotolar, Vojtech Formanek, Alok Debnath, Allison Lahnala, Charles Welch, Lucie FLek

Abstract: Empathetic response generation is a desirable aspect of conversational agents, crucial for facilitating engaging and emotionally intelligent multi-turn conversations between humans and machines. Leveraging large language models for this task has shown promising results, yet challenges persist in ensuring both the empathetic quality of the responses and retention of the generalization performance of the models. We propose a novel approach where we construct theory-driven preference datasets based on emotion grounding and use them to align LLMs with preference optimization algorithms to address these challenges. To evaluate empathetic response generation, we employ the EmpatheticDialogues dataset, assessing empathy with the diff-Epitome and BERTscore metrics and with multi-dimensional human evaluation. Additionally, we measure diversity and emotional valence using feature-based methods. We also evaluate the impact of training on the generalization performance using the MMLU benchmark and tasks from the Open LLM Leaderboard. The results show that LLMs can be aligned for empathetic response generation by preference optimization while retaining their general performance and that emotion grounding can guide preference dataset creation. We make all datasets, source code, and models publicly available. https://github.com/justtherightsize/empo

URLs: https://github.com/justtherightsize/empo

replace Self-Translate-Train: Enhancing Cross-Lingual Transfer of Large Language Models via Inherent Capability

Authors: Ryokan Ri, Shun Kiyono, Sho Takase

Abstract: Zero-shot cross-lingual transfer by fine-tuning multilingual pretrained models shows promise for low-resource languages, but often suffers from misalignment of internal representations between languages. We hypothesize that even when the model cannot generalize across languages effectively in fine-tuning, it still captures cross-lingual correspondence useful for cross-lingual transfer. We explore this hypothesis with Self-Translate-Train, a method that lets large language models (LLMs) to translate training data into the target language and fine-tunes the model on its own generated data. By demonstrating that Self-Translate-Train outperforms zero-shot transfer, we encourage further exploration of better methods to elicit cross-lingual capabilities of LLMs.

replace Entity-Level Sentiment: More than the Sum of Its Parts

Authors: Egil R{\o}nningstad, Roman Klinger, Lilja {\O}vrelid, Erik Velldal

Abstract: In sentiment analysis of longer texts, there may be a variety of topics discussed, of entities mentioned, and of sentiments expressed regarding each entity. We find a lack of studies exploring how such texts express their sentiment towards each entity of interest, and how these sentiments can be modelled. In order to better understand how sentiment regarding persons and organizations (each entity in our scope) is expressed in longer texts, we have collected a dataset of expert annotations where the overall sentiment regarding each entity is identified, together with the sentence-level sentiment for these entities separately. We show that the reader's perceived sentiment regarding an entity often differs from an arithmetic aggregation of sentiments at the sentence level. Only 70\% of the positive and 55\% of the negative entities receive a correct overall sentiment label when we aggregate the (human-annotated) sentiment labels for the sentences where the entity is mentioned. Our dataset reveals the complexity of entity-specific sentiment in longer texts, and allows for more precise modelling and evaluation of such sentiment expressions.

replace Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

Authors: Zeliang Zhang, Xiaodong Liu, Hao Cheng, Chenliang Xu, Jianfeng Gao

Abstract: By increasing model parameters but activating them sparsely when performing a task, the use of Mixture-of-Experts (MoE) architecture significantly improves the performance of Large Language Models (LLMs) without increasing the inference cost. However, the memory consumption due to the growing number of experts presents a challenge to the deployment of these models in many real world settings. Our empirical study reveals that some experts encode redundant knowledge during pre-training. We thus propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures, including Mixtral, Deepseek-MoE, and Qwen. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks. We will release our code to facilitate future research.

replace MHS-STMA: Multimodal Hate Speech Detection via Scalable Transformer-Based Multilevel Attention Framework

Authors: Anusha Chhabra, Dinesh Kumar Vishwakarma

Abstract: Social media has a significant impact on people's lives. Hate speech on social media has emerged as one of society's most serious issues in recent years. Text and pictures are two forms of multimodal data that are distributed within articles. Unimodal analysis has been the primary emphasis of earlier approaches. Additionally, when doing multimodal analysis, researchers neglect to preserve the distinctive qualities associated with each modality. To address these shortcomings, the present article suggests a scalable architecture for multimodal hate content detection called transformer-based multilevel attention (STMA). This architecture consists of three main parts: a combined attention-based deep learning mechanism, a vision attention-mechanism encoder, and a caption attention-mechanism encoder. To identify hate content, each component uses various attention processes and handles multimodal data in a unique way. Several studies employing multiple assessment criteria on three hate speech datasets such as Hateful memes, MultiOff, and MMHS150K, validate the suggested architecture's efficacy. The outcomes demonstrate that on all three datasets, the suggested strategy performs better than the baseline approaches.

replace Elsevier Arena: Human Evaluation of Chemistry/Biology/Health Foundational Large Language Models

Authors: Camilo Thorne, Christian Druckenbrodt, Kinga Szarkowska, Deepika Goyal, Pranita Marajan, Vijay Somanath, Corey Harper, Mao Yan, Tony Scerri

Abstract: The quality and capabilities of large language models cannot be currently fully assessed with automated, benchmark evaluations. Instead, human evaluations that expand on traditional qualitative techniques from natural language generation literature are required. One recent best-practice consists in using A/B-testing frameworks, which capture preferences of human evaluators for specific models. In this paper we describe a human evaluation experiment focused on the biomedical domain (health, biology, chemistry/pharmacology) carried out at Elsevier. In it a large but not massive (8.8B parameter) decoder-only foundational transformer trained on a relatively small (135B tokens) but highly curated collection of Elsevier datasets is compared to OpenAI's GPT-3.5-turbo and Meta's foundational 7B parameter Llama 2 model against multiple criteria. Results indicate -- even if IRR scores were generally low -- a preference towards GPT-3.5-turbo, and hence towards models that possess conversational abilities, are very large and were trained on very large datasets. But at the same time, indicate that for less massive models training on smaller but well-curated training sets can potentially give rise to viable alternatives in the biomedical domain.

replace Larger Language Models Don't Care How You Think: Why Chain-of-Thought Prompting Fails in Subjective Tasks

Authors: Georgios Chochlakis, Niyantha Maruthu Pandiyan, Kristina Lerman, Shrikanth Narayanan

Abstract: In-Context Learning (ICL) in Large Language Models (LLM) has emerged as the dominant technique for performing natural language tasks, as it does not require updating the model parameters with gradient-based methods. ICL promises to "adapt" the LLM to perform the present task at a competitive or state-of-the-art level at a fraction of the computational cost. ICL can be augmented by incorporating the reasoning process to arrive at the final label explicitly in the prompt, a technique called Chain-of-Thought (CoT) prompting. However, recent work has found that ICL relies mostly on the retrieval of task priors and less so on "learning" to perform tasks, especially for complex subjective domains like emotion and morality, where priors ossify posterior predictions. In this work, we examine whether "enabling" reasoning also creates the same behavior in LLMs, wherein the format of CoT retrieves reasoning priors that remain relatively unchanged despite the evidence in the prompt. We find that, surprisingly, CoT indeed suffers from the same posterior collapse as ICL for larger language models. Code is avalaible at https://github.com/gchochla/cot-priors.

URLs: https://github.com/gchochla/cot-priors.

replace Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

Authors: Jonathan Ivey, Shivani Kumar, Jiayu Liu, Hua Shen, Sushrita Rakshit, Rohan Raju, Haotian Zhang, Aparna Ananthasubramaniam, Junghwan Kim, Bowen Yi, Dustin Wright, Abraham Israeli, Anders Giovanni M{\o}ller, Lechen Zhang, David Jurgens

Abstract: Studying and building datasets for dialogue tasks is both expensive and time-consuming due to the need to recruit, train, and collect data from study participants. In response, much recent work has sought to use large language models (LLMs) to simulate both human-human and human-LLM interactions, as they have been shown to generate convincingly human-like text in many settings. However, to what extent do LLM-based simulations \textit{actually} reflect human dialogues? In this work, we answer this question by generating a large-scale dataset of 100,000 paired LLM-LLM and human-LLM dialogues from the WildChat dataset and quantifying how well the LLM simulations align with their human counterparts. Overall, we find relatively low alignment between simulations and human interactions, demonstrating a systematic divergence along the multiple textual properties, including style and content. Further, in comparisons of English, Chinese, and Russian dialogues, we find that models perform similarly. Our results suggest that LLMs generally perform better when the human themself writes in a way that is more similar to the LLM's own style.

replace Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Authors: Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr \.Zelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke

Abstract: Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.

replace Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling

Authors: Samuel Belkadi, Libo Ren, Nicolo Micheletti, Lifeng Han, Goran Nenadic

Abstract: In this paper, we present a system that generates synthetic free-text medical records, such as discharge summaries, admission notes and doctor correspondences, using Masked Language Modeling (MLM). Our system is designed to preserve the critical information of the records while introducing significant diversity and minimizing re-identification risk. The system incorporates a de-identification component that uses Philter to mask Protected Health Information (PHI), followed by a Medical Entity Recognition (NER) model to retain key medical information. We explore various masking ratios and mask-filling techniques to balance the trade-off between diversity and fidelity in the synthetic outputs without affecting overall readability. Our results demonstrate that the system can produce high-quality synthetic data with significant diversity while achieving a HIPAA-compliant PHI recall rate of 0.96 and a low re-identification risk of 0.035. Furthermore, downstream evaluations using a NER task reveal that the synthetic data can be effectively used to train models with performance comparable to those trained on real data. The flexibility of the system allows it to be adapted for specific use cases, making it a valuable tool for privacy-preserving data generation in medical research and healthcare applications.

replace jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Authors: Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael G\"unther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao

Abstract: We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Additionally, Matryoshka Representation Learning is integrated into the training process, allowing flexible truncation of embedding dimensions without compromising performance. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks.

replace Augmenting Automatic Speech Recognition Models with Disfluency Detection

Authors: Robin Amann, Zhaolin Li, Barbara Bruno, Jan Niehues

Abstract: Speech disfluency commonly occurs in conversational and spontaneous speech. However, standard Automatic Speech Recognition (ASR) models struggle to accurately recognize these disfluencies because they are typically trained on fluent transcripts. Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech. Additionally, previous work often requires model fine-tuning and addresses limited types of disfluencies. In this work, we present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies. We first demonstrate that ASR models have difficulty transcribing speech disfluencies. Next, this work proposes a modified Connectionist Temporal Classification(CTC)-based forced alignment algorithm from \cite{kurzinger2020ctc} to predict word-level timestamps while effectively capturing disfluent speech. Additionally, we develop a model to classify alignment gaps between timestamps as either containing disfluent speech or silence. This model achieves an accuracy of 81.62% and an F1-score of 80.07%. We test the augmentation pipeline of alignment gap detection and classification on a disfluent dataset. Our results show that we captured 74.13% of the words that were initially missed by the transcription, demonstrating the potential of this pipeline for downstream tasks.

replace Schrodinger's Memory: Large Language Models

Authors: Wei Wang, Qing Li

Abstract: Memory is the foundation of all human activities; without memory, it would be nearly impossible for people to perform any task in daily life. With the development of Large Language Models (LLMs), their language capabilities are becoming increasingly comparable to those of humans. But do LLMs have memory? Based on current performance, LLMs do appear to exhibit memory. So, what is the underlying mechanism of this memory? Previous research has lacked a deep exploration of LLMs' memory capabilities and the underlying theory. In this paper, we use Universal Approximation Theorem (UAT) to explain the memory mechanism in LLMs. We also conduct experiments to verify the memory capabilities of various LLMs, proposing a new method to assess their abilities based on these memory ability. We argue that LLM memory operates like Schr\"odinger's memory, meaning that it only becomes observable when a specific memory is queried. We can only determine if the model retains a memory based on its output in response to the query; otherwise, it remains indeterminate. Finally, we expand on this concept by comparing the memory capabilities of the human brain and LLMs, highlighting the similarities and differences in their operational mechanisms.

replace-cross A new mapping of technological interdependence

Authors: A. Fronzetti Colladon, B. Guardabascio, F. Venturini

Abstract: How does technological interdependence affect innovation? We address this question by examining the influence of neighbors' innovativeness and the structure of the innovators' network on a sector's capacity to develop new technologies. We study these two dimensions of technological interdependence by applying novel methods of text mining and network analysis to the documents of 6.5 million patents granted by the United States Patent and Trademark Office (USPTO) between 1976 and 2021. We find that, in the long run, the influence of network linkages is as important as that of neighbor innovativeness. In the short run, however, positive shocks to neighbor innovativeness yield relatively rapid effects, while the impact of shocks strengthening network linkages manifests with delay, even though lasts longer. Our analysis also highlights that patent text contains a wealth of information often not captured by traditional innovation metrics, such as patent citations.

replace-cross Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models

Authors: Yupan Huang, Zaiqiao Meng, Fangyu Liu, Yixuan Su, Nigel Collier, Yutong Lu

Abstract: Large language models exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns. We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. Our experiments validate the effectiveness of training SparklesChat with SparklesDialogue based on MiniGPT-4 and LLaVA-v1.5, which enhances comprehension across multiple images and dialogue turns, and does not compromise single-image understanding capabilities. Qualitative evaluations further demonstrate SparklesChat's generality in handling real-world applications. All resources related to this study are publicly available at https://github.com/HYPJUDY/Sparkles.

URLs: https://github.com/HYPJUDY/Sparkles.

replace-cross Foundation Models to the Rescue: Deadlock Resolution in Connected Multi-Robot Systems

Authors: Kunal Garg, Songyuan Zhang, Jacob Arkin, Chuchu Fan

Abstract: Connected multi-agent robotic systems (MRS) are prone to deadlocks in an obstacle environment where the robots can get stuck away from their desired locations under a smooth low-level control policy. Without an external intervention, often in terms of a high-level command, a low-level control policy cannot resolve such deadlocks. Utilizing the generalizability and low data requirements of foundation models, this paper explores the possibility of using text-based models, i.e., large language models (LLMs), and text-and-image-based models, i.e., vision-language models (VLMs), as high-level planners for deadlock resolution. We propose a hierarchical control framework where a foundation model-based high-level planner helps to resolve deadlocks by assigning a leader to the MRS along with a set of waypoints for the MRS leader. Then, a low-level distributed control policy based on graph neural networks is executed to safely follow these waypoints, thereby evading the deadlock. We conduct extensive experiments on various MRS environments using the best available pre-trained LLMs and VLMs. We compare their performance with a graph-based planner in terms of effectiveness in helping the MRS reach their target locations and computational time. Our results illustrate that, compared to grid-based planners, the foundation models perform better in terms of the goal-reaching rate and computational time for complex environments, which helps us conclude that foundation models can assist MRS operating in complex obstacle-cluttered environments to resolve deadlocks efficiently.

replace-cross Transferable and Principled Efficiency for Open-Vocabulary Segmentation

Authors: Jingxuan Xu, Wuyang Chen, Yao Zhao, Yunchao Wei

Abstract: Recent success of pre-trained foundation vision-language models makes Open-Vocabulary Segmentation (OVS) possible. Despite the promising performance, this approach introduces heavy computational overheads for two challenges: 1) large model sizes of the backbone; 2) expensive costs during the fine-tuning. These challenges hinder this OVS strategy from being widely applicable and affordable in real-world scenarios. Although traditional methods such as model compression and efficient fine-tuning can address these challenges, they often rely on heuristics. This means that their solutions cannot be easily transferred and necessitate re-training on different models, which comes at a cost. In the context of efficient OVS, we target achieving performance that is comparable to or even better than prior OVS works based on large vision-language foundation models, by utilizing smaller models that incur lower training costs. The core strategy is to make our efficiency principled and thus seamlessly transferable from one OVS framework to others without further customization. Comprehensive experiments on diverse OVS benchmarks demonstrate our superior trade-off between segmentation accuracy and computation costs over previous works. Our code is available on https://github.com/Xujxyang/OpenTrans

URLs: https://github.com/Xujxyang/OpenTrans

replace-cross Large Language Model Can Continue Evolving From Mistakes

Authors: Haokun Zhao, Haixia Han, Jie Shi, Chengyu Du, Jiaqing Liang, Yanghua Xiao

Abstract: As world knowledge evolves and new task schemas emerge, Continual Learning (CL) is crucial for keeping Large Language Models (LLMs) up-to-date and addressing their shortcomings. LLMs typically require continual instruction tuning (CIT) and continual pre-training (CPT) to adapt to new tasks and acquire essential knowledge. However, collecting sufficient CPT data while addressing knowledge gaps remains challenging, as does optimizing the efficiency of utilizing this data. Inspired by the 'summarizing mistakes' strategy, we propose the Continue Evolving from Mistakes (CEM) method, a data-efficient approach aiming to collect CPT data and continually improve LLMs' performance through iterative evaluation and supplementation with mistake-relevant knowledge. To enhance data utilization and mitigate forgetting, we introduce a novel training paradigm that combines CIT and CPT data. Experiments demonstrate that CEM significantly enhances model performance and continual evolution. The code and dataset are available in the GitHub.

replace-cross A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning

Authors: Xinzhe Li

Abstract: Tool use, planning, and feedback learning are currently three prominent paradigms for developing Large Language Model (LLM)-based agents across various tasks. Although numerous frameworks have been devised for each paradigm, their intricate workflows and inconsistent taxonomy create challenges in understanding and reviewing the frameworks across different paradigms. This survey introduces a unified taxonomy to systematically review and discuss these frameworks. Specifically, 1) the taxonomy defines environments/tasks, common LLM-profiled roles (policy models, evaluators, and dynamic models), and universally applicable workflows found in prior work, and 2) it enables a comparison of key perspectives on LMPR implementations and workflow usage across different agent paradigms.

replace-cross Label-Looping: Highly Efficient Decoding for Transducers

Authors: Vladimir Bataev, Hainan Xu, Daniel Galvez, Vitaly Lavrukhin, Boris Ginsburg

Abstract: This paper introduces a highly efficient greedy decoding algorithm for Transducer-based speech recognition models. We redesign the standard nested-loop design for RNN-T decoding, swapping loops over frames and labels: the outer loop iterates over labels, while the inner loop iterates over frames searching for the next non-blank symbol. Additionally, we represent partial hypotheses in a special structure using CUDA tensors, supporting parallelized hypotheses manipulations. Experiments show that the label-looping algorithm is up to 2.0X faster than conventional batched decoding when using batch size 32. It can be further combined with other compiler or GPU call-related techniques to achieve even more speedup. Our algorithm is general-purpose and can work with both conventional Transducers and Token-and-Duration Transducers. We open-source our implementation to benefit the research community.

replace-cross Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains

Authors: Evan M. Williams, Peter Carragher, Kathleen M. Carley

Abstract: Proactive content moderation requires platforms to rapidly and continuously evaluate the credibility of websites. Leveraging the direct and indirect paths users follow to unreliable websites, we develop a website credibility classification and discovery system that integrates both webgraph and large-scale social media contexts. We additionally introduce the concept of dredge words, terms or phrases for which unreliable domains rank highly on search engines, and provide the first exploration of their usage on social media. Our graph neural networks that combine webgraph and social media contexts generate to state-of-the-art results in website credibility classification and significantly improves the top-k identification of unreliable domains. Additionally, we release a novel dataset of dredge words, highlighting their strong connections to both social media and online commerce platforms.

replace-cross Evaluating language models as risk scores

Authors: Andr\'e F. Cruz, Moritz Hardt, Celestine Mendler-D\"unner

Abstract: Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate language models' ability to quantify outcome uncertainty. In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using language models, and evaluate them against US Census data products. A flexible API enables the use of different prompting schemes, local or web-hosted models, and diverse census columns that can be used to compose custom prediction tasks. We demonstrate the utility of folktexts through a sweep of empirical insights into the statistical properties of 17 recent large language models across five natural text benchmark tasks. We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated. Base models consistently overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and produce over-confident risk scores. In fact, instruction-tuning polarizes answer distribution regardless of true underlying data uncertainty. Conversely, verbally querying models for probability estimates results in substantially improved calibration across all instruction-tuned models. These differences in ability to quantify data uncertainty cannot be revealed in realizable settings, and highlight a blind-spot in the current evaluation ecosystem that \folktexts covers.

replace-cross Deep Learning for Economists

Authors: Melissa Dell

Abstract: Deep learning provides powerful methods to impute structured information from large-scale, unstructured text and image datasets. For example, economists might wish to detect the presence of economic activity in satellite images, or to measure the topics or entities mentioned in social media, the congressional record, or firm filings. This review introduces deep neural networks, covering methods such as classifiers, regression models, generative AI, and embedding models. Applications include classification, document digitization, record linkage, and methods for data exploration in massive scale text and image corpora. When suitable methods are used, deep learning models can be cheap to tune and can scale affordably to problems involving millions or billions of data points.. The review is accompanied by a companion website, EconDL, with user-friendly demo notebooks, software resources, and a knowledge base that provides technical details and additional applications.

replace-cross DrugAgent: Explainable Drug Repurposing Agent with Large Language Model-based Reasoning

Authors: Yoshitaka Inoue, Tianci Song, Tianfan Fu

Abstract: Drug repurposing offers a promising avenue for accelerating drug development by identifying new therapeutic potentials of existing drugs. In this paper, we propose a multi-agent framework to enhance the drug repurposing process using state-of-the-art machine learning techniques and knowledge integration. Our framework comprises several specialized agents: an AI Agent trains robust drug-target interaction (DTI) models; a Knowledge Graph Agent utilizes the drug-gene interaction database (DGIdb), DrugBank, Comparative Toxicogenomics Database (CTD), and Search Tool for Interactions of Chemicals (STITCH) to systematically extract DTIs; and a Search Agent interacts with biomedical literature to annotate and verify computational predictions. By integrating outputs from these agents, our system effectively harnesses diverse data sources, including external databases, to propose viable repurposing candidates. Preliminary results demonstrate the potential of our approach in not only predicting drug-disease interactions but also in reducing the time and cost associated with traditional drug discovery methods. This paper highlights the scalability of multi-agent systems in biomedical research and their role in driving innovation in drug repurposing. Our approach not only outperforms existing methods in predicting drug repurposing potential but also provides interpretable results, paving the way for more efficient and cost-effective drug discovery processes.

replace-cross LLM-Powered Ensemble Learning for Paper Source Tracing: A GPU-Free Approach

Authors: Kunlong Chen, Junjun Wang, Zhaoqun Chen, Kunjin Chen, Yitian Chen

Abstract: We participated in the KDD CUP 2024 paper source tracing competition and achieved the 3rd place. This competition tasked participants with identifying the reference sources (i.e., ref-sources, as referred to by the organizers of the competition) of given academic papers. Unlike most teams that addressed this challenge by fine-tuning pre-trained neural language models such as BERT or ChatGLM, our primary approach utilized closed-source large language models (LLMs). With recent advancements in LLM technology, closed-source LLMs have demonstrated the capability to tackle complex reasoning tasks in zero-shot or few-shot scenarios. Consequently, in the absence of GPUs, we employed closed-source LLMs to directly generate predicted reference sources from the provided papers. We further refined these predictions through ensemble learning. Notably, our method was the only one among the award-winning approaches that did not require the use of GPUs for model training. Code available at https://github.com/Cklwanfifa/KDDCUP2024-PST.

URLs: https://github.com/Cklwanfifa/KDDCUP2024-PST.

replace-cross ExploreSelf: Fostering User-driven Exploration and Reflection on Personal Challenges with Adaptive Guidance by Large Language Models

Authors: Inhwa Song, SoHyun Park, Sachin R. Pendse, Jessica Lee Schleider, Munmun De Choudhury, Young-Ho Kim

Abstract: Expressing stressful experiences in words is proven to improve mental and physical health, but individuals often disengage with writing interventions as they struggle to organize their thoughts and emotions. Reflective prompts have been used to provide direction, and large language models (LLMs) have demonstrated the potential to provide tailored guidance. Current systems often limit users' flexibility to direct their reflections. We thus present ExploreSelf, an LLM-driven application designed to empower users to control their reflective journey. ExploreSelf allows users to receive adaptive support through dynamically generated questions. Through an exploratory study with 19 participants, we examine how participants explore and reflect on personal challenges using ExploreSelf. Our findings demonstrate that participants valued the balance between guided support and freedom to control their reflective journey, leading to deeper engagement and insight. Building on our findings, we discuss implications for designing LLM-driven tools that promote user empowerment through effective reflective practices.