Authors: Shuo Yu (Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence), Mingyue Cheng (Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence), Jiqian Yang (Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence), Jie Ouyang (Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence)
Abstract: Retrieval-Augmented Generation (RAG) enhances generative models by integrating retrieval mechanisms, which allow these models to access and utilize external knowledge sources. Despite its advantages, RAG encounters significant challenges, particularly in effectively handling real-world queries and mitigating hallucinations. The KDD Cup 2024 CRAG competition brings these issues to the forefront by incorporating both web pages and a mock API as knowledge sources, adding the complexity of parsing HTML before large language models (LLMs) can process the information. In this paper, we propose a novel RAG benchmark designed to address these challenges. Our work provides a comprehensive set of experimental results, offering valuable insights for the study of RAG. We thoroughly examine the entire RAG process, including knowledge source selection, retrieval, organization, and reasoning. Key findings from our study include the impact of automated knowledge source selection using agents and the influence of noise chunks on RAG reasoning. Additionally, we conduct detailed experiments to analyze the effects of various hyperparameters on RAG performance. To support further research, we have made our results, the associated code, and a parsed version of the CRAG dataset publicly available\footnote{https://github.com/USTCAGI/RAG-X}, contributing to the advancement of RAG methodologies and establishing a solid foundation for future work in this domain.
Authors: Yun Joon Soh, Hanxian Huang, Yuandong Tian, Jishen Zhao
Abstract: Supporting longer context for Large Language Models (LLM) is a promising direction to advance LLMs. As training a model for a longer context window is computationally expensive, many alternative solutions, such as Retrieval Augmented Generation (RAG), have been used. However, most existing RAG methods adopt embedding-based retrieval that falls short on long contexts. To address such challenges, we propose an attention-based retrieval technique, You Only Use Reactive Attention slice (YOURA). YOURA leverages a novel retrieval heuristic called reaction score to rank the relevance of each sentence in the input context with the query sentence. Intuitively, we measure how the per-token attention score "reacts" to the query and greedily retrieves the most reactive sentences. Internally, YOURA generates a token-indexed vector (called reaction vector) for the whole input context. To map each sentence to the token-indexed vector, we propose an Embedding-Agnostic Sentence Yield (EASY), a best-effort token wiggling algorithm. We evaluate our retrieval technique on three open-source pre-trained LLM models across six LongBench QA datasets. Our technique achieves up to 30% vLLM inference throughput improvement for serving long-context queries with a nearly identical quality score to the simple yet effective truncate-middle approach.
Authors: Aman Bhargava, Cameron Witkowski, Alexander Detkov, Matt Thomson
Abstract: Two primary ways to change LLM behavior are prompting and weight updates (e.g., fine-tuning). Prompting LLMs is simple and effective, specifying the desired changes explicitly in natural language, whereas weight updates provide more expressive and permanent behavior changes, specified implicitly via training on large datasets. We present a technique for "baking" prompts into the weights of an LLM. Prompt Baking converts a prompt $u$ and initial weights $\theta$ to a new set of weights $\theta_u$ such that new "baked" LLM behaves like the original prompted LLM. Mathematically, we minimize the KL divergence between $P_\theta(\cdot | u)$ and $P_{\theta_u}(\cdot)$, where $P$ is the LLM's probability distribution over token sequences. Across all our experiments, we find prompts can be readily baked into weight updates. Baking chain-of-thought prompts improves zero-shot performance on GSM8K, ASDiv, MBPP, ARC-Easy, ARC-Challenge, and CommonsenseQA benchmarks. Baking news headlines directly updates an LLM's knowledge. And baking instructions & personas alleviates "prompt forgetting" over long sequences. Furthermore, stopping baking early creates "half-baked" models, continuously scaling prompt strength. Baked models retain their sensitivity to further prompting and baking, including re-prompting with the baked-in prompt. Surprisingly, the re-prompted models yield further performance gains in instruction following, as well as math reasoning and coding benchmarks. Taking re-prompting and re-baking to the limit yields a form of iterative self-improvement we call Prompt Pursuit, and preliminary results on instruction following exhibit dramatic performance gains. Finally, we discuss implications for AI safety, continuous model updating, enhancing real-time learning capabilities in LLM-based agents, and generating more stable AI personas.
Authors: Genshun Wan, Mengzhi Wang, Tingzhi Mao, Hang Chen, Zhongfu Ye
Abstract: The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. This enables the lightweight transducer achieving similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.
Authors: Minghao Liu, Mingxiu Sui, Cangqing Wang, Zhejie Zhou
Abstract: Effective communication in automated chat systems hinges on the ability to understand and respond to context. Traditional models often struggle with determining when additional context is necessary for generating appropriate responses. This paper introduces Context-Aware BERT (CA-BERT), a transformer-based model specifically fine-tuned to address this challenge. CA-BERT innovatively applies deep learning techniques to discern context necessity in multi-turn chat interactions, enhancing both the relevance and accuracy of responses. We describe the development of CA-BERT, which adapts the robust architecture of BERT with a novel training regimen focused on a specialized dataset of chat dialogues. The model is evaluated on its ability to classify context necessity, demonstrating superior performance over baseline BERT models in terms of accuracy and efficiency. Furthermore, CA-BERT's implementation showcases significant reductions in training time and resource usage, making it feasible for real-time applications. The results indicate that CA-BERT can effectively enhance the functionality of chatbots by providing a nuanced understanding of context, thereby improving user experience and interaction quality in automated systems. This study not only advances the field of NLP in chat applications but also provides a framework for future research into context-sensitive AI developments.
Authors: Josiane Mothe (IRIT-SIG)
Abstract: Isidore of Seville is credited with the adage that it is language that gives birth to a people, and not the other way around , underlining the profound role played by language in the formation of cultural and social identity. Today, of the more than 7100 languages listed, a significant number are endangered. Since the 1970s, linguists, information seekers and enthusiasts have helped develop digital resources and automatic tools to support a wide range of languages, including endangered ones. The advent of Large Language Model (LLM) technologies holds both promise and peril. They offer unprecedented possibilities for the translation and generation of content and resources, key elements in the preservation and revitalisation of languages. They also present threat of homogenisation, cultural oversimplification and the further marginalisation of already vulnerable languages. The talk this paper is based on has proposed an initiatory journey, exploring the potential paths and partnerships between technology and tradition, with a particular focus on the Occitan language. Occitan is a language from Southern France, parts of Spain and Italy that played a major cultural and economic role, particularly in the Middle Ages. It is now endangered according to UNESCO. The talk critically has examined how human expertise and artificial intelligence can work together to offer hope for preserving the linguistic diversity that forms the foundation of our global and especially our European heritage while addressing some of the ethical and practical challenges that accompany the use of these powerful technologies. This paper is based on the keynote I gave at the 46th European Conference on Information Retrieval (ECIR 2024). As an alternative to reading this paper, a video talk is available online. 1 Date: 26 March 2024.
Authors: Panagiotis Koletsis, Panagiotis-Konstantinos Gemos, Christos Chronis, Iraklis Varlamis, Vasilis Efthymiou, Georgios Th. Papadopoulos
Abstract: The rise of financial crime that has been observed in recent years has created an increasing concern around the topic and many people, organizations and governments are more and more frequently trying to combat it. Despite the increase of interest in this area, there is a lack of specialized datasets that can be used to train and evaluate works that try to tackle those problems. This article proposes a new micro-benchmark dataset for algorithms and models that identify individuals and organizations, and their multiple writings, in news articles, and presents an approach that assists in its creation. Experimental efforts are also reported, using this dataset, to identify individuals and organizations in financial-crime-related articles using various low-billion parameter Large Language Models (LLMs). For these experiments, standard metrics (Accuracy, Precision, Recall, F1 Score) are reported and various prompt variants comprising the best practices of prompt engineering are tested. In addition, to address the problem of ambiguous entity mentions, a simple, yet effective LLM-based disambiguation method is proposed, ensuring that the evaluation aligns with reality. Finally, the proposed approach is compared against a widely used state-of-the-art open-source baseline, showing the superiority of the proposed method.
Authors: Olivia Sturman, Aparna Joshi, Bhaktipriya Radharapu, Piyush Kumar, Renee Shelby
Abstract: Increasing use of large language models (LLMs) demand performant guardrails to ensure the safety of inputs and outputs of LLMs. When these safeguards are trained on imbalanced data, they can learn the societal biases. We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers. Our approach involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer. We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases. We create an expanded Open AI dataset, and a new templated LLM-generated dataset based on user-prompts, both of which are counterfactually balanced across identity groups and cover four key areas of safety; we will work towards publicly releasing these datasets. Our results show that our approach improves counterfactual fairness with minimal impact on model performance.
Authors: Joseph Lam (Great Ormond Street Institute of Child Health, University College London, UK), Mario Cortina-Borja (Great Ormond Street Institute of Child Health, University College London, UK), Robert Aldridge (Institute for Health Metrics and Evaluation, University of Washington, USA), Ruth Blackburn (Great Ormond Street Institute of Child Health, University College London, UK), Katie Harron (Great Ormond Street Institute of Child Health, University College London, UK)
Abstract: Data linkage is increasingly used in health research and policy making and is relied on for understanding health inequalities. However, linked data is only as useful as the underlying data quality, and differential linkage rates may induce selection bias in the linked data. A mechanism that selectively compromises data quality is name romanisation. Converting text of a different writing system into Latin based writing, or romanisation, has long been the standard process of representing names in character-based writing systems such as Chinese, Vietnamese, and other languages such as Swahili. Unstandardised romanisation of Chinese characters, due in part to problems of preserving the correct name orders the lack of proper phonetic representation of a tonal language, has resulted in poor linkage rates for Chinese immigrants. This opinion piece aims to suggests that the use of standardised romanisation systems for Cantonese (Jyutping) or Mandarin (Pinyin) Chinese, which incorporate tonal information, could improve linkage rates and accuracy for individuals with Chinese names. We used 771 Chinese and English names scraped from openly available sources, and compared the utility of Jyutping, Pinyin and the Hong Kong Government Romanisation system (HKG-romanisation) for representing Chinese names. We demonstrate that both Jyutping and Pinyin result in fewer errors compared with the HKG-romanisation system. We suggest that collecting and preserving people's names in their original writing systems is ethically and socially pertinent. This may inform development of language-specific pre-processing and linkage paradigms that result in more inclusive research data which better represents the targeted populations.
Authors: Art\=urs Kanepajs, Vladimir Ivanov, Richard Moulange
Abstract: Linguistically inclusive LLMs -- which maintain good performance regardless of the language with which they are prompted -- are necessary for the diffusion of AI benefits around the world. Multilingual jailbreaks that rely on language translation to evade safety measures undermine the safe and inclusive deployment of AI systems. We provide policy recommendations to enhance the multilingual capabilities of AI while mitigating the risks of multilingual jailbreaks. We quantitatively assess the relationship between language resourcedness and model vulnerabilities to multilingual jailbreaks for five frontier large language models across 24 official EU languages. Building on prior research, we propose policy actions that align with the EU legal landscape and institutional framework to address multilingual jailbreaks, while promoting linguistic inclusivity. These include mandatory assessments of multilingual capabilities and vulnerabilities, public opinion research, and state support for multilingual AI development. The measures aim to improve AI safety and functionality through EU policy initiatives, guiding the implementation of the EU AI Act and informing regulatory efforts of the European AI Office.
Authors: Margherita Martorana, Xueli Pan, Benno Kruit, Tobias Kuhn, Jacco van Ossenbruggen
Abstract: Traditional Semantic Table Interpretation (STI) methods rely primarily on the underlying table data to create semantic annotations. This year's SemTab challenge introduced the ``Metadata to KG'' track, which focuses on performing STI by using only metadata information, without access to the underlying data. In response to this new challenge, we introduce a new term: Column Vocabulary Association (CVA). This term refers to the task of semantic annotation of column headers solely based on metadata information. In this study, we evaluate the performance of various methods in executing the CVA task, including a Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) approach, as well as a more traditional similarity approach with SemanticBERT. Our methodology uses a zero-shot setting, with no pretraining or examples passed to the Large Language Models (LLMs), as we aim to avoid a domain-specific setting. We investigate a total of 7 different LLMs, of which three commercial GPT models (i.e. gpt-3.5-turbo-0.125, gpt-4o and gpt-4-turbo) and four open source models (i.e. llama3-80b, llama3-7b, gemma-7b and mixtral-8x7b). We integrate this models with RAG systems, and we explore how variations in temperature settings affect performances. Moreover, we continue our investigation by performing the CVA task utilizing SemanticBERT, analyzing how various metadata information influence its performance. Initial findings indicate that LLMs generally perform well at temperatures below 1.0, achieving an accuracy of 100\% in certain cases. Nevertheless, our investigation also reveal that the nature of the data significantly influences CVA task outcomes. In fact, in cases where the input data and glossary are related (for example by being created by the same organizations) traditional methods appear to surpass the performance of LLMs.
Authors: Stefan Heimersheim
Abstract: The LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability. LN is a crucial component required to stabilize the training of large language models, and LN or the similar RMSNorm have been used in practically all large language models based on the transformer architecture. The non-linear nature of the LN layers is a hindrance for mechanistic interpretability as it hinders interpretation of the residual stream, and makes it difficult to decompose the model into circuits. Some research have gone so far as to name "reasons interpretability researchers hate layer norm". In this paper we show that it is possible to remove the LN layers from a pre-trained GPT2-small model by fine-tuning on a fraction (500M tokens) of the training data. We demonstrate that this LN-free model achieves similar performance to the original model on the OpenWebText and ThePile datasets (-0.05 cross-entropy loss), and the Hellaswag benchmark (-0.5% accuracy). We provide the fine-tuning procedure and a Hugging Face repository with the fine-tuned GPT2-small models. Our work not only provides a simplified model for mechanistic interpretability research, but also provides evidence that the LN layers, at inference time, do not play a crucial role in transformer models.
Authors: Yi Xu, Bo Xue, Shuqian Sheng, Cheng Deng, Jiaxin Ding, Zanwei Shen, Luoyi Fu, Xinbing Wang, Chenghu Zhou
Abstract: In the ever-expanding landscape of academic research, the proliferation of ideas presents a significant challenge for researchers: discerning valuable ideas from the less impactful ones. The ability to efficiently evaluate the potential of these ideas is crucial for the advancement of science and paper review. In this work, we focus on idea assessment, which aims to leverage the knowledge of large language models to assess the merit of scientific ideas. First, we investigate existing text evaluation research and define the problem of quantitative evaluation of ideas. Second, we curate and release a benchmark dataset from nearly four thousand manuscript papers with full texts, meticulously designed to train and evaluate the performance of different approaches to this task. Third, we establish a framework for quantifying the value of ideas by employing representations in a specific layer of large language models. Experimental results show that the scores predicted by our method are relatively consistent with those of humans. Our findings suggest that the representations of large language models hold more potential in quantifying the value of ideas than their generative outputs, demonstrating a promising avenue for automating the idea assessment process.
Authors: Bayode Ogunleye, Hemlata Sharma, Olamilekan Shobayo
Abstract: The World Health Organisation (WHO) revealed approximately 280 million people in the world suffer from depression. Yet, existing studies on early-stage depression detection using machine learning (ML) techniques are limited. Prior studies have applied a single stand-alone algorithm, which is unable to deal with data complexities, prone to overfitting, and limited in generalization. To this end, our paper examined the performance of several ML algorithms for early-stage depression detection using two benchmark social media datasets (D1 and D2). More specifically, we incorporated sentiment indicators to improve our model performance. Our experimental results showed that sentence bidirectional encoder representations from transformers (SBERT) numerical vectors fitted into the stacking ensemble model achieved comparable F1 scores of 69% in the dataset (D1) and 76% in the dataset (D2). Our findings suggest that utilizing sentiment indicators as an additional feature for depression detection yields an improved model performance, and thus, we recommend the development of a depressive term corpus for future work.
Authors: Hannes Thurnherr, J\'er\'emy Scheurer
Abstract: Achieving a mechanistic understanding of transformer-based language models is an open challenge, especially due to their large number of parameters. Moreover, the lack of ground truth mappings between model weights and their functional roles hinders the effective evaluation of interpretability methods, impeding overall progress. Tracr, a method for generating compiled transformers with inherent ground truth mappings in RASP, has been proposed to address this issue. However, manually creating a large number of models needed for verifying interpretability methods is labour-intensive and time-consuming. In this work, we present a novel approach for generating interpretability test beds using large language models (LLMs) and introduce TracrBench, a novel dataset consisting of 121 manually written and LLM-generated, human-validated RASP programs and their corresponding transformer weights. During this process, we evaluate the ability of frontier LLMs to autonomously generate RASP programs and find that this task poses significant challenges. GPT-4-turbo, with a 20-shot prompt and best-of-5 sampling, correctly implements only 57 out of 101 test programs, necessitating the manual implementation of the remaining programs. With its 121 samples, TracrBench aims to serve as a valuable testbed for evaluating and comparing interpretability methods.
Authors: Maria Tsfasman, Bernd Dudzik, Kristian Fenech, Andras Lorincz, Catholijn M. Jonker, Catharine Oertel
Abstract: The quality of human social relationships is intricately linked to human memory processes, with memory serving as the foundation for the creation of social bonds. Since human memory is selective, differing recollections of the same events within a group can lead to misunderstandings and misalignments in what is perceived to be common ground in the group. Yet, conversational facilitation systems, aimed at advancing the quality of group interactions, usually focus on tracking users' states within an individual session, ignoring what remains in each participant's memory after the interaction. Conversational memory is the process by which humans encode, retain and retrieve verbal, non-verbal and contextual information from a conversation. Understanding conversational memory can be used as a source of information on the long-term development of social connections within a group. This paper introduces the MeMo corpus, the first conversational dataset annotated with participants' memory retention reports, aimed at facilitating computational modelling of human conversational memory. The MeMo corpus includes 31 hours of small-group discussions on the topic of Covid-19, repeated over the term of 2 weeks. It integrates validated behavioural and perceptual measures, and includes audio, video, and multimodal annotations, offering a valuable resource for studying and modelling conversational memory and group dynamics. By introducing the MeMo corpus, presenting an analysis of its validity, and demonstrating its usefulness for future research, this paper aims to pave the way for future research in conversational memory modelling for intelligent system development.
Authors: Yiheng Wu, Junhui Li, Muhua Zhu
Abstract: Previous approaches to the task of implicit discourse relation recognition (IDRR) generally view it as a classification task. Even with pre-trained language models, like BERT and RoBERTa, IDRR still relies on complicated neural networks with multiple intermediate layers to proper capture the interaction between two discourse units. As a result, the outputs of these intermediate layers may have different capability in discriminating instances of different classes. To this end, we propose to adapt a supervised contrastive learning (CL) method, label- and instance-centered CL, to enhance representation learning. Moreover, we propose a novel constrained multi-layer CL approach to properly impose a constraint that the contrastive loss of higher layers should be smaller than that of lower layers. Experimental results on PDTB 2.0 and PDTB 3.0 show that our approach can significantly improve the performance on both multi-class classification and binary classification.
Authors: Yiheng Wu, Roman Yangarber, Xian Mao
Abstract: The remarkable capabilities of Large Language Models (LLMs) in text comprehension and generation have revolutionized Information Extraction (IE). One such advancement is in Document-level Relation Triplet Extraction (DocRTE), a critical task in information systems that aims to extract entities and their semantic relationships from documents. However, existing methods are primarily designed for Sentence level Relation Triplet Extraction (SentRTE), which typically handles a limited set of relations and triplet facts within a single sentence. Additionally, some approaches treat relations as candidate choices integrated into prompt templates, resulting in inefficient processing and suboptimal performance when determining the relation elements in triplets. To address these limitations, we introduce a Discriminative and Voice Aware Paradigm DiVA. DiVA involves only two steps: performing document-level relation extraction (DocRE) and then identifying the subject object entities based on the relation. No additional processing is required simply input the document to directly obtain the triplets. This streamlined process more accurately reflects real-world scenarios for triplet extraction. Our innovation lies in transforming DocRE into a discriminative task, where the model pays attention to each relation and to the often overlooked issue of active vs. passive voice within the triplet. Our experiments on the Re-DocRED and DocRED datasets demonstrate state-of-the-art results for the DocRTE task.
Authors: Linkai Zhu, Lu Yang, Chaofan Li, Shanwen Hu, Lu Liu, Bin Yin
Abstract: Ensuring compliance with international data protection standards for privacy and data security is a crucial but complex task, often requiring substantial legal expertise. This paper introduces LegiLM, a novel legal language model specifically tailored for consulting on data or information compliance. LegiLM leverages a pre-trained GDPR Fines dataset and has been fine-tuned to automatically assess whether particular actions or events breach data security and privacy regulations. By incorporating a specialized dataset that includes global data protection laws, meticulously annotated policy documents, and relevant privacy policies, LegiLM is optimized for addressing data compliance challenges. The model integrates advanced legal reasoning methods and information retrieval enhancements to enhance accuracy and reliability in practical legal consulting scenarios. Our evaluation using a custom benchmark dataset demonstrates that LegiLM excels in detecting data regulation breaches, offering sound legal justifications, and recommending necessary compliance modifications, setting a new benchmark for AI-driven legal compliance solutions. Our resources are publicly available at https://github.com/DAOLegalAI/LegiLM
Authors: Diego Calanzone, Stefano Teso, Antonio Vergari
Abstract: Large language models (LLMs) are a promising venue for natural language understanding and generation. However, current LLMs are far from reliable: they are prone to generating non-factual information and, more crucially, to contradicting themselves when prompted to reason about relations between entities of the world. These problems are currently addressed with large scale fine-tuning or by delegating reasoning to external tools. In this work, we strive for a middle ground and introduce a loss based on neuro-symbolic reasoning that teaches an LLM to be logically consistent with an external set of facts and rules and improves self-consistency even when the LLM is fine-tuned on a limited set of facts. Our approach also allows to easily combine multiple logical constraints at once in a principled way, delivering LLMs that are more consistent w.r.t. all constraints and improve over several baselines w.r.t. a given constraint. Moreover, our method allows LLMs to extrapolate to unseen but semantically similar factual knowledge, represented in unseen datasets, more systematically.
Authors: Oghenefejiro Isaacs Anigboro, Charlie M. Crawford, Dana\"e Metaxa, Sorelle A. Friedler
Abstract: Automated content moderation has long been used to help identify and filter undesired user-generated content online. Generative AI systems now use such filters to keep undesired generated content from being created by or shown to users. From classrooms to Hollywood, as generative AI is increasingly used for creative or expressive text generation, whose stories will these technologies allow to be told, and whose will they suppress? In this paper, we define and introduce measures of speech suppression, focusing on speech related to different identity groups incorrectly filtered by a range of content moderation APIs. Using both short-form, user-generated datasets traditional in content moderation and longer generative AI-focused data, including two datasets we introduce in this work, we create a benchmark for measurement of speech suppression for nine identity groups. Across one traditional and four generative AI-focused automated content moderation services tested, we find that identity-related speech is more likely to be incorrectly suppressed than other speech except in the cases of a few non-marginalized groups. Additionally, we find differences between APIs in their abilities to correctly moderate generative AI content.
Authors: Marius Funk, Shogo Okada, Elisabeth Andr\'e
Abstract: Non-verbal behavior is a central challenge in understanding the dynamics of a conversation and the affective states between interlocutors arising from the interaction. Although psychological research has demonstrated that non-verbal behaviors vary across cultures, limited computational analysis has been conducted to clarify these differences and assess their impact on engagement recognition. To gain a greater understanding of engagement and non-verbal behaviors among a wide range of cultures and language spheres, in this study we conduct a multilingual computational analysis of non-verbal features and investigate their role in engagement and engagement prediction. To achieve this goal, we first expanded the NoXi dataset, which contains interaction data from participants living in France, Germany, and the United Kingdom, by collecting session data of dyadic conversations in Japanese and Chinese, resulting in the enhanced dataset NoXi+J. Next, we extracted multimodal non-verbal features, including speech acoustics, facial expressions, backchanneling and gestures, via various pattern recognition techniques and algorithms. Then, we conducted a statistical analysis of listening behaviors and backchannel patterns to identify culturally dependent and independent features in each language and common features among multiple languages. These features were also correlated with the engagement shown by the interlocutors. Finally, we analyzed the influence of cultural differences in the input features of LSTM models trained to predict engagement for five language datasets. A SHAP analysis combined with transfer learning confirmed a considerable correlation between the importance of input features for a language set and the significant cultural characteristics analyzed.
Authors: Judit M Wulcan, Kevin L Jacques, Mary Ann Lee, Samantha L Kovacs, Nicole Dausend, Lauren E Prince, Jonatan Wulcan, Sina Marsilio, Stefan M Keller
Abstract: Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of temperature settings, and the influence of text ambiguity have not been previously evaluated. This study addresses these gaps by comparing the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions and investigating the relationship between human interobserver agreement and LLM errors. The LLMs and five humans were tasked with identifying six clinical signs associated with Feline chronic enteropathy in 250 EHRs from a veterinary referral hospital. At temperature 0, the performance of GPT-4o compared to the majority opinion of human respondents, achieved 96.9% sensitivity (interquartile range [IQR] 92.9-99.3%), 97.6% specificity (IQR 96.5-98.5%), 80.7% positive predictive value (IQR 70.8-84.6%), 99.5% negative predictive value (IQR 99.0-99.9%), 84.4% F1 score (IQR 77.3-90.4%), and 96.3% balanced accuracy (IQR 95.0-97.9%). The performance of GPT-4o was significantly better than that of its predecessor, GPT-3.5 Turbo, particularly with respect to sensitivity where GPT-3.5 Turbo only achieved 81.7% (IQR 78.9-84.8%). Adjusting the temperature for GPT-4o did not significantly impact classification performance. GPT-4o demonstrated greater reproducibility than human pairs regardless of temperature, with an average Cohen's kappa of 0.98 (IQR 0.98-0.99) at temperature 0 compared to 0.8 (IQR 0.78-0.81) for humans. Most GPT-4o errors occurred in instances where humans disagreed (35/43 errors, 81.4%), suggesting that these errors were more likely caused by ambiguity of the EHR than explicit model faults. Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction.
Authors: Anna M\'esz\'aros, Szilvia Ujv\'ary, Wieland Brendel, Patrik Reizinger, Ferenc Husz\'ar
Abstract: LLMs show remarkable emergent abilities, such as inferring concepts from presumably out-of-distribution prompts, known as in-context learning. Though this success is often attributed to the Transformer architecture, our systematic understanding is limited. In complex real-world data sets, even defining what is out-of-distribution is not obvious. To better understand the OOD behaviour of autoregressive LLMs, we focus on formal languages, which are defined by the intersection of rules. We define a new scenario of OOD compositional generalization, termed rule extrapolation. Rule extrapolation describes OOD scenarios, where the prompt violates at least one rule. We evaluate rule extrapolation in formal languages with varying complexity in linear and recurrent architectures, the Transformer, and state space models to understand the architectures' influence on rule extrapolation. We also lay the first stones of a normative theory of rule extrapolation, inspired by the Solomonoff prior in algorithmic information theory.
Authors: Zhen Yang, Jinhao Chen, Zhengxiao Du, Wenmeng Yu, Weihan Wang, Wenyi Hong, Zhihuan Jiang, Bin Xu, Yuxiao Dong, Jie Tang
Abstract: Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal large language models (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.
Authors: Lei Liang, Mengshu Sun, Zhengke Gui, Zhongshu Zhu, Zhouyu Jiang, Ling Zhong, Yuan Qu, Peilong Zhao, Zhongpu Bo, Jin Yang, Huaidong Xiong, Lin Yuan, Jun Xu, Zaoyang Wang, Wen Zhang, Huajun Chen, Zhiqiang Zhang, Jun Zhou
Abstract: The recently developed retrieval-augmented generation (RAG) technology enables the efficient construction of domain-specific applications. However, it faces limitations due to fuzzy retrieval processes, the "hallucination" problem of understanding and reasoning capabilities of general language models, and cascading losses in complex systems. These challenges hinder the effectiveness of specialized knowledge services. However, in scenarios such as scientific computing, medicine, and law, the accuracy of knowledge, the completeness of information, and the logical rigor of rules, time, and values are particularly critical. We Introduce professional domain knowledge service framework: Knowledge Augmented Generation(KAG) to improve generation and reasoning performance by bidirectionally enhancing large language model(LLM)s and knowledge graph(KG)s, including five key enhancements: 1) LLM-friendly knowledge semantic representation, 2) mutual indexing between knowledge graph and original chunks, 3) logicalform-guided hybrid reasoning and solving, 4) Knowledge alignment based on semantic reasoning, 5) Model for KAG. We compared KAG with existing RAG methods in multi-hop question answering. The results show that KAG performs significantly better than the state-of-the-art methods, with a relative improvement from 19.6% to 33.4% in F1. We apply KAG to two professional knowledge Q&A tasks of Ant Group, including E-Goverment Q&A and E-Health Q&A, and has achieved significant improvement in professionalism compared with NaiveRAG. We will soon natively support KAG on the open source KG engine OpenSPG, allowing developers to more easily build rigorous knowledge decision-making or convenient information retrieval services.
Authors: HuangChao Xu, Baohua Zhang, Zhong Jin, Tiannian Zhu, Quansheng Wu, Hongming Weng
Abstract: Large language models (LLMs), such as ChatGPT, have demonstrated impressive performance in the text generation task, showing the ability to understand and respond to complex instructions. However, the performance of naive LLMs in speciffc domains is limited due to the scarcity of domain-speciffc corpora and specialized training. Moreover, training a specialized large-scale model necessitates signiffcant hardware resources, which restricts researchers from leveraging such models to drive advances. Hence, it is crucial to further improve and optimize LLMs to meet speciffc domain demands and enhance their scalability. Based on the condensed matter data center, we establish a material knowledge graph (MaterialsKG) and integrate it with literature. Using large language models and prompt learning, we develop a specialized dialogue system for topological materials called TopoChat. Compared to naive LLMs, TopoChat exhibits superior performance in structural and property querying, material recommendation, and complex relational reasoning. This system enables efffcient and precise retrieval of information and facilitates knowledge interaction, thereby encouraging the advancement on the ffeld of condensed matter materials.
Authors: Kuan Wang, Alexander Bukharin, Haoming Jiang, Qingyu Yin, Zhengyang Wang, Tuo Zhao, Jingbo Shang, Chao Zhang, Bing Yin, Xian Li, Jianshu Chen, Shiyang Li
Abstract: Instruction fine-tuning (IFT) elicits instruction following capabilities and steers the behavior of large language models (LLMs) via supervised learning. However, existing models trained on open-source IFT datasets only have the ability to follow instructions from users, and often fail to follow complex role and rules specified by developers, a.k.a. system prompts. The ability to follow these roles and rules is essential for deployment, as it ensures that the model safely interacts with users within developer defined guidelines. To improve such role and rule following ability, we propose \model, an automated data generation pipeline that generates diverse roles and rules from existing IFT instructions, along with corresponding responses. This data can then be used to train models that follow complex system prompts. The models are evaluated on our newly created benchmarks for role and rule following ability, as well as standard instruction-following benchmarks and general NLP tasks. Our framework significantly improves role and rule following capability in LLMs, as evidenced by over 25% increase in pass-rate on rule adherence, i.e. following all requirements, in our experiments with the Alpaca and Ultrachat datasets. Moreover, our models achieves this increase without any regression on popular instruction following benchmarks.
Authors: Abdulhady Abas Abdullah, Sabat Salih Muhamad, Hadi Veisi
Abstract: The ability to synthesize spoken language from text has greatly facilitated access to digital content with the advances in text-to-speech technology. However, effective TTS development for low-resource languages, such as Central Kurdish (CKB), still faces many challenges due mainly to the lack of linguistic information and dedicated resources. In this paper, we improve the Kurdish TTS system based on Tacotron by training the Kurdish WaveGlow vocoder on a 21-hour central Kurdish speech corpus instead of using a pre-trained English vocoder WaveGlow. Vocoder training on the target language corpus is required to accurately and fluently adapt phonetic and prosodic changes in Kurdish language. The effectiveness of these enhancements is that our model is significantly better than the baseline system with English pretrained models. In particular, our adaptive WaveGlow model achieves an impressive MOS of 4.91, which sets a new benchmark for Kurdish speech synthesis. On one hand, this study empowers the advanced features of the TTS system for Central Kurdish, and on the other hand, it opens the doors for other dialects in Kurdish and other related languages to further develop.
Authors: Rayane Ghilene, Dimitra Niaouri, Michele Linardi, Julien Longhi
Abstract: Socially Unacceptable Discourse (SUD) analysis is crucial for maintaining online positive environments. We investigate the effectiveness of Entailment-based zero-shot text classification (unsupervised method) for SUD detection and characterization by leveraging pre-trained transformer models and prompting techniques. The results demonstrate good generalization capabilities of these models to unseen data and highlight the promising nature of this approach for generating labeled datasets for the analysis and characterization of extremist narratives. The findings of this research contribute to the development of robust tools for studying SUD and promoting responsible communication online.
Authors: William Van Woensel, Soroor Motie
Abstract: This literature review studies the field of automated process extraction, i.e., transforming textual descriptions into structured processes using Natural Language Processing (NLP). We found that Machine Learning (ML) / Deep Learning (DL) methods are being increasingly used for the NLP component. In some cases, they were chosen for their suitability towards process extraction, and results show that they can outperform classic rule-based methods. We also found a paucity of gold-standard, scalable annotated datasets, which currently hinders objective evaluations as well as the training or fine-tuning of ML / DL methods. Finally, we discuss preliminary work on the application of LLMs for automated process extraction, as well as promising developments in this field.
Authors: Aleksei S. Krylov, Oleg D. Somov
Abstract: Diffusion models have demonstrated significant potential in achieving state-of-the-art performance across various text generation tasks. In this systematic study, we investigate their application to the table-to-text problem by adapting the diffusion model to the task and conducting an in-depth analysis. Our experiments cover multiple aspects of diffusion models training. We explore sampling strategy influence by inducing recent diffusion model accelerator DPM-Solver++ into our core model. We have tested different prediction aggregation methods, like ROVER and Minimum Bayes-Risk (MBR). Our studies cover the impact of the pre-training phase in diffusion models and the generation length constraints influence. We also have compared diffusion model generation with auto-regressive text-to-text models with different temperature settings for diversity evaluation. Our key observation is that diffusion models demonstrate the balance between quality and diversity while auto-regressive text-to-text models are not successful at handling both at the same time. Furthermore, we found out that to achieve the highest quality possible, it is preferable to use a regular sampler with the strictest length constraint to create multiple samples, and then use MBR to aggregate the predictions. However, if you are prepared to give up high level of diversity and to accelerate the process, you can also utilize a fast sampler DPM-Solver++. Our findings reveal that diffusion models achieve comparable results in the table-to-text domain, highlighting their viability in the table-to-text challenge as a promising research direction.
Authors: Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, Andrew D. White
Abstract: Language models are known to produce incorrect information, and their accuracy and reliability for scientific research are still in question. We developed a detailed human-AI comparison method to evaluate language models on real-world literature search tasks, including information retrieval, summarization, and contradiction detection. Our findings show that PaperQA2, an advanced language model focused on improving factual accuracy, matches or outperforms subject matter experts on three realistic literature search tasks, with no restrictions on human participants (full internet access, search tools, and time). PaperQA2 generates cited, Wikipedia-style summaries of scientific topics that are significantly more accurate than current human-written Wikipedia entries. We also present LitQA2, a new benchmark for scientific literature research, which shaped the development of PaperQA2 and contributed to its superior performance. Additionally, PaperQA2 identifies contradictions in scientific literature, a challenging task for humans. It finds an average of 2.34 +/- 1.99 contradictions per paper in a random sample of biology papers, with 70% of these contradictions validated by human experts. These results show that language models can now surpass domain experts in important scientific literature tasks.
Authors: Prashanth Radhakrishnan, Jennifer Chen, Bo Xu, Prem Ramaswami, Hannah Pho, Adriana Olmos, James Manyika, R. V. Guha
Abstract: Large Language Models (LLMs) are prone to generating factually incorrect information when responding to queries that involve numerical and statistical data or other timely facts. In this paper, we present an approach for enhancing the accuracy of LLMs by integrating them with Data Commons, a vast, open-source repository of public statistics from trusted organizations like the United Nations (UN), Center for Disease Control and Prevention (CDC) and global census bureaus. We explore two primary methods: Retrieval Interleaved Generation (RIG), where the LLM is trained to produce natural language queries to retrieve data from Data Commons, and Retrieval Augmented Generation (RAG), where relevant data tables are fetched from Data Commons and used to augment the LLM's prompt. We evaluate these methods on a diverse set of queries, demonstrating their effectiveness in improving the factual accuracy of LLM outputs. Our work represents an early step towards building more trustworthy and reliable LLMs that are grounded in verifiable statistical data and capable of complex factual reasoning.
Authors: Daniel B. Hier, Thanh Son Do, Tayo Obafemi-Ajayi
Abstract: Large language models (LLMs) have shown improved accuracy in phenotype term normalization tasks when augmented with retrievers that suggest candidate normalizations based on term definitions. In this work, we introduce a simplified retriever that enhances LLM accuracy by searching the Human Phenotype Ontology (HPO) for candidate matches using contextual word embeddings from BioBERT without the need for explicit term definitions. Testing this method on terms derived from the clinical synopses of Online Mendelian Inheritance in Man (OMIM), we demonstrate that the normalization accuracy of a state-of-the-art LLM increases from a baseline of 62.3% without augmentation to 90.3% with retriever augmentation. This approach is potentially generalizable to other biomedical term normalization tasks and offers an efficient alternative to more complex retrieval methods.
Authors: Hongyan Chang, Ali Shahin Shamsabadi, Kleomenis Katevas, Hamed Haddadi, Reza Shokri
Abstract: Prior Membership Inference Attacks (MIAs) on pre-trained Large Language Models (LLMs), adapted from classification model attacks, fail due to ignoring the generative process of LLMs across token sequences. In this paper, we present a novel attack that adapts MIA statistical tests to the perplexity dynamics of subsequences within a data point. Our method significantly outperforms prior loss-based approaches, revealing context-dependent memorization patterns in pre-trained LLMs.
Authors: Daniel B. Hier, Thanh Son Do, Tayo Obafemi-Ajayi
Abstract: Term normalization is the process of mapping a term from free text to a standardized concept and its machine-readable code in an ontology. Accurate normalization of terms that capture phenotypic differences between patients and diseases is critical to the success of precision medicine initiatives. A large language model (LLM), such as GPT-4o, can normalize terms to the Human Phenotype Ontology (HPO), but it may retrieve incorrect HPO IDs. Reported accuracy rates for LLMs on these tasks may be inflated due to imbalanced test datasets skewed towards high-frequency terms. In our study, using a comprehensive dataset of 268,776 phenotype annotations for 12,655 diseases from the HPO, GPT-4o achieved an accuracy of 13.1% in normalizing 11,225 unique terms. However, the accuracy was unevenly distributed, with higher-frequency and shorter terms normalized more accurately than lower-frequency and longer terms. Feature importance analysis, using SHAP and permutation methods, identified low-term frequency as the most significant predictor of normalization errors. These findings suggest that training and evaluation datasets for LLM-based term normalization should balance low- and high-frequency terms to improve model performance, particularly for infrequent terms critical to precision medicine.
Authors: Abhinav P. M., SujayKumar Reddy M, Oswald Christopher
Abstract: This project, titled "Machine Translation with Large Language Models: Decoder-only vs. Encoder-Decoder," aims to develop a multilingual machine translation (MT) model. Focused on Indian regional languages, especially Telugu, Tamil, and Malayalam, the model seeks to enable accurate and contextually appropriate translations across diverse language pairs. By comparing Decoder-only and Encoder-Decoder architectures, the project aims to optimize translation quality and efficiency, advancing cross-linguistic communication tools.The primary objective is to develop a model capable of delivering high-quality translations that are accurate and contextually appropriate. By leveraging large language models, specifically comparing the effectiveness of Decoder-only and Encoder-Decoder architectures, the project seeks to optimize translation performance and efficiency across multilingual contexts. Through rigorous experimentation and analysis, this project aims to advance the field of machine translation, contributing valuable insights into the effectiveness of different model architectures and paving the way for enhanced cross-linguistic communication tools.
Authors: Kartikey Doshi, Jimit Shah, Narendra Shekokar
Abstract: We present TheraGen, an advanced AI-powered mental health chatbot utilizing the LLaMA 2 7B model. This approach builds upon recent advancements in language models and transformer architectures. TheraGen provides all-day personalized, compassionate mental health care by leveraging a large dataset of 1 million conversational entries, combining anonymized therapy transcripts, online mental health discussions, and psychological literature, including APA resources. Our implementation employs transfer learning, fine-tuning, and advanced training techniques to optimize performance. TheraGen offers a user-friendly interface for seamless interaction, providing empathetic responses and evidence-based coping strategies. Evaluation results demonstrate high user satisfaction rates, with 94% of users reporting improved mental well-being. The system achieved a BLEU score of 0.67 and a ROUGE score of 0.62, indicating strong response accuracy. With an average response time of 1395 milliseconds, TheraGen ensures real-time, efficient support. While not a replacement for professional therapy, TheraGen serves as a valuable complementary tool, significantly improving user well-being and addressing the accessibility gap in mental health treatments. This paper details TheraGen's architecture, training methodology, ethical considerations, and future directions, contributing to the growing field of AI-assisted mental healthcare and offering a scalable solution to the pressing need for mental health support.
Authors: Neel Rajani, Lilli Kiessling, Aleksandr Ogaltsov, Claus Lang
Abstract: Although powerful, current cutting-edge LLMs may not fulfil the needs of highly specialised sectors. We introduce KodeXv0.1, a family of large language models that outclass GPT-4 in financial question answering. We utilise the base variants of Llama 3.1 8B and 70B and adapt them to the financial domain through a custom training regime. To this end, we collect and process a large number of publicly available financial documents such as earnings calls and business reports. These are used to generate a high-quality, synthetic dataset consisting of Context-Question-Answer triplets which closely mirror real-world financial tasks. Using the train split of this dataset, we perform RAG-aware 4bit LoRA instruction tuning runs of Llama 3.1 base variants to produce KodeX-8Bv0.1 and KodeX-70Bv0.1. We then complete extensive model evaluations using FinanceBench, FinQABench and the withheld test split of our dataset. Our results show that KodeX-8Bv0.1 is more reliable in financial contexts than cutting-edge instruct models in the same parameter regime, surpassing them by up to 9.24%. In addition, it is even capable of outperforming state-of-the-art proprietary models such as GPT-4 by up to 7.07%. KodeX-70Bv0.1 represents a further improvement upon this, exceeding GPT-4's performance on every tested benchmark.
Authors: Baohua Zhang, Yongyi Huang, Wenyao Cui, Huaping Zhang
Abstract: Role-playing is an easy task for Large Language Models (LLMs), as they are skilled at simulating human behaviors. Many current studies have enabled LLMs to generate responses in the tone of a specific role by fine-tuning the models or using specialized prompts. However, it is typically easy to recognize when a role is being played by LLMs. These models tend to perform poorly when confronted with knowledge that the assumed role does not possess, or a question that requires the specific experience or logic of the role to answer. To address this problem and make LLMs act more like real roles, we propose a Thinking Before Speaking (TBS) model in this paper. Unlike other studies, we first extend the data based on the character's real-life scenarios and the historical dialogue, supplementing each pair of dialogue with the character's mindset. Then we add few data points that include elements beyond the role's knowledge, and fine-tune the LLMs. This approach can help LLMs adopt the role's thought process and logic, avoiding responses that fall outside the role's knowledge base. We have also prepared a dataset and evaluation metrics to test these capabilities. Experimental results show that our TBS model can better emulate a role in terms of tone, knowledge, and mindset.
Authors: Xin Wang, Xinyi Bai
Abstract: Relation extraction as an important natural Language processing (NLP) task is to identify relations between named entities in text. Recently, graph convolutional networks over dependency trees have been widely used to capture syntactic features and achieved attractive performance. However, most existing dependency-based approaches ignore the positive influence of the words outside the dependency trees, sometimes conveying rich and useful information on relation extraction. In this paper, we propose a novel model, Entity-aware Self-attention Contextualized GCN (ESC-GCN), which efficiently incorporates syntactic structure of input sentences and semantic context of sequences. To be specific, relative position self-attention obtains the overall semantic pairwise correlation related to word position, and contextualized graph convolutional networks capture rich intra-sentence dependencies between words by adequately pruning operations. Furthermore, entity-aware attention layer dynamically selects which token is more decisive to make final relation prediction. In this way, our proposed model not only reduces the noisy impact from dependency trees, but also obtains easily-ignored entity-related semantic representation. Extensive experiments on various tasks demonstrate that our model achieves encouraging performance as compared to existing dependency-based and sequence-based models. Specially, our model excels in extracting relations between entities of long sentences.
Authors: Stanley Cao, Felix Drinkall
Abstract: Stance detection is a crucial NLP task with numerous applications in social science, from analyzing online discussions to assessing political campaigns. This paper investigates the optimal way to incorporate metadata into a political stance detection task. We demonstrate that previous methods combining metadata with language-based data for political stance detection have not fully utilized the metadata information; our simple baseline, using only party membership information, surpasses the current state-of-the-art. We then show that prepending metadata (e.g., party and policy) to political speeches performs best, outperforming all baselines, indicating that complex metadata inclusion systems may not learn the task optimally.
Authors: Adarsh MS, Jithin VG, Ditto PS
Abstract: Large language models (LLMs) are known for their exceptional performance across a range of natural language processing tasks, but their deployment comes at a high computational and financial cost. On the other hand, smaller language models (SLMs), which can be deployed on lower-cost edge devices, struggle to match the performance of their larger counterparts. This paper presents a novel hybrid inference approach that leverages the strengths of both model types while minimizing reliance on costly cloud-based LLMs. Unlike existing methods that route entire queries to either an SLM or a cloud LLM, our approach introduces a reward-based mechanism to dynamically determine the involvement of the cloud LLM during token generation. Specifically, each token predicted by the SLM is evaluated against a reward score, and only when this score falls below a certain threshold is the cloud LLM consulted for assistance in the next token prediction. This method not only reduces the traffic to the cloud LLM, thereby lowering costs, but also allows for flexible control over response quality depending on the reward score threshold. Experimental results demonstrate that our approach significantly reduces cloud LLM usage with minimal impact on overall response quality, offering a cost-effective solution for deploying high-performance language models
Authors: Tracy Cai, Wilson Liang, Donte Townes
Abstract: The traditional songwriting process is rather complex and this is evident in the time it takes to produce lyrics that fit the genre and form comprehensive verses. Our project aims to simplify this process with deep learning techniques, thus optimizing the songwriting process and enabling an artist to hit their target audience by staying in genre. Using a dataset of 18,000 songs off Spotify, we developed a unique preprocessing format using tokens to parse lyrics into individual verses. These results were used to train a baseline pretrained seq2seq model, and a LSTM-based neural network models according to song genres. We found that generation yielded higher recall (ROUGE) in the baseline model, but similar precision (BLEU) for both models. Qualitatively, we found that many of the lyrical phrases generated by the original model were still comprehensible and discernible between which genres they fit into, despite not necessarily being the exact the same as the true lyrics. Overall, our results yielded that lyric generation can reasonably be sped up to produce genre-based lyrics and aid in hastening the songwriting process.
Authors: Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang
Abstract: As the use of large language models (LLMs) expands rapidly, so does the range of knowledge needed to supplement various LLM queries. Thus, enabling flexible and efficient injection of new knowledge in LLM inference is critical. Three high-level options exist: (i) embedding the knowledge in LLM's weights (i.e., fine-tuning), (ii) including the knowledge as a part of LLM's text input (i.e., in-context learning), or (iii) injecting the KV caches of the new knowledge to LLM during prefill. This paper argues that, although fine-tuning and in-context learning are popular, using KV caches as the medium of knowledge could simultaneously enable more modular management of knowledge injection and more efficient LLM serving with low cost and fast response. To realize these benefits, we envision a Knowledge Delivery Network (KDN), a new system component in LLM services that dynamically optimizes the storage, transfer, and composition of KV cache across LLM engines and other compute and storage resources. We believe that, just like content delivery networks (CDNs), such as Akamai, enabled the success of the Internet ecosystem through their efficient data delivery, KDNs will be critical to the success of LLM applications through their efficient knowledge delivery. We have open-sourced a KDN prototype at https://github.com/LMCache/LMCache.
Authors: Christos Fragkathoulas, Odysseas S. Chlapanis
Abstract: This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.
Authors: Weijie Zhao, Huajie Shao, Zhaozhuo Xu, Suzhen Duan, Denghui Zhang
Abstract: Exploring the data sources used to train Large Language Models (LLMs) is a crucial direction in investigating potential copyright infringement by these models. While this approach can identify the possible use of copyrighted materials in training data, it does not directly measure infringing risks. Recent research has shifted towards testing whether LLMs can directly output copyrighted content. Addressing this direction, we investigate and assess LLMs' capacity to generate infringing content by providing them with partial information from copyrighted materials, and try to use iterative prompting to get LLMs to generate more infringing content. Specifically, we input a portion of a copyrighted text into LLMs, prompt them to complete it, and then analyze the overlap between the generated content and the original copyrighted material. Our findings demonstrate that LLMs can indeed generate content highly overlapping with copyrighted materials based on these partial inputs.
Authors: Robert Morabito, Sangmitra Madhusudan, Tyler McDonald, Ali Emami
Abstract: Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing. However, many current methodologies evaluate scenarios in isolation, without considering the broader context or the spectrum of potential biases within each situation. To address this, we introduce the Sensitivity Testing on Offensive Progressions (STOP) dataset, which includes 450 offensive progressions containing 2,700 unique sentences of varying severity that progressively escalate from less to more explicitly offensive. Covering a broad spectrum of 9 demographics and 46 sub-demographics, STOP ensures inclusivity and comprehensive coverage. We evaluate several leading closed- and open-source models, including GPT-4, Mixtral, and Llama 3. Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%. We also demonstrate how aligning models with human judgments on STOP can improve model answer rates on sensitive tasks such as BBQ, StereoSet, and CrowS-Pairs by up to 191%, while maintaining or even improving performance. STOP presents a novel framework for assessing the complex nature of biases in LLMs, which will enable more effective bias mitigation strategies and facilitates the creation of fairer language models.
Authors: Julia Watson, Sophia Lee, Barend Beekhuizen, Suzanne Stevenson
Abstract: We study language ideologies in text produced by LLMs through a case study on English gendered language reform (related to role nouns like congressperson/-woman/-man, and singular they). First, we find political bias: when asked to use language that is "correct" or "natural", LLMs use language most similarly to when asked to align with conservative (vs. progressive) values. This shows how LLMs' metalinguistic preferences can implicitly communicate the language ideologies of a particular political group, even in seemingly non-political contexts. Second, we find LLMs exhibit internal inconsistency: LLMs use gender-neutral variants more often when more explicit metalinguistic context is provided. This shows how the language ideologies expressed in text produced by LLMs can vary, which may be unexpected to users. We discuss the broader implications of these findings for value alignment.
Authors: Zhepeng Wang, Runxue Bao, Yawen Wu, Jackson Taylor, Cao Xiao, Feng Zheng, Weiwen Jiang, Shangqian Gao, Yanfu Zhang
Abstract: Pretrained large language models (LLMs) have revolutionized natural language processing (NLP) tasks such as summarization, question answering, and translation. However, LLMs pose significant security risks due to their tendency to memorize training data, leading to potential privacy breaches and copyright infringement. Accurate measurement of this memorization is essential to evaluate and mitigate these potential risks. However, previous attempts to characterize memorization are constrained by either using prefixes only or by prepending a constant soft prompt to the prefixes, which cannot react to changes in input. To address this challenge, we propose a novel method for estimating LLM memorization using dynamic, prefix-dependent soft prompts. Our approach involves training a transformer-based generator to produce soft prompts that adapt to changes in input, thereby enabling more accurate extraction of memorized data. Our method not only addresses the limitations of previous methods but also demonstrates superior performance in diverse experimental settings compared to state-of-the-art techniques. In particular, our method can achieve the maximum relative improvement of 112.75% and 32.26% over the vanilla baseline in terms of discoverable memorization rate for the text generation task and code generation task respectively.
Authors: Eric Cullhed
Abstract: This article presents an experiment in fine-tuning a pretrained causal language model (Meta's Llama 3.1 8B Instruct) for aiding in three fundamental tasks of philological research: chronological and geographic attribution as well as text restoration in ancient Greek inscriptions and documentary papyri. Using a prompt-based instruct approach, the fine-tuned models surpass the state of the art in key metrics. For inscriptions, the models achieve a lower average character error rate (CER) of 22.5% (vs. 26.3%), while closely matching top-1 accuracy (60.9% vs. 61.8%) and top-20 accuracy (77.5% vs. 78.3%) for sequences up to 10 characters. They also provide a practical advantage by ignoring spaces during reconstruction, aligning better with the scriptio continua typically used in ancient written artifacts. In geographic attribution, the model outperforms previous benchmarks with a top-1 accuracy of 75.0% (vs. 70.8%) and a top-3 accuracy of 83.7% (vs. 82.1%). For dating, it achieves an average deviation of 26.2 years (vs. 29.3) and a median deviation of 1 year (vs. 3) from the actual date range. The models also set new baselines for documentary papyri, with a CER of 16.3%, a top-1 accuracy of 71.3%, and top-20 of 85.0% in text reconstruction; a top-1 accuracy of 66.4% and top-3 of 79.9% in geographic attribution; and, in chronological attribution, a deviation of 21.7 years from the actual termini post/ante quem, with a median deviation of 0 years.
Authors: Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou
Abstract: Equivocation and ambiguity in public speech are well-studied discourse phenomena, especially in political science and analysis of political interviews. Inspired by the well-grounded theory on equivocation, we aim to resolve the closely related problem of response clarity in questions extracted from political interviews, leveraging the capabilities of Large Language Models (LLMs) and human expertise. To this end, we introduce a novel taxonomy that frames the task of detecting and classifying response clarity and a corresponding clarity classification dataset which consists of question-answer (QA) pairs drawn from political interviews and annotated accordingly. Our proposed two-level taxonomy addresses the clarity of a response in terms of the information provided for a given question (high-level) and also provides a fine-grained taxonomy of evasion techniques that relate to unclear, ambiguous responses (lower-level). We combine ChatGPT and human annotators to collect, validate and annotate discrete QA pairs from political interviews, to be used for our newly introduced response clarity task. We provide a detailed analysis and conduct several experiments with different model architectures, sizes and adaptation methods to gain insights and establish new baselines over the proposed dataset and task.
Authors: Deonna M. Owens, Ryan A. Rossi, Sungchul Kim, Tong Yu, Franck Dernoncourt, Xiang Chen, Ruiyi Zhang, Jiuxiang Gu, Hanieh Deilamsalehy, Nedim Lipka
Abstract: Large Language Models (LLMs) are powerful tools with the potential to benefit society immensely, yet, they have demonstrated biases that perpetuate societal inequalities. Despite significant advancements in bias mitigation techniques using data augmentation, zero-shot prompting, and model fine-tuning, biases continuously persist, including subtle biases that may elude human detection. Recent research has shown a growing interest in multi-LLM approaches, which have been demonstrated to be effective in improving the quality of reasoning and factuality in LLMs. Building on this approach, we propose a novel multi-LLM debiasing framework aimed at reducing bias in LLMs. Our work is the first to introduce and evaluate two distinct approaches within this framework for debiasing LLMs: a centralized method, where the conversation is facilitated by a single central LLM, and a decentralized method, where all models communicate directly. Our findings reveal that our multi-LLM framework significantly reduces bias in LLMs, outperforming the baseline method across several social groups.
Authors: Yuhe Gao, Runxue Bao, Yuelyu Ji, Yiming Sun, Chenxi Song, Jeffrey P. Ferraro, Ye Ye
Abstract: Knowledge sharing is crucial in healthcare, especially when leveraging data from multiple clinical sites to address data scarcity, reduce costs, and enable timely interventions. Transfer learning can facilitate cross-site knowledge transfer, but a major challenge is heterogeneity in clinical concepts across different sites. Large Language Models (LLMs) show significant potential of capturing the semantic meaning of clinical concepts and reducing heterogeneity. This study analyzed electronic health records from two large healthcare systems to assess the impact of semantic embeddings from LLMs on local, shared, and transfer learning models. Results indicate that domain-specific LLMs, such as Med-BERT, consistently outperform in local and direct transfer scenarios, while generic models like OpenAI embeddings require fine-tuning for optimal performance. However, excessive tuning of models with biomedical embeddings may reduce effectiveness, emphasizing the need for balance. This study highlights the importance of domain-specific embeddings and careful model tuning for effective knowledge transfer in healthcare.
Authors: Samuel Cahyawijaya
Abstract: Natural language processing (NLP) has witnessed a profound impact of large language models (LLMs) that excel in a multitude of tasks. However, the limitation of LLMs in multilingual settings, particularly in underrepresented languages, remains a significant hurdle. This thesis aims to bridge the gap in NLP research and development by focusing on underrepresented languages. A comprehensive evaluation of LLMs is conducted to assess their capabilities in these languages, revealing the challenges of multilingual and multicultural generalization. Addressing the multilingual generalization gap, this thesis proposes data-and-compute-efficient methods to mitigate the disparity in LLM ability in underrepresented languages, allowing better generalization on underrepresented languages without the loss of task generalization ability. The proposed solutions cover cross-lingual continual instruction tuning, retrieval-based cross-lingual in-context learning, and in-context query alignment. Furthermore, a novel method to measure cultural values alignment between LLMs operating in different languages is proposed, ensuring cultural sensitivity and inclusivity. These contributions aim to enhance the multilingual and multicultural alignment of LLMs in underrepresented languages, ultimately advancing the NLP field toward greater equality and inclusiveness.
Authors: Aidan Gilson, Xuguang Ai, Thilaka Arunachalam, Ziyou Chen, Ki Xiong Cheong, Amisha Dave, Cameron Duic, Mercy Kibe, Annette Kaminaka, Minali Prasad, Fares Siddig, Maxwell Singer, Wendy Wong, Qiao Jin, Tiarnan D. L. Keenan, Xia Hu, Emily Y. Chew, Zhiyong Lu, Hua Xu, Ron A. Adelman, Yih-Chung Tham, Qingyu Chen
Abstract: Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time. In a case study on long-form consumer health questions, we systematically evaluated the responses including over 500 references of LLMs with and without RAG on 100 questions with 10 healthcare professionals. The evaluation focuses on factuality of evidence, selection and ranking of evidence, attribution of evidence, and answer accuracy and completeness. LLMs without RAG provided 252 references in total. Of which, 45.3% hallucinated, 34.1% consisted of minor errors, and 20.6% were correct. In contrast, LLMs with RAG significantly improved accuracy (54.5% being correct) and reduced error rates (18.8% with minor hallucinations and 26.7% with errors). 62.5% of the top 10 documents retrieved by RAG were selected as the top references in the LLM response, with an average ranking of 4.9. The use of RAG also improved evidence attribution (increasing from 1.85 to 2.49 on a 5-point scale, P<0.001), albeit with slight decreases in accuracy (from 3.52 to 3.23, P=0.03) and completeness (from 3.47 to 3.27, P=0.17). The results demonstrate that LLMs frequently exhibited hallucinated and erroneous evidence in the responses, raising concerns for downstream applications in the medical domain. RAG substantially reduced the proportion of such evidence but encountered challenges.
Authors: Sunit Sivasankaran, Eric Sun, Jinyu Li, Yan Huang, Jing Pan
Abstract: Obtaining word timestamp information from end-to-end (E2E) ASR models remains challenging due to the lack of explicit time alignment during training. This issue is further complicated in multilingual models. Existing methods, either rely on lexicons or introduce additional tokens, leading to scalability issues and increased computational costs. In this work, we propose a new approach to estimate word boundaries without relying on lexicons. Our method leverages word embeddings from sub-word token units and a pretrained ASR model, requiring only word alignment information during training. Our proposed method can scale-up to any number of languages without incurring any additional cost. We validate our approach using a multilingual ASR model trained on five languages and demonstrate its effectiveness against a strong baseline.
Authors: Sebastian Nehrdich, Oliver Hellwig, Kurt Keutzer
Abstract: Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained language model, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich language Sanskrit. We evaluate ByT5-Sanskrit on established Sanskrit word segmentation tasks, where it outperforms previous data-driven approaches by a considerable margin and matches the performance of the current best lexicon-based model. It is easier to deploy and more robust to data not covered by external linguistic resources. It also achieves new state-of-the-art results in Vedic Sanskrit dependency parsing and OCR post-correction tasks. Additionally, based on the Digital Corpus of Sanskrit, we introduce a novel multitask dataset for the joint training of Sanskrit word segmentation, lemmatization, and morphosyntactic tagging tasks. We fine-tune ByT5-Sanskrit on this dataset, creating a versatile multitask model for various downstream Sanskrit applications. We have used this model in Sanskrit linguistic annotation projects, in information retrieval setups, and as a preprocessing step in a Sanskrit machine translation pipeline. We also show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages. We thus demonstrate that byte-level pretrained language models can achieve excellent performance for morphologically rich languages, outperforming tokenizer-based models and presenting an important vector of exploration when constructing NLP pipelines for such languages.
Authors: Sarfaroz Yunusov, Hamza Sidat, Ali Emami
Abstract: This study explores the effectiveness of Large Language Models (LLMs) in creating personalized "mirror stories" that reflect and resonate with individual readers' identities, addressing the significant lack of diversity in literature. We present MirrorStories, a corpus of 1,500 personalized short stories generated by integrating elements such as name, gender, age, ethnicity, reader interest, and story moral. We demonstrate that LLMs can effectively incorporate diverse identity elements into narratives, with human evaluators identifying personalized elements in the stories with high accuracy. Through a comprehensive evaluation involving 26 diverse human judges, we compare the effectiveness of MirrorStories against generic narratives. We find that personalized LLM-generated stories not only outscore generic human-written and LLM-generated ones across all metrics of engagement (with average ratings of 4.22 versus 3.37 on a 5-point scale), but also achieve higher textual diversity while preserving the intended moral. We also provide analyses that include bias assessments and a study on the potential for integrating images into personalized stories.
Authors: Chen Zhang, Dading Chong, Feng Jiang, Chengguang Tang, Anningzhe Gao, Guohua Tang, Haizhou Li
Abstract: In natural human-to-human conversations, participants often receive feedback signals from one another based on their follow-up reactions. These reactions can include verbal responses, facial expressions, changes in emotional state, and other non-verbal cues. Similarly, in human-machine interactions, the machine can leverage the user's follow-up utterances as feedback signals to assess whether it has appropriately addressed the user's request. Therefore, we propose using the likelihood of follow-up utterances as rewards to differentiate preferred responses from less favored ones, without relying on human or commercial LLM-based preference annotations. Our proposed reward mechanism, ``Follow-up Likelihood as Reward" (FLR), matches the performance of strong reward models trained on large-scale human or GPT-4 annotated data on 8 pairwise-preference and 4 rating-based benchmarks. Building upon the FLR mechanism, we propose to automatically mine preference data from the online generations of a base policy model. The preference data are subsequently used to boost the helpfulness of the base model through direct alignment from preference (DAP) methods, such as direct preference optimization (DPO). Lastly, we demonstrate that fine-tuning the language model that provides follow-up likelihood with natural language feedback significantly enhances FLR's performance on reward modeling benchmarks and effectiveness in aligning the base policy model's helpfulness.
Authors: Zheng Wei Lim, Nitish Gupta, Honglin Yu, Trevor Cohn
Abstract: Multilingual large language models (LLMs) are great translators, but this is largely limited to high-resource languages. For many LLMs, translating in and out of low-resource languages remains a challenging task. To maximize data efficiency in this low-resource setting, we introduce Mufu, which includes a selection of automatically generated multilingual candidates and an instruction to correct inaccurate translations in the prompt. Mufu prompts turn a translation task into a postediting one, and seek to harness the LLM's reasoning capability with auxiliary translation candidates, from which the model is required to assess the input quality, align the semantics cross-lingually, copy from relevant inputs and override instances that are incorrect. Our experiments on En-XX translations over the Flores-200 dataset show LLMs finetuned against Mufu-style prompts are robust to poor quality auxiliary translation candidates, achieving performance superior to NLLB 1.3B distilled model in 64% of low- and very-low-resource language pairs. We then distill these models to reduce inference cost, while maintaining on average 3.1 chrF improvement over finetune-only baseline in low-resource translations.
Authors: Jaewook Lee, Hunter McNichols, Andrew Lan
Abstract: In this paper, we study an under-explored area of language and vocabulary learning: keyword mnemonics, a technique for memorizing vocabulary through memorable associations with a target word via a verbal cue. Typically, creating verbal cues requires extensive human effort and is quite time-consuming, necessitating an automated method that is more scalable. We propose a novel overgenerate-and-rank method via prompting large language models (LLMs) to generate verbal cues and then ranking them according to psycholinguistic measures and takeaways from a pilot user study. To assess cue quality, we conduct both an automated evaluation of imageability and coherence, as well as a human evaluation involving English teachers and learners. Results show that LLM-generated mnemonics are comparable to human-generated ones in terms of imageability, coherence, and perceived usefulness, but there remains plenty of room for improvement due to the diversity in background and preference among language learners.
Authors: Jinman Zhao, Xueyan Zhang, Xingyu Yue, Weizhe Chen, Zifan Qian, Ruiyu Wang
Abstract: Current common interactions with language models is through full inference. This approach may not necessarily align with the model's internal knowledge. Studies show discrepancies between prompts and internal representations. Most focus on sentence understanding. We study the discrepancy of word semantics understanding in internal and external mismatch across Encoder-only, Decoder-only, and Encoder-Decoder pre-trained language models.
Authors: Jinman Zhao, Zifan Qian, Linbo Cao, Yining Wang, Yitian Ding
Abstract: Role-play in the Large Language Model (LLM) is a crucial technique that enables models to adopt specific perspectives, enhancing their ability to generate contextually relevant and accurate responses. By simulating different roles, theis approach improves reasoning capabilities across various NLP benchmarks, making the model's output more aligned with diverse scenarios. However, in this work, we demonstrate that role-play also carries potential risks. We systematically evaluate the impact of role-play by asking the language model to adopt different roles and testing it on multiple benchmarks that contain stereotypical and harmful questions. Despite the significant fluctuations in the benchmark results in different experiments, we find that applying role-play often increases the overall likelihood of generating stereotypical and harmful outputs.
Authors: Yuqing Huang, Rongyang Zhang, Xuesong He, Xuyang Zhi, Hao Wang, Xin Li, Feiyang Xu, Deguang Liu, Huadong Liang, Yi Li, Jian Cui, Zimu Liu, Shijin Wang, Guoping Hu, Guiquan Liu, Qi Liu, Defu Lian, Enhong Chen
Abstract: There is a growing interest in the role that LLMs play in chemistry which lead to an increased focus on the development of LLMs benchmarks tailored to chemical domains to assess the performance of LLMs across a spectrum of chemical tasks varying in type and complexity. However, existing benchmarks in this domain fail to adequately meet the specific requirements of chemical research professionals. To this end, we propose \textbf{\textit{ChemEval}}, which provides a comprehensive assessment of the capabilities of LLMs across a wide range of chemical domain tasks. Specifically, ChemEval identified 4 crucial progressive levels in chemistry, assessing 12 dimensions of LLMs across 42 distinct chemical tasks which are informed by open-source data and the data meticulously crafted by chemical experts, ensuring that the tasks have practical value and can effectively evaluate the capabilities of LLMs. In the experiment, we evaluate 12 mainstream LLMs on ChemEval under zero-shot and few-shot learning contexts, which included carefully selected demonstration examples and carefully designed prompts. The results show that while general LLMs like GPT-4 and Claude-3.5 excel in literature understanding and instruction following, they fall short in tasks demanding advanced chemical knowledge. Conversely, specialized LLMs exhibit enhanced chemical competencies, albeit with reduced literary comprehension. This suggests that LLMs have significant potential for enhancement when tackling sophisticated tasks in the field of chemistry. We believe our work will facilitate the exploration of their potential to drive progress in chemistry. Our benchmark and analysis will be available at {\color{blue} \url{https://github.com/USTC-StarTeam/ChemEval}}.
Authors: Jiatao Li, Xinyu Hu, Xiaojun Wan
Abstract: Retrieval-Augmented Generation (RAG) has greatly improved large language models (LLMs) by enabling them to generate accurate, contextually grounded responses through the integration of external information. However, conventional RAG approaches, which prioritize top-ranked documents based solely on query-context relevance, often introduce redundancy and conflicting information. This issue is particularly evident in unsupervised retrieval settings, where there are no mechanisms to effectively mitigate these problems, leading to suboptimal context selection. To address this, we propose Selection using Matrices for Augmented Retrieval (SMART) in question answering tasks, a fully unsupervised and training-free framework designed to optimize context selection in RAG. SMART leverages Determinantal Point Processes (DPPs) to simultaneously model relevance, diversity and conflict, ensuring the selection of potentially high-quality contexts. Experimental results across multiple datasets demonstrate that SMART significantly enhances QA performance and surpasses previous unsupervised context selection methods, showing a promising strategy for RAG.
Authors: Zhenhong Zhang, Jiajing Chen, Weiyan Shi, Lingjie Yi, Chihang Wang, Qian Yu
Abstract: With the rapid development of artificial intelligence technology, especially the increasingly widespread application of question-and-answer systems, high-quality question generation has become a key component in supporting the development of these systems. This article focuses on knowledge-based question generation technology, which aims to enable computers to simulate the human questioning process based on understanding specific texts or knowledge bases. In light of the issues of hallucination and knowledge gaps present in large-scale language models when applied to knowledge-intensive tasks, this paper proposes an enhanced question generation method that incorporates contrastive learning. This method utilizes multiple models to jointly mine domain knowledge and uses contrastive learning to guide the model in reducing noise and hallucinations in generation. Experimental results show that by designing prompts containing contrasting examples, the model's performance in question generation improves considerably, particularly when contrasting instructions and examples are used simultaneously, leading to the highest quality of generated questions and improved accuracy. These results demonstrate that the method proposed in this study, which combines contrasting context and chain-of-thought prompts, can effectively improve both the quality and the practicality of question generation.
Authors: Linxiao Wu, Yuanshuai Luo, Binrong Zhu, Guiran Liu, Rui Wang, Qian Yu
Abstract: Amidst the swift evolution of social media platforms and e-commerce ecosystems, the domain of opinion mining has surged as a pivotal area of exploration within natural language processing. A specialized segment within this field focuses on extracting nuanced evaluations tied to particular elements within textual contexts. This research advances a composite framework that amalgamates the positional cues of topical descriptors. The proposed system converts syntactic structures into a matrix format, leveraging convolutions and attention mechanisms within a graph to distill salient characteristics. Incorporating the positional relevance of descriptors relative to lexical items enhances the sequential integrity of the input. Trials have substantiated that this integrated graph-centric scheme markedly elevates the efficacy of evaluative categorization, showcasing preeminence.
Authors: Jason Zhang, Scott Viteri
Abstract: As language models grow more influential and trusted in our society, our ability to reliably steer them toward favorable behaviors becomes increasingly paramount. For this, we investigate the technique of steering vectors: biasing the forward pass of language models using a "steering vector" derived from a specific task. We apply them to steer language models toward performing Chain of Thought (CoT) Reasoning without the need to prompt through natural language. We demonstrate this approach on Llama3 8b and Mistral 7b v0.2, and obtain competitive results compared to CoT-prompted performances on a series of reasoning benchmarks (GSM8k, MMLU, AGI Eval, ARC AI2) and qualitative examples. We find this approach yields consistent steering towards CoT responses and takes less compute than traditional methods of fine-tuning models towards CoT.
Authors: Prasoon Bajpai, Niladri Chatterjee, Subhabrata Dutta, Tanmoy Chakraborty
Abstract: Large Language Models (LLMs) and AI assistants driven by these models are experiencing exponential growth in usage among both expert and amateur users. In this work, we focus on evaluating the reliability of current LLMs as science communicators. Unlike existing benchmarks, our approach emphasizes assessing these models on scientific questionanswering tasks that require a nuanced understanding and awareness of answerability. We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts, along with a benchmarking suite that evaluates LLMs for correctness and consistency across various criteria. We benchmark three proprietary LLMs from the OpenAI GPT family and 13 open-access LLMs from the Meta Llama-2, Llama-3, and Mistral families. While most open-access models significantly underperform compared to GPT-4 Turbo, our experiments identify Llama-3-70B as a strong competitor, often surpassing GPT-4 Turbo in various evaluation aspects. We also find that even the GPT models exhibit a general incompetence in reliably verifying LLM responses. Moreover, we observe an alarming trend where human evaluators are deceived by incorrect responses from GPT-4 Turbo.
Authors: Tongxuan Liu, Xingyu Wang, Weizhe Huang, Wenjiang Xu, Yuting Zeng, Lei Jiang, Hailong Yang, Jing Li
Abstract: In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse NLP tasks. Extensive research has explored how to enhance the logical reasoning abilities such as Chain-of-Thought, Chain-of-Thought with Self-Consistency, Tree-Of-Thoughts, and multi-agent debates. In the context of multi-agent debates, significant performance improvements can be achieved with an increasing number of agents and debate rounds. However, the escalation in the number of agents and debate rounds can drastically raise the tokens cost of debates, thereby limiting the scalability of the multi-agent debate technique. To better harness the advantages of multi-agent debates in logical reasoning tasks, this paper proposes a method to significantly reduce token cost in multi-agent debates. This approach involves dividing all agents into multiple debate groups, with agents engaging in debates within their respective groups and sharing interim debate results between groups. Comparative experiments across multiple datasets have demonstrated that this method can reduce the total tokens by up to 51.7% during debates and while potentially enhancing accuracy by as much as 25%. Our method significantly enhances the performance and efficiency of interactions in the multi-agent debate.
Authors: Xiao Zhang, Miao Li, Ji Wu
Abstract: Pretrained language models can encode a large amount of knowledge and utilize it for various reasoning tasks, yet they can still struggle to learn novel factual knowledge effectively from finetuning on limited textual demonstrations. In this work, we show that the reason for this deficiency is that language models are biased to learn word co-occurrence statistics instead of true factual associations. We identify the differences between two forms of knowledge representation in language models: knowledge in the form of co-occurrence statistics is encoded in the middle layers of the transformer model and does not generalize well to reasoning scenarios beyond simple question answering, while true factual associations are encoded in the lower layers and can be freely utilized in various reasoning tasks. Based on these observations, we propose two strategies to improve the learning of factual associations in language models. We show that training on text with implicit rather than explicit factual associations can force the model to learn factual associations instead of co-occurrence statistics, significantly improving the generalization of newly learned knowledge. We also propose a simple training method to actively forget the learned co-occurrence statistics, which unblocks and enhances the learning of factual associations when training on plain narrative text. On both synthetic and real-world corpora, the two proposed strategies improve the generalization of the knowledge learned during finetuning to reasoning scenarios such as indirect and multi-hop question answering.
Authors: Ashutosh Bajpai, Aaryan Goyal, Atif Anwer, Tanmoy Chakraborty
Abstract: The prolific use of Large Language Models (LLMs) as an alternate knowledge base requires them to be factually consistent, necessitating both correctness and consistency traits for paraphrased queries. Recently, significant attempts have been made to benchmark datasets and metrics to evaluate LLMs for these traits. However, structural simplicity (subject-relation-object) and contemporary association in their query formulation limit the broader definition of factuality and consistency. In this study, we introduce TeCFaP, a novel Temporally Consistent Factuality Probe task to expand the consistent factuality probe in the temporal dimension. To this end, we propose TEMP-COFAC, a high-quality dataset of prefix-style English query paraphrases. Subsequently, we extend the definitions of existing metrics to represent consistent factuality across temporal dimension. We experiment with a diverse set of LLMs and find most of them performing poorly on TeCFaP. Next, we propose a novel solution CoTSeLF (Consistent-Time-Sensitive Learning Framework) combining multi-task instruction tuning (MT-IT) with consistent-time-sensitive reinforcement learning (CTSRL) to improve temporally consistent factuality in LLMs. Our experiments demonstrate the efficacy of CoTSeLF over several baselines.
Authors: Khai Le-Duc, Phuc Phan, Tan-Hanh Pham, Bach Phan Tat, Minh-Huong Ngo, Truong-Son Hy
Abstract: Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants. This technology enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we introduce MultiMed, a collection of small-to-large end-to-end ASR models for the medical domain, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese, together with the corresponding real-world ASR dataset. To our best knowledge, MultiMed stands as the largest and the first multilingual medical ASR dataset, in terms of total duration, number of speakers, diversity of diseases, recording conditions, speaker roles, unique medical terms, accents, and ICD-10 codes. Secondly, we establish the empirical baselines, present the first reproducible study of multilinguality in medical ASR, conduct a layer-wise ablation study for end-to-end ASR training, and provide the first linguistic analysis for multilingual medical ASR. All code, data, and models are available online https://github.com/leduckhai/MultiMed/tree/master/MultiMed
URLs: https://github.com/leduckhai/MultiMed/tree/master/MultiMed
Authors: Ruilin Luo, Liyuan Wang, Binghuai Lin, Zicheng Lin, Yujiu Yang
Abstract: Large Language Models (LLMs) have emerged as powerful tools for Text-to-SQL tasks, exhibiting remarkable reasoning capabilities. Different from tasks such as math word problems and commonsense reasoning, SQL solutions have a relatively fixed pattern. This facilitates the investigation of whether LLMs can benefit from categorical thinking, mirroring how humans acquire knowledge through inductive reasoning based on comparable examples. In this study, we propose that employing query group partitioning allows LLMs to focus on learning the thought processes specific to a single problem type, consequently enhancing their reasoning abilities across diverse difficulty levels and problem categories. Our experiments reveal that multiple advanced LLMs, when equipped with PTD-SQL, can either surpass or match previous state-of-the-art (SOTA) methods on the Spider and BIRD datasets. Intriguingly, models with varying initial performances have exhibited significant improvements, mainly at the boundary of their capabilities after targeted drilling, suggesting a parallel with human progress. Code is available at https://github.com/lrlbbzl/PTD-SQL.
Authors: Soniya Vijayakumar, Josef van Genabith, Simon Ostermann
Abstract: In the era of high performing Large Language Models, researchers have widely acknowledged that contextual word representations are one of the key drivers in achieving top performances in downstream tasks. In this work, we investigate the degree of contextualization encoded in the fine-grained sub-layer representations of a Pre-trained Language Model (PLM) by empirical experiments using linear probes. Unlike previous work, we are particularly interested in identifying the strength of contextualization across PLM sub-layer representations (i.e. Self-Attention, Feed-Forward Activation and Output sub-layers). To identify the main contributions of sub-layers to contextualisation, we first extract the sub-layer representations of polysemous words in minimally different sentence pairs, and compare how these representations change through the forward pass of the PLM network. Second, by probing on a sense identification classification task, we try to empirically localize the strength of contextualization information encoded in these sub-layer representations. With these probing experiments, we also try to gain a better understanding of the influence of context length and context richness on the degree of contextualization. Our main conclusion is cautionary: BERT demonstrates a high degree of contextualization in the top sub-layers if the word in question is in a specific position in the sentence with a shorter context window, but this does not systematically generalize across different word positions and context sizes.
Authors: Stefan Arnold, Marian Fietta, Dilara Yesilbas
Abstract: Language Models (LMs) recently incorporate mixture-of-experts layers consisting of a router and a collection of experts to scale up their parameter count given a fixed computational budget. Building on previous efforts indicating that token-expert assignments are predominantly influenced by token identities and positions, we trace routing decisions of similarity-annotated text pairs to evaluate the context sensitivity of learned token-expert assignments. We observe that routing in encoder layers mainly depends on (semantic) associations, but contextual cues provide an additional layer of refinement. Conversely, routing in decoder layers is more variable and markedly less sensitive to context.
Authors: Jaehan Kim, Minkyoo Song, Seung Ho Na, Seungwon Shin
Abstract: Parameter-efficient fine-tuning (PEFT) has become a key training strategy for large language models. However, its reliance on fewer trainable parameters poses security risks, such as task-agnostic backdoors. Despite their severe impact on a wide range of tasks, there is no practical defense solution available that effectively counters task-agnostic backdoors within the context of PEFT. In this study, we introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our evaluations across three major PEFT architectures show that our method can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors (83.6%$\downarrow$). Furthermore, our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks. Source code will be obtained at https://github.com/obliviateARR/Obliviate.
Authors: Zeping Yu, Sophia Ananiadou
Abstract: We find arithmetic ability resides within a limited number of attention heads, with each head specializing in distinct operations. To delve into the reason, we introduce the Comparative Neuron Analysis (CNA) method, which identifies an internal logic chain consisting of four distinct stages from input to prediction: feature enhancing with shallow FFN neurons, feature transferring by shallow attention layers, feature predicting by arithmetic heads, and prediction enhancing among deep FFN neurons. Moreover, we identify the human-interpretable FFN neurons within both feature-enhancing and feature-predicting stages. These findings lead us to investigate the mechanism of LoRA, revealing that it enhances prediction probabilities by amplifying the coefficient scores of FFN neurons related to predictions. Finally, we apply our method in model pruning for arithmetic tasks and model editing for reducing gender bias. Code is on https://github.com/zepingyu0512/arithmetic-mechanism.
Authors: Aishwarya Mirashi, Purva Lingayat, Srushti Sonavane, Tejas Padhiyar, Raviraj Joshi, Geetanjali Kale
Abstract: The rise of large transformer models has revolutionized Natural Language Processing, leading to significant advances in tasks like text classification. However, this progress demands substantial computational resources, escalating training duration, and expenses with larger model sizes. Efforts have been made to downsize and accelerate English models (e.g., Distilbert, MobileBert). Yet, research in this area is scarce for low-resource languages. In this study, we explore the case of the low-resource Indic language Marathi. Leveraging the marathi-topic-all-doc-v2 model as our baseline, we implement optimization techniques to reduce computation time and memory usage. Our focus is on enhancing the efficiency of Marathi transformer models while maintaining top-tier accuracy and reducing computational demands. Using the MahaNews document classification dataset and the marathi-topic-all-doc-v2 model from L3Cube, we apply Block Movement Pruning, Knowledge Distillation, and Mixed Precision methods individually and in combination to boost efficiency. We demonstrate the importance of strategic pruning levels in achieving desired efficiency gains. Furthermore, we analyze the balance between efficiency improvements and environmental impact, highlighting how optimized model architectures can contribute to a more sustainable computational ecosystem. Implementing these techniques on a single GPU system, we determine that the optimal configuration is 25\% pruning + knowledge distillation. This approach yielded a 2.56x speedup in computation time while maintaining baseline accuracy levels.
Authors: Anushka Shelke, Riya Savant, Raviraj Joshi
Abstract: This study examines the effectiveness of layer pruning in creating efficient Sentence BERT (SBERT) models. Our goal is to create smaller sentence embedding models that reduce complexity while maintaining strong embedding similarity. We assess BERT models like Muril and MahaBERT-v2 before and after pruning, comparing them with smaller, scratch-trained models like MahaBERT-Small and MahaBERT-Smaller. Through a two-phase SBERT fine-tuning process involving Natural Language Inference (NLI) and Semantic Textual Similarity (STS), we evaluate the impact of layer reduction on embedding quality. Our findings show that pruned models, despite fewer layers, perform competitively with fully layered versions. Moreover, pruned models consistently outperform similarly sized, scratch-trained models, establishing layer pruning as an effective strategy for creating smaller, efficient embedding models. These results highlight layer pruning as a practical approach for reducing computational demand while preserving high-quality embeddings, making SBERT models more accessible for languages with limited technological resources.
Authors: Blessed Guda, Gabrial Zencha A., Lawrence Francis, Carlee Joe-Wong
Abstract: Large Language models (LLMs) have brought about substantial advancements in the field of Question Answering (QA) systems. These models do remarkably well in addressing intricate inquiries in a variety of disciplines. However, because of domain-specific vocabulary, complex technological concepts, and the requirement for exact responses applying LLMs to specialized sectors like telecommunications presents additional obstacles. GPT-3.5 has been used in recent work, to obtain noteworthy accuracy for telecom-related questions in a Retrieval Augmented Generation (RAG) framework. Notwithstanding these developments, the practical use of models such as GPT-3.5 is restricted by their proprietary nature and high computing demands. This paper introduces QMOS, an innovative approach which uses a Question-Masked loss and Option Shuffling trick to enhance the performance of LLMs in answering Multiple-Choice Questions in the telecommunications domain. Our focus was on using opensource, smaller language models (Phi-2 and Falcon-7B) within an enhanced RAG framework. Our multi-faceted approach involves several enhancements to the whole LLM-RAG pipeline of finetuning, retrieval, prompt engineering and inference. Our approaches significantly outperform existing results, achieving accuracy improvements from baselines of 24.70% to 49.30% with Falcon-7B and from 42.07% to 84.65% with Phi-2.
Authors: Hossein Sholehrasa, Sanaz Saki Norouzi, Pascal Hitzler, Majid Jaberi-Douraki
Abstract: Integrating structured knowledge from tabular formats poses significant challenges within natural language processing (NLP), mainly when dealing with complex, semi-structured tables like those found in the FeTaQA dataset. These tables require advanced methods to interpret and generate meaningful responses accurately. Traditional approaches, such as SQL and SPARQL, often fail to fully capture the semantics of such data, especially in the presence of irregular table structures like web tables. This paper addresses these challenges by proposing a novel approach that extracts triples straightforward from tabular data and integrates it with a retrieval-augmented generation (RAG) model to enhance the accuracy, coherence, and contextual richness of responses generated by a fine-tuned GPT-3.5-turbo-0125 model. Our approach significantly outperforms existing baselines on the FeTaQA dataset, particularly excelling in Sacre-BLEU and ROUGE metrics. It effectively generates contextually accurate and detailed long-form answers from tables, showcasing its strength in complex data interpretation.
Authors: Xinghua Zhang, Haiyang Yu, Yongbin Li, Minzheng Wang, Longze Chen, Fei Huang
Abstract: In the era of large language models (LLMs), a vast amount of conversation logs will be accumulated thanks to the rapid development trend of language UI. Conversation Analysis (CA) strives to uncover and analyze critical information from conversation data, streamlining manual processes and supporting business insights and decision-making. The need for CA to extract actionable insights and drive empowerment is becoming increasingly prominent and attracting widespread attention. However, the lack of a clear scope for CA leads to a dispersion of various techniques, making it difficult to form a systematic technical synergy to empower business applications. In this paper, we perform a thorough review and systematize CA task to summarize the existing related work. Specifically, we formally define CA task to confront the fragmented and chaotic landscape in this field, and derive four key steps of CA from conversation scene reconstruction, to in-depth attribution analysis, and then to performing targeted training, finally generating conversations based on the targeted training for achieving the specific goals. In addition, we showcase the relevant benchmarks, discuss potential challenges and point out future directions in both industry and academia. In view of current advancements, it is evident that the majority of efforts are still concentrated on the analysis of shallow conversation elements, which presents a considerable gap between the research and business, and with the assist of LLMs, recent work has shown a trend towards research on causality and strategic tasks which are sophisticated and high-level. The analyzed experiences and insights will inevitably have broader application value in business operations that target conversation logs.
Authors: Zhenting Wang, Zhizhi Wang, Mingyu Jin, Mengnan Du, Juan Zhai, Shiqing Ma
Abstract: Backdoor attack is a severe threat to the trustworthiness of DNN-based language models. In this paper, we first extend the definition of memorization of language models from sample-wise to more fine-grained sentence element-wise (e.g., word, phrase, structure, and style), and then point out that language model backdoors are a type of element-wise memorization. Through further analysis, we find that the strength of such memorization is positively correlated to the frequency of duplicated elements in the training dataset. In conclusion, duplicated sentence elements are necessary for successful backdoor attacks. Based on this, we propose a data-centric defense. We first detect trigger candidates in training data by finding memorizable elements, i.e., duplicated elements, and then confirm real triggers by testing if the candidates can activate backdoor behaviors (i.e., malicious elements). Results show that our method outperforms state-of-the-art defenses in defending against different types of NLP backdoors.
Authors: Javier Chiyah-Garcia, Alessandro Suglia, Arash Eshghi
Abstract: In dialogue, the addressee may initially misunderstand the speaker and respond erroneously, often prompting the speaker to correct the misunderstanding in the next turn with a Third Position Repair (TPR). The ability to process and respond appropriately to such repair sequences is thus crucial in conversational AI systems. In this paper, we first collect, analyse, and publicly release BlockWorld-Repairs: a dataset of multi-modal TPR sequences in an instruction-following manipulation task that is, by design, rife with referential ambiguity. We employ this dataset to evaluate several state-of-the-art Vision and Language Models (VLM) across multiple settings, focusing on their capability to process and accurately respond to TPRs and thus recover from miscommunication. We find that, compared to humans, all models significantly underperform in this task. We then show that VLMs can benefit from specialised losses targeting relevant tokens during fine-tuning, achieving better performance and generisability. Our results suggest that these models are not yet ready to be deployed in multi-modal collaborative settings where repairs are common, and highlight the need to design training regimes and objectives that facilitate learning from interaction.
Authors: John Hewitt, Nelson F. Liu, Percy Liang, Christopher D. Manning
Abstract: Instruction tuning commonly means finetuning a language model on instruction-response pairs. We discover two forms of adaptation (tuning) that are deficient compared to instruction tuning, yet still yield instruction following; we call this implicit instruction tuning. We first find that instruction-response pairs are not necessary: training solely on responses, without any corresponding instructions, yields instruction following. This suggests pretrained models have an instruction-response mapping which is revealed by teaching the model the desired distribution of responses. However, we then find it's not necessary to teach the desired distribution of responses: instruction-response training on narrow-domain data like poetry still leads to broad instruction-following behavior like recipe generation. In particular, when instructions are very different from those in the narrow finetuning domain, models' responses do not adhere to the style of the finetuning domain. To begin to explain implicit instruction tuning, we hypothesize that very simple changes to a language model's distribution yield instruction following. We support this by hand-writing a rule-based language model which yields instruction following in a product-of-experts with a pretrained model. The rules are to slowly increase the probability of ending the sequence, penalize repetition, and uniformly change 15 words' probabilities. In summary, adaptations made without being designed to yield instruction following can do so implicitly.
Authors: Navid Ayoobi, Lily Knab, Wen Cheng, David Pantoja, Hamidreza Alikhani, Sylvain Flamant, Jin Kim, Arjun Mukherjee
Abstract: While large language models (LLMs) exhibit significant utility across various domains, they simultaneously are susceptible to exploitation for unethical purposes, including academic misconduct and dissemination of misinformation. Consequently, AI-generated text detection systems have emerged as a countermeasure. However, these detection mechanisms demonstrate vulnerability to evasion techniques and lack robustness against textual manipulations. This paper introduces back-translation as a novel technique for evading detection, underscoring the need to enhance the robustness of current detection systems. The proposed method involves translating AI-generated text through multiple languages before back-translating to English. We present a model that combines these back-translated texts to produce a manipulated version of the original AI-generated text. Our findings demonstrate that the manipulated text retains the original semantics while significantly reducing the true positive rate (TPR) of existing detection methods. We evaluate this technique on nine AI detectors, including six open-source and three proprietary systems, revealing their susceptibility to back-translation manipulation. In response to the identified shortcomings of existing AI text detectors, we present a countermeasure to improve the robustness against this form of manipulation. Our results indicate that the TPR of the proposed method declines by only 1.85% after back-translation manipulation. Furthermore, we build a large dataset of 720k texts using eight different LLMs. Our dataset contains both human-authored and LLM-generated texts in various domains and writing styles to assess the performance of our method and existing detectors. This dataset is publicly shared for the benefit of the research community.
Authors: Yuxuan Zhou, Xien Liu, Chen Ning, Ji Wu
Abstract: In the study, we aim to investigate current LLMs' mastery of medical factual knowledge with a dynamic evaluation schema, which can automatically generate multiple test samples for each medical factual knowledge point. Test samples produced directly by LLMs always introduce factual errors and lack diversity in the manner of knowledge expression. To overcome the drawbacks, here we propose a novel evaluation method, Predicate-text Dual Transformation (PretextTrans), by introducing predicate transformations into the dynamic evaluation schema. Specifically, each medical knowledge point is firstly transformed into a predicate expression; then, the predicate expression derives a series of variants through predicate transformations; lastly, the produced predicate variants are transformed back into textual expressions, resulting in a series of test samples with both factual reliability and expression diversity. Using the proposed PretextTrans method, we systematically investigate 12 well-known LLMs' mastery of medical factual knowledge based on two medical datasets. The comparison results show that current LLMs still have significant deficiencies in fully mastering medical knowledge, which may illustrate why current LLMs still perform unsatisfactorily in real-world medical scenarios despite having achieved considerable performance on public benchmarks. Our proposed method serves as an effective solution for evaluation of LLMs in medical domain and offers valuable insights for developing medical-specific LLMs.
Authors: Hung-Ting Su, Ya-Ching Hsu, Xudong Lin, Xiang-Qian Shi, Yulei Niu, Han-Yuan Hsu, Hung-yi Lee, Winston H. Hsu
Abstract: Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4's performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT's heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.
Authors: Qingyu Lu, Liang Ding, Kanjian Zhang, Jinxia Zhang, Dacheng Tao
Abstract: Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment, providing both scores and fine-grained feedback. Although approaches such as GEMBA-MQM has shown SOTA performance on reference-free evaluation, the predicted errors do not align well with those annotated by human, limiting their interpretability as feedback signals. To enhance the quality of error annotations predicted by LLM evaluators, we introduce a universal and training-free framework, $\textbf{MQM-APE}$, based on the idea of filtering out non-impactful errors by Automatically Post-Editing (APE) the original translation based on each error, leaving only those errors that contribute to quality improvement. Specifically, we prompt the LLM to act as 1) $\textit{evaluator}$ to provide error annotations, 2) $\textit{post-editor}$ to determine whether errors impact quality improvement and 3) $\textit{pairwise quality verifier}$ as the error filter. Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM, across eight LLMs in both high- and low-resource languages. Orthogonal to trained approaches, MQM-APE complements translation-specific evaluators such as Tower, highlighting its broad applicability. Further analysis confirm the effectiveness of each module and offer valuable insights into evaluator design and LLMs selection. The code will be released to facilitate the community.
Authors: Mascha Kurpicz-Briki, Ghofrane Merhbene, Alexandre Puttick, Souhir Ben Souissi, Jannic Bieri, Thomas J\"org M\"uller, Christoph Golz
Abstract: Burnout, classified as a syndrome in the ICD-11, arises from chronic workplace stress that has not been effectively managed. It is characterized by exhaustion, cynicism, and reduced professional efficacy, and estimates of its prevalence vary significantly due to inconsistent measurement methods. Recent advancements in Natural Language Processing (NLP) and machine learning offer promising tools for detecting burnout through textual data analysis, with studies demonstrating high predictive accuracy. This paper contributes to burnout detection in German texts by: (a) collecting an anonymous real-world dataset including free-text answers and Oldenburg Burnout Inventory (OLBI) responses; (b) demonstrating the limitations of a GermanBERT-based classifier trained on online data; (c) presenting two versions of a curated BurnoutExpressions dataset, which yielded models that perform well in real-world applications; and (d) providing qualitative insights from an interdisciplinary focus group on the interpretability of AI models used for burnout detection. Our findings emphasize the need for greater collaboration between AI researchers and clinical experts to refine burnout detection models. Additionally, more real-world data is essential to validate and enhance the effectiveness of current AI methods developed in NLP research, which are often based on data automatically scraped from online sources and not evaluated in a real-world context. This is essential for ensuring AI tools are well suited for practical applications.
Authors: Runsong Zhao, Pengcheng Huang, Xinyu Liu, Chunyang Xiao, Tong Xiao, Jingbo Zhu
Abstract: Compressing Transformer inputs into compressd tokens allows running LLMs with improved speed and cost efficiency. Based on the compression method ICAE, we carefully examine the position identifier choices for compressed tokens and also propose a new compression loss. We demonstrate empirically that our proposed methods achieve significantly higher compression ratios (15x compared to 4x for ICAE), while being able to attain comparable reconstruction performance.
Authors: Lior Madmoni, Amir Zait, Ilia Labzovsky, Danny Karmon
Abstract: Generative AI agents are often expected to respond to complex user requests that have No One Right Answer (NORA), e.g., "design a vegetarian meal plan below 1800 calories". Such requests may entail a set of constraints that the agent should adhere to. To successfully develop agents for NORA scenarios, an accurate automatic evaluation framework is essential, and specifically - one capable of validating the satisfaction of constraints in the agent's response. Recently, large language models (LLMs) have been adopted as versatile evaluators for many NORA tasks, but their ability to evaluate constraint-satisfaction in generated text remains unclear. To study this, we develop and release a novel Arithmetic Constraint-Satisfaction (ACS) benchmarking dataset. The dataset consists of complex user requests with corresponding constraints, agent responses and human labels indicating each constraint's satisfaction level in the response. A unique property of this dataset is that validating many of its constraints requires reviewing the response as a whole (in contrast to many other benchmarks that require the validation of a single independent item). Moreover, it assesses LLMs in performing reasoning, in-context data extraction, arithmetic calculations, and counting. We then benchmark both open and proprietary LLMs on evaluating constraint-satisfaction, and show that most models still have a significant headroom for improvement, and that errors primarily stem from reasoning issues. In addition, most models exhibit a skewed constraint-satisfaction prediction pattern, with higher accuracy where the ground-truth label is "satisfied". Lastly, few-shot prompting for our task proved to be rather challenging, since many of the studied models showed a degradation in performance when it was introduced.
Authors: Lemeng Qi, Yang Han, Zhuotong Xie
Abstract: This paper explores the challenges posed by nominal adjectives (NAs) in natural language processing (NLP) tasks, particularly in part-of-speech (POS) tagging. We propose treating NAs as a distinct POS tag, "JN," and investigate its impact on POS tagging, BIO chunking, and coreference resolution. Our study shows that reclassifying NAs can improve the accuracy of syntactic analysis and structural understanding in NLP. We present experimental results using Hidden Markov Models (HMMs), Maximum Entropy (MaxEnt) models, and Spacy, demonstrating the feasibility and potential benefits of this approach. Additionally we trained a bert model to identify the NA in untagged text.
Authors: Yang Zhang, Yanfei Dong, Kenji Kawaguchi
Abstract: Large language models (LLMs) have gained increasing attention due to their prominent ability to understand and process texts. Nevertheless, LLMs largely remain opaque. The lack of understanding of LLMs has obstructed the deployment in safety-critical scenarios and hindered the development of better models. In this study, we advance the understanding of LLM by investigating the significance of individual layers in LLMs. We propose an efficient sampling method to faithfully evaluate the importance of layers using Shapley values, a widely used explanation framework in feature attribution and data valuation. In addition, we conduct layer ablation experiments to assess the performance degradation resulting from the exclusion of specific layers. Our findings reveal the existence of cornerstone layers, wherein certain early layers can exhibit a dominant contribution over others. Removing one cornerstone layer leads to a drastic collapse of the model performance, often reducing it to random guessing. Conversely, removing non-cornerstone layers results in only marginal performance changes. This study identifies cornerstone layers in LLMs and underscores their critical role for future research.
Authors: Siyuan Brandon Loh, Liang Ze Wong, Prasanta Bhattacharya, Joseph Simons, Wei Gao, Hong Zhang
Abstract: We investigate Large Language Models' (LLMs) ability to predict a user's stance on a target given a collection of his/her target-agnostic social media posts (i.e., user-level stance prediction). While we show early evidence that LLMs are capable of this task, we highlight considerable variability in the performance of the model across (i) the type of stance target, (ii) the prediction strategy and (iii) the number of target-agnostic posts supplied. Post-hoc analyses further hint at the usefulness of target-agnostic posts in providing relevant information to LLMs through the presence of both surface-level (e.g., target-relevant keywords) and user-level features (e.g., encoding users' moral values). Overall, our findings suggest that LLMs might offer a viable method for determining public stances towards new topics based on historical and target-agnostic data. At the same time, we also call for further research to better understand LLMs' strong performance on the stance prediction task and how their effectiveness varies across task contexts.
Authors: Peixin Qin, Chen Huang, Yang Deng, Wenqiang Lei, Tat-Seng Chua
Abstract: With the aid of large language models, current conversational recommender system (CRS) has gaining strong abilities to persuade users to accept recommended items. While these CRSs are highly persuasive, they can mislead users by incorporating incredible information in their explanations, ultimately damaging the long-term trust between users and the CRS. To address this, we propose a simple yet effective method, called PC-CRS, to enhance the credibility of CRS's explanations during persuasion. It guides the explanation generation through our proposed credibility-aware persuasive strategies and then gradually refines explanations via post-hoc self-reflection. Experimental results demonstrate the efficacy of PC-CRS in promoting persuasive and credible explanations. Further analysis reveals the reason behind current methods producing incredible explanations and the potential of credible explanations to improve recommendation accuracy.
Authors: Raju Gorain, Omkar Salunke
Abstract: The process of landscaping automotive innovation through patent analysis is crucial for Research and Development teams. It aids in comprehending innovation trends, technological advancements, and the latest technologies from competitors. Traditionally, this process required intensive manual efforts. However, with the advent of Large Language Models (LLMs), it can now be automated, leading to faster and more efficient patent categorization & state-of-the-art of inventive concept extraction. This automation can assist various R\&D teams in extracting relevant information from extensive patent databases. This paper introduces a method based on prompt engineering to extract essential information for landscaping. The information includes the problem addressed by the patent, the technology utilized, and the area of innovation within the vehicle ecosystem (such as safety, Advanced Driver Assistance Systems and more).The result demonstrates the implementation of this method to create a landscape of fuel cell technology using open-source patent data. This approach provides a comprehensive overview of the current state of fuel cell technology, offering valuable insights for future research and development in this field.
Authors: Daoyang Li, Mingyu Jin, Qingcheng Zeng, Haiyan Zhao, Mengnan Du
Abstract: Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of the world's languages. In this paper, we extend these probing methods to a multilingual context, investigating the behaviors of LLMs across diverse languages. We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages. Our key findings reveal: (1) a consistent performance gap between high-resource and low-resource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages. These results highlight significant disparities in LLMs' multilingual capabilities and emphasize the need for improved modeling of low-resource languages.
Authors: Tom Marzea, Abraham Israeli, Oren Tsur
Abstract: Automatic detection of online hate speech serves as a crucial step in the detoxification of the online discourse. Moreover, accurate classification can promote a better understanding of the proliferation of hate as a social phenomenon. While most prior work focus on the detection of hateful utterances, we argue that focusing on the user level is as important, albeit challenging. In this paper we consider a multimodal aggregative approach for the detection of hate-mongers, taking into account the potentially hateful texts, user activity, and the user network. We evaluate our methods on three unique datasets X (Twitter), Gab, and Parler showing that a processing a user's texts in her social context significantly improves the detection of hate mongers, compared to previously used text and graph-based methods. Our method can be then used to improve the classification of coded messages, dog-whistling, and racial gas-lighting, as well as inform intervention measures. Moreover, our approach is highly efficient even for very large datasets and networks.
Authors: Kaikai An, Shuzheng Si, Helan Hu, Haozhe Zhao, Yuchi Wang, Qingyan Guo, Baobao Chang
Abstract: Semantic Parsing aims to capture the meaning of a sentence and convert it into a logical, structured form. Previous studies show that semantic parsing enhances the performance of smaller models (e.g., BERT) on downstream tasks. However, it remains unclear whether the improvements extend similarly to LLMs. In this paper, our empirical findings reveal that, unlike smaller models, directly adding semantic parsing results into LLMs reduces their performance. To overcome this, we propose SENSE, a novel prompting approach that embeds semantic hints within the prompt. Experiments show that SENSE consistently improves LLMs' performance across various tasks, highlighting the potential of integrating semantic information to improve LLM capabilities.
Authors: Ahmed Adel Attia, Dorottya Demszky, Tolulope Ogunremi, Jing Liu, Carol Espy-Wilson
Abstract: Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%. More specifically, CPT improves the model's robustness to different noises, microphones and classroom conditions.
Authors: Chenxu Wang, Ping Jian, Yang Zhen
Abstract: Logical reading comprehension is a challenging task that entails grasping the underlying semantics of text and applying reasoning to deduce the correct answer. Prior researches have primarily focused on enhancing logical reasoning capabilities through Chain-of-Thought (CoT) or data augmentation. However, previous work constructing chain-of-thought rationales concentrates solely on analyzing correct options, neglecting the incorrect alternatives. Addtionally, earlier efforts on data augmentation by altering contexts rely on rule-based methods, which result in generated contexts that lack diversity and coherence. To address these issues, we propose a Premise-Oriented Data Augmentation (PODA) framework. This framework can generate CoT rationales including analyses for both correct and incorrect options, while constructing diverse and high-quality counterfactual contexts from incorrect candidate options. We integrate summarizing premises and identifying premises for each option into rationales. Subsequently, we employ multi-step prompts with identified premises to construct counterfactual context. To facilitate the model's capabilities to better differentiate the reasoning process associated with each option, we introduce a novel thought-path contrastive learning method that compares reasoning paths between the original and counterfactual samples. Experimental results on three representative LLMs demonstrate that our method can improve the baselines substantially across two challenging logical reasoning benchmarks (ReClor and LogiQA 2.0). The data and code are released at https://github.com/lalalamdbf/TPReasoner.
Authors: David Chanin, James Wilken-Smith, Tom\'a\v{s} Dulka, Hardik Bhatnagar, Joseph Bloom
Abstract: Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose the activations of Large Language Models (LLMs) into human-interpretable latents. In this paper, we pose two questions. First, to what extent do SAEs extract monosemantic and interpretable latents? Second, to what extent does varying the sparsity or the size of the SAE affect monosemanticity / interpretability? By investigating these questions in the context of a simple first-letter identification task where we have complete access to ground truth labels for all tokens in the vocabulary, we are able to provide more detail than prior investigations. Critically, we identify a problematic form of feature-splitting we call feature absorption where seemingly monosemantic latents fail to fire in cases where they clearly should. Our investigation suggests that varying SAE size or sparsity is insufficient to solve this issue, and that there are deeper conceptual issues in need of resolution.
Authors: Tuhin Chakrabarty, Philippe Laban, Chien-Sheng Wu
Abstract: LLM-based applications are helping people write, and LLM-generated text is making its way into social media, journalism, and our classrooms. However, the differences between LLM-generated and human-written text remain unclear. To explore this, we hired professional writers to edit paragraphs in several creative domains. We first found these writers agree on undesirable idiosyncrasies in LLM-generated text, formalizing it into a seven-category taxonomy (e.g. cliches, unnecessary exposition). Second, we curated the LAMP corpus: 1,057 LLM-generated paragraphs edited by professional writers according to our taxonomy. Analysis of LAMP reveals that none of the LLMs used in our study (GPT4o, Claude-3.5-Sonnet, Llama-3.1-70b) outperform each other in terms of writing quality, revealing common limitations across model families. Third, we explored automatic editing methods to improve LLM-generated text. A large-scale preference annotation confirms that although experts largely prefer text edited by other experts, automatic editing methods show promise in improving alignment between LLM-generated and human-written text.
Authors: Zhou Zhang, Dongzeng Tan, Jiaan Wang, Yilong Chen, Jiarong Xu
Abstract: Emojis have gained immense popularity on social platforms, serving as a common means to supplement or replace text. However, existing data mining approaches generally either completely ignore or simply treat emojis as ordinary Unicode characters, which may limit the model's ability to grasp the rich semantic information in emojis and the interaction between emojis and texts. Thus, it is necessary to release the emoji's power in social media data mining. To this end, we first construct a heterogeneous graph consisting of three types of nodes, i.e. post, word and emoji nodes to improve the representation of different elements in posts. The edges are also well-defined to model how these three elements interact with each other. To facilitate the sharing of information among post, word and emoji nodes, we propose a graph pre-train framework for text and emoji co-modeling, which contains two graph pre-training tasks: node-level graph contrastive learning and edge-level link reconstruction learning. Extensive experiments on the Xiaohongshu and Twitter datasets with two types of downstream tasks demonstrate that our approach proves significant improvement over previous strong baseline methods.
Authors: Hongchen Wang, Kangming Li, Scott Ramsay, Yao Fehlis, Edward Kim, Jason Hattrick-Simpers
Abstract: Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. This study conducts a comprehensive evaluation and robustness analysis of LLMs within the field of materials science, focusing on domain-specific question answering and materials property prediction. Three distinct datasets are used in this study: 1) a set of multiple-choice questions from undergraduate-level materials science courses, 2) a dataset including various steel compositions and yield strengths, and 3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of 'noise', ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study uncovers unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance enhancement from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.
Authors: Tim Patzelt
Abstract: In the field of biomedical natural language processing, medical concept normalization is a crucial task for accurately mapping mentions of concepts to a large knowledge base. However, this task becomes even more challenging in low-resource settings, where limited data and resources are available. In this thesis, I explore the challenges of medical concept normalization in a low-resource setting. Specifically, I investigate the shortcomings of current medical concept normalization methods applied to German lay texts. Since there is no suitable dataset available, a dataset consisting of posts from a German medical online forum is annotated with concepts from the Unified Medical Language System. The experiments demonstrate that multilingual Transformer-based models are able to outperform string similarity methods. The use of contextual information to improve the normalization of lay mentions is also examined, but led to inferior results. Based on the results of the best performing model, I present a systematic error analysis and lay out potential improvements to mitigate frequent errors.
Authors: Ogen Schlachet Drukerman, Einat Minkov
Abstract: Social networks form a valuable source of world knowledge, where influential entities correspond to popular accounts. Unlike factual knowledge bases (KBs), which maintain a semantic ontology, structured semantic information is not available on social media. In this work, we consider a social KB of roughly 200K popular Twitter accounts, which denotes entities of interest. We elicit semantic information about those entities. In particular, we associate them with a fine-grained set of 136 semantic types, e.g., determine whether a given entity account belongs to a politician, or a musical artist. In the lack of explicit type information in Twitter, we obtain semantic labels for a subset of the accounts via alignment with the KBs of DBpedia and Wikidata. Given the labeled dataset, we finetune a transformer-based text encoder to generate semantic embeddings of the entities based on the contents of their accounts. We then exploit this evidence alongside network-based embeddings to predict the entities semantic types. In our experiments, we show high type prediction performance on the labeled dataset. Consequently, we apply our type classification model to all of the entity accounts in the social KB. Our analysis of the results offers insights about the global semantics of the Twitter sphere. We discuss downstream applications that should benefit from semantic type information and the semantic embeddings of social entities generated in this work. In particular, we demonstrate enhanced performance on the key task of entity similarity assessment using this information.
Authors: Hossein Rajabzadeh, Aref Jafari, Aman Sharma, Benyamin Jami, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh
Abstract: Large Language Models (LLMs), with their increasing depth and number of parameters, have demonstrated outstanding performance across a variety of natural language processing tasks. However, this growth in scale leads to increased computational demands, particularly during inference and fine-tuning. To address these challenges, we introduce EchoAtt, a novel framework aimed at optimizing transformer-based models by analyzing and leveraging the similarity of attention patterns across layers. Our analysis reveals that many inner layers in LLMs, especially larger ones, exhibit highly similar attention matrices. By exploiting this similarity, EchoAtt enables the sharing of attention matrices in less critical layers, significantly reducing computational requirements without compromising performance. We incorporate this approach within a knowledge distillation setup, where a pre-trained teacher model guides the training of a smaller student model. The student model selectively shares attention matrices in layers with high similarity while inheriting key parameters from the teacher. Our best results with TinyLLaMA-1.1B demonstrate that EchoAtt improves inference speed by 15\%, training speed by 25\%, and reduces the number of parameters by approximately 4\%, all while improving zero-shot performance. These findings highlight the potential of attention matrix sharing to enhance the efficiency of LLMs, making them more practical for real-time and resource-limited applications.
Authors: Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay
Abstract: The title of a research paper communicates in a succinct style the main theme and, sometimes, the findings of the paper. Coming up with the right title is often an arduous task, and therefore, it would be beneficial to authors if title generation can be automated. In this paper, we fine-tune pre-trained and large language models to generate titles of papers from their abstracts. We also use ChatGPT in a zero-shot setting to generate paper titles. The performance of the models is measured with ROUGE, METEOR, MoverScore, BERTScore and SciBERTScore metrics.
Authors: Aso Mahmudi, Borja Herce, Demian Inostroza Amestica, Andreas Scherbakov, Eduard Hovy, Ekaterina Vylomova
Abstract: Linguistic fieldwork is an important component in language documentation and preservation. However, it is a long, exhaustive, and time-consuming process. This paper presents a novel model that guides a linguist during the fieldwork and accounts for the dynamics of linguist-speaker interactions. We introduce a novel framework that evaluates the efficiency of various sampling strategies for obtaining morphological data and assesses the effectiveness of state-of-the-art neural models in generalising morphological structures. Our experiments highlight two key strategies for improving the efficiency: (1) increasing the diversity of annotated data by uniform sampling among the cells of the paradigm tables, and (2) using model confidence as a guide to enhance positive interaction by providing reliable predictions during annotation.
Authors: Bokang Bi, Leibo Liu, Oscar Perez-Concha, Sanja Lujic, Louisa Jorm
Abstract: The increasing volume and complexity of clinical documentation in Electronic Medical Records systems pose significant challenges for clinical coders, who must mentally process and summarise vast amounts of clinical text to extract essential information needed for coding tasks. While large language models have been successfully applied to shorter summarisation tasks in recent years, the challenge of summarising a hospital course remains an open area for further research and development. In this study, we adapted three pre trained LLMs, Llama 3, BioMistral, Mistral Instruct v0.1 for the hospital course summarisation task, using Quantized Low Rank Adaptation fine tuning. We created a free text clinical dataset from MIMIC III data by concatenating various clinical notes as the input clinical text, paired with ground truth Brief Hospital Course sections extracted from the discharge summaries for model training. The fine tuned models were evaluated using BERTScore and ROUGE metrics to assess the effectiveness of clinical domain fine tuning. Additionally, we validated their practical utility using a novel hospital course summary assessment metric specifically tailored for clinical coding. Our findings indicate that fine tuning pre trained LLMs for the clinical domain can significantly enhance their performance in hospital course summarisation and suggest their potential as assistive tools for clinical coding. Future work should focus on refining data curation methods to create higher quality clinical datasets tailored for hospital course summary tasks and adapting more advanced open source LLMs comparable to proprietary models to further advance this research.
Authors: Kengatharaiyer Sarveswaran
Abstract: Treebanks are important linguistic resources, which are structured and annotated corpora with rich linguistic annotations. These resources are used in Natural Language Processing (NLP) applications, supporting linguistic analyses, and are essential for training and evaluating various computational models. This paper discusses the creation of Tamil treebanks using three distinct approaches: manual annotation, computational grammars, and machine learning techniques. Manual annotation, though time-consuming and requiring linguistic expertise, ensures high-quality and rich syntactic and semantic information. Computational deep grammars, such as Lexical Functional Grammar (LFG), offer deep linguistic analyses but necessitate significant knowledge of the formalism. Machine learning approaches, utilising off-the-shelf frameworks and tools like Stanza, UDpipe, and UUParser, facilitate the automated annotation of large datasets but depend on the availability of quality annotated data, cross-linguistic training resources, and computational power. The paper discusses the challenges encountered in building Tamil treebanks, including issues with Internet data, the need for comprehensive linguistic analysis, and the difficulty of finding skilled annotators. Despite these challenges, the development of Tamil treebanks is essential for advancing linguistic research and improving NLP tools for Tamil.
Authors: Peifeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, Shafiq Joty
Abstract: Auto-evaluation is crucial for assessing response quality and offering feedback for model development. Recent studies have explored training large language models (LLMs) as generative judges to evaluate and critique other models' outputs. In this work, we investigate the idea of learning from both positive and negative data with preference optimization to enhance the evaluation capabilities of LLM judges across an array of different use cases. We achieve this by employing three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective. Our comprehensive study over a wide range of benchmarks demonstrates the effectiveness of our method. In particular, our generative judge achieves the best performance on 10 out of 13 benchmarks, outperforming strong baselines like GPT-4o and specialized judge models. Further analysis show that our judge model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
Authors: Taihang Wang, Xiaoman Xu, Yimin Wang, Ye Jiang
Abstract: Real-world applications of large language models (LLMs) in computational social science (CSS) tasks primarily depend on the effectiveness of instruction tuning (IT) or in-context learning (ICL). While IT has shown highly effective at fine-tuning LLMs for various tasks, ICL offers a rapid alternative for task adaptation by learning from examples without explicit gradient updates. In this paper, we evaluate the classification performance of LLMs using IT versus ICL in few-shot CSS tasks. The experimental results indicate that ICL consistently outperforms IT in most CSS tasks. Additionally, we investigate the relationship between the increasing number of training samples and LLM performance. Our findings show that simply increasing the number of samples without considering their quality does not consistently enhance the performance of LLMs with either ICL or IT and can sometimes even result in a performance decline. Finally, we compare three prompting strategies, demonstrating that ICL is more effective than zero-shot and Chain-of-Thought (CoT). Our research highlights the significant advantages of ICL in handling CSS tasks in few-shot settings and emphasizes the importance of optimizing sample quality and prompting strategies to improve LLM classification performance. The code will be made available.
Authors: Ernie Chang, Pin-Jie Lin, Yang Li, Changsheng Zhao, Daeil Kim, Rastislav Rabatin, Zechun Liu, Yangyang Shi, Vikas Chandra
Abstract: Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. However, there are instances where we desire a model that excels in specific areas without markedly compromising performance in other areas. A cost-effective and straightforward approach is sampling with low-dimensional data features, which allows to select large-scale pretraining data for domain-specific use cases. In this work, we revisit importance sampling with n-gram features consisting of multi-granular tokens, which strikes a good balance between sentence compression and representation capabilities. We observed the sampled data to have a high correlation with the target downstream task performance while preserving its effectiveness on other tasks. This leads to the proposed data sampling paradigm where language models can be pretrained more efficiently on selected documents. On eight benchmarks we demonstrate with $\sim$1% of the data, pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
Authors: Yihong Tang, Jiao Ou, Che Liu, Fuzheng Zhang, Di Zhang, Kun Gai
Abstract: Role-playing is an emerging application in the field of Human-Computer Interaction (HCI), primarily implemented through the alignment training of a large language model (LLM) with assigned characters. Despite significant progress, role-playing agents (RPLAs) still struggle with maintaining role-consistency across conversations, particularly when confronted with boundary queries subtly related to character attributes. In this paper, we present ERABAL, a framework aimed at enhancing RPLAs' role-playing capabilities through boundary-aware learning. ERABAL encompasses a generation pipeline for role-specific dialogues and a concomitant methodology for alignment training. Through comprehensive evaluations, we demonstrate that ERABAL is both efficient and effective. By training with significantly fewer dialogues than those used in leading approaches, ERABAL achieves notable improvements across WikiRoleEval, CharacterEval, and the role-playing subset of MT-Bench compared to the generalist baseline models. Our code and datasets will be made publicly available to support further research.
Authors: Zheng Hui, Zhaoxiao Guo, Hang Zhao, Juanyong Duan, Congrui Huang
Abstract: In different NLP tasks, detecting harmful content is crucial for online environments, especially with the growing influence of social media. However, previous research has two main issues: 1) a lack of data in low-resource settings, and 2) inconsistent definitions and criteria for judging harmful content, requiring classification models to be robust to spurious features and diverse. We propose Toxicraft, a novel framework for synthesizing datasets of harmful information to address these weaknesses. With only a small amount of seed data, our framework can generate a wide variety of synthetic, yet remarkably realistic, examples of toxic information. Experimentation across various datasets showcases a notable enhancement in detection model robustness and adaptability, surpassing or close to the gold labels. We release the generated data at Github upon acceptance.
Authors: Sihui Yang, Keping Bi, Wanqing Cui, Jiafeng Guo, Xueqi Cheng
Abstract: Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. The commonly used automatic evaluation metrics like ROUGE or BERTScore cannot accurately measure semantic similarities or answers from different perspectives. Recently, Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. Common approaches include pointwise scoring of each candidate answer and pairwise comparisons between answers. Inspired by the evolution from pointwise to pairwise to listwise in learning-to-rank methods, we propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality. Moreover, for NF questions that do not have multi-grade or any golden answers, we leverage LLMs to generate the reference answer list of various quality to facilitate the listwise evaluation. Extensive experimental results on three NFQA datasets, i.e., ANTIQUE, the TREC-DL-NF, and WebGLM show that our method has significantly higher correlations with human annotations compared to automatic scores and common pointwise and pairwise approaches.
Authors: Yuyan Chen, Tianhao Yu, Yueze Li, Songzhou Yan, Sijia Liu, Jiaqing Liang, Yanghua Xiao
Abstract: The evaluation of the problem-solving capability under incomplete information scenarios of Large Language Models (LLMs) is increasingly important, encompassing capabilities such as questioning, knowledge search, error detection, and path planning. Current research mainly focus on LLMs' problem-solving capability such as ``Twenty Questions''. However, these kinds of games do not require recognizing misleading cues which are necessary in the incomplete information scenario. Moreover, the existing game such as ``Who is undercover'' are highly subjective, making it challenging for evaluation. Therefore, in this paper, we introduce a novel game named BrainKing based on the ``Who is undercover'' and ``Twenty Questions'' for evaluating LLM capabilities under incomplete information scenarios. It requires LLMs to identify target entities with limited yes-or-no questions and potential misleading answers. By setting up easy, medium, and hard difficulty modes, we comprehensively assess the performance of LLMs across various aspects. Our results reveal the capabilities and limitations of LLMs in BrainKing, providing significant insights of LLM problem-solving levels.
Authors: Sona Binu, Jismi Jose, Fathima Shimna K V, Alino Luke Hans, Reni K. Cherian, Starlet Ben Alex, Priyanka Srivastava, Chiranjeevi Yarra
Abstract: The people with Major Depressive Disorder (MDD) exhibit the symptoms of tonal variations in their speech compared to the healthy counterparts. However, these tonal variations not only confine to the state of MDD but also on the language, which has unique tonal patterns. This work analyzes automatic speech-based depression detection across two languages, English and Malayalam, which exhibits distinctive prosodic and phonemic characteristics. We propose an approach that utilizes speech data collected along with self-reported labels from participants reading sentences from IViE corpus, in both English and Malayalam. The IViE corpus consists of five sets of sentences: simple sentences, WH-questions, questions without morphosyntactic markers, inversion questions and coordinations, that can naturally prompt speakers to speak in different tonal patterns. Convolutional Neural Networks (CNNs) are employed for detecting depression from speech. The CNN model is trained to identify acoustic features associated with depression in speech, focusing on both languages. The model's performance is evaluated on the collected dataset containing recordings from both depressed and non-depressed speakers, analyzing its effectiveness in detecting depression across the two languages. Our findings and collected data could contribute to the development of language-agnostic speech-based depression detection systems, thereby enhancing accessibility for diverse populations.
Authors: Tal Kadosh, Niranjan Hasabnis, Prema Soundararajan, Vy A. Vo, Mihai Capota, Nesreen Ahmed, Yuval Pinter, Gal Oren
Abstract: Manual parallelization of code remains a significant challenge due to the complexities of modern software systems and the widespread adoption of multi-core architectures. This paper introduces OMPar, an AI-driven tool designed to automate the parallelization of C/C++ code using OpenMP pragmas. OMPar integrates Large Language Models (LLMs) through two key components: OMPify, which assesses loop parallelization potential, and MonoCoder-OMP, a new fine-tuned model which generates precise OpenMP pragmas. The evaluation of OMPar follows the same rigorous process applied to traditional tools like source-to-source AutoPar and ICPC compilers: (1) ensuring the generated code compiles and runs correctly in serial form, (2) assessing performance with the gradual addition of threads and corresponding physical cores, and (3) verifying and validating the correctness of the code's output. Benchmarks from HeCBench and ParEval are used to evaluate accuracy and performance. Experimental results demonstrate that OMPar significantly outperforms traditional methods, achieving higher accuracy in identifying parallelizable loops and generating efficient pragmas. Beyond accuracy, OMPar offers advantages such as the ability to work on partial or incomplete codebases and the capacity to continuously learn from new code patterns, enhancing its parallelization capabilities over time. These results underscore the potential of LLMs in revolutionizing automatic parallelization techniques, paving the way for more efficient and scalable parallel computing systems.
Authors: Weichao Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
Abstract: As the scale of training corpora for large language models (LLMs) grows, model developers become increasingly reluctant to disclose details on their data. This lack of transparency poses challenges to scientific evaluation and ethical deployment. Recently, pretraining data detection approaches, which infer whether a given text was part of an LLM's training data through black-box access, have been explored. The Min-K% Prob method, which has achieved state-of-the-art results, assumes that a non-training example tends to contain a few outlier words with low token probabilities. However, the effectiveness may be limited as it tends to misclassify non-training texts that contain many common words with high probabilities predicted by LLMs. To address this issue, we introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection. We compute the cross-entropy (i.e., the divergence) between the token probability distribution and the token frequency distribution to derive a detection score.We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text. Experimental results on English-language benchmarks and PatentMIA demonstrate that our proposed method significantly outperforms existing methods. Our code and PatentMIA benchmark are available at https://github.com/zhang-wei-chao/DC-PDD
Authors: Patrick Amadeus Irawan, Genta Indra Winata, Samuel Cahyawijaya, Ayu Purwarianti
Abstract: Natural Language Explanation (NLE) aims to elucidate the decision-making process by providing detailed, human-friendly explanations in natural language. It helps demystify the decision-making processes of large vision-language models (LVLMs) through the use of language models. While existing methods for creating a Vision Question-Answering with Natural Language Explanation (VQA-NLE) datasets can provide explanations, they heavily rely on human annotations that are time-consuming and costly. In this study, we propose a novel approach that leverages LVLMs to efficiently generate high-quality synthetic VQA-NLE datasets. By evaluating our synthetic data, we showcase how advanced prompting techniques can lead to the production of high-quality VQA-NLE data. Our findings indicate that this proposed method achieves up to 20x faster than human annotation, with only a minimal decrease in qualitative metrics, achieving robust quality that is nearly equivalent to human-annotated data. Furthermore, we show that incorporating visual prompts significantly enhances the relevance of text generation. Our study paves the way for a more efficient and robust automated generation of multi-modal NLE data, offering a promising solution to the problem.
Authors: Gia-Bao Dinh Ho, Chang Wei Tan, Zahra Zamanzadeh Darban, Mahsa Salehi, Gholamreza Haffari, Wray Buntine
Abstract: Detecting critical moments, such as emotional outbursts or changes in decisions during conversations, is crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting focusing on these moments as turning points (TPs), accompanied by a meticulously curated, high-consensus, human-annotated multi-modal dataset. We provide precise timestamps, descriptions, and visual-textual evidence high-lighting changes in emotions, behaviors, perspectives, and decisions at these turning points. We also propose a framework, TPMaven, utilizing state-of-the-art vision-language models to construct a narrative from the videos and large language models to classify and detect turning points in our multi-modal dataset. Evaluation results show that TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with additional explanations aligning with human expectations.
Authors: Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Shuo Shang
Abstract: Recently, mobile AI agents based on VLMs have been gaining increasing attention. These works typically utilize VLM as a foundation, fine-tuning it with instruction-based mobile datasets. However, these VLMs are typically pre-trained on general-domain data, which often results in a lack of fundamental capabilities specific to the mobile domain. Therefore, they may struggle to recognize specific UI elements and understand intra-UI fine-grained information. In addition, the current fine-tuning task focuses on interacting with the most relevant element for the given instruction. These fine-tuned VLMs may still ignore the relationships between UI pages, neglect the roles of elements in page transitions and lack inter-UI understanding. To address issues, we propose a VLM called MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding. We defined four UI-based pre-training tasks, enabling the model to better perceive fine-grained elements and capture page transition actions. To address the lack of mobile pre-training data, we built a large Chinese mobile dataset Mobile3M from scratch, which contains 3 million UI pages, and real-world transition actions, forming a directed graph structure. Experimental results show MobileVLM excels on both our test set and public mobile benchmarks, outperforming existing VLMs.
Authors: Nianqi Li, Siyu Yuan, Jiangjie Chen, Jiaqing Liang, Feng Wei, Zujie Liang, Deqing Yang, Yanghua Xiao
Abstract: Historical analogies, which compare known past events with contemporary but unfamiliar events, are important abilities that help people make decisions and understand the world. However, research in applied history suggests that people have difficulty finding appropriate analogies. And previous studies in the AI community have also overlooked historical analogies. To fill this gap, in this paper, we focus on the historical analogy acquisition task, which aims to acquire analogous historical events for a given event. We explore retrieval and generation methods for acquiring historical analogies based on different large language models (LLMs). Furthermore, we propose a self-reflection method to mitigate hallucinations and stereotypes when LLMs generate historical analogies. Through human evaluations and our specially designed automatic multi-dimensional assessment, we find that LLMs generally have a good potential for historical analogies. And the performance of the models can be further improved by using our self-reflection method.
Authors: Qinzhuo Wu, Wei Liu, Jian Luan, Bin Wang
Abstract: Recently, tool-augmented LLMs have gained increasing attention. Given an instruction, tool-augmented LLMs can interact with various external tools in multiple rounds and provide a final answer. However, previous LLMs were trained on overly detailed instructions, which included API names or parameters, while real users would not explicitly mention these API details. This leads to a gap between trained LLMs and real-world scenarios. In addition, most works ignore whether the interaction process follows the instruction. To address these issues, we constructed a training dataset called MGToolBench, which contains statement and category-level instructions to better reflect real-world scenarios. In addition, we propose ToolPlanner, a two-stage reinforcement learning framework that utilizes path planning and two feedback mechanisms to enhance the LLM's task completion and instruction-following capabilities. Experimental results show that ToolPlanner significantly improves the Match Rate, Pass Rate and Win Rate by 26.8%, 20.2%, and 5.6% compared to the SOTA model. Human evaluation verifies that the multi-granularity instructions can better align with users' usage habits. Our data and code will be released upon acceptance.
Authors: Chenxu Yang, Ruipeng Jia, Naibin Gu, Zheng Lin, Siyuan Chen, Chao Pang, Weichong Yin, Yu Sun, Hua Wu, Weiping Wang
Abstract: DPO is an effective preference optimization algorithm. However, the DPO-tuned models tend to overfit on the dispreferred samples, manifested as overly long generations lacking diversity. While recent regularization approaches have endeavored to alleviate this issue by modifying the objective function, they achieved that at the cost of alignment performance degradation. In this paper, we innovatively incorporate regularization from the perspective of weight updating to curb alignment overfitting. Through the pilot experiment, we discovered that there exists a positive correlation between overfitting and the hyperspherical energy fluctuation. Hence, we introduce orthogonal finetuning for DPO via a weight-Rotated Preference Optimization (RoPO) method, which merely conducts rotational and magnitude-stretching updates on the weight parameters to maintain the hyperspherical energy invariant, thereby preserving the knowledge encoded in the angle between neurons. Extensive experiments demonstrate that our model aligns perfectly with human preferences while retaining the original expressive capacity using only 0.0086% of the trainable parameters, suggesting an effective regularization against overfitting. Specifically, RoPO outperforms DPO by up to 10 points on MT-Bench and by up to 2.8 points on AlpacaEval 2, while enhancing the generation diversity by an average of 6 points.
Authors: Arda Goknil, Femke B. Gelderblom, Simeon Tverdal, Shukun Tokas, Hui Song
Abstract: Privacy policies are often obfuscated by their complexity, which impedes transparency and informed consent. Conventional machine learning approaches for automatically analyzing these policies demand significant resources and substantial domain-specific training, causing adaptability issues. Moreover, they depend on extensive datasets that may require regular maintenance due to changing privacy concerns. In this paper, we propose, apply, and assess PAPEL (Privacy Policy Analysis through Prompt Engineering for LLMs), a framework harnessing the power of Large Language Models (LLMs) through prompt engineering to automate the analysis of privacy policies. PAPEL aims to streamline the extraction, annotation, and summarization of information from these policies, enhancing their accessibility and comprehensibility without requiring additional model training. By integrating zero-shot, one-shot, and few-shot learning approaches and the chain-of-thought prompting in creating predefined prompts and prompt templates, PAPEL guides LLMs to efficiently dissect, interpret, and synthesize the critical aspects of privacy policies into user-friendly summaries. We demonstrate the effectiveness of PAPEL with two applications: (i) annotation and (ii) contradiction analysis. We assess the ability of several LLaMa and GPT models to identify and articulate data handling practices, offering insights comparable to existing automated analysis approaches while reducing training efforts and increasing the adaptability to new analytical needs. The experiments demonstrate that the LLMs PAPEL utilizes (LLaMA and Chat GPT models) achieve robust performance in privacy policy annotation, with F1 scores reaching 0.8 and above (using the OPP-115 gold standard), underscoring the effectiveness of simpler prompts across various advanced language models.
Authors: Bin Hong, Jinze Wu, Jiayu Liu, Liang Ding, Jing Sha, Kai Zhang, Shijin Wang, Zhenya Huang
Abstract: In recent years, the breakthrough of Large Language Models (LLMs) offers new ideas for achieving universal methods on graph data. The common practice of converting graphs into natural language for LLMs, which refers to graph flattening, exhibits good generalizability and interpretability. However, the poor organization of the textual format results in poor performance in long-distance scenario understanding. Inspired by human cognitive reasoning habits, we propose a novel method for graph flattening to fit LLMs, termed as End-to-End DAG-Path prompting (EEDP). Experiments on real-world datasets show that EEDP enhances the reasoning performance of LLMs in long-distance scenarios while maintaining excellent performance in short-distance scenarios, demonstrating good robustness in the face of distance variations.
Authors: Sangyeon Cho, Jangyeong Jeon, Dongjoon Lee, Changhee Lee, Junyeong Kim
Abstract: The use of pre-trained language models fine-tuned to address specific downstream tasks is a common approach in natural language processing (NLP). However, acquiring domain-specific knowledge via fine-tuning is challenging. Traditional methods involve pretraining language models using vast amounts of domain-specific data before fine-tuning for particular tasks. This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea. Our findings reveal that existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics of non-English-speaking regions. To address these limitations, we propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning. This study demonstrates the effective transfer of specialized knowledge between models by defining a general language model as the student model and a domain-specific pre-trained model as the teacher model. In particular, we address the complexities of EMR data obtained from PEDs in non-English-speaking regions, such as Korea, and demonstrate that the proposed method enhances classification performance in such contexts. The proposed methodology not only outperforms baseline models on Korean PED EMR data, but also promises broader applicability in various professional and technical domains. In future works, we intend to extend this methodology to include diverse non-English-speaking regions and address additional downstream tasks, with the aim of developing advanced model architectures using state-of-the-art KD techniques. The code is available in https://github.com/JoSangYeon/DSG-KD.
Authors: Aseem Srivastava, Smriti Joshi, Tanmoy Chakraborty, Md Shad Akhtar
Abstract: In mental health counseling, condensing dialogues into concise and relevant summaries (aka counseling notes) holds pivotal significance. Large Language Models (LLMs) exhibit remarkable capabilities in various generative tasks; however, their adaptation to domain-specific intricacies remains challenging, especially within mental health contexts. Unlike standard LLMs, mental health experts first plan to apply domain knowledge in writing summaries. Our work enhances LLMs' ability by introducing a novel planning engine to orchestrate structuring knowledge alignment. To achieve high-order planning, we divide knowledge encapsulation into two major phases: (i) holding dialogue structure and (ii) incorporating domain-specific knowledge. We employ a planning engine on Llama-2, resulting in a novel framework, PIECE. Our proposed system employs knowledge filtering-cum-scaffolding to encapsulate domain knowledge. Additionally, PIECE leverages sheaf convolution learning to enhance its understanding of the dialogue's structural nuances. We compare PIECE with 14 baseline methods and observe a significant improvement across ROUGE and Bleurt scores. Further, expert evaluation and analyses validate the generation quality to be effective, sometimes even surpassing the gold standard. We further benchmark PIECE with other LLMs and report improvement, including Llama-2 (+2.72%), Mistral (+2.04%), and Zephyr (+1.59%), to justify the generalizability of the planning engine.
Authors: Peter M\"uhlbacher, Nikos I. Bosse, Lawrence Phillips
Abstract: We present initial results of a forthcoming benchmark for evaluating LLM agents on white-collar tasks of economic value. We evaluate eight realistic and ``messy'' tasks that are routine in finance and consulting, drawn from real-world cases from our customers. We lay the groundwork for an LLM agent evaluation suite where good performance directly corresponds to a large economic and societal impact. This fills a gap in existing benchmarks with tasks like ``order a pizza to the following address'' that do not constitute real-human work of economic value. Our evaluations assign credit to agents for partially solving tasks. By doing that, this initial evaluation, and the forthcoming benchmark, allow us to more accurately extrapolate performance of LLM-based agents on economically valuable tasks. We built and tested several architectures with GPT-4o, Claude-3.5 Sonnet, Llama 3.1 (405b), and GPT-4o-mini, ensuring that failure to solve a task was due to failures of reasoning and planning, rather than due to common failures like e.g. the inability to parse a website. On average, LLM agents powered by Claude-3.5 Sonnet substantially outperformed agents using GPT-4o, with agents based on Llama 3.1 (405b) and GPT-4o-mini lagging noticeably behind. Across LLMs, a ReAct architecture with the ability to delegate subtasks to subagents performed best. In addition to quantitative evaluations, we qualitatively assessed the performance of the LLM agents by inspecting their traces and reflecting on their observations.
Authors: Tyler Loakman, Yucheng Li, Chenghua Lin
Abstract: Recently, Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated aptitude as potential substitutes for human participants in experiments testing psycholinguistic phenomena. However, an understudied question is to what extent models that only have access to vision and text modalities are able to implicitly understand sound-based phenomena via abstract reasoning from orthography and imagery alone. To investigate this, we analyse the ability of VLMs and LLMs to demonstrate sound symbolism (i.e., to recognise a non-arbitrary link between sounds and concepts) as well as their ability to ``hear'' via the interplay of the language and vision modules of open and closed-source multimodal models. We perform multiple experiments, including replicating the classic Kiki-Bouba and Mil-Mal shape and magnitude symbolism tasks, and comparing human judgements of linguistic iconicity with that of LLMs. Our results show that VLMs demonstrate varying levels of agreement with human labels, and more task information may be required for VLMs versus their human counterparts for in silico experimentation. We additionally see through higher maximum agreement levels that Magnitude Symbolism is an easier pattern for VLMs to identify than Shape Symbolism, and that an understanding of linguistic iconicity is highly dependent on model size.
Authors: Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu
Abstract: Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges. These challenges encompass a wide range of issues, from retrieving relevant data and accurately interpreting user intent to fully harnessing the reasoning capabilities of LLMs for complex tasks. We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. In this survey, we propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. We define these levels of queries, provide relevant datasets, and summarize the key challenges and most effective techniques for addressing these challenges. Finally, we discuss three main forms of integrating external data into LLMs: context, small model, and fine-tuning, highlighting their respective strengths, limitations, and the types of problems they are suited to solve. This work aims to help readers thoroughly understand and decompose the data requirements and key bottlenecks in building LLM applications, offering solutions to the different challenges and serving as a guide to systematically developing such applications.
Authors: Elena Chistova
Abstract: Discourse parsing is a crucial task in natural language processing that aims to reveal the higher-level relations in a text. Despite growing interest in cross-lingual discourse parsing, challenges persist due to limited parallel data and inconsistencies in the Rhetorical Structure Theory (RST) application across languages and corpora. To address this, we introduce a parallel Russian annotation for the large and diverse English GUM RST corpus. Leveraging recent advances, our end-to-end RST parser achieves state-of-the-art results on both English and Russian corpora. It demonstrates effectiveness in both monolingual and bilingual settings, successfully transferring even with limited second-language annotation. To the best of our knowledge, this work is the first to evaluate the potential of cross-lingual end-to-end RST parsing on a manually annotated parallel corpus.
Authors: Anthony Sicilia, Malihe Alikhani
Abstract: Typically, when evaluating Theory of Mind, we consider the beliefs of others to be binary: held or not held. But what if someone is unsure about their own beliefs? How can we quantify this uncertainty? We propose a new suite of tasks, challenging language models (LMs) to model the uncertainty of others in dialogue. We design these tasks around conversation forecasting, wherein an agent forecasts an unobserved outcome to a conversation. Uniquely, we view interlocutors themselves as forecasters, asking an LM to predict the uncertainty of the interlocutors (a probability). We experiment with re-scaling methods, variance reduction strategies, and demographic context, for this regression task, conducting experiments on three dialogue corpora (social, negotiation, task-oriented) with eight LMs. While LMs can explain up to 7% variance in the uncertainty of others, we highlight the difficulty of the tasks and room for future work, especially in practical applications, like anticipating ``false
Authors: Cl\'ement Christophe, Tathagata Raha, Svetlana Maslenkova, Muhammad Umar Salman, Praveen K Kanithi, Marco AF Pimentel, Shadab Khan
Abstract: Large Language Models (LLMs) have demonstrated significant potential in transforming clinical applications. In this study, we investigate the efficacy of four techniques in adapting LLMs for clinical use-cases: continuous pretraining, instruct fine-tuning, NEFTune, and prompt engineering. We employ these methods on Mistral 7B and Mixtral 8x7B models, leveraging a large-scale clinical pretraining dataset of 50 billion tokens and an instruct fine-tuning dataset of 500 million tokens. Our evaluation across various clinical tasks reveals the impact of each technique. While continuous pretraining beyond 250 billion tokens yields marginal improvements on its own, it establishes a strong foundation for instruct fine-tuning. Notably, NEFTune, designed primarily to enhance generation quality, surprisingly demonstrates additional gains on our benchmark. Complex prompt engineering methods further enhance performance. These findings show the importance of tailoring fine-tuning strategies and exploring innovative techniques to optimize LLM performance in the clinical domain.
Authors: Chun Xu, Mengmeng Wang, Yan Ren, Shaolin Zhu
Abstract: Aspect-Based Sentiment Analysis (ABSA) in tourism plays a significant role in understanding tourists' evaluations of specific aspects of attractions, which is crucial for driving innovation and development in the tourism industry. However, traditional pipeline models are afflicted by issues such as error propagation and incomplete extraction of sentiment elements. To alleviate this issue, this paper proposes an aspect-based sentiment analysis model, ACOS_LLM, for Aspect-Category-Opinion-Sentiment Quadruple Extraction (ACOSQE). The model comprises two key stages: auxiliary knowledge generation and ACOSQE. Firstly, Adalora is used to fine-tune large language models for generating high-quality auxiliary knowledge. To enhance model efficiency, Sparsegpt is utilized to compress the fine-tuned model to 50% sparsity. Subsequently, Positional information and sequence modeling are employed to achieve the ACOSQE task, with auxiliary knowledge and the original text as inputs. Experiments are conducted on both self-created tourism datasets and publicly available datasets, Rest15 and Rest16. Results demonstrate the model's superior performance, with an F1 improvement of 7.49% compared to other models on the tourism dataset. Additionally, there is an F1 improvement of 0.05% and 1.06% on the Rest15 and Rest16 datasets, respectively.
Authors: Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley
Abstract: The size of the key-value (KV) cache plays a critical role in determining both the maximum context length and the number of concurrent requests supported during inference in modern language models. The KV cache size grows proportionally with the number of attention heads and the tokens processed, leading to increased memory consumption and slower inference for long inputs. In this work, we explore the use of MixAttention, a model architecture modification closely related to a blog published by Character.AI. MixAttention combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers. Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks. We also explore various configurations of this architecture, identifying those that maintain quality across evaluation metrics while optimizing resource efficiency.
Authors: Mohammad Amin Roshani, Xiangyu Zhou, Yao Qiang, Srinivasan Suresh, Steve Hicks, Usha Sethuraman, Dongxiao Zhu
Abstract: Large language models (LLMs) have shown remarkable capabilities in various natural language tasks and are increasingly being applied in healthcare domains. This work demonstrates a new LLM-powered disease risk assessment approach via streaming human-AI conversation, eliminating the need for programming required by traditional machine learning approaches. In a COVID-19 severity risk assessment case study, we fine-tune pre-trained generative LLMs (e.g., Llama2-7b and Flan-t5-xl) using a few shots of natural language examples, comparing their performance with traditional classifiers (i.e., Logistic Regression, XGBoost, Random Forest) that are trained de novo using tabular data across various experimental settings. We develop a mobile application that uses these fine-tuned LLMs as its generative AI (GenAI) core to facilitate real-time interaction between clinicians and patients, providing no-code risk assessment through conversational interfaces. This integration not only allows for the use of streaming Questions and Answers (QA) as inputs but also offers personalized feature importance analysis derived from the LLM's attention layers, enhancing the interpretability of risk assessments. By achieving high Area Under the Curve (AUC) scores with a limited number of fine-tuning samples, our results demonstrate the potential of generative LLMs to outperform discriminative classification methods in low-data regimes, highlighting their real-world adaptability and effectiveness. This work aims to fill the existing gap in leveraging generative LLMs for interactive no-code risk assessment and to encourage further research in this emerging field.
Authors: Ga\"etan Caillaut, Raheel Qader, Mariam Nakhl\'e, Jingshu Liu, Jean-Gabriel Barth\'elemy
Abstract: Recent studies have showcased remarkable capabilities of decoder-only models in many NLP tasks, including translation. Yet, the machine translation field has been largely dominated by encoder-decoder models based on the Transformer architecture. As a consequence, scaling laws of encoder-decoder models for neural machine translation have already been well studied, but decoder-only models have received less attention. This work explores the scaling laws of decoder-only models on the multilingual and multidomain translation task. We trained a collection of six decoder-only models, ranging from 70M to 7B parameters, on a sentence-level, multilingual and multidomain dataset. We conducted a series of experiments showing that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models, but we also show that this scaling law has difficulties to generalize to too large models or to a different data distribution. We also study different scaling methods and show that scaling the depth and the width of a model lead to similar test loss improvements, but with different impact on the model's efficiency.
Authors: Siddharth Betala, Ishan Chokshi
Abstract: In this paper, we describe our system under the team name Brotherhood for the English-to-Lowres Multi-Modal Translation Task. We participate in the multi-modal translation tasks for English-Hindi, English-Hausa, English-Bengali, and English-Malayalam language pairs. We present a method leveraging multi-modal Large Language Models (LLMs), specifically GPT-4o and Claude 3.5 Sonnet, to enhance cross-lingual image captioning without traditional training or fine-tuning. Our approach utilizes instruction-tuned prompting to generate rich, contextual conversations about cropped images, using their English captions as additional context. These synthetic conversations are then translated into the target languages. Finally, we employ a weighted prompting strategy, balancing the original English caption with the translated conversation to generate captions in the target language. This method achieved competitive results, scoring 37.90 BLEU on the English-Hindi Challenge Set and ranking first and second for English-Hausa on the Challenge and Evaluation Leaderboards, respectively. We conduct additional experiments on a subset of 250 images, exploring the trade-offs between BLEU scores and semantic similarity across various weighting schemes.
Authors: Sean Kim, Raja Mazumder
Abstract: The exponential growth in computational power and accessibility has transformed the complexity and scale of bioinformatics research, necessitating standardized documentation for transparency, reproducibility, and regulatory compliance. The IEEE BioCompute Object (BCO) standard addresses this need but faces adoption challenges due to the overhead of creating compliant documentation, especially for legacy research. This paper presents a novel approach to automate the creation of BCOs from scientific papers using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). We describe the development of the BCO assistant tool that leverages RAG to extract relevant information from source papers and associated code repositories, addressing key challenges such as LLM hallucination and long-context understanding. The implementation incorporates optimized retrieval processes, including a two-pass retrieval with re-ranking, and employs carefully engineered prompts for each BCO domain. We discuss the tool's architecture, extensibility, and evaluation methods, including automated and manual assessment approaches. The BCO assistant demonstrates the potential to significantly reduce the time and effort required for retroactive documentation of bioinformatics research while maintaining compliance with the standard. This approach opens avenues for AI-assisted scientific documentation and knowledge extraction from publications thereby enhancing scientific reproducibility. The BCO assistant tool and documentation is available at https://biocompute-objects.github.io/bco-rag/.
Authors: Kunyao Lan, Bingui Jin, Zichen Zhu, Siyuan Chen, Shu Zhang, Kenny Q. Zhu, Mengyue Wu
Abstract: Mental health issues, particularly depressive disorders, present significant challenges in contemporary society, necessitating the development of effective automated diagnostic methods. This paper introduces the Agent Mental Clinic (AMC), a self-improving conversational agent system designed to enhance depression diagnosis through simulated dialogues between patient and psychiatrist agents. To enhance the dialogue quality and diagnosis accuracy, we design a psychiatrist agent consisting of a tertiary memory structure, a dialogue control and reflect plugin that acts as ``supervisor'' and a memory sampling module, fully leveraging the skills reflected by the psychiatrist agent, achieving great accuracy on depression risk and suicide risk diagnosis via conversation. Experiment results on datasets collected in real-life scenarios demonstrate that the system, simulating the procedure of training psychiatrists, can be a promising optimization method for aligning LLMs with real-life distribution in specific domains without modifying the weights of LLMs, even when only a few representative labeled cases are available.
Authors: Yuxuan Ye, Edwin Simpson, Raul Santos Rodriguez
Abstract: Cutting-edge abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed. Early summary factuality evaluation metrics are usually based on n-gram overlap and embedding similarity, but are reported fail to align with human annotations. Therefore, many techniques for detecting factual inconsistencies build pipelines around natural language inference (NLI) or question-answering (QA) models with additional supervised learning steps. In this paper, we revisit similarity-based metrics, showing that this failure stems from the comparison text selection and its granularity. We propose a new zero-shot factuality evaluation metric, Sentence-BERT Score (SBERTScore), which compares sentences between the summary and the source document. It outperforms widely-used word-word metrics including BERTScore and can compete with existing NLI and QA-based factuality metrics on the benchmark without needing any fine-tuning. Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries. We demonstrate how a combination of techniques is more effective in detecting various types of error.
Authors: Lior Forer, Tom Hope
Abstract: We address the fundamental task of inferring cross-document coreference and hierarchy in scientific texts, which has important applications in knowledge graph construction, search, recommendation and discovery. LLMs still struggle when faced with many long-tail technical concepts with nuanced variations. We present a novel method which generates context-dependent definitions of concept mentions by retrieving full-text literature, and uses the definitions to enhance detection of cross-document mention relations. We further generate relational definitions, which describe how two concept mentions are related or different, and design an efficient re-ranking approach to address the combinatorial explosion involved in inferring links across papers. In both fine-tuning and in-context learning settings we achieve large gains in performance. We provide analysis of generated definitions, shedding light on the relational reasoning ability of LLMs over fine-grained scientific texts.
Authors: Skatje Myers, Timothy A. Miller, Yanjun Gao, Matthew M. Churpek, Anoop Mayampurath, Dmitriy Dligach, Majid Afshar
Abstract: Objective: Applying large language models (LLMs) to the clinical domain is challenging due to the context-heavy nature of processing medical records. Retrieval-augmented generation (RAG) offers a solution by facilitating reasoning over large text sources. However, there are many parameters to optimize in just the retrieval system alone. This paper presents an ablation study exploring how different embedding models and pooling methods affect information retrieval for the clinical domain. Methods: Evaluating on three retrieval tasks on two electronic health record (EHR) data sources, we compared seven models, including medical- and general-domain models, specialized encoder embedding models, and off-the-shelf decoder LLMs. We also examine the choice of embedding pooling strategy for each model, independently on the query and the text to retrieve. Results: We found that the choice of embedding model significantly impacts retrieval performance, with BGE, a comparatively small general-domain model, consistently outperforming all others, including medical-specific models. However, our findings also revealed substantial variability across datasets and query text phrasings. We also determined the best pooling methods for each of these models to guide future design of retrieval systems. Discussion: The choice of embedding model, pooling strategy, and query formulation can significantly impact retrieval performance and the performance of these models on other public benchmarks does not necessarily transfer to new domains. Further studies such as this one are vital for guiding empirically-grounded development of retrieval frameworks, such as in the context of RAG, for the clinical domain.
Authors: Zhiyuan Wang, Fangxu Yuan, Virginia LeBaron, Tabor Flickinger, Laura E. Barnes
Abstract: Effective patient-provider communication is crucial in clinical care, directly impacting patient outcomes and quality of life. Traditional evaluation methods, such as human ratings, patient feedback, and provider self-assessments, are often limited by high costs and scalability issues. Although existing natural language processing (NLP) techniques show promise, they struggle with the nuances of clinical communication and require sensitive clinical data for training, reducing their effectiveness in real-world applications. Emerging large language models (LLMs) offer a new approach to assessing complex communication metrics, with the potential to advance the field through integration into passive sensing and just-in-time intervention systems. This study explores LLMs as evaluators of palliative care communication quality, leveraging their linguistic, in-context learning, and reasoning capabilities. Specifically, using simulated scripts crafted and labeled by healthcare professionals, we test proprietary models (e.g., GPT-4) and fine-tune open-source LLMs (e.g., LLaMA2) with a synthetic dataset generated by GPT-4 to evaluate clinical conversations, to identify key metrics such as `understanding' and `empathy'. Our findings demonstrated LLMs' superior performance in evaluating clinical communication, providing actionable feedback with reasoning, and demonstrating the feasibility and practical viability of developing in-house LLMs. This research highlights LLMs' potential to enhance patient-provider interactions and lays the groundwork for downstream steps in developing LLM-empowered clinical health systems.
Authors: Mingqi Li, Karan Aggarwal, Yong Xie, Aitzaz Ahmad, Stephen Lau
Abstract: As LLMs evolve, significant effort is spent on manually crafting prompts. While existing prompt optimization methods automate this process, they rely solely on learning from incorrect samples, leading to a sub-optimal performance. Additionally, an unexplored challenge in the literature is prompts effective for prior models may not perform well on newer versions or different languages. We propose the Learning from Contrastive Prompts (LCP) framework to address these gaps, enhancing both prompt optimization and adaptation. LCP employs contrastive learning to generate effective prompts by analyzing patterns in good and bad prompt examples. Our evaluation on the Big-Bench Hard dataset shows that LCP has a win rate of over 76% over existing methods in prompt optimization and demonstrates strong adaptability across different model versions, families, and languages. LCP offers a systematic approach to prompt engineering, reducing manual effort in deploying LLMs across varied contexts.
Authors: Iwo Naglik, Mateusz Lango
Abstract: Aspect-Sentiment Triplet Extraction (ASTE) is a recently proposed task of aspect-based sentiment analysis that consists in extracting (aspect phrase, opinion phrase, sentiment polarity) triples from a given sentence. Recent state-of-the-art methods approach this task by first extracting all possible text spans from a given text, then filtering the potential aspect and opinion phrases with a classifier, and finally considering all their pairs with another classifier that additionally assigns sentiment polarity to them. Although several variations of the above scheme have been proposed, the common feature is that the final result is constructed by a sequence of independent classifier decisions. This hinders the exploitation of dependencies between extracted phrases and prevents the use of knowledge about the interrelationships between classifier predictions to improve performance. In this paper, we propose a new ASTE approach consisting of three transformer-inspired layers, which enables the modelling of dependencies both between phrases and between the final classifier decisions. Experimental results show that the method achieves higher performance in terms of F1 measure than other methods studied on popular benchmarks. In addition, we show that a simple pre-training technique further improves the performance of the model.
Authors: Junqing He, Liang Zhu, Qi Wei, Rui Wang, Jiaxing Zhang
Abstract: Long-term memory is so important for chatbots and dialogue systems (DS) that researchers have developed numerous memory-augmented DS. However, their evaluation methods are different from the real situation in human conversation. They only measured the accuracy of factual information or the perplexity of generated responses given a query, which hardly reflected their performance. Moreover, they only consider passive memory retrieval based on similarity, neglecting diverse memory-recalling paradigms in humans, e.g. emotions and surroundings. To bridge the gap, we construct a novel benchmark covering various memory recalling paradigms based on cognitive science and psychology theory. The Memory Benchmark (MemBench) contains two tasks according to the two-phrase theory in cognitive science: memory retrieval, memory recognition and injection. The benchmark considers both passive and proactive memory recalling based on meta information for the first time. In addition, novel scoring aspects are proposed to comprehensively measure the generated responses. Results from the strongest embedding models and LLMs on MemBench show that there is plenty of room for improvement in existing dialogue systems. Extensive experiments also reveal the correlation between memory injection and emotion supporting (ES) skillfulness, and intimacy. Our code and dataset will be released.
Authors: Yuhang Xiao, Yudi Lin, Ming-Chang Chiu
Abstract: Large Vision-Language Models (LVLMs) evolve rapidly as Large Language Models (LLMs) was equipped with vision modules to create more human-like models. However, we should carefully evaluate their applications in different domains, as they may possess undesired biases. Our work studies the potential behavioral biases of LVLMs from a behavioral finance perspective, an interdisciplinary subject that jointly considers finance and psychology. We propose an end-to-end framework, from data collection to new evaluation metrics, to assess LVLMs' reasoning capabilities and the dynamic behaviors manifested in two established human financial behavioral biases: recency bias and authority bias. Our evaluations find that recent open-source LVLMs such as LLaVA-NeXT, MobileVLM-V2, Mini-Gemini, MiniCPM-Llama3-V 2.5 and Phi-3-vision-128k suffer significantly from these two biases, while the proprietary model GPT-4o is negligibly impacted. Our observations highlight directions in which open-source models can improve. The code is available at https://github.com/mydcxiao/vlm_behavioral_fin.
Authors: Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Jian Yang, Siwei Wu, Xingwei Qu, Jinjie Shi, Xinyue Zhang, Zhenzhu Yang, Xiangzhou Wang, Zhaoxiang Zhang, Zachary Liu, Emmanouil Benetos, Wenhao Huang, Chenghua Lin
Abstract: Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) the baseline models perform poorly (below 50% accuracy) even when provided with alternative textual representations of images and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities. The codes and live leaderboard could be found at https://m-a-p.ai/OmniBench.
Authors: Yunfei Xie, Juncheng Wu, Haoqin Tu, Siwei Yang, Bingchen Zhao, Yongshuo Zong, Qiao Jin, Cihang Xie, Yuyin Zhou
Abstract: Large language models (LLMs) have exhibited remarkable capabilities across various domains and tasks, pushing the boundaries of our knowledge in learning and cognition. The latest model, OpenAI's o1, stands out as the first LLM with an internalized chain-of-thought technique using reinforcement learning strategies. While it has demonstrated surprisingly strong capabilities on various general language tasks, its performance in specialized fields such as medicine remains unknown. To this end, this report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine (NEJM) and The Lancet. These datasets offer greater clinical relevance compared to standard medical QA benchmarks such as MedQA, translating more effectively into real-world clinical utility. Our analysis of o1 suggests that the enhanced reasoning ability of LLMs may (significantly) benefit their capability to understand various medical instructions and reason through complex clinical scenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios. But meanwhile, we identify several weaknesses in both the model capability and the existing evaluation protocols, including hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. We release our raw data and model outputs at https://ucsc-vlaa.github.io/o1_medicine/ for future research.
Authors: Thierry Petit, Arnault Pachot, Claire Conan-Vrinat, Alexandre Dubarry
Abstract: This article introduces an innovative architecture designed to declaratively combine Large Language Models (LLMs) with shared histories, and triggers to identify the most appropriate LLM for a given task. Our approach is general and declarative, relying on the construction of finite automata coupled with an event management system. The developed tool is crafted to facilitate the efficient and complex integration of LLMs with minimal programming effort, especially, but not only, for integrating methods of positive psychology to AI. The flexibility of our technique is demonstrated through applied examples in automation, communication, and ethics.
Authors: Paulina Toro Isaza, Michael Nidd, Noah Zheutlin, Jae-wook Ahn, Chidansh Amitkumar Bhatt, Yu Deng, Ruchi Mahindru, Martin Franz, Hans Florian, Salim Roukos
Abstract: Clients wishing to implement generative AI in the domain of IT Support and AIOps face two critical issues: domain coverage and model size constraints due to model choice limitations. Clients might choose to not use larger proprietary models such as GPT-4 due to cost and privacy concerns and so are limited to smaller models with potentially less domain coverage that do not generalize to the client's domain. Retrieval augmented generation is a common solution that addresses both of these issues: a retrieval system first retrieves the necessary domain knowledge which a smaller generative model leverages as context for generation. We present a system developed for a client in the IT Support domain for support case solution recommendation that combines retrieval augmented generation (RAG) for answer generation with an encoder-only model for classification and a generative large language model for query generation. We cover architecture details, data collection and annotation, development journey and preliminary validations, expected final deployment process and evaluation plans, and finally lessons learned.
Authors: Zhihuan Jiang, Zhen Yang, Jinhao Chen, Zhengxiao Du, Weihan Wang, Bin Xu, Yuxiao Dong, Jie Tang
Abstract: Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks by integrating textual and visual information to achieve visual understanding in complex scenarios. Despite the availability of several benchmarks aims to evaluating MLLMs in tasks from visual question answering to complex problem-solving, most focus predominantly on mathematics or general visual understanding tasks. This reveals a critical gap in current benchmarks, which often overlook the inclusion of other key scientific disciplines such as physics and chemistry. To address this gap, we meticulously construct a comprehensive benchmark, named VisScience, which is utilized to assess the multi-modal scientific reasoning across the three disciplines of mathematics, physics, and chemistry. This benchmark comprises 3,000 questions drawn from K12 education - spanning elementary school through high school - equally distributed across three disciplines, with 1,000 questions per discipline. The questions within VisScience span 21 distinct subjects and are categorized into five difficulty levels, offering a broad spectrum of topics within each discipline. With VisScience, we present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning. Experimental results demonstrate that closed-source MLLMs generally outperform open-source models. The best performance observed include a 53.4\% accuracy in mathematics by Claude3.5-Sonnet, 38.2\% in physics by GPT-4o, and 47.0\% in chemistry by Gemini-1.5-Pro. These results underscore the strengths and limitations of MLLMs, suggesting areas for future improvement and highlighting the importance of developing models that can effectively handle the diverse demands of multi-modal scientific reasoning.
Authors: Asher Sprigler, Alexander Drobek, Keagan Weinstock, Wendpanga Tapsoba, Gavin Childress, Andy Dao, Lucas Gral
Abstract: Large Language Models (LLMs) have increasingly demonstrated the ability to facilitate the development of multi-agent systems that allow the interpretation of thoughts and actions generated by each individual. Promising advancements have also been made in LLM-based interaction with existing worlds, particularly in interacting with simulated environments. This paper aims to integrate both aforementioned topics (agents & world interaction) into a single simulation where multiple agents can work together to solve a problem, modeling how groups of humans can often solve problems better than individuals. By showing whether LLMs demonstrate the synergy of human collaboration, it could lead to advancements in the applications of LLMs. We implemented two simulations: a physical studio apartment with two roommates, and another where agents collaborate to complete a programming task. We provide a multi-agent framework, discuss the performance of the agents in each simulation, and discuss potential future additions.
Authors: Yu Zhang, Changhao Pan, Wenxiang Guo, Ruiqi Li, Zhiyuan Zhu, Jialei Wang, Wenhao Xu, Jingyu Lu, Zhiqing Hong, Chuxin Wang, LiChao Zhang, Jinzheng He, Ziyue Jiang, Yuxin Chen, Chen Yang, Jiecheng Zhou, Xinyu Cheng, Zhou Zhao
Abstract: The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present \textbf{GTSinger}, a large \textbf{G}lobal, multi-\textbf{T}echnique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality singing voices, forming the largest recorded singing dataset; (2) 20 professional singers across nine widely spoken languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six commonly used singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompanied by manual phoneme-to-audio alignments, global style labels, and 16.16 hours of paired speech for various singing tasks. Moreover, to facilitate the use of GTSinger, we conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion. The corpus and demos can be found at http://gtsinger.github.io. We provide the dataset and the code for processing data and conducting benchmarks at https://huggingface.co/datasets/GTSinger/GTSinger and https://github.com/GTSinger/GTSinger.
URLs: http://gtsinger.github.io., https://huggingface.co/datasets/GTSinger/GTSinger, https://github.com/GTSinger/GTSinger.
Authors: Ayoob Sadeghiani
Abstract: AI governance and ethics in AI development have become critical concerns, prompting active discussions among tech companies, governments, and researchers about the potential risks AI poses to our democracies. This short essay aims to highlight one such risk: how generative AI includes or excludes equity-deserving groups in its outputs. The findings reveal that generative AI is not equitably inclusive regarding gender, race, age, and visible disability.
Authors: Seonghyeon Lee, Suyeon Kim, Joonwon Jang, Heejae Chon, Dongha Lee, Hwanjo Yu
Abstract: We study the code generation behavior of instruction-tuned models built on top of code pre-trained language models when they could access an auxiliary function to implement a function. We design several ways to provide auxiliary functions to the models by adding them to the query or providing a response prefix to incorporate the ability to utilize auxiliary functions with the instruction-following capability. Our experimental results show the effectiveness of combining the base models' auxiliary function utilization ability with the instruction following ability. In particular, the performance of adopting our approaches with the open-sourced language models surpasses that of the recent powerful proprietary language models, i.e., gpt-4o.
Authors: Dongyang Fan, Bettina Messmer, Martin Jaggi
Abstract: We target on-device collaborative fine-tuning of Large Language Models (LLMs) by adapting a Mixture of Experts (MoE) architecture, where experts are Low-Rank Adaptation (LoRA) modules. In conventional MoE approaches, experts develop into specialists throughout training. In contrast, we propose a novel $\textbf{Co}$llaborative learning approach via a $\textbf{Mi}$xture of $\textbf{G}$eneralists and $\textbf{S}$pecialists (CoMiGS). Diversifying into the two roles is achieved by aggregating certain experts globally while keeping others localized to specialize in user-specific datasets. Central to our work is a learnable routing network that routes at a token level, balancing collaboration and personalization at the finest granularity. Our method consistently demonstrates superior performance in scenarios with high data heterogeneity across various datasets. By design, our approach accommodates varying computational resource constraints among users as shown in different numbers of LoRA experts. We further showcase that low-resourced users can benefit from high-resourced users with high data quantity.
Authors: Zhangcheng Qiang, Kerry Taylor, Weiqing Wang, Jing Jiang
Abstract: Hallucinations of large language models (LLMs) commonly occur in domain-specific downstream tasks, with no exception in ontology matching (OM). The prevalence of using LLMs for OM raises the need for benchmarks to better understand LLM hallucinations. The OAEI-LLM dataset is an extended version of the Ontology Alignment Evaluation Initiative (OAEI) datasets that evaluate LLM-specific hallucinations in OM tasks. We outline the methodology used in dataset construction and schema extension, and provide examples of potential use cases.
Authors: Haoran Zhang, Shuanghao Bai, Wanqi Zhou, Jingwen Fu, Badong Chen
Abstract: Source-free domain generalization (SFDG) tackles the challenge of adapting models to unseen target domains without access to source domain data. To deal with this challenging task, recent advances in SFDG have primarily focused on leveraging the text modality of vision-language models such as CLIP. These methods involve developing a transferable linear classifier based on diverse style features extracted from the text and learned prompts or deriving domain-unified text representations from domain banks. However, both style features and domain banks have limitations in capturing comprehensive domain knowledge. In this work, we propose Prompt-Driven Text Adapter (PromptTA) method, which is designed to better capture the distribution of style features and employ resampling to ensure thorough coverage of domain knowledge. To further leverage this rich domain information, we introduce a text adapter that learns from these style features for efficient domain information storage. Extensive experiments conducted on four benchmark datasets demonstrate that PromptTA achieves state-of-the-art performance. The code is available at https://github.com/zhanghr2001/PromptTA.
Authors: Yuxuan Zhua, Shiyi Wang, Wenqing Zhong, Nianchen Shen, Yunqi Li, Siqi Wang, Zhiheng Li, Cathy Wu, Zhengbing He, Li Li
Abstract: Artificial intelligence (AI) plays a crucial role in autonomous driving (AD) research, propelling its development towards intelligence and efficiency. Currently, the development of AD technology follows two main technical paths: modularization and end-to-end. Modularization decompose the driving task into modules such as perception, prediction, planning, and control, and train them separately. Due to the inconsistency of training objectives between modules, the integrated effect suffers from bias. End-to-end attempts to address this issue by utilizing a single model that directly maps from sensor data to control signals. This path has limited learning capabilities in a comprehensive set of features and struggles to handle unpredictable long-tail events and complex urban traffic scenarios. In the face of challenges encountered in both paths, many researchers believe that large language models (LLMs) with powerful reasoning capabilities and extensive knowledge understanding may be the solution, expecting LLMs to provide AD systems with deeper levels of understanding and decision-making capabilities. In light of the challenges faced by both paths, many researchers believe that LLMs, with their powerful reasoning abilities and extensive knowledge, could offer a solution. To understand if LLMs could enhance AD, this paper conducts a thorough analysis of the potential applications of LLMs in AD systems, including exploring their optimization strategies in both modular and end-to-end approaches, with a particular focus on how LLMs can tackle the problems and challenges present in current solutions. Furthermore, we discuss an important question: Can LLM-based artificial general intelligence (AGI) be a key to achieve high-level AD? We further analyze the potential limitations and challenges that LLMs may encounter in promoting the development of AD technology.
Authors: Muhan Zhang
Abstract: In this draft, we study a novel problem, called lexical invariance, using the medium of multisets and graphs. Traditionally in the NLP domain, lexical invariance indicates that the semantic meaning of a sentence should remain unchanged regardless of the specific lexical or word-based representation of the input. For example, ``The movie was extremely entertaining'' would have the same meaning as ``The film was very enjoyable''. In this paper, we study a more challenging setting, where the output of a function is invariant to any injective transformation applied to the input lexical space. For example, multiset {1,2,3,2} is equivalent to multiset {a,b,c,b} if we specify an injective transformation that maps 1 to a, 2 to b and 3 to c. We study the sufficient and necessary conditions for a most expressive lexical invariant (and permutation invariant) function on multisets and graphs, and proves that for multisets, the function must have a form that only takes the multiset of counts of the unique elements in the original multiset as input. For example, a most expressive lexical invariant function on {a,b,c,b} must have a form that only operates on {1,1,2} (meaning that there are 1, 1, 2 unique elements corresponding to a,c,b). For graphs, we prove that a most expressive lexical invariant and permutation invariant function must have a form that only takes the adjacency matrix and a difference matrix as input, where the (i,j)th element of the difference matrix is 1 if node i and node j have the same feature and 0 otherwise. We perform synthetic experiments on TU datasets to verify our theorems.
Authors: Yew Ken Chia, Qi Sun, Lidong Bing, Soujanya Poria
Abstract: Large multimodal models have demonstrated impressive problem-solving abilities in vision and language tasks, and have the potential to encode extensive world knowledge. However, it remains an open challenge for these models to perceive, reason, plan, and act in realistic environments. In this work, we introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities through more diverse and complex scenarios than previous datasets. Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans. The data encompasses diverse aspects of commonsense knowledge, physical understanding, and safety awareness. Our fine-grained analysis reveals that state-of-the-art models, including GPT-4V, face bottlenecks in visual perception, comprehension, and reasoning abilities. To address these challenges, we propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans. Experimental results demonstrate the effectiveness of our framework compared to strong baselines. Our code and dataset are available at https://embodied-planning.github.io.
Authors: Isabele Bittencourt, Aparna S. Varde, Pankaj Lal
Abstract: In this paper, we conduct sentiment analysis on social media data to study mass opinion about offshore wind energy. We adapt three machine learning models, namely, TextBlob, VADER, and SentiWordNet because different functions are provided by each model. TextBlob provides subjectivity analysis as well as polarity classification. VADER offers cumulative sentiment scores. SentiWordNet considers sentiments with reference to context and performs classification accordingly. Techniques in NLP are harnessed to gather meaning from the textual data in social media. Data visualization tools are suitably deployed to display the overall results. This work is much in line with citizen science and smart governance via involvement of mass opinion to guide decision support. It exemplifies the role of Machine Learning and NLP here.
Authors: Nikita Kartashov, Nikolaos N. Vlassis
Abstract: Microstructure plays a critical role in determining the macroscopic properties of materials, with applications spanning alloy design, MEMS devices, and tissue engineering, among many others. Computational frameworks have been developed to capture the complex relationship between microstructure and material behavior. However, despite these advancements, the steep learning curve associated with domain-specific knowledge and complex algorithms restricts the broader application of these tools. To lower this barrier, we propose a framework that integrates Natural Language Processing (NLP), Large Language Models (LLMs), and Denoising Diffusion Probabilistic Models (DDPMs) to enable microstructure design using intuitive natural language commands. Our framework employs contextual data augmentation, driven by a pretrained LLM, to generate and expand a diverse dataset of microstructure descriptors. A retrained NER model extracts relevant microstructure descriptors from user-provided natural language inputs, which are then used by the DDPM to generate microstructures with targeted mechanical properties and topological features. The NLP and DDPM components of the framework are modular, allowing for separate training and validation, which ensures flexibility in adapting the framework to different datasets and use cases. A surrogate model system is employed to rank and filter generated samples based on their alignment with target properties. Demonstrated on a database of nonlinear hyperelastic microstructures, this framework serves as a prototype for accessible inverse design of microstructures, starting from intuitive natural language commands.
Authors: Shaowei Ying, Zhenlong Li, Manzhu Yu
Abstract: The resurgence and rapid advancement of Generative Artificial Intelligence (GenAI) in 2023 has catalyzed transformative shifts across numerous industry sectors, including urban transportation and logistics. This study investigates the evaluation of Large Language Models (LLMs), specifically GPT-4 and Phi-3-mini, to enhance transportation planning. The study assesses the performance and spatial comprehension of these models through a transportation-informed evaluation framework that includes general geospatial skills, general transportation domain skills, and real-world transportation problem-solving. Utilizing a mixed-methods approach, the research encompasses an evaluation of the LLMs' general Geographic Information System (GIS) skills, general transportation domain knowledge as well as abilities to support human decision-making in the real-world transportation planning scenarios of congestion pricing. Results indicate that GPT-4 demonstrates superior accuracy and reliability across various GIS and transportation-specific tasks compared to Phi-3-mini, highlighting its potential as a robust tool for transportation planners. Nonetheless, Phi-3-mini exhibits competence in specific analytical scenarios, suggesting its utility in resource-constrained environments. The findings underscore the transformative potential of GenAI technologies in urban transportation planning. Future work could explore the application of newer LLMs and the impact of Retrieval-Augmented Generation (RAG) techniques, on a broader set of real-world transportation planning and operations challenges, to deepen the integration of advanced AI models in transportation management practices.
Authors: Yingzhi Wang, Pooneh Mousavi, Artem Ploujnikov, Mirco Ravanelli
Abstract: In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are present in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We release a scene-reasoning dataset called "What Are They Doing" and establish a joint audio-speech benchmark to evaluate the joint reasoning capability of popular ALLMs. Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.
Authors: Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M. Bikel, Jason Weston, Eric Michael Smith
Abstract: Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to "undo" and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1\% $\to$ 1.5\%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.
Authors: Yinpei Dai, Jayjun Lee, Nima Fazeli, Joyce Chai
Abstract: Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grained language annotations for training. We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor framework, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision-language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language-conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long-horizon tasks, dynamic goal-change tasks and zero-shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: https://rich-language-failure-recovery.github.io.
Authors: Benjamin Clavi\'e, Antoine Chaffin, Griffin Adams
Abstract: Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space & memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.
Authors: Siddhant Bikram Shah, Shuvam Shiwakoti, Maheep Chaudhary, Haohan Wang
Abstract: The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of the multiple aspects of expression conveyed in them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, our study expands the focus to encompass multiple aspects of linguistics: hate, target, stance, and humor detection. We introduce a novel dataset PrideMM comprising text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.
Authors: Jingtao Cao, Zheng Zhang, Hongru Wang, Kam-Fai Wong
Abstract: Progress in Text-to-Image (T2I) models has significantly improved the generation of images from textual descriptions. However, existing evaluation metrics do not adequately assess the models' ability to handle a diverse range of textual prompts, which is crucial for their generalizability. To address this, we introduce a new metric called Visual Language Evaluation Understudy (VLEU). VLEU uses large language models to sample from the visual text domain, the set of all possible input texts for T2I models, to generate a wide variety of prompts. The images generated from these prompts are evaluated based on their alignment with the input text using the CLIP model.VLEU quantifies a model's generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. This metric provides a quantitative way to compare different T2I models and track improvements during model finetuning. Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models, positioning it as an essential metric for future research in text-to-image synthesis.
Authors: Junzhuo Liu, Xuzheng Yang, Weiwei Li, Peng Wang
Abstract: Referring Expression Comprehension (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. Consequently, it serves as an ideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit of this goal, we have established a new REC dataset characterized by two key features: Firstly, it is designed with controllable varying levels of difficulty, necessitating multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Secondly, it includes negative text and images created through fine-grained editing and generation based on existing data, thereby testing the model's ability to correctly reject scenarios where the target object is not visible in the image--an essential aspect often overlooked in existing datasets and approaches. Utilizing this high-quality dataset, we conducted comprehensive evaluations of both state-of-the-art specialist models and MLLMs. Our findings indicate that there remains a significant gap in achieving satisfactory grounding performance. We anticipate that our dataset will inspire new approaches to enhance visual reasoning and develop more advanced cross-modal interaction strategies, ultimately unlocking the full potential of MLLMs. Our code and the datasets are available at https://github.com/liujunzhuo/FineCops-Ref.
Authors: Furkan Pala, Mehmet Yasin Akp{\i}nar, Onur Deniz, G\"ul\c{s}en Eryi\u{g}it
Abstract: Multimodal key information extraction (KIE) models have been studied extensively on semi-structured documents. However, their investigation on unstructured documents is an emerging research topic. The paper presents an approach to adapt a multimodal transformer (i.e., ViBERTgrid previously explored on semi-structured documents) for unstructured financial documents, by incorporating a BiLSTM-CRF layer. The proposed ViBERTgrid BiLSTM-CRF model demonstrates a significant improvement in performance (up to 2 percentage points) on named entity recognition from unstructured documents in financial domain, while maintaining its KIE performance on semi-structured documents. As an additional contribution, we publicly released token-level annotations for the SROIE dataset in order to pave the way for its use in multimodal sequence labeling models.
Authors: Zeliang Zhang, Zhuo Liu, Mingqian Feng, Chenliang Xu
Abstract: CLIP has demonstrated great versatility in adapting to various downstream tasks, such as image editing and generation, visual question answering, and video understanding. However, CLIP-based applications often suffer from misunderstandings regarding user intent, leading to discrepancies between the required number of objects and the actual outputs in image generation tasks. In this work, we empirically investigate the quantity bias in CLIP. By carefully designing different experimental settings and datasets, we comprehensively evaluate CLIP's understanding of quantity from text, image, and cross-modal perspectives. Our experimental results reveal a quantity bias in CLIP embeddings, impacting the reliability of downstream tasks.
Authors: Sanchana Srikanth, Mohammad Hasanuzzaman, Farah Tasnur Meem
Abstract: Large Language Models (LLMs) have the potential to significantly enhance threat intelligence by automating the collection, preprocessing, and analysis of threat data. However, the usability of these tools is critical to ensure their effective adoption by security professionals. Despite the advanced capabilities of LLMs, concerns about their reliability, accuracy, and potential for generating inaccurate information persist. This study conducts a comprehensive usability evaluation of five LLMs ChatGPT, Gemini, Cohere, Copilot, and Meta AI focusing on their user interface design, error handling, learning curve, performance, and integration with existing tools in threat intelligence enrichment. Utilizing a heuristic walkthrough and a user study methodology, we identify key usability issues and offer actionable recommendations for improvement. Our findings aim to bridge the gap between LLM functionality and user experience, thereby promoting more efficient and accurate threat intelligence practices by ensuring these tools are user-friendly and reliable.
Authors: Agniv Sharma, Jonas Geiping
Abstract: Transformers are widely used across various applications, many of which yield sparse or partially filled attention matrices. Examples include attention masks designed to reduce the quadratic complexity of attention, sequence packing techniques, and recent innovations like tree masking for fast validation in MEDUSA. Despite the inherent sparsity in these matrices, the state-of-the-art algorithm Flash Attention still processes them with quadratic complexity as though they were dense. In this paper, we introduce \textbf{Binary Block Masking}, a highly efficient modification that enhances Flash Attention by making it mask-aware. We further propose two optimizations: one tailored for masks with contiguous non-zero patterns and another for extremely sparse masks. Our experiments on attention masks derived from real-world scenarios demonstrate up to a 9x runtime improvement. The implementation will be publicly released to foster further research and application.
Authors: Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, E. Kelly Buchanan, Mayee Chen, Neel Guha, Christopher R\'e, Azalia Mirhoseini
Abstract: Inference-time techniques are emerging as highly effective tools to increase large language model (LLM) capabilities. However, there is still limited understanding of the best practices for developing systems that combine inference-time techniques with one or more LLMs, with challenges including: (1) effectively allocating inference compute budget, (2) understanding the interactions between different combinations of inference-time techniques and their impact on downstream performance, and 3) efficiently searching over the large space of model choices, inference-time techniques, and their compositions. To address these challenges, we introduce Archon, an automated framework for designing inference-time architectures. Archon defines an extensible design space, encompassing methods such as generation ensembling, multi-sampling, ranking, fusion, critiquing, verification, and unit testing. It then transforms the problem of selecting and combining LLMs and inference-time techniques into a hyperparameter optimization objective. To optimize this objective, we introduce automated Inference-Time Architecture Search (ITAS) algorithms. Given target benchmark(s), an inference compute budget, and available LLMs, ITAS outputs optimized architectures. We evaluate Archon architectures across a wide range of instruction-following and reasoning benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. We show that automatically designed inference-time architectures by Archon outperform strong models such as GPT-4o and Claude 3.5 Sonnet on these benchmarks, achieving an average increase of 14.1 and 10.3 percentage points with all-source models and open-source models, respectively. We make our code and datasets available publicly on Github: https://github.com/ScalingIntelligence/Archon.
Authors: Tianqing Fang, Quyet V. Do, Zihao Zheng, Weiqi Wang, Sehyun Choi, Zhaowei Wang, Yangqiu Song
Abstract: Commonsense Knowledge Bases (CSKB) Population, which aims at automatically expanding knowledge in CSKBs with external resources, is an important yet hard task in NLP. Fang et al. (2021a) proposed a CSKB Population (CKBP) framework with an evaluation set CKBP v1. However, CKBP v1 relies on crowdsourced annotations that suffer from a considerable number of mislabeled answers, and the evaluationset lacks alignment with the external knowledge source due to random sampling. In this paper, we introduce CKBP v2, a new high-quality CSKB Population evaluation set that addresses the two aforementioned issues by employing domain experts as annotators and incorporating diversified adversarial samples to make the evaluation data more representative. We show that CKBP v2 serves as a challenging and representative evaluation dataset for the CSKB Population task, while its development set aids in selecting a population model that leads to improved knowledge acquisition for downstream commonsense reasoning. A better population model can also help acquire more informative commonsense knowledge as additional supervision signals for both generative commonsense inference and zero-shot commonsense question answering. Specifically, the question-answering model based on DeBERTa-v3-large (He et al., 2023b) even outperforms powerful large language models in a zero-shot setting, including ChatGPT and GPT-3.5.
Authors: Hongbo Zhang, Chen Tang, Tyler Loakman, Bohao Yang, Stefan Goetze, Chenghua Lin
Abstract: Commonsense knowledge is crucial to many natural language processing tasks. Existing works usually incorporate graph knowledge with conventional graph neural networks (GNNs), resulting in a sequential pipeline that compartmentalizes the encoding processes for textual and graph-based knowledge. This compartmentalization does, however, not fully exploit the contextual interplay between these two types of input knowledge. In this paper, a novel context-aware graph-attention model (Context-aware GAT) is proposed, designed to effectively assimilate global features from relevant knowledge graphs through a context-enhanced knowledge aggregation mechanism. Specifically, the proposed framework employs an innovative approach to representation learning that harmonizes heterogeneous features by amalgamating flattened graph knowledge with text data. The hierarchical application of graph knowledge aggregation within connected subgraphs, complemented by contextual information, to bolster the generation of commonsense-driven dialogues is analyzed. Empirical results demonstrate that our framework outperforms conventional GNN-based language models in terms of performance. Both, automated and human evaluations affirm the significant performance enhancements achieved by our proposed model over the concept flow baseline.
Authors: Qingyu Chen, Jingcheng Du, Yan Hu, Vipina Kuttichi Keloth, Xueqing Peng, Kalpana Raja, Rui Zhang, Zhiyong Lu, Hua Xu
Abstract: Biomedical literature is growing rapidly, making it challenging to curate and extract knowledge manually. Biomedical natural language processing (BioNLP) techniques that can automatically extract information from biomedical literature help alleviate this burden. Recently, large Language Models (LLMs), such as GPT-3 and GPT-4, have gained significant attention for their impressive performance. However, their effectiveness in BioNLP tasks and impact on method development and downstream users remain understudied. This pilot study (1) establishes the baseline performance of GPT-3 and GPT-4 at both zero-shot and one-shot settings in eight BioNLP datasets across four applications: named entity recognition, relation extraction, multi-label document classification, and semantic similarity and reasoning, (2) examines the errors produced by the LLMs and categorized the errors into three types: missingness, inconsistencies, and unwanted artificial content, and (3) provides suggestions for using LLMs in BioNLP applications. We make the datasets, baselines, and results publicly available to the community via https://github.com/qingyu-qc/gpt_bionlp_benchmark.
Authors: Xiaohan Xu, Chongyang Tao, Tao Shen, Can Xu, Hongbo Xu, Guodong Long, Jian-guang Lou, Shuai Ma
Abstract: To enhance the reasoning capabilities of off-the-shelf Large Language Models (LLMs), we introduce a simple, yet general and effective prompting method, Re2, i.e., \textbf{Re}-\textbf{Re}ading the question as input. Unlike most thought-eliciting prompting methods, such as Chain-of-Thought (CoT), which aim to elicit the reasoning process in the output, Re2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process. Consequently, Re2 demonstrates strong generality and compatibility with most thought-eliciting prompting methods, including CoT. Crucially, Re2 facilitates a "bidirectional" encoding in unidirectional decoder-only LLMs because the first pass could provide global information for the second pass. We begin with a preliminary empirical study as the foundation of Re2, illustrating its potential to enable "bidirectional" attention mechanisms. We then evaluate Re2 on extensive reasoning benchmarks across 14 datasets, spanning 112 experiments, to validate its effectiveness and generality. Our findings indicate that, with the exception of a few scenarios on vanilla ChatGPT, Re2 consistently enhances the reasoning performance of LLMs through a simple re-reading strategy. Further analyses reveal Re2's adaptability, showing how it can be effectively integrated with different LLMs, thought-eliciting prompting, and ensemble strategies. Our code is available at \url{https://github.com/Tebmer/Rereading-LLM-Reasoning/}
Authors: Michael Gervers, Gelila Tilahun
Abstract: We outline an unsupervised method for temporal rank ordering of sets of historical documents, namely American State of the Union Addresses and DEEDS, a corpus of medieval English property transfer documents. Our method relies upon effectively capturing the gradual change in word usage via a bandwidth estimate for the non-parametric Generalized Linear Models (Fan, Heckman, and Wand, 1995). The number of possible rank orders needed to search through for cost functions related to the bandwidth can be quite large, even for a small set of documents. We tackle this problem of combinatorial optimization using the Simulated Annealing algorithm, which allows us to obtain the optimal document temporal orders. Our rank ordering method significantly improved the temporal sequencing of both corpora compared to a randomly sequenced baseline. This unsupervised approach should enable the temporal ordering of undated document sets.
Authors: Yibo Wang, Xiangjue Dong, James Caverlee, Philip S. Yu
Abstract: Language models can be manipulated by adversarial attacks, which introduce subtle perturbations to input data. While recent attack methods can achieve a relatively high attack success rate (ASR), we've observed that the generated adversarial examples have a different data distribution compared with the original examples. Specifically, these adversarial examples exhibit reduced confidence levels and greater divergence from the training data distribution. Consequently, they are easy to detect using straightforward detection methods, diminishing the efficacy of such attacks. To address this issue, we propose a Distribution-Aware Adversarial Attack ($DA^3$) method. $DA^3$ considers the distribution shifts of adversarial examples to improve attacks' effectiveness under detection methods. We further design a novel evaluation metric, the Non-detectable Attack Success Rate (NASR), which integrates both ASR and detectability for the attack task. We conduct experiments on four widely used datasets to validate the attack effectiveness and transferability of adversarial examples generated by $DA^3$ against both the white-box BERT-base and RoBERTa-base models and the black-box LLaMA2-7b model.
Authors: Zichen Chen, Jianda Chen, Ambuj Singh, Misha Sra
Abstract: Large Language Models (LLMs) have achieved remarkable success in natural language tasks, yet understanding their reasoning processes remains a significant challenge. We address this by introducing XplainLLM, a dataset accompanying an explanation framework designed to enhance LLM transparency and reliability. Our dataset comprises 24,204 instances where each instance interprets the LLM's reasoning behavior using knowledge graphs (KGs) and graph attention networks (GAT), and includes explanations of LLMs such as the decoder-only Llama-3 and the encoder-only RoBERTa. XplainLLM also features a framework for generating grounded explanations and the debugger-scores for multidimensional quality analysis. Our explanations include why-choose and why-not-choose components, reason-elements, and debugger-scores that collectively illuminate the LLM's reasoning behavior. Our evaluations demonstrate XplainLLM's potential to reduce hallucinations and improve grounded explanation generation in LLMs. XplainLLM is a resource for researchers and practitioners to build trust and verify the reliability of LLM outputs.
Authors: Haofei Yu, Zhengyang Qi, Lawrence Jang, Ruslan Salakhutdinov, Louis-Philippe Morency, Paul Pu Liang
Abstract: Advances in multimodal models have greatly improved how interactions relevant to various tasks are modeled. Today's multimodal models mainly focus on the correspondence between images and text, using this for tasks like image-text matching. However, this covers only a subset of real-world interactions. Novel interactions, such as sarcasm expressed through opposing spoken words and gestures or humor expressed through utterances and tone of voice, remain challenging. In this paper, we introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE). The key idea in MMoE is to train separate expert models for each type of multimodal interaction, such as redundancy present in both modalities, uniqueness in one modality, or synergy that emerges when both modalities are fused. On a sarcasm detection task (MUStARD) and a humor detection task (URFUNNY), we obtain new state-of-the-art results. MMoE is also able to be applied to various types of models to gain improvement.
Authors: Benjamin Clavi\'e
Abstract: As language-specific training data tends to be sparsely available compared to English, document retrieval in many languages has been largely relying on multilingual models. In Japanese, the best performing deep-learning based retrieval approaches rely on multilingual dense embedders, with Japanese-only models lagging far behind. However, multilingual models require considerably more compute and data to train and have higher computational and memory requirements while often missing out on culturally-relevant information. In this paper, we introduce JaColBERT, a family of multi-vector retrievers trained on two magnitudes fewer data than their multilingual counterparts while reaching competitive performance. Our strongest model largely outperform all existing monolingual Japanese retrievers on all dataset, as well as the strongest existing multilingual models on all out-of-domain tasks, highlighting the need for specialised models able to handle linguistic specificities. These results are achieved using a model with only 110 million parameters, considerably smaller than all multilingual models, and using only a limited Japanese-language. We believe our results show great promise to support Japanese retrieval-enhanced application pipelines in a wide variety of domains.
Authors: Junjie Ye, Yilong Wu, Songyang Gao, Caishuang Huang, Sixian Li, Guanyu Li, Xiaoran Fan, Qi Zhang, Tao Gui, Xuanjing Huang
Abstract: Tool learning has generated widespread interest as a vital means of interaction between Large Language Models (LLMs) and the physical world. Current research predominantly emphasizes LLMs' capacity to utilize tools in well-structured environments while overlooking their stability when confronted with the inevitable noise of the real world. To bridge this gap, we introduce RoTBench, a multi-level benchmark for evaluating the robustness of LLMs in tool learning. Specifically, we establish five external environments, each featuring varying levels of noise (i.e., Clean, Slight, Medium, Heavy, and Union), providing an in-depth analysis of the model's resilience across three critical phases: tool selection, parameter identification, and content filling. Experiments involving six widely-used models underscore the urgent necessity for enhancing the robustness of LLMs in tool learning. For instance, the performance of GPT-4 even drops significantly from 80.00 to 58.10 when there is no substantial change in manual accuracy. More surprisingly, the noise correction capability inherent in the GPT family paradoxically impedes its adaptability in the face of mild noise. In light of these findings, we propose RoTTuning, a strategy that enriches the diversity of training environments to bolster the robustness of LLMs in tool learning. The code and data are available at https://github.com/Junjie-Ye/RoTBench.
Authors: Fanqi Wan, Xinting Huang, Leyang Cui, Xiaojun Quan, Wei Bi, Shuming Shi
Abstract: While large language models (LLMs) have demonstrated exceptional performance across various tasks following human alignment, they may still generate responses that sound plausible but contradict factual knowledge, a phenomenon known as hallucination. In this paper, we demonstrate the feasibility of mitigating hallucinations by verifying and minimizing the inconsistency between external knowledge present in the alignment data and the intrinsic knowledge embedded within foundation LLMs. Specifically, we propose a novel approach called Knowledge Consistent Alignment (KCA), which employs a well-aligned LLM to automatically formulate assessments based on external knowledge to evaluate the knowledge boundaries of foundation LLMs. To address knowledge inconsistencies in the alignment data, KCA implements several specific strategies to deal with these data instances. We demonstrate the superior efficacy of KCA in reducing hallucinations across six benchmarks, utilizing foundation LLMs of varying backbones and scales. This confirms the effectiveness of mitigating hallucinations by reducing knowledge inconsistency. Our code, model weights, and data are openly accessible at \url{https://github.com/fanqiwan/KCA}.
Authors: Wesley H. Holliday, Matthew Mandelkern, Cedegao E. Zhang
Abstract: The reasoning abilities of large language models (LLMs) are the topic of a growing body of research in AI and cognitive science. In this paper, we probe the extent to which twenty-five LLMs are able to distinguish logically correct inferences from logically fallacious ones. We focus on inference patterns involving conditionals (e.g., 'If Ann has a queen, then Bob has a jack') and epistemic modals (e.g., 'Ann might have an ace', 'Bob must have a king'). These inferences have been of special interest to logicians, philosophers, and linguists, since they play a central role in the fundamental human ability to reason about distal possibilities. Assessing LLMs on these inferences is thus highly relevant to the question of how much the reasoning abilities of LLMs match those of humans. Among the LLMs we tested, all but the GPT-4 model family often make basic mistakes with conditionals, though zero-shot chain-of-thought prompting helps them make fewer mistakes. Moreover, even the GPT-4 family displays logically inconsistent judgments across inference patterns involving epistemic modals, and almost all models give answers to certain complex conditional inferences widely discussed in the literature that do not match human judgments. These results highlight gaps in basic logical reasoning in today's LLMs.
Authors: Zichong Wang, Zhibo Chu, Thang Viet Doan, Shiwen Ni, Min Yang, Wenbin Zhang
Abstract: Language models serve as a cornerstone in natural language processing (NLP), utilizing mathematical methods to generalize language laws and knowledge for prediction and generation. Over extensive research spanning decades, language modeling has progressed from initial statistical language models (SLMs) to the contemporary landscape of large language models (LLMs). Notably, the swift evolution of LLMs has reached the ability to process, understand, and generate human-level text. Nevertheless, despite the significant advantages that LLMs offer in improving both work and personal lives, the limited understanding among general practitioners about the background and principles of these models hampers their full potential. Notably, most LLM reviews focus on specific aspects and utilize specialized language, posing a challenge for practitioners lacking relevant background knowledge. In light of this, this survey aims to present a comprehensible overview of LLMs to assist a broader audience. It strives to facilitate a comprehensive understanding by exploring the historical background of language models and tracing their evolution over time. The survey further investigates the factors influencing the development of LLMs, emphasizing key contributions. Additionally, it concentrates on elucidating the underlying principles of LLMs, equipping audiences with essential theoretical knowledge. The survey also highlights the limitations of existing work and points out promising future directions.
Authors: Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, Weihao Chen, Chunhui Li, Jianbing Zhang, Xinyu Dai
Abstract: Multimodal large language models (MLLMs) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination. To eliminate hallucinations, existing methods manually annotate paired responses with and without hallucinations, and then employ various alignment algorithms to improve the alignment capability between images and text. However, they not only demand considerable computation resources during the finetuning stage but also require expensive human annotation to construct paired data needed by the alignment algorithms. To address these issues, we borrow the idea of unlearning and propose an efficient fine-grained unlearning framework (EFUF), which can eliminate hallucinations without the need for paired data. Extensive experiments show that our method consistently reduces hallucinations while preserving the generation quality with modest computational overhead. Our code and datasets will be publicly available.
Authors: Khiem Phi, Noushin Salek Faramarzi, Chenlu Wang, Ritwik Banerjee
Abstract: Whataboutism, a potent tool for disrupting narratives and sowing distrust, remains under-explored in quantitative NLP research. Moreover, past work has not distinguished its use as a strategy for misinformation and propaganda from its use as a tool for pragmatic and semantic framing. We introduce new datasets from Twitter and YouTube, revealing overlaps as well as distinctions between whataboutism, propaganda, and the tu quoque fallacy. Furthermore, drawing on recent work in linguistic semantics, we differentiate the `what about' lexical construct from whataboutism. Our experiments bring to light unique challenges in its accurate detection, prompting the introduction of a novel method using attention weights for negative sample mining. We report significant improvements of 4% and 10% over previous state-of-the-art methods in our Twitter and YouTube collections, respectively.
Authors: Alexander Arno Weber, Klaudia Thellmann, Jan Ebert, Nicolas Flores-Herr, Jens Lehmann, Michael Fromm, Mehdi Ali
Abstract: The adaption of multilingual pre-trained LLMs into eloquent and helpful assistants is essential to facilitate their use across different language regions. In that spirit, we are the first to conduct an extensive study of the performance of multilingual models instruction-tuned on different language compositions on parallel instruction-tuning benchmarks across a selection of the most spoken Indo-European languages. We systematically examine the effects of language and instruction dataset size on a mid-sized and a large, multilingual LLMs by instruction-tuning them on parallel instruction-tuning datasets. Our results demonstrate that instruction-tuning on parallel instead of monolingual corpora benefits cross-lingual instruction following capabilities by up to 9.9%. Furthermore, we show that the Superficial Alignment Hypothesis does not hold in general, as the investigated multilingual 7B parameter model presents a counter-example requiring large-scale instruction-tuning datasets. Finally, we conduct a human annotation study to understand the alignment between human-based and GPT-4-based evaluation within multilingual chat scenarios.
Authors: Xinglin Lyu, Junhui Li, Yanqing Zhao, Min Zhang, Daimeng Wei, Shimin Tao, Hao Yang, Min Zhang
Abstract: Generally, the decoder-only large language models (LLMs) are adapted to context-aware neural machine translation (NMT) in a concatenating way, where LLMs take the concatenation of the source sentence (i.e., intra-sentence context) and the inter-sentence context as the input, and then to generate the target tokens sequentially. This adaptation strategy, i.e., concatenation mode, considers intra-sentence and inter-sentence contexts with the same priority, despite an apparent difference between the two kinds of contexts. In this paper, we propose an alternative adaptation approach, named Decoding-enhanced Multi-phase Prompt Tuning (DeMPT), to make LLMs discriminately model and utilize the inter- and intra-sentence context and more effectively adapt LLMs to context-aware NMT. First, DeMPT divides the context-aware NMT process into three separate phases. During each phase, different continuous prompts are introduced to make LLMs discriminately model various information. Second, DeMPT employs a heuristic way to further discriminately enhance the utilization of the source-side inter- and intra-sentence information at the final decoding phase. Experiments show that our approach significantly outperforms the concatenation method, and further improves the performance of LLMs in discourse modeling.
Authors: Derong Xu, Ziheng Zhang, Zhihong Zhu, Zhenxi Lin, Qidong Liu, Xian Wu, Tong Xu, Wanyu Wang, Yuyang Ye, Xiangyu Zhao, Enhong Chen, Yefeng Zheng
Abstract: Model editing aims to precisely alter the behaviors of large language models (LLMs) in relation to specific knowledge, while leaving unrelated knowledge intact. This approach has proven effective in addressing issues of hallucination and outdated information in LLMs. However, the potential of using model editing to modify knowledge in the medical field remains largely unexplored, even though resolving hallucination is a pressing need in this area. Our observations indicate that current methods face significant challenges in dealing with specialized and complex knowledge in medical domain. Therefore, we propose MedLaSA, a novel Layer-wise Scalable Adapter strategy for medical model editing. MedLaSA harnesses the strengths of both adding extra parameters and locate-then-edit methods for medical model editing. We utilize causal tracing to identify the association of knowledge in neurons across different layers, and generate a corresponding scale set from the association value for each piece of knowledge. Subsequently, we incorporate scalable adapters into the dense layers of LLMs. These adapters are assigned scaling values based on the corresponding specific knowledge, which allows for the adjustment of the adapter's weight and rank. The more similar the content, the more consistent the scale between them. This ensures precise editing of semantically identical knowledge while avoiding impact on unrelated knowledge. To evaluate the editing impact on the behaviours of LLMs, we propose two model editing studies for medical domain: (1) editing factual knowledge for medical specialization and (2) editing the explanatory ability for complex knowledge. We build two novel medical benchmarking datasets and introduce a series of challenging and comprehensive metrics. Extensive experiments on medical LLMs demonstrate the editing efficiency of MedLaSA, without affecting unrelated knowledge.
Authors: Jinman Zhao, Xueyan Zhang
Abstract: We present a comprehensive evaluation of large language models(LLMs)' ability to reason about composition relations through a benchmark encompassing 1,500 test cases in English, designed to cover six distinct types of composition relations: Positional, Comparative, Personal, Mathematical, Identity, and Other. Acknowledging the significance of multilingual capabilities, we expanded our assessment to include translations of these cases into Chinese, Japanese, French, and Korean. Our Multilingual Composition Relation (MCR) benchmark aims at investigating the robustness and adaptability of LLMs in handling composition relation reasoning across diverse linguistic contexts.
Authors: Zhitao He, Pengfei Cao, Chenhao Wang, Zhuoran Jin, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, Jun Zhao
Abstract: With the development of deep learning, natural language processing technology has effectively improved the efficiency of various aspects of the traditional judicial industry. However, most current efforts focus on tasks within individual judicial stages, making it difficult to handle complex tasks that span multiple stages. As the autonomous agents powered by large language models are becoming increasingly smart and able to make complex decisions in real-world settings, offering new insights for judicial intelligence. In this paper, (1) we propose a novel multi-agent framework, AgentsCourt, for judicial decision-making. Our framework follows the classic court trial process, consisting of court debate simulation, legal resources retrieval and decision-making refinement to simulate the decision-making of judge. (2) we introduce SimuCourt, a judicial benchmark that encompasses 420 Chinese judgment documents, spanning the three most common types of judicial cases. Furthermore, to support this task, we construct a large-scale legal knowledge base, Legal-KB, with multi-resource legal knowledge. (3) Extensive experiments show that our framework outperforms the existing advanced methods in various aspects, especially in generating legal articles, where our model achieves significant improvements of 8.6% and 9.1% F1 score in the first and second instance settings, respectively.
Authors: Tong Zhang, Chen Huang, Yang Deng, Hongru Liang, Jia Liu, Zujie Wen, Wenqiang Lei, Tat-Seng Chua
Abstract: We investigate non-collaborative dialogue agents, which are expected to engage in strategic conversations with diverse users, for securing a mutual agreement that leans favorably towards the system's objectives. This poses two main challenges for existing dialogue agents: 1) The inability to integrate user-specific characteristics into the strategic planning, and 2) The difficulty of training strategic planners that can be generalized to diverse users. To address these challenges, we propose Trip to enhance the capability in tailored strategic planning, incorporating a user-aware strategic planning module and a population-based training paradigm. Through experiments on benchmark non-collaborative dialogue tasks, we demonstrate the effectiveness of Trip in catering to diverse users.
Authors: Preetam Prabhu Srikar Dammu, Himanshu Naidu, Mouly Dewan, YoungMin Kim, Tanya Roosta, Aman Chadha, Chirag Shah
Abstract: In the midst of widespread misinformation and disinformation through social media and the proliferation of AI-generated texts, it has become increasingly difficult for people to validate and trust information they encounter. Many fact-checking approaches and tools have been developed, but they often lack appropriate explainability or granularity to be useful in various contexts. A text validation method that is easy to use, accessible, and can perform fine-grained evidence attribution has become crucial. More importantly, building user trust in such a method requires presenting the rationale behind each prediction, as research shows this significantly influences people's belief in automated systems. Localizing and bringing users' attention to the specific problematic content is also paramount, instead of providing simple blanket labels. In this paper, we present ClaimVer, a human-centric framework tailored to meet users' informational and verification needs by generating rich annotations and thereby reducing cognitive load. Designed to deliver comprehensive evaluations of texts, it highlights each claim, verifies it against a trusted knowledge graph (KG), presents the evidence, and provides succinct, clear explanations for each claim prediction. Finally, our framework introduces an attribution score, enhancing applicability across a wide range of downstream tasks.
Authors: Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, Yiqun Liu
Abstract: Dynamic retrieval augmented generation (RAG) paradigm actively decides when and what to retrieve during the text generation process of Large Language Models (LLMs). There are two key elements of this paradigm: identifying the optimal moment to activate the retrieval module (deciding when to retrieve) and crafting the appropriate query once retrieval is triggered (determining what to retrieve). However, current dynamic RAG methods fall short in both aspects. Firstly, the strategies for deciding when to retrieve often rely on static rules. Moreover, the strategies for deciding what to retrieve typically limit themselves to the LLM's most recent sentence or the last few tokens, while the LLM's real-time information needs may span across the entire context. To overcome these limitations, we introduce a new framework, DRAGIN, i.e., Dynamic Retrieval Augmented Generation based on the real-time Information Needs of LLMs. Our framework is specifically designed to make decisions on when and what to retrieve based on the LLM's real-time information needs during the text generation process. We evaluate DRAGIN along with existing methods comprehensively over 4 knowledge-intensive generation datasets. Experimental results show that DRAGIN achieves superior performance on all tasks, demonstrating the effectiveness of our method. We have open-sourced all the code, data, and models in GitHub: https://github.com/oneal2000/DRAGIN/tree/main
Authors: Eftekhar Hossain, Omar Sharif, Mohammed Moshiul Hoque, Sarah M. Preum
Abstract: Internet memes have become a powerful means for individuals to express emotions, thoughts, and perspectives on social media. While often considered as a source of humor and entertainment, memes can also disseminate hateful content targeting individuals or communities. Most existing research focuses on the negative aspects of memes in high-resource languages, overlooking the distinctive challenges associated with low-resource languages like Bengali (also known as Bangla). Furthermore, while previous work on Bengali memes has focused on detecting hateful memes, there has been no work on detecting their targeted entities. To bridge this gap and facilitate research in this arena, we introduce a novel multimodal dataset for Bengali, BHM (Bengali Hateful Memes). The dataset consists of 7,148 memes with Bengali as well as code-mixed captions, tailored for two tasks: (i) detecting hateful memes, and (ii) detecting the social entities they target (i.e., Individual, Organization, Community, and Society). To solve these tasks, we propose DORA (Dual cO attention fRAmework), a multimodal deep neural network that systematically extracts the significant modality features from the memes and jointly evaluates them with the modality-specific features to understand the context better. Our experiments show that DORA is generalizable on other low-resource hateful meme datasets and outperforms several state-of-the-art rivaling baselines.
Authors: Tyler Loakman, Chen Tang, Chenghua Lin
Abstract: Previous work in phonologically and phonetically grounded language generation has mainly focused on domains such as puns and poetry. In this article, we present new work on the generation of English tongue twisters - a form of language that is required to be conditioned on a phoneme level to maximize sound overlap, while maintaining semantic consistency with an input topic or phrase and still being grammatically correct. We present TwisterLister, a pipeline for generating phonologically informed tongue twisters from large language models (LLMs) that we use to generate TwistList 2.0, the largest annotated dataset of tongue twisters to date, consisting of 17K+ examples from a combination of human and LLM authors. Our generation pipeline involves the use of a phonologically constrained vocabulary alongside LLM prompting to generate novel, non-derivative tongue twister examples. We additionally present the results of automatic and human evaluation of smaller models trained on our generated dataset to demonstrate the extent to which phonologically motivated language types can be generated without explicit injection of phonological knowledge. Additionally, we introduce a phoneme-aware constrained decoding module (PACD) that can be integrated into an autoregressive language model and demonstrate that this method generates good quality tongue twisters both with and without fine-tuning the underlying language model. We also design and implement a range of automatic metrics for the task of tongue twister generation that is phonologically motivated and captures the unique essence of tongue twisters, primarily based on phonemic edit distance (PED)
Authors: Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei
Abstract: The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach. Leveraging dual encoders, we decouple different types of speech information, utilizing a Whisper encoder to process the semantic content of speech, and a WavLM encoder to capture the unique characteristics of the speaker's identity. Within the curriculum learning framework, WavLLM first builds its foundational capabilities by optimizing on mixed elementary single tasks, followed by advanced multi-task training on more complex tasks such as combinations of the elementary tasks. To enhance the flexibility and adherence to different tasks and instructions, a prompt-aware LoRA weight adapter is introduced in the second advanced multi-task training stage. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set. Experiments demonstrate that the proposed model achieves state-of-the-art performance across a range of speech tasks on the same model size, exhibiting robust generalization capabilities in executing complex tasks using CoT approach. Furthermore, our model successfully completes Gaokao tasks without specialized training. The codes, models, audio, and Gaokao evaluation set can be accessed at \url{aka.ms/wavllm}.
Authors: Irene Pagliai, Goya van Boven, Tosin Adewumi, Lama Alkhaled, Namrata Gurung, Isabella S\"odergren, Elisa Barney
Abstract: We introduce new large labeled datasets on bias in 3 languages and show in experiments that bias exists in all 10 datasets of 5 languages evaluated, including benchmark datasets on the English GLUE/SuperGLUE leaderboards. The 3 new languages give a total of almost 6 million labeled samples and we benchmark on these datasets using SotA multilingual pretrained models: mT5 and mBERT. The challenge of social bias, based on prejudice, is ubiquitous, as recent events with AI and large language models (LLMs) have shown. Motivated by this challenge, we set out to estimate bias in multiple datasets. We compare some recent bias metrics and use bipol, which has explainability in the metric. We also confirm the unverified assumption that bias exists in toxic comments by randomly sampling 200 samples from a toxic dataset population using the confidence level of 95% and error margin of 7%. Thirty gold samples were randomly distributed in the 200 samples to secure the quality of the annotation. Our findings confirm that many of the datasets have male bias (prejudice against women), besides other types of bias. We publicly release our new datasets, lexica, models, and codes.
Authors: Monica Romero, Sandra Gomez, Ivan G. Torre
Abstract: Indigenous languages are a fundamental legacy in the development of human communication, embodying the unique identity and culture of local communities in America. The Second AmericasNLP (Americas Natural Language Processing) Competition Track 1 of NeurIPS (Neural Information Processing Systems) 2022 proposed the task of training automatic speech recognition (ASR) systems for five Indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana. In this paper, we describe the fine-tuning of a state-of-the-art ASR model for each target language, using approximately 36.65 h of transcribed speech data from diverse sources enriched with data augmentation methods. We systematically investigate, using a Bayesian search, the impact of the different hyperparameters on the Wav2vec2.0 XLS-R (Cross-Lingual Speech Representations) variants of 300 M and 1 B parameters. Our findings indicate that data and detailed hyperparameter tuning significantly affect ASR accuracy, but language complexity determines the final result. The Quechua model achieved the lowest character error rate (CER) (12.14), while the Kotiria model, despite having the most extensive dataset during the fine-tuning phase, showed the highest CER (36.59). Conversely, with the smallest dataset, the Guarani model achieved a CER of 15.59, while Bribri and Wa'ikhana obtained, respectively, CERs of 34.70 and 35.23. Additionally, Sobol' sensitivity analysis highlighted the crucial roles of freeze fine-tuning updates and dropout rates. We release our best models for each language, marking the first open ASR models for Wa'ikhana and Kotiria. This work opens avenues for future research to advance ASR techniques in preserving minority Indigenous languages
Authors: Juhwan Choi, Jungmin Yun, Kyohoon Jin, YoungBin Kim
Abstract: The quality of the dataset is crucial for ensuring optimal performance and reliability of downstream task models. However, datasets often contain noisy data inadvertently included during the construction process. Numerous attempts have been made to correct this issue through human annotators. However, hiring and managing human annotators is expensive and time-consuming. As an alternative, recent studies are exploring the use of large language models (LLMs) for data annotation. In this study, we present a case study that extends the application of LLM-based data annotation to enhance the quality of existing datasets through a cleansing strategy. Specifically, we leverage approaches such as chain-of-thought and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset, which is widely used for the multi-document summarization task. Through our proposed cleansing method, we introduce an enhanced Multi-News+. By employing LLMs for data cleansing, we demonstrate an efficient and effective approach to improving dataset quality without relying on expensive human annotation efforts.
Authors: Mahammed Kamruzzaman, Gene Louis Kim
Abstract: Dual process theory posits that human cognition arises via two systems. System 1, which is a quick, emotional, and intuitive process, which is subject to cognitive biases, and System 2, is a slow, onerous, and deliberate process. NLP researchers often compare zero-shot prompting in LLMs to System 1 reasoning and chain-of-thought (CoT) prompting to System 2. In line with this interpretation, prior research has found that using CoT prompting in LLMs leads to reduced gender bias. We investigate the relationship between bias, CoT prompting, a debiasing prompt, and dual process theory in LLMs directly. We compare zero-shot CoT, debiasing, and a variety of dual process theory-based prompting strategies on two bias datasets spanning nine different social bias categories. We incorporate human and machine personas to determine whether the effects of dual process theory in LLMs exist independent of explicit persona models or are based on modeling human cognition. We find that a human persona, debiasing, System 2, and CoT prompting all tend to reduce social biases in LLMs, though the best combination of features depends on the exact model and bias category -- resulting in up to a 19 percent drop in stereotypical judgments by an LLM.
Authors: Juhwan Choi, Yeonghwa Kim, Seunguk Yu, JungMin Yun, YoungBin Kim
Abstract: Although pre-trained language models have exhibited great flexibility and versatility with prompt-based few-shot learning, they suffer from the extensive parameter size and limited applicability for inference. Recent studies have suggested that PLMs be used as dataset generators and a tiny task-specific model be trained to achieve efficient inference. However, their applicability to various domains is limited because they tend to generate domain-specific datasets. In this work, we propose a novel approach to universal domain generalization that generates a dataset regardless of the target domain. This allows for generalization of the tiny task model to any domain that shares the label space, thus enhancing the real-world applicability of the dataset generation paradigm. Our experiments indicate that the proposed method accomplishes generalizability across various domains while using a parameter set that is orders of magnitude smaller than PLMs.
Authors: Yinzhu Quan, Zefang Liu
Abstract: In this paper, we introduce EconLogicQA, a rigorous benchmark designed to assess the sequential reasoning capabilities of large language models (LLMs) within the intricate realms of economics, business, and supply chain management. Diverging from traditional benchmarks that predict subsequent events individually, EconLogicQA poses a more challenging task: it requires models to discern and sequence multiple interconnected events, capturing the complexity of economic logics. EconLogicQA comprises an array of multi-event scenarios derived from economic articles, which necessitate an insightful understanding of both temporal and logical event relationships. Through comprehensive evaluations, we exhibit that EconLogicQA effectively gauges a LLM's proficiency in navigating the sequential complexities inherent in economic contexts. We provide a detailed description of EconLogicQA dataset and shows the outcomes from evaluating the benchmark across various leading-edge LLMs, thereby offering a thorough perspective on their sequential reasoning potential in economic contexts. Our benchmark dataset is available at https://huggingface.co/datasets/yinzhu-quan/econ_logic_qa.
URLs: https://huggingface.co/datasets/yinzhu-quan/econ_logic_qa.
Authors: Junhao Song, Yingfang Yuan, Kaiwen Chang, Bing Xu, Jin Xuan, Wei Pang
Abstract: To advance the circular economy (CE), it is crucial to gain insights into the evolution of public attention, cognitive pathways of the masses concerning circular products, and to identify primary concerns. To achieve this, we collected data from diverse platforms, including Twitter, Reddit, and The Guardian, and utilised three topic models to analyse the data. Given the performance of topic modelling may vary depending on hyperparameter settings, this research proposed a novel framework that integrates twin (single and multi-objective) hyperparameter optimisation for the CE. We conducted systematic experiments to ensure that topic models are set with appropriate hyperparameters under different constraints, providing valuable insights into the correlations between CE and public attention. In summary, our optimised model reveals that public remains concerned about the economic impacts of sustainability and circular practices, particularly regarding recyclable materials and environmentally sustainable technologies. The analysis shows that the CE has attracted significant attention on The Guardian, especially in topics related to sustainable development and environmental protection technologies, while discussions are comparatively less active on Twitter. These insights highlight the need for policymakers to implement targeted education programs, create incentives for businesses to adopt CE principles, and enforce more stringent waste management policies alongside improved recycling processes.
Authors: Virgile Rennard, Guokan Shang, Michalis Vazirgiannis, Julie Hunter
Abstract: We introduce an extractive summarization system for meetings that leverages discourse structure to better identify salient information from complex multi-party discussions. Using discourse graphs to represent semantic relations between the contents of utterances in a meeting, we train a GNN-based node classification model to select the most important utterances, which are then combined to create an extractive summary. Experimental results on AMI and ICSI demonstrate that our approach surpasses existing text-based and graph-based extractive summarization systems, as measured by both classification and summarization metrics. Additionally, we conduct ablation studies on discourse structure and relation type to provide insights for future NLP applications leveraging discourse analysis theory.
Authors: Tabea M. G. Pakull, Hendrik Damm, Ahmad Idrissi-Yaghir, Henning Sch\"afer, Peter A. Horn, Christoph M. Friedrich
Abstract: This paper details the efforts of the WisPerMed team in the BioLaySumm2024 Shared Task on automatic lay summarization in the biomedical domain, aimed at making scientific publications accessible to non-specialists. Large language models (LLMs), specifically the BioMistral and Llama3 models, were fine-tuned and employed to create lay summaries from complex scientific texts. The summarization performance was enhanced through various approaches, including instruction tuning, few-shot learning, and prompt variations tailored to incorporate specific context information. The experiments demonstrated that fine-tuning generally led to the best performance across most evaluated metrics. Few-shot learning notably improved the models' ability to generate relevant and factually accurate texts, particularly when using a well-crafted prompt. Additionally, a Dynamic Expert Selection (DES) mechanism to optimize the selection of text outputs based on readability and factuality metrics was developed. Out of 54 participants, the WisPerMed team reached the 4th place, measured by readability, factuality, and relevance. Determined by the overall score, our approach improved upon the baseline by approx. 5.5 percentage points and was only approx 1.5 percentage points behind the first place.
Authors: Chen Huang, Yang Deng, Wenqiang Lei, Jiancheng Lv, Ido Dagan
Abstract: To obtain high-quality annotations under limited budget, semi-automatic annotation methods are commonly used, where a portion of the data is annotated by experts and a model is then trained to complete the annotations for the remaining data. However, these methods mainly focus on selecting informative data for expert annotations to improve the model predictive ability (i.e., triage-to-human data), while the rest of the data is indiscriminately assigned to model annotation (i.e., triage-to-model data). This may lead to inefficiencies in budget allocation for annotations, as easy data that the model could accurately annotate may be unnecessarily assigned to the expert, and hard data may be misclassified by the model. As a result, the overall annotation quality may be compromised. To address this issue, we propose a selective annotation framework called SANT. It effectively takes advantage of both the triage-to-human and triage-to-model data through the proposed error-aware triage and bi-weighting mechanisms. As such, informative or hard data is assigned to the expert for annotation, while easy data is handled by the model. Experimental results show that SANT consistently outperforms other baselines, leading to higher-quality annotation through its proper allocation of data to both expert and model workers. We provide pioneering work on data annotation within budget constraints, establishing a landmark for future triage-based annotation studies.
Authors: Shangqing Tu, Kejian Zhu, Yushi Bai, Zijun Yao, Lei Hou, Juanzi Li
Abstract: The advancement of large language models (LLMs) relies on evaluation using public benchmarks, but data contamination can lead to overestimated performance. Previous researches focus on detecting contamination by determining whether the model has seen the exact same data during training. Besides, prior work has already shown that even training on data similar to benchmark data inflates performance, namely \emph{In-distribution contamination}. In this work, we argue that in-distribution contamination can lead to the performance drop on OOD benchmarks. To effectively detect in-distribution contamination, we propose DICE, a novel method that leverages the internal states of LLMs to locate-then-detect the contamination. DICE first identifies the most sensitive layer to contamination, then trains a classifier based on the internal states of that layer. Experiments reveal DICE's high accuracy in detecting in-distribution contamination across various LLMs and math reasoning datasets. We also show the generalization capability of the trained DICE detector, which is able to detect contamination across multiple benchmarks with similar distributions. Additionally, we find that DICE's predictions correlate with the performance of LLMs fine-tuned by either us or other organizations, achieving a coefficient of determination ($R^2$) between 0.61 and 0.75. The code and data are available at https://github.com/THU-KEG/DICE.
Authors: Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, Huajun Chen
Abstract: Large language models (LLMs) have gained increasing prominence in scientific research, but there is a lack of comprehensive benchmarks to fully evaluate their proficiency in understanding and mastering scientific knowledge. To address this need, we introduce the SciKnowEval benchmark, a novel framework that systematically evaluates LLMs across five progressive levels of scientific knowledge: studying extensively, inquiring earnestly, thinking profoundly, discerning clearly, and practicing assiduously. These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including memory, comprehension, reasoning, discernment, and application. Specifically, we first construct a large-scale evaluation dataset encompassing 70K multi-level scientific problems and solutions in the domains of biology, chemistry, physics, and materials science. By leveraging this dataset, we benchmark 26 advanced open-source and proprietary LLMs using zero-shot and few-shot prompting strategies. The results reveal that despite the state-of-the-art performance of proprietary LLMs, there is still significant room for improvement, particularly in addressing scientific reasoning and applications. We anticipate that SciKnowEval will establish a standard for benchmarking LLMs in science research and promote the development of stronger scientific LLMs. The dataset and code are publicly available at https://scimind.ai/sciknoweval .
Authors: Jinqiang Wang, Huansheng Ning, Yi Peng, Qikai Wei, Daniel Tesfai, Wenwei Mao, Tao Zhu, Runhe Huang
Abstract: Large Language Models (LLMs) have demonstrated surprising performance across various natural language processing tasks. Recently, medical LLMs enhanced with domain-specific knowledge have exhibited excellent capabilities in medical consultation and diagnosis. These models can smoothly simulate doctor-patient dialogues and provide professional medical advice. Most medical LLMs are developed through continued training of open-source general LLMs, which require significantly fewer computational resources than training LLMs from scratch. Additionally, this approach offers better patient privacy protection than API-based solutions. Given the above advantages, this survey systematically summarizes how to train medical LLMs based on open-source general LLMs from a more fine-grained perspective. It covers (a) how to acquire training corpus and construct customized medical training sets, (b) how to choose an appropriate training paradigm, (c) how to choose a suitable evaluation benchmark, and (d) existing challenges and promising research directions are discussed. This survey can provide guidance for the development of LLMs focused on various medical applications, such as medical education, diagnostic planning, and clinical assistants. Related resources and supplemental information can be found on the GitHub repository.
Authors: Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini
Abstract: The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy. Moreover, without relying on the attention scores, this approach remains compatible with FlashAttention, enabling broader applicability.
Authors: Sherzod Hakimov, Yerkezhan Abdullayeva, Kushal Koshti, Antonia Schmidt, Yan Weiser, Anne Beyer, David Schlangen
Abstract: While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evaluation. Specifically, we define games that challenge a model's capability to represent a situation from visual information and align such representations through dialogue. We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them. On further analysis, we find that the exceptional deep captioning capabilities of the largest models drive some of the performance. There is still room to grow for both kinds of models, ensuring the continued relevance of the benchmark.
Authors: Cheng-Han Chiang, Wei-Chih Chen, Chun-Yi Kuan, Chienchou Yang, Hung-yi Lee
Abstract: Using large language models (LLMs) for automatic evaluation has become an important evaluation method in NLP research. However, it is unclear whether these LLM-based evaluators can be applied in real-world classrooms to assess student assignments. This empirical report shares how we use GPT-4 as an automatic assignment evaluator in a university course with 1,028 students. Based on student responses, we find that LLM-based assignment evaluators are generally acceptable to students when students have free access to these LLM-based evaluators. However, students also noted that the LLM sometimes fails to adhere to the evaluation instructions. Additionally, we observe that students can easily manipulate the LLM-based evaluator to output specific strings, allowing them to achieve high scores without meeting the assignment rubric. Based on student feedback and our experience, we provide several recommendations for integrating LLM-based evaluators into future classrooms. Our observation also highlights potential directions for improving LLM-based evaluators, including their instruction-following ability and vulnerability to prompt hacking.
Authors: Yanfei Chen, Jinsung Yoon, Devendra Singh Sachan, Qingze Wang, Vincent Cohen-Addad, Mohammadhossein Bateni, Chen-Yu Lee, Tomas Pfister
Abstract: Recent advances in large language models (LLMs) have enabled autonomous agents with complex reasoning and task-fulfillment capabilities using a wide range of tools. However, effectively identifying the most relevant tools for a given task becomes a key bottleneck as the toolset size grows, hindering reliable tool utilization. To address this, we introduce Re-Invoke, an unsupervised tool retrieval method designed to scale effectively to large toolsets without training. Specifically, we first generate a diverse set of synthetic queries that comprehensively cover different aspects of the query space associated with each tool document during the tool indexing phase. Second, we leverage LLM's query understanding capabilities to extract key tool-related context and underlying intents from user queries during the inference phase. Finally, we employ a novel multi-view similarity ranking strategy based on intents to pinpoint the most relevant tools for each query. Our evaluation demonstrates that Re-Invoke significantly outperforms state-of-the-art alternatives in both single-tool and multi-tool scenarios, all within a fully unsupervised setting. Notably, on the ToolE datasets, we achieve a 20% relative improvement in nDCG@5 for single-tool retrieval and a 39% improvement for multi-tool retrieval.
Authors: Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen
Abstract: Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language models (LLMs). This study investigates whether such constraints on generation space impact LLMs abilities, including reasoning and domain knowledge comprehension. Specifically, we evaluate LLMs performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. Surprisingly, we observe a significant decline in LLMs reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.
Authors: Niklas Wretblad, Oskar Holmstr\"om, Erik Larsson, Axel Wiks\"ater, Oscar S\"oderlund, Hjalmar \"Ohman, Ture Pont\'en, Martin Forsberg, Martin S\"orme, Fredrik Heintz
Abstract: Relational databases often suffer from uninformative descriptors of table contents, such as ambiguous columns and hard-to-interpret values, impacting both human users and text-to-SQL models. In this paper, we explore the use of large language models (LLMs) to automatically generate detailed natural language descriptions for SQL database columns, aiming to improve text-to-SQL performance and automate metadata creation. We create a dataset of gold column descriptions based on the BIRD-Bench benchmark, manually refining its column descriptions and creating a taxonomy for categorizing column difficulty. Through evaluating several LLMs, we find that incorporating these column descriptions consistently enhances text-to-SQL model performance, particularly for larger models like GPT-4o, Qwen2 72B and Mixtral 22Bx8. However, models struggle with columns that exhibit inherent ambiguity, highlighting the need for manual expert input. Notably, Qwen2-generated descriptions, containing by annotators deemed superfluous information, outperform manually curated gold descriptions, suggesting that models benefit from more detailed metadata than humans expect. Future work will investigate the specific features of these high-performing descriptions and explore other types of metadata, such as numerical reasoning and synonyms, to further improve text-to-SQL systems. The dataset, annotations and code will all be made available.
Authors: Zhibo Zhang, Wuxia Bai, Yuxi Li, Mark Huasong Meng, Kailong Wang, Ling Shi, Li Li, Jun Wang, Haoyu Wang
Abstract: Large language models (LLMs) have achieved unprecedented success in the field of natural language processing. However, the black-box nature of their internal mechanisms has brought many concerns about their trustworthiness and interpretability. Recent research has discovered a class of abnormal tokens in the model's vocabulary space and named them "glitch tokens". Those tokens, once included in the input, may induce the model to produce incorrect, irrelevant, or even harmful results, drastically undermining the reliability and practicality of LLMs. In this work, we aim to enhance the understanding of glitch tokens and propose techniques for their detection and mitigation. We first reveal the characteristic features induced by glitch tokens on LLMs, which are evidenced by significant deviations in the distributions of attention patterns and dynamic information from intermediate model layers. Based on the insights, we develop GlitchProber, a tool for efficient glitch token detection and mitigation. GlitchProber utilizes small-scale sampling, principal component analysis for accelerated feature extraction, and a simple classifier for efficient vocabulary screening. Taking one step further, GlitchProber rectifies abnormal model intermediate layer values to mitigate the destructive effects of glitch tokens. Evaluated on five mainstream open-source LLMs, GlitchProber demonstrates higher efficiency, precision, and recall compared to existing approaches, with an average F1 score of 0.86 and an average repair rate of 50.06%. GlitchProber unveils a novel path to address the challenges posed by glitch tokens and inspires future research toward more robust and interpretable LLMs.
Authors: Lia Shahnazaryan, Meriem Beloucif
Abstract: Recent advancements in neural machine translation (NMT) have revolutionized the field, yet the dependency on extensive parallel corpora limits progress for low-resource languages and domains. Cross-lingual transfer learning offers a promising solution by utilizing data from high-resource languages but often struggles with in-domain NMT. This paper investigates zero-shot cross-lingual domain adaptation for NMT, focusing on the impact of domain specification and linguistic factors on transfer effectiveness. Using English as the source language and Spanish for fine-tuning, we evaluate multiple target languages, including Portuguese, Italian, French, Czech, Polish, and Greek. We demonstrate that both language-specific and domain-specific factors influence transfer effectiveness, with domain characteristics playing a crucial role in determining cross-domain transfer potential. We also explore the feasibility of zero-shot cross-lingual cross-domain transfer, providing insights into which domains are more responsive to transfer and why. Our results show the importance of well-defined domain boundaries and transparency in experimental setups for in-domain transfer learning.
Authors: Yijie Jin
Abstract: Multimodal Sentiment Analysis (MSA) leverages multiple data modals to analyze human sentiment. Existing MSA models generally employ cutting-edge multimodal fusion and representation learning-based methods to promote MSA capability. However, there are two key challenges: (i) in existing multimodal fusion methods, the decoupling of modal combinations and tremendous parameter redundancy, lead to insufficient fusion performance and efficiency; (ii) a challenging trade-off exists between representation capability and computational overhead in unimodal feature extractors and encoders. Our proposed GSIFN incorporates two main components to solve these problems: (i) a graph-structured and interlaced-masked multimodal Transformer. It adopts the Interlaced Mask mechanism to construct robust multimodal graph embedding, achieve all-modal-in-one Transformer-based fusion, and greatly reduce the computational overhead; (ii) a self-supervised learning framework with low computational overhead and high performance, which utilizes a parallelized LSTM with matrix memory to enhance non-verbal modal features for unimodal label generation. Evaluated on the MSA datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS, GSIFN demonstrates superior performance with significantly lower computational overhead compared with previous state-of-the-art models.
Authors: Hanna Abi Akl
Abstract: We introduce SHADOW, a fine-tuned language model trained on an intermediate task using associative deductive reasoning, and measure its performance on a knowledge base construction task using Wikidata triple completion. We evaluate SHADOW on the LM-KBC 2024 challenge and show that it outperforms the baseline solution by 20% with a F1 score of 68.72%.
Authors: Ramon Ferrer-i-Cancho
Abstract: We address the linguistic problem of the sequential arrangement of a head and its dependents from an information theoretic perspective. In particular, we consider the optimal placement of a head that maximizes the predictability of the sequence. We assume that dependents are statistically independent given a head, in line with the open-choice principle and the core assumptions of dependency grammar. We demonstrate the optimality of harmonic order, i.e., placing the head last maximizes the predictability of the head whereas placing the head first maximizes the predictability of dependents. We also show that postponing the head is the optimal strategy to maximize its predictability while bringing it forward is the optimal strategy to maximize the predictability of dependents. We unravel the advantages of the strategy of maximizing the predictability of the head over maximizing the predictability of dependents. Our findings shed light on the placements of the head adopted by real languages or emerging in different kinds of experiments.
Authors: Ond\v{r}ej Pra\v{z}\'ak, Miloslav Konop\'ik
Abstract: Coreference resolution, the task of identifying expressions in text that refer to the same entity, is a critical component in various natural language processing (NLP) applications. This paper presents our end-to-end neural coreference resolution system, utilizing the CorefUD 1.1 dataset, which spans 17 datasets across 12 languages. Our model is based on the end-to-end neural coreference resolution system. We first establish strong baseline models, including monolingual and cross-lingual variations, and then propose several extensions to enhance performance across diverse linguistic contexts. These extensions include cross-lingual training, incorporation of syntactic information, a Span2Head model for optimized headword prediction, and advanced singleton modeling. We also experiment with headword span representation and long-document modeling through overlapping segments. The proposed extensions, particularly the heads-only approach, singleton modeling, and long document prediction, significantly improve performance across most datasets. We also perform zero-shot cross-lingual experiments, highlighting the potential and limitations of cross-lingual transfer in coreference resolution. Our findings contribute to the development of robust and scalable coreference systems for multilingual coreference resolution. Finally, we evaluate our model on the CorefUD 1.1 test set and surpass the best model from the CRAC 2023 shared task of comparable size by a large margin. Our model is available on GitHub: https://github.com/ondfa/coref-multiling
Authors: Yang Liu, Xichou Zhu, Zhou Shen, Yi Liu, Min Li, Yujun Chen, Benzi John, Zhenzhen Ma, Tao Hu, Zhi Li, Zhiyang Xu, Wei Luo, Junhui Wang
Abstract: Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. However, how to comprehensively assess the sentiment capabilities of LLMs continues to be a challenge. This paper investigates the ability of LLMs to detect and react to sentiment in text modal. As the integration of LLMs into diverse applications is on the rise, it becomes highly critical to comprehend their sensitivity to emotional tone, as it can influence the user experience and the efficacy of sentiment-driven tasks. We conduct a series of experiments to evaluate the performance of several prominent LLMs in identifying and responding appropriately to sentiments like positive, negative, and neutral emotions. The models' outputs are analyzed across various sentiment benchmarks, and their responses are compared with human evaluations. Our discoveries indicate that although LLMs show a basic sensitivity to sentiment, there are substantial variations in their accuracy and consistency, emphasizing the requirement for further enhancements in their training processes to better capture subtle emotional cues. Take an example in our findings, in some cases, the models might wrongly classify a strongly positive sentiment as neutral, or fail to recognize sarcasm or irony in the text. Such misclassifications highlight the complexity of sentiment analysis and the areas where the models need to be refined. Another aspect is that different LLMs might perform differently on the same set of data, depending on their architecture and training datasets. This variance calls for a more in-depth study of the factors that contribute to the performance differences and how they can be optimized.
Authors: Xichou Zhu, Yang Liu, Zhou Shen, Yi Liu, Min Li, Yujun Chen, Benzi John, Zhenzhen Ma, Tao Hu, Zhi Li, Bolong Yang, Manman Wang, Zongxing Xie, Peng Liu, Dan Cai, Junhui Wang
Abstract: The recent advances in large language models (LLMs) have significantly expanded their applications across various fields such as language generation, summarization, and complex question answering. However, their application to privacy compliance and technical privacy reviews remains under-explored, raising critical concerns about their ability to adhere to global privacy standards and protect sensitive user data. This paper seeks to address this gap by providing a comprehensive case study evaluating LLMs' performance in privacy-related tasks such as privacy information extraction (PIE), legal and regulatory key point detection (KPD), and question answering (QA) with respect to privacy policies and data protection regulations. We introduce a Privacy Technical Review (PTR) framework, highlighting its role in mitigating privacy risks during the software development life-cycle. Through an empirical assessment, we investigate the capacity of several prominent LLMs, including BERT, GPT-3.5, GPT-4, and custom models, in executing privacy compliance checks and technical privacy reviews. Our experiments benchmark the models across multiple dimensions, focusing on their precision, recall, and F1-scores in extracting privacy-sensitive information and detecting key regulatory compliance points. While LLMs show promise in automating privacy reviews and identifying regulatory discrepancies, significant gaps persist in their ability to fully comply with evolving legal standards. We provide actionable recommendations for enhancing LLMs' capabilities in privacy compliance, emphasizing the need for robust model improvements and better integration with legal and regulatory requirements. This study underscores the growing importance of developing privacy-aware LLMs that can both support businesses in compliance efforts and safeguard user privacy rights.
Authors: Zhuowei Chen, Lianxi Wang, Yuben Wu, Xinfeng Liao, Yujia Tian, Junyang Zhong
Abstract: Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework's modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.
Authors: Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Mingchuan Yang, Bo Tang, Feiyu Xiong, Zhiyu Li
Abstract: Since the advent of ChatGPT, Large Language Models (LLMs) have excelled in various tasks but remain as black-box systems. Consequently, the reasoning bottlenecks of LLMs are mainly influenced by their internal architecture. As a result, many researchers have begun exploring the potential internal mechanisms of LLMs, with most studies focusing on attention heads. Our survey aims to shed light on the internal reasoning processes of LLMs by concentrating on the underlying mechanisms of attention heads. We first distill the human thought process into a four-stage framework: Knowledge Recalling, In-Context Identification, Latent Reasoning, and Expression Preparation. Using this framework, we systematically review existing research to identify and categorize the functions of specific attention heads. Furthermore, we summarize the experimental methodologies used to discover these special heads, dividing them into two categories: Modeling-Free methods and Modeling-Required methods. Also, we outline relevant evaluation methods and benchmarks. Finally, we discuss the limitations of current research and propose several potential future directions.
Authors: Junkai Wu, Xulin Fan, Bo-Ru Lu, Xilin Jiang, Nima Mesgarani, Mark Hasegawa-Johnson, Mari Ostendorf
Abstract: In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans' listening and reasoning abilities. SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) performance in benchmarks like Gaokao, the English listening test of the college entrance exam in China, which seemingly requires understanding both the spoken content and voice characteristics of speakers in a conversation. However, after carefully examining Gaokao's questions, we find the correct answers to many questions can be inferred from the conversation transcript alone, i.e.\ without speaker segmentation and identification. Our evaluation of state-of-the-art models Qwen-Audio and WavLLM on both Gaokao and our proposed "What Do You Like?" dataset shows a significantly higher accuracy in these context-based questions than in identity-critical questions, which can only be answered reliably with correct speaker identification. The results and analysis suggest that when solving SQA, the current SpeechLLMs exhibit limited speaker awareness from the audio and behave similarly to an LLM reasoning from the conversation transcription without sound. We propose that tasks focused on identity-critical questions could offer a more accurate evaluation framework of SpeechLLMs in SQA.
Authors: Sung-Lin Yeh, Hao Tang
Abstract: Representing speech with discrete units has been widely used in speech codec and speech generation. However, there are several unverified claims about self-supervised discrete units, such as disentangling phonetic and speaker information with k-means, or assuming information loss after k-means. In this work, we take an information-theoretic perspective to answer how much information is present (information completeness) and how much information is accessible (information accessibility), before and after residual vector quantization. We show a lower bound for information completeness and estimate completeness on discretized HuBERT representations after residual vector quantization. We find that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual, showing that vector quantization does not achieve disentanglement. Our results offer a comprehensive assessment on the choice of discrete units, and suggest that a lot more information in the residual should be mined rather than discarded.
Authors: Yufei Tao, Adam Hiatt, Erik Haake, Antonie J. Jetter, Ameeta Agrawal
Abstract: Large language models (LLMs) have demonstrated remarkable progress in leveraging diverse knowledge sources. This study investigates how nine widely used LLMs allocate knowledge between local context and global parameters when answering open-ended questions in knowledge-consistent scenarios. We introduce a novel dataset, WikiAtomic, and systematically vary context sizes to analyze how LLMs prioritize and utilize the provided information and their parametric knowledge in knowledge-consistent scenarios. Additionally, we also study their tendency to hallucinate under varying context sizes. Our findings reveal consistent patterns across models, including a consistent reliance on both contextual (around 70%) and parametric (around 30%) knowledge, and a decrease in hallucinations with increasing context. These insights highlight the importance of more effective context organization and developing models that use input more deterministically for robust performance.
Authors: Xiang Zhang, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan
Abstract: The Transformer architecture excels in a variety of language modeling tasks, outperforming traditional neural architectures such as RNN and LSTM. This is partially due to its elimination of recurrent connections, which allows for parallel training and a smoother flow of gradients. However, this move away from recurrent structures places the Transformer model at the lower end of Chomsky's computational hierarchy, imposing limitations on its computational abilities. Consequently, even advanced Transformer-based models face considerable difficulties in tasks like counting, string reversal, and multiplication. These tasks, though seemingly elementary, require a level of computational complexity that exceeds the capabilities of the Transformer architecture. Concurrently, the emergence of ``Chain of Thought" (CoT) prompting has enabled Transformer-based language models to tackle tasks that were previously impossible or poorly executed. In this work, we thoroughly investigate the influence of recurrent structures in neural models on their reasoning abilities and computability, contrasting the role autoregression plays in the neural models' computational power. We then shed light on how the CoT approach can mimic recurrent computation and act as a bridge between autoregression and recurrence in the context of language models. It is this approximated recurrence that notably improves the model's performance and computational capacity. Moreover, we revisit recent recurrent-based Transformer model designs, focusing on their computational abilities through our proposed concept of ``recurrence-completeness" and identify key theoretical limitations in models like Linear Transformer and RWKV. Through this, we aim to provide insight into the neural model architectures and prompt better model design.
Authors: Nicholas Pangakis, Samuel Wolken
Abstract: Automated text annotation is a compelling use case for generative large language models (LLMs) in social media research. Recent work suggests that LLMs can achieve strong performance on annotation tasks; however, these studies evaluate LLMs on a small number of tasks and likely suffer from contamination due to a reliance on public benchmark datasets. Here, we test a human-centered framework for responsibly evaluating artificial intelligence tools used in automated annotation. We use GPT-4 to replicate 27 annotation tasks across 11 password-protected datasets from recently published computational social science articles in high-impact journals. For each task, we compare GPT-4 annotations against human-annotated ground-truth labels and against annotations from separate supervised classification models fine-tuned on human-generated labels. Although the quality of LLM labels is generally high, we find significant variation in LLM performance across tasks, even within datasets. Our findings underscore the importance of a human-centered workflow and careful evaluation standards: Automated annotations significantly diverge from human judgment in numerous scenarios, despite various optimization strategies such as prompt tuning. Grounding automated annotation in validation labels generated by humans is essential for responsible evaluation.
Authors: Yi-Jyun Sun, Suvodip Dey, Dilek Hakkani-Tur, Gokhan Tur
Abstract: Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. We also enhance these with a self-probing mechanism, proposed for closed models. Furthermore, we assess these methods using an open-weight model fine-tuned for the task of DST, achieving superior joint goal accuracy (JGA). Our findings also suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.
Authors: Shanshan Wang, Chun Zhang, Ning Zhang
Abstract: The Knowledge Graph-to-Text Generation task aims to convert structured knowledge graphs into coherent and human-readable natural language text. Recent efforts in this field have focused on enhancing pre-trained language models (PLMs) by incorporating graph structure information to capture the intricate structure details of knowledge graphs. However, most of these approaches tend to capture only single-granularity structure information, concentrating either on the relationships between entities within the original graph or on the relationships between words within the same entity or across different entities. This narrow focus results in a significant limitation: models that concentrate solely on entity-level structure fail to capture the nuanced semantic relationships between words, while those that focus only on word-level structure overlook the broader relationships between original entire entities. To overcome these limitations, this paper introduces the Multi-granularity Graph Structure Attention (MGSA), which is based on PLMs. The encoder of the model architecture features an entity-level structure encoding module, a word-level structure encoding module, and an aggregation module that synthesizes information from both structure. This multi-granularity structure encoding approach allows the model to simultaneously capture both entity-level and word-level structure information, providing a more comprehensive understanding of the knowledge graph's structure information, thereby significantly improving the quality of the generated text. We conducted extensive evaluations of the MGSA model using two widely recognized KG-to-Text Generation benchmark datasets, WebNLG and EventNarrative, where it consistently outperformed models that rely solely on single-granularity structure information, demonstrating the effectiveness of our approach.
Authors: Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush
Abstract: Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.
Authors: Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie
Abstract: Large language models have achieved notable success across various domains, yet efficient inference is still limited by the quadratic computation complexity of the attention mechanism. The inference consists of prefilling and decoding phases. Although several attempts have been made to accelerate decoding, the inefficiency of the prefilling phase, especially for long-context tasks, remains a challenge. In this paper, we observe a locality in query criticality during the prefilling phase of long-context processing: adjacent query tokens tend to focus on similar subsets of the past Key-Value (KV) cache. Based on this observation, we propose CritiPrefill, a criticality-based segment-wise prefilling method. This method partitions the input sequence's queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality. By pruning non-critical computations between query segments and cache blocks in the self-attention mechanism, the prefilling process can be significantly accelerated. Extensive evaluations on multiple long-context datasets show up to 2.7x speedup on Llama3-8B and 3.0x speedup on Yi-9B for 128K context length on a single A100 GPU, with minimal quality degradation.
Authors: Thomas Savage, Stephen Ma, Abdessalem Boukil, Vishwesh Patel, Ekanath Rangan, Ivan Rodriguez, Jonathan H Chen
Abstract: Large Language Model (LLM) fine tuning is underutilized in the field of medicine. Two of the most common methods of fine tuning are Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO), but there is little guidance informing users when to use either technique. In this investigation, we compare the performance of SFT and DPO for five common natural language tasks in medicine: Classification with text data, Classification with numeric data, Clinical Reasoning, Summarization, and Clinical Triage. We find that SFT alone is sufficient for Classification with text data, whereas DPO improves performance for the more complex tasks of Clinical Reasoning, Summarization and Clinical Triage. Our results establish the role and importance of DPO fine tuning within medicine, and consequently call attention to current software gaps that prevent widespread deployment of this technique.
Authors: Shivam Shandilya, Menglin Xia, Supriyo Ghosh, Huiqiang Jiang, Jue Zhang, Qianhui Wu, Victor R\"uhle
Abstract: The increasing prevalence of large language models (LLMs) such as GPT-4 in various applications has led to a surge in the size of prompts required for optimal performance, leading to challenges in computational efficiency. Prompt compression aims to reduce the inference cost by minimizing input tokens without compromising on the task performance. However, existing prompt compression techniques either rely on sub-optimal metrics such as information entropy or model it as a task-agnostic token classification problem that fails to capture task-specific information. To address these issues, we propose a novel and efficient reinforcement learning (RL) based task-aware prompt compression method. To ensure low latency requirements, we leverage existing Transformer encoder-based token classification model while guiding the learning process with task-specific reward signals using lightweight REINFORCE algorithm. We evaluate the performance of our method on three diverse and challenging tasks including text summarization, question answering and code summarization. We demonstrate that our RL-guided compression method improves the task performance by 8% - 260% across these three scenarios over state-of-the-art compression techniques while satisfying the same compression rate and latency requirements.
Authors: Yupu Hao, Pengfei Cao, Zhuoran Jin, Huanxuan Liao, Yubo Chen, Kang Liu, Jun Zhao
Abstract: Tool learning enables the Large Language Models (LLMs) to interact with the external environment by invoking tools, enriching the accuracy and capability scope of LLMs. However, previous works predominantly focus on improving model's tool-utilizing accuracy and the ability to generalize to new, unseen tools, excessively forcing LLMs to adjust specific tool-invoking pattern without considering the harm to model's general performance. This deviates from the actual applications and original intention of integrating tools to enhance model. To tackle this problem, we dissect the capability trade-offs by examining the hidden representation changes and the gradient-based importance score of model's components. Based on the analysis result, we propose a Component Importance-based Tool-utilizing ability Injection method (CITI). According to the gradient-based importance score of different components, it alleviates the capability conflicts caused by fine-tuning process by applying distinct training strategies to different components. CITI applies Mixture-Of-LoRA (MOLoRA) for important components. Meanwhile, it fine-tunes the parameters of few components deemed less important in the backbone of the LLM, while keeping other parameters frozen. CITI can effectively enhance the model's tool-utilizing capability without excessively compromising its general performance. Experimental results demonstrate that our approach achieves outstanding performance across a range of evaluation metrics.
Authors: Mohanna Hoveyda, Arjen P. de Vries, Maarten de Rijke, Harrie Oosterhuis, Faegheh Hasibi
Abstract: In question answering (QA), different questions can be effectively addressed with different answering strategies. Some require a simple lookup, while others need complex, multi-step reasoning to be answered adequately. This observation motivates the development of a dynamic method that adaptively selects the most suitable QA strategy for each question, enabling more efficient and effective systems capable of addressing a broader range of question types. To this aim, we build on recent advances in the orchestration of multiple large language models (LLMs) and formulate adaptive QA as a dynamic orchestration challenge. We define this as a contextual multi-armed bandit problem, where the context is defined by the characteristics of the incoming question and the action space consists of potential communication graph configurations among the LLM agents. We then train a linear upper confidence bound model to learn an optimal mapping between different question types and their corresponding optimal multi-LLM communication graph representation. Our experiments show that the proposed solution is viable for adaptive orchestration of a QA system with multiple modules, as it combines the superior performance of more complex strategies while avoiding their costs when simpler strategies suffice.
Authors: Avyav Kumar Singh, Ekaterina Shutova, Helen Yannakoudakis
Abstract: Existing approaches to few-shot learning in NLP rely on large language models (LLMs) and/or fine-tuning of these to generalise on out-of-distribution data. In this work, we propose a novel few-shot learning approach based on soft-label prototypes (SLPs) designed to collectively capture the distribution of different classes across the input domain space. We focus on learning previously unseen NLP tasks from very few examples (4, 8, 16) per class and experimentally demonstrate that our approach achieves superior performance on the majority of tested tasks in this data-lean setting while being highly parameter efficient. We also show that our few-shot adaptation method can be integrated into more generalised learning settings, primarily meta-learning, to yield superior performance against strong baselines.
Authors: Muris Sladi\'c, Veronica Valeros, Carlos Catania, Sebastian Garcia
Abstract: Honeypots are essential tools in cybersecurity for early detection, threat intelligence gathering, and analysis of attacker's behavior. However, most of them lack the required realism to engage and fool human attackers long-term. Being easy to distinguish honeypots strongly hinders their effectiveness. This can happen because they are too deterministic, lack adaptability, or lack deepness. This work introduces shelLM, a dynamic and realistic software honeypot based on Large Language Models that generates Linux-like shell output. We designed and implemented shelLM using cloud-based LLMs. We evaluated if shelLM can generate output as expected from a real Linux shell. The evaluation was done by asking cybersecurity researchers to use the honeypot and give feedback if each answer from the honeypot was the expected one from a Linux shell. Results indicate that shelLM can create credible and dynamic answers capable of addressing the limitations of current honeypots. ShelLM reached a TNR of 0.90, convincing humans it was consistent with a real Linux shell. The source code and prompts for replicating the experiments have been publicly available.
Authors: Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc'Aurelio Ranzato, Arthur Szlam, Jiajun Shen
Abstract: Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging gradients and other intermediate states at each optimization step. While it is difficult to build and maintain a single computing cluster hosting many accelerators, it might be easier to find several computing clusters each hosting a smaller number of devices. In this work, we propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of language models on islands of devices that are poorly connected. The approach is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum. On the widely used C4 dataset, we show that DiLoCo on 8 workers performs as well as fully synchronous optimization while communicating 500 times less. DiLoCo exhibits great robustness to the data distribution of each worker. It is also robust to resources becoming unavailable over time, and vice versa, it can seamlessly leverage resources that become available during training.
Authors: Erik Derner, Kristina Batisti\v{c}, Jan Zah\'alka, Robert Babu\v{s}ka
Abstract: As large language models (LLMs) permeate more and more applications, an assessment of their associated security risks becomes increasingly necessary. The potential for exploitation by malicious actors, ranging from disinformation to data breaches and reputation damage, is substantial. This paper addresses a gap in current research by specifically focusing on security risks posed by LLMs within the prompt-based interaction scheme, which extends beyond the widely covered ethical and societal implications. Our work proposes a taxonomy of security risks along the user-model communication pipeline and categorizes the attacks by target and attack type alongside the commonly used confidentiality, integrity, and availability (CIA) triad. The taxonomy is reinforced with specific attack examples to showcase the real-world impact of these risks. Through this taxonomy, we aim to inform the development of robust and secure LLM applications, enhancing their safety and trustworthiness.
Authors: Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu Sun, Lu Hou
Abstract: The ability to perceive how objects change over time is a crucial ingredient in human intelligence. However, current benchmarks cannot faithfully reflect the temporal understanding abilities of video-language models (VidLMs) due to the existence of static visual shortcuts. To remedy this issue, we present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. Specifically, we first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. Furthermore, to disentangle the correlation between static and temporal information, we generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect. We employ a semi-automatic data collection framework using large language models and human-in-the-loop annotation to obtain high-quality counterfactual descriptions efficiently. Evaluation of representative video-language understanding models confirms their deficiency in temporal understanding, revealing the need for greater emphasis on the temporal elements in video-language research.
Authors: Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato
Abstract: Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of {\it asynchronous} Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We propose a novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed. This approach, evaluated with models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step, and significantly surpasses it in terms of wall clock time.
Authors: V. K. Cody Bumgardner, Mitchell A. Klusty, W. Vaiden Logan, Samuel E. Armstrong, Caylin Hickey, Jeff Talbert
Abstract: This paper introduces a user-friendly platform developed by the University of Kentucky Center for Applied AI, designed to make large, customized language models (LLMs) more accessible. By capitalizing on recent advancements in multi-LoRA inference, the system efficiently accommodates custom adapters for a diverse range of users and projects. The paper outlines the system's architecture and key features, encompassing dataset curation, model training, secure inference, and text-based feature extraction. We illustrate the establishment of a tenant-aware computational network using agent-based methods, securely utilizing islands of isolated resources as a unified system. The platform strives to deliver secure LLM services, emphasizing process and data isolation, end-to-end encryption, and role-based resource authentication. This contribution aligns with the overarching goal of enabling simplified access to cutting-edge AI models and technology in support of scientific discovery.
Authors: Hao Tang, Darren Key, Kevin Ellis
Abstract: We give a model-based agent that builds a Python program representing its knowledge of the world based on its interactions with the environment. The world model tries to explain its interactions, while also being optimistic about what reward it can achieve. We define this optimism as a logical constraint between a program and a planner. We study our agent on gridworlds, and on task planning, finding our approach is more sample-efficient compared to deep RL, more compute-efficient compared to ReAct-style agents, and that it can transfer its knowledge across environments by editing its code.
Authors: Nico Manzonelli, Wanrong Zhang, Salil Vadhan
Abstract: Recent research shows that large language models are susceptible to privacy attacks that infer aspects of the training data. However, it is unclear if simpler generative models, like topic models, share similar vulnerabilities. In this work, we propose an attack against topic models that can confidently identify members of the training data in Latent Dirichlet Allocation. Our results suggest that the privacy risks associated with generative modeling are not restricted to large neural models. Additionally, to mitigate these vulnerabilities, we explore differentially private (DP) topic modeling. We propose a framework for private topic modeling that incorporates DP vocabulary selection as a pre-processing step, and show that it improves privacy while having limited effects on practical utility.
Authors: Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, Ranjay Krishna
Abstract: Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. With m&m's, we evaluate 10 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments. Our dataset and code are available on HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github (https://github.com/RAIVNLab/mnms).
URLs: https://huggingface.co/datasets/zixianma/mnms), https://github.com/RAIVNLab/mnms).
Authors: Abhishek Ghose, Emma Thuong Nguyen
Abstract: Active learning (AL) techniques optimally utilize a labeling budget by iteratively selecting instances that are most valuable for learning. However, they lack ``prerequisite checks'', i.e., there are no prescribed criteria to pick an AL algorithm best suited for a dataset. A practitioner must pick a technique they \emph{trust} would beat random sampling, based on prior reported results, and hope that it is resilient to the many variables in their environment: dataset, labeling budget and prediction pipelines. The important questions then are: how often on average, do we expect any AL technique to reliably beat the computationally cheap and easy-to-implement strategy of random sampling? Does it at least make sense to use AL in an ``Always ON'' mode in a prediction pipeline, so that while it might not always help, it never under-performs random sampling? How much of a role does the prediction pipeline play in AL's success? We examine these questions in detail for the task of text classification using pre-trained representations, which are ubiquitous today. Our primary contribution here is a rigorous evaluation of AL techniques, old and new, across setups that vary wrt datasets, text representations and classifiers. This unlocks multiple insights around warm-up times, i.e., number of labels before gains from AL are seen, viability of an ``Always ON'' mode and the relative significance of different factors. Additionally, we release a framework for rigorous benchmarking of AL techniques for text classification.
Authors: Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, Xin Xia
Abstract: Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and ClassEval). Code reasoning is one of the most essential abilities of code LLMs, but existing benchmarks for code reasoning are not sufficient. Typically, they focus on predicting the input and output of a program, ignoring the evaluation of the intermediate behavior during program execution, as well as the logical consistency (e.g., the model should not give the correct output if the prediction of execution path is wrong) when performing the reasoning. To address these problems, in this paper, we propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution. We utilize existing code benchmarks and adapt them to new benchmarks within our framework. A large-scale empirical study is conducted and most LLMs show unsatisfactory performance on both Runtime Behavior Reasoning (i.e., an average accuracy of 44.4%) and Incremental Consistency Evaluation (i.e., an average IC score of 10.3). Evaluation results of current code LLMs reflect the urgent need for the community to strengthen the code reasoning capability of code LLMs. Our code, data, and \newname leaderboard are available at https://r-eval.github.io.
Authors: Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Yue Chen, Guolong Liu, Gaoqi Liang, Junhua Zhao, Jinyue Yan, Yun Li
Abstract: With extensive pre-trained knowledge and high-level general capabilities, large language models (LLMs) emerge as a promising avenue to augment reinforcement learning (RL) in aspects such as multi-task learning, sample efficiency, and high-level task planning. In this survey, we provide a comprehensive review of the existing literature in LLM-enhanced RL and summarize its characteristics compared to conventional RL methods, aiming to clarify the research scope and directions for future studies. Utilizing the classical agent-environment interaction paradigm, we propose a structured taxonomy to systematically categorize LLMs' functionalities in RL, including four roles: information processor, reward designer, decision-maker, and generator. For each role, we summarize the methodologies, analyze the specific RL challenges that are mitigated, and provide insights into future directions. Lastly, a comparative analysis of each role, potential applications, prospective opportunities, and challenges of the LLM-enhanced RL are discussed. By proposing this taxonomy, we aim to provide a framework for researchers to effectively leverage LLMs in the RL field, potentially accelerating RL applications in complex applications such as robotics, autonomous driving, and energy systems.
Authors: Pranab Sahoo, Prabhash Meharia, Akash Ghosh, Sriparna Saha, Vinija Jain, Aman Chadha
Abstract: The rapid advancement of foundation models (FMs) across language, image, audio, and video domains has shown remarkable capabilities in diverse tasks. However, the proliferation of FMs brings forth a critical challenge: the potential to generate hallucinated outputs, particularly in high-stakes applications. The tendency of foundation models to produce hallucinated content arguably represents the biggest hindrance to their widespread adoption in real-world scenarios, especially in domains where reliability and accuracy are paramount. This survey paper presents a comprehensive overview of recent developments that aim to identify and mitigate the problem of hallucination in FMs, spanning text, image, video, and audio modalities. By synthesizing recent advancements in detecting and mitigating hallucination across various modalities, the paper aims to provide valuable insights for researchers, developers, and practitioners. Essentially, it establishes a clear framework encompassing definition, taxonomy, and detection strategies for addressing hallucination in multimodal foundation models, laying the foundation for future research in this pivotal area.
Authors: Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham
Abstract: Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning (approximately 100K prompt-response pairs) and continued pretraining (20B unstructured tokens) data regimes. Our results show that, in the standard low-rank settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA better maintains the base model's performance on tasks outside the target domain. We show that LoRA mitigates forgetting more than common regularization techniques such as weight decay and dropout; it also helps maintain more diverse generations. Finally, we show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.
Authors: Aditya Gunturu, Shivesh Jadon, Nandi Zhang, Morteza Faraji, Jarin Thundathil, Tafreed Ahmad, Wesley Willett, Ryo Suzuki
Abstract: Large Language Models (LLMs) are gaining popularity as tools for reading and summarization aids. However, little is known about their potential benefits when integrated with mixed reality (MR) interfaces to support everyday reading assistants. We developed RealitySummary, an MR reading assistant that seamlessly integrates LLMs with always-on camera access, OCR-based text extraction, and augmented spatial and visual responses in MR interfaces. Developed iteratively, RealitySummary evolved across three versions, each shaped by user feedback and reflective analysis: 1) a preliminary user study to understand user perceptions (N=12), 2) an in-the-wild deployment to explore real-world usage (N=11), and 3) a diary study to capture insights from real-world work contexts (N=5). Our findings highlight the unique advantages of combining AI and MR, including an always-on implicit assistant, minimal context switching, and spatial affordances, demonstrating significant potential for future LLM-MR interfaces beyond traditional screen-based interactions.
Authors: Qiyao Ma, Xubin Ren, Chao Huang
Abstract: Recommender systems help users navigate information overload by providing personalized recommendations aligned with their preferences. Collaborative Filtering (CF) is a widely adopted approach, but while advanced techniques like graph neural networks (GNNs) and self-supervised learning (SSL) have enhanced CF models for better user representations, they often lack the ability to provide explanations for the recommended items. Explainable recommendations aim to address this gap by offering transparency and insights into the recommendation decision-making process, enhancing users' understanding. This work leverages the language capabilities of Large Language Models (LLMs) to push the boundaries of explainable recommender systems. We introduce a model-agnostic framework called XRec, which enables LLMs to provide comprehensive explanations for user behaviors in recommender systems. By integrating collaborative signals and designing a lightweight collaborative adaptor, the framework empowers LLMs to understand complex patterns in user-item interactions and gain a deeper understanding of user preferences. Our extensive experiments demonstrate the effectiveness of XRec, showcasing its ability to generate comprehensive and meaningful explanations that outperform baseline approaches in explainable recommender systems. We open-source our model implementation at https://github.com/HKUDS/XRec.
Authors: Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Kaicheng yu, Wanyu Chen, Miaoyu Wang, Stan Z. Li
Abstract: Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances. However, in many specialized fields, it is struggling to obtain sufficient alignment data for training, which seriously limits the use of previously effective models. Therefore, semi-supervised learning approaches are attempted to facilitate multimodal alignment by learning from low-alignment data with fewer matched pairs, but traditional techniques like pseudo-labeling may run into troubles in the label-deficient scenarios. To tackle these challenges, we reframe semi-supervised multimodal alignment as a manifold matching issue and propose a new methodology based on CLIP, termed Set-CLIP. Specifically, by designing a novel semantic density distribution loss, we constrain the latent representation distribution with fine granularity and extract implicit semantic alignment from unpaired multimodal data, thereby reducing the reliance on numerous strictly matched pairs. Furthermore, we apply coarse-grained modality adaptation and unimodal self-supervised guidance to narrow the gaps between modality spaces and improve the stability of representation distributions. Extensive experiments conducted on a range of tasks in various fields, including protein analysis, remote sensing, and the general vision-language field, validate the efficacy of our proposed Set-CLIP method. Especially with no paired data for supervised training, Set-CLIP is still outstanding, which brings an improvement of 144.83% over CLIP.
Authors: Simranjit Singh, Michael Fore, Andreas Karatzas, Chaehong Lee, Yanan Jian, Longfei Shangguan, Fuxun Yu, Iraklis Anagnostopoulos, Dimitrios Stamoulis
Abstract: As Large Language Models (LLMs) broaden their capabilities to manage thousands of API calls, they are confronted with complex data operations across vast datasets with significant overhead to the underlying system. In this work, we introduce LLM-dCache to optimize data accesses by treating cache operations as callable API functions exposed to the tool-augmented agent. We grant LLMs the autonomy to manage cache decisions via prompting, seamlessly integrating with existing function-calling mechanisms. Tested on an industry-scale massively parallel platform that spans hundreds of GPT endpoints and terabytes of imagery, our method improves Copilot times by an average of 1.24x across various LLMs and prompting techniques.
Authors: Chun-Yi Kuan, Chih-Kai Yang, Wei-Ping Huang, Ke-Han Lu, Hung-yi Lee
Abstract: In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on large language models that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing agent based on large language models, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.
Authors: Andr\'e F. Cruz, Moritz Hardt, Celestine Mendler-D\"unner
Abstract: Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate LLMs' ability to quantify ground-truth outcome uncertainty. In this work, we focus on the use of LLMs as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using LLMs, and evaluate them against US Census data products. A flexible API enables the use of different prompting schemes, local or web-hosted models, and diverse census columns that can be used to compose custom prediction tasks. We evaluate 17 recent LLMs across five proposed benchmark tasks. We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated. Base models consistently overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and produce over-confident risk scores. In fact, instruction-tuning polarizes answer distribution regardless of true underlying data uncertainty. This reveals a general inability of instruction-tuned LLMs to express data uncertainty using multiple-choice answers. A separate experiment using verbalized chat-style risk queries yields substantially improved calibration across instruction-tuned models. These differences in ability to quantify data uncertainty cannot be revealed in realizable settings, and highlight a blind-spot in the current evaluation ecosystem that folktexts covers.
Authors: Peiyuan Chen, Zecheng Zhang, Yiping Dong, Li Zhou, Han Wang
Abstract: Visual Question Answering (VQA) is a challenging task that requires systems to provide accurate answers to questions based on image content. Current VQA models struggle with complex questions due to limitations in capturing and integrating multimodal information effectively. To address these challenges, we propose the Rank VQA model, which leverages a ranking-inspired hybrid training strategy to enhance VQA performance. The Rank VQA model integrates high-quality visual features extracted using the Faster R-CNN model and rich semantic text features obtained from a pre-trained BERT model. These features are fused through a sophisticated multimodal fusion technique employing multi-head self-attention mechanisms. Additionally, a ranking learning module is incorporated to optimize the relative ranking of answers, thus improving answer accuracy. The hybrid training strategy combines classification and ranking losses, enhancing the model's generalization ability and robustness across diverse datasets. Experimental results demonstrate the effectiveness of the Rank VQA model. Our model significantly outperforms existing state-of-the-art models on standard VQA datasets, including VQA v2.0 and COCO-QA, in terms of both accuracy and Mean Reciprocal Rank (MRR). The superior performance of Rank VQA is evident in its ability to handle complex questions that require understanding nuanced details and making sophisticated inferences from the image and text. This work highlights the effectiveness of a ranking-based hybrid training strategy in improving VQA performance and lays the groundwork for further research in multimodal learning methods.
Authors: Yang Cao
Abstract: The rapid advancement in large language models (LLMs) comes with a significant increase in their parameter size, presenting challenges for adaptation and fine-tuning. Parameter-efficient fine-tuning (PEFT) methods are widely used to adapt LLMs for downstream tasks efficiently. In this paper, we propose Singular Values and Orthonormal Regularized Singular Vectors Adaptation, or SORSA, a novel PEFT method. We introduce a method to analyze the variation of the parameters by performing singular value decomposition (SVD) and discuss and analyze SORSA's superiority in minimizing the alteration in the SVD aspect. Each SORSA adapter consists of two main parts: trainable principal singular weights $W_p = U_p \Sigma_p V^\top_p$, and frozen residual weights $W_r = U_r \Sigma_r V^\top_r$. These parts are initialized by performing SVD on pre-trained weights. Moreover, we implement and analyze an orthonormal regularizer, which we prove could decrease the condition number of $W_p$ and allows the optimization to be more efficient. SORSA adapters could be merged during inference, thus eliminating any inference latency. After all, SORSA shows a faster convergence than PiSSA and LoRA in our experiments. On the MATH benchmark, Llama 2 7B adapted using SORSA achieved 10.36% accuracy, outperforming LoRA (5.50%), Full FT (7.22%), and PiSSA (7.44%). On the GSM-8K benchmark, SORSA achieved 56.03% accuracy, surpassing LoRA (42.30%), Full FT (49.05%), and PiSSA (53.07%). We conclude that SORSA offers a new perspective on parameter-efficient fine-tuning, demonstrating remarkable performance. The code is available at https://github.com/Gunale0926/SORSA
Authors: Guanwen Xie, Jingzehua Xu, Yiyuan Yang, Yimian Ding, Shuai Zhang
Abstract: Achieving the effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities. Specifically, we generate reward components for each numerically explicit user requirement and employ a reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively adjust the weights without ambiguity and redundant adjustments by flexibly adopting directional mutation and crossover strategies, similar to genetic algorithms, based on the context provided by the training log analyzer. We applied the framework to an underwater data collection RL task without direct human feedback or reward examples (zero-shot learning). The reward critic successfully corrects the reward code with only one feedback instance for each requirement, effectively preventing unrectifiable errors. The initialization of weights enables the acquisition of different reward functions within the Pareto solution set without the need for weight search. Even in cases where a weight is 500 times off, on average, only 5.2 iterations are needed to meet user requirements. The ERFSL also works well with most prompts utilizing GPT-4o mini, as we decompose the weight searching process to reduce the requirement for numerical and long-context understanding capabilities
Authors: Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
Abstract: Masked diffusion models (MDMs) have emerged as a popular research topic for generative modeling of discrete data, thanks to their superior performance over other discrete diffusion models, and are rivaling the auto-regressive models (ARMs) for language modeling tasks. The recent effort in simplifying the masked diffusion framework further leads to alignment with continuous-space diffusion models and more principled training and sampling recipes. In this paper, however, we reveal that both training and sampling of MDMs are theoretically free from the time variable, arguably the key signature of diffusion models, and are instead equivalent to masked models. The connection on the sampling aspect is drawn by our proposed first-hitting sampler (FHS). Specifically, we show that the FHS is theoretically equivalent to MDMs' original generation process while significantly alleviating the time-consuming categorical sampling and achieving a 20$\times$ speedup. In addition, our investigation raises doubts about whether MDMs can truly beat ARMs. We identify, for the first time, an underlying numerical issue, even with the commonly used 32-bit floating-point precision, which results in inaccurate categorical sampling. We show that the numerical issue lowers the effective temperature both theoretically and empirically, and the resulting decrease in token diversity makes previous evaluations, which assess the generation quality solely through the incomplete generative perplexity metric, somewhat unfair.
Authors: Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung
Abstract: To overcome the burden on the memory size and bandwidth due to ever-increasing size of large language models (LLMs), aggressive weight quantization has been recently studied, while lacking research on quantizing activations. In this paper, we present a hardware-software co-design method that results in an energy-efficient LLM accelerator, named OPAL, for generation tasks. First of all, a novel activation quantization method that leverages the microscaling data format while preserving several outliers per sub-tensor block (e.g., four out of 128 elements) is proposed. Second, on top of preserving outliers, mixed precision is utilized that sets 5-bit for inputs to sensitive layers in the decoder block of an LLM, while keeping inputs to less sensitive layers to 3-bit. Finally, we present the OPAL hardware architecture that consists of FP units for handling outliers and vectorized INT multipliers for dominant non-outlier related operations. In addition, OPAL uses log2-based approximation on softmax operations that only requires shift and subtraction to maximize power efficiency. As a result, we are able to improve the energy efficiency by 1.6~2.2x, and reduce the area by 2.4~3.1x with negligible accuracy loss, i.e., <1 perplexity increase.
Authors: Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw
Abstract: The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
Authors: Yanlin Wang, Wanjun Zhong, Yanxian Huang, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, Zibin Zheng
Abstract: In recent years, Large Language Models (LLMs) have achieved remarkable success and have been widely used in various downstream tasks, especially in the tasks of the software engineering (SE) field. We find that many studies combining LLMs with SE have employed the concept of agents either explicitly or implicitly. However, there is a lack of an in-depth survey to sort out the development context of existing works, analyze how existing works combine the LLM-based agent technologies to optimize various tasks, and clarify the framework of LLM-based agents in SE. In this paper, we conduct the first survey of the studies on combining LLM-based agents with SE and present a framework of LLM-based agents in SE which includes three key modules: perception, memory, and action. We also summarize the current challenges in combining the two fields and propose future opportunities in response to existing challenges. We maintain a GitHub repository of the related papers at: https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE.
URLs: https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE.
Authors: Luning Wang, Shiyao Li, Xuefei Ning, Zhihang Yuan, Shengen Yan, Guohao Dai, Yu Wang
Abstract: Large Language Models (LLMs) have been widely adopted to process long-context tasks. However, the large memory overhead of the key-value (KV) cache poses significant challenges in long-context scenarios. Existing training-free KV cache compression methods typically focus on quantization and token pruning, which have compression limits, and excessive sparsity can lead to severe performance degradation. Other methods design new architectures with less KV overhead but require significant training overhead. To address the above two drawbacks, we further explore the redundancy in the channel dimension and apply an architecture-level design with minor training costs. Therefore, we introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression: (1) We first analyze the singular value distribution of the KV cache, revealing significant redundancy and compression potential along the channel dimension. Based on this observation, we propose using low-rank decomposition for key and value layers and storing the low-dimension features. (2) To preserve model performance, we introduce a bi-branch KV cache, including a window-based full-precision KV cache and a low-precision compressed KV cache. (3) To reduce the training costs, we minimize the layer-wise reconstruction loss for the compressed KV cache instead of retraining the entire LLMs. Extensive experiments show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model's long-context capability. Moreover, we show that our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%.