new Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models

Authors: Lionel Wong, Katherine M. Collins, Lance Ying, Cedegao E. Zhang, Adrian Weller, Tobias Gersternberg, Timothy O'Donnell, Alexander K. Lew, Jacob D. Andreas, Joshua B. Tenenbaum, Tyler Brooke-Wilson

Abstract: When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea -- a ``Model Synthesis Architecture'' (MSA) -- using language models to implement global relevance-based retrieval and model synthesis and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA as a model of human judgments on a novel reasoning dataset. The dataset -- built around a `Model Olympics` domain of sports vignettes -- tests models' capacity for human-like, open-ended reasoning by requiring (i) judgments about novel causal structures described in language; (ii) drawing on large bodies of background knowledge; and (iii) doing both in light of observations that introduce arbitrary novel variables. Our MSA approach captures human judgments better than language model-only baselines, under both direct and chain-of-thought generations from the LM that supports model synthesis. These results suggest that MSAs can be implemented in a way that mirrors people's ability to deliver locally coherent reasoning over globally relevant variables, offering a path to understanding and replicating human reasoning in open-ended domains.

new Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility

Authors: Michael A. Lepori, Jennifer Hu, Ishita Dasgupta, Roma Patel, Thomas Serre, Ellie Pavlick

Abstract: Language models (LMs) are used for a diverse range of tasks, from question answering to writing fantastical stories. In order to reliably accomplish these tasks, LMs must be able to discern the modal category of a sentence (i.e., whether it describes something that is possible, impossible, completely nonsensical, etc.). However, recent studies have called into question the ability of LMs to categorize sentences according to modality (Michaelov et al., 2025; Kauf et al., 2023). In this work, we identify linear representations that discriminate between modal categories within a variety of LMs, or modal difference vectors. Analysis of modal difference vectors reveals that LMs have access to more reliable modal categorization judgments than previously reported. Furthermore, we find that modal difference vectors emerge in a consistent order as models become more competent (i.e., through training steps, layers, and parameter count). Notably, we find that modal difference vectors identified within LM activations can be used to model fine-grained human categorization behavior. This potentially provides a novel view into how human participants distinguish between modal categories, which we explore by correlating projections along modal difference vectors with human participants' ratings of interpretable features. In summary, we derive new insights into LM modal categorization using techniques from mechanistic interpretability, with the potential to inform our understanding of modal categorization in humans.

new The first open machine translation system for the Chechen language

Authors: Abu-Viskhan A. Umishov, Vladislav A. Grigorian

Abstract: We introduce the first open-source model for translation between the vulnerable Chechen language and Russian, and the dataset collected to train and evaluate it. We explore fine-tuning capabilities for including a new language into a large language model system for multilingual translation NLLB-200. The BLEU / ChrF++ scores for our model are 8.34 / 34.69 and 20.89 / 44.55 for translation from Russian to Chechen and reverse direction, respectively. The release of the translation models is accompanied by the distribution of parallel words, phrases and sentences corpora and multilingual sentence encoder adapted to the Chechen language.

new Improving Drug Identification in Overdose Death Surveillance using Large Language Models

Authors: Arthur J. Funnell, Panayiotis Petousis, Fabrice Harel-Canada, Ruby Romero, Alex A. T. Bui, Adam Koncsol, Hritika Chaturvedi, Chelsea Shover, David Goodman-Meza

Abstract: The rising rate of drug-related deaths in the United States, largely driven by fentanyl, requires timely and accurate surveillance. However, critical overdose data are often buried in free-text coroner reports, leading to delays and information loss when coded into ICD (International Classification of Disease)-10 classifications. Natural language processing (NLP) models may automate and enhance overdose surveillance, but prior applications have been limited. A dataset of 35,433 death records from multiple U.S. jurisdictions in 2020 was used for model training and internal testing. External validation was conducted using a novel separate dataset of 3,335 records from 2023-2024. Multiple NLP approaches were evaluated for classifying specific drug involvement from unstructured death certificate text. These included traditional single- and multi-label classifiers, as well as fine-tuned encoder-only language models such as Bidirectional Encoder Representations from Transformers (BERT) and BioClinicalBERT, and contemporary decoder-only large language models such as Qwen 3 and Llama 3. Model performance was assessed using macro-averaged F1 scores, and 95% confidence intervals were calculated to quantify uncertainty. Fine-tuned BioClinicalBERT models achieved near-perfect performance, with macro F1 scores >=0.998 on the internal test set. External validation confirmed robustness (macro F1=0.966), outperforming conventional machine learning, general-domain BERT models, and various decoder-only large language models. NLP models, particularly fine-tuned clinical variants like BioClinicalBERT, offer a highly accurate and scalable solution for overdose death classification from free-text reports. These methods can significantly accelerate surveillance workflows, overcoming the limitations of manual ICD-10 coding and supporting near real-time detection of emerging substance use trends.

new AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis

Authors: S M Rafiuddin, Sadia Kamal, Mohammed Rakib, Arunkumar Bagavathi, Atriya Sen

Abstract: We introduce AdaptiSent, a new framework for Multimodal Aspect-Based Sentiment Analysis (MABSA) that uses adaptive cross-modal attention mechanisms to improve sentiment classification and aspect term extraction from both text and images. Our model integrates dynamic modality weighting and context-adaptive attention, enhancing the extraction of sentiment and aspect-related information by focusing on how textual cues and visual context interact. We tested our approach against several baselines, including traditional text-based models and other multimodal methods. Results from standard Twitter datasets show that AdaptiSent surpasses existing models in precision, recall, and F1 score, and is particularly effective in identifying nuanced inter-modal relationships that are crucial for accurate sentiment and aspect term extraction. This effectiveness comes from the model's ability to adjust its focus dynamically based on the context's relevance, improving the depth and accuracy of sentiment analysis across various multimodal data sets. AdaptiSent sets a new standard for MABSA, significantly outperforming current methods, especially in understanding complex multimodal information.

new AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation

Authors: Potsawee Manakul, Woody Haosheng Gan, Michael J. Ryan, Ali Sartaz Khan, Warit Sirichotedumrong, Kunat Pipatanakul, William Held, Diyi Yang

Abstract: Current speech evaluation suffers from two critical limitations: the need and difficulty of designing specialized systems targeting individual audio characteristics, and poor correlation between automatic evaluation methods and human preferences. This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges. We systematically explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking. We investigate different prompt engineering strategies, finding that audio concatenation combined with in-context learning significantly improves performance across both audio characteristic detection and human preference simulation tasks. We further introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on our system ranking benchmark. Robustness analysis reveals that while LAMs maintain strong performance under acoustic noise, they exhibit significant verbosity and positional biases that require careful mitigation.

new FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Authors: Abraham Toluase Owodunni, Orevaoghene Ahia, Sachin Kumar

Abstract: Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10\% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at https://github.com/owos/flexitokens

URLs: https://github.com/owos/flexitokens

new TransEvalnia: Reasoning-based Evaluation and Ranking of Translations

Authors: Richard Sproat, Tianyu Zhao, Llion Jones

Abstract: We present TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This system presents fine-grained evaluations based on a subset of the Multidimensional Quality Metrics (https://themqm.org/), returns an assessment of which translation it deems the best, and provides numerical scores for the various dimensions and for the overall translation. We show that TransEvalnia performs as well as or better than the state-of-the-art MT-Ranker (Moosa et al. 2024) on our own English-Japanese data as well as several language pairs from various WMT shared tasks. Using Anthropic's Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct as the evaluation LLMs, we show that the evaluations returned are deemed highly acceptable to human raters, and that the scores assigned to the translations by Sonnet, as well as other LLMs, correlate well with scores assigned by the human raters. We also note the sensitivity of our system -- as well as MT-Ranker -- to the order in which the translations are presented, and we propose methods to address this position bias. All data, including the system's evaluation and reasoning, human assessments, as well as code is released.

URLs: https://themqm.org/),

new Strategy Adaptation in Large Language Model Werewolf Agents

Authors: Fuya Nakamori, Yin Jou Huang, Fei Cheng

Abstract: This study proposes a method to improve the performance of Werewolf agents by switching between predefined strategies based on the attitudes of other players and the context of conversations. While prior works of Werewolf agents using prompt engineering have employed methods where effective strategies are implicitly defined, they cannot adapt to changing situations. In this research, we propose a method that explicitly selects an appropriate strategy based on the game context and the estimated roles of other players. We compare the strategy adaptation Werewolf agents with baseline agents using implicit or fixed strategies and verify the effectiveness of our proposed method.

new Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Authors: Yunxiang Zhang, Muhammad Khalifa, Lechen Zhang, Xin Liu, Ayoung Lee, Xinliang Frederick Zhang, Farima Fatahi Bayat, Lu Wang

Abstract: Large reasoning models (LRMs) can do complex reasoning via long chain-of-thought (CoT) involving cognitive strategies such as backtracking and self-correction. Recent studies suggest that some models inherently possess these long reasoning abilities, which may be unlocked via extra training. Our work first investigates whether we can elicit such behavior without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logits arithmetic (Liu et al., 2024) to tune a target large LM for long reasoning using a substantially smaller model as guider. We then show that we can further boost performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model -- a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in pass@1 by 26% and 29%, respectively, over four mathematical datasets using the Qwen2.5-32B when guided by R1-Distill-Qwen-1.5B -- a model 21x smaller. Lastly, we show that ThinkLogit can transfer long reasoning skills acquired through reinforcement learning, improving pass@1 by 13% relative compared to the Qwen2.5-32B base model. Our work presents a computationally-efficient method to elicit long reasoning in large models with minimal or no additional training.

new Synergy: End-to-end Concept Model

Authors: Keli Zheng, Zerong Xie

Abstract: In this paper, we present Synergy, a language model that bridges different levels of abstraction in an end-to-end fashion through a learned routing mechanism. Focusing on low-level linguistic abstraction, we trained our model as a byte-level language model. Our model spontaneously learns to tokenize bytes, producing fewer concept tokens than Byte-level Byte Pair Encoder (BBPE) tokenizers while keeping comparable performance. By comparing with Llama3, we observed an advantage of Synergy under the same model scale and training dataset size. Further studies show that the middle part (the higher abstraction part) of our model performs better when positional encodings are removed, suggesting the emergence of position-independent concepts. These findings demonstrate the feasibility of tokenizer-free architectures, paving the way for more robust and flexible pipelines.

new Learning Robust Negation Text Representations

Authors: Thinh Hung Truong, Karin Verspoor, Trevor Cohn, Timothy Baldwin

Abstract: Despite rapid adoption of autoregressive large language models, smaller text encoders still play an important role in text understanding tasks that require rich contextualized representations. Negation is an important semantic function that is still not properly captured by such methods, affecting many downstream applications relying on text embeddings. We propose a strategy to improve negation robustness of text encoders, by distilling data from large language models using diverse patterns of negation and hedging. We adopt a standard contrastive learning strategy to finetune a strong BERT-based model, and observe large improvement in negation understanding capabilities while maintaining competitive performance on general benchmarks. In addition, we also show that our method can be adapted to LLMs, leading to improved performance on negation benchmarks.

new Large Language Models' Internal Perception of Symbolic Music

Authors: Andrew Shin, Kunitake Kaneko

Abstract: Large language models (LLMs) excel at modeling relationships between strings in natural language and have shown promise in extending to other symbolic domains like coding or mathematics. However, the extent to which they implicitly model symbolic music remains underexplored. This paper investigates how LLMs represent musical concepts by generating symbolic music data from textual prompts describing combinations of genres and styles, and evaluating their utility through recognition and generation tasks. We produce a dataset of LLM-generated MIDI files without relying on explicit musical training. We then train neural networks entirely on this LLM-generated MIDI dataset and perform genre and style classification as well as melody completion, benchmarking their performance against established models. Our results demonstrate that LLMs can infer rudimentary musical structures and temporal relationships from text, highlighting both their potential to implicitly encode musical patterns and their limitations due to a lack of explicit musical context, shedding light on their generative capabilities for symbolic music.

new Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent?

Authors: Xi Ai, Mahardika Krisna Ihsani, Min-Yen Kan

Abstract: Cross-lingual consistency should be considered to assess cross-lingual transferability, maintain the factuality of the model knowledge across languages, and preserve the parity of language model performance. We are thus interested in analyzing, evaluating, and interpreting cross-lingual consistency for factual knowledge. We examine code-mixed coreferential statements conveyed identical knowledge across languages to study cross-lingual knowledge consistency. We use some interpretability approaches to analyze the behavior of a model in cross-lingual contexts, discovering that multilingual models show different levels of consistency, subject to language families, linguistic factors, and a bottleneck in cross-lingual consistency on a particular layer. In addition, we evaluate common strategies aimed at improving multilingual performance to observe whether these strategies can improve knowledge consistency at the same time. While knowledge is not cross-lingual consistency in many cases, code-switching training and cross-lingual word alignment objectives show the most promising results, emphasizing the noteworthiness of cross-lingual alignment supervision and code-switching training for both multilingual performance and cross-lingual consistency enhancement.

new Making Language Model a Hierarchical Classifier and Generator

Authors: Yihong Wang, Zhonglin Jiang, Ningyuan Xi, Yue Zhao, Qingqing Gu, Xiyuan Chen, Hao Wu, Sheng Xu, Hange Zhou, Yong Chen, Luo Ji

Abstract: Decoder-only language models, such as GPT and LLaMA, generally decode on the last layer. Motivated by human's hierarchical thinking capability, we propose that a hierarchical decoder architecture could be built with different layers decoding texts simultaneously. Due to limited time and computationally resources, we choose to adapt a pretrained language model into this form of hierarchical decoder. Language heads of the last layer are copied to different selected intermediate layers, and fine-tuned with different task inputs. By thorough experiments, we validate that these selective intermediate layers could be adapted to speak meaningful and reasonable contents, and this paradigm of hierarchical decoder can obtain state-of-the-art performances on multiple tasks such as hierarchical text classification, classification-guided generation, and hierarchical text generation. This study suggests the possibility of a generalized hierarchical reasoner, pretraining from scratch.

new MRT at IberLEF-2025 PRESTA Task: Maximizing Recovery from Tables with Multiple Steps

Authors: Maximiliano Hormaz\'abal Lagos, \'Alvaro Bueno S\'aez, H\'ector Cerezo-Costas, Pedro Alonso Doval, Jorge Alcalde Vesteiro

Abstract: This paper presents our approach for the IberLEF 2025 Task PRESTA: Preguntas y Respuestas sobre Tablas en Espa\~nol (Questions and Answers about Tables in Spanish). Our solution obtains answers to the questions by implementing Python code generation with LLMs that is used to filter and process the table. This solution evolves from the MRT implementation for the Semeval 2025 related task. The process consists of multiple steps: analyzing and understanding the content of the table, selecting the useful columns, generating instructions in natural language, translating these instructions to code, running it, and handling potential errors or exceptions. These steps use open-source LLMs and fine-grained optimized prompts for each step. With this approach, we achieved an accuracy score of 85\% in the task.

new Formalizing Attack Scenario Description: A Proposed Model

Authors: Quentin Goux (CEDRIC - ISID), Nadira Lammari (CEDRIC - ISID)

Abstract: Organizations face an ever-changing threat landscape. They must continuously dedicate significant efforts to protect their assets, making their adoption of increased cybersecurity automation inevitable. However, process automation requires formalization of input data. Through this paper, we address this need for processes that use attack scenarios as input. Among these processes, one can mention both the generation of scripts for attack simulation and training purposes, as well as the analysis of attacks. Therefore, the paper's main research contribution is a novel formal model that encompasses the attack's context description and its scenario. It is abstracted using UML class model. Once the description of our model done, we will show how it could serve an upstream attack analysis process. We will show also its use for an automatic generation of attack scripts in the context of cybersecurity training. These two uses cases constitute the second contribution of this present research work.

new SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts

Authors: Marc Brinner, Sina Zarriess

Abstract: We introduce SemCSE, an unsupervised method for learning semantic embeddings of scientific texts. Building on recent advances in contrastive learning for text embeddings, our approach leverages LLM-generated summaries of scientific abstracts to train a model that positions semantically related summaries closer together in the embedding space. This resulting objective ensures that the model captures the true semantic content of a text, in contrast to traditional citation-based approaches that do not necessarily reflect semantic similarity. To validate this, we propose a novel benchmark designed to assess a model's ability to understand and encode the semantic content of scientific texts, demonstrating that our method enforces a stronger semantic separation within the embedding space. Additionally, we evaluate SemCSE on the comprehensive SciRepEval benchmark for scientific text embeddings, where it achieves state-of-the-art performance among models of its size, thus highlighting the benefits of a semantically focused training approach.

new A Computational Framework to Identify Self-Aspects in Text

Authors: Jaya Caporusso, Matthew Purver, Senja Pollak

Abstract: This Ph.D. proposal introduces a plan to develop a computational framework to identify Self-aspects in text. The Self is a multifaceted construct and it is reflected in language. While it is described across disciplines like cognitive science and phenomenology, it remains underexplored in natural language processing (NLP). Many of the aspects of the Self align with psychological and other well-researched phenomena (e.g., those related to mental health), highlighting the need for systematic NLP-based analysis. In line with this, we plan to introduce an ontology of Self-aspects and a gold-standard annotated dataset. Using this foundation, we will develop and evaluate conventional discriminative models, generative large language models, and embedding-based retrieval approaches against four main criteria: interpretability, ground-truth adherence, accuracy, and computational efficiency. Top-performing models will be applied in case studies in mental health and empirical phenomenology.

new Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation

Authors: Hadi Mohammadi, Tina Shahedi, Pablo Mosteiro, Massimo Poesio, Ayoub Bagheri, Anastasia Giachanou

Abstract: Understanding the sources of variability in annotations is crucial for developing fair NLP systems, especially for tasks like sexism detection where demographic bias is a concern. This study investigates the extent to which annotator demographic features influence labeling decisions compared to text content. Using a Generalized Linear Mixed Model, we quantify this inf luence, finding that while statistically present, demographic factors account for a minor fraction ( 8%) of the observed variance, with tweet content being the dominant factor. We then assess the reliability of Generative AI (GenAI) models as annotators, specifically evaluating if guiding them with demographic personas improves alignment with human judgments. Our results indicate that simplistic persona prompting often fails to enhance, and sometimes degrades, performance compared to baseline models. Furthermore, explainable AI (XAI) techniques reveal that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. We argue that focusing on content-driven explanations and robust annotation protocols offers a more reliable path towards fairness than potentially persona simulation.

new Feature-based analysis of oral narratives from Afrikaans and isiXhosa children

Authors: Emma Sharratt, Annelien Smith, Retief Louw, Daleen Klop, Febe de Wet, Herman Kamper

Abstract: Oral narrative skills are strong predictors of later literacy development. This study examines the features of oral narratives from children who were identified by experts as requiring intervention. Using simple machine learning methods, we analyse recorded stories from four- and five-year-old Afrikaans- and isiXhosa-speaking children. Consistent with prior research, we identify lexical diversity (unique words) and length-based features (mean utterance length) as indicators of typical development, but features like articulation rate prove less informative. Despite cross-linguistic variation in part-of-speech patterns, the use of specific verbs and auxiliaries associated with goal-directed storytelling is correlated with a reduced likelihood of requiring intervention. Our analysis of two linguistically distinct languages reveals both language-specific and shared predictors of narrative proficiency, with implications for early assessment in multilingual contexts.

new GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems

Authors: Jisoo Lee, Raeyoung Chang, Dongwook Kwon, Harmanpreet Singh, Nikhil Verma

Abstract: Multi-agent systems built on language models have shown strong performance on collaborative reasoning tasks. However, existing evaluations focus only on the correctness of the final output, overlooking how inefficient communication and poor coordination contribute to redundant reasoning and higher computational costs. We introduce GEMMAS, a graph-based evaluation framework that analyzes the internal collaboration process by modeling agent interactions as a directed acyclic graph. To capture collaboration quality, we propose two process-level metrics: Information Diversity Score (IDS) to measure semantic variation in inter-agent messages, and Unnecessary Path Ratio (UPR) to quantify redundant reasoning paths. We evaluate GEMMAS across five benchmarks and highlight results on GSM8K, where systems with only a 2.1% difference in accuracy differ by 12.8% in IDS and 80% in UPR, revealing substantial variation in internal collaboration. These findings demonstrate that outcome-only metrics are insufficient for evaluating multi-agent performance and highlight the importance of process-level diagnostics in designing more interpretable and resource-efficient collaborative AI systems.

new Automatically assessing oral narratives of Afrikaans and isiXhosa children

Authors: R. Louw (Stellenbosch University), E. Sharratt (Stellenbosch University), F. de Wet (Stellenbosch University), C. Jacobs (Stellenbosch University), A. Smith (Stellenbosch University), H. Kamper (Stellenbosch University)

Abstract: Developing narrative and comprehension skills in early childhood is critical for later literacy. However, teachers in large preschool classrooms struggle to accurately identify students who require intervention. We present a system for automatically assessing oral narratives of preschool children in Afrikaans and isiXhosa. The system uses automatic speech recognition followed by a machine learning scoring model to predict narrative and comprehension scores. For scoring predicted transcripts, we compare a linear model to a large language model (LLM). The LLM-based system outperforms the linear model in most cases, but the linear system is competitive despite its simplicity. The LLM-based system is comparable to a human expert in flagging children who require intervention. We lay the foundation for automatic oral assessments in classrooms, giving teachers extra capacity to focus on personalised support for children's learning.

new Enhancing Cross-task Transfer of Large Language Models via Activation Steering

Authors: Xinyu Tang, Zhihao Lv, Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, Jun Zhou

Abstract: Large language models (LLMs) have shown impressive abilities in leveraging pretrained knowledge through prompting, but they often struggle with unseen tasks, particularly in data-scarce scenarios. While cross-task in-context learning offers a direct solution for transferring knowledge across tasks, it still faces critical challenges in terms of robustness, scalability, and efficiency. In this paper, we investigate whether cross-task transfer can be achieved via latent space steering without parameter updates or input expansion. Through an analysis of activation patterns in the latent space of LLMs, we observe that the enhanced activations induced by in-context examples have consistent patterns across different tasks. Inspired by these findings, we propose CAST, a novel Cross-task Activation Steering Transfer framework that enables effective transfer by manipulating the model's internal activation states. Our approach first selects influential and diverse samples from high-resource tasks, then utilizes their contrastive representation-enhanced activations to adapt LLMs to low-resource tasks. Extensive experiments across both cross-domain and cross-lingual transfer settings show that our method outperforms competitive baselines and demonstrates superior scalability and lower computational costs.

new HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models

Authors: Ashray Gupta, Rohan Joseph, Sunny Rai

Abstract: Analogies test a model's ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.

new Automating Steering for Safe Multimodal Large Language Models

Authors: Lyucheng Wu, Mengru Wang, Ziwen Xu, Tri Cao, Nay Oo, Bryan Hooi, Shumin Deng

Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.

new QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation

Authors: Jiazheng Li, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Hongzhou Lin, Yi Wu, Jingzhao Zhang

Abstract: Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 67.1% (+5.3%) on AIME24, 59.5% (+10.0%) on AIME25, and 35.5% (+4.0%) on HMMT25. Further, we provide theoretical explanations that QuestA improves sample efficiency, offering a practical and generalizable pathway for expanding reasoning capability through RL.

new Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management

Authors: Luis Gasco, Hermenegildo Fabregat, Laura Garc\'ia-Sardi\~na, Paula Estrella, Daniel Deniz, Alvaro Rodrigo, Rabih Zbib

Abstract: Advances in natural language processing and large language models are driving a major transformation in Human Capital Management, with a growing interest in building smart systems based on language technologies for talent acquisition, upskilling strategies, and workforce planning. However, the adoption and progress of these technologies critically depend on the development of reliable and fair models, properly evaluated on public data and open benchmarks, which have so far been unavailable in this domain. To address this gap, we present TalentCLEF 2025, the first evaluation campaign focused on skill and job title intelligence. The lab consists of two tasks: Task A - Multilingual Job Title Matching, covering English, Spanish, German, and Chinese; and Task B - Job Title-Based Skill Prediction, in English. Both corpora were built from real job applications, carefully anonymized, and manually annotated to reflect the complexity and diversity of real-world labor market data, including linguistic variability and gender-marked expressions. The evaluations included monolingual and cross-lingual scenarios and covered the evaluation of gender bias. TalentCLEF attracted 76 registered teams with more than 280 submissions. Most systems relied on information retrieval techniques built with multilingual encoder-based models fine-tuned with contrastive learning, and several of them incorporated large language models for data augmentation or re-ranking. The results show that the training strategies have a larger effect than the size of the model alone. TalentCLEF provides the first public benchmark in this field and encourages the development of robust, fair, and transferable language technologies for the labor market.

new Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis

Authors: Wang Xi, Quan Shi, Tian Yu, Yujie Peng, Jiayi Sun, Mengxing Ren, Zenghui Ding, Ningguang Yao

Abstract: Automated generation of high-quality media presentations is challenging, requiring robust content extraction, narrative planning, visual design, and overall quality optimization. Existing methods often produce presentations with logical inconsistencies and suboptimal layouts, thereby struggling to meet professional standards. To address these challenges, we introduce RCPS (Reflective Coherent Presentation Synthesis), a novel framework integrating three key components: (1) Deep Structured Narrative Planning; (2) Adaptive Layout Generation; (3) an Iterative Optimization Loop. Additionally, we propose PREVAL, a preference-based evaluation framework employing rationale-enhanced multi-dimensional models to assess presentation quality across Content, Coherence, and Design. Experimental results demonstrate that RCPS significantly outperforms baseline methods across all quality dimensions, producing presentations that closely approximate human expert standards. PREVAL shows strong correlation with human judgments, validating it as a reliable automated tool for assessing presentation quality.

new AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

Authors: Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Yixin Liu, Chengye Wang, Lovekesh Vig, Arman Cohan

Abstract: We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.

new HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals

Authors: Guimin Hu, Daniel Hershcovich, Hasti Seifi

Abstract: Haptic signals, from smartphone vibrations to virtual reality touch feedback, can effectively convey information and enhance realism, but designing signals that resonate meaningfully with users is challenging. To facilitate this, we introduce a multimodal dataset and task, of matching user descriptions to vibration haptic signals, and highlight two primary challenges: (1) lack of large haptic vibration datasets annotated with textual descriptions as collecting haptic descriptions is time-consuming, and (2) limited capability of existing tasks and models to describe vibration signals in text. To advance this area, we create HapticCap, the first fully human-annotated haptic-captioned dataset, containing 92,070 haptic-text pairs for user descriptions of sensory, emotional, and associative attributes of vibrations. Based on HapticCap, we propose the haptic-caption retrieval task and present the results of this task from a supervised contrastive learning framework that brings together text representations within specific categories and vibrations. Overall, the combination of language model T5 and audio model AST yields the best performance in the haptic-caption retrieval task, especially when separately trained for each description category.

new Social and Political Framing in Search Engine Results

Authors: Amrit Poudel, Tim Weninger

Abstract: Search engines play a crucial role in shaping public discourse by influencing how information is accessed and framed. While prior research has extensively examined various dimensions of search bias -- such as content prioritization, indexical bias, political polarization, and sources of bias -- an important question remains underexplored: how do search engines and ideologically-motivated user queries contribute to bias in search results. This study analyzes the outputs of major search engines using a dataset of political and social topics. The findings reveal that search engines not only prioritize content in ways that reflect underlying biases but also that ideologically-driven user queries exacerbate these biases, resulting in the amplification of specific narratives. Moreover, significant differences were observed across search engines in terms of the sources they prioritize. These results suggest that search engines may play a pivotal role in shaping public perceptions by reinforcing ideological divides, thereby contributing to the broader issue of information polarization.

new Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It

Authors: Yulu Qin, Dheeraj Varghese, Adam Dahlgren Lindstr\"om, Lucia Donatelli, Kanishka Misra, Najoung Kim

Abstract: Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.

new The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

Authors: Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen

Abstract: Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.

new A Survey of Context Engineering for Large Language Models

Authors: Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu

Abstract: The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

new Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes

Authors: Tyler Loakman, William Thorne, Chenghua Lin

Abstract: Humour, as a complex language form, is derived from myriad aspects of life, whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes. In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events. In doing so, we curate a dataset of 600 jokes split across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes, where understanding relies on reasoning beyond "common sense", rooted instead in world knowledge regarding news events and pop culture. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (inc. reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most works in computational humour on overly simple joke forms.

cross Perfect diffusion is $\mathsf{TC}^0$ -- Bad diffusion is Turing-complete

Authors: Yuxi Liu

Abstract: This paper explores the computational complexity of diffusion-based language modeling. We prove a dichotomy based on the quality of the score-matching network in a diffusion model. In one direction, a network that exactly computes the score function of some initial distribution can only perform language modeling within the $\mathsf{TC}^0$ complexity class, reflecting limitations tied to rapid convergence. In the other direction, we show that if there is no requirement for the network to match any score function, then diffusion modeling can simulate any Turing machine in a certain sense. This dichotomy provides a theoretical lens on the capabilities and limitations of diffusion models, particularly concerning tasks requiring sequential computation. We conjecture extensions of our theoretical results, including for the case where the diffusion model is not perfect, but merely good. We also discuss the wider context and practical implications, and hypothesize that a machine learning architecture that can interpolate between sequential and parallel modes of operation would be superior to both Transformers and diffusion models.

cross A Survey of AIOps in the Era of Large Language Models

Authors: Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip S. Yu, Ying Li

Abstract: As large language models (LLMs) grow increasingly sophisticated and pervasive, their application to various Artificial Intelligence for IT Operations (AIOps) tasks has garnered significant attention. However, a comprehensive understanding of the impact, potential, and limitations of LLMs in AIOps remains in its infancy. To address this gap, we conducted a detailed survey of LLM4AIOps, focusing on how LLMs can optimize processes and improve outcomes in this domain. We analyzed 183 research papers published between January 2020 and December 2024 to answer four key research questions (RQs). In RQ1, we examine the diverse failure data sources utilized, including advanced LLM-based processing techniques for legacy data and the incorporation of new data sources enabled by LLMs. RQ2 explores the evolution of AIOps tasks, highlighting the emergence of novel tasks and the publication trends across these tasks. RQ3 investigates the various LLM-based methods applied to address AIOps challenges. Finally, RQ4 reviews evaluation methodologies tailored to assess LLM-integrated AIOps approaches. Based on our findings, we discuss the state-of-the-art advancements and trends, identify gaps in existing research, and propose promising directions for future exploration.

cross Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering

Authors: Maximiliano Hormaz\'abal Lagos, H\'ector Cerezo-Costas, Dimosthenis Karatzas

Abstract: We introduce EaGERS, a fully training-free and model-agnostic pipeline that (1) generates natural language rationales via a vision language model, (2) grounds these rationales to spatial sub-regions by computing multimodal embedding similarities over a configurable grid with majority voting, and (3) restricts the generation of responses only from the relevant regions selected in the masked image. Experiments on the DocVQA dataset demonstrate that our best configuration not only outperforms the base model on exact match accuracy and Average Normalized Levenshtein Similarity metrics but also enhances transparency and reproducibility in DocVQA without additional model fine-tuning.

cross Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training

Authors: Mingjie Liu, Shizhe Diao, Jian Hu, Ximing Lu, Xin Dong, Hao Zhang, Alexander Bukharin, Shaokun Zhang, Jiaqi Zeng, Makesh Narsimhan Sreedhar, Gerald Shen, David Mosallanezhad, Di Zhang, Jonas Yang, June Yang, Oleksii Kuchaiev, Guilin Liu, Zhiding Yu, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong

Abstract: Recent advancements in reasoning-focused language models such as OpenAI's O1 and DeepSeek-R1 have shown that scaling test-time computation-through chain-of-thought reasoning and iterative exploration-can yield substantial improvements on complex tasks like mathematics and code generation. These breakthroughs have been driven by large-scale reinforcement learning (RL), particularly when combined with verifiable reward signals that provide objective and grounded supervision. In this report, we investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains. Our work identifies several key ingredients for effective training, including the use of verifiable reward tasks, enhancements to Group Relative Policy Optimization (GRPO), and practical techniques to improve training stability and generalization. We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains. Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks. To facilitate continued research, we release our model publicly.

cross Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Authors: Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai

Abstract: This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.

URLs: https://github.com/OpenGVLab/Mono-InternVL.

cross A Fuzzy Approach to Project Success: Measuring What Matters

Authors: Jo\~ao Granja-Correia, Remedios Hern\'andez-Linares, Luca Ferranti, Arm\'enio Rego

Abstract: This paper introduces a novel approach to project success evaluation by integrating fuzzy logic into an existing construct. Traditional Likert-scale measures often overlook the context-dependent and multifaceted nature of project success. The proposed hierarchical Type-1 Mamdani fuzzy system prioritizes sustained positive impact for end-users, reducing emphasis on secondary outcomes like stakeholder satisfaction and internal project success. This dynamic approach may provide a more accurate measure of project success and could be adaptable to complex evaluations. Future research will focus on empirical testing and broader applications of fuzzy logic in social science.

cross A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models

Authors: Weijieying Ren, Jingxi Zhu, Zehao Liu, Tianxiang Zhao, Vasant Honavar

Abstract: Artificial intelligence (AI) has demonstrated significant potential in transforming healthcare through the analysis and modeling of electronic health records (EHRs). However, the inherent heterogeneity, temporal irregularity, and domain-specific nature of EHR data present unique challenges that differ fundamentally from those in vision and natural language tasks. This survey offers a comprehensive overview of recent advancements at the intersection of deep learning, large language models (LLMs), and EHR modeling. We introduce a unified taxonomy that spans five key design dimensions: data-centric approaches, neural architecture design, learning-focused strategies, multimodal learning, and LLM-based modeling systems. Within each dimension, we review representative methods addressing data quality enhancement, structural and temporal representation, self-supervised learning, and integration with clinical knowledge. We further highlight emerging trends such as foundation models, LLM-driven clinical agents, and EHR-to-text translation for downstream reasoning. Finally, we discuss open challenges in benchmarking, explainability, clinical alignment, and generalization across diverse clinical settings. This survey aims to provide a structured roadmap for advancing AI-driven EHR modeling and clinical decision support. For a comprehensive list of EHR-related methods, kindly refer to https://survey-on-tabular-data.github.io/.

URLs: https://survey-on-tabular-data.github.io/.

cross PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database

Authors: Hui Sun, Yanfeng Ding, Liping Yi, Huidong Ma, Gang Wang, Xiaoguang Liu, Cheng Zhong, Wentong Cai

Abstract: Learning-based lossless compressors play a crucial role in large-scale genomic database backup, storage, transmission, and management. However, their 1) inadequate compression ratio, 2) low compression \& decompression throughput, and 3) poor compression robustness limit their widespread adoption and application in both industry and academia. To solve those challenges, we propose a novel \underline{P}arallel \underline{M}ulti-\underline{K}nowledge \underline{L}earning-based \underline{C}ompressor (PMKLC) with four crucial designs: 1) We propose an automated multi-knowledge learning-based compression framework as compressors' backbone to enhance compression ratio and robustness; 2) we design a GPU-accelerated ($s$,$k$)-mer encoder to optimize compression throughput and computing resource usage; 3) we introduce data block partitioning and Step-wise Model Passing (SMP) mechanisms for parallel acceleration; 4) We design two compression modes PMKLC-S and PMKLC-M to meet the complex application scenarios, where the former runs on a resource-constrained single GPU and the latter is multi-GPU accelerated. We benchmark PMKLC-S/M and 14 baselines (7 traditional and 7 leaning-based) on 15 real-world datasets with different species and data sizes. Compared to baselines on the testing datasets, PMKLC-S/M achieve the average compression ratio improvement up to 73.609\% and 73.480\%, the average throughput improvement up to 3.036$\times$ and 10.710$\times$, respectively. Besides, PMKLC-S/M also achieve the best robustness and competitive memory cost, indicating its greater stability against datasets with different probability distribution perturbations, and its strong ability to run on memory-constrained devices.

cross MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Authors: Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Huan Wang, Shelby Heinecke, Silvio Savarese, Caiming Xiong

Abstract: The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

URLs: https://github.com/SalesforceAIResearch/MCPEval

cross Emotional Support with LLM-based Empathetic Dialogue Generation

Authors: Shiquan Wang, Ruiyu Fang, Zhongjiang He, Shuangyong Song, Yongxiang Li

Abstract: Emotional Support Conversation (ESC) aims to provide empathetic and effective emotional assistance through dialogue, addressing the growing demand for mental health support. This paper presents our solution for the NLPCC 2025 Task 8 ESC evaluation, where we leverage large-scale language models enhanced by prompt engineering and finetuning techniques. We explore both parameter-efficient Low-Rank Adaptation and full-parameter fine-tuning strategies to improve the model's ability to generate supportive and contextually appropriate responses. Our best model ranked second in the competition, highlighting the potential of combining LLMs with effective adaptation methods for ESC tasks. Future work will focus on further enhancing emotional understanding and response personalization to build more practical and reliable emotional support systems.

cross Probabilistic Soundness Guarantees in LLM Reasoning Chains

Authors: Weiqiu You, Anton Xue, Shreya Havaldar, Delip Rao, Helen Jin, Chris Callison-Burch, Eric Wong

Abstract: In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because they do not properly account for how earlier errors might corrupt judgments of downstream reasoning. To better detect such propagated errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a novel probabilistic framework that prevents error propagation by judging each claim based only on previously-assessed sound premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).

cross UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets

Authors: Zhichao Sheng, Shilin Zhou, Chen Gong, Zhenghua Li

Abstract: Spoken Language Understanding (SLU) plays a crucial role in speech-centric multimedia applications, enabling machines to comprehend spoken language in scenarios such as meetings, interviews, and customer service interactions. SLU encompasses multiple tasks, including Automatic Speech Recognition (ASR), spoken Named Entity Recognition (NER), and spoken Sentiment Analysis (SA). However, existing methods often rely on separate model architectures for individual tasks such as spoken NER and SA, which increases system complexity, limits cross-task interaction, and fails to fully exploit heterogeneous datasets available across tasks. To address these limitations, we propose UniSLU, a unified framework that jointly models multiple SLU tasks within a single architecture. Specifically, we propose a unified representation for diverse SLU tasks, enabling full utilization of heterogeneous datasets across multiple tasks. Built upon this representation, we propose a unified generative method that jointly models ASR, spoken NER, and SA tasks, enhancing task interactions and enabling seamless integration with large language models to harness their powerful generative capabilities. Extensive experiments on public SLU datasets demonstrate the effectiveness of our approach, achieving superior SLU performance compared to several benchmark methods, making it well-suited for real-world speech-based multimedia scenarios. We will release all code and models at github to facilitate future research.

cross Teach Old SAEs New Domain Tricks with Boosting

Authors: Nikita Koriagin, Yaroslav Aksenov, Daniil Laptev, Gleb Gerasimov, Nikita Balagansky, Daniil Gavrilov

Abstract: Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.

cross Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

Authors: Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, Jiangmiao Pang

Abstract: Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment's overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. The code is available at https://crystalsixone.github.io/vln_pe.github.io/.

URLs: https://crystalsixone.github.io/vln_pe.github.io/.

cross From Roots to Rewards: Dynamic Tree Reasoning with RL

Authors: Ahmed Bahloul, Simon Malberg

Abstract: Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree's static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree's probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems.

cross Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities

Authors: Hao Sun, Mihaela van der Schaar

Abstract: In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.

cross The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

Authors: Carlos Arriaga, Gonzalo Mart\'inez, Eneko Sendin, Javier Conde, Pedro Reviriego

Abstract: The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.

cross VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Authors: Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia

Abstract: Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.

URLs: https://github.com/dvlab-research/VisionThink.

replace A Logically Consistent Chain-of-Thought Approach for Stance Detection

Authors: Bowen Zhang, Daijun Ding, Liwen Jing, Hu Huang

Abstract: Zero-shot stance detection (ZSSD) aims to detect stances toward unseen targets. Incorporating background knowledge to enhance transferability between seen and unseen targets constitutes the primary approach of ZSSD. However, these methods often struggle with a knowledge-task disconnect and lack logical consistency in their predictions. To address these issues, we introduce a novel approach named Logically Consistent Chain-of-Thought (LC-CoT) for ZSSD, which improves stance detection by ensuring relevant and logically sound knowledge extraction. LC-CoT employs a three-step process. Initially, it assesses whether supplementary external knowledge is necessary. Subsequently, it uses API calls to retrieve this knowledge, which can be processed by a separate LLM. Finally, a manual exemplar guides the LLM to infer stance categories, using an if-then logical structure to maintain relevance and logical coherence. This structured approach to eliciting background knowledge enhances the model's capability, outperforming traditional supervised methods without relying on labeled data.

replace Exploiting Adaptive Contextual Masking for Aspect-Based Sentiment Analysis

Authors: S M Rafiuddin, Mohammed Rakib, Sadia Kamal, Arunkumar Bagavathi

Abstract: Aspect-Based Sentiment Analysis (ABSA) is a fine-grained linguistics problem that entails the extraction of multifaceted aspects, opinions, and sentiments from the given text. Both standalone and compound ABSA tasks have been extensively used in the literature to examine the nuanced information present in online reviews and social media posts. Current ABSA methods often rely on static hyperparameters for attention-masking mechanisms, which can struggle with context adaptation and may overlook the unique relevance of words in varied situations. This leads to challenges in accurately analyzing complex sentences containing multiple aspects with differing sentiments. In this work, we present adaptive masking methods that remove irrelevant tokens based on context to assist in Aspect Term Extraction and Aspect Sentiment Classification subtasks of ABSA. We show with our experiments that the proposed methods outperform the baseline methods in terms of accuracy and F1 scores on four benchmark online review datasets. Further, we show that the proposed methods can be extended with multiple adaptations and demonstrate a qualitative analysis of the proposed approach using sample text for aspect term extraction.

replace On the Limitations of Large Language Models (LLMs): False Attribution

Authors: Tosin Adewumi, Nudrat Habib, Lama Alkhaled, Elisa Barney

Abstract: In this work, we introduce a new hallucination metric - Simple Hallucination Index (SHI) and provide insight into one important limitation of the parametric knowledge of large language models (LLMs), i.e. false attribution. The task of automatic author attribution for relatively small chunks of text is an important NLP task but can be challenging. We empirically evaluate the power of 3 open SotA LLMs in zero-shot setting (Gemma-7B, Mixtral 8x7B, and LLaMA-2-13B). We acquired the top 10 most popular books of a month, according to Project Gutenberg, divided each one into equal chunks of 400 words, and prompted each LLM to predict the author. We then randomly sampled 162 chunks per book for human evaluation, based on the error margin of 7% and a confidence level of 95%. The average results show that Mixtral 8x7B has the highest prediction accuracy, the lowest SHI, and a Pearson's correlation (r) of 0.724, 0.263, and -0.9996, respectively, followed by LLaMA-2-13B and Gemma-7B. However, Mixtral 8x7B suffers from high hallucinations for 3 books, rising as high as a SHI of 0.87 (in the range 0-1, where 1 is the worst). The strong negative correlation of accuracy and SHI, given by r, demonstrates the fidelity of the new hallucination metric, which may generalize to other tasks. We also show that prediction accuracies correlate positively with the frequencies of Wikipedia instances of the book titles instead of the downloads and we perform error analyses of predictions. We publicly release the annotated chunks of data and our codes to aid the reproducibility and evaluation of other models.

replace DeFine: Decision-Making with Analogical Reasoning over Factor Profiles

Authors: Yebowen Hu, Xiaoyang Wang, Wenlin Yao, Yiming Lu, Daoan Zhang, Hassan Foroosh, Dong Yu, Fei Liu

Abstract: LLMs are ideal for decision-making thanks to their ability to reason over long contexts. However, challenges arise when processing speech transcripts that describe complex scenarios, as they are verbose and include repetition, hedging, and vagueness. E.g., during a company's earnings call, an executive might project a positive revenue outlook to reassure investors, despite uncertainty regarding future earnings. It is crucial for LLMs to incorporate this uncertainty systematically when making decisions. In this paper, we introduce \textsc{DeFine}, a modular framework that constructs probabilistic factor profiles from complex scenarios. It then integrates these profiles with analogical reasoning, leveraging insights from similar past experiences to guide LLMs in making critical decisions in new situations. Our framework separates the tasks of quantifying uncertainty and incorporating it into LLM decision-making. This approach is particularly useful in areas such as consulting and financial deliberation, where making decisions under uncertainty is vital.

replace Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information

Authors: Yingya Li, Timothy Miller, Steven Bethard, Guergana Savova

Abstract: The success of multi-task learning can depend heavily on which tasks are grouped together. Naively grouping all tasks or a random set of tasks can result in negative transfer, with the multi-task models performing worse than single-task models. Though many efforts have been made to identify task groupings and to measure the relatedness among different tasks, it remains a challenging research topic to define a metric to identify the best task grouping out of a pool of many potential task combinations. We propose a metric of task relatedness based on task difficulty measured by pointwise V-usable information (PVI). PVI is a recently proposed metric to estimate how much usable information a dataset contains given a model. We hypothesize that tasks with not statistically different PVI estimates are similar enough to benefit from the joint learning process. We conduct comprehensive experiments to evaluate the feasibility of this metric for task grouping on 15 NLP datasets in the general, biomedical, and clinical domains. We compare the results of the joint learners against single learners, existing baseline methods, and recent large language models, including Llama 2 and GPT-4. The results show that by grouping tasks with similar PVI estimates, the joint learners yielded competitive results with fewer total parameters, with consistent performance across domains.

replace SCULPT: Systematic Tuning of Long Prompts

Authors: Shanu Kumar, Akhila Yesantarao Venkata, Shubhanshu Khandelwal, Bishal Santra, Parag Agrawal, Manish Gupta

Abstract: Prompt optimization is essential for effective utilization of large language models (LLMs) across diverse tasks. While existing optimization methods are effective in optimizing short prompts, they struggle with longer, more complex ones, often risking information loss and being sensitive to small perturbations. To address these challenges, we propose SCULPT (Systematic Tuning of Long Prompts), a framework that treats prompt optimization as a hierarchical tree refinement problem. SCULPT represents prompts as tree structures, enabling targeted modifications while preserving contextual integrity. It employs a Critic-Actor framework that generates reflections and applies actions to refine the prompt. Evaluations demonstrate SCULPT's effectiveness on long prompts, its robustness to adversarial perturbations, and its ability to generate high-performing prompts even without any initial human-written prompt. Compared to existing state of the art methods, SCULPT consistently improves LLM performance by preserving essential task information while applying structured refinements. Both qualitative and quantitative analyses show that SCULPT produces more stable and interpretable prompt modifications, ensuring better generalization across tasks.

replace IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

Authors: Xinghua Zhang, Haiyang Yu, Cheng Fu, Fei Huang, Yongbin Li

Abstract: In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

replace Multi-task retriever fine-tuning for domain-specific and efficient RAG

Authors: Patrice B\'echard, Orlando Marquez Ayala

Abstract: Retrieval-Augmented Generation (RAG) has become ubiquitous when deploying Large Language Models (LLMs), as it can address typical limitations such as generating hallucinated or outdated information. However, when building real-world RAG applications, practical issues arise. First, the retrieved information is generally domain-specific. Since it is computationally expensive to fine-tune LLMs, it is more feasible to fine-tune the retriever to improve the quality of the data included in the LLM input. Second, as more applications are deployed in the same real-world system, one cannot afford to deploy separate retrievers. Moreover, these RAG applications normally retrieve different kinds of data. Our solution is to instruction fine-tune a small retriever encoder on a variety of domain-specific tasks to allow us to deploy one encoder that can serve many use cases, thereby achieving low-cost, scalability, and speed. We show how this encoder generalizes to out-of-domain settings as well as to an unseen retrieval task on real-world enterprise use cases.

replace Memorization Inheritance in Sequence-Level Knowledge Distillation for Neural Machine Translation

Authors: Verna Dankers, Vikas Raunak

Abstract: In this work, we explore how instance-level memorization in the teacher Neural Machine Translation (NMT) model gets inherited by the student model in sequence-level knowledge distillation (SeqKD). We find that despite not directly seeing the original training data, students memorize more than baseline models (models of the same size, trained on the original data) -- 3.4% for exact matches and 57% for extractive memorization -- and show increased hallucination rates. Further, under this SeqKD setting, we also characterize how students behave on specific training data subgroups, such as subgroups with low quality and specific counterfactual memorization (CM) scores, and find that students exhibit amplified denoising on low-quality subgroups. Finally, we propose a modification to SeqKD named Adaptive-SeqKD, which intervenes in SeqKD to reduce memorization and hallucinations. Overall, we recommend caution when applying SeqKD: students inherit both their teachers' superior performance and their fault modes, thereby requiring active monitoring.

replace MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

Authors: Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang

Abstract: Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.

replace OASIS: Order-Augmented Strategy for Improved Code Search

Authors: Zuchen Gao, Zizheng Zhan, Xianming Li, Erxin Yu, Ziqi Zhan, Haotian Zhang, Bin Chen, Yuqun Zhang, Jing Li

Abstract: Code embeddings capture the semantic representations of code and are crucial for various code-related large language model (LLM) applications, such as code search. Previous training primarily relies on optimizing the InfoNCE loss by comparing positive natural language (NL)-code pairs with in-batch negatives. However, due to the sparse nature of code contexts, training solely by comparing the major differences between positive and negative pairs may fail to capture deeper semantic nuances. To address this issue, we propose a novel order-augmented strategy for improved code search (OASIS). It leverages order-based similarity labels to train models to capture subtle differences in similarity among negative pairs. Extensive benchmark evaluations demonstrate that our OASIS model significantly outperforms previous state-of-the-art models focusing solely on major positive-negative differences. It underscores the value of exploiting subtle differences among negative pairs with order labels for effective code embedding training.

replace Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs

Authors: Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu

Abstract: Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.

replace A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models

Authors: Palakorn Achananuparp, Ee-Peng Lim, Yao Lu

Abstract: Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations, especially for smaller models. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show that our framework not only enhances occupation and skill classification tasks, but also provides a cost-effective alternative to frontier models like GPT-4o, significantly reducing computational costs while maintaining strong performance. This makes it a practical and scalable solution for occupation classification and related tasks across LLMs.

replace CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings

Authors: Daniil Orel, Dilshod Azizov, Preslav Nakov

Abstract: Large language models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency. However, these advancements challenge programming skills, ethics, and assessment integrity, making the detection of LLM-generated code essential for maintaining accountability and standards. While, there has been some research on this problem, it generally lacks domain coverage and robustness, and only covers a small number of programming languages. To this end, we propose a framework capable of distinguishing between human- and LLM-written code across multiple programming languages, code generators, and domains. We use a large-scale dataset from renowned platforms and LLM-based code generators, alongside applying rigorous data quality checks, feature engineering, and comparative analysis using evaluation of traditional machine learning models, pre-trained language models (PLMs), and LLMs for code detection. We perform an evaluation on out-of-domain scenarios, such as detecting the authorship and hybrid authorship of generated code and generalizing to unseen models, domains, and programming languages. Moreover, our extensive experiments show that our framework effectively distinguishes human- from LLM-written code and sets a new benchmark for this task.

replace Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering

Authors: Grace Byun, Shinsun Lee, Nayoung Choi, Jinho D. Choi

Abstract: Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these issues, we propose the Secure Multifaceted-RAG (SecMulti-RAG) framework, which retrieves not only from internal documents but also from two supplementary sources: pre-generated expert knowledge for anticipated queries and on-demand external LLM-generated knowledge. To mitigate security risks, we adopt a local open-source generator and selectively utilize external LLMs only when prompts are deemed safe by a filtering mechanism. This approach enhances completeness, prevents data leakage, and reduces costs. In our evaluation on a report generation task in the automotive industry, SecMulti-RAG significantly outperforms traditional RAG - achieving 79.3 to 91.9 percent win rates across correctness, richness, and helpfulness in LLM-based evaluation, and 56.3 to 70.4 percent in human evaluation. This highlights SecMulti-RAG as a practical and secure solution for enterprise RAG.

replace ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs

Authors: Fahmida Liza Piya, Rahmatollah Beheshti

Abstract: Unstructured clinical data can serve as a unique and rich source of information that can meaningfully inform clinical practice. Extracting the most pertinent context from such data is critical for exploiting its true potential toward optimal and timely decision-making in patient care. While prior research has explored various methods for clinical text summarization, most prior studies either process all input tokens uniformly or rely on heuristic-based filters, which can overlook nuanced clinical cues and fail to prioritize information critical for decision-making. In this study, we propose Contextual, a novel framework that integrates a Context-Preserving Token Filtering method with a Domain-Specific Knowledge Graph (KG) for contextual augmentation. By preserving context-specific important tokens and enriching them with structured knowledge, ConTextual improves both linguistic coherence and clinical fidelity. Our extensive empirical evaluations on two public benchmark datasets demonstrate that ConTextual consistently outperforms other baselines. Our proposed approach highlights the complementary role of token-level filtering and structured retrieval in enhancing both linguistic and clinical integrity, as well as offering a scalable solution for improving precision in clinical text generation.

replace MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness

Authors: Junsheng Huang, Zhitao He, Yucheng Huang, Sandeep Polisetty, Qingyun Wang, May Fung

Abstract: With the widespread application of large language models (LLMs), the issue of generating non-existing facts, known as hallucination, has garnered increasing attention. Previous research in enhancing LLM confidence estimation mainly focuses on the single problem setting. However, LLM awareness of its internal parameterized knowledge boundary under the more challenging multi-problem setting, which requires answering multiple problems accurately simultaneously, remains underexplored. To bridge this gap, we introduce a novel method, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25% in average precision.

replace Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning

Authors: Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang

Abstract: Visual-language Chain-of-Thought (CoT) data resources are relatively scarce compared to text-only counterparts, limiting the improvement of reasoning capabilities in Vision Language Models (VLMs). However, high-quality vision-language reasoning data is expensive and labor-intensive to annotate. To address this issue, we leverage a promising resource: game code, which naturally contains logical structures and state transition processes. Therefore, we propose Code2Logic, a novel game-code-driven approach for multimodal reasoning data synthesis. Our approach leverages Large Language Models (LLMs) to adapt game code, enabling automatic acquisition of reasoning processes and results through code execution. Using the Code2Logic approach, we developed the GameQA dataset to train and evaluate VLMs. GameQA is cost-effective and scalable, offers controllable difficulty gradation and is diverse with 30 games and 158 tasks. Surprisingly, despite training solely on game data, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33% across 7 diverse vision-language benchmarks. Our code, dataset and models are available at https://github.com/tongjingqi/Code2Logic.

URLs: https://github.com/tongjingqi/Code2Logic.

replace ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

Authors: Yiming Lei, Zhizheng Yang, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang

Abstract: Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.

replace Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Authors: Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

Abstract: Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.4% and 3.8% on Qwen2.5-7B-Base and Qwen3-8B, respectively. Notably, Critique-GRPO enables effective self-improvement through self-critiquing and weak-to-strong generalization, achieving consistent gains over GRPO, such as 16.7% and 10.0% pass@1 improvements on AIME 2024, respectively.

replace MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Authors: Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, Paul Pu Liang

Abstract: Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.

replace Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation

Authors: Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, Huaxia Li

Abstract: Large Language Models (LLMs) often exhibit \textit{hallucinations}, generating factually incorrect or semantically irrelevant content in response to prompts. Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning, but its impact on hallucination detection remains underexplored. To bridge this gap, we conduct a systematic empirical evaluation. We begin with a pilot experiment, revealing that CoT reasoning significantly affects the LLM's internal states and token probability distributions. Building on this, we evaluate the impact of various CoT prompting methods on mainstream hallucination detection methods across both instruction-tuned and reasoning-oriented LLMs. Specifically, we examine three key dimensions: changes in hallucination score distributions, variations in detection accuracy, and shifts in detection confidence. Our findings show that while CoT prompting helps reduce hallucination frequency, it also tends to obscure critical signals used for detection, impairing the effectiveness of various detection methods. Our study highlights an overlooked trade-off in the use of reasoning. Code is publicly available at: https://anonymous.4open.science/r/cot-hallu-detect.

URLs: https://anonymous.4open.science/r/cot-hallu-detect.

replace ReCode: Updating Code API Knowledge with Reinforcement Learning

Authors: Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang

Abstract: Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.

URLs: https://github.com/zjunlp/ReCode.

replace VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Authors: Sam Yu-Te Lee, Chengyang Ji, Shicheng Wen, Lifu Huang, Dongyu Liu, Kwan-Liu Ma

Abstract: Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE's effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience -- from none to expert -- demonstrates the system's usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.

replace Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits

Authors: Nathaniel Berger, Johannes Eschbach-Dymanus, Miriam Exel, Matthias Huck, Stefan Riezler

Abstract: In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference optimization, with both term-specific and full sequence objectives, yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. Additionally, we release test sets from our post-edited data and terminology dictionary.

replace Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Authors: Gheorghe Comanici (Xinyi), Eric Bieber (Xinyi), Mike Schaekermann (Xinyi), Ice Pasupat (Xinyi), Noveen Sachdeva (Xinyi), Inderjit Dhillon (Xinyi), Marcel Blistein (Xinyi), Ori Ram (Xinyi), Dan Zhang (Xinyi), Evan Rosen (Xinyi), Luke Marris (Xinyi), Sam Petulla (Xinyi), Colin Gaffney (Xinyi), Asaf Aharoni (Xinyi), Nathan Lintz (Xinyi), Tiago Cardal Pais (Xinyi), Henrik Jacobsson (Xinyi), Idan Szpektor (Xinyi), Nan-Jiang Jiang (Xinyi), Krishna Haridasan (Xinyi), Ahmed Omran (Xinyi), Nikunj Saunshi (Xinyi), Dara Bahri (Xinyi), Gaurav Mishra (Xinyi), Eric Chu (Xinyi), Toby Boyd (Xinyi), Brad Hekman (Xinyi), Aaron Parisi (Xinyi), Chaoyi Zhang (Xinyi), Kornraphop Kawintiranon (Xinyi), Tania Bedrax-Weiss (Xinyi), Oliver Wang (Xinyi), Ya Xu (Xinyi), Ollie Purkiss (Xinyi), Uri Mendlovic (Xinyi), Ila\"i Deutel (Xinyi), Nam Nguyen (Xinyi), Adam Langley (Xinyi), Flip Korn (Xinyi), Lucia Rossazza (Xinyi), Alexandre Ram\'e (Xinyi), Sagar Waghmare (Xinyi), Helen Miller (Xinyi), Vaishakh Keshava (Xinyi), Ying Jian (Xinyi), Xiaofan Zhang (Xinyi), Raluca Ada Popa (Xinyi), Kedar Dhamdhere (Xinyi), Bla\v{z} Bratani\v{c} (Xinyi), Kyuyeun Kim (Xinyi), Terry Koo (Xinyi), Ferran Alet (Xinyi), Yi-ting Chen (Xinyi), Arsha Nagrani (Xinyi), Hannah Muckenhirn (Xinyi), Zhiyuan Zhang (Xinyi), Corbin Quick (Xinyi), Filip Paveti\'c (Xinyi), Duc Dung Nguyen (Xinyi), Joao Carreira (Xinyi), Michael Elabd (Xinyi), Haroon Qureshi (Xinyi), Fabian Mentzer (Xinyi), Yao-Yuan Yang (Xinyi), Danielle Eisenbud (Xinyi), Anmol Gulati (Xinyi), Ellie Talius (Xinyi), Eric Ni (Xinyi), Sahra Ghalebikesabi (Xinyi), Edouard Yvinec (Xinyi), Alaa Saade (Xinyi), Thatcher Ulrich (Xinyi), Lorenzo Blanco (Xinyi), Dan A. Calian (Xinyi), Muhuan Huang (Xinyi), A\"aron van den Oord (Xinyi), Naman Goyal (Xinyi), Terry Chen (Xinyi), Praynaa Rawlani (Xinyi), Christian Schallhart (Xinyi), Swachhand Lokhande (Xinyi), Xianghong Luo (Xinyi), Jyn Shan (Xinyi), Ceslee Montgomery (Xinyi), Victoria Krakovna (Xinyi), Federico Piccinini (Xinyi), Omer Barak (Xinyi), Jingyu Cui (Xinyi), Yiling Jia (Xinyi), Mikhail Dektiarev (Xinyi), Alexey Kolganov (Xinyi), Shiyu Huang (Xinyi), Zhe Chen (Xinyi), Xingyu Wang (Xinyi), Jessica Austin (Xinyi), Peter de Boursac (Xinyi), Evgeny Sluzhaev (Xinyi), Frank Ding (Xinyi), Huijian Li (Xinyi), Surya Bhupatiraju (Xinyi), Mohit Agarwal (Xinyi), S{\l}awek Kwasiborski (Xinyi), Paramjit Sandhu (Xinyi), Patrick Siegler (Xinyi), Ahmet Iscen (Xinyi), Eyal Ben-David (Xinyi), Shiraz Butt (Xinyi), Miltos Allamanis (Xinyi), Seth Benjamin (Xinyi), Robert Busa-Fekete (Xinyi), Felix Hernandez-Campos (Xinyi), Sasha Goldshtein (Xinyi), Matt Dibb (Xinyi), Weiyang Zhang (Xinyi), Annie Marsden (Xinyi), Carey Radebaugh (Xinyi), Stephen Roller (Xinyi), Abhishek Nayyar (Xinyi), Jacob Austin (Xinyi), Tayfun Terzi (Xinyi), Bhargav Kanagal Shamanna (Xinyi), Pete Shaw (Xinyi), Aayush Singh (Xinyi), Florian Luisier (Xinyi), Artur Mendon\c{c}a (Xinyi), Vaibhav Aggarwal (Xinyi), Larisa Markeeva (Xinyi), Claudio Fantacci (Xinyi), Sergey Brin (Xinyi), HyunJeong Choe (Xinyi), Guanyu Wang (Xinyi), Hartwig Adam (Xinyi), Avigail Dabush (Xinyi), Tatsuya Kiyono (Xinyi), Eyal Marcus (Xinyi), Jeremy Cole (Xinyi), Theophane Weber (Xinyi), Hongrae Lee (Xinyi), Ronny Huang (Xinyi), Alex Muzio (Xinyi), Leandro Kieliger (Xinyi), Maigo Le (Xinyi), Courtney Biles (Xinyi), Long Le (Xinyi), Archit Sharma (Xinyi), Chengrun Yang (Xinyi), Avery Lamp (Xinyi), Dave Dopson (Xinyi), Nate Hurley (Xinyi), Katrina (Xinyi), Xu (Jerry), Zhihao Shan (Jerry), Shuang Song (Jerry), Jiewen Tan (Jerry), Alexandre Senges (Jerry), George Zhang (Jerry), Chong You (Jerry), Yennie Jun (Jerry), David Raposo (Jerry), Susanna Ricco (Jerry), Xuan Yang (Jerry), Weijie Chen (Jerry), Prakhar Gupta (Jerry), Arthur Szlam (Jerry), Kevin Villela (Jerry), Chun-Sung Ferng (Jerry), Daniel Kasenberg (Jerry), Chen Liang (Jerry), Rui Zhu (Jerry), Arunachalam Narayanaswamy (Jerry), Florence Perot (Jerry), Paul Pucciarelli (Jerry), Anna Shekhawat (Jerry), Alexey Stern (Jerry), Rishikesh Ingale (Jerry), Stefani Karp (Jerry), Sanaz Bahargam (Jerry), Adrian Goedeckemeyer (Jerry), Jie Han (Jerry), Sicheng Li (Jerry), Andrea Tacchetti (Jerry), Dian Yu (Jerry), Abhishek Chakladar (Jerry), Zhiying Zhang (Jerry), Mona El Mahdy (Jerry), Xu Gao (Jerry), Dale Johnson (Jerry), Samrat Phatale (Jerry), AJ Piergiovanni (Jerry), Hyeontaek Lim (Jerry), Clement Farabet (Jerry), Carl Lebsack (Jerry), Theo Guidroz (Jerry), John Blitzer (Jerry), Nico Duduta (Jerry), David Madras (Jerry), Steve Li (Jerry), Daniel von Dincklage (Jerry), Xin Li (Jerry), Mahdis Mahdieh (Jerry), George Tucker (Jerry), Ganesh Jawahar (Jerry), Owen Xiao (Jerry), Danny Tarlow (Jerry), Robert Geirhos (Jerry), Noam Velan (Jerry), Daniel Vlasic (Jerry), Kalesha Bullard (Jerry), SK Park (Jerry), Nishesh Gupta (Jerry), Kellie Webster (Jerry), Ayal Hitron (Jerry), Jieming Mao (Jerry), Julian Eisenschlos (Jerry), Laurel Prince (Jerry), Nina D'Souza (Jerry), Kelvin Zheng (Jerry), Sara Nasso (Jerry), Gabriela Botea (Jerry), Carl Doersch (Jerry), Caglar Unlu (Jerry), Chris Alberti (Jerry), Alexey Svyatkovskiy (Jerry), Ankita Goel (Jerry), Krzysztof Choromanski (Jerry), Pan-Pan Jiang (Jerry), Richard Nguyen (Jerry), Four Flynn (Jerry), Daria \'Curko (Jerry), Peter Chen (Jerry), Nicholas Roth (Jerry), Kieran Milan (Jerry), Caleb Habtegebriel (Jerry), Shashi Narayan (Jerry), Michael Moffitt (Jerry), Jake Marcus (Jerry), Thomas Anthony (Jerry), Brendan McMahan (Jerry), Gowoon Cheon (Jerry), Ruibo Liu (Jerry), Megan Barnes (Jerry), Lukasz Lew (Jerry), Rebeca Santamaria-Fernandez (Jerry), Mayank Upadhyay (Jerry), Arjun Akula (Jerry), Arnar Mar Hrafnkelsson (Jerry), Alvaro Caceres (Jerry), Andrew Bunner (Jerry), Michal Sokolik (Jerry), Subha Puttagunta (Jerry), Lawrence Moore (Jerry), Berivan Isik (Jerry), Jay Hartford (Jerry), Lawrence Chan (Jerry), Pradeep Shenoy (Jerry), Dan Holtmann-Rice (Jerry), Jane Park (Jerry), Fabio Viola (Jerry), Alex Salcianu (Jerry), Sujeevan Rajayogam (Jerry), Ian Stewart-Binks (Jerry), Zelin Wu (Jerry), Richard Everett (Jerry), Xi Xiong (Jerry), Pierre-Antoine Manzagol (Jerry), Gary Leung (Jerry), Carl Saroufim (Jerry), Bo Pang (Jerry), Dawid Wegner (Jerry), George Papamakarios (Jerry), Jennimaria Palomaki (Jerry), Helena Pankov (Jerry), Guangda Lai (Jerry), Guilherme Tubone (Jerry), Shubin Zhao (Jerry), Theofilos Strinopoulos (Jerry), Seth Neel (Jerry), Mingqiu Wang (Jerry), Joe Kelley (Jerry), Li Li (Jerry), Pingmei Xu (Jerry), Anitha Vijayakumar (Jerry), Andrea D'olimpio (Jerry), Omer Levy (Jerry), Massimo Nicosia (Jerry), Grigory Rozhdestvenskiy (Jerry), Ni Lao (Jerry), Sirui Xie (Jerry), Yash Katariya (Jerry), Jon Simon (Jerry), Sanjiv Kumar (Jerry), Florian Hartmann (Jerry), Michael Kilgore (Jerry), Jinhyuk Lee (Jerry), Aroma Mahendru (Jerry), Roman Ring (Jerry), Tom Hennigan (Jerry), Fiona Lang (Jerry), Colin Cherry (Jerry), David Steiner (Jerry), Dawsen Hwang (Jerry), Ray Smith (Jerry), Pidong Wang (Jerry), Jeremy Chen (Jerry), Ming-Hsuan Yang (Jerry), Sam Kwei (Jerry), Philippe Schlattner (Jerry), Donnie Kim (Jerry), Ganesh Poomal Girirajan (Jerry), Nikola Momchev (Jerry), Ayushi Agarwal (Jerry), Xingyi Zhou (Jerry), Ilkin Safarli (Jerry), Zachary Garrett (Jerry), AJ Pierigiovanni (Jerry), Sarthak Jauhari (Jerry), Alif Raditya Rochman (Jerry), Shikhar Vashishth (Jerry), Quan Yuan (Jerry), Christof Angermueller (Jerry), Jon Blanton (Jerry), Xinying Song (Jerry), Nitesh Bharadwaj Gundavarapu (Jerry), Thi Avrahami (Jerry), Maxine Deines (Jerry), Subhrajit Roy (Jerry), Manish Gupta (Jerry), Christopher Semturs (Jerry), Shobha Vasudevan (Jerry), Aditya Srikanth Veerubhotla (Jerry), Shriya Sharma (Jerry), Josh Jacob (Jerry), Zhen Yang (Jerry), Andreas Terzis (Jerry), Dan Karliner (Jerry), Auriel Wright (Jerry), Tania Rojas-Esponda (Jerry), Ashley Brown (Jerry), Abhijit Guha Roy (Jerry), Pawan Dogra (Jerry), Andrei Kapishnikov (Jerry), Peter Young (Jerry), Wendy Kan (Jerry), Vinodh Kumar Rajendran (Jerry), Maria Ivanova (Jerry), Salil Deshmukh (Jerry), Chia-Hua Ho (Jerry), Mike Kwong (Jerry), Stav Ginzburg (Jerry), Annie Louis (Jerry), KP Sawhney (Jerry), Slav Petrov (Jerry), Jing Xie (Jerry), Yunfei Bai (Jerry), Georgi Stoyanov (Jerry), Alex Fabrikant (Jerry), Rajesh Jayaram (Jerry), Yuqi Li (Jerry), Joe Heyward (Jerry), Justin Gilmer (Jerry), Yaqing Wang (Jerry), Radu Soricut (Jerry), Luyang Liu (Jerry), Qingnan Duan (Jerry), Jamie Hayes (Jerry), Maura O'Brien (Jerry), Gaurav Singh Tomar (Jerry), Sivan Eiger (Jerry), Bahar Fatemi (Jerry), Jeffrey Hui (Jerry), Catarina Barros (Jerry), Adaeze Chukwuka (Jerry), Alena Butryna (Jerry), Saksham Thakur (Jerry), Austin Huang (Jerry), Zhufeng Pan (Jerry), Haotian Tang (Jerry), Serkan Cabi (Jerry), Tulsee Doshi (Jerry), Michiel Bakker (Jerry), Sumit Bagri (Jerry), Ruy Ley-Wild (Jerry), Adam Lelkes (Jerry), Jennie Lees (Jerry), Patrick Kane (Jerry), David Greene (Jerry), Shimu Wu (Jerry), J\"org Bornschein (Jerry), Gabriela Surita (Jerry), Sarah Hodkinson (Jerry), Fangtao Li (Jerry), Chris Hidey (Jerry), S\'ebastien Pereira (Jerry), Sean Ammirati (Jerry), Phillip Lippe (Jerry), Adam Kraft (Jerry), Pu Han (Jerry), Sebastian Gerlach (Jerry), Zifeng Wang (Jerry), Liviu Panait (Jerry), Feng Han (Jerry), Brian Farris (Jerry), Yingying Bi (Jerry), Hannah DeBalsi (Jerry), Miaosen Wang (Jerry), Gladys Tyen (Jerry), James Cohan (Jerry), Susan Zhang (Jerry), Jarred Barber (Jerry), Da-Woon Chung (Jerry), Jaeyoun Kim (Jerry), Markus Kunesch (Jerry), Steven Pecht (Jerry), Nami Akazawa (Jerry), Abe Friesen (Jerry), James Lyon (Jerry), Ali Eslami (Jerry), Junru Wu (Jerry), Jie Tan (Jerry), Yue Song (Jerry), Ravi Kumar (Jerry), Chris Welty (Jerry), Ilia Akolzin (Jerry), Gena Gibson (Jerry), Sean Augenstein (Jerry), Arjun Pillai (Jerry), Nancy Yuen (Jerry), Du Phan (Jerry), Xin Wang (Jerry), Iain Barr (Jerry), Heiga Zen (Jerry), Nan Hua (Jerry), Casper Liu (Jerry), Jilei (Jerry), Wang (Elena), Tanuj Bhatia (Elena), Hao Xu (Elena), Oded Elyada (Elena), Pushmeet Kohli (Elena), Mirek Ol\v{s}\'ak (Elena), Ke Chen (Elena), Azalia Mirhoseini (Elena), Noam Shazeer (Elena), Shoshana Jakobovits (Elena), Maggie Tran (Elena), Nolan Ramsden (Elena), Tarun Bharti (Elena), Fred Alcober (Elena), Yunjie Li (Elena), Shilpa Shetty (Elena), Jing Chen (Elena), Dmitry Kalashnikov (Elena), Megha Nawhal (Elena), Sercan Arik (Elena), Hanwen Chen (Elena), Michiel Blokzijl (Elena), Shubham Gupta (Elena), James Rubin (Elena), Rigel Swavely (Elena), Sophie Bridgers (Elena), Ian Gemp (Elena), Chen Su (Elena), Arun Suggala (Elena), Juliette Pluto (Elena), Mary Cassin (Elena), Alain Vaucher (Elena), Kaiyang Ji (Elena), Jiahao Cai (Elena), Andrew Audibert (Elena), Animesh Sinha (Elena), David Tian (Elena), Efrat Farkash (Elena), Amy Hua (Elena), Jilin Chen (Elena), Duc-Hieu Tran (Elena), Edward Loper (Elena), Nicole Brichtova (Elena), Lara McConnaughey (Elena), Ballie Sandhu (Elena), Robert Leland (Elena), Doug DeCarlo (Elena), Andrew Over (Elena), James Huang (Elena), Xing Wu (Elena), Connie Fan (Elena), Eric Li (Elena), Yun Lei (Elena), Deepak Sharma (Elena), Cosmin Paduraru (Elena), Luo Yu (Elena), Matko Bo\v{s}njak (Elena), Phuong Dao (Elena), Min Choi (Elena), Sneha Kudugunta (Elena), Jakub Adamek (Elena), Carlos Gu\'ia (Elena), Ali Khodaei (Elena), Jie Feng (Elena), Wenjun Zeng (Elena), David Welling (Elena), Sandeep Tata (Elena), Christina Butterfield (Elena), Andrey Vlasov (Elena), Seliem El-Sayed (Elena), Swaroop Mishra (Elena), Tara Sainath (Elena), Shentao Yang (Elena), RJ Skerry-Ryan (Elena), Jeremy Shar (Elena), Robert Berry (Elena), Arunkumar Rajendran (Elena), Arun Kandoor (Elena), Andrea Burns (Elena), Deepali Jain (Elena), Tom Stone (Elena), Wonpyo Park (Elena), Shibo Wang (Elena), Albin Cassirer (Elena), Guohui Wang (Elena), Hayato Kobayashi (Elena), Sergey Rogulenko (Elena), Vineetha Govindaraj (Elena), Miko{\l}aj Rybi\'nski (Elena), Nadav Olmert (Elena), Colin Evans (Elena), Po-Sen Huang (Elena), Kelvin Xu (Elena), Premal Shah (Elena), Terry Thurk (Elena), Caitlin Sikora (Elena), Mu Cai (Elena), Jin Xie (Elena), Elahe Dabir (Elena), Saloni Shah (Elena), Norbert Kalb (Elena), Carrie Zhang (Elena), Shruthi Prabhakara (Elena), Amit Sabne (Elena), Artiom Myaskovsky (Elena), Vikas Raunak (Elena), Blanca Huergo (Elena), Behnam Neyshabur (Elena), Jon Clark (Elena), Ye Zhang (Elena), Shankar Krishnan (Elena), Eden Cohen (Elena), Dinesh Tewari (Elena), James Lottes (Elena), Yumeya Yamamori (Elena), Hui (Elena), Li (Tu\'\^an), Mohamed Elhawaty (Tu\'\^an), Ada Maksutaj Oflazer (Tu\'\^an), Adri\`a Recasens (Tu\'\^an), Sheryl Luo (Tu\'\^an), Duy Nguyen (Tu\'\^an), Taylor Bos (Tu\'\^an), Kalyan Andra (Tu\'\^an), Ana Salazar (Tu\'\^an), Ed Chi (Tu\'\^an), Jeongwoo Ko (Tu\'\^an), Matt Ginsberg (Tu\'\^an), Anders Andreassen (Tu\'\^an), Anian Ruoss (Tu\'\^an), Todor Davchev (Tu\'\^an), Elnaz Davoodi (Tu\'\^an), Chenxi Liu (Tu\'\^an), Min Kim (Tu\'\^an), Santiago Ontanon (Tu\'\^an), Chi Ming To (Tu\'\^an), Dawei Jia (Tu\'\^an), Rosemary Ke (Tu\'\^an), Jing Wang (Tu\'\^an), Anna Korsun (Tu\'\^an), Moran Ambar (Tu\'\^an), Ilya Kornakov (Tu\'\^an), Irene Giannoumis (Tu\'\^an), Toni Creswell (Tu\'\^an), Denny Zhou (Tu\'\^an), Yi Su (Tu\'\^an), Ishaan Watts (Tu\'\^an), Aleksandr Zaks (Tu\'\^an), Evgenii Eltyshev (Tu\'\^an), Ziqiang Feng (Tu\'\^an), Sidharth Mudgal (Tu\'\^an), Alex Kaskasoli (Tu\'\^an), Juliette Love (Tu\'\^an), Kingshuk Dasgupta (Tu\'\^an), Sam Shleifer (Tu\'\^an), Richard Green (Tu\'\^an), Sungyong Seo (Tu\'\^an), Chansoo Lee (Tu\'\^an), Dale Webster (Tu\'\^an), Prakash Shroff (Tu\'\^an), Ganna Raboshchuk (Tu\'\^an), Isabel Leal (Tu\'\^an), James Manyika (Tu\'\^an), Sofia Erell (Tu\'\^an), Daniel Murphy (Tu\'\^an), Zhisheng Xiao (Tu\'\^an), Anton Bulyenov (Tu\'\^an), Julian Walker (Tu\'\^an), Mark Collier (Tu\'\^an), Matej Kastelic (Tu\'\^an), Nelson George (Tu\'\^an), Sushant Prakash (Tu\'\^an), Sailesh Sidhwani (Tu\'\^an), Alexey Frolov (Tu\'\^an), Steven Hansen (Tu\'\^an), Petko Georgiev (Tu\'\^an), Tiberiu Sosea (Tu\'\^an), Chris Apps (Tu\'\^an), Aishwarya Kamath (Tu\'\^an), David Reid (Tu\'\^an), Emma Cooney (Tu\'\^an), Charlotte Magister (Tu\'\^an), Oriana Riva (Tu\'\^an), Alec Go (Tu\'\^an), Pu-Chin Chen (Tu\'\^an), Sebastian Krause (Tu\'\^an), Nir Levine (Tu\'\^an), Marco Fornoni (Tu\'\^an), Ilya Figotin (Tu\'\^an), Nick Roy (Tu\'\^an), Parsa Mahmoudieh (Tu\'\^an), Vladimir Magay (Tu\'\^an), Mukundan Madhavan (Tu\'\^an), Jin Miao (Tu\'\^an), Jianmo Ni (Tu\'\^an), Yasuhisa Fujii (Tu\'\^an), Ian Chou (Tu\'\^an), George Scrivener (Tu\'\^an), Zak Tsai (Tu\'\^an), Siobhan Mcloughlin (Tu\'\^an), Jeremy Selier (Tu\'\^an), Sandra Lefdal (Tu\'\^an), Jeffrey Zhao (Tu\'\^an), Abhijit Karmarkar (Tu\'\^an), Kushal Chauhan (Tu\'\^an), Shivanker Goel (Tu\'\^an), Zhaoyi Zhang (Tu\'\^an), Vihan Jain (Tu\'\^an), Parisa Haghani (Tu\'\^an), Mostafa Dehghani (Tu\'\^an), Jacob Scott (Tu\'\^an), Erin Farnese (Tu\'\^an), Anastasija Ili\'c (Tu\'\^an), Steven Baker (Tu\'\^an), Julia Pawar (Tu\'\^an), Li Zhong (Tu\'\^an), Josh Camp (Tu\'\^an), Yoel Zeldes (Tu\'\^an), Shravya Shetty (Tu\'\^an), Anand Iyer (Tu\'\^an), V\'it List\'ik (Tu\'\^an), Jiaxian Guo (Tu\'\^an), Luming Tang (Tu\'\^an), Mark Geller (Tu\'\^an), Simon Bucher (Tu\'\^an), Yifan Ding (Tu\'\^an), Hongzhi Shi (Tu\'\^an), Carrie Muir (Tu\'\^an), Dominik Grewe (Tu\'\^an), Ramy Eskander (Tu\'\^an), Octavio Ponce (Tu\'\^an), Boqing Gong (Tu\'\^an), Derek Gasaway (Tu\'\^an), Samira Khan (Tu\'\^an), Umang Gupta (Tu\'\^an), Angelos Filos (Tu\'\^an), Weicheng Kuo (Tu\'\^an), Klemen Kloboves (Tu\'\^an), Jennifer Beattie (Tu\'\^an), Christian Wright (Tu\'\^an), Leon Li (Tu\'\^an), Alicia Jin (Tu\'\^an), Sandeep Mariserla (Tu\'\^an), Miteyan Patel (Tu\'\^an), Jens Heitkaemper (Tu\'\^an), Dilip Krishnan (Tu\'\^an), Vivek Sharma (Tu\'\^an), David Bieber (Tu\'\^an), Christian Frank (Tu\'\^an), John Lambert (Tu\'\^an), Paul Caron (Tu\'\^an), Martin Polacek (Tu\'\^an), Mai Gim\'enez (Tu\'\^an), Himadri Choudhury (Tu\'\^an), Xing Yu (Tu\'\^an), Sasan Tavakkol (Tu\'\^an), Arun Ahuja (Tu\'\^an), Franz Och (Tu\'\^an), Rodolphe Jenatton (Tu\'\^an), Wojtek Skut (Tu\'\^an), Bryan Richter (Tu\'\^an), David Gaddy (Tu\'\^an), Andy Ly (Tu\'\^an), Misha Bilenko (Tu\'\^an), Megh Umekar (Tu\'\^an), Ethan Liang (Tu\'\^an), Martin Sevenich (Tu\'\^an), Mandar Joshi (Tu\'\^an), Hassan Mansoor (Tu\'\^an), Rebecca Lin (Tu\'\^an), Sumit Sanghai (Tu\'\^an), Abhimanyu Singh (Tu\'\^an), Xiaowei Li (Tu\'\^an), Sudheendra Vijayanarasimhan (Tu\'\^an), Zaheer Abbas (Tu\'\^an), Yonatan Bitton (Tu\'\^an), Hansa Srinivasan (Tu\'\^an), Manish Reddy Vuyyuru (Tu\'\^an), Alexander Fr\"ommgen (Tu\'\^an), Yanhua Sun (Tu\'\^an), Ralph Leith (Tu\'\^an), Alfonso Casta\~no (Tu\'\^an), DJ Strouse (Tu\'\^an), Le Yan (Tu\'\^an), Austin Kyker (Tu\'\^an), Satish Kambala (Tu\'\^an), Mary Jasarevic (Tu\'\^an), Thibault Sellam (Tu\'\^an), Chao Jia (Tu\'\^an), Alexander Pritzel (Tu\'\^an), Raghavender R (Tu\'\^an), Huizhong Chen (Tu\'\^an), Natalie Clay (Tu\'\^an), Sudeep Gandhe (Tu\'\^an), Sean Kirmani (Tu\'\^an), Sayna Ebrahimi (Tu\'\^an), Hannah Kirkwood (Tu\'\^an), Jonathan Mallinson (Tu\'\^an), Chao Wang (Tu\'\^an), Adnan Ozturel (Tu\'\^an), Kuo Lin (Tu\'\^an), Shyam Upadhyay (Tu\'\^an), Vincent Cohen-Addad (Tu\'\^an), Sean Purser-haskell (Tu\'\^an), Yichong Xu (Tu\'\^an), Ebrahim Songhori (Tu\'\^an), Babi Seal (Tu\'\^an), Alberto Magni (Tu\'\^an), Almog Gueta (Tu\'\^an), Tingting Zou (Tu\'\^an), Guru Guruganesh (Tu\'\^an), Thais Kagohara (Tu\'\^an), Hung Nguyen (Tu\'\^an), Khalid Salama (Tu\'\^an), Alejandro Cruzado Ruiz (Tu\'\^an), Justin Frye (Tu\'\^an), Zhenkai Zhu (Tu\'\^an), Matthias Lochbrunner (Tu\'\^an), Simon Osindero (Tu\'\^an), Wentao Yuan (Tu\'\^an), Lisa Lee (Tu\'\^an), Aman Prasad (Tu\'\^an), Lam Nguyen Thiet (Tu\'\^an), Daniele Calandriello (Tu\'\^an), Victor Stone (Tu\'\^an), Qixuan Feng (Tu\'\^an), Han Ke (Tu\'\^an), Maria Voitovich (Tu\'\^an), Geta Sampemane (Tu\'\^an), Lewis Chiang (Tu\'\^an), Ling Wu (Tu\'\^an), Alexander Bykovsky (Tu\'\^an), Matt Young (Tu\'\^an), Luke Vilnis (Tu\'\^an), Ishita Dasgupta (Tu\'\^an), Aditya Chawla (Tu\'\^an), Qin Cao (Tu\'\^an), Bowen Liang (Tu\'\^an), Daniel Toyama (Tu\'\^an), Szabolcs Payrits (Tu\'\^an), Anca Stefanoiu (Tu\'\^an), Dimitrios Vytiniotis (Tu\'\^an), Ankesh Anand (Tu\'\^an), Tianxiao Shen (Tu\'\^an), Blagoj Mitrevski (Tu\'\^an), Michael Tschannen (Tu\'\^an), Sreenivas Gollapudi (Tu\'\^an), Aishwarya P S (Tu\'\^an), Jos\'e Leal (Tu\'\^an), Zhe Shen (Tu\'\^an), Han Fu (Tu\'\^an), Wei Wang (Tu\'\^an), Arvind Kannan (Tu\'\^an), Doron Kukliansky (Tu\'\^an), Sergey Yaroshenko (Tu\'\^an), Svetlana Grant (Tu\'\^an), Umesh Telang (Tu\'\^an), David Wood (Tu\'\^an), Alexandra Chronopoulou (Tu\'\^an), Alexandru \c{T}ifrea (Tu\'\^an), Tao Zhou (Tu\'\^an), Tony (Tu\'\^an), Nguy\~\^en (Q), Muge Ersoy (Q), Anima Singh (Q), Meiyan Xie (Q), Emanuel Taropa (Q), Woohyun Han (Q), Eirikur Agustsson (Q), Andrei Sozanschi (Q), Hui Peng (Q), Alex Chen (Q), Yoel Drori (Q), Efren Robles (Q), Yang Gao (Q), Xerxes Dotiwalla (Q), Ying Chen (Q), Anudhyan Boral (Q), Alexei Bendebury (Q), John Nham (Q), Chris Tar (Q), Luis Castro (Q), Jiepu Jiang (Q), Canoee Liu (Q), Felix Halim (Q), Jinoo Baek (Q), Andy Wan (Q), Jeremiah Liu (Q), Yuan Cao (Q), Shengyang Dai (Q), Trilok Acharya (Q), Ruoxi Sun (Q), Fuzhao Xue (Q), Saket Joshi (Q), Morgane Lustman (Q), Yongqin Xian (Q), Rishabh Joshi (Q), Deep Karkhanis (Q), Nora Kassner (Q), Jamie Hall (Q), Xiangzhuo Ding (Q), Gan Song (Q), Gang Li (Q), Chen Zhu (Q), Yana Kulizhskaya (Q), Bin Ni (Q), Alexey Vlaskin (Q), Solomon Demmessie (Q), Lucio Dery (Q), Salah Zaiem (Q), Yanping Huang (Q), Cindy Fan (Q), Felix Gimeno (Q), Ananth Balashankar (Q), Koji Kojima (Q), Hagai Taitelbaum (Q), Maya Meng (Q), Dero Gharibian (Q), Sahil Singla (Q), Wei Chen (Q), Ambrose Slone (Q), Guanjie Chen (Q), Sujee Rajayogam (Q), Max Schumacher (Q), Suyog Kotecha (Q), Rory Blevins (Q), Qifei Wang (Q), Mor Hazan Taege (Q), Alex Morris (Q), Xin Liu (Q), Fayaz Jamil (Q), Richard Zhang (Q), Pratik Joshi (Q), Ben Ingram (Q), Tyler Liechty (Q), Ahmed Eleryan (Q), Scott Baird (Q), Alex Grills (Q), Gagan Bansal (Q), Shan Han (Q), Kiran Yalasangi (Q), Shawn Xu (Q), Majd Al Merey (Q), Isabel Gao (Q), Felix Weissenberger (Q), Igor Karpov (Q), Robert Riachi (Q), Ankit Anand (Q), Gautam Prasad (Q), Kay Lamerigts (Q), Reid Hayes (Q), Jamie Rogers (Q), Mandy Guo (Q), Ashish Shenoy (Q), Qiong (Q), Hu (Weilun), Kyle He (Weilun), Yuchen Liu (Weilun), Polina Zablotskaia (Weilun), Sagar Gubbi (Weilun), Yifan Chang (Weilun), Jay Pavagadhi (Weilun), Kristian Kjems (Weilun), Archita Vadali (Weilun), Diego Machado (Weilun), Yeqing Li (Weilun), Renshen Wang (Weilun), Dipankar Ghosh (Weilun), Aahil Mehta (Weilun), Dana Alon (Weilun), George Polovets (Weilun), Alessio Tonioni (Weilun), Nate Kushman (Weilun), Joel D'sa (Weilun), Lin Zhuo (Weilun), Allen Wu (Weilun), Rohin Shah (Weilun), John Youssef (Weilun), Jiayu Ye (Weilun), Justin Snyder (Weilun), Karel Lenc (Weilun), Senaka Buthpitiya (Weilun), Matthew Tung (Weilun), Jichuan Chang (Weilun), Tao Chen (Weilun), David Saxton (Weilun), Jenny Lee (Weilun), Lydia Lihui Zhang (Weilun), James Qin (Weilun), Prabakar Radhakrishnan (Weilun), Maxwell Chen (Weilun), Piotr Ambroszczyk (Weilun), Metin Toksoz-Exley (Weilun), Yan Zhong (Weilun), Nitzan Katz (Weilun), Brendan O'Donoghue (Weilun), Tamara von Glehn (Weilun), Adi Gerzi Rosenthal (Weilun), Aga \'Swietlik (Weilun), Xiaokai Zhao (Weilun), Nick Fernando (Weilun), Jinliang Wei (Weilun), Jieru Mei (Weilun), Sergei Vassilvitskii (Weilun), Diego Cedillo (Weilun), Pranjal Awasthi (Weilun), Hui Zheng (Weilun), Koray Kavukcuoglu (Weilun), Itay Laish (Weilun), Joseph Pagadora (Weilun), Marc Brockschmidt (Weilun), Christopher A. Choquette-Choo (Weilun), Arunkumar Byravan (Weilun), Yifeng Lu (Weilun), Xu Chen (Weilun), Mia Chen (Weilun), Kenton Lee (Weilun), Rama Pasumarthi (Weilun), Sijal Bhatnagar (Weilun), Aditya Shah (Weilun), Qiyin Wu (Weilun), Zhuoyuan Chen (Weilun), Zack Nado (Weilun), Bartek Perz (Weilun), Zixuan Jiang (Weilun), David Kao (Weilun), Ganesh Mallya (Weilun), Nino Vieillard (Weilun), Lantao Mei (Weilun), Sertan Girgin (Weilun), Mandy Jordan (Weilun), Yeongil Ko (Weilun), Alekh Agarwal (Weilun), Yaxin Liu (Weilun), Yasemin Altun (Weilun), Raoul de Liedekerke (Weilun), Anastasios Kementsietsidis (Weilun), Daiyi Peng (Weilun), Dangyi Liu (Weilun), Utku Evci (Weilun), Peter Humphreys (Weilun), Austin Tarango (Weilun), Xiang Deng (Weilun), Yoad Lewenberg (Weilun), Kevin Aydin (Weilun), Chengda Wu (Weilun), Bhavishya Mittal (Weilun), Tsendsuren Munkhdalai (Weilun), Kleopatra Chatziprimou (Weilun), Rodrigo Benenson (Weilun), Uri First (Weilun), Xiao Ma (Weilun), Jinning Li (Weilun), Armand Joulin (Weilun), Hamish Tomlinson (Weilun), Tingnan Zhang (Weilun), Milad Nasr (Weilun), Zhi Hong (Weilun), Micha\"el Sander (Weilun), Lisa Anne Hendricks (Weilun), Anuj Sharma (Weilun), Andrew Bolt (Weilun), Eszter V\'ertes (Weilun), Jiri Simsa (Weilun), Tomer Levinboim (Weilun), Olcan Sercinoglu (Weilun), Divyansh Shukla (Weilun), Austin Wu (Weilun), Craig Swanson (Weilun), Danny Vainstein (Weilun), Fan Bu (Weilun), Bo Wang (Weilun), Ryan Julian (Weilun), Charles Yoon (Weilun), Sergei Lebedev (Weilun), Antonious Girgis (Weilun), Bernd Bandemer (Weilun), David Du (Weilun), Todd Wang (Weilun), Xi Chen (Weilun), Ying Xiao (Weilun), Peggy Lu (Weilun), Natalie Ha (Weilun), Vlad Ionescu (Weilun), Simon Rowe (Weilun), Josip Matak (Weilun), Federico Lebron (Weilun), Andreas Steiner (Weilun), Lalit Jain (Weilun), Manaal Faruqui (Weilun), Nicolas Lacasse (Weilun), Georgie Evans (Weilun), Neesha Subramaniam (Weilun), Dean Reich (Weilun), Giulia Vezzani (Weilun), Aditya Pandey (Weilun), Joe Stanton (Weilun), Tianhao Zhou (Weilun), Liam McCafferty (Weilun), Henry Griffiths (Weilun), Verena Rieser (Weilun), Soheil Hassas Yeganeh (Weilun), Eleftheria Briakou (Weilun), Lu Huang (Weilun), Zichuan Wei (Weilun), Liangchen Luo (Weilun), Erik Jue (Weilun), Gabby Wang (Weilun), Victor Cotruta (Weilun), Myriam Khan (Weilun), Jongbin Park (Weilun), Qiuchen Guo (Weilun), Peiran Li (Weilun), Rong Rong (Weilun), Diego Antognini (Weilun), Anastasia Petrushkina (Weilun), Chetan Tekur (Weilun), Eli Collins (Weilun), Parul Bhatia (Weilun), Chester Kwak (Weilun), Wenhu Chen (Weilun), Arvind Neelakantan (Weilun), Immanuel Odisho (Weilun), Sheng Peng (Weilun), Vincent Nallatamby (Weilun), Vaibhav Tulsyan (Weilun), Fabian Pedregosa (Weilun), Peng Xu (Weilun), Raymond Lin (Weilun), Yulong Wang (Weilun), Emma Wang (Weilun), Sholto Douglas (Weilun), Reut Tsarfaty (Weilun), Elena Gribovskaya (Weilun), Renga Aravamudhan (Weilun), Manu Agarwal (Weilun), Mara Finkelstein (Weilun), Qiao Zhang (Weilun), Elizabeth Cole (Weilun), Phil Crone (Weilun), Sarmishta Velury (Weilun), Anil Das (Weilun), Chris Sauer (Weilun), Luyao Xu (Weilun), Danfeng Qin (Weilun), Chenjie Gu (Weilun), Dror Marcus (Weilun), CJ Zheng (Weilun), Wouter Van Gansbeke (Weilun), Sobhan Miryoosefi (Weilun), Haitian Sun (Weilun), YaGuang Li (Weilun), Charlie Chen (Weilun), Jae Yoo (Weilun), Pavel Dubov (Weilun), Alex Tomala (Weilun), Adams Yu (Weilun), Pawe{\l} Weso{\l}owski (Weilun), Alok Gunjan (Weilun), Eddie Cao (Weilun), Jiaming Luo (Weilun), Nikhil Sethi (Weilun), Arkadiusz Socala (Weilun), Laura Graesser (Weilun), Tomas Kocisky (Weilun), Arturo BC (Weilun), Minmin Chen (Weilun), Edward Lee (Weilun), Sophie Wang (Weilun), Weize Kong (Weilun), Qiantong Xu (Weilun), Nilesh Tripuraneni (Weilun), Yiming Li (Weilun), Xinxin Yu (Weilun), Allen Porter (Weilun), Paul Voigtlaender (Weilun), Biao Zhang (Weilun), Arpi Vezer (Weilun), Sarah York (Weilun), Qing Wei (Weilun), Geoffrey Cideron (Weilun), Mark Kurzeja (Weilun), Seungyeon Kim (Weilun), Benny Li (Weilun), Ang\'eline Pouget (Weilun), Hyo Lee (Weilun), Kaspar Daugaard (Weilun), Yang Li (Weilun), Dave Uthus (Weilun), Aditya Siddhant (Weilun), Paul Cavallaro (Weilun), Sriram Ganapathy (Weilun), Maulik Shah (Weilun), Rolf Jagerman (Weilun), Jeff Stanway (Weilun), Piermaria Mendolicchio (Weilun), Li Xiao (Weilun), Kayi Lee (Weilun), Tara Thompson (Weilun), Shubham Milind Phal (Weilun), Jason Chase (Weilun), Sun Jae Lee (Weilun), Adrian N Reyes (Weilun), Disha Shrivastava (Weilun), Zhen Qin (Weilun), Roykrong Sukkerd (Weilun), Seth Odoom (Weilun), Lior Madmoni (Weilun), John Aslanides (Weilun), Jonathan Herzig (Weilun), Elena Pochernina (Weilun), Sheng Zhang (Weilun), Parker Barnes (Weilun), Daisuke Ikeda (Weilun), Qiujia Li (Weilun), Shuo-yiin Chang (Weilun), Shakir Mohamed (Weilun), Jim Sproch (Weilun), Richard Powell (Weilun), Bidisha Samanta (Weilun), Domagoj \'Cevid (Weilun), Anton Kovsharov (Weilun), Shrestha Basu Mallick (Weilun), Srinivas Tadepalli (Weilun), Anne Zheng (Weilun), Kareem Ayoub (Weilun), Andreas Noever (Weilun), Christian Reisswig (Weilun), Zhuo Xu (Weilun), Junhyuk Oh (Weilun), Martin Matysiak (Weilun), Tim Blyth (Weilun), Shereen Ashraf (Weilun), Julien Amelot (Weilun), Boone Severson (Weilun), Michele Bevilacqua (Weilun), Motoki Sano (Weilun), Ethan Dyer (Weilun), Ofir Roval (Weilun), Anu Sinha (Weilun), Yin Zhong (Weilun), Sagi Perel (Weilun), Tea Saboli\'c (Weilun), Johannes Mauerer (Weilun), Willi Gierke (Weilun), Mauro Verzetti (Weilun), Rodrigo Cabrera (Weilun), Alvin Abdagic (Weilun), Steven Hemingray (Weilun), Austin Stone (Weilun), Jong Lee (Weilun), Farooq Ahmad (Weilun), Karthik Raman (Weilun), Lior Shani (Weilun), Jonathan Lai (Weilun), Orhan Firat (Weilun), Nathan Waters (Weilun), Eric Ge (Weilun), Mo Shomrat (Weilun), Himanshu Gupta (Weilun), Rajeev Aggarwal (Weilun), Tom Hudson (Weilun), Bill Jia (Weilun), Simon Baumgartner (Weilun), Palak Jain (Weilun), Joe Kovac (Weilun), Junehyuk Jung (Weilun), Ante \v{Z}u\v{z}ul (Weilun), Will Truong (Weilun), Morteza Zadimoghaddam (Weilun), Songyou Peng (Weilun), Marco Liang (Weilun), Rachel Sterneck (Weilun), Balaji Lakshminarayanan (Weilun), Machel Reid (Weilun), Oliver Woodman (Weilun), Tong Zhou (Weilun), Jianling Wang (Weilun), Vincent Coriou (Weilun), Arjun Narayanan (Weilun), Jay Hoover (Weilun), Yenai Ma (Weilun), Apoorv Jindal (Weilun), Clayton Sanford (Weilun), Doug Reid (Weilun), Swaroop Ramaswamy (Weilun), Alex Kurakin (Weilun), Roland Zimmermann (Weilun), Yana Lunts (Weilun), Dragos Dena (Weilun), Zal\'an Borsos (Weilun), Vered Cohen (Weilun), Shujian Zhang (Weilun), Will Grathwohl (Weilun), Robert Dadashi (Weilun), Morgan Redshaw (Weilun), Joshua Kessinger (Weilun), Julian Odell (Weilun), Silvano Bonacina (Weilun), Zihang Dai (Weilun), Grace Chen (Weilun), Ayush Dubey (Weilun), Pablo Sprechmann (Weilun), Mantas Pajarskas (Weilun), Wenxuan Zhou (Weilun), Niharika Ahuja (Weilun), Tara Thomas (Weilun), Martin Nikoltchev (Weilun), Matija Kecman (Weilun), Bharath Mankalale (Weilun), Andrey Ryabtsev (Weilun), Jennifer She (Weilun), Christian Walder (Weilun), Jiaming Shen (Weilun), Lu Li (Weilun), Carolina Parada (Weilun), Sheena Panthaplackel (Weilun), Okwan Kwon (Weilun), Matt Lawlor (Weilun), Utsav Prabhu (Weilun), Yannick Schroecker (Weilun), Marc'aurelio Ranzato (Weilun), Pete Blois (Weilun), Iurii Kemaev (Weilun), Ting Yu (Weilun), Dmitry Lepikhin (Weilun), Hao Xiong (Weilun), Sahand Sharifzadeh (Weilun), Oleaser Johnson (Weilun), Jeremiah Willcock (Weilun), Rui Yao (Weilun), Greg Farquhar (Weilun), Sujoy Basu (Weilun), Hidetoshi Shimokawa (Weilun), Nina Anderson (Weilun), Haiguang Li (Weilun), Khiem Pham (Weilun), Yizhong Liang (Weilun), Sebastian Borgeaud (Weilun), Alexandre Moufarek (Weilun), Hideto Kazawa (Weilun), Blair Kutzman (Weilun), Marcin Sieniek (Weilun), Sara Smoot (Weilun), Ruth Wang (Weilun), Natalie Axelsson (Weilun), Nova Fallen (Weilun), Prasha Sundaram (Weilun), Yuexiang Zhai (Weilun), Varun Godbole (Weilun), Petros Maniatis (Weilun), Alek Wang (Weilun), Ilia Shumailov (Weilun), Santhosh Thangaraj (Weilun), Remi Crocker (Weilun), Nikita Gupta (Weilun), Gang Wu (Weilun), Phil Chen (Weilun), Gell\'ert Weisz (Weilun), Celine Smith (Weilun), Mojtaba Seyedhosseini (Weilun), Boya Fang (Weilun), Xiyang Luo (Weilun), Roey Yogev (Weilun), Zeynep Cankara (Weilun), Andrew Hard (Weilun), Helen Ran (Weilun), Rahul Sukthankar (Weilun), George Necula (Weilun), Ga\"el Liu (Weilun), Honglong Cai (Weilun), Praseem Banzal (Weilun), Daniel Keysers (Weilun), Sanjay Ghemawat (Weilun), Connie Tao (Weilun), Emma Dunleavy (Weilun), Aditi Chaudhary (Weilun), Wei Li (Weilun), Maciej Miku{\l}a (Weilun), Chen-Yu Lee (Weilun), Tiziana Refice (Weilun), Krishna Somandepalli (Weilun), Alexandre Fr\'echette (Weilun), Dan Bahir (Weilun), John Karro (Weilun), Keith Rush (Weilun), Sarah Perrin (Weilun), Bill Rosgen (Weilun), Xiaomeng Yang (Weilun), Clara Huiyi Hu (Weilun), Mahmoud Alnahlawi (Weilun), Justin Mao-Jones (Weilun), Roopal Garg (Weilun), Hoang Nguyen (Weilun), Bat-Orgil Batsaikhan (Weilun), I\~naki Iturrate (Weilun), Anselm Levskaya (Weilun), Avi Singh (Weilun), Ashyana Kachra (Weilun), Tony Lu (Weilun), Denis Petek (Weilun), Zheng Xu (Weilun), Mark Graham (Weilun), Lukas Zilka (Weilun), Yael Karov (Weilun), Marija Kostelac (Weilun), Fangyu Liu (Weilun), Yaohui Guo (Weilun), Weiyue Wang (Weilun), Bernd Bohnet (Weilun), Emily Pitler (Weilun), Tony Bruguier (Weilun), Keisuke Kinoshita (Weilun), Chrysovalantis Anastasiou (Weilun), Nilpa Jha (Weilun), Ting Liu (Weilun), Jerome Connor (Weilun), Phil Wallis (Weilun), Philip Pham (Weilun), Eric Bailey (Weilun), Shixin Li (Weilun), Heng-Tze Cheng (Weilun), Sally Ma (Weilun), Haiqiong Li (Weilun), Akanksha Maurya (Weilun), Kate Olszewska (Weilun), Manfred Warmuth (Weilun), Christy Koh (Weilun), Dominik Paulus (Weilun), Siddhartha Reddy Jonnalagadda (Weilun), Enrique Piqueras (Weilun), Ali Elqursh (Weilun), Geoff Brown (Weilun), Hadar Shemtov (Weilun), Loren Maggiore (Weilun), Fei Xia (Weilun), Ryan Foley (Weilun), Beka Westberg (Weilun), George van den Driessche (Weilun), Livio Baldini Soares (Weilun), Arjun Kar (Weilun), Michael Quinn (Weilun), Siqi Zuo (Weilun), Jialin Wu (Weilun), Kyle Kastner (Weilun), Anna Bortsova (Weilun), Aijun Bai (Weilun), Ales Mikhalap (Weilun), Luowei Zhou (Weilun), Jennifer Brennan (Weilun), Vinay Ramasesh (Weilun), Honglei Zhuang (Weilun), John Maggs (Weilun), Johan Schalkwyk (Weilun), Yuntao Xu (Weilun), Hui Huang (Weilun), Andrew Howard (Weilun), Sasha Brown (Weilun), Linting Xue (Weilun), Gloria Shen (Weilun), Brian Albert (Weilun), Neha Jha (Weilun), Daniel Zheng (Weilun), Varvara Krayvanova (Weilun), Spurthi Amba Hombaiah (Weilun), Olivier Lacombe (Weilun), Gautam Vasudevan (Weilun), Dan Graur (Weilun), Tian Xie (Weilun), Meet Gandhi (Weilun), Bangju Wang (Weilun), Dustin Zelle (Weilun), Harman Singh (Weilun), Dahun Kim (Weilun), S\'ebastien Cevey (Weilun), Victor Ungureanu (Weilun), Natasha Noy (Weilun), Fei Liu (Weilun), Annie Xie (Weilun), Fangxiaoyu Feng (Weilun), Katerina Tsihlas (Weilun), Daniel Formoso (Weilun), Neera Vats (Weilun), Quentin Wellens (Weilun), Yinan Wang (Weilun), Niket Kumar Bhumihar (Weilun), Samrat Ghosh (Weilun), Matt Hoffman (Weilun), Tom Lieber (Weilun), Oran Lang (Weilun), Kush Bhatia (Weilun), Tom Paine (Weilun), Aroonalok Pyne (Weilun), Ronny Votel (Weilun), Madeleine Clare Elish (Weilun), Benoit Schillings (Weilun), Alex Panagopoulos (Weilun), Haichuan Yang (Weilun), Adam Raveret (Weilun), Zohar Yahav (Weilun), Shuang Liu (Weilun), Dalia El Badawy (Weilun), Nishant Agrawal (Weilun), Mohammed Badawi (Weilun), Mahdi Mirzazadeh (Weilun), Carla Bromberg (Weilun), Fan Ye (Weilun), Chang Liu (Weilun), Tatiana Sholokhova (Weilun), George-Cristian Muraru (Weilun), Gargi Balasubramaniam (Weilun), Jonathan Malmaud (Weilun), Alen Carin (Weilun), Danilo Martins (Weilun), Irina Jurenka (Weilun), Pankil Botadra (Weilun), Dave Lacey (Weilun), Richa Singh (Weilun), Mariano Schain (Weilun), Dan Zheng (Weilun), Isabelle Guyon (Weilun), Victor Lavrenko (Weilun), Seungji Lee (Weilun), Xiang Zhou (Weilun), Demis Hassabis (Weilun), Jeshwanth Challagundla (Weilun), Derek Cheng (Weilun), Nikhil Mehta (Weilun), Matthew Mauger (Weilun), Michela Paganini (Weilun), Pushkar Mishra (Weilun), Kate Lee (Weilun), Zhang Li (Weilun), Lexi Baugher (Weilun), Ondrej Skopek (Weilun), Max Chang (Weilun), Amir Zait (Weilun), Gaurav Menghani (Weilun), Lizzetth Bellot (Weilun), Guangxing Han (Weilun), Jean-Michel Sarr (Weilun), Sharat Chikkerur (Weilun), Himanshu Sahni (Weilun), Rohan Anil (Weilun), Arun Narayanan (Weilun), Chandu Thekkath (Weilun), Daniele Pighin (Weilun), Hana Strej\v{c}ek (Weilun), Marko Velic (Weilun), Fred Bertsch (Weilun), Manuel Tragut (Weilun), Keran Rong (Weilun), Alicia Parrish (Weilun), Kai Bailey (Weilun), Jiho Park (Weilun), Isabela Albuquerque (Weilun), Abhishek Bapna (Weilun), Rajesh Venkataraman (Weilun), Alec Kosik (Weilun), Johannes Griesser (Weilun), Zhiwei Deng (Weilun), Alek Andreev (Weilun), Qingyun Dou (Weilun), Kevin Hui (Weilun), Fanny Wei (Weilun), Xiaobin Yu (Weilun), Lei Shu (Weilun), Avia Aharon (Weilun), David Barker (Weilun), Badih Ghazi (Weilun), Sebastian Flennerhag (Weilun), Chris Breaux (Weilun), Yuchuan Liu (Weilun), Matthew Bilotti (Weilun), Josh Woodward (Weilun), Uri Alon (Weilun), Stephanie Winkler (Weilun), Tzu-Kuo Huang (Weilun), Kostas Andriopoulos (Weilun), Jo\~ao Gabriel Oliveira (Weilun), Penporn Koanantakool (Weilun), Berkin Akin (Weilun), Michael Wunder (Weilun), Cicero Nogueira dos Santos (Weilun), Mohammad Hossein Bateni (Weilun), Lin Yang (Weilun), Dan Horgan (Weilun), Beer Changpinyo (Weilun), Keyvan Amiri (Weilun), Min Ma (Weilun), Dayeong Lee (Weilun), Lihao Liang (Weilun), Anirudh Baddepudi (Weilun), Tejasi Latkar (Weilun), Raia Hadsell (Weilun), Jun Xu (Weilun), Hairong Mu (Weilun), Michael Han (Weilun), Aedan Pope (Weilun), Snchit Grover (Weilun), Frank Kim (Weilun), Ankit Bhagatwala (Weilun), Guan Sun (Weilun), Yamini Bansal (Weilun), Amir Globerson (Weilun), Alireza Nazari (Weilun), Samira Daruki (Weilun), Hagen Soltau (Weilun), Jane Labanowski (Weilun), Laurent El Shafey (Weilun), Matt Harvey (Weilun), Yanif Ahmad (Weilun), Elan Rosenfeld (Weilun), William Kong (Weilun), Etienne Pot (Weilun), Yi-Xuan Tan (Weilun), Aurora Wei (Weilun), Victoria Langston (Weilun), Marcel Prasetya (Weilun), Petar Veli\v{c}kovi\'c (Weilun), Richard Killam (Weilun), Robin Strudel (Weilun), Darren Ni (Weilun), Zhenhai Zhu (Weilun), Aaron Archer (Weilun), Kavya Kopparapu (Weilun), Lynn Nguyen (Weilun), Emilio Parisotto (Weilun), Hussain Masoom (Weilun), Sravanti Addepalli (Weilun), Jordan Grimstad (Weilun), Hexiang Hu (Weilun), Joss Moore (Weilun), Avinatan Hassidim (Weilun), Le Hou (Weilun), Mukund Raghavachari (Weilun), Jared Lichtarge (Weilun), Adam R. Brown (Weilun), Hilal Dib (Weilun), Natalia Ponomareva (Weilun), Justin Fu (Weilun), Yujing Zhang (Weilun), Altaf Rahman (Weilun), Joana Iljazi (Weilun), Edouard Leurent (Weilun), Gabriel Dulac-Arnold (Weilun), Cosmo Du (Weilun), Chulayuth Asawaroengchai (Weilun), Larry Jin (Weilun), Ela Gruzewska (Weilun), Ziwei Ji (Weilun), Benigno Uria (Weilun), Daniel De Freitas (Weilun), Paul Barham (Weilun), Lauren Beltrone (Weilun), V\'ictor Campos (Weilun), Jun Yan (Weilun), Neel Kovelamudi (Weilun), Arthur Nguyen (Weilun), Elinor Davies (Weilun), Zhichun Wu (Weilun), Zoltan Egyed (Weilun), Kristina Toutanova (Weilun), Nithya Attaluri (Weilun), Hongliang Fei (Weilun), Peter Stys (Weilun), Siddhartha Brahma (Weilun), Martin Izzard (Weilun), Siva Velusamy (Weilun), Scott Lundberg (Weilun), Vincent Zhuang (Weilun), Kevin Sequeira (Weilun), Adam Santoro (Weilun), Ehsan Amid (Weilun), Ophir Aharoni (Weilun), Shuai Ye (Weilun), Mukund Sundararajan (Weilun), Lijun Yu (Weilun), Yu-Cheng Ling (Weilun), Stephen Spencer (Weilun), Hugo Song (Weilun), Josip Djolonga (Weilun), Christo Kirov (Weilun), Sonal Gupta (Weilun), Alessandro Bissacco (Weilun), Clemens Meyer (Weilun), Mukul Bhutani (Weilun), Andrew Dai (Weilun), Weiyi Wang (Weilun), Siqi Liu (Weilun), Ashwin Sreevatsa (Weilun), Qijun Tan (Weilun), Maria Wang (Weilun), Lucy Kim (Weilun), Yicheng Wang (Weilun), Alex Irpan (Weilun), Yang Xiao (Weilun), Stanislav Fort (Weilun), Yifan He (Weilun), Alex Gurney (Weilun), Bryan Gale (Weilun), Yue Ma (Weilun), Monica Roy (Weilun), Viorica Patraucean (Weilun), Taylan Bilal (Weilun), Golnaz Ghiasi (Weilun), Anahita Hosseini (Weilun), Melvin Johnson (Weilun), Zhuowan Li (Weilun), Yi Tay (Weilun), Benjamin Beyret (Weilun), Katie Millican (Weilun), Josef Broder (Weilun), Mayank Lunayach (Weilun), Danny Swisher (Weilun), Eugen Vu\v{s}ak (Weilun), David Parkinson (Weilun), MH Tessler (Weilun), Adi Mayrav Gilady (Weilun), Richard Song (Weilun), Allan Dafoe (Weilun), Yves Raimond (Weilun), Masa Yamaguchi (Weilun), Itay Karo (Weilun), Elizabeth Nielsen (Weilun), Kevin Kilgour (Weilun), Mike Dusenberry (Weilun), Rajiv Mathews (Weilun), Jiho Choi (Weilun), Siyuan Qiao (Weilun), Harsh Mehta (Weilun), Sahitya Potluri (Weilun), Chris Knutsen (Weilun), Jialu Liu (Weilun), Tat Tan (Weilun), Kuntal Sengupta (Weilun), Keerthana Gopalakrishnan (Weilun), Abodunrinwa Toki (Weilun), Mencher Chiang (Weilun), Mike Burrows (Weilun), Grace Vesom (Weilun), Zafarali Ahmed (Weilun), Ilia Labzovsky (Weilun), Siddharth Vashishtha (Weilun), Preeti Singh (Weilun), Ankur Sharma (Weilun), Ada Ma (Weilun), Jinyu Xie (Weilun), Pranav Talluri (Weilun), Hannah Forbes-Pollard (Weilun), Aarush Selvan (Weilun), Joel Wee (Weilun), Loic Matthey (Weilun), Tom Funkhouser (Weilun), Parthasarathy Gopavarapu (Weilun), Lev Proleev (Weilun), Cheng Li (Weilun), Matt Thomas (Weilun), Kashyap Kolipaka (Weilun), Zhipeng Jia (Weilun), Ashwin Kakarla (Weilun), Srinivas Sunkara (Weilun), Joan Puigcerver (Weilun), Suraj Satishkumar Sheth (Weilun), Emily Graves (Weilun), Chen Wang (Weilun), Sadh MNM Khan (Weilun), Kai Kang (Weilun), Shyamal Buch (Weilun), Fred Zhang (Weilun), Omkar Savant (Weilun), David Soergel (Weilun), Kevin Lee (Weilun), Linda Friso (Weilun), Xuanyi Dong (Weilun), Rahul Arya (Weilun), Shreyas Chandrakaladharan (Weilun), Connor Schenck (Weilun), Greg Billock (Weilun), Tejas Iyer (Weilun), Anton Bakalov (Weilun), Leslie Baker (Weilun), Alex Ruiz (Weilun), Angad Chandorkar (Weilun), Trieu Trinh (Weilun), Matt Miecnikowski (Weilun), Yanqi Zhou (Weilun), Yangsibo Huang (Weilun), Jiazhong Nie (Weilun), Ali Shah (Weilun), Ashish Thapliyal (Weilun), Sam Haves (Weilun), Lun Wang (Weilun), Uri Shaham (Weilun), Patrick Morris-Suzuki (Weilun), Soroush Radpour (Weilun), Leonard Berrada (Weilun), Thomas Strohmann (Weilun), Chaochao Yan (Weilun), Jingwei Shen (Weilun), Sonam Goenka (Weilun), Tris Warkentin (Weilun), Petar Devi\'c (Weilun), Dan Belov (Weilun), Albert Webson (Weilun), Madhavi Yenugula (Weilun), Puranjay Datta (Weilun), Jerry Chang (Weilun), Nimesh Ghelani (Weilun), Aviral Kumar (Weilun), Vincent Perot (Weilun), Jessica Lo (Weilun), Yang Song (Weilun), Herman Schmit (Weilun), Jianmin Chen (Weilun), Vasilisa Bashlovkina (Weilun), Xiaoyue Pan (Weilun), Diana Mincu (Weilun), Paul Roit (Weilun), Isabel Edkins (Weilun), Andy Davis (Weilun), Yujia Li (Weilun), Ben Horn (Weilun), Xinjian Li (Weilun), Pradeep Kumar S (Weilun), Eric Doi (Weilun), Wanzheng Zhu (Weilun), Sri Gayatri Sundara Padmanabhan (Weilun), Siddharth Verma (Weilun), Jasmine Liu (Weilun), Heng Chen (Weilun), Mihajlo Velimirovi\'c (Weilun), Malcolm Reynolds (Weilun), Priyanka Agrawal (Weilun), Nick Sukhanov (Weilun), Abhinit Modi (Weilun), Siddharth Goyal (Weilun), John Palowitch (Weilun), Nima Khajehnouri (Weilun), Wing Lowe (Weilun), David Klinghoffer (Weilun), Sharon Silver (Weilun), Vinh Tran (Weilun), Candice Schumann (Weilun), Francesco Piccinno (Weilun), Xi Liu (Weilun), Mario Lu\v{c}i\'c (Weilun), Xiaochen Yang (Weilun), Sandeep Kumar (Weilun), Ajay Kannan (Weilun), Ragha Kotikalapudi (Weilun), Mudit Bansal (Weilun), Fabian Fuchs (Weilun), Mohammad Javad Hosseini (Weilun), Abdelrahman Abdelhamed (Weilun), Dawn Bloxwich (Weilun), Tianhe Yu (Weilun), Ruoxin Sang (Weilun), Gregory Thornton (Weilun), Karan Gill (Weilun), Yuchi Liu (Weilun), Virat Shejwalkar (Weilun), Jason Lin (Weilun), Zhipeng Yan (Weilun), Kehang Han (Weilun), Thomas Buschmann (Weilun), Michael Pliskin (Weilun), Zhi Xing (Weilun), Susheel Tatineni (Weilun), Junlin Zhang (Weilun), Sissie Hsiao (Weilun), Gavin Buttimore (Weilun), Marcus Wu (Weilun), Zefei Li (Weilun), Geza Kovacs (Weilun), Legg Yeung (Weilun), Tao Huang (Weilun), Aaron Cohen (Weilun), Bethanie Brownfield (Weilun), Averi Nowak (Weilun), Mikel Rodriguez (Weilun), Tianze Shi (Weilun), Hado van Hasselt (Weilun), Kevin Cen (Weilun), Deepanway Ghoshal (Weilun), Kushal Majmundar (Weilun), Weiren Yu (Weilun), Warren (Weilun), Chen (Yonghao), Danila Sinopalnikov (Yonghao), Hao Zhang (Yonghao), Vlado Gali\'c (Yonghao), Di Lu (Yonghao), Zeyu Zheng (Yonghao), Maggie Song (Yonghao), Gary Wang (Yonghao), Gui Citovsky (Yonghao), Swapnil Gawde (Yonghao), Isaac Galatzer-Levy (Yonghao), David Silver (Yonghao), Ivana Balazevic (Yonghao), Dipanjan Das (Yonghao), Kingshuk Majumder (Yonghao), Yale Cong (Yonghao), Praneet Dutta (Yonghao), Dustin Tran (Yonghao), Hui Wan (Yonghao), Junwei Yuan (Yonghao), Daniel Eppens (Yonghao), Alanna Walton (Yonghao), Been Kim (Yonghao), Harry Ragan (Yonghao), James Cobon-Kerr (Yonghao), Lu Liu (Yonghao), Weijun Wang (Yonghao), Bryce Petrini (Yonghao), Jack Rae (Yonghao), Rakesh Shivanna (Yonghao), Yan Xiong (Yonghao), Chace Lee (Yonghao), Pauline Coquinot (Yonghao), Yiming Gu (Yonghao), Lisa Patel (Yonghao), Blake Hechtman (Yonghao), Aviel Boag (Yonghao), Orion Jankowski (Yonghao), Alex Wertheim (Yonghao), Alex Lee (Yonghao), Paul Covington (Yonghao), Hila Noga (Yonghao), Sam Sobell (Yonghao), Shanthal Vasanth (Yonghao), William Bono (Yonghao), Chirag Nagpal (Yonghao), Wei Fan (Yonghao), Xavier Garcia (Yonghao), Kedar Soparkar (Yonghao), Aybuke Turker (Yonghao), Nathan Howard (Yonghao), Sachit Menon (Yonghao), Yuankai Chen (Yonghao), Vikas Verma (Yonghao), Vladimir Pchelin (Yonghao), Harish Rajamani (Yonghao), Valentin Dalibard (Yonghao), Ana Ramalho (Yonghao), Yang Guo (Yonghao), Kartikeya Badola (Yonghao), Seojin Bang (Yonghao), Nathalie Rauschmayr (Yonghao), Julia Proskurnia (Yonghao), Sudeep Dasari (Yonghao), Xinyun Chen (Yonghao), Mikhail Sushkov (Yonghao), Anja Hauth (Yonghao), Pauline Sho (Yonghao), Abhinav Singh (Yonghao), Bilva Chandra (Yonghao), Allie Culp (Yonghao), Max Dylla (Yonghao), Olivier Bachem (Yonghao), James Besley (Yonghao), Heri Zhao (Yonghao), Timothy Lillicrap (Yonghao), Wei Wei (Yonghao), Wael Al Jishi (Yonghao), Ning Niu (Yonghao), Alban Rrustemi (Yonghao), Rapha\"el Lopez Kaufman (Yonghao), Ryan Poplin (Yonghao), Jewel Zhao (Yonghao), Minh Truong (Yonghao), Shikhar Bharadwaj (Yonghao), Ester Hlavnova (Yonghao), Eli Stickgold (Yonghao), Cordelia Schmid (Yonghao), Georgi Stephanov (Yonghao), Zhaoqi Leng (Yonghao), Frederick Liu (Yonghao), L\'eonard Hussenot (Yonghao), Shenil Dodhia (Yonghao), Juliana Vicente Franco (Yonghao), Lesley Katzen (Yonghao), Abhanshu Sharma (Yonghao), Sarah Cogan (Yonghao), Zuguang Yang (Yonghao), Aniket Ray (Yonghao), Sergi Caelles (Yonghao), Shen Yan (Yonghao), Ravin Kumar (Yonghao), Daniel Gillick (Yonghao), Renee Wong (Yonghao), Joshua Ainslie (Yonghao), Jonathan Hoech (Yonghao), S\'eb Arnold (Yonghao), Dan Abolafia (Yonghao), Anca Dragan (Yonghao), Ben Hora (Yonghao), Grace Hu (Yonghao), Alexey Guseynov (Yonghao), Yang Lu (Yonghao), Chas Leichner (Yonghao), Jinmeng Rao (Yonghao), Abhimanyu Goyal (Yonghao), Nagabhushan Baddi (Yonghao), Daniel Hernandez Diaz (Yonghao), Tim McConnell (Yonghao), Max Bain (Yonghao), Jake Abernethy (Yonghao), Qiqi Yan (Yonghao), Rylan Schaeffer (Yonghao), Paul Vicol (Yonghao), Will Thompson (Yonghao), Montse Gonzalez Arenas (Yonghao), Mathias Bellaiche (Yonghao), Pablo Barrio (Yonghao), Stefan Zinke (Yonghao), Riccardo Patana (Yonghao), Pulkit Mehta (Yonghao), JK Kearns (Yonghao), Avraham Ruderman (Yonghao), Scott Pollom (Yonghao), David D'Ambrosio (Yonghao), Cath Hope (Yonghao), Yang Yu (Yonghao), Andrea Gesmundo (Yonghao), Kuang-Huei Lee (Yonghao), Aviv Rosenberg (Yonghao), Yiqian Zhou (Yonghao), Yaoyiran Li (Yonghao), Drew Garmon (Yonghao), Yonghui Wu (Yonghao), Safeen Huda (Yonghao), Gil Fidel (Yonghao), Martin Baeuml (Yonghao), Jian Li (Yonghao), Phoebe Kirk (Yonghao), Rhys May (Yonghao), Tao Tu (Yonghao), Sara Mc Carthy (Yonghao), Toshiyuki Fukuzawa (Yonghao), Miranda Aperghis (Yonghao), Chih-Kuan Yeh (Yonghao), Toshihiro Yoshino (Yonghao), Bo Li (Yonghao), Austin Myers (Yonghao), Kaisheng Yao (Yonghao), Ben Limonchik (Yonghao), Changwan Ryu (Yonghao), Rohun Saxena (Yonghao), Alex Goldin (Yonghao), Ruizhe Zhao (Yonghao), Rocky Rhodes (Yonghao), Tao Zhu (Yonghao), Divya Tyam (Yonghao), Heidi Howard (Yonghao), Nathan Byrd (Yonghao), Hongxu Ma (Yonghao), Yan Wu (Yonghao), Ryan Mullins (Yonghao), Qingze Wang (Yonghao), Aida Amini (Yonghao), Sebastien Baur (Yonghao), Yiran Mao (Yonghao), Subhashini Venugopalan (Yonghao), Will Song (Yonghao), Wen Ding (Yonghao), Paul Collins (Yonghao), Sashank Reddi (Yonghao), Megan Shum (Yonghao), Andrei Rusu (Yonghao), Luisa Zintgraf (Yonghao), Kelvin Chan (Yonghao), Sheela Goenka (Yonghao), Mathieu Blondel (Yonghao), Michael Collins (Yonghao), Renke Pan (Yonghao), Marissa Giustina (Yonghao), Nikolai Chinaev (Yonghao), Christian Schuler (Yonghao), Ce Zheng (Yonghao), Jonas Valfridsson (Yonghao), Alyssa Loo (Yonghao), Alex Yakubovich (Yonghao), Jamie Smith (Yonghao), Tao Jiang (Yonghao), Rich Munoz (Yonghao), Gabriel Barcik (Yonghao), Rishabh Bansal (Yonghao), Mingyao Yang (Yonghao), Yilun Du (Yonghao), Pablo Duque (Yonghao), Mary Phuong (Yonghao), Alexandra Belias (Yonghao), Kunal Lad (Yonghao), Zeyu Liu (Yonghao), Tal Schuster (Yonghao), Karthik Duddu (Yonghao), Jieru Hu (Yonghao), Paige Kunkle (Yonghao), Matthew Watson (Yonghao), Jackson Tolins (Yonghao), Josh Smith (Yonghao), Denis Teplyashin (Yonghao), Garrett Bingham (Yonghao), Marvin Ritter (Yonghao), Marco Andreetto (Yonghao), Divya Pitta (Yonghao), Mohak Patel (Yonghao), Shashank Viswanadha (Yonghao), Trevor Strohman (Yonghao), Catalin Ionescu (Yonghao), Jincheng Luo (Yonghao), Yogesh Kalley (Yonghao), Jeremy Wiesner (Yonghao), Dan Deutsch (Yonghao), Derek Lockhart (Yonghao), Peter Choy (Yonghao), Rumen Dangovski (Yonghao), Chawin Sitawarin (Yonghao), Cat Graves (Yonghao), Tanya Lando (Yonghao), Joost van Amersfoort (Yonghao), Ndidi Elue (Yonghao), Zhouyuan Huo (Yonghao), Pooya Moradi (Yonghao), Jean Tarbouriech (Yonghao), Henryk Michalewski (Yonghao), Wenting Ye (Yonghao), Eunyoung Kim (Yonghao), Alex Druinsky (Yonghao), Florent Altch\'e (Yonghao), Xinyi Chen (Yonghao), Artur Dwornik (Yonghao), Da-Cheng Juan (Yonghao), Rivka Moroshko (Yonghao), Horia Toma (Yonghao), Jarrod Kahn (Yonghao), Hai Qian (Yonghao), Maximilian Sieb (Yonghao), Irene Cai (Yonghao), Roman Goldenberg (Yonghao), Praneeth Netrapalli (Yonghao), Sindhu Raghuram (Yonghao), Yuan Gong (Yonghao), Lijie Fan (Yonghao), Evan Palmer (Yonghao), Yossi Matias (Yonghao), Valentin Gabeur (Yonghao), Shreya Pathak (Yonghao), Tom Ouyang (Yonghao), Don Metzler (Yonghao), Geoff Bacon (Yonghao), Srinivasan Venkatachary (Yonghao), Sridhar Thiagarajan (Yonghao), Alex Cullum (Yonghao), Eran Ofek (Yonghao), Vytenis Sakenas (Yonghao), Mohamed Hammad (Yonghao), Cesar Magalhaes (Yonghao), Mayank Daswani (Yonghao), Oscar Chang (Yonghao), Ashok Popat (Yonghao), Ruichao Li (Yonghao), Komal Jalan (Yonghao), Yanhan Hou (Yonghao), Josh Lipschultz (Yonghao), Antoine He (Yonghao), Wenhao Jia (Yonghao), Pier Giuseppe Sessa (Yonghao), Prateek Kolhar (Yonghao), William Wong (Yonghao), Sumeet Singh (Yonghao), Lukas Haas (Yonghao), Jay Whang (Yonghao), Hanna Klimczak-Pluci\'nska (Yonghao), Georges Rotival (Yonghao), Grace Chung (Yonghao), Yiqing Hua (Yonghao), Anfal Siddiqui (Yonghao), Nicolas Serrano (Yonghao), Dongkai Chen (Yonghao), Billy Porter (Yonghao), Libin Bai (Yonghao), Keshav Shivam (Yonghao), Sho Arora (Yonghao), Partha Talukdar (Yonghao), Tom Cobley (Yonghao), Sangnie Bhardwaj (Yonghao), Evgeny Gladchenko (Yonghao), Simon Green (Yonghao), Kelvin Guu (Yonghao), Felix Fischer (Yonghao), Xiao Wu (Yonghao), Eric Wang (Yonghao), Achintya Singhal (Yonghao), Tatiana Matejovicova (Yonghao), James Martens (Yonghao), Hongji Li (Yonghao), Roma Patel (Yonghao), Elizabeth Kemp (Yonghao), Jiaqi Pan (Yonghao), Lily Wang (Yonghao), Blake JianHang Chen (Yonghao), Jean-Baptiste Alayrac (Yonghao), Navneet Potti (Yonghao), Erika Gemzer (Yonghao), Eugene Ie (Yonghao), Kay McKinney (Yonghao), Takaaki Saeki (Yonghao), Edward Chou (Yonghao), Pascal Lamblin (Yonghao), SQ Mah (Yonghao), Zach Fisher (Yonghao), Martin Chadwick (Yonghao), Jon Stritar (Yonghao), Obaid Sarvana (Yonghao), Andrew Hogue (Yonghao), Artem Shtefan (Yonghao), Hadi Hashemi (Yonghao), Yang Xu (Yonghao), Jindong Gu (Yonghao), Sharad Vikram (Yonghao), Chung-Ching Chang (Yonghao), Sabela Ramos (Yonghao), Logan Kilpatrick (Yonghao), Weijuan Xi (Yonghao), Jenny Brennan (Yonghao), Yinghao Sun (Yonghao), Abhishek Jindal (Yonghao), Ionel Gog (Yonghao), Dawn Chen (Yonghao), Felix Wu (Yonghao), Jason Lee (Yonghao), Sudhindra Kopalle (Yonghao), Srinadh Bhojanapalli (Yonghao), Oriol Vinyals (Yonghao), Natan Potikha (Yonghao), Burcu Karagol Ayan (Yonghao), Yuan Yuan (Yonghao), Michael Riley (Yonghao), Piotr Stanczyk (Yonghao), Sergey Kishchenko (Yonghao), Bing Wang (Yonghao), Dan Garrette (Yonghao), Antoine Yang (Yonghao), Vlad Feinberg (Yonghao), CJ Carey (Yonghao), Javad Azizi (Yonghao), Viral Shah (Yonghao), Erica Moreira (Yonghao), Chongyang Shi (Yonghao), Josh Feldman (Yonghao), Elizabeth Salesky (Yonghao), Thomas Lampe (Yonghao), Aneesh Pappu (Yonghao), Duhyeon Kim (Yonghao), Jonas Adler (Yonghao), Avi Caciularu (Yonghao), Brian Walker (Yonghao), Yunhan Xu (Yonghao), Yochai Blau (Yonghao), Dylan Scandinaro (Yonghao), Terry Huang (Yonghao), Sam El-Husseini (Yonghao), Abhishek Sinha (Yonghao), Lijie Ren (Yonghao), Taylor Tobin (Yonghao), Patrik Sundberg (Yonghao), Tim Sohn (Yonghao), Vikas Yadav (Yonghao), Mimi Ly (Yonghao), Emily Xue (Yonghao), Jing Xiong (Yonghao), Afzal Shama Soudagar (Yonghao), Sneha Mondal (Yonghao), Nikhil Khadke (Yonghao), Qingchun Ren (Yonghao), Ben Vargas (Yonghao), Stan Bileschi (Yonghao), Sarah Chakera (Yonghao), Cindy Wang (Yonghao), Boyu Wang (Yonghao), Yoni Halpern (Yonghao), Joe Jiang (Yonghao), Vikas Sindhwani (Yonghao), Petre Petrov (Yonghao), Pranavaraj Ponnuramu (Yonghao), Sanket Vaibhav Mehta (Yonghao), Yu Watanabe (Yonghao), Betty Chan (Yonghao), Matheus Wisniewski (Yonghao), Trang Pham (Yonghao), Jingwei Zhang (Yonghao), Conglong Li (Yonghao), Dario de Cesare (Yonghao), Art Khurshudov (Yonghao), Alex Vasiloff (Yonghao), Melissa Tan (Yonghao), Zoe Ashwood (Yonghao), Bobak Shahriari (Yonghao), Maryam Majzoubi (Yonghao), Garrett Tanzer (Yonghao), Olga Kozlova (Yonghao), Robin Alazard (Yonghao), James Lee-Thorp (Yonghao), Nguyet Minh Phu (Yonghao), Isaac Tian (Yonghao), Junwhan Ahn (Yonghao), Andy Crawford (Yonghao), Lauren Lax (Yonghao), Yuan Shangguan (Yonghao), Iftekhar Naim (Yonghao), David Ross (Yonghao), Oleksandr Ferludin (Yonghao), Tongfei Guo (Yonghao), Andrea Banino (Yonghao), Hubert Soyer (Yonghao), Xiaoen Ju (Yonghao), Dominika Rogozi\'nska (Yonghao), Ishaan Malhi (Yonghao), Marcella Valentine (Yonghao), Daniel Balle (Yonghao), Apoorv Kulshreshtha (Yonghao), Maciej Kula (Yonghao), Yiwen Song (Yonghao), Sophia Austin (Yonghao), John Schultz (Yonghao), Roy Hirsch (Yonghao), Arthur Douillard (Yonghao), Apoorv Reddy (Yonghao), Michael Fink (Yonghao), Summer Yue (Yonghao), Khyatti Gupta (Yonghao), Adam Zhang (Yonghao), Norman Rink (Yonghao), Daniel McDuff (Yonghao), Lei Meng (Yonghao), Andr\'as Gy\"orgy (Yonghao), Yasaman Razeghi (Yonghao), Ricky Liang (Yonghao), Kazuki Osawa (Yonghao), Aviel Atias (Yonghao), Matan Eyal (Yonghao), Tyrone Hill (Yonghao), Nikolai Grigorev (Yonghao), Zhengdong Wang (Yonghao), Nitish Kulkarni (Yonghao), Rachel Soh (Yonghao), Ivan Lobov (Yonghao), Zachary Charles (Yonghao), Sid Lall (Yonghao), Kazuma Hashimoto (Yonghao), Ido Kessler (Yonghao), Victor Gomes (Yonghao), Zelda Mariet (Yonghao), Danny Driess (Yonghao), Alessandro Agostini (Yonghao), Canfer Akbulut (Yonghao), Jingcao Hu (Yonghao), Marissa Ikonomidis (Yonghao), Emily Caveness (Yonghao), Kartik Audhkhasi (Yonghao), Saurabh Agrawal (Yonghao), Ioana Bica (Yonghao), Evan Senter (Yonghao), Jayaram Mudigonda (Yonghao), Kelly Chen (Yonghao), Jingchen Ye (Yonghao), Xuanhui Wang (Yonghao), James Svensson (Yonghao), Philipp Fr\"anken (Yonghao), Josh Newlan (Yonghao), Li Lao (Yonghao), Eva Schnider (Yonghao), Sami Alabed (Yonghao), Joseph Kready (Yonghao), Jesse Emond (Yonghao), Afief Halumi (Yonghao), Tim Zaman (Yonghao), Chengxi Ye (Yonghao), Naina Raisinghani (Yonghao), Vilobh Meshram (Yonghao), Bo Chang (Yonghao), Ankit Singh Rawat (Yonghao), Axel Stjerngren (Yonghao), Sergey Levi (Yonghao), Rui Wang (Yonghao), Xiangzhu Long (Yonghao), Mitchelle Rasquinha (Yonghao), Steven Hand (Yonghao), Aditi Mavalankar (Yonghao), Lauren Agubuzu (Yonghao), Sudeshna Roy (Yonghao), Junquan Chen (Yonghao), Jarek Wilkiewicz (Yonghao), Hao Zhou (Yonghao), Michal Jastrzebski (Yonghao), Qiong Hu (Yonghao), Agustin Dal Lago (Yonghao), Ramya Sree Boppana (Yonghao), Wei-Jen Ko (Yonghao), Jennifer Prendki (Yonghao), Yao Su (Yonghao), Zhi Li (Yonghao), Eliza Rutherford (Yonghao), Girish Ramchandra Rao (Yonghao), Ramona Comanescu (Yonghao), Adri\`a Puigdom\`enech (Yonghao), Qihang Chen (Yonghao), Dessie Petrova (Yonghao), Christine Chan (Yonghao), Vedrana Milutinovic (Yonghao), Felipe Tiengo Ferreira (Yonghao), Chin-Yi Cheng (Yonghao), Ming Zhang (Yonghao), Tapomay Dey (Yonghao), Sherry Yang (Yonghao), Ramesh Sampath (Yonghao), Quoc Le (Yonghao), Howard Zhou (Yonghao), Chu-Cheng Lin (Yonghao), Hoi Lam (Yonghao), Christine Kaeser-Chen (Yonghao), Kai Hui (Yonghao), Dean Hirsch (Yonghao), Tom Eccles (Yonghao), Basil Mustafa (Yonghao), Shruti Rijhwani (Yonghao), Morgane Rivi\`ere (Yonghao), Yuanzhong Xu (Yonghao), Junjie Wang (Yonghao), Xinyang Geng (Yonghao), Xiance Si (Yonghao), Arjun Khare (Yonghao), Cheolmin Kim (Yonghao), Vahab Mirrokni (Yonghao), Kamyu Lee (Yonghao), Khuslen Baatarsukh (Yonghao), Nathaniel Braun (Yonghao), Lisa Wang (Yonghao), Pallavi LV (Yonghao), Richard Tanburn (Yonghao), Yuvein (Yonghao), Zhu (Joyce), Fangda Li (Joyce), Setareh Ariafar (Joyce), Dan Goldberg (Joyce), Ken Burke (Joyce), Daniil Mirylenka (Joyce), Meiqi Guo (Joyce), Olaf Ronneberger (Joyce), Hadas Natalie Vogel (Joyce), Liqun Cheng (Joyce), Nishita Shetty (Joyce), Johnson Jia (Joyce), Thomas Jimma (Joyce), Corey Fry (Joyce), Ted Xiao (Joyce), Martin Sundermeyer (Joyce), Ryan Burnell (Joyce), Yannis Assael (Joyce), Mario Pinto (Joyce), JD Chen (Joyce), Rohit Sathyanarayana (Joyce), Donghyun Cho (Joyce), Jing Lu (Joyce), Rishabh Agarwal (Joyce), Sugato Basu (Joyce), Lucas Gonzalez (Joyce), Dhruv Shah (Joyce), Meng Wei (Joyce), Dre Mahaarachchi (Joyce), Rohan Agrawal (Joyce), Tero Rissa (Joyce), Yani Donchev (Joyce), Ramiro Leal-Cavazos (Joyce), Adrian Hutter (Joyce), Markus Mircea (Joyce), Alon Jacovi (Joyce), Faruk Ahmed (Joyce), Jiageng Zhang (Joyce), Shuguang Hu (Joyce), Bo-Juen Chen (Joyce), Jonni Kanerva (Joyce), Guillaume Desjardins (Joyce), Andrew Lee (Joyce), Nikos Parotsidis (Joyce), Asier Mujika (Joyce), Tobias Weyand (Joyce), Jasper Snoek (Joyce), Jo Chick (Joyce), Kai Chen (Joyce), Paul Chang (Joyce), Ethan Mahintorabi (Joyce), Zi Wang (Joyce), Tolly Powell (Joyce), Orgad Keller (Joyce), Abhirut Gupta (Joyce), Claire Sha (Joyce), Kanav Garg (Joyce), Nicolas Heess (Joyce), \'Agoston Weisz (Joyce), Cassidy Hardin (Joyce), Bartek Wydrowski (Joyce), Ben Coleman (Joyce), Karina Zainullina (Joyce), Pankaj Joshi (Joyce), Alessandro Epasto (Joyce), Terry Spitz (Joyce), Binbin Xiong (Joyce), Kai Zhao (Joyce), Arseniy Klimovskiy (Joyce), Ivy Zheng (Joyce), Johan Ferret (Joyce), Itay Yona (Joyce), Waleed Khawaja (Joyce), Jean-Baptiste Lespiau (Joyce), Maxim Krikun (Joyce), Siamak Shakeri (Joyce), Timothee Cour (Joyce), Bonnie Li (Joyce), Igor Krivokon (Joyce), Dan Suh (Joyce), Alex Hofer (Joyce), Jad Al Abdallah (Joyce), Nikita Putikhin (Joyce), Oscar Akerlund (Joyce), Silvio Lattanzi (Joyce), Anurag Kumar (Joyce), Shane Settle (Joyce), Himanshu Srivastava (Joyce), Folawiyo Campbell-Ajala (Joyce), Edouard Rosseel (Joyce), Mihai Dorin Istin (Joyce), Nishanth Dikkala (Joyce), Anand Rao (Joyce), Nick Young (Joyce), Kate Lin (Joyce), Dhruva Bhaswar (Joyce), Yiming Wang (Joyce), Jaume Sanchez Elias (Joyce), Kritika Muralidharan (Joyce), James Keeling (Joyce), Dayou Du (Joyce), Siddharth Gopal (Joyce), Gregory Dibb (Joyce), Charles Blundell (Joyce), Manolis Delakis (Joyce), Jacky Liang (Joyce), Marco Tulio Ribeiro (Joyce), Georgi Karadzhov (Joyce), Guillermo Garrido (Joyce), Ankur Bapna (Joyce), Jiawei Cao (Joyce), Adam Sadovsky (Joyce), Pouya Tafti (Joyce), Arthur Guez (Joyce), Coline Devin (Joyce), Yixian Di (Joyce), Jinwei Xing (Joyce), Chuqiao (Joyce), Xu (Cindy), Hanzhao Lin (Cindy), Chun-Te Chu (Cindy), Sameera Ponda (Cindy), Wesley Helmholz (Cindy), Fan Yang (Cindy), Yue Gao (Cindy), Sara Javanmardi (Cindy), Wael Farhan (Cindy), Alex Ramirez (Cindy), Ricardo Figueira (Cindy), Khe Chai Sim (Cindy), Yuval Bahat (Cindy), Ashwin Vaswani (Cindy), Liangzhe Yuan (Cindy), Gufeng Zhang (Cindy), Leland Rechis (Cindy), Hanjun Dai (Cindy), Tayo Oguntebi (Cindy), Alexandra Cordell (Cindy), Eug\'enie Rives (Cindy), Kaan Tekelioglu (Cindy), Naveen Kumar (Cindy), Bing Zhang (Cindy), Aurick Zhou (Cindy), Nikolay Savinov (Cindy), Andrew Leach (Cindy), Alex Tudor (Cindy), Sanjay Ganapathy (Cindy), Yanyan Zheng (Cindy), Mirko Rossini (Cindy), Vera Axelrod (Cindy), Arnaud Autef (Cindy), Yukun Zhu (Cindy), Zheng Zheng (Cindy), Mingda Zhang (Cindy), Baochen Sun (Cindy), Jie Ren (Cindy), Nenad Tomasev (Cindy), Nithish Kannen (Cindy), Amer Sinha (Cindy), Charles Chen (Cindy), Louis O'Bryan (Cindy), Alex Pak (Cindy), Aditya Kusupati (Cindy), Weel Yang (Cindy), Deepak Ramachandran (Cindy), Patrick Griffin (Cindy), Seokhwan Kim (Cindy), Philipp Neubeck (Cindy), Craig Schiff (Cindy), Tammo Spalink (Cindy), Mingyang Ling (Cindy), Arun Nair (Cindy), Ga-Young Joung (Cindy), Linda Deng (Cindy), Avishkar Bhoopchand (Cindy), Lora Aroyo (Cindy), Tom Duerig (Cindy), Jordan Griffith (Cindy), Gabe Barth-Maron (Cindy), Jake Ades (Cindy), Alex Haig (Cindy), Ankur Taly (Cindy), Yunting Song (Cindy), Paul Michel (Cindy), Dave Orr (Cindy), Dean Weesner (Cindy), Corentin Tallec (Cindy), Carrie Grimes Bostock (Cindy), Paul Niemczyk (Cindy), Andy Twigg (Cindy), Mudit Verma (Cindy), Rohith Vallu (Cindy), Henry Wang (Cindy), Marco Gelmi (Cindy), Kiranbir Sodhia (Cindy), Aleksandr Chuklin (Cindy), Omer Goldman (Cindy), Jasmine George (Cindy), Liang Bai (Cindy), Kelvin Zhang (Cindy), Petar Sirkovic (Cindy), Efrat Nehoran (Cindy), Golan Pundak (Cindy), Jiaqi Mu (Cindy), Alice Chen (Cindy), Alex Greve (Cindy), Paulo Zacchello (Cindy), David Amos (Cindy), Heming Ge (Cindy), Eric Noland (Cindy), Colton Bishop (Cindy), Jeffrey Dudek (Cindy), Youhei Namiki (Cindy), Elena Buchatskaya (Cindy), Jing Li (Cindy), Dorsa Sadigh (Cindy), Masha Samsikova (Cindy), Dan Malkin (Cindy), Damien Vincent (Cindy), Robert David (Cindy), Rob Willoughby (Cindy), Phoenix Meadowlark (Cindy), Shawn Gao (Cindy), Yan Li (Cindy), Raj Apte (Cindy), Amit Jhindal (Cindy), Stein Xudong Lin (Cindy), Alex Polozov (Cindy), Zhicheng Wang (Cindy), Tomas Mery (Cindy), Anirudh GP (Cindy), Varun Yerram (Cindy), Sage Stevens (Cindy), Tianqi Liu (Cindy), Noah Fiedel (Cindy), Charles Sutton (Cindy), Matthew Johnson (Cindy), Xiaodan Song (Cindy), Kate Baumli (Cindy), Nir Shabat (Cindy), Muqthar Mohammad (Cindy), Hao Liu (Cindy), Marco Selvi (Cindy), Yichao Zhou (Cindy), Mehdi Hafezi Manshadi (Cindy), Chu-ling Ko (Cindy), Anthony Chen (Cindy), Michael Bendersky (Cindy), Jorge Gonzalez Mendez (Cindy), Nisarg Kothari (Cindy), Amir Zandieh (Cindy), Yiling Huang (Cindy), Daniel Andor (Cindy), Ellie Pavlick (Cindy), Idan Brusilovsky (Cindy), Jitendra Harlalka (Cindy), Sally Goldman (Cindy), Andrew Lampinen (Cindy), Guowang Li (Cindy), Asahi Ushio (Cindy), Somit Gupta (Cindy), Lei Zhang (Cindy), Chuyuan Kelly Fu (Cindy), Madhavi Sewak (Cindy), Timo Denk (Cindy), Jed Borovik (Cindy), Brendan Jou (Cindy), Avital Zipori (Cindy), Prateek Jain (Cindy), Junwen Bai (Cindy), Thang Luong (Cindy), Jonathan Tompson (Cindy), Alice Li (Cindy), Li Liu (Cindy), George Powell (Cindy), Jiajun Shen (Cindy), Alex Feng (Cindy), Grishma Chole (Cindy), Da Yu (Cindy), Yinlam Chow (Cindy), Tongxin Yin (Cindy), Eric Malmi (Cindy), Kefan Xiao (Cindy), Yash Pande (Cindy), Shachi Paul (Cindy), Niccol\`o Dal Santo (Cindy), Adil Dostmohamed (Cindy), Sergio Guadarrama (Cindy), Aaron Phillips (Cindy), Thanumalayan Sankaranarayana Pillai (Cindy), Gal Yona (Cindy), Amin Ghafouri (Cindy), Preethi Lahoti (Cindy), Benjamin Lee (Cindy), Dhruv Madeka (Cindy), Eren Sezener (Cindy), Simon Tokumine (Cindy), Adrian Collister (Cindy), Nicola De Cao (Cindy), Richard Shin (Cindy), Uday Kalra (Cindy), Parker Beak (Cindy), Emily Nottage (Cindy), Ryo Nakashima (Cindy), Ivan Jurin (Cindy), Vikash Sehwag (Cindy), Meenu Gaba (Cindy), Junhao Zeng (Cindy), Kevin R. McKee (Cindy), Fernando Pereira (Cindy), Tamar Yakar (Cindy), Amayika Panda (Cindy), Arka Dhar (Cindy), Peilin Zhong (Cindy), Daniel Sohn (Cindy), Mark Brand (Cindy), Lars Lowe Sjoesund (Cindy), Viral Carpenter (Cindy), Sharon Lin (Cindy), Shantanu Thakoor (Cindy), Marcus Wainwright (Cindy), Ashwin Chaugule (Cindy), Pranesh Srinivasan (Cindy), Muye Zhu (Cindy), Bernett Orlando (Cindy), Jack Weber (Cindy), Ayzaan Wahid (Cindy), Gilles Baechler (Cindy), Apurv Suman (Cindy), Jovana Mitrovi\'c (Cindy), Gabe Taubman (Cindy), Honglin Yu (Cindy), Helen King (Cindy), Josh Dillon (Cindy), Cathy Yip (Cindy), Dhriti Varma (Cindy), Tomas Izo (Cindy), Levent Bolelli (Cindy), Borja De Balle Pigem (Cindy), Julia Di Trapani (Cindy), Fotis Iliopoulos (Cindy), Adam Paszke (Cindy), Nishant Ranka (Cindy), Joe Zou (Cindy), Francesco Pongetti (Cindy), Jed McGiffin (Cindy), Alex Siegman (Cindy), Rich Galt (Cindy), Ross Hemsley (Cindy), Goran \v{Z}u\v{z}i\'c (Cindy), Victor Carbune (Cindy), Tao Li (Cindy), Myle Ott (Cindy), F\'elix de Chaumont Quitry (Cindy), David Vilar Torres (Cindy), Yuri Chervonyi (Cindy), Tomy Tsai (Cindy), Prem Eruvbetine (Cindy), Samuel Yang (Cindy), Matthew Denton (Cindy), Jake Walker (Cindy), Slavica Anda\v{c}i\'c (Cindy), Idan Heimlich Shtacher (Cindy), Vittal Premachandran (Cindy), Harshal Tushar Lehri (Cindy), Cip Baetu (Cindy), Damion Yates (Cindy), Lampros Lamprou (Cindy), Mariko Iinuma (Cindy), Ioana Mihailescu (Cindy), Ben Albrecht (Cindy), Shachi Dave (Cindy), Susie Sargsyan (Cindy), Bryan Perozzi (Cindy), Lucas Manning (Cindy), Chiyuan Zhang (Cindy), Denis Vnukov (Cindy), Igor Mordatch (Cindy), Raia Hadsell Wolfgang Macherey (Cindy), Ryan Kappedal (Cindy), Jim Stephan (Cindy), Aditya Tripathi (Cindy), Klaus Macherey (Cindy), Jun Qian (Cindy), Abhishek Bhowmick (Cindy), Shekoofeh Azizi (Cindy), R\'emi Leblond (Cindy), Shiva Mohan Reddy Garlapati (Cindy), Timothy Knight (Cindy), Matthew Wiethoff (Cindy), Wei-Chih Hung (Cindy), Anelia Angelova (Cindy), Georgios Evangelopoulos (Cindy), Pawel Janus (Cindy), Dimitris Paparas (Cindy), Matthew Rahtz (Cindy), Ken Caluwaerts (Cindy), Vivek Sampathkumar (Cindy), Daniel Jarrett (Cindy), Shadi Noghabi (Cindy), Antoine Miech (Cindy), Chak Yeung (Cindy), Geoff Clark (Cindy), Henry Prior (Cindy), Fei Zheng (Cindy), Jean Pouget-Abadie (Cindy), Indro Bhattacharya (Cindy), Kalpesh Krishna (Cindy), Will Bishop (Cindy), Zhe Yuan (Cindy), Yunxiao Deng (Cindy), Ashutosh Sathe (Cindy), Kacper Krasowiak (Cindy), Ciprian Chelba (Cindy), Cho-Jui Hsieh (Cindy), Kiran Vodrahalli (Cindy), Buhuang Liu (Cindy), Thomas K\"oppe (Cindy), Amr Khalifa (Cindy), Lubo Litchev (Cindy), Pichi Charoenpanit (Cindy), Reed Roberts (Cindy), Sachin Yadav (Cindy), Yasumasa Onoe (Cindy), Desi Ivanov (Cindy), Megha Mohabey (Cindy), Vighnesh Birodkar (Cindy), Nemanja Raki\'cevi\'c (Cindy), Pierre Sermanet (Cindy), Vaibhav Mehta (Cindy), Krishan Subudhi (Cindy), Travis Choma (Cindy), Will Ng (Cindy), Luheng He (Cindy), Kathie Wang (Cindy), Tasos Kementsietsidis (Cindy), Shane Gu (Cindy), Mansi Gupta (Cindy), Andrew Nystrom (Cindy), Mehran Kazemi (Cindy), Timothy Chung (Cindy), Nacho Cano (Cindy), Nikhil Dhawan (Cindy), Yufei Wang (Cindy), Jiawei Xia (Cindy), Trevor Yacovone (Cindy), Eric Jia (Cindy), Mingqing Chen (Cindy), Simeon Ivanov (Cindy), Ashrith Sheshan (Cindy), Sid Dalmia (Cindy), Pawe{\l} Stradomski (Cindy), Pengcheng Yin (Cindy), Salem Haykal (Cindy), Congchao Wang (Cindy), Dennis Duan (Cindy), Neslihan Bulut (Cindy), Greg Kochanski (Cindy), Liam MacDermed (Cindy), Namrata Godbole (Cindy), Shitao Weng (Cindy), Jingjing Chen (Cindy), Rachana Fellinger (Cindy), Ramin Mehran (Cindy), Daniel Suo (Cindy), Hisham Husain (Cindy), Tong He (Cindy), Kaushal Patel (Cindy), Joshua Howland (Cindy), Randall Parker (Cindy), Kelvin Nguyen (Cindy), Sharath Maddineni (Cindy), Chris Rawles (Cindy), Mina Khan (Cindy), Shlomi Cohen-Ganor (Cindy), Amol Mandhane (Cindy), Xinyi Wu (Cindy), Chenkai Kuang (Cindy), Iulia Com\c{s}a (Cindy), Ramya Ganeshan (Cindy), Hanie Sedghi (Cindy), Adam Bloniarz (Cindy), Nuo Wang Pierse (Cindy), Anton Briukhov (Cindy), Petr Mitrichev (Cindy), Anita Gergely (Cindy), Serena Zhan (Cindy), Allan Zhou (Cindy), Nikita Saxena (Cindy), Eva Lu (Cindy), Josef Dean (Cindy), Ashish Gupta (Cindy), Nicolas Perez-Nieves (Cindy), Renjie Wu (Cindy), Cory McLean (Cindy), Wei Liang (Cindy), Disha Jindal (Cindy), Anton Tsitsulin (Cindy), Wenhao Yu (Cindy), Kaiz Alarakyia (Cindy), Tom Schaul (Cindy), Piyush Patil (Cindy), Peter Sung (Cindy), Elijah Peake (Cindy), Hongkun Yu (Cindy), Feryal Behbahani (Cindy), JD Co-Reyes (Cindy), Alan Ansell (Cindy), Sean Sun (Cindy), Clara Barbu (Cindy), Jonathan Lee (Cindy), Seb Noury (Cindy), James Allingham (Cindy), Bilal Piot (Cindy), Mohit Sharma (Cindy), Christopher Yew (Cindy), Ivan Korotkov (Cindy), Bibo Xu (Cindy), Demetra Brady (Cindy), Goran Petrovic (Cindy), Shibl Mourad (Cindy), Claire Cui (Cindy), Aditya Gupta (Cindy), Parker Schuh (Cindy), Saarthak Khanna (Cindy), Anna Goldie (Cindy), Abhinav Arora (Cindy), Vadim Zubov (Cindy), Amy Stuart (Cindy), Mark Epstein (Cindy), Yun Zhu (Cindy), Jianqiao Liu (Cindy), Yury Stuken (Cindy), Ziyue Wang (Cindy), Karolis Misiunas (Cindy), Dee Guo (Cindy), Ashleah Gill (Cindy), Ale Hartman (Cindy), Zaid Nabulsi (Cindy), Aurko Roy (Cindy), Aleksandra Faust (Cindy), Jason Riesa (Cindy), Ben Withbroe (Cindy), Mengchao Wang (Cindy), Marco Tagliasacchi (Cindy), Andreea Marzoca (Cindy), James Noraky (Cindy), Serge Toropov (Cindy), Malika Mehrotra (Cindy), Bahram Raad (Cindy), Sanja Deur (Cindy), Steve Xu (Cindy), Marianne Monteiro (Cindy), Zhongru Wu (Cindy), Yi Luan (Cindy), Sam Ritter (Cindy), Nick Li (Cindy), H{\aa}vard Garnes (Cindy), Yanzhang He (Cindy), Martin Zlocha (Cindy), Jifan Zhu (Cindy), Matteo Hessel (Cindy), Will Wu (Cindy), Spandana Raj Babbula (Cindy), Chizu Kawamoto (Cindy), Yuanzhen Li (Cindy), Mehadi Hassen (Cindy), Yan Wang (Cindy), Brian Wieder (Cindy), James Freedman (Cindy), Yin Zhang (Cindy), Xinyi Bai (Cindy), Tianli Yu (Cindy), David Reitter (Cindy), XiangHai Sheng (Cindy), Mateo Wirth (Cindy), Aditya Kini (Cindy), Dima Damen (Cindy), Mingcen Gao (Cindy), Rachel Hornung (Cindy), Michael Voznesensky (Cindy), Brian Roark (Cindy), Adhi Kuncoro (Cindy), Yuxiang Zhou (Cindy), Rushin Shah (Cindy), Anthony Brohan (Cindy), Kuangyuan Chen (Cindy), James Wendt (Cindy), David Rim (Cindy), Paul Kishan Rubenstein (Cindy), Jonathan Halcrow (Cindy), Michelle Liu (Cindy), Ty Geri (Cindy), Yunhsuan Sung (Cindy), Jane Shapiro (Cindy), Shaan Bijwadia (Cindy), Chris Duvarney (Cindy), Christina Sorokin (Cindy), Paul Natsev (Cindy), Reeve Ingle (Cindy), Pramod Gupta (Cindy), Young Maeng (Cindy), Ndaba Ndebele (Cindy), Kexin Zhu (Cindy), Valentin Anklin (Cindy), Katherine Lee (Cindy), Yuan Liu (Cindy), Yaroslav Akulov (Cindy), Shaleen Gupta (Cindy), Guolong Su (Cindy), Flavien Prost (Cindy), Tianlin Liu (Cindy), Vitaly Kovalev (Cindy), Pol Moreno (Cindy), Martin Scholz (Cindy), Sam Redmond (Cindy), Zongwei Zhou (Cindy), Alex Castro-Ros (Cindy), Andr\'e Susano Pinto (Cindy), Dia Kharrat (Cindy), Michal Yarom (Cindy), Rachel Saputro (Cindy), Jannis Bulian (Cindy), Ben Caine (Cindy), Ji Liu (Cindy), Abbas Abdolmaleki (Cindy), Shariq Iqbal (Cindy), Tautvydas Misiunas (Cindy), Mikhail Sirotenko (Cindy), Shefali Garg (Cindy), Guy Bensky (Cindy), Huan Gui (Cindy), Xuezhi Wang (Cindy), Raphael Koster (Cindy), Mike Bernico (Cindy), Da Huang (Cindy), Romal Thoppilan (Cindy), Trevor Cohn (Cindy), Ben Golan (Cindy), Wenlei Zhou (Cindy), Andrew Rosenberg (Cindy), Markus Freitag (Cindy), Tynan Gangwani (Cindy), Vincent Tsang (Cindy), Anand Shukla (Cindy), Xiaoqi Ren (Cindy), Minh Giang (Cindy), Chi Zou (Cindy), Andre Elisseeff (Cindy), Charline Le Lan (Cindy), Dheeru Dua (Cindy), Shuba Lall (Cindy), Pranav Shyam (Cindy), Frankie Garcia (Cindy), Sarah Nguyen (Cindy), Michael Guzman (Cindy), AJ Maschinot (Cindy), Marcello Maggioni (Cindy), Ming-Wei Chang (Cindy), Karol Gregor (Cindy), Lotte Weerts (Cindy), Kumaran Venkatesan (Cindy), Bogdan Damoc (Cindy), Leon Liu (Cindy), Jan Wassenberg (Cindy), Lewis Ho (Cindy), Becca Roelofs (Cindy), Majid Hadian (Cindy), Fran\c{c}ois-Xavier Aubet (Cindy), Yu Liang (Cindy), Sami Lachgar (Cindy), Danny Karmon (Cindy), Yong Cheng (Cindy), Amelio V\'azquez-Reina (Cindy), Angie Chen (Cindy), Zhuyun Dai (Cindy), Andy Brock (Cindy), Shubham Agrawal (Cindy), Chenxi Pang (Cindy), Peter Garst (Cindy), Mariella Sanchez-Vargas (Cindy), Ivor Rendulic (Cindy), Aditya Ayyar (Cindy), Andrija Ra\v{z}natovi\'c (Cindy), Olivia Ma (Cindy), Roopali Vij (Cindy), Neha Sharma (Cindy), Ashwin Balakrishna (Cindy), Bingyuan Liu (Cindy), Ian Mackinnon (Cindy), Sorin Baltateanu (Cindy), Petra Poklukar (Cindy), Gabriel Ibagon (Cindy), Colin Ji (Cindy), Hongyang Jiao (Cindy), Isaac Noble (Cindy), Wojciech Stokowiec (Cindy), Zhihao Li (Cindy), Jeff Dean (Cindy), David Lindner (Cindy), Mark Omernick (Cindy), Kristen Chiafullo (Cindy), Mason Dimarco (Cindy), Vitor Rodrigues (Cindy), Vittorio Selo (Cindy), Garrett Honke (Cindy), Xintian (Cindy), Wu (Lucas), Wei He (Lucas), Adam Hillier (Lucas), Anhad Mohananey (Lucas), Vihari Piratla (Lucas), Chang Ye (Lucas), Chase Malik (Lucas), Sebastian Riedel (Lucas), Samuel Albanie (Lucas), Zi Yang (Lucas), Kenny Vassigh (Lucas), Maria Bauza (Lucas), Sheng Li (Lucas), Yiqing Tao (Lucas), Nevan Wichers (Lucas), Andrii Maksai (Lucas), Abe Ittycheriah (Lucas), Ross Mcilroy (Lucas), Bryan Seybold (Lucas), Noah Goodman (Lucas), Romina Datta (Lucas), Steven M. Hernandez (Lucas), Tian Shi (Lucas), Yony Kochinski (Lucas), Anna Bulanova (Lucas), Ken Franko (Lucas), Mikita Sazanovich (Lucas), Nicholas FitzGerald (Lucas), Praneeth Kacham (Lucas), Shubha Srinivas Raghvendra (Lucas), Vincent Hellendoorn (Lucas), Alexander Grushetsky (Lucas), Julian Salazar (Lucas), Angeliki Lazaridou (Lucas), Jason Chang (Lucas), Jan-Thorsten Peter (Lucas), Sushant Kafle (Lucas), Yann Dauphin (Lucas), Abhishek Rao (Lucas), Filippo Graziano (Lucas), Izhak Shafran (Lucas), Yuguo Liao (Lucas), Tianli Ding (Lucas), Geng Yan (Lucas), Grace Chu (Lucas), Zhao Fu (Lucas), Vincent Roulet (Lucas), Gabriel Rasskin (Lucas), Duncan Williams (Lucas), Shahar Drath (Lucas), Alex Mossin (Lucas), Raphael Hoffmann (Lucas), Jordi Orbay (Lucas), Francesco Bertolini (Lucas), Hila Sheftel (Lucas), Justin Chiu (Lucas), Siyang Xue (Lucas), Yuheng Kuang (Lucas), Ferjad Naeem (Lucas), Swaroop Nath (Lucas), Nana Nti (Lucas), Phil Culliton (Lucas), Kashyap Krishnakumar (Lucas), Michael Isard (Lucas), Pei Sun (Lucas), Ayan Chakrabarti (Lucas), Nathan Clement (Lucas), Regev Cohen (Lucas), Arissa Wongpanich (Lucas), GS Oh (Lucas), Ashwin Murthy (Lucas), Hao Zheng (Lucas), Jessica Hamrick (Lucas), Oskar Bunyan (Lucas), Suhas Ganesh (Lucas), Nitish Gupta (Lucas), Roy Frostig (Lucas), John Wieting (Lucas), Yury Malkov (Lucas), Pierre Marcenac (Lucas), Zhixin (Lucas), Lai, Xiaodan Tang, Mohammad Saleh, Fedir Zubach, Chinmay Kulkarni, Huanjie Zhou, Vicky Zayats, Nan Ding, Anshuman Tripathi, Arijit Pramanik, Patrik Zochbauer, Harish Ganapathy, Vedant Misra, Zach Behrman, Hugo Vallet, Mingyang Zhang, Mukund Sridhar, Ye Jin, Mohammad Babaeizadeh, Siim P\~oder, Megha Goel, Divya Jain, Tajwar Nasir, Shubham Mittal, Tim Dozat, Diego Ardila, Aliaksei Severyn, Fabio Pardo, Sammy Jerome, Siyang Qin, Louis Rouillard, Amir Yazdanbakhsh, Zizhao Zhang, Shivani Agrawal, Kaushik Shivakumar, Caden Lu, Praveen Kallakuri, Rachita Chhaparia, Kanishka Rao, Charles Kwong, Asya Fadeeva, Shitij Nigam, Yan Virin, Yuan Zhang, Balaji Venkatraman, Beliz Gunel, Marc Wilson, Huiyu Wang, Abhinav Gupta, Xiaowei Xu, Adrien Ali Ta\"iga, Kareem Mohamed, Doug Fritz, Daniel Rodriguez, Zoubin Ghahramani, Harry Askham, Lior Belenki, James Zhao, Rahul Gupta, Krzysztof Jastrz\k{e}bski, Takahiro Kosakai, Kaan Katircioglu, Jon Schneider, Rina Panigrahy, Konstantinos Bousmalis, Peter Grabowski, Prajit Ramachandran, Chaitra Hegde, Mihaela Rosca, Angelo Scorza Scarpati, Kyriakos Axiotis, Ying Xu, Zach Gleicher, Assaf Hurwitz Michaely, Mandar Sharma, Sanil Jain, Christoph Hirnschall, Tal Marian, Xuhui Jia, Kevin Mather, Kilol Gupta, Linhai Qiu, Nigamaa Nayakanti, Lucian Ionita, Steven Zheng, Lucia Loher, Kurt Shuster, Igor Petrovski, Roshan Sharma, Rahma Chaabouni, Angel Yeh, James An, Arushi Gupta, Steven Schwarcz, Seher Ellis, Sam Conway-Rahman, Javier Snaider, Alex Zhai, James Atwood, Daniel Golovin, Liqian Peng, Te I, Vivian Xia, Salvatore Scellato, Mahan Malihi, Arthur Bra\v{z}inskas, Vlad-Doru Ion, Younghoon Jun, James Swirhun, Soroosh Mariooryad, Jiao Sun, Steve Chien, Rey Coaguila, Ariel Brand, Yi Gao, Tom Kwiatkowski, Roee Aharoni, Cheng-Chun Lee, Mislav \v{Z}ani\'c, Yichi Zhang, Dan Ethier, Vitaly Nikolaev, Pranav Nair, Yoav Ben Shalom, Hen Fitoussi, Jai Gupta, Hongbin Liu, Dee Cattle, Tolga Bolukbasi, Ben Murdoch, Fantine Huot, Yin Li, Chris Hahn, Urvashi Khandelwal, Frederik Benzing, Arthur Conmy, Andrey Simanovsky, Fran\c{c}oise Beaufays, Eugene Weinstein, Tongzhou Chen, Luke Leonhard, Bhuvana Ramabhadran

Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

replace Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses

Authors: Jens Rupprecht, Georg Ahnert, Markus Strohmaier

Abstract: Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known response biases are poorly understood. This paper investigates the response robustness of LLMs in normative survey contexts - we test nine diverse LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of 11 perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated interviews. In doing so, we not only reveal LLMs' vulnerabilities to perturbations but also show that all tested models exhibit a consistent recency bias varying in intensity, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. By applying a set of perturbations, we reveal that LLMs partially align with survey response biases identified in humans. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.

replace What Factors Affect LLMs and RLLMs in Financial Question Answering?

Authors: Peng Wang, Xuesi Hu, Jiageng Wu, Yuntao Zou, Qiancheng Zhang, Dagang Li

Abstract: Recently, the development of large language models (LLMs) and reasoning large language models (RLLMs) have gained considerable attention from many researchers. RLLMs enhance the reasoning capabilities of LLMs through Long Chain-of-Thought (Long CoT) processes, significantly improving the performance of LLMs in addressing complex problems. However, there are few works that systematically explore what methods can fully unlock the performance of LLMs and RLLMs within the financial domain. To investigate the impact of various methods on LLMs and RLLMs, we utilize five LLMs and three RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks. Our research findings indicate: (1) Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT; (2) RLLMs possess inherent Long CoT capabilities, which limits the effectiveness of conventional methods in further enhancing their performance; (3) Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length, which yields minimal benefits for RLLMs. We hope that this study can serve as an important reference for LLMs and RLLMs in the field of financial question answering.

replace SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Authors: Wenliang Shan, Michael Fu, Rui Yang, Chakkrit Tantithamthavorn

Abstract: Safety alignment is critical for LLM-powered systems. While recent LLM-powered guardrail approaches such as LlamaGuard achieve high detection accuracy of unsafe inputs written in English (e.g., ``How to create a bomb?''), they struggle with multilingual unsafe inputs. This limitation leaves LLM systems vulnerable to unsafe and jailbreak prompts written in low-resource languages such as those in Southeast Asia. This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages. It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts in LLM-powered systems. We adapt a general-purpose multilingual language model into a multilingual guardrail using low-rank adaptation (LoRA). We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages, including safe, unsafe, and jailbreak cases. We evaluate SEALGuard against state-of-the-art guardrails such as LlamaGuard on this benchmark. Our findings show that multilingual unsafe and jailbreak prompts substantially degrade the performance of the state-of-the-art LlamaGuard, which experiences a drop in Defense Success Rate (DSR) by 9% and 18%, respectively, compared to its performance on English-only prompts. In contrast, SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving DSR by 48% over LlamaGuard and achieving the best DSR, precision, and F1-score. Our ablation study further reveals the contributions of adaptation strategies and model size to the overall performance of SEALGuard. We release our pre-trained model and benchmark at https://github.com/awsm-research/SEALGuard to support further research.

URLs: https://github.com/awsm-research/SEALGuard

replace A Comparative Approach to Assessing Linguistic Creativity of Large Language Models and Humans

Authors: Anca Dinu, Andra-Maria Florescu, Alina Resceanu

Abstract: The following paper introduces a general linguistic creativity test for humans and Large Language Models (LLMs). The test consists of various tasks aimed at assessing their ability to generate new original words and phrases based on word formation processes (derivation and compounding) and on metaphorical language use. We administered the test to 24 humans and to an equal number of LLMs, and we automatically evaluated their answers using OCSAI tool for three criteria: Originality, Elaboration, and Flexibility. The results show that LLMs not only outperformed humans in all the assessed criteria, but did better in six out of the eight test tasks. We then computed the uniqueness of the individual answers, which showed some minor differences between humans and LLMs. Finally, we performed a short manual analysis of the dataset, which revealed that humans are more inclined towards E(extending)-creativity, while LLMs favor F(ixed)-creativity.

replace-cross GUI Test Migration via Abstraction and Concretization

Authors: Yakun Zhang, Chen Liu, Xiaofei Xie, Yun Lin, Jin Song Dong, Dan Hao, Lu Zhang

Abstract: GUI test migration aims to produce test cases with events and assertions to test specific functionalities of a target app. Existing migration approaches typically focus on the widget-mapping paradigm that maps widgets from source apps to target apps. However, since different apps may implement the same functionality in different ways, direct mapping may result in incomplete or buggy test cases, thus significantly impacting the effectiveness of testing target functionality and the practical applicability of migration approaches. In this paper, we propose a new migration paradigm (i.e., the abstraction-concretization paradigm) that first abstracts the test logic for the target functionality and then utilizes this logic to generate the concrete GUI test case. Furthermore, we introduce MACdroid, the first approach that migrates GUI test cases based on this paradigm. Specifically, we propose an abstraction technique that utilizes source test cases from source apps targeting the same functionality to extract a general test logic for that functionality. Then, we propose a concretization technique that utilizes the general test logic to guide an LLM in generating the corresponding GUI test case (including events and assertions) for the target app. We evaluate MACdroid on two widely-used datasets (including 31 apps, 34 functionalities, and 123 test cases). On the FrUITeR dataset, the test cases generated by MACdroid successfully test 64% of the target functionalities, improving the baselines by 191%. On the Lin dataset, MACdroid successfully tests 75% of the target functionalities, outperforming the baselines by 42%. These results underscore the effectiveness of MACdroid in GUI test migration.

replace-cross LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization

Authors: Jui-Nan Yen, Si Si, Zhao Meng, Felix Yu, Sai Surya Duvvuri, Inderjit S. Dhillon, Cho-Jui Hsieh, Sanjiv Kumar

Abstract: Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements. However, current LoRA optimizers lack transformation invariance, meaning the actual updates to the weights depends on how the two LoRA factors are scaled or rotated. This deficiency leads to inefficient learning and sub-optimal solutions in practice. This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization, which can achieve transformation invariance and remain computationally efficient. We provide theoretical analysis to demonstrate the benefit of our method and conduct experiments on various LLM tasks with different models including Gemma 2B, 7B, and mT5-XXL. The results demonstrate consistent improvements against existing optimizers. For example, replacing Adam with LoRA-RITE during LoRA fine-tuning of Gemma-2B yielded 4.6\% accuracy gain on Super-Natural Instructions and 3.5\% accuracy gain across other four LLM benchmarks (HellaSwag, ArcChallenge, GSM8K, OpenBookQA).

replace-cross Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

Authors: Junjie Wu, Tsz Ting Chung, Kai Chen, Dit-Yan Yeung

Abstract: Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in the given image. Most existing LVLM hallucination benchmarks are constrained to evaluate the object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, we design a unified framework to measure the object and relation hallucination in LVLMs simultaneously. The core idea of our framework is to evaluate hallucinations via (object, relation, object) triplets extracted from LVLMs' responses, making it easily generalizable to different vision-language tasks. Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. With comprehensive evaluations on Tri-HE, we observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple training-free approach that effectively mitigates hallucinations for LVLMs. Our dataset and code for the reproduction of our experiments are available publicly at https://github.com/wujunjie1998/Tri-HE.

URLs: https://github.com/wujunjie1998/Tri-HE.

replace-cross UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning

Authors: Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal

Abstract: User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs). This requires deleting or "forgetting" a set of data points from an already-trained model, which typically degrades its performance on other data points. Thus, a balance must be struck between removing information and keeping the model's other abilities intact, with a failure to balance this trade-off leading to poor deletion or an unusable model. To this end, we propose UPCORE (Utility-Preserving Coreset Selection), a method-agnostic data selection framework for mitigating collateral damage during unlearning. Finding that the model damage is correlated with the variance of the model's representations on the forget set, we selectively prune the forget set to remove outliers, thereby minimizing model degradation after unlearning. Across three standard unlearning methods, UPCORE consistently achieves a superior balance between the competing objectives of deletion efficacy and model preservation. To better evaluate this trade-off, we introduce a new metric, measuring the area-under-the-curve (AUC) across standard metrics. Our results show that UPCORE improves both standard metrics and AUC, benefiting from positive transfer between the coreset and pruned points while reducing negative transfer from the forget set to points outside of it.

replace-cross BEARCUBS: A benchmark for computer-using web agents

Authors: Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer

Abstract: Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a "small but mighty" benchmark of 111 information-seeking questions designed to evaluate a web agent's ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing domain knowledge gaps and overlooked details as common failure points. By contrast, state-of-the-art computer-using agents underperform, with the best-scoring system (OpenAI's Operator) reaching only 23.4% accuracy. These results highlight critical areas for improvement, including reliable source selection and more powerful multimodal capabilities. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.

replace-cross ActionStudio: A Lightweight Framework for Data and Training of Large Action Models

Authors: Jianguo Zhang, Thai Hoang, Ming Zhu, Zuxin Liu, Shiyu Wang, Tulika Awalgaonkar, Akshara Prabhakar, Haolin Chen, Weiran Yao, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong

Abstract: Large Action models are essential for enabling autonomous agents to perform complex tasks. However, training such models remains challenging due to the diversity of agent environments and the complexity of noisy agentic data. Existing infrastructure offers limited support for scalable, agent-specific fine-tuning and standardized agent data processing. We introduce ActionStudio, a lightweight and extensible data and training framework designed for large action models. ActionStudio unifies diverse agent trajectories using our proposed Unified Format 2.0, supports a range of training workflows with optimized multi-node distributed setup, and integrates robust preprocessing and real-time verification tools. ActionStudio demonstrates up to 9x higher throughput compared to existing agentic training frameworks, and our trained models yield top performances across public and realistic agent benchmarks. To support the broader research community, we open-source the ActionStudio framework and release actionstudio-98k, a curated dataset of 98k high-quality trajectories. Code: https://github.com/SalesforceAIResearch/xLAM.

URLs: https://github.com/SalesforceAIResearch/xLAM.

replace-cross Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression

Authors: Hanqi Xiao, Yi-Lin Sung, Elias Stengel-Eskin, Mohit Bansal

Abstract: Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantization process on specific weight circuits -- which we define as sets of weights associated with downstream task performance. These weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TaCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TaCQ-based quantization to existing mixed-precision quantization methods when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, and text-to-SQL tasks for both Llama-3 and Qwen2.5, we find that TaCQ outperforms baselines using the same calibration data and a lower weight budget, achieving major improvements in the 2 and 3-bit regime. With only 3.1 bits we are able to recover 96% of Llama-3-8B-Instruct's unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. We also observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, SliM-LLM. Moreover, we observe a 7.20% gain without conditioning on specific tasks, showing TaCQ's ability to identify important weights is not limited to task-conditioned settings.

replace-cross Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows

Authors: Orlando Marquez Ayala, Patrice Bechard, Emily Chen, Maggie Baird, Jingfei Chen

Abstract: Large Language Models (LLMs) such as GPT-4o can handle a wide range of complex tasks with the right prompt. As per token costs are reduced, the advantages of fine-tuning Small Language Models (SLMs) for real-world applications -- faster inference, lower costs -- may no longer be clear. In this work, we present evidence that, for domain-specific tasks that require structured outputs, SLMs still have a quality advantage. We compare fine-tuning an SLM against prompting LLMs on the task of generating low-code workflows in JSON form. We observe that while a good prompt can yield reasonable results, fine-tuning improves quality by 10% on average. We also perform systematic error analysis to reveal model limitations.

replace-cross Cross-Layer Discrete Concept Discovery for Interpreting Language Models

Authors: Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou

Abstract: Uncovering emergent concepts across transformer layers remains a significant challenge because the residual stream linearly mixes and duplicates information, obscuring how features evolve within large language models. Current research efforts primarily inspect neural representations at single layers, thereby overlooking this cross-layer superposition and the redundancy it introduces. These representations are typically either analyzed directly for activation patterns or passed to probing classifiers that map them to a limited set of predefined concepts. To address these limitations, we propose cross-layer VQ-VAE (CLVQ-VAE), a framework that uses vector quantization to map representations across layers and in the process collapse duplicated residual-stream features into compact, interpretable concept vectors. Our approach uniquely combines top-k temperature-based sampling during quantization with EMA codebook updates, providing controlled exploration of the discrete latent space while maintaining code-book diversity. We further enhance the framework with scaled-spherical k-means++ for codebook initialization, which clusters by directional similarity rather than magnitude, better aligning with semantic structure in word embedding space.

replace-cross SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control

Authors: Xingyang He, Xiao Ling, Jie Liu

Abstract: Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling, but this progress has also introduced considerable redundancy and inefficiency into their reasoning processes, resulting in substantial computational waste. Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning (RL), with the goal of encouraging a more concise chains of thought. However, we observe that such global length penalty often lead to excessive compression of critical reasoning steps while preserving unnecessary details in simpler ones, yielding a suboptimal trade-off between accuracy and efficiency. To address this issue, we propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains based on the importance of each individual step. In the first stage, SmartThinker adapts a reasoning model to a short-form reasoning mode through rejection sampling combined with supervised fine-tuning (SFT). In the second stage, SmartThinker applies Step-Level Length Control Policy Optimization (SCPO) to refine the model output distribution, which increases the proportion of length allocated to critical steps while reducing redundancy in less important ones. SCPO consists of four core components: an online importance estimator, a step-level length control reward function, a step-level generalized advantage estimation (S-GAE) and a difficulty-adaptive clipping strategy. Working in concert, these components enable SCPO to implement differentiated length control across reasoning steps. Empirical results across multiple reasoning benchmarks and various backbone models demonstrate that SmartThinker significantly reduces redundant reasoning while achieving comparable or even superior performance to existing methods.

replace-cross SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

Authors: Pavel Adamenko, Mikhail Ivanov, Aidar Valeev, Rodion Levichev, Pavel Zadorozhny, Ivan Lopatin, Dmitry Babayev, Alena Fenogenova, Valentin Malykh

Abstract: The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 300 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across a dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025.

replace-cross Fairness Is Not Enough: Auditing Competence and Intersectional Bias in AI-powered Resume Screening

Authors: Kevin T Webster

Abstract: The increasing use of generative AI for resume screening is predicated on the assumption that it offers an unbiased alternative to biased human decision-making. However, this belief fails to address a critical question: are these AI systems fundamentally competent at the evaluative tasks they are meant to perform? This study investigates the question of competence through a two-part audit of eight major AI platforms. Experiment 1 confirmed complex, contextual racial and gender biases, with some models penalizing candidates merely for the presence of demographic signals. Experiment 2, which evaluated core competence, provided a critical insight: some models that appeared unbiased were, in fact, incapable of performing a substantive evaluation, relying instead on superficial keyword matching. This paper introduces the "Illusion of Neutrality" to describe this phenomenon, where an apparent lack of bias is merely a symptom of a model's inability to make meaningful judgments. This study recommends that organizations and regulators adopt a dual-validation framework, auditing AI hiring tools for both demographic bias and demonstrable competence to ensure they are both equitable and effective.

replace-cross MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

Authors: Artem Chervyakov, Alexander Kharitonov, Pavel Zadorozhny, Adamenko Pavel, Rodion Levichev, Dmitrii Vorobev, Dmitrii Salikhov, Aidar Valeev, Alena Pestova, Maria Dziuba, Ilseyar Alimova, Artem Zavgorodnev, Aleksandr Medvedev, Stanislav Moiseev, Elena Bruches, Daniil Grebenkin, Roman Derunets, Vikulov Vladimir, Anton Emelyanov, Dmitrii Babaev, Vladimir V. Ivanov, Valentin Malykh, Alena Fenogenova

Abstract: Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.