new eSapiens's DEREK Module: Deep Extraction & Reasoning Engine for Knowledge with LLMs

Authors: Isaac Shi, Zeyuan Li, Fan Liu, Wenli Wang, Lewei He, Yang Yang, Tianyu Shi

Abstract: We present the DEREK (Deep Extraction & Reasoning Engine for Knowledge) Module, a secure and scalable Retrieval-Augmented Generation pipeline designed specifically for enterprise document question answering. Designed and implemented by eSapiens, the system ingests heterogeneous content (PDF, Office, web), splits it into 1,000-token overlapping chunks, and indexes them in a hybrid HNSW+BM25 store. User queries are refined by GPT-4o, retrieved via combined vector+BM25 search, reranked with Cohere, and answered by an LLM using CO-STAR prompt engineering. A LangGraph verifier enforces citation overlap, regenerating answers until every claim is grounded. On four LegalBench subsets, 1000-token chunks improve Recall@50 by approximately 1 pp and hybrid+rerank boosts Precision@10 by approximately 7 pp; the verifier raises TRACe Utilization above 0.50 and limits unsupported statements to less than 3%. All components run in containers, enforce end-to-end TLS 1.3 and AES-256. These results demonstrate that the DEREK module delivers accurate, traceable, and production-ready document QA with minimal operational overhead. The module is designed to meet enterprise demands for secure, auditable, and context-faithful retrieval, providing a reliable baseline for high-stakes domains such as legal and finance.

new Adversarial Demonstration Learning for Low-resource NER Using Dual Similarity

Authors: Guowen Yuan, Tien-Hsuan Wu, Lianghao Xia, Ben Kao

Abstract: We study the problem of named entity recognition (NER) based on demonstration learning in low-resource scenarios. We identify two issues in demonstration construction and model training. Firstly, existing methods for selecting demonstration examples primarily rely on semantic similarity; We show that feature similarity can provide significant performance improvement. Secondly, we show that the NER tagger's ability to reference demonstration examples is generally inadequate. We propose a demonstration and training approach that effectively addresses these issues. For the first issue, we propose to select examples by dual similarity, which comprises both semantic similarity and feature similarity. For the second issue, we propose to train an NER model with adversarial demonstration such that the model is forced to refer to the demonstrations when performing the tagging task. We conduct comprehensive experiments in low-resource NER tasks, and the results demonstrate that our method outperforms a range of methods.

new Small Edits, Big Consequences: Telling Good from Bad Robustness in Large Language Models

Authors: Altynbek Ismailov, Salia Asanova

Abstract: Large language models (LLMs) now write code in settings where misreading a single word can break safety or cost money, yet we still expect them to overlook stray typos. To probe where useful robustness ends and harmful insensitivity begins, we compile 50 LeetCode problems and craft three minimal prompt perturbations that should vary in importance: (i) progressive underspecification deleting 10 % of words per step; (ii) lexical flip swapping a pivotal quantifier ("max" to "min"); and (iii) jargon inflation replacing a common noun with an obscure technical synonym. Six frontier models, including three "reasoning-tuned" versions, solve each mutated prompt, and their Python outputs are checked against the original test suites to reveal whether they reused the baseline solution or adapted. Among 11 853 generations we observe a sharp double asymmetry. Models remain correct in 85 % of cases even after 90 % of the prompt is missing, showing over-robustness to underspecification, yet only 54 % react to a single quantifier flip that reverses the task, with reasoning-tuned variants even less sensitive than their bases. Jargon edits lie in between, passing through 56 %. Current LLMs thus blur the line between harmless noise and meaning - changing edits, often treating both as ignorable. Masking salient anchors such as function names can force re - evaluation. We advocate evaluation and training protocols that reward differential sensitivity: stay steady under benign noise but adapt - or refuse - when semantics truly change.

new Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation

Authors: Sumit Singh, Rohit Mishra, Uma Shanker Tiwary

Abstract: One major challenge in natural language processing is named entity recognition (NER), which identifies and categorises named entities in textual input. In order to improve NER, this study investigates a Hindi NER technique that makes use of Hindi-specific pretrained encoders (MuRIL and XLM-R) and Generative Models ( Llama-2-7B-chat-hf (Llama2-7B), Llama-2-70B-chat-hf (Llama2-70B), Llama-3-70B-Instruct (Llama3-70B) and GPT3.5-turbo), and augments the data with retrieved data from external relevant contexts, notably from Wikipedia. We have fine-tuned MuRIL, XLM-R and Llama2-7B with and without RA. However, Llama2-70B, lama3-70B and GPT3.5-turbo are utilised for few-shot NER generation. Our investigation shows that the mentioned language models (LMs) with Retrieval Augmentation (RA) outperform baseline methods that don't incorporate RA in most cases. The macro F1 scores for MuRIL and XLM-R are 0.69 and 0.495, respectively, without RA and increase to 0.70 and 0.71, respectively, in the presence of RA. Fine-tuned Llama2-7B outperforms Llama2-7B by a significant margin. On the other hand the generative models which are not fine-tuned also perform better with augmented data. GPT3.5-turbo adopted RA well; however, Llama2-70B and llama3-70B did not adopt RA with our retrieval context. The findings show that RA significantly improves performance, especially for low-context data. This study adds significant knowledge about how best to use data augmentation methods and pretrained models to enhance NER performance, particularly in languages with limited resources.

new Learning without training: The implicit dynamics of in-context learning

Authors: Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Javier Gonzalvo

Abstract: One of the most striking features of Large Language Models (LLM) is their ability to learn in context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP, allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theory and experimentation that this simple mechanism may be the reason why LLMs can learn in context and not only during training. Specifically, we show under mild simplifying assumptions how a transformer block implicitly transforms a context into a low-rank weight-update of the MLP layer.

new Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback

Authors: Hannah Rashkin, Elizabeth Clark, Fantine Huot, Mirella Lapata

Abstract: Can LLMs provide support to creative writers by giving meaningful writing feedback? In this paper, we explore the challenges and limitations of model-generated writing feedback by defining a new task, dataset, and evaluation frameworks. To study model performance in a controlled manner, we present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues. We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics. Our analysis shows that current models have strong out-of-the-box behavior in many respects -- providing specific and mostly accurate writing feedback. However, models often fail to identify the biggest writing issue in the story and to correctly decide when to offer critical vs. positive feedback.

new mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages

Authors: Hellina Hailu Nigatu, Min Li, Maartje ter Hoeve, Saloni Potdar, Sarah Chasins

Abstract: Knowledge Graphs represent real-world entities and the relationships between them. Multilingual Knowledge Graph Construction (mKGC) refers to the task of automatically constructing or predicting missing entities and links for knowledge graphs in a multilingual setting. In this work, we reformulate the mKGC task as a Question Answering (QA) task and introduce mRAKL: a Retrieval-Augmented Generation (RAG) based system to perform mKGC. We achieve this by using the head entity and linking relation in a question, and having our model predict the tail entity as an answer. Our experiments focus primarily on two low-resourced languages: Tigrinya and Amharic. We experiment with using higher-resourced languages Arabic and English for cross-lingual transfer. With a BM25 retriever, we find that the RAG-based approach improves performance over a no-context setting. Further, our ablation studies show that with an idealized retrieval system, mRAKL improves accuracy by 4.92 and 8.79 percentage points for Tigrinya and Amharic, respectively.

new AutoMeet: a proof-of-concept study of genAI to automate meetings in automotive engineering

Authors: Simon Baeuerle, Max Radyschevski, Ulrike Pado

Abstract: In large organisations, knowledge is mainly shared in meetings, which takes up significant amounts of work time. Additionally, frequent in-person meetings produce inconsistent documentation -- official minutes, personal notes, presentations may or may not exist. Shared information therefore becomes hard to retrieve outside of the meeting, necessitating lengthy updates and high-frequency meeting schedules. Generative Artificial Intelligence (genAI) models like Large Language Models (LLMs) exhibit an impressive performance on spoken and written language processing. This motivates a practical usage of genAI for knowledge management in engineering departments: using genAI for transcribing meetings and integrating heterogeneous additional information sources into an easily usable format for ad-hoc searches. We implement an end-to-end pipeline to automate the entire meeting documentation workflow in a proof-of-concept state: meetings are recorded and minutes are created by genAI. These are further made easily searchable through a chatbot interface. The core of our work is to test this genAI-based software tooling in a real-world engineering department and collect extensive survey data on both ethical and technical aspects. Direct feedback from this real-world setup points out both opportunities and risks: a) users agree that the effort for meetings could be significantly reduced with the help of genAI models, b) technical aspects are largely solved already, c) organizational aspects are crucial for a successful ethical usage of such a system.

new Deep Researcher with Test-Time Diffusion

Authors: Rujun Han, Yanfei Chen, Zoey CuiZhu, Lesly Miculicich, Guan Sun, Yuanjun Bi, Weiming Wen, Hui Wan, Chunfeng Wen, Sol\`ene Ma\^itre, George Lee, Vishy Tirumalashetty, Emily Xue, Zizhao Zhang, Salem Haykal, Burak Gokturk, Tomas Pfister, Chen-Yu Lee

Abstract: Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.

new The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models

Authors: Marlene Lutz, Indira Sen, Georg Ahnert, Elisa Rogers, Markus Strohmaier

Abstract: Persona prompting is increasingly used in large language models (LLMs) to simulate views of various sociodemographic groups. However, how a persona prompt is formulated can significantly affect outcomes, raising concerns about the fidelity of such simulations. Using five open-source LLMs, we systematically examine how different persona prompt strategies, specifically role adoption formats and demographic priming strategies, influence LLM simulations across 15 intersectional demographic groups in both open- and closed-ended tasks. Our findings show that LLMs struggle to simulate marginalized groups, particularly nonbinary, Hispanic, and Middle Eastern identities, but that the choice of demographic priming and role adoption strategy significantly impacts their portrayal. Specifically, we find that prompting in an interview-style format and name-based priming can help reduce stereotyping and improve alignment. Surprisingly, smaller models like OLMo-2-7B outperform larger ones such as Llama-3.3-70B. Our findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.

new Efficient Compositional Multi-tasking for On-device Large Language Models

Authors: Ondrej Bohdal, Mete Ozay, Jijoong Moon, Kyeng-Hun Lee, Hyeonmok Ko, Umberto Michieli

Abstract: Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.

new BIDWESH: A Bangla Regional Based Hate Speech Detection Dataset

Authors: Azizul Hakim Fayaz, MD. Shorif Uddin, Rayhan Uddin Bhuiyan, Zakia Sultana, Md. Samiul Islam, Bidyarthi Paul, Tashreef Muhammad, Shahriar Manzoor

Abstract: Hate speech on digital platforms has become a growing concern globally, especially in linguistically diverse countries like Bangladesh, where regional dialects play a major role in everyday communication. Despite progress in hate speech detection for standard Bangla, Existing datasets and systems fail to address the informal and culturally rich expressions found in dialects such as Barishal, Noakhali, and Chittagong. This oversight results in limited detection capability and biased moderation, leaving large sections of harmful content unaccounted for. To address this gap, this study introduces BIDWESH, the first multi-dialectal Bangla hate speech dataset, constructed by translating and annotating 9,183 instances from the BD-SHS corpus into three major regional dialects. Each entry was manually verified and labeled for hate presence, type (slander, gender, religion, call to violence), and target group (individual, male, female, group), ensuring linguistic and contextual accuracy. The resulting dataset provides a linguistically rich, balanced, and inclusive resource for advancing hate speech detection in Bangla. BIDWESH lays the groundwork for the development of dialect-sensitive NLP tools and contributes significantly to equitable and context-aware content moderation in low-resource language settings.

new Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task

Authors: Jared Moore, Ned Cooper, Rasmus Overmark, Beba Cibralic, Nick Haber, Cameron R. Jones

Abstract: Recent evidence suggests Large Language Models (LLMs) display Theory of Mind (ToM) abilities. Most ToM experiments place participants in a spectatorial role, wherein they predict and interpret other agents' behavior. However, human ToM also contributes to dynamically planning action and strategically intervening on others' mental states. We present MindGames: a novel `planning theory of mind' (PToM) task which requires agents to infer an interlocutor's beliefs and desires to persuade them to alter their behavior. Unlike previous evaluations, we explicitly evaluate use cases of ToM. We find that humans significantly outperform o1-preview (an LLM) at our PToM task (11% higher; $p=0.006$). We hypothesize this is because humans have an implicit causal model of other agents (e.g., they know, as our task requires, to ask about people's preferences). In contrast, o1-preview outperforms humans in a baseline condition which requires a similar amount of planning but minimal mental state inferences (e.g., o1-preview is better than humans at planning when already given someone's preferences). These results suggest a significant gap between human-like social reasoning and LLM abilities.

new WAKENLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

Authors: Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Yao Wan, Kejia Huang, Chen Huang, Zhichao Hou, Xuming Hu

Abstract: Large Language Models (LLMs) frequently output the label Unknown, yet current evaluations focus almost exclusively on whether such answers are honest rather than why they arise. This blurs two distinct cases: (i) an input that is genuinely indeterminate and (ii) a solvable problem that the model fails to resolve. We call this phenomenon Vague Perception. And thus we introduce a framework that quantifies the proportion of Unknown responses attributable to model incapacity and tests whether guided stimulation can convert them into either correct Known or correct Unknown with valid reasoning. By separating these sources of uncertainty, our method provides a clearer picture of LLM reasoning limits and their potential for improvement. As we get a theoretical accuracy of reasoning task on different LLMs, we apply different methods to test whether the model can reach the accuracy given a baseline framework. Our work is meaningful in exploring the potential reasoning ability of LLMs and providing a new perspective on solving the Vague Perception phenomenon.

new Towards Compute-Optimal Many-Shot In-Context Learning

Authors: Shahriar Golchin, Yanfei Chen, Rujun Han, Manan Gandhi, Tianli Yu, Swaroop Mishra, Mihai Surdeanu, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister

Abstract: Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.

new FinResearchBench: A Logic Tree based Agent-as-a-Judge Evaluation Framework for Financial Research Agents

Authors: Run Sun, Zuo Bai, Wentao Zhang, Yuxiang Zhang, Li Zhao, Shan Sun, Zhengwen Qiu

Abstract: Recently, AI agents are rapidly evolving in intelligence and widely used in professional research applications, such as STEM, software development, finance, etc. Among these AI agents, deep research agent is a key category as it can perform long-horizon tasks and solve problems of greater complexity. However, there are few evaluation frameworks and benchmarks that systematically and automatically investigate the capabilities of these research agents. Furthermore, financial research problems have distinct complexity and subtlety. To fill in the gap, we propose FinResearchBench, which is a logic tree based Agent-as-a-Judge and targets specifically for the financial research agents. It provides a comprehensive and automatic assessment of the research agents across 7 key types of tasks in the financial research domain. The contributions of this work are two-folded: (1) the first and innovative Agent-as-a-Judge system that extracts the logic tree of the research outcome and uses it as the intermediate information to present a comprehensive, reliable and robust evaluation; (2) finance oriented that it covers 70 typical financial research questions, spreading across 7 frequently encountered types of tasks in the domain.

new Efficient RL for optimizing conversation level outcomes with an LLM-based tutor

Authors: Hyunji Nam, Omer Gottesman, Amy Zhang, Dean Foster, Emma Brunskill, Lyle Ungar

Abstract: Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize responses based on immediate turn-level human preferences. However, this approach falls short in multi-turn dialogue settings, such as online math tutoring. We propose a method to enhance LLM-based tutors by representing the dialogue history with a lower-dimensional latent state representation of a student and optimizing a long-term policy to determine high-level actions based on the latent state. The goal is to better align the tutor's behavior with the long-term objective of guiding the student towards solving a target math problem on their own. Our model is lightweight, requiring less computational resources than prior work of training the tutor policy end-to-end to directly output the tutor's next utterance. Our experiment results demonstrate that these modifications lead to improved long-term outcomes compared to prompting in LLM-simulated tutoring tasks.

new iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss

Authors: Yujian Sun, Tian Li

Abstract: As the Large Language Model (LLM) gains widespread adoption, increasing attention has been given to the challenge of making LLM forget non-compliant data memorized during its pre-training. Machine Unlearning focuses on efficiently erasing sensitive information from LLM under limited computational resources. To advance research in this area, SemEval 2025 Task 4: "Unlearning Sensitive Content from Large Language Models" introduces three unlearning datasets and establishes a benchmark by evaluating both forgetting effectiveness and the preservation of standard capabilities. In this work, we propose a more controllable forgetting loss, Effective Unlearning Loss, and explore its integration with various techniques to achieve more efficient and controlled unlearning. Our system ultimately ranked 5th on the competition leaderboard.

new Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction

Authors: Tianyun Zhong, Guozhao Mo, Yanjiang Liu, Yihan Chen, Lingdi Kong, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Le Sun

Abstract: With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://huggingface.co/datasets/tianyumyum/AOE.

URLs: https://huggingface.co/datasets/tianyumyum/AOE.

new Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

Authors: Paul-Andrei Pog\u{a}cean, Sanda-Maria Avram

Abstract: The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80\% accuracy on texts shorter than 150 characters and reaches 100\% accuracy for longer texts. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.

new SpeLLM: Character-Level Multi-Head Decoding

Authors: Amit Ben-Artzy, Roy Schwartz

Abstract: Scaling LLM vocabulary is often used to reduce input sequence length and alleviate attention's quadratic cost. Yet, current LLM architectures impose a critical bottleneck to this procedure: the output projection layer scales linearly with vocabulary size, rendering substantial expansion impractical. We propose SpeLLM, a method that decouples input and output vocabularies by predicting character-level strings through multiple output heads. In SpeLLM, each of the $k$ linear heads predicts a single character simultaneously, enabling the model to represent a much larger output space using smaller, independent linear heads. We present a self-distillation approach for converting a standard LLM to a SpeLLM. Our experiments with four pre-trained LLMs show their SpeLLM variants achieve competitive performance on downstream tasks while reducing runtime by 5.1% on average across models. Our approach provides a potential avenue for reducing LLM costs, while increasing support for underrepresented languages and domains.

new Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Authors: Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Zhuangzhuang He, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, Jie Fu

Abstract: Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.

new GG-BBQ: German Gender Bias Benchmark for Question Answering

Authors: Shalaka Satheesh, Katrin Klug, Katharina Beckh, H\'ector Allende-Cid, Sebastian Houben, Teena Hassan

Abstract: Within the context of Natural Language Processing (NLP), fairness evaluation is often associated with the assessment of bias and reduction of associated harm. In this regard, the evaluation is usually carried out by using a benchmark dataset, for a task such as Question Answering, created for the measurement of bias in the model's predictions along various dimensions, including gender identity. In our work, we evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al. (2022) as a reference. Specifically, the templates in the gender identity subset of this English dataset were machine translated into German. The errors in the machine translated templates were then manually reviewed and corrected with the help of a language expert. We find that manual revision of the translation is crucial when creating datasets for gender bias evaluation because of the limitations of machine translation from English to a language such as German with grammatical gender. Our final dataset is comprised of two subsets: Subset-I, which consists of group terms related to gender identity, and Subset-II, where group terms are replaced with proper names. We evaluate several LLMs used for German NLP on this newly created dataset and report the accuracy and bias scores. The results show that all models exhibit bias, both along and against existing social stereotypes.

new PromptAL: Sample-Aware Dynamic Soft Prompts for Few-Shot Active Learning

Authors: Hui Xiang, Jinqiao Shi, Ting Zhang, Xiaojie Zhao, Yong Liu, Yong Ma

Abstract: Active learning (AL) aims to optimize model training and reduce annotation costs by selecting the most informative samples for labeling. Typically, AL methods rely on the empirical distribution of labeled data to define the decision boundary and perform uncertainty or diversity estimation, subsequently identifying potential high-quality samples. In few-shot scenarios, the empirical distribution often diverges significantly from the target distribution, causing the decision boundary to shift away from its optimal position. However, existing methods overlook the role of unlabeled samples in enhancing the empirical distribution to better align with the target distribution, resulting in a suboptimal decision boundary and the selection of samples that inadequately represent the target distribution. To address this, we propose a hybrid AL framework, termed \textbf{PromptAL} (Sample-Aware Dynamic Soft \textbf{Prompts} for Few-Shot \textbf{A}ctive \textbf{L}earning). This framework accounts for the contribution of each unlabeled data point in aligning the current empirical distribution with the target distribution, thereby optimizing the decision boundary. Specifically, PromptAL first leverages unlabeled data to construct sample-aware dynamic soft prompts that adjust the model's predictive distribution and decision boundary. Subsequently, based on the adjusted decision boundary, it integrates uncertainty estimation with both global and local diversity to select high-quality samples that more accurately represent the target distribution. Experimental results on six in-domain and three out-of-domain datasets show that PromptAL achieves superior performance over nine baselines. Our codebase is openly accessible.

new Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch

Authors: Elza Strazda, Gerasimos Spanakis

Abstract: Warning: This paper contains explicit statements of offensive stereotypes which might be upsetting. Language models are prone to exhibiting biases, further amplifying unfair and harmful stereotypes. Given the fast-growing popularity and wide application of these models, it is necessary to ensure safe and fair language models. As of recent considerable attention has been paid to measuring bias in language models, yet the majority of studies have focused only on English language. A Dutch version of the US-specific CrowS-Pairs dataset for measuring bias in Dutch language models is introduced. The resulting dataset consists of 1463 sentence pairs that cover bias in 9 categories, such as Sexual orientation, Gender and Disability. The sentence pairs are composed of contrasting sentences, where one of the sentences concerns disadvantaged groups and the other advantaged groups. Using the Dutch CrowS-Pairs dataset, we show that various language models, BERTje, RobBERT, multilingual BERT, GEITje and Mistral-7B exhibit substantial bias across the various bias categories. Using the English and French versions of the CrowS-Pairs dataset, bias was evaluated in English (BERT and RoBERTa) and French (FlauBERT and CamemBERT) language models, and it was shown that English models exhibit the most bias, whereas Dutch models the least amount of bias. Additionally, results also indicate that assigning a persona to a language model changes the level of bias it exhibits. These findings highlight the variability of bias across languages and contexts, suggesting that cultural and linguistic factors play a significant role in shaping model biases.

new Towards Enforcing Company Policy Adherence in Agentic Workflows

Authors: Naama Zwerdling, David Boaz, Ella Rabinovich, Guy Uziel, David Amid, Ateret Anaby-Tavor

Abstract: Large Language Model (LLM) agents hold promise for a flexible and scalable alternative to traditional business process automation, but struggle to reliably follow complex company policies. In this study we introduce a deterministic, transparent, and modular framework for enforcing business policy adherence in agentic workflows. Our method operates in two phases: (1) an offline buildtime stage that compiles policy documents into verifiable guard code associated with tool use, and (2) a runtime integration where these guards ensure compliance before each agent action. We demonstrate our approach on the challenging $\tau$-bench Airlines domain, showing encouraging preliminary results in policy enforcement, and further outline key challenges for real-world deployments.

new ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs

Authors: Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, Xiaojun Wan

Abstract: Large language models (LLMs) excel at various natural language processing tasks, but their tendency to generate hallucinations undermines their reliability. Existing hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations, overlooking their dynamic evolution across layers, which limits efficacy. To address this limitation, we shift the focus to the hidden state update process and introduce a novel metric, the ICR Score (Information Contribution to Residual Stream), which quantifies the contribution of modules to the hidden states' update. We empirically validate that the ICR Score is effective and reliable in distinguishing hallucinations. Building on these insights, we propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states. Experimental results show that the ICR Probe achieves superior performance with significantly fewer parameters. Furthermore, ablation studies and case analyses offer deeper insights into the underlying mechanism of this method, improving its interpretability.

new Combining Language and Topic Models for Hierarchical Text Classification

Authors: Jaco du Toit, Marcel Dunaiski

Abstract: Hierarchical text classification (HTC) is a natural language processing task which has the objective of categorising text documents into a set of classes from a predefined structured class hierarchy. Recent HTC approaches use various techniques to incorporate the hierarchical class structure information with the natural language understanding capabilities of pre-trained language models (PLMs) to improve classification performance. Furthermore, using topic models along with PLMs to extract features from text documents has been shown to be an effective approach for multi-label text classification tasks. The rationale behind the combination of these feature extractor models is that the PLM captures the finer-grained contextual and semantic information while the topic model obtains high-level representations which consider the corpus of documents as a whole. In this paper, we use a HTC approach which uses a PLM and a topic model to extract features from text documents which are used to train a classification model. Our objective is to determine whether the combination of the features extracted from the two models is beneficial to HTC performance in general. In our approach, the extracted features are passed through separate convolutional layers whose outputs are combined and passed to a label-wise attention mechanisms which obtains label-specific document representations by weighing the most important features for each class separately. We perform comprehensive experiments on three HTC benchmark datasets and show that using the features extracted from the topic model generally decreases classification performance compared to only using the features obtained by the PLM. In contrast to previous work, this shows that the incorporation of features extracted from topic models for text classification tasks should not be assumed beneficial.

new The Ever-Evolving Science Exam

Authors: Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai

Abstract: As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad **Range**, wide **Reach**, and high **Rigor**, yet they often face two major challenges: **data leakage risks** that compromise benchmarking validity, and **evaluation inefficiency** due to large-scale testing. To address these issues, we introduce the **Ever-Evolving Science Exam (EESE)**, a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public **EESE-Pool** with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring **Range**, **Reach**, and **Rigor**, 2) a periodically updated 500-instance subset **EESE**, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.

URLs: https://github.com/aiben-ch/EESE.

new Introducing Quality Estimation to Machine Translation Post-editing Workflow: An Empirical Study on Its Usefulness

Authors: Siqi Liu, Guangrong Dai, Dechao Li

Abstract: This preliminary study investigates the usefulness of sentence-level Quality Estimation (QE) in English-Chinese Machine Translation Post-Editing (MTPE), focusing on its impact on post-editing speed and student translators' perceptions. It also explores the interaction effects between QE and MT quality, as well as between QE and translation expertise. The findings reveal that QE significantly reduces post-editing time. The examined interaction effects were not significant, suggesting that QE consistently improves MTPE efficiency across medium- and high-quality MT outputs and among student translators with varying levels of expertise. In addition to indicating potentially problematic segments, QE serves multiple functions in MTPE, such as validating translators' evaluations of MT quality and enabling them to double-check translation outputs. However, interview data suggest that inaccurate QE may hinder post-editing processes. This research provides new insights into the strengths and limitations of QE, facilitating its more effective integration into MTPE workflows to enhance translators' productivity.

new Learning Text Styles: A Study on Transfer, Attribution, and Verification

Authors: Zhiqiang Hu

Abstract: This thesis advances the computational understanding and manipulation of text styles through three interconnected pillars: (1) Text Style Transfer (TST), which alters stylistic properties (e.g., sentiment, formality) while preserving content; (2)Authorship Attribution (AA), identifying the author of a text via stylistic fingerprints; and (3) Authorship Verification (AV), determining whether two texts share the same authorship. We address critical challenges in these areas by leveraging parameter-efficient adaptation of large language models (LLMs), contrastive disentanglement of stylistic features, and instruction-based fine-tuning for explainable verification.

new Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language

Authors: Kristin Gnadt, David Thulke, Simone Kopeinik, Ralf Schl\"uter

Abstract: In recent years, various methods have been proposed to evaluate gender bias in large language models (LLMs). A key challenge lies in the transferability of bias measurement methods initially developed for the English language when applied to other languages. This work aims to contribute to this research strand by presenting five German datasets for gender bias evaluation in LLMs. The datasets are grounded in well-established concepts of gender bias and are accessible through multiple methodologies. Our findings, reported for eight multilingual LLM models, reveal unique challenges associated with gender bias in German, including the ambiguous interpretation of male occupational terms and the influence of seemingly neutral nouns on gender perception. This work contributes to the understanding of gender bias in LLMs across languages and underscores the necessity for tailored evaluation frameworks.

new Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models

Authors: Mohamad Ballout, Serwan Jassim, Elia Bruni

Abstract: This paper presents a systematic evaluation of state-of-the-art multimodal large language models (MLLMs) on intuitive physics tasks using the GRASP and IntPhys 2 datasets. We assess the open-source models InternVL 2.5, Qwen 2.5 VL, LLaVA-OneVision, and the proprietary Gemini 2.0 Flash Thinking, finding that even the latest models struggle to reliably distinguish physically plausible from implausible scenarios. To go beyond performance metrics, we conduct a probing analysis of model embeddings, extracting intermediate representations at key processing stages to examine how well task-relevant information is preserved. Our results show that, depending on task difficulty, a critical vision-language misalignment can emerge: vision encoders successfully capture physical plausibility cues, but this information is not effectively utilized by the language model, leading to failures in reasoning. This misalignment suggests that the primary limitation of MLLMs in intuitive physics tasks is not the vision component but the ineffective integration of visual and linguistic information. Our findings highlight vision-language alignment as a key area for improvement, offering insights for future MLLMs development.

new Step-Audio 2 Technical Report

Authors: Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu

Abstract: This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

URLs: https://github.com/stepfun-ai/Step-Audio2

new Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models

Authors: Armin Berger, Lars Hillebrand, David Leonhard, Tobias Deu{\ss}er, Thiago Bell Felix de Oliveira, Tim Dilmaghani, Mohamed Khaled, Bernd Kliem, R\"udiger Loitz, Christian Bauckhage, Rafet Sifa

Abstract: The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from financial reports to align with the legal requirements of accounting standards. However, a glaring limitation remains: these systems commonly fall short in verifying if the recommended excerpts indeed comply with the specific legal mandates. Hence, in this paper, we probe the efficiency of publicly available Large Language Models (LLMs) in the realm of regulatory compliance across different model configurations. We place particular emphasis on comparing cutting-edge open-source LLMs, such as Llama-2, with their proprietary counterparts like OpenAI's GPT models. This comparative analysis leverages two custom datasets provided by our partner PricewaterhouseCoopers (PwC) Germany. We find that the open-source Llama-2 70 billion model demonstrates outstanding performance in detecting non-compliance or true negative occurrences, beating all their proprietary counterparts. Nevertheless, proprietary models such as GPT-4 perform the best in a broad variety of scenarios, particularly in non-English contexts.

new P-CoT: A Pedagogically-motivated Participatory Chain-of-Thought Prompting for Phonological Reasoning in LLMs

Authors: Dongjun Jang, Youngchae Ahn, Hyopil Shin

Abstract: This study explores the potential of phonological reasoning within text-based large language models (LLMs). Utilizing the PhonologyBench benchmark, we assess tasks like rhyme word generation, g2p conversion, and syllable counting. Our evaluations across 12 LLMs reveal that while few-shot learning offers inconsistent gains, the introduction of a novel Pedagogically-motivated Participatory Chain-of-Thought (P-CoT) prompt, which is anchored in educational theories like scaffolding and discovery learning, consistently enhances performance. This method leverages structured guidance to activate latent phonological abilities, achieving up to 52% improvement and even surpassing human baselines in certain tasks. Future work could aim to optimize P-CoT prompts for specific models or explore their application across different linguistic domains.

new Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLMs

Authors: Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Lin, Yingya Zhang, Shiwei Zhang, Difan Zou

Abstract: Despite efforts to unify multimodal generation and understanding tasks in a single model, we show these MLLMs exhibit self-contradiction where generation produces images deemed misaligned with input prompts based on the model's own understanding. We define a Nonunified score that quantifies such self-contradiction. Our empirical results reveal that the self-contradiction mainly arises from weak generation that fails to align with prompts, rather than misunderstanding. This capability asymmetry indicates the potential of leveraging self-contradiction for self-improvement, where the stronger model understanding guides the weaker generation to mitigate the generation-understanding gap. Applying standard post-training methods (e.g., SFT, DPO) with such internal supervision successfully improves both generation and unification. We discover a co-improvement effect on both generation and understanding when only fine-tuning the generation branch, a phenomenon known in pre-training but underexplored in post-training. Our analysis shows improvements stem from better detection of false positives that are previously incorrectly identified as prompt-aligned. Theoretically, we show the aligned training dynamics between generation and understanding allow reduced prompt-misaligned generations to also improve mismatch detection in the understanding branch. Additionally, the framework reveals a potential risk of co-degradation under poor supervision-an overlooked phenomenon that is empirically validated in our experiments. Notably, we find intrinsic metrics like Nonunified score cannot distinguish co-degradation from co-improvement, which highlights the necessity of data quality check. Finally, we propose a curriculum-based strategy based on our findings that gradually introduces harder samples as the model improves, leading to better unification and improved MLLM generation and understanding.

new PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

Authors: Han Jiang, Dongyao Zhu, Zhihua Wei, Xiaoyuan Yi, Ziang Xiao, Xing Xie

Abstract: In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs' understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

new Interpretable Topic Extraction and Word Embedding Learning using row-stochastic DEDICOM

Authors: Lars Hillebrand, David Biesner, Christian Bauckhage, Rafet Sifa

Abstract: The DEDICOM algorithm provides a uniquely interpretable matrix factorization method for symmetric and asymmetric square matrices. We employ a new row-stochastic variation of DEDICOM on the pointwise mutual information matrices of text corpora to identify latent topic clusters within the vocabulary and simultaneously learn interpretable word embeddings. We introduce a method to efficiently train a constrained DEDICOM algorithm and a qualitative evaluation of its topic modeling and word embedding performance.

new Advancing Risk and Quality Assurance: A RAG Chatbot for Improved Regulatory Compliance

Authors: Lars Hillebrand, Armin Berger, Daniel Uedelhoven, David Berghaus, Ulrich Warning, Tim Dilmaghani, Bernd Kliem, Thomas Schmid, R\"udiger Loitz, Rafet Sifa

Abstract: Risk and Quality (R&Q) assurance in highly regulated industries requires constant navigation of complex regulatory frameworks, with employees handling numerous daily queries demanding accurate policy interpretation. Traditional methods relying on specialized experts create operational bottlenecks and limit scalability. We present a novel Retrieval Augmented Generation (RAG) system leveraging Large Language Models (LLMs), hybrid search and relevance boosting to enhance R&Q query processing. Evaluated on 124 expert-annotated real-world queries, our actively deployed system demonstrates substantial improvements over traditional RAG approaches. Additionally, we perform an extensive hyperparameter analysis to compare and evaluate multiple configuration setups, delivering valuable insights to practitioners.

new RAVine: Reality-Aligned Evaluation for Agentic Search

Authors: Yilong Xu, Xiang Long, Zhi Zheng, Jinhua Gao

Abstract: Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine -- a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model's interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.

URLs: https://github.com/SwordFaith/RAVine.

new Unpacking Ambiguity: The Interaction of Polysemous Discourse Markers and Non-DM Signals

Authors: Jingni Wu, Amir Zeldes

Abstract: Discourse markers (DMs) like 'but' or 'then' are crucial for creating coherence in discourse, yet they are often replaced by or co-occur with non-DMs ('in the morning' can mean the same as 'then'), and both can be ambiguous ('since' can refer to time or cause). The interaction mechanism between such signals remains unclear but pivotal for their disambiguation. In this paper we investigate the relationship between DM polysemy and co-occurrence of non-DM signals in English, as well as the influence of genre on these patterns. Using the framework of eRST, we propose a graded definition of DM polysemy, and conduct correlation and regression analyses to examine whether polysemous DMs are accompanied by more numerous and diverse non-DM signals. Our findings reveal that while polysemous DMs do co-occur with more diverse non-DMs, the total number of co-occurring signals does not necessarily increase. Moreover, genre plays a significant role in shaping DM-signal interactions.

new Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

Authors: Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O'Brien, James Glass

Abstract: To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.

new Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent

Authors: Xiaoyu Zhan, Xinyu Fu, Hao Sun, Yuanqi Li, Jie Guo, Yanwen Guo

Abstract: The rapid advancement of large language models (LLMs) has enabled role-playing language agents to demonstrate significant potential in various applications. However, relying solely on prompts and contextual inputs often proves insufficient for achieving deep immersion in specific roles, particularly well-known fictional or public figures. On the other hand, fine-tuning-based approaches face limitations due to the challenges associated with data collection and the computational resources required for training, thereby restricting their broader applicability. To address these issues, we propose Test-Time-Matching (TTM), a training-free role-playing framework through test-time scaling and context engineering. TTM uses LLM agents to automatically decouple a character's features into personality, memory, and linguistic style. Our framework involves a structured, three-stage generation pipeline that utilizes these features for controlled role-playing. It achieves high-fidelity role-playing performance, also enables seamless combinations across diverse linguistic styles and even variations in personality and memory. We evaluate our framework through human assessment, and the results demonstrate that our method achieves the outstanding performance in generating expressive and stylistically consistent character dialogues.

new Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Authors: Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wei Wang, Peng Zhang

Abstract: Large Language Models (LLMs) exhibit considerable promise in financial applications; however, prevailing models frequently demonstrate limitations when confronted with scenarios that necessitate sophisticated reasoning capabilities, stringent trustworthiness criteria, and efficient adaptation to domain-specific requirements. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task label system with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage training pipeline, and dynamic attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including Fineva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA-diamond. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications. The Finova bench is available at https://github.com/antgroup/Finova.

URLs: https://github.com/antgroup/Finova.

new LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

Authors: Da-Chen Lian, Ri-Sheng Huang, Pin-Er Chen, Chunki Lim, You-Kuan Lin, Guan-Yu Tseng, Zi-Cheng Yang, Shu-Kai Hsieh

Abstract: We propose LingBench++, a linguistically-informed benchmark and reasoning framework designed to evaluate large language models (LLMs) on complex linguistic tasks inspired by the International Linguistics Olympiad (IOL). Unlike prior benchmarks that focus solely on final answer accuracy, LingBench++ provides structured reasoning traces, stepwise evaluation protocols, and rich typological metadata across over 90 low-resource and cross-cultural languages. We further develop a multi-agent architecture integrating grammatical knowledge retrieval, tool-augmented reasoning, and deliberate hypothesis testing. Through systematic comparisons of baseline and our proposed agentic models, we demonstrate that models equipped with external knowledge sources and iterative reasoning outperform single-pass approaches in both accuracy and interpretability. LingBench++ offers a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in LLMs.

new MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Authors: Run-Ze Fan, Zengzhi Wang, Pengfei Liu

Abstract: Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.

cross RDMA: Cost Effective Agent-Driven Rare Disease Discovery within Electronic Health Record Systems

Authors: John Wu, Adam Cross, Jimeng Sun

Abstract: Rare diseases affect 1 in 10 Americans, yet standard ICD coding systems fail to capture these conditions in electronic health records (EHR), leaving crucial information buried in clinical notes. Current approaches struggle with medical abbreviations, miss implicit disease mentions, raise privacy concerns with cloud processing, and lack clinical reasoning abilities. We present Rare Disease Mining Agents (RDMA), a framework that mirrors how medical experts identify rare disease patterns in EHR. RDMA connects scattered clinical observations that together suggest specific rare conditions. By handling clinical abbreviations, recognizing implicit disease patterns, and applying contextual reasoning locally on standard hardware, RDMA reduces privacy risks while improving F1 performance by upwards of 30\% and decreasing inferences costs 10-fold. This approach helps clinicians avoid the privacy risk of using cloud services while accessing key rare disease information from EHR systems, supporting earlier diagnosis for rare disease patients. Available at https://github.com/jhnwu3/RDMA.

URLs: https://github.com/jhnwu3/RDMA.

cross Why Braking? Scenario Extraction and Reasoning Utilizing LLM

Authors: Yin Wu, Daniel Slieter, Vivek Subramanian, Ahmed Abouelazm, Robin Bohn, J. Marius Z\"ollner

Abstract: The growing number of ADAS-equipped vehicles has led to a dramatic increase in driving data, yet most of them capture routine driving behavior. Identifying and understanding safety-critical corner cases within this vast dataset remains a significant challenge. Braking events are particularly indicative of potentially hazardous situations, motivating the central question of our research: Why does a vehicle brake? Existing approaches primarily rely on rule-based heuristics to retrieve target scenarios using predefined condition filters. While effective in simple environments such as highways, these methods lack generalization in complex urban settings. In this paper, we propose a novel framework that leverages Large Language Model (LLM) for scenario understanding and reasoning. Our method bridges the gap between low-level numerical signals and natural language descriptions, enabling LLM to interpret and classify driving scenarios. We propose a dual-path scenario retrieval that supports both category-based search for known scenarios and embedding-based retrieval for unknown Out-of-Distribution (OOD) scenarios. To facilitate evaluation, we curate scenario annotations on the Argoverse 2 Sensor Dataset. Experimental results show that our method outperforms rule-based baselines and generalizes well to OOD scenarios.

cross Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark

Authors: Goeric Huybrechts, Srikanth Ronanki, Sai Muralidhar Jayanthi, Jack Fitzgerald, Srinivasan Veeravanallur

Abstract: The proliferation of multimodal Large Language Models has significantly advanced the ability to analyze and understand complex data inputs from different modalities. However, the processing of long documents remains under-explored, largely due to a lack of suitable benchmarks. To address this, we introduce Document Haystack, a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long, visually complex documents. Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image "needles" at various depths within the documents to challenge VLMs' retrieval capabilities. Comprising 400 document variants and a total of 8,250 questions, it is supported by an objective, automated evaluation framework. We detail the construction and characteristics of the Document Haystack dataset, present results from prominent VLMs and discuss potential research avenues in this area.

cross AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

Authors: Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, Ofir Press

Abstract: Despite progress in language model (LM) capabilities, evaluations have thus far focused on models' performance on tasks that humans have previously solved, including in programming (Jimenez et al., 2024) and mathematics (Glazer et al., 2024). We therefore propose testing models' ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 155 coding tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner achieves an average 1.72x speedup against our reference solvers, which use libraries such as SciPy, sk-learn and CVXPY. However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.

cross SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting

Authors: Shuhao Mei, Yongchao Long, Shan Cao, Xiaobo Han, Shijia Geng, Jinbo Sun, Yuxi Zhou, Shenda Hong

Abstract: Chronic Obstructive Pulmonary Disease (COPD), a major chronic respiratory disease with persistent airflow limitation, is a leading global cause of disability and mortality. Respiratory spirogram time series, routinely collected during pulmonary function tests (PFTs), play a critical role in the early detection of repsiratory diseases and in monitoring lung function over time. However, most current AI models for COPD diagnosis are limited to outputting classification results without providing a rationale for their diagnostic process, while current Large Language Models (LLMs) cannot understand spirograms yet, which severely limits their clinical trust and adoption. To tackle this challenge, we leverage a cohort of 234,028 individuals from the UK Biobank (UKB) to propose SpiroLLM, the first multimodal large language model that can understand spirogram. The model extracts morphological features from respiratory curves via a SpiroEncoder and aligns them with PFT numerical values in a unified latent space using a SpiroProjector, ultimately empowering a large language model to generate a comprehensive diagnostic report. Experimental results confirm that SpiroLLM achieved a diagnostic AUROC of 0.8980 (95% CI: 0.8820-0.9132). In a robustness test with missing core data, it maintained a 100% valid response rate, far surpassing the 13.4% of a text-only model and showcasing the superiority of its multimodal design. This work demonstrates the substantial potential of deeply fusing physiological signals with large language models, establishing a new paradigm for the next generation of interpretable and reliable clinical decision support tools.

cross Characterizing Online Activities Contributing to Suicide Mortality among Youth

Authors: Aparna Ananthasubramaniam, Elyse J. Thulin, Viktoryia Kalesnikava, Silas Falde, Jonathan Kertawidjaja, Lily Johns, Alejandro Rodr\'iguez-Putnam, Emma Spring, Kara Zivin, Briana Mezuk

Abstract: The recent rise in youth suicide highlights the urgent need to understand how online experiences contribute to this public health issue. Our mixed-methods approach responds to this challenge by developing a set of themes focused on risk factors for suicide mortality in online spaces among youth ages 10-24, and a framework to model these themes at scale. Using 29,124 open text summaries of death investigations between 2013-2022, we conducted a thematic analysis to identify 12 types of online activities that were considered by investigators or next of kin to be relevant in contextualizing a given suicide death. We then develop a zero-shot learning framework to model these 12 themes at scale, and analyze variation in these themes by decedent characteristics and over time. Our work uncovers several online activities related to harm to self, harm to others, interpersonal interactions, activity levels online, and life events, which correspond to different phases of suicide risk from two prominent suicide theories. We find an association between these themes and decedent characteristics like age, means of death, and interpersonal problems, and many themes became more prevalent during the 2020 COVID-19 lockdowns. While digital spaces have taken some steps to address expressions of suicidality online, our work illustrates the opportunities for developing interventions related to less explicit indicators of suicide risk by combining suicide theories with computational research.

cross WhatsApp Tiplines and Multilingual Claims in the 2021 Indian Assembly Elections

Authors: Gautam Kishore Shahi, Scot A. Hale

Abstract: WhatsApp tiplines, first launched in 2019 to combat misinformation, enable users to interact with fact-checkers to verify misleading content. This study analyzes 580 unique claims (tips) from 451 users, covering both high-resource languages (English, Hindi) and a low-resource language (Telugu) during the 2021 Indian assembly elections using a mixed-method approach. We categorize the claims into three categories, election, COVID-19, and others, and observe variations across languages. We compare content similarity through frequent word analysis and clustering of neural sentence embeddings. We also investigate user overlap across languages and fact-checking organizations. We measure the average time required to debunk claims and inform tipline users. Results reveal similarities in claims across languages, with some users submitting tips in multiple languages to the same fact-checkers. Fact-checkers generally require a couple of days to debunk a new claim and share the results with users. Notably, no user submits claims to multiple fact-checking organizations, indicating that each organization maintains a unique audience. We provide practical recommendations for using tiplines during elections with ethical consideration of users' information.

cross MMS Player: an open source software for parametric data-driven animation of Sign Language avatars

Authors: Fabrizio Nunnari, Shailesh Mishra, Patrick Gebhard

Abstract: This paper describes the MMS-Player, an open source software able to synthesise sign language animations from a novel sign language representation format called MMS (MultiModal Signstream). The MMS enhances gloss-based representations by adding information on parallel execution of signs, timing, and inflections. The implementation consists of Python scripts for the popular Blender 3D authoring tool and can be invoked via command line or HTTP API. Animations can be rendered as videos or exported in other popular 3D animation exchange formats. The software is freely available under GPL-3.0 license at https://github.com/DFKI-SignLanguage/MMS-Player.

URLs: https://github.com/DFKI-SignLanguage/MMS-Player.

cross C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning

Authors: Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jianhua Han, Hang Xu, Xiaodan Liang

Abstract: Recent advances in multimodal large language models (MLLMs) have shown impressive reasoning capabilities. However, further enhancing existing MLLMs necessitates high-quality vision-language datasets with carefully curated task complexities, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct supervised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks. Our code, models, and datasets will be released.

cross Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

Authors: Shanghai AI Lab, :, Xiaoyang Chen, Yunhao Chen, Zeren Chen, Zhiyun Chen, Hanyun Cui, Yawen Duan, Jiaxuan Guo, Qi Guo, Xuhao Hu, Hong Huang, Lige Huang, Chunxiao Li, Juncheng Li, Qihao Lin, Dongrui Liu, Xinmin Liu, Zicheng Liu, Chaochao Lu, Xiaoya Lu, Jingjing Qu, Qibing Ren, Jing Shao, Jingwei Shi, Jingwei Sun, Peng Wang, Weibing Wang, Jia Xu, Lewen Yan, Xiao Yu, Yi Yu, Boxuan Zhang, Jie Zhang, Weichen Zhang, Zhijie Zheng, Tianyi Zhou, Bowen Zhou

Abstract: To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R\&D, strategic deception and scheming, self-replication, and collusion. Guided by the "AI-$45^\circ$ Law," we evaluate these risks using "red lines" (intolerable thresholds) and "yellow lines" (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R\&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.

cross Scaling Linear Attention with Sparse State Expansion

Authors: Yuqi Pan, Yongqi An, Zheng Li, Yuhong Chou, Ruijie Zhu, Xiaohui Wang, Mingxuan Wang, Jinqiao Wang, Guoqi Li

Abstract: The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top-$k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Our design, supported by efficient parallelized implementations, yields effective classification and discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.7 on AIME24 and 51.3 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.

cross Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory

Authors: Guowei Lan, Kaixian Qu, Ren\'e Zurbr\"ugg, Changan Chen, Christopher E. Mower, Haitham Bou-Ammar, Marco Hutter

Abstract: Vision-language models (VLMs) have been widely adopted in robotics to enable autonomous planning. However, grounding VLMs, originally trained on internet data, to diverse real-world robots remains a challenge. This paper presents ExpTeach, a framework that grounds VLMs to physical robots by building a self-generated memory of real-world experiences. In ExpTeach, the VLM autonomously plans actions, verifies outcomes, reflects on failures, and adapts robot behaviors in a closed loop. The self-generated experiences during this process are then summarized into a long-term memory, enabling retrieval of learned knowledge to guide future tasks via retrieval-augmented generation (RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with an on-demand image annotation module. In experiments, we show that reflection improves success rates from 36% to 84% on four challenging robotic tasks and observe the emergence of intelligent object interactions, including creative tool use. Across extensive tests on 12 real-world scenarios (including eight unseen ones), we find that grounding with long-term memory boosts single-trial success rates from 22% to 80%, demonstrating the effectiveness and generalizability of ExpTeach.

cross Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Authors: Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum

Abstract: Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.

cross Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Authors: Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda

Abstract: Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM's latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.

cross Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Authors: Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas

Abstract: When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score -- a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations -- outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.

replace Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing

Authors: Ben Hutchinson

Abstract: This position paper concerns the use of religious texts in Natural Language Processing (NLP), which is of special interest to the Ethics of NLP. Religious texts are expressions of culturally important values, and machine learned models have a propensity to reproduce cultural values encoded in their training data. Furthermore, translations of religious texts are frequently used by NLP researchers when language data is scarce. This repurposes the translations from their original uses and motivations, which often involve attracting new followers. This paper argues that NLP's use of such texts raises considerations that go beyond model biases, including data provenance, cultural contexts, and their use in proselytism. We argue for more consideration of researcher positionality, and of the perspectives of marginalized linguistic and religious communities.

replace Erasing Conceptual Knowledge from Language Models

Authors: Rohit Gandikota, Sheridan Feucht, Samuel Marks, David Bau

Abstract: In this work, we introduce Erasure of Language Memory (ELM), a principled approach to concept-level unlearning that operates by matching distributions defined by the model's own introspective classification capabilities. Our key insight is that effective unlearning should leverage the model's ability to evaluate its own knowledge, using the language model itself as a classifier to identify and reduce the likelihood of generating content related to undesired concepts. ELM applies this framework to create targeted low-rank updates that reduce generation probabilities for concept-specific content while preserving the model's broader capabilities. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative evaluation reveals that ELM-modified models achieve near-random performance on assessments targeting erased concepts, while simultaneously preserving generation coherence, maintaining benchmark performance on unrelated tasks, and exhibiting strong robustness to adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

URLs: https://elm.baulab.info

replace Data Processing for the OpenGPT-X Model Family

Authors: Nicolo' Brandizzi, Hammam Abdelwahab, Anirban Bhowmick, Lennard Helmer, Benny J\"org Stein, Pavel Denisov, Qasid Saleem, Michael Fromm, Mehdi Ali, Richard Rutmann, Farzad Naderi, Mohamad Saif Agy, Alexander Schwirjow, Fabian K\"uch, Luzian Hahn, Malte Ostendorff, Pedro Ortiz Suarez, Georg Rehm, Dennis Wegener, Nicolas Flores-Herr, Joachim K\"ohler, Johannes Leveling

Abstract: This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project goal is to deliver models that cover all major European languages, with a particular focus on real-world applications within the European Union. We explain all data processing steps, starting with the data selection and requirement definition to the preparation of the final filtered data. We distinguish between curated data and web data, as each of these categories is handled by distinct pipelines, with curated data undergoing minimal filtering and web data requiring extensive filtering and deduplication. This distinction guided the development of specialized algorithmic solutions for both pipelines. In addition to describing the processing methodologies, we provide an in-depth analysis of the datasets, increasing transparency and alignment with European data regulations. Finally, we share key insights and challenges faced during the project, offering recommendations for future endeavors in large-scale multilingual data preparation for LLMs.

replace Atomic Calibration of LLMs in Long-Form Generations

Authors: Caiqi Zhang, Ruihan Yang, Zhisong Zhang, Xinting Huang, Sen Yang, Dong Yu, Nigel Collier

Abstract: Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, which estimates the underlying uncertainty of model predictions, is essential to enhance the LLMs' trustworthiness. Existing research on LLM calibration has primarily focused on short-form tasks, providing a single confidence score at the response level (macro calibration). However, this approach is insufficient for long-form generations, where responses often contain more complex statements and may include both accurate and inaccurate information. Therefore, we introduce atomic calibration, a novel approach that evaluates factuality calibration at a fine-grained level by breaking down long responses into atomic claims. We classify confidence elicitation methods into discriminative and generative types and demonstrate that their combination can enhance calibration. Our extensive experiments on various LLMs and datasets show that atomic calibration is well-suited for long-form generation and can also improve macro calibration results. Additionally, atomic calibration reveals insightful patterns in LLM confidence throughout the generation process.

replace Beyond English: Evaluating Automated Measurement of Moral Foundations in Non-English Discourse with a Chinese Case Study

Authors: Calvin Yixiang Cheng, Scott A Hale

Abstract: This study explores computational approaches for measuring moral foundations (MFs) in non-English corpora. Since most resources are developed primarily for English, cross-linguistic applications of moral foundation theory remain limited. Using Chinese as a case study, this paper evaluates the effectiveness of applying English resources to machine translated text, local language lexicons, multilingual language models, and large language models (LLMs) in measuring MFs in non-English texts. The results indicate that machine translation and local lexicon approaches are insufficient for complex moral assessments, frequently resulting in a substantial loss of cultural information. In contrast, multilingual models and LLMs demonstrate reliable cross-language performance with transfer learning, with LLMs excelling in terms of data efficiency. Importantly, this study also underscores the need for human-in-the-loop validation of automated MF assessment, as the most advanced models may overlook cultural nuances in cross-language measurements. The findings highlight the potential of LLMs for cross-language MF measurements and other complex multilingual deductive coding tasks.

replace Universal Model Routing for Efficient LLM Inference

Authors: Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, Sanjiv Kumar

Abstract: Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the effectiveness of UniRoute in routing amongst more than 30 unseen LLMs.

replace Reasoning Does Not Necessarily Improve Role-Playing Ability

Authors: Xiachong Feng, Longxu Dou, Lingpeng Kong

Abstract: The application of role-playing large language models (LLMs) is rapidly expanding in both academic and commercial domains, driving an increasing demand for high-precision role-playing models. Simultaneously, the rapid advancement of reasoning techniques has continuously pushed the performance boundaries of LLMs. This intersection of practical role-playing demands and evolving reasoning capabilities raises an important research question: "Can reasoning techniques enhance the role-playing capabilities of LLMs?" To address this, we conduct a comprehensive study using 6 role-playing benchmarks, 24 LLMs, and 3 distinct role-playing strategies, comparing the effectiveness of direct zero-shot role-playing, role-playing with Chain-of-Thought (CoT), and role-playing using reasoning-optimized LLMs. Our findings reveal that CoT may reduce role-playing performance, reasoning-optimized LLMs are unsuitable for role-playing, reasoning ability disrupts the role-playing scaling law, large models still lack proficiency in advanced role-playing, and Chinese role-playing performance surpasses English role-playing performance. Furthermore, based on extensive experimental results, we propose two promising future research directions: Role-aware CoT for improving role-playing LLMs and Reinforcement Learning for role-playing LLMs, aiming to enhance the adaptability, consistency, and effectiveness of role-playing LLMs for both research and real-world applications.

replace MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

Authors: Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang

Abstract: Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.

replace LLMs syntactically adapt their language use to their conversational partner

Authors: Florian Kandra, Vera Demberg, Alexander Koller

Abstract: It has been frequently observed that human speakers align their language use with each other during conversations. In this paper, we study empirically whether large language models (LLMs) exhibit the same behavior of conversational adaptation. We construct a corpus of conversations between LLMs and find that two LLM agents end up making more similar syntactic choices as conversations go on, confirming that modern LLMs adapt their language use to their conversational partners in at least a rudimentary way.

replace SciFi-Benchmark: Leveraging Science Fiction To Improve Robot Behavior

Authors: Pierre Sermanet, Anirudha Majumdar, Vikas Sindhwani

Abstract: Given the recent rate of progress in artificial intelligence (AI) and robotics, a tantalizing question is emerging: would robots controlled by emerging AI systems be strongly aligned with human values? In this work, we propose a scalable way to probe this question by generating a benchmark spanning the key moments in 824 major pieces of science fiction literature (movies, tv, novels and scientific books) where an agent (AI or robot) made critical decisions (good or bad). We use a state-of-the-art LLM's recollection of each key moment to generate questions in similar situations, the decisions made by the agent, and alternative decisions it could have made (good or bad). We then measure an approximation of how well models align with human values on a set of human-voted answers. We also generate rules that can be automatically improved via an amendment process in order to generate the first Sci-Fi inspired constitutions for promoting ethical behavior in AIs and robots in the real world. Our first finding is that modern LLMs paired with constitutions turn out to be well-aligned with human values (95.8%), contrary to unsettling decisions typically made in Sci-Fi (only 21.2% alignment). Secondly, we find that generated constitutions substantially increase alignment compared to the base model (79.4% to 95.8%), and show resilience to an adversarial prompt setting (23.3% to 92.3%). Additionally, we find that those constitutions are among the top performers on the ASIMOV Benchmark which is derived from real-world images and hospital injury reports. Sci-Fi-inspired constitutions are thus highly aligned and applicable in real-world situations. We release SciFi-Benchmark: a large-scale dataset to advance robot ethics and safety research. It comprises 9,056 questions and 53,384 answers generated through a novel LLM-introspection process, in addition to a smaller human-labeled evaluation set.

replace Synthetic Data Generation Using Large Language Models: Advances in Text and Code

Authors: Mihai Nadas, Laura Diosan, Andreea Tomescu

Abstract: This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data is scarce, expensive, or sensitive. This paper surveys recent advances in leveraging LLMs to create synthetic text and code, highlighting key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We examine how these methods can enrich low-resource tasks (e.g., classification, question answering) and facilitate code-centric applications (e.g., instruction tuning, code translation, bug repair) through automated verification of functional correctness. Alongside potential benefits - cost-effectiveness, broad coverage, and controllable diversity - we discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification. Proposed mitigation strategies range from filtering and weighting synthetic outputs to reinforcement learning with execution feedback in code domains. We conclude by outlining open research directions, such as automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, underscoring the growing importance of LLM-generated synthetic data in accelerating AI development while emphasizing ethical and quality safeguards.

replace Typed-RAG: Type-Aware Decomposition of Non-Factoid Questions for Retrieval-Augmented Generation

Authors: DongGeon Lee, Ahjeong Park, Hyeri Lee, Hyeonseo Nam, Yunho Maeng

Abstract: Addressing non-factoid question answering (NFQA) remains challenging due to its open-ended nature, diverse user intents, and need for multi-aspect reasoning. These characteristics often reveal the limitations of conventional retrieval-augmented generation (RAG) approaches. To overcome these challenges, we propose Typed-RAG, a framework for type-aware decomposition of non-factoid questions (NFQs) within the RAG paradigm. Specifically, Typed-RAG first classifies an NFQ into a predefined type (e.g., Debate, Experience, Comparison). It then decomposes the question into focused sub-queries, each focusing on a single aspect. This decomposition enhances both retrieval relevance and answer quality. By combining the results of these sub-queries, Typed-RAG produces more informative and contextually aligned responses. Additionally, we construct Wiki-NFQA, a benchmark dataset for NFQA covering a wide range of NFQ types. Experiments show that Typed-RAG consistently outperforms existing QA approaches based on LLMs or RAG methods, validating the effectiveness of type-aware decomposition for improving both retrieval quality and answer generation in NFQA. Our code and dataset are available on https://github.com/TeamNLP/Typed-RAG.

URLs: https://github.com/TeamNLP/Typed-RAG.

replace Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics

Authors: Zena Al-Khalili, Nick Howell, Dietrich Klakow

Abstract: Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs generated programs in response to math reasoning tasks, with a focus on evaluating the soundness of the underlying reasoning processes. For this purpose, we assess the generations of five LLMs, on several math datasets, both manually and automatically, and propose a taxonomy of generated programs based on their logical soundness. Our findings show that the capabilities of models significantly impact the logic implemented to solve the problem. Closed-source LLMs ground their programs in mathematical concepts, whereas open-source models often resort to unsound reasoning, relying on memorized information and exhaustive searches. Furthermore, increasing the difficulty of problems decreases sound generations for all models, revealing a critical shortcoming of LLMs on complex mathematics, contrary to what accuracy metrics suggest. Our work highlights the need for more holistic evaluations of code-assisted LLMs beyond execution accuracy metrics, toward a better understanding of LLMs' limits in the math domain.

replace A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1

Authors: Mingda Zhang, Jianglong Qin

Abstract: Despite significant advances in foundation models like DeepSeek-R1 and ChatGPT, their deployment in medical settings faces critical challenges including computational requirements and professional knowledge barriers. This paper presents an efficient lightweight medical large language model architecture that systematically addresses these challenges through three-dimensional optimization: knowledge acquisition, model compression, and computational enhancement. We design a knowledge transfer pipeline from DeepSeek-R1-Distill-70B to DeepSeek-R1-Distill-7B using Low-Rank Adaptation (LoRA) for precise medical knowledge retention. Through 4-bit quantization and mixed-precision strategies, we achieve substantial model compression while preserving medical reasoning capabilities. The inference framework incorporates Flash Attention acceleration and continuous batching, complemented by specialized prompt templates for diverse medical queries. Experimental evaluation on medical benchmarks demonstrates that our approach maintains 92.1% accuracy on USMLE examinations while reducing memory consumption by 64.7% and inference latency by 12.4% compared to baseline models. This work provides a practical solution for deploying advanced language models in resource-constrained medical environments, enabling broader accessibility of AI-assisted healthcare.

replace Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition

Authors: Siyu Liang, Yunan Li, Wentian Xin, Huizhou Chen, Xujie Liu, Kang Liu, Qiguang Miao

Abstract: Sign language recognition (SLR) faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. To the best of our knowledge, this is the first work to integrate generative large language models (LLMs) into SLR tasks. We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method that leverages retrieval-augmented generation (RAG) with domain-specific LLMs, incorporating multi-step prompt engineering and expert-validated sign language corpora to produce precise multipart descriptions. The GSP-MC method also employs a dual-encoder architecture to bidirectionally align hierarchical skeleton features with multiple text descriptions (global, synonym, and part level) through probabilistic matching. Our approach combines global and part-level losses, optimizing KL divergence to ensure robust alignment across all relevant text-skeleton pairs while capturing both sign-level semantics and detailed part dynamics. Experiments demonstrate state-of-the-art performance against existing methods on the Chinese SLR500 (reaching 97.1%) and Turkish AUTSL datasets (97.07% accuracy). The method's cross-lingual effectiveness highlight its potential for developing inclusive communication technologies.

replace HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing

Authors: Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Idris Abdulmumin, Falalu Ibrahim Lawan, Babangida Sani, Sukairaj Hafiz Imam, Yusuf Aliyu, Sani Abdullahi Sani, Ali Usman Umar, Tajuddeen Gwadabe, Kenneth Church, Vukosi Marivate

Abstract: Hausa Natural Language Processing (NLP) has gained increasing attention in recent years, yet remains understudied as a low-resource language despite having over 120 million first-language (L1) and 80 million second-language (L2) speakers worldwide. While significant advances have been made in high-resource languages, Hausa NLP faces persistent challenges, including limited open-source datasets and inadequate model representation. This paper presents an overview of the current state of Hausa NLP, systematically examining existing resources, research contributions, and gaps across fundamental NLP tasks: text classification, machine translation, named entity recognition, speech recognition, and question answering. We introduce HausaNLP (https://catalog.hausanlp.org), a curated catalog that aggregates datasets, tools, and research works to enhance accessibility and drive further development. Furthermore, we discuss challenges in integrating Hausa into large language models (LLMs), addressing issues of suboptimal tokenization and dialectal variation. Finally, we propose strategic research directions emphasizing dataset expansion, improved language modeling approaches, and strengthened community collaboration to advance Hausa NLP. Our work provides both a foundation for accelerating Hausa NLP progress and valuable insights for broader multilingual NLP research.

URLs: https://catalog.hausanlp.org),

replace Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models

Authors: Yue Li, Xin Yi, Dongsheng Shi, Gerard de Melo, Xiaoling Wang, Linlin Wang

Abstract: With the increasing size of Large Vision-Language Models (LVLMs), network pruning techniques aimed at compressing models for deployment in resource-constrained environments have garnered significant attention. However, we observe that pruning often leads to a degradation in safety performance. To address this issue, we present a novel and lightweight approach, termed Hierarchical Safety Realignment (HSR). HSR operates by first quantifying the contribution of each attention head to safety, identifying the most critical ones, and then selectively restoring neurons directly within these attention heads that play a pivotal role in maintaining safety. This process hierarchically realigns the safety of pruned LVLMs, progressing from the attention head level to the neuron level. We validate HSR across various models and pruning strategies, consistently achieving notable improvements in safety performance. To our knowledge, this is the first work explicitly focused on restoring safety in LVLMs post-pruning.

replace Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model

Authors: Jintao Zhang, Zirui Liu, Mingyue Cheng, Shilong Zhang, Tingyue Pan, Yitong zhou, Qi Liu, Yanhu Xie

Abstract: Intraoperative hypotension (IOH) frequently occurs under general anesthesia and is strongly linked to adverse outcomes such as myocardial injury and increased mortality. Despite its significance, IOH prediction is hindered by event sparsity and the challenge of integrating static and dynamic data across diverse patients. In this paper, we propose \textbf{IOHFuseLM}, a multimodal language model framework. To accurately identify and differentiate sparse hypotensive events, we leverage a two-stage training strategy. The first stage involves domain adaptive pretraining on IOH physiological time series augmented through diffusion methods, thereby enhancing the model sensitivity to patterns associated with hypotension. Subsequently, task fine-tuning is performed on the original clinical dataset to further enhance the ability to distinguish normotensive from hypotensive states. To enable multimodal fusion for each patient, we align structured clinical descriptions with the corresponding physiological time series at the token level. Such alignment enables the model to capture individualized temporal patterns alongside their corresponding clinical semantics. In addition, we convert static patient attributes into structured text to enrich personalized information. Experimental evaluations on two intraoperative datasets demonstrate that IOHFuseLM outperforms established baselines in accurately identifying IOH events, highlighting its applicability in clinical decision support scenarios. Our code is publicly available to promote reproducibility at https://github.com/zjt-gpu/IOHFuseLM.

URLs: https://github.com/zjt-gpu/IOHFuseLM.

replace Self-Correcting Code Generation Using Small Language Models

Authors: Jeonghun Cho, Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee

Abstract: Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement. Recent studies have explored prompting-based strategies that incorporate verification or feedback loops using proprietary models, as well as training-based methods that leverage their strong reasoning capabilities. However, whether smaller models possess the capacity to effectively guide their outputs through self-reflection remains unexplored. Our findings reveal that smaller models struggle to exhibit reflective revision behavior across both self-correction paradigms. In response, we introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction. Specifically, we propose an online reinforcement learning objective that trains the model to confidently maintain correct outputs while progressively correcting incorrect outputs as turns proceed. Our approach features an accumulated reward function that aggregates rewards across the entire trajectory and a fine-grained reward better suited to multi-turn correction scenarios. This facilitates the model in enhancing initial response quality while achieving substantial improvements through self-correction. With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.

replace SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

Authors: Roksana Goworek, Harpal Karlcut, Muhammad Shezad, Nijaguna Darshana, Abhishek Mane, Syam Bondada, Raghav Sikka, Ulvi Mammadov, Rauf Allahverdiyev, Sriram Purighella, Paridhi Gupta, Muhinyia Ndegwa, Haim Dubossarsky

Abstract: This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning ten low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.

replace Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction

Authors: Mai Ali, Christopher Lucasius, Tanmay P. Patel, Madison Aitken, Jacob Vorstman, Peter Szatmari, Marco Battaglia, Deepa Kundur

Abstract: Speech is a noninvasive digital phenotype that can offer valuable insights into mental health conditions, but it is often treated as a single modality. In contrast, we propose the treatment of patient speech data as a trimodal multimedia data source for depression detection. This study explores the potential of large language model-based architectures for speech-based depression prediction in a multimodal regime that integrates speech-derived text, acoustic landmarks, and vocal biomarkers. Adolescent depression presents a significant challenge and is often comorbid with multiple disorders, such as suicidal ideation and sleep disturbances. This presents an additional opportunity to integrate multi-task learning (MTL) into our study by simultaneously predicting depression, suicidal ideation, and sleep disturbances using the multimodal formulation. We also propose a longitudinal analysis strategy that models temporal changes across multiple clinical interactions, allowing for a comprehensive understanding of the conditions' progression. Our proposed approach, featuring trimodal, longitudinal MTL is evaluated on the Depression Early Warning dataset. It achieves a balanced accuracy of 70.8%, which is higher than each of the unimodal, single-task, and non-longitudinal methods.

replace Adaptive Graph Pruning for Multi-Agent Communication

Authors: Boyi Li, Zhonghan Zhao, Der-Horng Lee, Gaoang Wang

Abstract: Large Language Model (LLM) based multi-agent systems have shown remarkable performance in various tasks, especially when enhanced through collaborative communication. However, current methods often rely on a fixed number of agents and static communication structures, limiting their ability to adapt to varying task complexities. In this paper, we propose Adaptive Graph Pruning (AGP), a novel task-adaptive multi-agent collaboration framework that jointly optimizes agent quantity (hard-pruning) and communication topology (soft-pruning). Specifically, our method employs a two-stage training strategy: firstly, independently training soft-pruning networks for different agent quantities to determine optimal agent-quantity-specific complete graphs and positional masks across specific tasks; and then jointly optimizing hard-pruning and soft-pruning within a maximum complete graph to dynamically configure the number of agents and their communication topologies per task. Extensive experiments demonstrate that our approach is: (1) High-performing, achieving state-of-the-art results across six benchmarks and consistently generalizes across multiple mainstream LLM architectures, with a increase in performance of $2.58\%\sim 9.84\%$; (2) Task-adaptive, dynamically constructing optimized communication topologies tailored to specific tasks, with an extremely high performance in all three task categories (general reasoning, mathematical reasoning, and code generation); (3) Token-economical, having fewer training steps and token consumption at the same time, with a decrease in token consumption of $90\%+$; and (4) Training-efficient, achieving high performance with very few training steps compared with other methods. The performance will surpass the existing baselines after about ten steps of training under six benchmarks.

replace Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Authors: Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tian Xie, Tianxing He

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.

replace Continuously Updating Digital Twins using Large Language Models

Authors: Harry Amad, Nicol\'as Astorga, Mihaela van der Schaar

Abstract: Digital twins are models of real-world systems that can simulate their dynamics in response to potential actions. In complex settings, the state and action variables, and available data and knowledge relevant to a system can constantly change, requiring digital twins to continuously update with these changes to remain relevant. Current approaches struggle in this regard, as they require fixed, well-defined modelling environments, and they cannot adapt to novel variables without re-designs, or incorporate new information without re-training. To address this, we frame digital twinning as an in-context learning problem using large language models, enabling seamless updates to the twin at inference time. We develop CALM-DT, a Context-Adaptive Language Model-based Digital Twin that can accurately simulate across diverse state-action spaces using in-context learning alone by utilising fine-tuned encoders for sample retrieval. We empirically demonstrate CALM-DT's competitive performance with existing digital twin approaches, and its unique ability to adapt to changes in its modelling environment without parameter updates.

replace Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Authors: Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly

Abstract: Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.

replace Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Authors: Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou

Abstract: Current AI agents cannot effectively learn from each other's problem-solving experiences or use past successes to guide self-reflection and error correction in new tasks. We introduce Agent KB, a shared knowledge base that captures both high-level problem-solving strategies and detailed execution lessons, enabling knowledge transfer across agent frameworks. Agent KB implements a novel teacher-student dual-phase retrieval mechanism where student agents retrieve workflow-level patterns for strategic guidance while teacher agents identify execution-level patterns for refinement. This hierarchical approach enables agents to break out of limited reasoning pathways by incorporating diverse strategies from external sources. Evaluations on the GAIA benchmark demonstrate substantial performance gains, with Agent KB improving success rates by up to 6.06 percentage points overall under pass@1. For SWE-bench code repair tasks, our system significantly improved resolution rates, with o3-mini achieving an 8.67 percentage point gain (23 percent to 31.67 percent) in pass@1. Our ablation studies demonstrate that the refinement module proves most critical, with its removal causing a 3.85% drop on challenging Level 3 tasks, highlighting that effective knowledge transfer necessitates both strategic guidance and execution-level refinement.

replace Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Authors: Gheorghe Comanici (Xinyi), Eric Bieber (Xinyi), Mike Schaekermann (Xinyi), Ice Pasupat (Xinyi), Noveen Sachdeva (Xinyi), Inderjit Dhillon (Xinyi), Marcel Blistein (Xinyi), Ori Ram (Xinyi), Dan Zhang (Xinyi), Evan Rosen (Xinyi), Luke Marris (Xinyi), Sam Petulla (Xinyi), Colin Gaffney (Xinyi), Asaf Aharoni (Xinyi), Nathan Lintz (Xinyi), Tiago Cardal Pais (Xinyi), Henrik Jacobsson (Xinyi), Idan Szpektor (Xinyi), Nan-Jiang Jiang (Xinyi), Krishna Haridasan (Xinyi), Ahmed Omran (Xinyi), Nikunj Saunshi (Xinyi), Dara Bahri (Xinyi), Gaurav Mishra (Xinyi), Eric Chu (Xinyi), Toby Boyd (Xinyi), Brad Hekman (Xinyi), Aaron Parisi (Xinyi), Chaoyi Zhang (Xinyi), Kornraphop Kawintiranon (Xinyi), Tania Bedrax-Weiss (Xinyi), Oliver Wang (Xinyi), Ya Xu (Xinyi), Ollie Purkiss (Xinyi), Uri Mendlovic (Xinyi), Ila\"i Deutel (Xinyi), Nam Nguyen (Xinyi), Adam Langley (Xinyi), Flip Korn (Xinyi), Lucia Rossazza (Xinyi), Alexandre Ram\'e (Xinyi), Sagar Waghmare (Xinyi), Helen Miller (Xinyi), Nathan Byrd (Xinyi), Ashrith Sheshan (Xinyi), Raia Hadsell Sangnie Bhardwaj (Xinyi), Pawel Janus (Xinyi), Tero Rissa (Xinyi), Dan Horgan (Xinyi), Sharon Silver (Xinyi), Ayzaan Wahid (Xinyi), Sergey Brin (Xinyi), Yves Raimond (Xinyi), Klemen Kloboves (Xinyi), Cindy Wang (Xinyi), Nitesh Bharadwaj Gundavarapu (Xinyi), Ilia Shumailov (Xinyi), Bo Wang (Xinyi), Mantas Pajarskas (Xinyi), Joe Heyward (Xinyi), Martin Nikoltchev (Xinyi), Maciej Kula (Xinyi), Hao Zhou (Xinyi), Zachary Garrett (Xinyi), Sushant Kafle (Xinyi), Sercan Arik (Xinyi), Ankita Goel (Xinyi), Mingyao Yang (Xinyi), Jiho Park (Xinyi), Koji Kojima (Xinyi), Parsa Mahmoudieh (Xinyi), Koray Kavukcuoglu (Xinyi), Grace Chen (Xinyi), Doug Fritz (Xinyi), Anton Bulyenov (Xinyi), Sudeshna Roy (Xinyi), Dimitris Paparas (Xinyi), Hadar Shemtov (Xinyi), Bo-Juen Chen (Xinyi), Robin Strudel (Xinyi), David Reitter (Xinyi), Aurko Roy (Xinyi), Andrey Vlasov (Xinyi), Changwan Ryu (Xinyi), Chas Leichner (Xinyi), Haichuan Yang (Xinyi), Zelda Mariet (Xinyi), Denis Vnukov (Xinyi), Tim Sohn (Xinyi), Amy Stuart (Xinyi), Wei Liang (Xinyi), Minmin Chen (Xinyi), Praynaa Rawlani (Xinyi), Christy Koh (Xinyi), JD Co-Reyes (Xinyi), Guangda Lai (Xinyi), Praseem Banzal (Xinyi), Dimitrios Vytiniotis (Xinyi), Jieru Mei (Xinyi), Mu Cai (Xinyi), Mohammed Badawi (Xinyi), Corey Fry (Xinyi), Ale Hartman (Xinyi), Daniel Zheng (Xinyi), Eric Jia (Xinyi), James Keeling (Xinyi), Annie Louis (Xinyi), Ying Chen (Xinyi), Efren Robles (Xinyi), Wei-Chih Hung (Xinyi), Howard Zhou (Xinyi), Nikita Saxena (Xinyi), Sonam Goenka (Xinyi), Olivia Ma (Xinyi), Zach Fisher (Xinyi), Mor Hazan Taege (Xinyi), Emily Graves (Xinyi), David Steiner (Xinyi), Yujia Li (Xinyi), Sarah Nguyen (Xinyi), Rahul Sukthankar (Xinyi), Joe Stanton (Xinyi), Ali Eslami (Xinyi), Gloria Shen (Xinyi), Berkin Akin (Xinyi), Alexey Guseynov (Xinyi), Yiqian Zhou (Xinyi), Jean-Baptiste Alayrac (Xinyi), Armand Joulin (Xinyi), Efrat Farkash (Xinyi), Ashish Thapliyal (Xinyi), Stephen Roller (Xinyi), Noam Shazeer (Xinyi), Todor Davchev (Xinyi), Terry Koo (Xinyi), Hannah Forbes-Pollard (Xinyi), Kartik Audhkhasi (Xinyi), Greg Farquhar (Xinyi), Adi Mayrav Gilady (Xinyi), Maggie Song (Xinyi), John Aslanides (Xinyi), Piermaria Mendolicchio (Xinyi), Alicia Parrish (Xinyi), John Blitzer (Xinyi), Pramod Gupta (Xinyi), Xiaoen Ju (Xinyi), Xiaochen Yang (Xinyi), Puranjay Datta (Xinyi), Andrea Tacchetti (Xinyi), Sanket Vaibhav Mehta (Xinyi), Gregory Dibb (Xinyi), Shubham Gupta (Xinyi), Federico Piccinini (Xinyi), Raia Hadsell (Xinyi), Sujee Rajayogam (Xinyi), Jiepu Jiang (Xinyi), Patrick Griffin (Xinyi), Patrik Sundberg (Xinyi), Jamie Hayes (Xinyi), Alexey Frolov (Xinyi), Tian Xie (Xinyi), Adam Zhang (Xinyi), Kingshuk Dasgupta (Xinyi), Uday Kalra (Xinyi), Lior Shani (Xinyi), Klaus Macherey (Xinyi), Tzu-Kuo Huang (Xinyi), Liam MacDermed (Xinyi), Karthik Duddu (Xinyi), Paulo Zacchello (Xinyi), Zi Yang (Xinyi), Jessica Lo (Xinyi), Kai Hui (Xinyi), Matej Kastelic (Xinyi), Derek Gasaway (Xinyi), Qijun Tan (Xinyi), Summer Yue (Xinyi), Pablo Barrio (Xinyi), John Wieting (Xinyi), Weel Yang (Xinyi), Andrew Nystrom (Xinyi), Solomon Demmessie (Xinyi), Anselm Levskaya (Xinyi), Fabio Viola (Xinyi), Chetan Tekur (Xinyi), Greg Billock (Xinyi), George Necula (Xinyi), Mandar Joshi (Xinyi), Rylan Schaeffer (Xinyi), Swachhand Lokhande (Xinyi), Christina Sorokin (Xinyi), Pradeep Shenoy (Xinyi), Mia Chen (Xinyi), Mark Collier (Xinyi), Hongji Li (Xinyi), Taylor Bos (Xinyi), Nevan Wichers (Xinyi), Sun Jae Lee (Xinyi), Ang\'eline Pouget (Xinyi), Santhosh Thangaraj (Xinyi), Kyriakos Axiotis (Xinyi), Phil Crone (Xinyi), Rachel Sterneck (Xinyi), Nikolai Chinaev (Xinyi), Victoria Krakovna (Xinyi), Oleksandr Ferludin (Xinyi), Ian Gemp (Xinyi), Stephanie Winkler (Xinyi), Dan Goldberg (Xinyi), Ivan Korotkov (Xinyi), Kefan Xiao (Xinyi), Malika Mehrotra (Xinyi), Sandeep Mariserla (Xinyi), Vihari Piratla (Xinyi), Terry Thurk (Xinyi), Khiem Pham (Xinyi), Hongxu Ma (Xinyi), Alexandre Senges (Xinyi), Ravi Kumar (Xinyi), Clemens Meyer (Xinyi), Ellie Talius (Xinyi), Nuo Wang Pierse (Xinyi), Ballie Sandhu (Xinyi), Horia Toma (Xinyi), Kuo Lin (Xinyi), Swaroop Nath (Xinyi), Tom Stone (Xinyi), Dorsa Sadigh (Xinyi), Nikita Gupta (Xinyi), Arthur Guez (Xinyi), Avi Singh (Xinyi), Matt Thomas (Xinyi), Tom Duerig (Xinyi), Yuan Gong (Xinyi), Richard Tanburn (Xinyi), Lydia Lihui Zhang (Xinyi), Phuong Dao (Xinyi), Mohamed Hammad (Xinyi), Sirui Xie (Xinyi), Shruti Rijhwani (Xinyi), Ben Murdoch (Xinyi), Duhyeon Kim (Xinyi), Will Thompson (Xinyi), Heng-Tze Cheng (Xinyi), Daniel Sohn (Xinyi), Pablo Sprechmann (Xinyi), Qiantong Xu (Xinyi), Srinivas Tadepalli (Xinyi), Peter Young (Xinyi), Ye Zhang (Xinyi), Hansa Srinivasan (Xinyi), Miranda Aperghis (Xinyi), Aditya Ayyar (Xinyi), Hen Fitoussi (Xinyi), Ryan Burnell (Xinyi), David Madras (Xinyi), Mike Dusenberry (Xinyi), Xi Xiong (Xinyi), Tayo Oguntebi (Xinyi), Ben Albrecht (Xinyi), J\"org Bornschein (Xinyi), Jovana Mitrovi\'c (Xinyi), Mason Dimarco (Xinyi), Bhargav Kanagal Shamanna (Xinyi), Premal Shah (Xinyi), Eren Sezener (Xinyi), Shyam Upadhyay (Xinyi), Dave Lacey (Xinyi), Craig Schiff (Xinyi), Sebastien Baur (Xinyi), Sanjay Ganapathy (Xinyi), Eva Schnider (Xinyi), Mateo Wirth (Xinyi), Connor Schenck (Xinyi), Andrey Simanovsky (Xinyi), Yi-Xuan Tan (Xinyi), Philipp Fr\"anken (Xinyi), Dennis Duan (Xinyi), Bharath Mankalale (Xinyi), Nikhil Dhawan (Xinyi), Kevin Sequeira (Xinyi), Zichuan Wei (Xinyi), Shivanker Goel (Xinyi), Caglar Unlu (Xinyi), Yukun Zhu (Xinyi), Haitian Sun (Xinyi), Ananth Balashankar (Xinyi), Kurt Shuster (Xinyi), Megh Umekar (Xinyi), Mahmoud Alnahlawi (Xinyi), A\"aron van den Oord (Xinyi), Kelly Chen (Xinyi), Yuexiang Zhai (Xinyi), Zihang Dai (Xinyi), Kuang-Huei Lee (Xinyi), Eric Doi (Xinyi), Lukas Zilka (Xinyi), Rohith Vallu (Xinyi), Disha Shrivastava (Xinyi), Jason Lee (Xinyi), Hisham Husain (Xinyi), Honglei Zhuang (Xinyi), Vincent Cohen-Addad (Xinyi), Jarred Barber (Xinyi), James Atwood (Xinyi), Adam Sadovsky (Xinyi), Quentin Wellens (Xinyi), Steven Hand (Xinyi), Arunkumar Rajendran (Xinyi), Aybuke Turker (Xinyi), CJ Carey (Xinyi), Yuanzhong Xu (Xinyi), Hagen Soltau (Xinyi), Zefei Li (Xinyi), Xinying Song (Xinyi), Conglong Li (Xinyi), Iurii Kemaev (Xinyi), Sasha Brown (Xinyi), Andrea Burns (Xinyi), Viorica Patraucean (Xinyi), Piotr Stanczyk (Xinyi), Renga Aravamudhan (Xinyi), Mathieu Blondel (Xinyi), Hila Noga (Xinyi), Lorenzo Blanco (Xinyi), Will Song (Xinyi), Michael Isard (Xinyi), Mandar Sharma (Xinyi), Reid Hayes (Xinyi), Dalia El Badawy (Xinyi), Avery Lamp (Xinyi), Itay Laish (Xinyi), Olga Kozlova (Xinyi), Kelvin Chan (Xinyi), Sahil Singla (Xinyi), Srinivas Sunkara (Xinyi), Mayank Upadhyay (Xinyi), Chang Liu (Xinyi), Aijun Bai (Xinyi), Jarek Wilkiewicz (Xinyi), Martin Zlocha (Xinyi), Jeremiah Liu (Xinyi), Zhuowan Li (Xinyi), Haiguang Li (Xinyi), Omer Barak (Xinyi), Ganna Raboshchuk (Xinyi), Jiho Choi (Xinyi), Fangyu Liu (Xinyi), Erik Jue (Xinyi), Mohit Sharma (Xinyi), Andreea Marzoca (Xinyi), Robert Busa-Fekete (Xinyi), Anna Korsun (Xinyi), Andre Elisseeff (Xinyi), Zhe Shen (Xinyi), Sara Mc Carthy (Xinyi), Kay Lamerigts (Xinyi), Anahita Hosseini (Xinyi), Hanzhao Lin (Xinyi), Charlie Chen (Xinyi), Fan Yang (Xinyi), Kushal Chauhan (Xinyi), Mark Omernick (Xinyi), Dawei Jia (Xinyi), Karina Zainullina (Xinyi), Demis Hassabis (Xinyi), Danny Vainstein (Xinyi), Ehsan Amid (Xinyi), Xiang Zhou (Xinyi), Ronny Votel (Xinyi), Eszter V\'ertes (Xinyi), Xinjian Li (Xinyi), Zongwei Zhou (Xinyi), Angeliki Lazaridou (Xinyi), Brendan McMahan (Xinyi), Arjun Narayanan (Xinyi), Hubert Soyer (Xinyi), Sujoy Basu (Xinyi), Kayi Lee (Xinyi), Bryan Perozzi (Xinyi), Qin Cao (Xinyi), Leonard Berrada (Xinyi), Rahul Arya (Xinyi), Ke Chen (Xinyi), Katrina (Xinyi), Xu (Joyce), Matthias Lochbrunner (Joyce), Alex Hofer (Joyce), Sahand Sharifzadeh (Joyce), Renjie Wu (Joyce), Sally Goldman (Joyce), Pranjal Awasthi (Joyce), Xuezhi Wang (Joyce), Yan Wu (Joyce), Claire Sha (Joyce), Biao Zhang (Joyce), Maciej Miku{\l}a (Joyce), Filippo Graziano (Joyce), Siobhan Mcloughlin (Joyce), Irene Giannoumis (Joyce), Youhei Namiki (Joyce), Chase Malik (Joyce), Carey Radebaugh (Joyce), Jamie Hall (Joyce), Ramiro Leal-Cavazos (Joyce), Jianmin Chen (Joyce), Vikas Sindhwani (Joyce), David Kao (Joyce), David Greene (Joyce), Jordan Griffith (Joyce), Chris Welty (Joyce), Ceslee Montgomery (Joyce), Toshihiro Yoshino (Joyce), Liangzhe Yuan (Joyce), Noah Goodman (Joyce), Assaf Hurwitz Michaely (Joyce), Kevin Lee (Joyce), KP Sawhney (Joyce), Wei Chen (Joyce), Zheng Zheng (Joyce), Megan Shum (Joyce), Nikolay Savinov (Joyce), Etienne Pot (Joyce), Alex Pak (Joyce), Morteza Zadimoghaddam (Joyce), Sijal Bhatnagar (Joyce), Yoad Lewenberg (Joyce), Blair Kutzman (Joyce), Ji Liu (Joyce), Lesley Katzen (Joyce), Jeremy Selier (Joyce), Josip Djolonga (Joyce), Dmitry Lepikhin (Joyce), Kelvin Xu (Joyce), Jacky Liang (Joyce), Jiewen Tan (Joyce), Benoit Schillings (Joyce), Muge Ersoy (Joyce), Pete Blois (Joyce), Bernd Bandemer (Joyce), Abhimanyu Singh (Joyce), Sergei Lebedev (Joyce), Pankaj Joshi (Joyce), Adam R. Brown (Joyce), Evan Palmer (Joyce), Shreya Pathak (Joyce), Komal Jalan (Joyce), Fedir Zubach (Joyce), Shuba Lall (Joyce), Randall Parker (Joyce), Alok Gunjan (Joyce), Sergey Rogulenko (Joyce), Sumit Sanghai (Joyce), Zhaoqi Leng (Joyce), Zoltan Egyed (Joyce), Shixin Li (Joyce), Maria Ivanova (Joyce), Kostas Andriopoulos (Joyce), Jin Xie (Joyce), Elan Rosenfeld (Joyce), Auriel Wright (Joyce), Ankur Sharma (Joyce), Xinyang Geng (Joyce), Yicheng Wang (Joyce), Sam Kwei (Joyce), Renke Pan (Joyce), Yujing Zhang (Joyce), Gabby Wang (Joyce), Xi Liu (Joyce), Chak Yeung (Joyce), Elizabeth Cole (Joyce), Aviv Rosenberg (Joyce), Zhen Yang (Joyce), Phil Chen (Joyce), George Polovets (Joyce), Pranav Nair (Joyce), Rohun Saxena (Joyce), Josh Smith (Joyce), Shuo-yiin Chang (Joyce), Aroma Mahendru (Joyce), Svetlana Grant (Joyce), Anand Iyer (Joyce), Irene Cai (Joyce), Jed McGiffin (Joyce), Jiaming Shen (Joyce), Alanna Walton (Joyce), Antonious Girgis (Joyce), Oliver Woodman (Joyce), Rosemary Ke (Joyce), Mike Kwong (Joyce), Louis Rouillard (Joyce), Jinmeng Rao (Joyce), Zhihao Li (Joyce), Yuntao Xu (Joyce), Flavien Prost (Joyce), Chi Zou (Joyce), Ziwei Ji (Joyce), Alberto Magni (Joyce), Tyler Liechty (Joyce), Dan A. Calian (Joyce), Deepak Ramachandran (Joyce), Igor Krivokon (Joyce), Hui Huang (Joyce), Terry Chen (Joyce), Anja Hauth (Joyce), Anastasija Ili\'c (Joyce), Weijuan Xi (Joyce), Hyeontaek Lim (Joyce), Vlad-Doru Ion (Joyce), Pooya Moradi (Joyce), Metin Toksoz-Exley (Joyce), Kalesha Bullard (Joyce), Miltos Allamanis (Joyce), Xiaomeng Yang (Joyce), Sophie Wang (Joyce), Zhi Hong (Joyce), Anita Gergely (Joyce), Cheng Li (Joyce), Bhavishya Mittal (Joyce), Vitaly Kovalev (Joyce), Victor Ungureanu (Joyce), Jane Labanowski (Joyce), Jan Wassenberg (Joyce), Nicolas Lacasse (Joyce), Geoffrey Cideron (Joyce), Petar Devi\'c (Joyce), Annie Marsden (Joyce), Lynn Nguyen (Joyce), Michael Fink (Joyce), Yin Zhong (Joyce), Tatsuya Kiyono (Joyce), Desi Ivanov (Joyce), Sally Ma (Joyce), Max Bain (Joyce), Kiran Yalasangi (Joyce), Jennifer She (Joyce), Anastasia Petrushkina (Joyce), Mayank Lunayach (Joyce), Carla Bromberg (Joyce), Sarah Hodkinson (Joyce), Vilobh Meshram (Joyce), Daniel Vlasic (Joyce), Austin Kyker (Joyce), Steve Xu (Joyce), Jeff Stanway (Joyce), Zuguang Yang (Joyce), Kai Zhao (Joyce), Matthew Tung (Joyce), Seth Odoom (Joyce), Yasuhisa Fujii (Joyce), Justin Gilmer (Joyce), Eunyoung Kim (Joyce), Felix Halim (Joyce), Quoc Le (Joyce), Bernd Bohnet (Joyce), Seliem El-Sayed (Joyce), Behnam Neyshabur (Joyce), Malcolm Reynolds (Joyce), Dean Reich (Joyce), Yang Xu (Joyce), Erica Moreira (Joyce), Anuj Sharma (Joyce), Zeyu Liu (Joyce), Mohammad Javad Hosseini (Joyce), Naina Raisinghani (Joyce), Yi Su (Joyce), Ni Lao (Joyce), Daniel Formoso (Joyce), Marco Gelmi (Joyce), Almog Gueta (Joyce), Tapomay Dey (Joyce), Elena Gribovskaya (Joyce), Domagoj \'Cevid (Joyce), Sidharth Mudgal (Joyce), Garrett Bingham (Joyce), Jianling Wang (Joyce), Anurag Kumar (Joyce), Alex Cullum (Joyce), Feng Han (Joyce), Konstantinos Bousmalis (Joyce), Diego Cedillo (Joyce), Grace Chu (Joyce), Vladimir Magay (Joyce), Paul Michel (Joyce), Ester Hlavnova (Joyce), Daniele Calandriello (Joyce), Setareh Ariafar (Joyce), Kaisheng Yao (Joyce), Vikash Sehwag (Joyce), Arpi Vezer (Joyce), Agustin Dal Lago (Joyce), Zhenkai Zhu (Joyce), Paul Kishan Rubenstein (Joyce), Allen Porter (Joyce), Anirudh Baddepudi (Joyce), Oriana Riva (Joyce), Mihai Dorin Istin (Joyce), Chih-Kuan Yeh (Joyce), Zhi Li (Joyce), Andrew Howard (Joyce), Nilpa Jha (Joyce), Jeremy Chen (Joyce), Raoul de Liedekerke (Joyce), Zafarali Ahmed (Joyce), Mikel Rodriguez (Joyce), Tanuj Bhatia (Joyce), Bangju Wang (Joyce), Ali Elqursh (Joyce), David Klinghoffer (Joyce), Peter Chen (Joyce), Pushmeet Kohli (Joyce), Te I (Joyce), Weiyang Zhang (Joyce), Zack Nado (Joyce), Jilin Chen (Joyce), Maxwell Chen (Joyce), George Zhang (Joyce), Aayush Singh (Joyce), Adam Hillier (Joyce), Federico Lebron (Joyce), Yiqing Tao (Joyce), Ting Liu (Joyce), Gabriel Dulac-Arnold (Joyce), Jingwei Zhang (Joyce), Shashi Narayan (Joyce), Buhuang Liu (Joyce), Orhan Firat (Joyce), Abhishek Bhowmick (Joyce), Bingyuan Liu (Joyce), Hao Zhang (Joyce), Zizhao Zhang (Joyce), Georges Rotival (Joyce), Nathan Howard (Joyce), Anu Sinha (Joyce), Alexander Grushetsky (Joyce), Benjamin Beyret (Joyce), Keerthana Gopalakrishnan (Joyce), James Zhao (Joyce), Kyle He (Joyce), Szabolcs Payrits (Joyce), Zaid Nabulsi (Joyce), Zhaoyi Zhang (Joyce), Weijie Chen (Joyce), Edward Lee (Joyce), Nova Fallen (Joyce), Sreenivas Gollapudi (Joyce), Aurick Zhou (Joyce), Filip Paveti\'c (Joyce), Thomas K\"oppe (Joyce), Shiyu Huang (Joyce), Rama Pasumarthi (Joyce), Nick Fernando (Joyce), Felix Fischer (Joyce), Daria \'Curko (Joyce), Yang Gao (Joyce), James Svensson (Joyce), Austin Stone (Joyce), Haroon Qureshi (Joyce), Abhishek Sinha (Joyce), Apoorv Kulshreshtha (Joyce), Martin Matysiak (Joyce), Jieming Mao (Joyce), Carl Saroufim (Joyce), Aleksandra Faust (Joyce), Qingnan Duan (Joyce), Gil Fidel (Joyce), Kaan Katircioglu (Joyce), Rapha\"el Lopez Kaufman (Joyce), Dhruv Shah (Joyce), Weize Kong (Joyce), Abhishek Bapna (Joyce), Gell\'ert Weisz (Joyce), Emma Dunleavy (Joyce), Praneet Dutta (Joyce), Tianqi Liu (Joyce), Rahma Chaabouni (Joyce), Carolina Parada (Joyce), Marcus Wu (Joyce), Alexandra Belias (Joyce), Alessandro Bissacco (Joyce), Stanislav Fort (Joyce), Li Xiao (Joyce), Fantine Huot (Joyce), Chris Knutsen (Joyce), Yochai Blau (Joyce), Gang Li (Joyce), Jennifer Prendki (Joyce), Juliette Love (Joyce), Yinlam Chow (Joyce), Pichi Charoenpanit (Joyce), Hidetoshi Shimokawa (Joyce), Vincent Coriou (Joyce), Karol Gregor (Joyce), Tomas Izo (Joyce), Arjun Akula (Joyce), Mario Pinto (Joyce), Chris Hahn (Joyce), Dominik Paulus (Joyce), Jiaxian Guo (Joyce), Neha Sharma (Joyce), Cho-Jui Hsieh (Joyce), Adaeze Chukwuka (Joyce), Kazuma Hashimoto (Joyce), Nathalie Rauschmayr (Joyce), Ling Wu (Joyce), Christof Angermueller (Joyce), Yulong Wang (Joyce), Sebastian Gerlach (Joyce), Michael Pliskin (Joyce), Daniil Mirylenka (Joyce), Min Ma (Joyce), Lexi Baugher (Joyce), Bryan Gale (Joyce), Shaan Bijwadia (Joyce), Nemanja Raki\'cevi\'c (Joyce), David Wood (Joyce), Jane Park (Joyce), Chung-Ching Chang (Joyce), Babi Seal (Joyce), Chris Tar (Joyce), Kacper Krasowiak (Joyce), Yiwen Song (Joyce), Georgi Stephanov (Joyce), Gary Wang (Joyce), Marcello Maggioni (Joyce), Stein Xudong Lin (Joyce), Felix Wu (Joyce), Shachi Paul (Joyce), Zixuan Jiang (Joyce), Shubham Agrawal (Joyce), Bilal Piot (Joyce), Alex Feng (Joyce), Cheolmin Kim (Joyce), Tulsee Doshi (Joyce), Jonathan Lai (Joyce), Chuqiao (Joyce), Xu (Yonghao), Sharad Vikram (Yonghao), Ciprian Chelba (Yonghao), Sebastian Krause (Yonghao), Vincent Zhuang (Yonghao), Jack Rae (Yonghao), Timo Denk (Yonghao), Adrian Collister (Yonghao), Lotte Weerts (Yonghao), Xianghong Luo (Yonghao), Yifeng Lu (Yonghao), H{\aa}vard Garnes (Yonghao), Nitish Gupta (Yonghao), Terry Spitz (Yonghao), Avinatan Hassidim (Yonghao), Lihao Liang (Yonghao), Izhak Shafran (Yonghao), Peter Humphreys (Yonghao), Kenny Vassigh (Yonghao), Phil Wallis (Yonghao), Virat Shejwalkar (Yonghao), Nicolas Perez-Nieves (Yonghao), Rachel Hornung (Yonghao), Melissa Tan (Yonghao), Beka Westberg (Yonghao), Andy Ly (Yonghao), Richard Zhang (Yonghao), Brian Farris (Yonghao), Jongbin Park (Yonghao), Alec Kosik (Yonghao), Zeynep Cankara (Yonghao), Andrii Maksai (Yonghao), Yunhan Xu (Yonghao), Albin Cassirer (Yonghao), Sergi Caelles (Yonghao), Abbas Abdolmaleki (Yonghao), Mencher Chiang (Yonghao), Alex Fabrikant (Yonghao), Shravya Shetty (Yonghao), Luheng He (Yonghao), Mai Gim\'enez (Yonghao), Hadi Hashemi (Yonghao), Sheena Panthaplackel (Yonghao), Yana Kulizhskaya (Yonghao), Salil Deshmukh (Yonghao), Daniele Pighin (Yonghao), Robin Alazard (Yonghao), Disha Jindal (Yonghao), Seb Noury (Yonghao), Pradeep Kumar S (Yonghao), Siyang Qin (Yonghao), Xerxes Dotiwalla (Yonghao), Stephen Spencer (Yonghao), Mohammad Babaeizadeh (Yonghao), Blake JianHang Chen (Yonghao), Vaibhav Mehta (Yonghao), Jennie Lees (Yonghao), Andrew Leach (Yonghao), Penporn Koanantakool (Yonghao), Ilia Akolzin (Yonghao), Ramona Comanescu (Yonghao), Junwhan Ahn (Yonghao), Alexey Svyatkovskiy (Yonghao), Basil Mustafa (Yonghao), David D'Ambrosio (Yonghao), Shiva Mohan Reddy Garlapati (Yonghao), Pascal Lamblin (Yonghao), Alekh Agarwal (Yonghao), Shuang Song (Yonghao), Pier Giuseppe Sessa (Yonghao), Pauline Coquinot (Yonghao), John Maggs (Yonghao), Hussain Masoom (Yonghao), Divya Pitta (Yonghao), Yaqing Wang (Yonghao), Patrick Morris-Suzuki (Yonghao), Billy Porter (Yonghao), Johnson Jia (Yonghao), Jeffrey Dudek (Yonghao), Raghavender R (Yonghao), Cosmin Paduraru (Yonghao), Alan Ansell (Yonghao), Tolga Bolukbasi (Yonghao), Tony Lu (Yonghao), Ramya Ganeshan (Yonghao), Zi Wang (Yonghao), Henry Griffiths (Yonghao), Rodrigo Benenson (Yonghao), Yifan He (Yonghao), James Swirhun (Yonghao), George Papamakarios (Yonghao), Aditya Chawla (Yonghao), Kuntal Sengupta (Yonghao), Yan Wang (Yonghao), Vedrana Milutinovic (Yonghao), Igor Mordatch (Yonghao), Zhipeng Jia (Yonghao), Jamie Smith (Yonghao), Will Ng (Yonghao), Shitij Nigam (Yonghao), Matt Young (Yonghao), Eugen Vu\v{s}ak (Yonghao), Blake Hechtman (Yonghao), Sheela Goenka (Yonghao), Avital Zipori (Yonghao), Kareem Ayoub (Yonghao), Ashok Popat (Yonghao), Trilok Acharya (Yonghao), Luo Yu (Yonghao), Dawn Bloxwich (Yonghao), Hugo Song (Yonghao), Paul Roit (Yonghao), Haiqiong Li (Yonghao), Aviel Boag (Yonghao), Nigamaa Nayakanti (Yonghao), Bilva Chandra (Yonghao), Tianli Ding (Yonghao), Aahil Mehta (Yonghao), Cath Hope (Yonghao), Jiageng Zhang (Yonghao), Idan Heimlich Shtacher (Yonghao), Kartikeya Badola (Yonghao), Ryo Nakashima (Yonghao), Andrei Sozanschi (Yonghao), Iulia Com\c{s}a (Yonghao), Ante \v{Z}u\v{z}ul (Yonghao), Emily Caveness (Yonghao), Julian Odell (Yonghao), Matthew Watson (Yonghao), Dario de Cesare (Yonghao), Phillip Lippe (Yonghao), Derek Lockhart (Yonghao), Siddharth Verma (Yonghao), Huizhong Chen (Yonghao), Sean Sun (Yonghao), Lin Zhuo (Yonghao), Aditya Shah (Yonghao), Prakhar Gupta (Yonghao), Alex Muzio (Yonghao), Ning Niu (Yonghao), Amir Zait (Yonghao), Abhinav Singh (Yonghao), Meenu Gaba (Yonghao), Fan Ye (Yonghao), Prajit Ramachandran (Yonghao), Mohammad Saleh (Yonghao), Raluca Ada Popa (Yonghao), Ayush Dubey (Yonghao), Frederick Liu (Yonghao), Sara Javanmardi (Yonghao), Mark Epstein (Yonghao), Ross Hemsley (Yonghao), Richard Green (Yonghao), Nishant Ranka (Yonghao), Eden Cohen (Yonghao), Chuyuan Kelly Fu (Yonghao), Sanjay Ghemawat (Yonghao), Jed Borovik (Yonghao), James Martens (Yonghao), Anthony Chen (Yonghao), Pranav Shyam (Yonghao), Andr\'e Susano Pinto (Yonghao), Ming-Hsuan Yang (Yonghao), Alexandru \c{T}ifrea (Yonghao), David Du (Yonghao), Boqing Gong (Yonghao), Ayushi Agarwal (Yonghao), Seungyeon Kim (Yonghao), Christian Frank (Yonghao), Saloni Shah (Yonghao), Xiaodan Song (Yonghao), Zhiwei Deng (Yonghao), Ales Mikhalap (Yonghao), Kleopatra Chatziprimou (Yonghao), Timothy Chung (Yonghao), Toni Creswell (Yonghao), Susan Zhang (Yonghao), Yennie Jun (Yonghao), Carl Lebsack (Yonghao), Will Truong (Yonghao), Slavica Anda\v{c}i\'c (Yonghao), Itay Yona (Yonghao), Marco Fornoni (Yonghao), Rong Rong (Yonghao), Serge Toropov (Yonghao), Afzal Shama Soudagar (Yonghao), Andrew Audibert (Yonghao), Salah Zaiem (Yonghao), Zaheer Abbas (Yonghao), Andrei Rusu (Yonghao), Sahitya Potluri (Yonghao), Shitao Weng (Yonghao), Anastasios Kementsietsidis (Yonghao), Anton Tsitsulin (Yonghao), Daiyi Peng (Yonghao), Natalie Ha (Yonghao), Sanil Jain (Yonghao), Tejasi Latkar (Yonghao), Simeon Ivanov (Yonghao), Cory McLean (Yonghao), Anirudh GP (Yonghao), Rajesh Venkataraman (Yonghao), Canoee Liu (Yonghao), Dilip Krishnan (Yonghao), Joel D'sa (Yonghao), Roey Yogev (Yonghao), Paul Collins (Yonghao), Benjamin Lee (Yonghao), Lewis Ho (Yonghao), Carl Doersch (Yonghao), Gal Yona (Yonghao), Shawn Gao (Yonghao), Felipe Tiengo Ferreira (Yonghao), Adnan Ozturel (Yonghao), Hannah Muckenhirn (Yonghao), Ce Zheng (Yonghao), Gargi Balasubramaniam (Yonghao), Mudit Bansal (Yonghao), George van den Driessche (Yonghao), Sivan Eiger (Yonghao), Salem Haykal (Yonghao), Vedant Misra (Yonghao), Abhimanyu Goyal (Yonghao), Danilo Martins (Yonghao), Gary Leung (Yonghao), Jonas Valfridsson (Yonghao), Four Flynn (Yonghao), Will Bishop (Yonghao), Chenxi Pang (Yonghao), Yoni Halpern (Yonghao), Honglin Yu (Yonghao), Lawrence Moore (Yonghao), Yuvein (Yonghao), Zhu (Cindy), Sridhar Thiagarajan (Cindy), Yoel Drori (Cindy), Zhisheng Xiao (Cindy), Lucio Dery (Cindy), Rolf Jagerman (Cindy), Jing Lu (Cindy), Eric Ge (Cindy), Vaibhav Aggarwal (Cindy), Arjun Khare (Cindy), Vinh Tran (Cindy), Oded Elyada (Cindy), Ferran Alet (Cindy), James Rubin (Cindy), Ian Chou (Cindy), David Tian (Cindy), Libin Bai (Cindy), Lawrence Chan (Cindy), Lukasz Lew (Cindy), Karolis Misiunas (Cindy), Taylan Bilal (Cindy), Aniket Ray (Cindy), Sindhu Raghuram (Cindy), Alex Castro-Ros (Cindy), Viral Carpenter (Cindy), CJ Zheng (Cindy), Michael Kilgore (Cindy), Josef Broder (Cindy), Emily Xue (Cindy), Praveen Kallakuri (Cindy), Dheeru Dua (Cindy), Nancy Yuen (Cindy), Steve Chien (Cindy), John Schultz (Cindy), Saurabh Agrawal (Cindy), Reut Tsarfaty (Cindy), Jingcao Hu (Cindy), Ajay Kannan (Cindy), Dror Marcus (Cindy), Nisarg Kothari (Cindy), Baochen Sun (Cindy), Ben Horn (Cindy), Matko Bo\v{s}njak (Cindy), Ferjad Naeem (Cindy), Dean Hirsch (Cindy), Lewis Chiang (Cindy), Boya Fang (Cindy), Jie Han (Cindy), Qifei Wang (Cindy), Ben Hora (Cindy), Antoine He (Cindy), Mario Lu\v{c}i\'c (Cindy), Beer Changpinyo (Cindy), Anshuman Tripathi (Cindy), John Youssef (Cindy), Chester Kwak (Cindy), Philippe Schlattner (Cindy), Cat Graves (Cindy), R\'emi Leblond (Cindy), Wenjun Zeng (Cindy), Anders Andreassen (Cindy), Gabriel Rasskin (Cindy), Yue Song (Cindy), Eddie Cao (Cindy), Junhyuk Oh (Cindy), Matt Hoffman (Cindy), Wojtek Skut (Cindy), Yichi Zhang (Cindy), Jon Stritar (Cindy), Xingyu Cai (Cindy), Saarthak Khanna (Cindy), Kathie Wang (Cindy), Shriya Sharma (Cindy), Christian Reisswig (Cindy), Younghoon Jun (Cindy), Aman Prasad (Cindy), Tatiana Sholokhova (Cindy), Preeti Singh (Cindy), Adi Gerzi Rosenthal (Cindy), Anian Ruoss (Cindy), Fran\c{c}oise Beaufays (Cindy), Sean Kirmani (Cindy), Dongkai Chen (Cindy), Johan Schalkwyk (Cindy), Jonathan Herzig (Cindy), Been Kim (Cindy), Josh Jacob (Cindy), Damien Vincent (Cindy), Adrian N Reyes (Cindy), Ivana Balazevic (Cindy), L\'eonard Hussenot (Cindy), Jon Schneider (Cindy), Parker Barnes (Cindy), Luis Castro (Cindy), Spandana Raj Babbula (Cindy), Simon Green (Cindy), Serkan Cabi (Cindy), Nico Duduta (Cindy), Danny Driess (Cindy), Rich Galt (Cindy), Noam Velan (Cindy), Junjie Wang (Cindy), Hongyang Jiao (Cindy), Matthew Mauger (Cindy), Du Phan (Cindy), Miteyan Patel (Cindy), Vlado Gali\'c (Cindy), Jerry Chang (Cindy), Eyal Marcus (Cindy), Matt Harvey (Cindy), Julian Salazar (Cindy), Elahe Dabir (Cindy), Suraj Satishkumar Sheth (Cindy), Amol Mandhane (Cindy), Hanie Sedghi (Cindy), Jeremiah Willcock (Cindy), Amir Zandieh (Cindy), Shruthi Prabhakara (Cindy), Aida Amini (Cindy), Antoine Miech (Cindy), Victor Stone (Cindy), Massimo Nicosia (Cindy), Paul Niemczyk (Cindy), Ying Xiao (Cindy), Lucy Kim (Cindy), S{\l}awek Kwasiborski (Cindy), Vikas Verma (Cindy), Ada Maksutaj Oflazer (Cindy), Christoph Hirnschall (Cindy), Peter Sung (Cindy), Lu Liu (Cindy), Richard Everett (Cindy), Michiel Bakker (Cindy), \'Agoston Weisz (Cindy), Yufei Wang (Cindy), Vivek Sampathkumar (Cindy), Uri Shaham (Cindy), Bibo Xu (Cindy), Yasemin Altun (Cindy), Mingqiu Wang (Cindy), Takaaki Saeki (Cindy), Guanjie Chen (Cindy), Emanuel Taropa (Cindy), Shanthal Vasanth (Cindy), Sophia Austin (Cindy), Lu Huang (Cindy), Goran Petrovic (Cindy), Qingyun Dou (Cindy), Daniel Golovin (Cindy), Grigory Rozhdestvenskiy (Cindy), Allie Culp (Cindy), Will Wu (Cindy), Motoki Sano (Cindy), Divya Jain (Cindy), Julia Proskurnia (Cindy), S\'ebastien Cevey (Cindy), Alejandro Cruzado Ruiz (Cindy), Piyush Patil (Cindy), Mahdi Mirzazadeh (Cindy), Eric Ni (Cindy), Javier Snaider (Cindy), Lijie Fan (Cindy), Alexandre Fr\'echette (Cindy), AJ Pierigiovanni (Cindy), Shariq Iqbal (Cindy), Kenton Lee (Cindy), Claudio Fantacci (Cindy), Jinwei Xing (Cindy), Lisa Wang (Cindy), Alex Irpan (Cindy), David Raposo (Cindy), Yi Luan (Cindy), Zhuoyuan Chen (Cindy), Harish Ganapathy (Cindy), Kevin Hui (Cindy), Jiazhong Nie (Cindy), Isabelle Guyon (Cindy), Heming Ge (Cindy), Roopali Vij (Cindy), Hui Zheng (Cindy), Dayeong Lee (Cindy), Alfonso Casta\~no (Cindy), Khuslen Baatarsukh (Cindy), Gabriel Ibagon (Cindy), Alexandra Chronopoulou (Cindy), Nicholas FitzGerald (Cindy), Shashank Viswanadha (Cindy), Safeen Huda (Cindy), Rivka Moroshko (Cindy), Georgi Stoyanov (Cindy), Prateek Kolhar (Cindy), Alain Vaucher (Cindy), Ishaan Watts (Cindy), Adhi Kuncoro (Cindy), Henryk Michalewski (Cindy), Satish Kambala (Cindy), Bat-Orgil Batsaikhan (Cindy), Alek Andreev (Cindy), Irina Jurenka (Cindy), Maigo Le (Cindy), Qihang Chen (Cindy), Wael Al Jishi (Cindy), Sarah Chakera (Cindy), Zhe Chen (Cindy), Aditya Kini (Cindy), Vikas Yadav (Cindy), Aditya Siddhant (Cindy), Ilia Labzovsky (Cindy), Balaji Lakshminarayanan (Cindy), Carrie Grimes Bostock (Cindy), Pankil Botadra (Cindy), Ankesh Anand (Cindy), Colton Bishop (Cindy), Sam Conway-Rahman (Cindy), Mohit Agarwal (Cindy), Yani Donchev (Cindy), Achintya Singhal (Cindy), F\'elix de Chaumont Quitry (Cindy), Natalia Ponomareva (Cindy), Nishant Agrawal (Cindy), Bin Ni (Cindy), Kalpesh Krishna (Cindy), Masha Samsikova (Cindy), John Karro (Cindy), Yilun Du (Cindy), Tamara von Glehn (Cindy), Caden Lu (Cindy), Christopher A. Choquette-Choo (Cindy), Zhen Qin (Cindy), Tingnan Zhang (Cindy), Sicheng Li (Cindy), Divya Tyam (Cindy), Swaroop Mishra (Cindy), Wing Lowe (Cindy), Colin Ji (Cindy), Weiyi Wang (Cindy), Manaal Faruqui (Cindy), Ambrose Slone (Cindy), Valentin Dalibard (Cindy), Arunachalam Narayanaswamy (Cindy), John Lambert (Cindy), Pierre-Antoine Manzagol (Cindy), Dan Karliner (Cindy), Andrew Bolt (Cindy), Ivan Lobov (Cindy), Aditya Kusupati (Cindy), Chang Ye (Cindy), Xuan Yang (Cindy), Heiga Zen (Cindy), Nelson George (Cindy), Mukul Bhutani (Cindy), Olivier Lacombe (Cindy), Robert Riachi (Cindy), Gagan Bansal (Cindy), Rachel Soh (Cindy), Yue Gao (Cindy), Yang Yu (Cindy), Adams Yu (Cindy), Emily Nottage (Cindy), Tania Rojas-Esponda (Cindy), James Noraky (Cindy), Manish Gupta (Cindy), Ragha Kotikalapudi (Cindy), Jichuan Chang (Cindy), Sanja Deur (Cindy), Dan Graur (Cindy), Alex Mossin (Cindy), Erin Farnese (Cindy), Ricardo Figueira (Cindy), Alexandre Moufarek (Cindy), Austin Huang (Cindy), Patrik Zochbauer (Cindy), Ben Ingram (Cindy), Tongzhou Chen (Cindy), Zelin Wu (Cindy), Adri\`a Puigdom\`enech (Cindy), Leland Rechis (Cindy), Da Yu (Cindy), Sri Gayatri Sundara Padmanabhan (Cindy), Rui Zhu (Cindy), Chu-ling Ko (Cindy), Andrea Banino (Cindy), Samira Daruki (Cindy), Aarush Selvan (Cindy), Dhruva Bhaswar (Cindy), Daniel Hernandez Diaz (Cindy), Chen Su (Cindy), Salvatore Scellato (Cindy), Jennifer Brennan (Cindy), Woohyun Han (Cindy), Grace Chung (Cindy), Priyanka Agrawal (Cindy), Urvashi Khandelwal (Cindy), Khe Chai Sim (Cindy), Morgane Lustman (Cindy), Sam Ritter (Cindy), Kelvin Guu (Cindy), Jiawei Xia (Cindy), Prateek Jain (Cindy), Emma Wang (Cindy), Tyrone Hill (Cindy), Mirko Rossini (Cindy), Marija Kostelac (Cindy), Tautvydas Misiunas (Cindy), Amit Sabne (Cindy), Kyuyeun Kim (Cindy), Ahmet Iscen (Cindy), Congchao Wang (Cindy), Jos\'e Leal (Cindy), Ashwin Sreevatsa (Cindy), Utku Evci (Cindy), Manfred Warmuth (Cindy), Saket Joshi (Cindy), Daniel Suo (Cindy), James Lottes (Cindy), Garrett Honke (Cindy), Brendan Jou (Cindy), Stefani Karp (Cindy), Jieru Hu (Cindy), Himanshu Sahni (Cindy), Adrien Ali Ta\"iga (Cindy), William Kong (Cindy), Samrat Ghosh (Cindy), Renshen Wang (Cindy), Jay Pavagadhi (Cindy), Natalie Axelsson (Cindy), Nikolai Grigorev (Cindy), Patrick Siegler (Cindy), Rebecca Lin (Cindy), Guohui Wang (Cindy), Emilio Parisotto (Cindy), Sharath Maddineni (Cindy), Krishan Subudhi (Cindy), Eyal Ben-David (Cindy), Elena Pochernina (Cindy), Orgad Keller (Cindy), Thi Avrahami (Cindy), Zhe Yuan (Cindy), Pulkit Mehta (Cindy), Jialu Liu (Cindy), Sherry Yang (Cindy), Wendy Kan (Cindy), Katherine Lee (Cindy), Tom Funkhouser (Cindy), Derek Cheng (Cindy), Hongzhi Shi (Cindy), Archit Sharma (Cindy), Joe Kelley (Cindy), Matan Eyal (Cindy), Yury Malkov (Cindy), Corentin Tallec (Cindy), Yuval Bahat (Cindy), Shen Yan (Cindy), Xintian (Cindy), Wu (Weilun), David Lindner (Weilun), Chengda Wu (Weilun), Avi Caciularu (Weilun), Xiyang Luo (Weilun), Rodolphe Jenatton (Weilun), Tim Zaman (Weilun), Yingying Bi (Weilun), Ilya Kornakov (Weilun), Ganesh Mallya (Weilun), Daisuke Ikeda (Weilun), Itay Karo (Weilun), Anima Singh (Weilun), Colin Evans (Weilun), Praneeth Netrapalli (Weilun), Vincent Nallatamby (Weilun), Isaac Tian (Weilun), Yannis Assael (Weilun), Vikas Raunak (Weilun), Victor Carbune (Weilun), Ioana Bica (Weilun), Lior Madmoni (Weilun), Dee Cattle (Weilun), Snchit Grover (Weilun), Krishna Somandepalli (Weilun), Sid Lall (Weilun), Amelio V\'azquez-Reina (Weilun), Riccardo Patana (Weilun), Jiaqi Mu (Weilun), Pranav Talluri (Weilun), Maggie Tran (Weilun), Rajeev Aggarwal (Weilun), RJ Skerry-Ryan (Weilun), Jun Xu (Weilun), Mike Burrows (Weilun), Xiaoyue Pan (Weilun), Edouard Yvinec (Weilun), Di Lu (Weilun), Zhiying Zhang (Weilun), Duc Dung Nguyen (Weilun), Hairong Mu (Weilun), Gabriel Barcik (Weilun), Helen Ran (Weilun), Lauren Beltrone (Weilun), Krzysztof Choromanski (Weilun), Dia Kharrat (Weilun), Samuel Albanie (Weilun), Sean Purser-haskell (Weilun), David Bieber (Weilun), Carrie Zhang (Weilun), Jing Wang (Weilun), Tom Hudson (Weilun), Zhiyuan Zhang (Weilun), Han Fu (Weilun), Johannes Mauerer (Weilun), Mohammad Hossein Bateni (Weilun), AJ Maschinot (Weilun), Bing Wang (Weilun), Muye Zhu (Weilun), Arjun Pillai (Weilun), Tobias Weyand (Weilun), Shuang Liu (Weilun), Oscar Akerlund (Weilun), Fred Bertsch (Weilun), Vittal Premachandran (Weilun), Alicia Jin (Weilun), Vincent Roulet (Weilun), Peter de Boursac (Weilun), Shubham Mittal (Weilun), Ndaba Ndebele (Weilun), Georgi Karadzhov (Weilun), Sahra Ghalebikesabi (Weilun), Ricky Liang (Weilun), Allen Wu (Weilun), Yale Cong (Weilun), Nimesh Ghelani (Weilun), Sumeet Singh (Weilun), Bahar Fatemi (Weilun), Warren (Weilun), Chen (Q), Charles Kwong (Q), Alexey Kolganov (Q), Steve Li (Q), Richard Song (Q), Chenkai Kuang (Q), Sobhan Miryoosefi (Q), Dale Webster (Q), James Wendt (Q), Arkadiusz Socala (Q), Guolong Su (Q), Artur Mendon\c{c}a (Q), Abhinav Gupta (Q), Xiaowei Li (Q), Tomy Tsai (Q), Qiong (Q), Hu (Jerry), Kai Kang (Jerry), Angie Chen (Jerry), Sertan Girgin (Jerry), Yongqin Xian (Jerry), Andrew Lee (Jerry), Nolan Ramsden (Jerry), Leslie Baker (Jerry), Madeleine Clare Elish (Jerry), Varvara Krayvanova (Jerry), Rishabh Joshi (Jerry), Jiri Simsa (Jerry), Yao-Yuan Yang (Jerry), Piotr Ambroszczyk (Jerry), Dipankar Ghosh (Jerry), Arjun Kar (Jerry), Yuan Shangguan (Jerry), Yumeya Yamamori (Jerry), Yaroslav Akulov (Jerry), Andy Brock (Jerry), Haotian Tang (Jerry), Siddharth Vashishtha (Jerry), Rich Munoz (Jerry), Andreas Steiner (Jerry), Kalyan Andra (Jerry), Daniel Eppens (Jerry), Qixuan Feng (Jerry), Hayato Kobayashi (Jerry), Sasha Goldshtein (Jerry), Mona El Mahdy (Jerry), Xin Wang (Jerry), Jilei (Jerry), Wang (Tu\'\^an), Richard Killam (Tu\'\^an), Tom Kwiatkowski (Tu\'\^an), Kavya Kopparapu (Tu\'\^an), Serena Zhan (Tu\'\^an), Chao Jia (Tu\'\^an), Alexei Bendebury (Tu\'\^an), Sheryl Luo (Tu\'\^an), Adri\`a Recasens (Tu\'\^an), Timothy Knight (Tu\'\^an), Jing Chen (Tu\'\^an), Mohak Patel (Tu\'\^an), YaGuang Li (Tu\'\^an), Ben Withbroe (Tu\'\^an), Dean Weesner (Tu\'\^an), Kush Bhatia (Tu\'\^an), Jie Ren (Tu\'\^an), Danielle Eisenbud (Tu\'\^an), Ebrahim Songhori (Tu\'\^an), Yanhua Sun (Tu\'\^an), Travis Choma (Tu\'\^an), Tasos Kementsietsidis (Tu\'\^an), Lucas Manning (Tu\'\^an), Brian Roark (Tu\'\^an), Wael Farhan (Tu\'\^an), Jie Feng (Tu\'\^an), Susheel Tatineni (Tu\'\^an), James Cobon-Kerr (Tu\'\^an), Yunjie Li (Tu\'\^an), Lisa Anne Hendricks (Tu\'\^an), Isaac Noble (Tu\'\^an), Chris Breaux (Tu\'\^an), Nate Kushman (Tu\'\^an), Liqian Peng (Tu\'\^an), Fuzhao Xue (Tu\'\^an), Taylor Tobin (Tu\'\^an), Jamie Rogers (Tu\'\^an), Josh Lipschultz (Tu\'\^an), Chris Alberti (Tu\'\^an), Alexey Vlaskin (Tu\'\^an), Mostafa Dehghani (Tu\'\^an), Roshan Sharma (Tu\'\^an), Tris Warkentin (Tu\'\^an), Chen-Yu Lee (Tu\'\^an), Benigno Uria (Tu\'\^an), Da-Cheng Juan (Tu\'\^an), Angad Chandorkar (Tu\'\^an), Hila Sheftel (Tu\'\^an), Ruibo Liu (Tu\'\^an), Elnaz Davoodi (Tu\'\^an), Borja De Balle Pigem (Tu\'\^an), Kedar Dhamdhere (Tu\'\^an), David Ross (Tu\'\^an), Jonathan Hoech (Tu\'\^an), Mahdis Mahdieh (Tu\'\^an), Li Liu (Tu\'\^an), Qiujia Li (Tu\'\^an), Liam McCafferty (Tu\'\^an), Chenxi Liu (Tu\'\^an), Markus Mircea (Tu\'\^an), Yunting Song (Tu\'\^an), Omkar Savant (Tu\'\^an), Alaa Saade (Tu\'\^an), Colin Cherry (Tu\'\^an), Vincent Hellendoorn (Tu\'\^an), Siddharth Goyal (Tu\'\^an), Paul Pucciarelli (Tu\'\^an), David Vilar Torres (Tu\'\^an), Zohar Yahav (Tu\'\^an), Hyo Lee (Tu\'\^an), Lars Lowe Sjoesund (Tu\'\^an), Christo Kirov (Tu\'\^an), Bo Chang (Tu\'\^an), Deepanway Ghoshal (Tu\'\^an), Lu Li (Tu\'\^an), Gilles Baechler (Tu\'\^an), S\'ebastien Pereira (Tu\'\^an), Tara Sainath (Tu\'\^an), Anudhyan Boral (Tu\'\^an), Dominik Grewe (Tu\'\^an), Afief Halumi (Tu\'\^an), Nguyet Minh Phu (Tu\'\^an), Tianxiao Shen (Tu\'\^an), Marco Tulio Ribeiro (Tu\'\^an), Dhriti Varma (Tu\'\^an), Alex Kaskasoli (Tu\'\^an), Vlad Feinberg (Tu\'\^an), Navneet Potti (Tu\'\^an), Jarrod Kahn (Tu\'\^an), Matheus Wisniewski (Tu\'\^an), Shakir Mohamed (Tu\'\^an), Arnar Mar Hrafnkelsson (Tu\'\^an), Bobak Shahriari (Tu\'\^an), Jean-Baptiste Lespiau (Tu\'\^an), Lisa Patel (Tu\'\^an), Legg Yeung (Tu\'\^an), Tom Paine (Tu\'\^an), Lantao Mei (Tu\'\^an), Alex Ramirez (Tu\'\^an), Rakesh Shivanna (Tu\'\^an), Li Zhong (Tu\'\^an), Josh Woodward (Tu\'\^an), Guilherme Tubone (Tu\'\^an), Samira Khan (Tu\'\^an), Heng Chen (Tu\'\^an), Elizabeth Nielsen (Tu\'\^an), Catalin Ionescu (Tu\'\^an), Utsav Prabhu (Tu\'\^an), Mingcen Gao (Tu\'\^an), Qingze Wang (Tu\'\^an), Sean Augenstein (Tu\'\^an), Neesha Subramaniam (Tu\'\^an), Jason Chang (Tu\'\^an), Fotis Iliopoulos (Tu\'\^an), Jiaming Luo (Tu\'\^an), Myriam Khan (Tu\'\^an), Weicheng Kuo (Tu\'\^an), Denis Teplyashin (Tu\'\^an), Florence Perot (Tu\'\^an), Logan Kilpatrick (Tu\'\^an), Amir Globerson (Tu\'\^an), Hongkun Yu (Tu\'\^an), Anfal Siddiqui (Tu\'\^an), Nick Sukhanov (Tu\'\^an), Arun Kandoor (Tu\'\^an), Umang Gupta (Tu\'\^an), Marco Andreetto (Tu\'\^an), Moran Ambar (Tu\'\^an), Donnie Kim (Tu\'\^an), Pawe{\l} Weso{\l}owski (Tu\'\^an), Sarah Perrin (Tu\'\^an), Ben Limonchik (Tu\'\^an), Wei Fan (Tu\'\^an), Jim Stephan (Tu\'\^an), Ian Stewart-Binks (Tu\'\^an), Ryan Kappedal (Tu\'\^an), Tong He (Tu\'\^an), Sarah Cogan (Tu\'\^an), Romina Datta (Tu\'\^an), Tong Zhou (Tu\'\^an), Jiayu Ye (Tu\'\^an), Leandro Kieliger (Tu\'\^an), Ana Ramalho (Tu\'\^an), Kyle Kastner (Tu\'\^an), Fabian Mentzer (Tu\'\^an), Wei-Jen Ko (Tu\'\^an), Arun Suggala (Tu\'\^an), Tianhao Zhou (Tu\'\^an), Shiraz Butt (Tu\'\^an), Hana Strej\v{c}ek (Tu\'\^an), Lior Belenki (Tu\'\^an), Subhashini Venugopalan (Tu\'\^an), Mingyang Ling (Tu\'\^an), Evgenii Eltyshev (Tu\'\^an), Yunxiao Deng (Tu\'\^an), Geza Kovacs (Tu\'\^an), Mukund Raghavachari (Tu\'\^an), Hanjun Dai (Tu\'\^an), Tal Schuster (Tu\'\^an), Steven Schwarcz (Tu\'\^an), Richard Nguyen (Tu\'\^an), Arthur Nguyen (Tu\'\^an), Gavin Buttimore (Tu\'\^an), Shrestha Basu Mallick (Tu\'\^an), Sudeep Gandhe (Tu\'\^an), Seth Benjamin (Tu\'\^an), Michal Jastrzebski (Tu\'\^an), Le Yan (Tu\'\^an), Sugato Basu (Tu\'\^an), Chris Apps (Tu\'\^an), Isabel Edkins (Tu\'\^an), James Allingham (Tu\'\^an), Immanuel Odisho (Tu\'\^an), Tomas Kocisky (Tu\'\^an), Jewel Zhao (Tu\'\^an), Linting Xue (Tu\'\^an), Apoorv Reddy (Tu\'\^an), Chrysovalantis Anastasiou (Tu\'\^an), Aviel Atias (Tu\'\^an), Sam Redmond (Tu\'\^an), Kieran Milan (Tu\'\^an), Nicolas Heess (Tu\'\^an), Herman Schmit (Tu\'\^an), Allan Dafoe (Tu\'\^an), Daniel Andor (Tu\'\^an), Tynan Gangwani (Tu\'\^an), Anca Dragan (Tu\'\^an), Sheng Zhang (Tu\'\^an), Ashyana Kachra (Tu\'\^an), Gang Wu (Tu\'\^an), Siyang Xue (Tu\'\^an), Kevin Aydin (Tu\'\^an), Siqi Liu (Tu\'\^an), Yuxiang Zhou (Tu\'\^an), Mahan Malihi (Tu\'\^an), Austin Wu (Tu\'\^an), Siddharth Gopal (Tu\'\^an), Candice Schumann (Tu\'\^an), Peter Stys (Tu\'\^an), Alek Wang (Tu\'\^an), Mirek Ol\v{s}\'ak (Tu\'\^an), Dangyi Liu (Tu\'\^an), Christian Schallhart (Tu\'\^an), Yiran Mao (Tu\'\^an), Demetra Brady (Tu\'\^an), Hao Xu (Tu\'\^an), Tomas Mery (Tu\'\^an), Chawin Sitawarin (Tu\'\^an), Siva Velusamy (Tu\'\^an), Tom Cobley (Tu\'\^an), Alex Zhai (Tu\'\^an), Christian Walder (Tu\'\^an), Nitzan Katz (Tu\'\^an), Ganesh Jawahar (Tu\'\^an), Chinmay Kulkarni (Tu\'\^an), Antoine Yang (Tu\'\^an), Adam Paszke (Tu\'\^an), Yinan Wang (Tu\'\^an), Bogdan Damoc (Tu\'\^an), Zal\'an Borsos (Tu\'\^an), Ray Smith (Tu\'\^an), Jinning Li (Tu\'\^an), Mansi Gupta (Tu\'\^an), Andrei Kapishnikov (Tu\'\^an), Sushant Prakash (Tu\'\^an), Florian Luisier (Tu\'\^an), Rishabh Agarwal (Tu\'\^an), Will Grathwohl (Tu\'\^an), Kuangyuan Chen (Tu\'\^an), Kehang Han (Tu\'\^an), Nikhil Mehta (Tu\'\^an), Andrew Over (Tu\'\^an), Shekoofeh Azizi (Tu\'\^an), Lei Meng (Tu\'\^an), Niccol\`o Dal Santo (Tu\'\^an), Kelvin Zheng (Tu\'\^an), Jane Shapiro (Tu\'\^an), Igor Petrovski (Tu\'\^an), Jeffrey Hui (Tu\'\^an), Amin Ghafouri (Tu\'\^an), Jasper Snoek (Tu\'\^an), James Qin (Tu\'\^an), Mandy Jordan (Tu\'\^an), Caitlin Sikora (Tu\'\^an), Jonathan Malmaud (Tu\'\^an), Yuheng Kuang (Tu\'\^an), Aga \'Swietlik (Tu\'\^an), Ruoxin Sang (Tu\'\^an), Chongyang Shi (Tu\'\^an), Leon Li (Tu\'\^an), Andrew Rosenberg (Tu\'\^an), Shubin Zhao (Tu\'\^an), Andy Crawford (Tu\'\^an), Jan-Thorsten Peter (Tu\'\^an), Yun Lei (Tu\'\^an), Xavier Garcia (Tu\'\^an), Long Le (Tu\'\^an), Todd Wang (Tu\'\^an), Julien Amelot (Tu\'\^an), Dave Orr (Tu\'\^an), Praneeth Kacham (Tu\'\^an), Dana Alon (Tu\'\^an), Gladys Tyen (Tu\'\^an), Abhinav Arora (Tu\'\^an), James Lyon (Tu\'\^an), Alex Kurakin (Tu\'\^an), Mimi Ly (Tu\'\^an), Theo Guidroz (Tu\'\^an), Zhipeng Yan (Tu\'\^an), Rina Panigrahy (Tu\'\^an), Pingmei Xu (Tu\'\^an), Thais Kagohara (Tu\'\^an), Yong Cheng (Tu\'\^an), Eric Noland (Tu\'\^an), Jinhyuk Lee (Tu\'\^an), Jonathan Lee (Tu\'\^an), Cathy Yip (Tu\'\^an), Maria Wang (Tu\'\^an), Efrat Nehoran (Tu\'\^an), Alexander Bykovsky (Tu\'\^an), Zhihao Shan (Tu\'\^an), Ankit Bhagatwala (Tu\'\^an), Chaochao Yan (Tu\'\^an), Jie Tan (Tu\'\^an), Guillermo Garrido (Tu\'\^an), Dan Ethier (Tu\'\^an), Nate Hurley (Tu\'\^an), Grace Vesom (Tu\'\^an), Xu Chen (Tu\'\^an), Siyuan Qiao (Tu\'\^an), Abhishek Nayyar (Tu\'\^an), Julian Walker (Tu\'\^an), Paramjit Sandhu (Tu\'\^an), Mihaela Rosca (Tu\'\^an), Danny Swisher (Tu\'\^an), Mikhail Dektiarev (Tu\'\^an), Josh Dillon (Tu\'\^an), George-Cristian Muraru (Tu\'\^an), Manuel Tragut (Tu\'\^an), Artiom Myaskovsky (Tu\'\^an), David Reid (Tu\'\^an), Marko Velic (Tu\'\^an), Owen Xiao (Tu\'\^an), Jasmine George (Tu\'\^an), Mark Brand (Tu\'\^an), Jing Li (Tu\'\^an), Wenhao Yu (Tu\'\^an), Shane Gu (Tu\'\^an), Xiang Deng (Tu\'\^an), Fran\c{c}ois-Xavier Aubet (Tu\'\^an), Soheil Hassas Yeganeh (Tu\'\^an), Fred Alcober (Tu\'\^an), Celine Smith (Tu\'\^an), Trevor Cohn (Tu\'\^an), Kay McKinney (Tu\'\^an), Michael Tschannen (Tu\'\^an), Ramesh Sampath (Tu\'\^an), Gowoon Cheon (Tu\'\^an), Liangchen Luo (Tu\'\^an), Luyang Liu (Tu\'\^an), Jordi Orbay (Tu\'\^an), Hui Peng (Tu\'\^an), Gabriela Botea (Tu\'\^an), Xiaofan Zhang (Tu\'\^an), Charles Yoon (Tu\'\^an), Cesar Magalhaes (Tu\'\^an), Pawe{\l} Stradomski (Tu\'\^an), Ian Mackinnon (Tu\'\^an), Steven Hemingray (Tu\'\^an), Kumaran Venkatesan (Tu\'\^an), Rhys May (Tu\'\^an), Jaeyoun Kim (Tu\'\^an), Alex Druinsky (Tu\'\^an), Jingchen Ye (Tu\'\^an), Zheng Xu (Tu\'\^an), Terry Huang (Tu\'\^an), Jad Al Abdallah (Tu\'\^an), Adil Dostmohamed (Tu\'\^an), Rachana Fellinger (Tu\'\^an), Tsendsuren Munkhdalai (Tu\'\^an), Akanksha Maurya (Tu\'\^an), Peter Garst (Tu\'\^an), Yin Zhang (Tu\'\^an), Maxim Krikun (Tu\'\^an), Simon Bucher (Tu\'\^an), Aditya Srikanth Veerubhotla (Tu\'\^an), Yaxin Liu (Tu\'\^an), Sheng Li (Tu\'\^an), Nishesh Gupta (Tu\'\^an), Jakub Adamek (Tu\'\^an), Hanwen Chen (Tu\'\^an), Bernett Orlando (Tu\'\^an), Aleksandr Zaks (Tu\'\^an), Joost van Amersfoort (Tu\'\^an), Josh Camp (Tu\'\^an), Hui Wan (Tu\'\^an), HyunJeong Choe (Tu\'\^an), Zhichun Wu (Tu\'\^an), Kate Olszewska (Tu\'\^an), Weiren Yu (Tu\'\^an), Archita Vadali (Tu\'\^an), Martin Scholz (Tu\'\^an), Daniel De Freitas (Tu\'\^an), Jason Lin (Tu\'\^an), Amy Hua (Tu\'\^an), Xin Liu (Tu\'\^an), Frank Ding (Tu\'\^an), Yichao Zhou (Tu\'\^an), Boone Severson (Tu\'\^an), Katerina Tsihlas (Tu\'\^an), Samuel Yang (Tu\'\^an), Tammo Spalink (Tu\'\^an), Varun Yerram (Tu\'\^an), Helena Pankov (Tu\'\^an), Rory Blevins (Tu\'\^an), Ben Vargas (Tu\'\^an), Sarthak Jauhari (Tu\'\^an), Matt Miecnikowski (Tu\'\^an), Ming Zhang (Tu\'\^an), Sandeep Kumar (Tu\'\^an), Clement Farabet (Tu\'\^an), Charline Le Lan (Tu\'\^an), Sebastian Flennerhag (Tu\'\^an), Yonatan Bitton (Tu\'\^an), Ada Ma (Tu\'\^an), Arthur Bra\v{z}inskas (Tu\'\^an), Eli Collins (Tu\'\^an), Niharika Ahuja (Tu\'\^an), Sneha Kudugunta (Tu\'\^an), Anna Bortsova (Tu\'\^an), Minh Giang (Tu\'\^an), Wanzheng Zhu (Tu\'\^an), Ed Chi (Tu\'\^an), Scott Lundberg (Tu\'\^an), Alexey Stern (Tu\'\^an), Subha Puttagunta (Tu\'\^an), Jing Xiong (Tu\'\^an), Xiao Wu (Tu\'\^an), Yash Pande (Tu\'\^an), Amit Jhindal (Tu\'\^an), Daniel Murphy (Tu\'\^an), Jon Clark (Tu\'\^an), Marc Brockschmidt (Tu\'\^an), Maxine Deines (Tu\'\^an), Kevin R. McKee (Tu\'\^an), Dan Bahir (Tu\'\^an), Jiajun Shen (Tu\'\^an), Minh Truong (Tu\'\^an), Daniel McDuff (Tu\'\^an), Andrea Gesmundo (Tu\'\^an), Edouard Rosseel (Tu\'\^an), Bowen Liang (Tu\'\^an), Ken Caluwaerts (Tu\'\^an), Jessica Hamrick (Tu\'\^an), Joseph Kready (Tu\'\^an), Mary Cassin (Tu\'\^an), Rishikesh Ingale (Tu\'\^an), Li Lao (Tu\'\^an), Scott Pollom (Tu\'\^an), Yifan Ding (Tu\'\^an), Wei He (Tu\'\^an), Lizzetth Bellot (Tu\'\^an), Joana Iljazi (Tu\'\^an), Ramya Sree Boppana (Tu\'\^an), Shan Han (Tu\'\^an), Tara Thompson (Tu\'\^an), Amr Khalifa (Tu\'\^an), Anna Bulanova (Tu\'\^an), Blagoj Mitrevski (Tu\'\^an), Bo Pang (Tu\'\^an), Emma Cooney (Tu\'\^an), Tian Shi (Tu\'\^an), Rey Coaguila (Tu\'\^an), Tamar Yakar (Tu\'\^an), Marc'aurelio Ranzato (Tu\'\^an), Nikola Momchev (Tu\'\^an), Chris Rawles (Tu\'\^an), Zachary Charles (Tu\'\^an), Young Maeng (Tu\'\^an), Yuan Zhang (Tu\'\^an), Rishabh Bansal (Tu\'\^an), Xiaokai Zhao (Tu\'\^an), Brian Albert (Tu\'\^an), Yuan Yuan (Tu\'\^an), Sudheendra Vijayanarasimhan (Tu\'\^an), Roy Hirsch (Tu\'\^an), Vinay Ramasesh (Tu\'\^an), Kiran Vodrahalli (Tu\'\^an), Xingyu Wang (Tu\'\^an), Arushi Gupta (Tu\'\^an), DJ Strouse (Tu\'\^an), Jianmo Ni (Tu\'\^an), Roma Patel (Tu\'\^an), Gabe Taubman (Tu\'\^an), Zhouyuan Huo (Tu\'\^an), Dero Gharibian (Tu\'\^an), Marianne Monteiro (Tu\'\^an), Hoi Lam (Tu\'\^an), Shobha Vasudevan (Tu\'\^an), Aditi Chaudhary (Tu\'\^an), Isabela Albuquerque (Tu\'\^an), Kilol Gupta (Tu\'\^an), Sebastian Riedel (Tu\'\^an), Chaitra Hegde (Tu\'\^an), Avraham Ruderman (Tu\'\^an), Andr\'as Gy\"orgy (Tu\'\^an), Marcus Wainwright (Tu\'\^an), Ashwin Chaugule (Tu\'\^an), Burcu Karagol Ayan (Tu\'\^an), Tomer Levinboim (Tu\'\^an), Sam Shleifer (Tu\'\^an), Yogesh Kalley (Tu\'\^an), Vahab Mirrokni (Tu\'\^an), Abhishek Rao (Tu\'\^an), Prabakar Radhakrishnan (Tu\'\^an), Jay Hartford (Tu\'\^an), Jialin Wu (Tu\'\^an), Zhenhai Zhu (Tu\'\^an), Francesco Bertolini (Tu\'\^an), Hao Xiong (Tu\'\^an), Nicolas Serrano (Tu\'\^an), Hamish Tomlinson (Tu\'\^an), Myle Ott (Tu\'\^an), Yifan Chang (Tu\'\^an), Mark Graham (Tu\'\^an), Jian Li (Tu\'\^an), Marco Liang (Tu\'\^an), Xiangzhu Long (Tu\'\^an), Sebastian Borgeaud (Tu\'\^an), Yanif Ahmad (Tu\'\^an), Alex Grills (Tu\'\^an), Diana Mincu (Tu\'\^an), Martin Izzard (Tu\'\^an), Yuan Liu (Tu\'\^an), Jinyu Xie (Tu\'\^an), Louis O'Bryan (Tu\'\^an), Sameera Ponda (Tu\'\^an), Simon Tong (Tu\'\^an), Michelle Liu (Tu\'\^an), Dan Malkin (Tu\'\^an), Khalid Salama (Tu\'\^an), Yuankai Chen (Tu\'\^an), Rohan Anil (Tu\'\^an), Anand Rao (Tu\'\^an), Rigel Swavely (Tu\'\^an), Misha Bilenko (Tu\'\^an), Nina Anderson (Tu\'\^an), Tat Tan (Tu\'\^an), Jing Xie (Tu\'\^an), Xing Wu (Tu\'\^an), Lijun Yu (Tu\'\^an), Oriol Vinyals (Tu\'\^an), Andrey Ryabtsev (Tu\'\^an), Rumen Dangovski (Tu\'\^an), Kate Baumli (Tu\'\^an), Daniel Keysers (Tu\'\^an), Christian Wright (Tu\'\^an), Zoe Ashwood (Tu\'\^an), Betty Chan (Tu\'\^an), Artem Shtefan (Tu\'\^an), Yaohui Guo (Tu\'\^an), Ankur Bapna (Tu\'\^an), Radu Soricut (Tu\'\^an), Steven Pecht (Tu\'\^an), Sabela Ramos (Tu\'\^an), Rui Wang (Tu\'\^an), Jiahao Cai (Tu\'\^an), Trieu Trinh (Tu\'\^an), Paul Barham (Tu\'\^an), Linda Friso (Tu\'\^an), Eli Stickgold (Tu\'\^an), Xiangzhuo Ding (Tu\'\^an), Siamak Shakeri (Tu\'\^an), Diego Ardila (Tu\'\^an), Eleftheria Briakou (Tu\'\^an), Phil Culliton (Tu\'\^an), Adam Raveret (Tu\'\^an), Jingyu Cui (Tu\'\^an), David Saxton (Tu\'\^an), Subhrajit Roy (Tu\'\^an), Javad Azizi (Tu\'\^an), Pengcheng Yin (Tu\'\^an), Lucia Loher (Tu\'\^an), Andrew Bunner (Tu\'\^an), Min Choi (Tu\'\^an), Faruk Ahmed (Tu\'\^an), Eric Li (Tu\'\^an), Yin Li (Tu\'\^an), Shengyang Dai (Tu\'\^an), Michael Elabd (Tu\'\^an), Sriram Ganapathy (Tu\'\^an), Shivani Agrawal (Tu\'\^an), Yiqing Hua (Tu\'\^an), Paige Kunkle (Tu\'\^an), Sujeevan Rajayogam (Tu\'\^an), Arun Ahuja (Tu\'\^an), Arthur Conmy (Tu\'\^an), Alex Vasiloff (Tu\'\^an), Parker Beak (Tu\'\^an), Christopher Yew (Tu\'\^an), Jayaram Mudigonda (Tu\'\^an), Bartek Wydrowski (Tu\'\^an), Jon Blanton (Tu\'\^an), Zhengdong Wang (Tu\'\^an), Yann Dauphin (Tu\'\^an), Zhuo Xu (Tu\'\^an), Martin Polacek (Tu\'\^an), Xi Chen (Tu\'\^an), Hexiang Hu (Tu\'\^an), Pauline Sho (Tu\'\^an), Markus Kunesch (Tu\'\^an), Mehdi Hafezi Manshadi (Tu\'\^an), Eliza Rutherford (Tu\'\^an), Bo Li (Tu\'\^an), Sissie Hsiao (Tu\'\^an), Iain Barr (Tu\'\^an), Alex Tudor (Tu\'\^an), Matija Kecman (Tu\'\^an), Arsha Nagrani (Tu\'\^an), Vladimir Pchelin (Tu\'\^an), Martin Sundermeyer (Tu\'\^an), Aishwarya P S (Tu\'\^an), Abhijit Karmarkar (Tu\'\^an), Yi Gao (Tu\'\^an), Grishma Chole (Tu\'\^an), Olivier Bachem (Tu\'\^an), Isabel Gao (Tu\'\^an), Arturo BC (Tu\'\^an), Matt Dibb (Tu\'\^an), Mauro Verzetti (Tu\'\^an), Felix Hernandez-Campos (Tu\'\^an), Yana Lunts (Tu\'\^an), Matthew Johnson (Tu\'\^an), Julia Di Trapani (Tu\'\^an), Raphael Koster (Tu\'\^an), Idan Brusilovsky (Tu\'\^an), Binbin Xiong (Tu\'\^an), Megha Mohabey (Tu\'\^an), Han Ke (Tu\'\^an), Joe Zou (Tu\'\^an), Tea Saboli\'c (Tu\'\^an), V\'ictor Campos (Tu\'\^an), John Palowitch (Tu\'\^an), Alex Morris (Tu\'\^an), Linhai Qiu (Tu\'\^an), Pranavaraj Ponnuramu (Tu\'\^an), Fangtao Li (Tu\'\^an), Vivek Sharma (Tu\'\^an), Kiranbir Sodhia (Tu\'\^an), Kaan Tekelioglu (Tu\'\^an), Aleksandr Chuklin (Tu\'\^an), Madhavi Yenugula (Tu\'\^an), Erika Gemzer (Tu\'\^an), Theofilos Strinopoulos (Tu\'\^an), Sam El-Husseini (Tu\'\^an), Huiyu Wang (Tu\'\^an), Yan Zhong (Tu\'\^an), Edouard Leurent (Tu\'\^an), Paul Natsev (Tu\'\^an), Weijun Wang (Tu\'\^an), Dre Mahaarachchi (Tu\'\^an), Tao Zhu (Tu\'\^an), Songyou Peng (Tu\'\^an), Sami Alabed (Tu\'\^an), Cheng-Chun Lee (Tu\'\^an), Anthony Brohan (Tu\'\^an), Arthur Szlam (Tu\'\^an), GS Oh (Tu\'\^an), Anton Kovsharov (Tu\'\^an), Jenny Lee (Tu\'\^an), Renee Wong (Tu\'\^an), Megan Barnes (Tu\'\^an), Gregory Thornton (Tu\'\^an), Felix Gimeno (Tu\'\^an), Omer Levy (Tu\'\^an), Martin Sevenich (Tu\'\^an), Melvin Johnson (Tu\'\^an), Jonathan Mallinson (Tu\'\^an), Robert Dadashi (Tu\'\^an), Ziyue Wang (Tu\'\^an), Qingchun Ren (Tu\'\^an), Preethi Lahoti (Tu\'\^an), Arka Dhar (Tu\'\^an), Josh Feldman (Tu\'\^an), Dan Zheng (Tu\'\^an), Thatcher Ulrich (Tu\'\^an), Liviu Panait (Tu\'\^an), Michiel Blokzijl (Tu\'\^an), Cip Baetu (Tu\'\^an), Josip Matak (Tu\'\^an), Jitendra Harlalka (Tu\'\^an), Maulik Shah (Tu\'\^an), Tal Marian (Tu\'\^an), Daniel von Dincklage (Tu\'\^an), Cosmo Du (Tu\'\^an), Ruy Ley-Wild (Tu\'\^an), Bethanie Brownfield (Tu\'\^an), Max Schumacher (Tu\'\^an), Yury Stuken (Tu\'\^an), Shadi Noghabi (Tu\'\^an), Sonal Gupta (Tu\'\^an), Xiaoqi Ren (Tu\'\^an), Eric Malmi (Tu\'\^an), Felix Weissenberger (Tu\'\^an), Blanca Huergo (Tu\'\^an), Maria Bauza (Tu\'\^an), Thomas Lampe (Tu\'\^an), Arthur Douillard (Tu\'\^an), Mojtaba Seyedhosseini (Tu\'\^an), Roy Frostig (Tu\'\^an), Zoubin Ghahramani (Tu\'\^an), Kelvin Nguyen (Tu\'\^an), Kashyap Krishnakumar (Tu\'\^an), Chengxi Ye (Tu\'\^an), Rahul Gupta (Tu\'\^an), Alireza Nazari (Tu\'\^an), Robert Geirhos (Tu\'\^an), Pete Shaw (Tu\'\^an), Ahmed Eleryan (Tu\'\^an), Dima Damen (Tu\'\^an), Jennimaria Palomaki (Tu\'\^an), Ted Xiao (Tu\'\^an), Qiyin Wu (Tu\'\^an), Quan Yuan (Tu\'\^an), Phoenix Meadowlark (Tu\'\^an), Matthew Bilotti (Tu\'\^an), Raymond Lin (Tu\'\^an), Mukund Sridhar (Tu\'\^an), Yannick Schroecker (Tu\'\^an), Da-Woon Chung (Tu\'\^an), Jincheng Luo (Tu\'\^an), Trevor Strohman (Tu\'\^an), Tianlin Liu (Tu\'\^an), Anne Zheng (Tu\'\^an), Jesse Emond (Tu\'\^an), Wei Wang (Tu\'\^an), Andrew Lampinen (Tu\'\^an), Toshiyuki Fukuzawa (Tu\'\^an), Folawiyo Campbell-Ajala (Tu\'\^an), Monica Roy (Tu\'\^an), James Lee-Thorp (Tu\'\^an), Lily Wang (Tu\'\^an), Iftekhar Naim (Tu\'\^an), Tony (Tu\'\^an), Nguy\~\^en (Lucas), Guy Bensky (Lucas), Aditya Gupta (Lucas), Dominika Rogozi\'nska (Lucas), Justin Fu (Lucas), Thanumalayan Sankaranarayana Pillai (Lucas), Petar Veli\v{c}kovi\'c (Lucas), Shahar Drath (Lucas), Philipp Neubeck (Lucas), Vaibhav Tulsyan (Lucas), Arseniy Klimovskiy (Lucas), Don Metzler (Lucas), Sage Stevens (Lucas), Angel Yeh (Lucas), Junwei Yuan (Lucas), Tianhe Yu (Lucas), Kelvin Zhang (Lucas), Alec Go (Lucas), Vincent Tsang (Lucas), Ying Xu (Lucas), Andy Wan (Lucas), Isaac Galatzer-Levy (Lucas), Sam Sobell (Lucas), Abodunrinwa Toki (Lucas), Elizabeth Salesky (Lucas), Wenlei Zhou (Lucas), Diego Antognini (Lucas), Sholto Douglas (Lucas), Shimu Wu (Lucas), Adam Lelkes (Lucas), Frank Kim (Lucas), Paul Cavallaro (Lucas), Ana Salazar (Lucas), Yuchi Liu (Lucas), James Besley (Lucas), Tiziana Refice (Lucas), Yiling Jia (Lucas), Zhang Li (Lucas), Michal Sokolik (Lucas), Arvind Kannan (Lucas), Jon Simon (Lucas), Jo Chick (Lucas), Avia Aharon (Lucas), Meet Gandhi (Lucas), Mayank Daswani (Lucas), Keyvan Amiri (Lucas), Vighnesh Birodkar (Lucas), Abe Ittycheriah (Lucas), Peter Grabowski (Lucas), Oscar Chang (Lucas), Charles Sutton (Lucas), Zhixin (Lucas), Lai (Elena), Umesh Telang (Elena), Susie Sargsyan (Elena), Tao Jiang (Elena), Raphael Hoffmann (Elena), Nicole Brichtova (Elena), Matteo Hessel (Elena), Jonathan Halcrow (Elena), Sammy Jerome (Elena), Geoff Brown (Elena), Alex Tomala (Elena), Elena Buchatskaya (Elena), Dian Yu (Elena), Sachit Menon (Elena), Pol Moreno (Elena), Yuguo Liao (Elena), Vicky Zayats (Elena), Luming Tang (Elena), SQ Mah (Elena), Ashish Shenoy (Elena), Alex Siegman (Elena), Majid Hadian (Elena), Okwan Kwon (Elena), Tao Tu (Elena), Nima Khajehnouri (Elena), Ryan Foley (Elena), Parisa Haghani (Elena), Zhongru Wu (Elena), Vaishakh Keshava (Elena), Khyatti Gupta (Elena), Tony Bruguier (Elena), Rui Yao (Elena), Danny Karmon (Elena), Luisa Zintgraf (Elena), Zhicheng Wang (Elena), Enrique Piqueras (Elena), Junehyuk Jung (Elena), Jenny Brennan (Elena), Diego Machado (Elena), Marissa Giustina (Elena), MH Tessler (Elena), Kamyu Lee (Elena), Qiao Zhang (Elena), Joss Moore (Elena), Kaspar Daugaard (Elena), Alexander Fr\"ommgen (Elena), Jennifer Beattie (Elena), Fred Zhang (Elena), Daniel Kasenberg (Elena), Ty Geri (Elena), Danfeng Qin (Elena), Gaurav Singh Tomar (Elena), Tom Ouyang (Elena), Tianli Yu (Elena), Luowei Zhou (Elena), Rajiv Mathews (Elena), Andy Davis (Elena), Yaoyiran Li (Elena), Jai Gupta (Elena), Damion Yates (Elena), Linda Deng (Elena), Elizabeth Kemp (Elena), Ga-Young Joung (Elena), Sergei Vassilvitskii (Elena), Mandy Guo (Elena), Pallavi LV (Elena), Dave Dopson (Elena), Sami Lachgar (Elena), Lara McConnaughey (Elena), Himadri Choudhury (Elena), Dragos Dena (Elena), Aaron Cohen (Elena), Joshua Ainslie (Elena), Sergey Levi (Elena), Parthasarathy Gopavarapu (Elena), Polina Zablotskaia (Elena), Hugo Vallet (Elena), Sanaz Bahargam (Elena), Xiaodan Tang (Elena), Nenad Tomasev (Elena), Ethan Dyer (Elena), Daniel Balle (Elena), Hongrae Lee (Elena), William Bono (Elena), Jorge Gonzalez Mendez (Elena), Vadim Zubov (Elena), Shentao Yang (Elena), Ivor Rendulic (Elena), Yanyan Zheng (Elena), Andrew Hogue (Elena), Golan Pundak (Elena), Ralph Leith (Elena), Avishkar Bhoopchand (Elena), Michael Han (Elena), Mislav \v{Z}ani\'c (Elena), Tom Schaul (Elena), Manolis Delakis (Elena), Tejas Iyer (Elena), Guanyu Wang (Elena), Harman Singh (Elena), Abdelrahman Abdelhamed (Elena), Tara Thomas (Elena), Siddhartha Brahma (Elena), Hilal Dib (Elena), Naveen Kumar (Elena), Wenxuan Zhou (Elena), Liang Bai (Elena), Pushkar Mishra (Elena), Jiao Sun (Elena), Valentin Anklin (Elena), Roykrong Sukkerd (Elena), Lauren Agubuzu (Elena), Anton Briukhov (Elena), Anmol Gulati (Elena), Maximilian Sieb (Elena), Fabio Pardo (Elena), Sara Nasso (Elena), Junquan Chen (Elena), Kexin Zhu (Elena), Tiberiu Sosea (Elena), Alex Goldin (Elena), Keith Rush (Elena), Spurthi Amba Hombaiah (Elena), Andreas Noever (Elena), Allan Zhou (Elena), Sam Haves (Elena), Mary Phuong (Elena), Jake Ades (Elena), Yi-ting Chen (Elena), Lin Yang (Elena), Joseph Pagadora (Elena), Stan Bileschi (Elena), Victor Cotruta (Elena), Rachel Saputro (Elena), Arijit Pramanik (Elena), Sean Ammirati (Elena), Dan Garrette (Elena), Kevin Villela (Elena), Tim Blyth (Elena), Canfer Akbulut (Elena), Neha Jha (Elena), Alban Rrustemi (Elena), Arissa Wongpanich (Elena), Chirag Nagpal (Elena), Yonghui Wu (Elena), Morgane Rivi\`ere (Elena), Sergey Kishchenko (Elena), Pranesh Srinivasan (Elena), Alice Chen (Elena), Animesh Sinha (Elena), Trang Pham (Elena), Bill Jia (Elena), Tom Hennigan (Elena), Anton Bakalov (Elena), Nithya Attaluri (Elena), Drew Garmon (Elena), Daniel Rodriguez (Elena), Dawid Wegner (Elena), Wenhao Jia (Elena), Evan Senter (Elena), Noah Fiedel (Elena), Denis Petek (Elena), Yuchuan Liu (Elena), Cassidy Hardin (Elena), Harshal Tushar Lehri (Elena), Joao Carreira (Elena), Sara Smoot (Elena), Marcel Prasetya (Elena), Nami Akazawa (Elena), Anca Stefanoiu (Elena), Chia-Hua Ho (Elena), Anelia Angelova (Elena), Kate Lin (Elena), Min Kim (Elena), Charles Chen (Elena), Marcin Sieniek (Elena), Alice Li (Elena), Tongfei Guo (Elena), Sorin Baltateanu (Elena), Pouya Tafti (Elena), Michael Wunder (Elena), Nadav Olmert (Elena), Divyansh Shukla (Elena), Jingwei Shen (Elena), Neel Kovelamudi (Elena), Balaji Venkatraman (Elena), Seth Neel (Elena), Romal Thoppilan (Elena), Jerome Connor (Elena), Frederik Benzing (Elena), Axel Stjerngren (Elena), Golnaz Ghiasi (Elena), Alex Polozov (Elena), Joshua Howland (Elena), Theophane Weber (Elena), Justin Chiu (Elena), Ganesh Poomal Girirajan (Elena), Andreas Terzis (Elena), Pidong Wang (Elena), Fangda Li (Elena), Yoav Ben Shalom (Elena), Dinesh Tewari (Elena), Matthew Denton (Elena), Roee Aharoni (Elena), Norbert Kalb (Elena), Heri Zhao (Elena), Junlin Zhang (Elena), Angelos Filos (Elena), Matthew Rahtz (Elena), Lalit Jain (Elena), Connie Fan (Elena), Vitor Rodrigues (Elena), Ruth Wang (Elena), Richard Shin (Elena), Jacob Austin (Elena), Roman Ring (Elena), Mariella Sanchez-Vargas (Elena), Mehadi Hassen (Elena), Ido Kessler (Elena), Uri Alon (Elena), Gufeng Zhang (Elena), Wenhu Chen (Elena), Yenai Ma (Elena), Xiance Si (Elena), Le Hou (Elena), Azalia Mirhoseini (Elena), Marc Wilson (Elena), Geoff Bacon (Elena), Becca Roelofs (Elena), Lei Shu (Elena), Gautam Vasudevan (Elena), Jonas Adler (Elena), Artur Dwornik (Elena), Tayfun Terzi (Elena), Matt Lawlor (Elena), Harry Askham (Elena), Mike Bernico (Elena), Xuanyi Dong (Elena), Chris Hidey (Elena), Kevin Kilgour (Elena), Ga\"el Liu (Elena), Surya Bhupatiraju (Elena), Luke Leonhard (Elena), Siqi Zuo (Elena), Partha Talukdar (Elena), Qing Wei (Elena), Aliaksei Severyn (Elena), V\'it List\'ik (Elena), Jong Lee (Elena), Aditya Tripathi (Elena), SK Park (Elena), Yossi Matias (Elena), Hao Liu (Elena), Alex Ruiz (Elena), Rajesh Jayaram (Elena), Jackson Tolins (Elena), Pierre Marcenac (Elena), Yiming Wang (Elena), Bryan Seybold (Elena), Henry Prior (Elena), Deepak Sharma (Elena), Jack Weber (Elena), Mikhail Sirotenko (Elena), Yunhsuan Sung (Elena), Dayou Du (Elena), Ellie Pavlick (Elena), Stefan Zinke (Elena), Markus Freitag (Elena), Max Dylla (Elena), Montse Gonzalez Arenas (Elena), Natan Potikha (Elena), Omer Goldman (Elena), Connie Tao (Elena), Rachita Chhaparia (Elena), Maria Voitovich (Elena), Pawan Dogra (Elena), Andrija Ra\v{z}natovi\'c (Elena), Zak Tsai (Elena), Chong You (Elena), Oleaser Johnson (Elena), George Tucker (Elena), Chenjie Gu (Elena), Jae Yoo (Elena), Maryam Majzoubi (Elena), Valentin Gabeur (Elena), Bahram Raad (Elena), Rocky Rhodes (Elena), Kashyap Kolipaka (Elena), Heidi Howard (Elena), Geta Sampemane (Elena), Benny Li (Elena), Chulayuth Asawaroengchai (Elena), Duy Nguyen (Elena), Chiyuan Zhang (Elena), Timothee Cour (Elena), Xinxin Yu (Elena), Zhao Fu (Elena), Joe Jiang (Elena), Po-Sen Huang (Elena), Gabriela Surita (Elena), I\~naki Iturrate (Elena), Yael Karov (Elena), Michael Collins (Elena), Martin Baeuml (Elena), Fabian Fuchs (Elena), Shilpa Shetty (Elena), Swaroop Ramaswamy (Elena), Sayna Ebrahimi (Elena), Qiuchen Guo (Elena), Jeremy Shar (Elena), Gabe Barth-Maron (Elena), Sravanti Addepalli (Elena), Bryan Richter (Elena), Chin-Yi Cheng (Elena), Eug\'enie Rives (Elena), Fei Zheng (Elena), Johannes Griesser (Elena), Nishanth Dikkala (Elena), Yoel Zeldes (Elena), Ilkin Safarli (Elena), Dipanjan Das (Elena), Himanshu Srivastava (Elena), Sadh MNM Khan (Elena), Xin Li (Elena), Aditya Pandey (Elena), Larisa Markeeva (Elena), Dan Belov (Elena), Qiqi Yan (Elena), Miko{\l}aj Rybi\'nski (Elena), Tao Chen (Elena), Megha Nawhal (Elena), Michael Quinn (Elena), Vineetha Govindaraj (Elena), Sarah York (Elena), Reed Roberts (Elena), Roopal Garg (Elena), Namrata Godbole (Elena), Jake Abernethy (Elena), Anil Das (Elena), Lam Nguyen Thiet (Elena), Jonathan Tompson (Elena), John Nham (Elena), Neera Vats (Elena), Ben Caine (Elena), Wesley Helmholz (Elena), Francesco Pongetti (Elena), Yeongil Ko (Elena), James An (Elena), Clara Huiyi Hu (Elena), Yu-Cheng Ling (Elena), Julia Pawar (Elena), Robert Leland (Elena), Keisuke Kinoshita (Elena), Waleed Khawaja (Elena), Marco Selvi (Elena), Eugene Ie (Elena), Danila Sinopalnikov (Elena), Lev Proleev (Elena), Nilesh Tripuraneni (Elena), Michele Bevilacqua (Elena), Seungji Lee (Elena), Clayton Sanford (Elena), Dan Suh (Elena), Dustin Tran (Elena), Jeff Dean (Elena), Simon Baumgartner (Elena), Jens Heitkaemper (Elena), Sagar Gubbi (Elena), Kristina Toutanova (Elena), Yichong Xu (Elena), Chandu Thekkath (Elena), Keran Rong (Elena), Palak Jain (Elena), Annie Xie (Elena), Yan Virin (Elena), Yang Li (Elena), Lubo Litchev (Elena), Richard Powell (Elena), Tarun Bharti (Elena), Adam Kraft (Elena), Nan Hua (Elena), Marissa Ikonomidis (Elena), Ayal Hitron (Elena), Sanjiv Kumar (Elena), Loic Matthey (Elena), Sophie Bridgers (Elena), Lauren Lax (Elena), Ishaan Malhi (Elena), Ondrej Skopek (Elena), Ashish Gupta (Elena), Jiawei Cao (Elena), Mitchelle Rasquinha (Elena), Siim P\~oder (Elena), Wojciech Stokowiec (Elena), Nicholas Roth (Elena), Guowang Li (Elena), Micha\"el Sander (Elena), Joshua Kessinger (Elena), Vihan Jain (Elena), Edward Loper (Elena), Wonpyo Park (Elena), Michal Yarom (Elena), Liqun Cheng (Elena), Guru Guruganesh (Elena), Kanishka Rao (Elena), Yan Li (Elena), Catarina Barros (Elena), Mikhail Sushkov (Elena), Chun-Sung Ferng (Elena), Rohin Shah (Elena), Ophir Aharoni (Elena), Ravin Kumar (Elena), Tim McConnell (Elena), Peiran Li (Elena), Chen Wang (Elena), Fernando Pereira (Elena), Craig Swanson (Elena), Fayaz Jamil (Elena), Yan Xiong (Elena), Anitha Vijayakumar (Elena), Prakash Shroff (Elena), Kedar Soparkar (Elena), Jindong Gu (Elena), Livio Baldini Soares (Elena), Eric Wang (Elena), Kushal Majmundar (Elena), Aurora Wei (Elena), Kai Bailey (Elena), Nora Kassner (Elena), Chizu Kawamoto (Elena), Goran \v{Z}u\v{z}i\'c (Elena), Victor Gomes (Elena), Abhirut Gupta (Elena), Michael Guzman (Elena), Ishita Dasgupta (Elena), Xinyi Bai (Elena), Zhufeng Pan (Elena), Francesco Piccinno (Elena), Hadas Natalie Vogel (Elena), Octavio Ponce (Elena), Adrian Hutter (Elena), Paul Chang (Elena), Pan-Pan Jiang (Elena), Ionel Gog (Elena), Vlad Ionescu (Elena), James Manyika (Elena), Fabian Pedregosa (Elena), Harry Ragan (Elena), Zach Behrman (Elena), Ryan Mullins (Elena), Coline Devin (Elena), Aroonalok Pyne (Elena), Swapnil Gawde (Elena), Martin Chadwick (Elena), Yiming Gu (Elena), Sasan Tavakkol (Elena), Andy Twigg (Elena), Naman Goyal (Elena), Ndidi Elue (Elena), Anna Goldie (Elena), Srinivasan Venkatachary (Elena), Hongliang Fei (Elena), Ziqiang Feng (Elena), Marvin Ritter (Elena), Isabel Leal (Elena), Sudeep Dasari (Elena), Pei Sun (Elena), Alif Raditya Rochman (Elena), Brendan O'Donoghue (Elena), Yuchen Liu (Elena), Jim Sproch (Elena), Kai Chen (Elena), Natalie Clay (Elena), Slav Petrov (Elena), Sailesh Sidhwani (Elena), Ioana Mihailescu (Elena), Alex Panagopoulos (Elena), AJ Piergiovanni (Elena), Yunfei Bai (Elena), George Powell (Elena), Deep Karkhanis (Elena), Trevor Yacovone (Elena), Petr Mitrichev (Elena), Joe Kovac (Elena), Dave Uthus (Elena), Amir Yazdanbakhsh (Elena), David Amos (Elena), Steven Zheng (Elena), Bing Zhang (Elena), Jin Miao (Elena), Bhuvana Ramabhadran (Elena), Soroush Radpour (Elena), Shantanu Thakoor (Elena), Josh Newlan (Elena), Oran Lang (Elena), Orion Jankowski (Elena), Shikhar Bharadwaj (Elena), Jean-Michel Sarr (Elena), Shereen Ashraf (Elena), Sneha Mondal (Elena), Jun Yan (Elena), Ankit Singh Rawat (Elena), Sarmishta Velury (Elena), Greg Kochanski (Elena), Tom Eccles (Elena), Franz Och (Elena), Abhanshu Sharma (Elena), Ethan Mahintorabi (Elena), Alex Gurney (Elena), Carrie Muir (Elena), Vered Cohen (Elena), Saksham Thakur (Elena), Adam Bloniarz (Elena), Asier Mujika (Elena), Alexander Pritzel (Elena), Paul Caron (Elena), Altaf Rahman (Elena), Fiona Lang (Elena), Yasumasa Onoe (Elena), Petar Sirkovic (Elena), Jay Hoover (Elena), Ying Jian (Elena), Pablo Duque (Elena), Arun Narayanan (Elena), David Soergel (Elena), Alex Haig (Elena), Loren Maggiore (Elena), Shyamal Buch (Elena), Josef Dean (Elena), Ilya Figotin (Elena), Igor Karpov (Elena), Shaleen Gupta (Elena), Denny Zhou (Elena), Muhuan Huang (Elena), Ashwin Vaswani (Elena), Christopher Semturs (Elena), Kaushik Shivakumar (Elena), Yu Watanabe (Elena), Vinodh Kumar Rajendran (Elena), Eva Lu (Elena), Yanhan Hou (Elena), Wenting Ye (Elena), Shikhar Vashishth (Elena), Nana Nti (Elena), Vytenis Sakenas (Elena), Darren Ni (Elena), Doug DeCarlo (Elena), Michael Bendersky (Elena), Sumit Bagri (Elena), Nacho Cano (Elena), Elijah Peake (Elena), Simon Tokumine (Elena), Varun Godbole (Elena), Carlos Gu\'ia (Elena), Tanya Lando (Elena), Vittorio Selo (Elena), Seher Ellis (Elena), Danny Tarlow (Elena), Daniel Gillick (Elena), Alessandro Epasto (Elena), Siddhartha Reddy Jonnalagadda (Elena), Meng Wei (Elena), Meiyan Xie (Elena), Ankur Taly (Elena), Michela Paganini (Elena), Mukund Sundararajan (Elena), Daniel Toyama (Elena), Ting Yu (Elena), Dessie Petrova (Elena), Aneesh Pappu (Elena), Rohan Agrawal (Elena), Senaka Buthpitiya (Elena), Justin Frye (Elena), Thomas Buschmann (Elena), Remi Crocker (Elena), Marco Tagliasacchi (Elena), Mengchao Wang (Elena), Da Huang (Elena), Sagi Perel (Elena), Brian Wieder (Elena), Hideto Kazawa (Elena), Weiyue Wang (Elena), Jeremy Cole (Elena), Himanshu Gupta (Elena), Ben Golan (Elena), Seojin Bang (Elena), Nitish Kulkarni (Elena), Ken Franko (Elena), Casper Liu (Elena), Doug Reid (Elena), Sid Dalmia (Elena), Jay Whang (Elena), Kevin Cen (Elena), Prasha Sundaram (Elena), Johan Ferret (Elena), Berivan Isik (Elena), Lucian Ionita (Elena), Guan Sun (Elena), Anna Shekhawat (Elena), Muqthar Mohammad (Elena), Philip Pham (Elena), Ronny Huang (Elena), Karthik Raman (Elena), Xingyi Zhou (Elena), Ross Mcilroy (Elena), Austin Myers (Elena), Sheng Peng (Elena), Jacob Scott (Elena), Paul Covington (Elena), Sofia Erell (Elena), Pratik Joshi (Elena), Jo\~ao Gabriel Oliveira (Elena), Natasha Noy (Elena), Tajwar Nasir (Elena), Jake Walker (Elena), Vera Axelrod (Elena), Tim Dozat (Elena), Pu Han (Elena), Chun-Te Chu (Elena), Eugene Weinstein (Elena), Anand Shukla (Elena), Shreyas Chandrakaladharan (Elena), Petra Poklukar (Elena), Bonnie Li (Elena), Ye Jin (Elena), Prem Eruvbetine (Elena), Steven Hansen (Elena), Avigail Dabush (Elena), Alon Jacovi (Elena), Samrat Phatale (Elena), Chen Zhu (Elena), Steven Baker (Elena), Mo Shomrat (Elena), Yang Xiao (Elena), Jean Pouget-Abadie (Elena), Mingyang Zhang (Elena), Fanny Wei (Elena), Yang Song (Elena), Helen King (Elena), Yiling Huang (Elena), Yun Zhu (Elena), Ruoxi Sun (Elena), Juliana Vicente Franco (Elena), Chu-Cheng Lin (Elena), Sho Arora (Elena), Hui (Elena), Li, Vivian Xia, Luke Vilnis, Mariano Schain, Kaiz Alarakyia, Laurel Prince, Aaron Phillips, Caleb Habtegebriel, Luyao Xu, Huan Gui, Santiago Ontanon, Lora Aroyo, Karan Gill, Peggy Lu, Yash Katariya, Dhruv Madeka, Shankar Krishnan, Shubha Srinivas Raghvendra, James Freedman, Yi Tay, Gaurav Menghani, Peter Choy, Nishita Shetty, Dan Abolafia, Doron Kukliansky, Edward Chou, Jared Lichtarge, Ken Burke, Ben Coleman, Dee Guo, Larry Jin, Indro Bhattacharya, Victoria Langston, Yiming Li, Suyog Kotecha, Alex Yakubovich, Xinyun Chen, Petre Petrov, Tolly Powell, Yanzhang He, Corbin Quick, Kanav Garg, Dawsen Hwang, Yang Lu, Srinadh Bhojanapalli, Kristian Kjems, Ramin Mehran, Aaron Archer, Hado van Hasselt, Ashwin Balakrishna, JK Kearns, Meiqi Guo, Jason Riesa, Mikita Sazanovich, Xu Gao, Chris Sauer, Chengrun Yang, XiangHai Sheng, Thomas Jimma, Wouter Van Gansbeke, Vitaly Nikolaev, Wei Wei, Katie Millican, Ruizhe Zhao, Justin Snyder, Levent Bolelli, Maura O'Brien, Shawn Xu, Fei Xia, Wentao Yuan, Arvind Neelakantan, David Barker, Sachin Yadav, Hannah Kirkwood, Farooq Ahmad, Joel Wee, Jordan Grimstad, Boyu Wang, Matthew Wiethoff, Shane Settle, Miaosen Wang, Charles Blundell, Jingjing Chen, Chris Duvarney, Grace Hu, Olaf Ronneberger, Alex Lee, Yuanzhen Li, Abhishek Chakladar, Alena Butryna, Georgios Evangelopoulos, Guillaume Desjardins, Jonni Kanerva, Henry Wang, Averi Nowak, Nick Li, Alyssa Loo, Art Khurshudov, Laurent El Shafey, Nagabhushan Baddi, Karel Lenc, Yasaman Razeghi, Tom Lieber, Amer Sinha, Xiao Ma, Yao Su, James Huang, Asahi Ushio, Hanna Klimczak-Pluci\'nska, Kareem Mohamed, JD Chen, Simon Osindero, Stav Ginzburg, Lampros Lamprou, Vasilisa Bashlovkina, Duc-Hieu Tran, Ali Khodaei, Ankit Anand, Yixian Di, Ramy Eskander, Manish Reddy Vuyyuru, Jasmine Liu, Aishwarya Kamath, Roman Goldenberg, Mathias Bellaiche, Juliette Pluto, Bill Rosgen, Hassan Mansoor, William Wong, Suhas Ganesh, Eric Bailey, Scott Baird, Dan Deutsch, Jinoo Baek, Xuhui Jia, Chansoo Lee, Abe Friesen, Nathaniel Braun, Kate Lee, Amayika Panda, Steven M. Hernandez, Duncan Williams, Jianqiao Liu, Ethan Liang, Arnaud Autef, Emily Pitler, Deepali Jain, Phoebe Kirk, Oskar Bunyan, Jaume Sanchez Elias, Tongxin Yin, Machel Reid, Aedan Pope, Nikita Putikhin, Bidisha Samanta, Sergio Guadarrama, Dahun Kim, Simon Rowe, Marcella Valentine, Geng Yan, Alex Salcianu, David Silver, Gan Song, Richa Singh, Shuai Ye, Hannah DeBalsi, Majd Al Merey, Eran Ofek, Albert Webson, Shibl Mourad, Ashwin Kakarla, Silvio Lattanzi, Nick Roy, Evgeny Sluzhaev, Christina Butterfield, Alessio Tonioni, Nathan Waters, Sudhindra Kopalle, Jason Chase, James Cohan, Girish Ramchandra Rao, Robert Berry, Michael Voznesensky, Shuguang Hu, Kristen Chiafullo, Sharat Chikkerur, George Scrivener, Ivy Zheng, Jeremy Wiesner, Wolfgang Macherey, Timothy Lillicrap, Fei Liu, Brian Walker, David Welling, Elinor Davies, Yangsibo Huang, Lijie Ren, Nir Shabat, Alessandro Agostini, Mariko Iinuma, Dustin Zelle, Rohit Sathyanarayana, Andrea D'olimpio, Morgan Redshaw, Matt Ginsberg, Ashwin Murthy, Mark Geller, Tatiana Matejovicova, Ayan Chakrabarti, Ryan Julian, Christine Chan, Qiong Hu, Daniel Jarrett, Manu Agarwal, Jeshwanth Challagundla, Tao Li, Sandeep Tata, Wen Ding, Maya Meng, Zhuyun Dai, Giulia Vezzani, Shefali Garg, Jannis Bulian, Mary Jasarevic, Honglong Cai, Harish Rajamani, Adam Santoro, Florian Hartmann, Chen Liang, Bartek Perz, Apoorv Jindal, Fan Bu, Sungyong Seo, Ryan Poplin, Adrian Goedeckemeyer, Badih Ghazi, Nikhil Khadke, Leon Liu, Kevin Mather, Mingda Zhang, Ali Shah, Alex Chen, Jinliang Wei, Keshav Shivam, Yuan Cao, Donghyun Cho, Angelo Scorza Scarpati, Michael Moffitt, Clara Barbu, Ivan Jurin, Ming-Wei Chang, Hongbin Liu, Hao Zheng, Shachi Dave, Christine Kaeser-Chen, Xiaobin Yu, Alvin Abdagic, Lucas Gonzalez, Yanping Huang, Peilin Zhong, Cordelia Schmid, Bryce Petrini, Alex Wertheim, Jifan Zhu, Hoang Nguyen, Kaiyang Ji, Yanqi Zhou, Tao Zhou, Fangxiaoyu Feng, Regev Cohen, David Rim, Shubham Milind Phal, Petko Georgiev, Ariel Brand, Yue Ma, Wei Li, Somit Gupta, Chao Wang, Pavel Dubov, Jean Tarbouriech, Kingshuk Majumder, Huijian Li, Norman Rink, Apurv Suman, Yang Guo, Yinghao Sun, Arun Nair, Xiaowei Xu, Mohamed Elhawaty, Rodrigo Cabrera, Guangxing Han, Julian Eisenschlos, Junwen Bai, Yuqi Li, Yamini Bansal, Thibault Sellam, Mina Khan, Hung Nguyen, Justin Mao-Jones, Nikos Parotsidis, Jake Marcus, Cindy Fan, Roland Zimmermann, Yony Kochinski, Laura Graesser, Feryal Behbahani, Alvaro Caceres, Michael Riley, Patrick Kane, Sandra Lefdal, Rob Willoughby, Paul Vicol, Lun Wang, Shujian Zhang, Ashleah Gill, Yu Liang, Gautam Prasad, Soroosh Mariooryad, Mehran Kazemi, Zifeng Wang, Kritika Muralidharan, Paul Voigtlaender, Jeffrey Zhao, Huanjie Zhou, Nina D'Souza, Aditi Mavalankar, S\'eb Arnold, Nick Young, Obaid Sarvana, Chace Lee, Milad Nasr, Tingting Zou, Seokhwan Kim, Lukas Haas, Kaushal Patel, Neslihan Bulut, David Parkinson, Courtney Biles, Dmitry Kalashnikov, Chi Ming To, Aviral Kumar, Jessica Austin, Alex Greve, Lei Zhang, Megha Goel, Yeqing Li, Sergey Yaroshenko, Max Chang, Abhishek Jindal, Geoff Clark, Hagai Taitelbaum, Dale Johnson, Ofir Roval, Jeongwoo Ko, Anhad Mohananey, Christian Schuler, Shenil Dodhia, Ruichao Li, Kazuki Osawa, Claire Cui, Peng Xu, Rushin Shah, Tao Huang, Ela Gruzewska, Nathan Clement, Mudit Verma, Olcan Sercinoglu, Hai Qian, Viral Shah, Masa Yamaguchi, Abhinit Modi, Takahiro Kosakai, Thomas Strohmann, Junhao Zeng, Beliz Gunel, Jun Qian, Austin Tarango, Krzysztof Jastrz\k{e}bski, Robert David, Jyn Shan, Parker Schuh, Kunal Lad, Willi Gierke, Mukundan Madhavan, Xinyi Chen, Mark Kurzeja, Rebeca Santamaria-Fernandez, Dawn Chen, Alexandra Cordell, Yuri Chervonyi, Frankie Garcia, Nithish Kannen, Vincent Perot, Nan Ding, Shlomi Cohen-Ganor, Victor Lavrenko, Junru Wu, Georgie Evans, Cicero Nogueira dos Santos, Madhavi Sewak, Ashley Brown, Andrew Hard, Joan Puigcerver, Zeyu Zheng, Yizhong Liang, Evgeny Gladchenko, Reeve Ingle, Uri First, Pierre Sermanet, Charlotte Magister, Mihajlo Velimirovi\'c, Sashank Reddi, Susanna Ricco, Eirikur Agustsson, Hartwig Adam, Nir Levine, David Gaddy, Dan Holtmann-Rice, Xuanhui Wang, Ashutosh Sathe, Abhijit Guha Roy, Bla\v{z} Bratani\v{c}, Alen Carin, Harsh Mehta, Silvano Bonacina, Nicola De Cao, Mara Finkelstein, Verena Rieser, Xinyi Wu, Florent Altch\'e, Dylan Scandinaro, Li Li, Nino Vieillard, Nikhil Sethi, Garrett Tanzer, Zhi Xing, Shibo Wang, Parul Bhatia, Gui Citovsky, Thomas Anthony, Sharon Lin, Tianze Shi, Shoshana Jakobovits, Gena Gibson, Raj Apte, Lisa Lee, Mingqing Chen, Arunkumar Byravan, Petros Maniatis, Kellie Webster, Andrew Dai, Pu-Chin Chen, Jiaqi Pan, Asya Fadeeva, Zach Gleicher, Thang Luong, Niket Kumar Bhumihar

Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

replace SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment

Authors: Guoxin Zang, Xue Li, Donglin Di, Lanshun Nie, Dechen Zhan, Yang Song, Lei Fan

Abstract: While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at https://github.com/amoreZgx1n/SAGE.

URLs: https://github.com/amoreZgx1n/SAGE.

replace Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Authors: Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Yuqi Ren, Deyi Xiong

Abstract: Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model to enhance its generative capabilities in Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that our model consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

replace A Survey of Deep Learning for Geometry Problem Solving

Authors: Jianzhe Ma, Wenxuan Wang, Qin Jin

Abstract: Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.

URLs: https://github.com/majianz/dl4gps.

replace Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

Authors: Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Jingwen Chen, Zhichao Huang, Tao Li, Yifu Li, Huiying Lin, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, Yuxuan Wang, Yonghui Wu

Abstract: Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.

replace Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track

Authors: Brian Ondov, William Xia, Kush Attal, Ishita Unde, Jerry He, Dina Demner-Fushman

Abstract: Objective: Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability, combined with the high potential for harm in this domain, means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems. Methods: We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level, rewriting of abstracts (Task 1) as well as identifying and replacing difficult terms (Task 2). For automatic evaluation of Task 1, we developed a four-fold set of professionally-written references. Submissions for both Tasks 1 and 2 were provided extensive manual evaluation from biomedical experts. Results: Twelve teams spanning twelve countries participated in the track, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human levels of factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity. Conclusion: The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.

replace Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models

Authors: Rithesh Murthy, Ming Zhu, Liangwei Yang, Jielin Qiu, Juntao Tan, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

Abstract: Large Language Models (LLMs) perform best with well-crafted prompts, yet prompt engineering remains manual, inconsistent, and inaccessible to non-experts. We introduce Promptomatix, an automatic prompt optimization framework that transforms natural language task descriptions into high-quality prompts without requiring manual tuning or domain expertise. Promptomatix supports both a lightweight meta-prompt-based optimizer and a DSPy-powered compiler, with modular design enabling future extension to more advanced frameworks. The system analyzes user intent, generates synthetic training data, selects prompting strategies, and refines prompts using cost-aware objectives. Evaluated across 5 task categories, Promptomatix achieves competitive or superior performance compared to existing libraries, while reducing prompt length and computational overhead making prompt optimization scalable and efficient.

replace X-Intelligence 3.0: Training and Evaluating Reasoning LLM for Semiconductor Display

Authors: Xiaolin Yan, Yangxing Liu, Jiazhang Zheng, Chi Liu, Mingyu Du, Caisheng Chen, Haoyang Liu, Ming Ding, Yuan Li, Qiuping Liao, Linfeng Li, Zhili Mei, Siyu Wan, Li Li, Ruyi Zhong, Jiangling Yu, Xule Liu, Huihui Hu, Jiameng Yue, Ruohui Cheng, Qi Yang, Liangqing Wu, Ke Zhu, Chi Zhang, Chufei Jing, Yifan Zhou, Yan Liang, Dongdong Li, Zhaohui Wang, Bin Zhao, Mingzhou Wu, Mingzhong Zhou, Peng Du, Zuomin Liao, Chao Dai, Pengfei Liang, Xiaoguang Zhu, Yu Zhang, Yu Gu, Kun Pan, Yuan Wu, Yanqing Guan, Shaojing Wu, Zikang Feng, Xianze Ma, Peishan Cheng, Wenjuan Jiang, Jing Ba, Huihao Yu, Zeping Hu, Yuan Xu, Zhiwei Liu, He Wang, Zhenguo Lin, Ming Liu, Yanhong Meng

Abstract: Large language models (LLMs) have recently achieved significant advances in reasoning and demonstrated their advantages in solving challenging problems. Yet, their effectiveness in the semiconductor display industry remains limited due to a lack of domain-specific training and expertise. To bridge this gap, we present X-Intelligence 3.0, the first high-performance reasoning model specifically developed for the semiconductor display industry. This model is designed to deliver expert-level understanding and reasoning for the industry's complex challenges. Leveraging a carefully curated industry knowledge base, the model undergoes supervised fine-tuning and reinforcement learning to enhance its reasoning and comprehension capabilities. To further accelerate development, we implemented an automated evaluation framework that simulates expert-level assessments. We also integrated a domain-specific retrieval-augmented generation (RAG) mechanism, resulting in notable performance gains on benchmark datasets. Despite its relatively compact size of 32 billion parameters, X-Intelligence 3.0 outperforms SOTA DeepSeek-R1-671B across multiple evaluations. This demonstrates its exceptional efficiency and establishes it as a powerful solution to the longstanding reasoning challenges faced by the semiconductor display industry.

replace Mangosteen: An Open Thai Corpus for Language Model Pretraining

Authors: Wannaphong Phatthiyaphaibun, Can Udomcharoenchaikit, Pakpoom Singkorapoom, Kunat Pipatanakul, Ekapol Chuangsuwanich, Peerat Limkonchotiwat, Sarana Nutanong

Abstract: Pre-training data shapes a language model's quality, but raw web text is noisy and demands careful cleaning. Existing large-scale corpora rely on English-centric or language-agnostic pipelines whose heuristics do not capture Thai script or cultural nuances, leaving risky material such as gambling content untreated. Prior Thai-specific efforts customize pipelines or build new ones, yet seldom release their data or document design choices, hindering reproducibility and raising the question of how to construct a transparent, high-quality Thai corpus. We introduce Mangosteen: a 47 billion-token Thai corpus built through a Thai-adapted Dolma pipeline that includes custom rule-based language ID, revised C4/Gopher quality filters, and Thai-trained content filters, plus curated non-web sources such as Wikipedia, Royal Gazette texts, OCR-extracted books, and CC-licensed YouTube subtitles. Systematic ablations using GPT-2 show the pipeline trims CommonCrawl from 202M to 25M documents while raising SEA-HELM NLG from 3 to 11; an 8B-parameter SEA-LION model continually pre-trained on Mangosteen then surpasses SEA-LION-v3 and Llama-3.1 by about four points on Thai benchmarks. We release the full pipeline code, cleaning manifests, corpus snapshot, and all checkpoints, providing a fully reproducible foundation for future Thai and regional LLM research.

replace Supernova: Achieving More with Less in Transformer Architectures

Authors: Andrei-Valentin Tanase, Elena Pelican

Abstract: We present Supernova, a 650M-parameter decoder-only transformer that demonstrates how careful architectural design and tokenization innovation can achieve the performance of larger models while maintaining computational efficiency. Our architecture combines Rotary Positional Embeddings (RoPE), Grouped Query Attention (GQA) with a 3:1 compression ratio, RMSNorm for computational efficiency, and SwiGLU activation functions. A critical innovation is our custom 128,000-vocabulary byte-level BPE tokenizer, which achieves state-of-the-art compression performance. Through detailed analysis, we show that Supernova achieves 90% of the performance of 1B-parameter models while using 35% fewer parameters and requiring only 100B training tokens--an order of magnitude less than competing models. Our findings challenge the prevailing scaling paradigm, demonstrating that architectural efficiency and tokenization quality can compensate for reduced parameter counts.

replace-cross Risks of AI Scientists: Prioritizing Safeguarding Over Autonomy

Authors: Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, Arman Cohan, Zhiyong Lu, Mark Gerstein

Abstract: AI scientists powered by large language models have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, these agents also introduce novel vulnerabilities that require careful consideration for safety. However, there has been limited comprehensive exploration of these vulnerabilities. This perspective examines vulnerabilities in AI scientists, shedding light on potential risks associated with their misuse, and emphasizing the need for safety measures. We begin by providing an overview of the potential risks inherent to AI scientists, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. Then, we explore the underlying causes of these vulnerabilities and provide a scoping review of the limited existing works. Based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. Furthermore, we highlight the limitations and challenges associated with safeguarding AI scientists and advocate for the development of improved models, robust benchmarks, and comprehensive regulations.

replace-cross Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry

Authors: Deepti Raghavan, Keshav Santhanam, Muhammad Shahir Rahman, Nayani Modugula, Luis Gaspar Schroeder, Maximilien Cura, Houjun Liu, Pratiksha Thaker, Philip Levis, Matei Zaharia

Abstract: Compound AI applications chain together subcomponents such as generative language models, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during intermediate computations, and text streams may be split into smaller, independent fragments (such as documents to sentences) which may then be re-aggregated at later parts of the computation. Due to this complexity, existing systems to serve compound AI queries do not fully take advantage of parallelism and pipelining opportunities. We present Alto, a framework that automatically optimizes execution of compound AI queries through streaming and parallelism. Bento introduces a new abstraction called nested ancestry, a metadata hierarchy that allows the system to correctly track partial outputs and aggregate data across the heterogeneous constraints of the components of compound AI applications. This metadata is automatically inferred from the programming model, allowing developers to express complex dataflow patterns without needing to reason manually about the details of routing and aggregation. Implementations of four applications in Alto outperform or match implementations in LangGraph, a popular existing AI programming framework. Alto implementations match or improve latency by between 10-30%.

replace-cross Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder

Authors: Siting Li, Pang Wei Koh, Simon Shaolei Du

Abstract: Recent research has shown that CLIP models struggle with visual reasoning tasks that require grounding compositionality, understanding spatial relationships, or capturing fine-grained details. One natural hypothesis is that the CLIP vision encoder does not embed essential information for these tasks. However, we find that this is not always the case: The encoder gathers query-relevant visual information, while CLIP fails to extract it. In particular, we show that another branch of Vision-Language Models (VLMs), Generative Multimodal Large Language Models (MLLMs), achieve significantly higher accuracy than CLIP in many of these tasks using the same vision encoder and weights, indicating that these Generative MLLMs perceive more -- as they extract and utilize visual information more effectively. We conduct a series of controlled experiments and reveal that their success is attributed to multiple key design choices, including patch tokens, position embeddings, and prompt-based weighting. On the other hand, enhancing the training data alone or applying a stronger text encoder does not suffice to solve the task, and additional text tokens offer little benefit. Interestingly, we find that fine-grained visual reasoning is not exclusive to generative models trained by an autoregressive loss: When converted into CLIP-like encoders by contrastive finetuning, these MLLMs still outperform CLIP under the same cosine similarity-based evaluation protocol. Our study highlights the importance of VLM architectural choices and suggests directions for improving the performance of CLIP-like contrastive VLMs.

replace-cross Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

Authors: Shukang Yin, Chaoyou Fu, Sirui Zhao, Chunjiang Ge, Yan Yang, Yuhan Dai, Yongdong Luo, Tong Xu, Caifeng Shan, Enhong Chen

Abstract: Recent years have seen the success of Multimodal Large Language Models (MLLMs) in the domain of vision understanding. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has been primarily driven by automatic data pipelines, which focus on the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our primary study approach involves fine-tuning pre-trained image-LLMs with video data and examining learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to that of baselines trained with significantly more samples. Meanwhile, we find that incorporating these synthetic samples can enhance the performance of long video understanding without requiring training on long video data. The code and data examples are available at https://github.com/VITA-MLLM/Sparrow.

URLs: https://github.com/VITA-MLLM/Sparrow.

replace-cross R-Bot: An LLM-based Query Rewrite System

Authors: Zhaoyan Sun, Xuanhe Zhou, Guoliang Li, Xiang Yu, Jianhua Feng, Yong Zhang

Abstract: Query rewrite is essential for optimizing SQL queries to improve their execution efficiency without changing their results. Traditionally, this task has been tackled through heuristic and learning-based methods, each with its limitations in terms of inferior quality and low robustness. Recent advancements in LLMs offer a new paradigm by leveraging their superior natural language and code comprehension abilities. Despite their potential, directly applying LLMs like GPT-4 has faced challenges due to problems such as hallucinations, where the model might generate inaccurate or irrelevant results. To address this, we propose R-Bot, an LLM-based query rewrite system with a systematic approach. We first design a multi-source rewrite evidence preparation pipeline to generate query rewrite evidences for guiding LLMs to avoid hallucinations. We then propose a hybrid structure-semantics retrieval method that combines structural and semantic analysis to retrieve the most relevant rewrite evidences for effectively answering an online query. We next propose a step-by-step LLM rewrite method that iteratively leverages the retrieved evidences to select and arrange rewrite rules with self-reflection. We conduct comprehensive experiments on real-world datasets and widely used benchmarks, and demonstrate the superior performance of our system, R-Bot, surpassing state-of-the-art query rewrite methods. The R-Bot system has been deployed at Huawei and with real customers, and the results show that the proposed R-Bot system achieves lower query latency.

replace-cross InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification

Authors: InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai

Abstract: Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce InternAgent, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. InternAgent highlights three key advantages: 1) Scalability: InternAgent has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: InternAgent provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: InternAgent has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.65 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.

replace-cross DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph

Authors: Jihyung Lee, Jin-Seop Lee, Jaehoon Lee, YunSeok Choi, Jee-Hyong Lee

Abstract: Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. The code is available at https://github.com/jjklle/DCG-SQL}{https://github.com/jjklle/DCG-SQL.

URLs: https://github.com/jjklle/DCG-SQL, https://github.com/jjklle/DCG-SQL.

replace-cross Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education

Authors: Boning Zhao

Abstract: Assessing student depression in sensitive environments like special education is challenging. Standardized questionnaires may not fully reflect students' true situations. Furthermore, automated methods often falter with rich student narratives, lacking the crucial, individualized insights stemming from teachers' empathetic connections with students. Existing methods often fail to address this ambiguity or effectively integrate educator understanding. To address these limitations by fostering a synergistic human-AI collaboration, this paper introduces Human Empathy as Encoder (HEAE), a novel, human-centered AI framework for transparent and socially responsible depression severity assessment. Our approach uniquely integrates student narrative text with a teacher-derived, 9-dimensional "Empathy Vector" (EV), its dimensions guided by the PHQ-9 framework,to explicitly translate tacit empathetic insight into a structured AI input enhancing rather than replacing human judgment. Rigorous experiments optimized the multimodal fusion, text representation, and classification architecture, achieving 82.74% accuracy for 7-level severity classification. This work demonstrates a path toward more responsible and ethical affective computing by structurally embedding human empathy

replace-cross A Multi-granularity Concept Sparse Activation and Hierarchical Knowledge Graph Fusion Framework for Rare Disease Diagnosis

Authors: Mingda Zhang, Na Zhao, Jianglong Qin, Guoyu Ye, Ruixiang Tang

Abstract: Rare disease diagnosis remains challenging for medical large language models due to insufficient knowledge representation, limited concept understanding, and constrained clinical reasoning. We propose a framework combining multi-granularity sparse activation with hierarchical knowledge graphs. Our approach employs four complementary matching algorithms with diversity control and a five-level fallback strategy for precise concept activation. A three-layer knowledge graph (taxonomy, clinical features, instances) provides structured, up-to-date context. Experiments on the BioASQ rare disease dataset demonstrate significant improvements: BLEU scores increased by up to 0.13, ROUGE by up to 0.10, and diagnostic accuracy by up to 0.25, with the best model achieving 0.92 accuracy--surpassing the 0.90 clinical threshold. Expert evaluation confirms enhancements in information quality, reasoning, and professional expression. Our framework shows promise in reducing the diagnostic odyssey for rare disease patients.

replace-cross Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Authors: Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel

Abstract: Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.

URLs: https://github.com/xingbpshen/prompt4trust.

replace-cross Physical models realizing the transformer architecture of large language models

Authors: Zeqian Chen

Abstract: The introduction of the transformer architecture in 2017 marked the most striking advancement in natural language processing. The transformer is a model architecture relying entirely on an attention mechanism to draw global dependencies between input and output. However, we believe there is a gap in our theoretical understanding of what the transformer is, and how it works physically. From a physical perspective on modern chips, such as those chips under 28nm, modern intelligent machines should be regarded as open quantum systems beyond conventional statistical systems. Thereby, in this paper, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens. Our physical models underlie the transformer architecture for large language models.

replace-cross Routine: A Structural Planning Framework for LLM Agent System in Enterprise

Authors: Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, Yujia Wang, Wenqiang Han, Linyan Huang, Gang Li, Jingjing Mo, Haowen Hu

Abstract: The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing to guide the agent's execution module in performing multi-step tool-calling tasks with high stability. In evaluations conducted within a real-world enterprise scenario, Routine significantly increases the execution accuracy in model tool calls, increasing the performance of GPT-4o from 41.1% to 96.3%, and Qwen3-14B from 32.6% to 83.3%. We further constructed a Routine-following training dataset and fine-tuned Qwen3-14B, resulting in an accuracy increase to 88.2% on scenario-specific evaluations, indicating improved adherence to execution plans. In addition, we employed Routine-based distillation to create a scenario-specific, multi-step tool-calling dataset. Fine-tuning on this distilled dataset raised the model's accuracy to 95.5%, approaching GPT-4o's performance. These results highlight Routine's effectiveness in distilling domain-specific tool-usage patterns and enhancing model adaptability to new scenarios. Our experimental results demonstrate that Routine provides a practical and accessible approach to building stable agent workflows, accelerating the deployment and adoption of agent systems in enterprise environments, and advancing the technical vision of AI for Process.

replace-cross Hear Your Code Fail, Voice-Assisted Debugging for Python

Authors: Sayed Mahbub Hasan Amiri, Md. Mainul Islam, Mohammad Shakhawat Hossen, Sayed Majhab Hasan Amiri, Mohammad Shawkat Ali Mamun, Sk. Humaun Kabir, Naznin Akter

Abstract: This research introduces an innovative voice-assisted debugging plugin for Python that transforms silent runtime errors into actionable audible diagnostics. By implementing a global exception hook architecture with pyttsx3 text-to-speech conversion and Tkinter-based GUI visualization, the solution delivers multimodal error feedback through parallel auditory and visual channels. Empirical evaluation demonstrates 37% reduced cognitive load (p<0.01, n=50) compared to traditional stack-trace debugging, while enabling 78% faster error identification through vocalized exception classification and contextualization. The system achieves sub-1.2 second voice latency with under 18% CPU overhead during exception handling, vocalizing error types and consequences while displaying interactive tracebacks with documentation deep links. Criteria validate compatibility across Python 3.7+ environments on Windows, macOS, and Linux platforms. Needing only two lines of integration code, the plugin significantly boosts availability for aesthetically impaired designers and supports multitasking workflows through hands-free error medical diagnosis. Educational applications show particular promise, with pilot studies indicating 45% faster debugging skill acquisition among novice programmers. Future development will incorporate GPT-based repair suggestions and real-time multilingual translation to further advance auditory debugging paradigms. The solution represents a fundamental shift toward human-centric error diagnostics, bridging critical gaps in programming accessibility while establishing new standards for cognitive efficiency in software development workflows.

replace-cross Hierarchical Budget Policy Optimization for Adaptive Reasoning

Authors: Shangke Lyu, Linjuan Wu, Yuchen Yan, Xingyu Wu, Hao Li, Yongliang Shen, Peisheng Jiang, Weiming Lu, Jun Xiao, Yueting Zhuang

Abstract: Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet exhibit significant computational inefficiency by applying uniform reasoning strategies regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. HBPO addresses the fundamental challenge of exploration space collapse in efficiency-oriented training, where penalties on long output length systematically bias models away from necessary long reasoning paths. Through hierarchical budget exploration, our approach partitions rollout samples into multiple subgroups with distinct token budgets, aiming to enable efficient resource allocation while preventing degradation of capability. We introduce differentiated reward mechanisms that create budget-aware incentives aligned with the complexity of the problem, allowing models to discover natural correspondences between task requirements and computational effort. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Unlike existing methods that impose external constraints or rely on discrete mode selection, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.

replace-cross GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

Authors: Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

Abstract: Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.