Authors: Jaros{\l}aw A. Chudziak, Adam Kostka
Abstract: The growing ubiquity of artificial intelligence (AI), in particular large language models (LLMs), has profoundly altered the way in which learners gain knowledge and interact with learning material, with many claiming that AI positively influences their learning achievements. Despite this advancement, current AI tutoring systems face limitations associated with their reactive nature, often providing direct answers without encouraging deep reflection or incorporating structured pedagogical tools and strategies. This limitation is most apparent in the field of mathematics, in which AI tutoring systems remain underdeveloped. This research addresses the question: How can AI tutoring systems move beyond providing reactive assistance to enable structured, individualized, and tool-assisted learning experiences? We introduce a novel multi-agent AI tutoring platform that combines adaptive and personalized feedback, structured course generation, and textbook knowledge retrieval to enable modular, tool-assisted learning processes. This system allows students to learn new topics while identifying and targeting their weaknesses, revise for exams effectively, and practice on an unlimited number of personalized exercises. This article contributes to the field of artificial intelligence in education by introducing a novel platform that brings together pedagogical agents and AI-driven components, augmenting the field with modular and effective systems for teaching mathematics.
Authors: Dustin Holley, Jovin D'sa, Hossein Nourkhiz Mahjoub, Gibran Ali
Abstract: Enhancing simulation environments to replicate real-world driver behavior, i.e., more humanlike sim agents, is essential for developing autonomous vehicle technology. In the context of highway merging, previous works have studied the operational-level yielding dynamics of lag vehicles in response to a merging car at highway on-ramps. Other works focusing on tactical decision modeling generally consider limited action sets or utilize payoff functions with large parameter sets and limited payoff bounds. In this work, we aim to improve the simulation of the highway merge scenario by targeting a game theoretic model for tactical decision-making with improved payoff functions and lag actions. We couple this with an underlying dynamics model to have a unified decision and dynamics model that can capture merging interactions and simulate more realistic interactions in an explainable and interpretable fashion. The proposed model demonstrated good reproducibility of complex interactions when validated on a real-world dataset. The model was finally integrated into a high fidelity simulation environment and confirmed to have adequate computation time efficiency for use in large-scale simulations to support autonomous vehicle development.
Authors: L\'eo Sauli\`eres
Abstract: The success of recent Artificial Intelligence (AI) models has been accompanied by the opacity of their internal mechanisms, due notably to the use of deep neural networks. In order to understand these internal mechanisms and explain the output of these AI models, a set of methods have been proposed, grouped under the domain of eXplainable AI (XAI). This paper focuses on a sub-domain of XAI, called eXplainable Reinforcement Learning (XRL), which aims to explain the actions of an agent that has learned by reinforcement learning. We propose an intuitive taxonomy based on two questions "What" and "How". The first question focuses on the target that the method explains, while the second relates to the way the explanation is provided. We use this taxonomy to provide a state-of-the-art review of over 250 papers. In addition, we present a set of domains close to XRL, which we believe should get attention from the community. Finally, we identify some needs for the field of XRL.
Authors: Alex Zook, Josef Spjut, Jonathan Tremblay
Abstract: Game design hinges on understanding how static rules and content translate into dynamic player behavior - something modern generative systems that inspect only a game's code or assets struggle to capture. We present an automated design iteration framework that closes this gap by pairing a reinforcement learning (RL) agent, which playtests the game, with a large multimodal model (LMM), which revises the game based on what the agent does. In each loop the RL player completes several episodes, producing (i) numerical play metrics and/or (ii) a compact image strip summarising recent video frames. The LMM designer receives a gameplay goal and the current game configuration, analyses the play traces, and edits the configuration to steer future behaviour toward the goal. We demonstrate results that LMMs can reason over behavioral traces supplied by RL agents to iteratively refine game mechanics, pointing toward practical, scalable tools for AI-assisted game design.
Authors: Avi Parrack, Carlo Leonardo Attubato, Stefan Heimersheim
Abstract: AI assistants will occasionally respond deceptively to user queries. Recently, linear classifiers (called "deception probes") have been trained to distinguish the internal activations of a language model during deceptive versus honest responses. However, it's unclear how effective these probes are at detecting deception in practice, nor whether such probes are resistant to simple counter strategies from a deceptive assistant who wishes to evade detection. In this paper, we compare white-box monitoring (where the monitor has access to token-level probe activations) to black-box monitoring (without such access). We benchmark deception probes by the extent to which the white box monitor outperforms the black-box monitor, i.e. the black-to-white performance boost. We find weak but encouraging black-to-white performance boosts from existing deception probes.
Authors: Sosui Moribe, Taketoshi Ushiama
Abstract: In recent years, peer learning has gained attention as a method that promotes spontaneous thinking among learners, and its effectiveness has been confirmed by numerous studies. This study aims to develop an AI Agent as a learning companion that enables peer learning anytime and anywhere. However, peer learning between humans has various limitations, and it is not always effective. Effective peer learning requires companions at the same proficiency levels. In this study, we assume that a learner's peers with the same proficiency level as the learner make the same mistakes as the learner does and focus on English composition as a specific example to validate this approach.
Authors: Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Huan Wang, Shelby Heinecke, Silvio Savarese, Caiming Xiong
Abstract: The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.
Authors: Shiquan Wang, Ruiyu Fang, Zhongjiang He, Shuangyong Song, Yongxiang Li
Abstract: Emotional Support Conversation (ESC) aims to provide empathetic and effective emotional assistance through dialogue, addressing the growing demand for mental health support. This paper presents our solution for the NLPCC 2025 Task 8 ESC evaluation, where we leverage large-scale language models enhanced by prompt engineering and finetuning techniques. We explore both parameter-efficient Low-Rank Adaptation and full-parameter fine-tuning strategies to improve the model's ability to generate supportive and contextually appropriate responses. Our best model ranked second in the competition, highlighting the potential of combining LLMs with effective adaptation methods for ESC tasks. Future work will focus on further enhancing emotional understanding and response personalization to build more practical and reliable emotional support systems.
Authors: Lance Ying, Katherine M. Collins, Prafull Sharma, Cedric Colas, Kaiya Ivy Zhao, Adrian Weller, Zenna Tavares, Phillip Isola, Samuel J. Gershman, Jacob D. Andreas, Thomas L. Griffiths, Francois Chollet, Kelsey R. Allen, Joshua B. Tenenbaum
Abstract: Human intelligence exhibits a remarkable capacity for rapid adaptation and effective problem-solving in novel and unfamiliar contexts. We argue that this profound adaptability is fundamentally linked to the efficient construction and refinement of internal representations of the environment, commonly referred to as world models, and we refer to this adaptation mechanism as world model induction. However, current understanding and evaluation of world models in artificial intelligence (AI) remains narrow, often focusing on static representations learned from training on a massive corpora of data, instead of the efficiency and efficacy of models in learning these representations through interaction and exploration within a novel environment. In this Perspective, we provide a view of world model induction drawing on decades of research in cognitive science on how humans learn and adapt so efficiently; we then call for a new evaluation framework for assessing adaptive world models in AI. Concretely, we propose a new benchmarking paradigm based on suites of carefully designed games with genuine, deep and continually refreshing novelty in the underlying game structures -- we refer to this kind of games as novel games. We detail key desiderata for constructing these games and propose appropriate metrics to explicitly challenge and evaluate the agent's ability for rapid world model induction. We hope that this new evaluation framework will inspire future evaluation efforts on world models in AI and provide a crucial step towards developing AI systems capable of the human-like rapid adaptation and robust generalization -- a critical component of artificial general intelligence.
Authors: Hussein Abbass, Taylan Akay, Harrison Tolley
Abstract: In the age of AI, human commanders need to use the computational powers available in today's environment to simulate a very large number of scenarios. Within each scenario, situations occur where different decision design options could have ethical consequences. Making these decisions reliant on human judgement is both counter-productive to the aim of exploring very large number of scenarios in a timely manner and infeasible when considering the workload needed to involve humans in each of these choices. In this paper, we move human judgement outside the simulation decision cycle. Basically, the human will design the ethical metric space, leaving it to the simulated environment to explore the space. When the simulation completes its testing cycles, the testing environment will come back to the human commander with a few options to select from. The human commander will then exercise human-judgement to select the most appropriate course of action, which will then get executed accordingly. We assume that the problem of designing metrics that are sufficiently granular to assess the ethical implications of decisions is solved. Subsequently, the fundamental problem we look at in this paper is how to weight ethical decisions during the running of these simulations; that is, how to dynamically weight the ethical attributes when agents are faced with decision options with ethical implications during generative simulations. The multi-criteria decision making literature has started to look at nearby problems, where the concept of entropy has been used to determine the weights during aggregation. We draw from that literature different approaches to automatically calculate the weights for ethical attributes during simulation-based testing and evaluation.
Authors: Rishane Dassanayake, Mario Demetroudi, James Walpole, Lindley Lentati, Jason R. Brown, Edward James Young
Abstract: Frontier AI systems are rapidly advancing in their capabilities to persuade, deceive, and influence human behaviour, with current models already demonstrating human-level persuasion and strategic deception in specific contexts. Humans are often the weakest link in cybersecurity systems, and a misaligned AI system deployed internally within a frontier company may seek to undermine human oversight by manipulating employees. Despite this growing threat, manipulation attacks have received little attention, and no systematic framework exists for assessing and mitigating these risks. To address this, we provide a detailed explanation of why manipulation attacks are a significant threat and could lead to catastrophic outcomes. Additionally, we present a safety case framework for manipulation risk, structured around three core lines of argument: inability, control, and trustworthiness. For each argument, we specify evidence requirements, evaluation methodologies, and implementation considerations for direct application by AI companies. This paper provides the first systematic methodology for integrating manipulation risk into AI safety governance, offering AI companies a concrete foundation to assess and mitigate these threats before deployment.
Authors: Jian Yao, Ran Cheng, Kay Chen Tan
Abstract: Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of large language models (LLMs), as measured by standard benchmarks. However, these gains often persist even when models are trained with flawed signals, such as random or inverted rewards, raising a fundamental question: do such improvements reflect true reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To address this question, we take an evaluation-centric perspective and identify two critical shortcomings in existing protocols. First, \emph{benchmark contamination} arises from the public availability of test problems, increasing the risk of data leakage. Second, \emph{evaluation fragility} stems from the reliance on single-instance assessments, which are highly sensitive to stochastic outputs and fail to capture reasoning consistency. To overcome these limitations, we introduce {VAR-MATH}, a symbolic evaluation framework designed to probe genuine reasoning ability. By converting fixed numerical problems into symbolic templates and requiring models to solve multiple instantiations of each, VAR-MATH enforces consistent reasoning across structurally equivalent variants, thereby mitigating contamination and improving evaluation robustness. We apply VAR-MATH to transform two popular benchmarks, AMC23 and AIME24, into their symbolic counterparts, VAR-AMC23 and VAR-AIME24. Experimental results reveal substantial performance drops for RL-trained models on the variabilized versions, especially for smaller models, with average declines of 48.0\% on AMC23 and 58.3\% on AIME24. These findings suggest that many existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms. Overall, VAR-MATH offers a principled, contamination-resistant evaluation paradigm for mathematical reasoning.
Authors: Lyris Xu, Fabio Aurelio D'Asaro, Luke Dickens
Abstract: Probabilistic Event Calculus (PEC) is a logical framework for reasoning about actions and their effects in uncertain environments, which enables the representation of probabilistic narratives and computation of temporal projections. The PEC formalism offers significant advantages in interpretability and expressiveness for narrative reasoning. However, it lacks mechanisms for goal-directed reasoning. This paper bridges this gap by developing a formal translation of PEC domains into Markov Decision Processes (MDPs), introducing the concept of "action-taking situations" to preserve PEC's flexible action semantics. The resulting PEC-MDP formalism enables the extensive collection of algorithms and theoretical tools developed for MDPs to be applied to PEC's interpretable narrative domains. We demonstrate how the translation supports both temporal reasoning tasks and objective-driven planning, with methods for mapping learned policies back into human-readable PEC representations, maintaining interpretability while extending PEC's capabilities.
Authors: Roger Xavier Lera-Leri, Filippo Bistaffa, Athina Georgara, Juan Antonio Rodriguez-Aguilar
Abstract: Following the recent push for trustworthy AI, there has been an increasing interest in developing contrastive explanation techniques for optimisation, especially concerning the solution of specific decision-making processes formalised as MILPs. Along these lines, we propose X-MILP, a domain-agnostic approach for building contrastive explanations for MILPs based on constraint reasoning techniques. First, we show how to encode the queries a user makes about the solution of an MILP problem as additional constraints. Then, we determine the reasons that constitute the answer to the user's query by computing the Irreducible Infeasible Subsystem (IIS) of the newly obtained set of constraints. Finally, we represent our explanation as a "graph of reasons" constructed from the IIS, which helps the user understand the structure among the reasons that answer their query. We test our method on instances of well-known optimisation problems to evaluate the empirical hardness of computing explanations.
Authors: Junseong Lee, Jaegwan Cho, Yoonju Cho, Seoyoon Choi, Yejin Shin
Abstract: The study "Prediction of Highway Traffic Flow Based on Artificial Intelligence Algorithms Using California Traffic Data" presents a machine learning-based traffic flow prediction model to address global traffic congestion issues. The research utilized 30-second interval traffic data from California Highway 78 over a five-month period from July to November 2022, analyzing a 7.24 km westbound section connecting "Melrose Dr" and "El-Camino Real" in the San Diego area. The study employed Multiple Linear Regression (MLR) and Random Forest (RF) algorithms, analyzing data collection intervals ranging from 30 seconds to 15 minutes. Using R^2, MAE, and RMSE as performance metrics, the analysis revealed that both MLR and RF models performed optimally with 10-minute data collection intervals. These findings are expected to contribute to future traffic congestion solutions and efficient traffic management.
Authors: Ahmed Bahloul, Simon Malberg
Abstract: Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree's static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree's probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems.
Authors: Matthew E. Brophy
Abstract: The advancement of powerful yet opaque large language models (LLMs) necessitates a fundamental revision of the philosophical criteria used to evaluate artificial moral agents (AMAs). Pre-LLM frameworks often relied on the assumption of transparent architectures, which LLMs defy due to their stochastic outputs and opaque internal states. This paper argues that traditional ethical criteria are pragmatically obsolete for LLMs due to this mismatch. Engaging with core themes in the philosophy of technology, this paper proffers a revised set of ten functional criteria to evaluate LLM-based artificial moral agents: moral concordance, context sensitivity, normative integrity, metaethical awareness, system resilience, trustworthiness, corrigibility, partial transparency, functional autonomy, and moral imagination. These guideposts, applied to what we term "SMA-LLS" (Simulating Moral Agency through Large Language Systems), aim to steer AMAs toward greater alignment and beneficial societal integration in the coming years. We illustrate these criteria using hypothetical scenarios involving an autonomous public bus (APB) to demonstrate their practical applicability in morally salient contexts.
Authors: Besik Dundua, Temur Kutsia
Abstract: The combination of higher-order theories and fuzzy logic can be useful in decision-making tasks that involve reasoning across abstract functions and predicates, where exact matches are often rare or unnecessary. Developing efficient reasoning and computational techniques for such a combined formalism presents a significant challenge. In this paper, we adopt a more straightforward approach aiming at integrating two well-established and computationally well-behaved components: higher-order patterns on one side and fuzzy equivalences expressed through similarity relations based on minimum T-norm on the other. We propose a unification algorithm for higher-order patterns modulo these similarity relations and prove its termination, soundness, and completeness. This unification problem, like its crisp counterpart, is unitary. The algorithm computes a most general unifier with the highest degree of approximation when the given terms are unifiable.
Authors: Carlos Arriaga, Gonzalo Mart\'inez, Eneko Sendin, Javier Conde, Pedro Reviriego
Abstract: The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.
Authors: Gal Beniamini, Yuval Dor, Alon Vinnikov, Shir Granot Peled, Or Weinstein, Or Sharir, Noam Wies, Tomer Nussbaum, Ido Ben Shaul, Tomer Zekharya, Yoav Levine, Shai Shalev-Shwartz, Amnon Shashua
Abstract: Frontier AI models demonstrate formidable breadth of knowledge. But how close are they to true human -- or superhuman -- expertise? Genuine experts can tackle the hardest problems and push the boundaries of scientific understanding. To illuminate the limits of frontier model capabilities, we turn away from contrived competitive programming puzzles, and instead focus on real-life research problems. We construct FormulaOne, a benchmark that lies at the intersection of graph theory, logic, and algorithms, all well within the training distribution of frontier models. Our problems are incredibly demanding, requiring an array of reasoning steps. The dataset has three key properties. First, it is of commercial interest and relates to practical large-scale optimisation problems, such as those arising in routing, scheduling, and network design. Second, it is generated from the highly expressive framework of Monadic Second-Order (MSO) logic on graphs, paving the way toward automatic problem generation at scale; ideal for building RL environments. Third, many of our problems are intimately related to the frontier of theoretical computer science, and to central conjectures therein, such as the Strong Exponential Time Hypothesis (SETH). As such, any significant algorithmic progress on our dataset, beyond known results, could carry profound theoretical implications. Remarkably, state-of-the-art models like OpenAI's o3 fail entirely on FormulaOne, solving less than 1% of the questions, even when given 10 attempts and explanatory fewshot examples -- highlighting how far they remain from expert-level understanding in some domains. To support further research, we additionally curate FormulaOne-Warmup, offering a set of simpler tasks, from the same distribution. We release the full corpus along with a comprehensive evaluation framework.
Authors: Zachary James, Joseph Guinness
Abstract: Gaussian Processes have become an indispensable part of the spatial statistician's toolbox but are unsuitable for analyzing large dataset because of the significant time and memory needed to fit the associated model exactly. Vecchia Approximation is widely used to reduce the computational complexity and can be calculated with embarrassingly parallel algorithms. While multi-core software has been developed for Vecchia Approximation, such as the GpGp R package, software designed to run on graphics processing units (GPU) is lacking, despite the tremendous success GPUs have had in statistics and machine learning. We compare three different ways to implement Vecchia Approximation on a GPU: two of which are similar to methods used for other Gaussian Process approximations and one that is new. The impact of memory type on performance is investigated and the final method is optimized accordingly. We show that our new method outperforms the other two and then present it in the GpGpU R package. We compare GpGpU to existing multi-core and GPU-accelerated software by fitting Gaussian Process models on various datasets, including a large spatial-temporal dataset of $n>10^6$ points collected from an earth-observing satellite. Our results show that GpGpU achieves faster runtimes and better predictive accuracy.
Authors: Takashi Izumo
Abstract: The St. Petersburg paradox presents a longstanding challenge in decision theory. It describes a game whose expected value is infinite, yet for which no rational finite stake can be determined. Traditional solutions introduce auxiliary assumptions, such as diminishing marginal utility, temporal discounting, or extended number systems. These methods often involve mathematical refinements that may not correspond to how people actually perceive or process numerical information. This paper explores an alternative approach based on a modified operation of addition defined over coarse partitions of the outcome space. In this model, exact numerical values are grouped into perceptual categories, and each value is replaced by a representative element of its group before being added. This method allows for a phenomenon where repeated additions eventually cease to affect the outcome, a behavior described as inertial stabilization. Although this is not intended as a definitive resolution of the paradox, the proposed framework offers a plausible way to represent how agents with limited cognitive precision might handle divergent reward structures. We demonstrate that the St. Petersburg series can become inert under this coarse addition for a suitably constructed partition. The approach may also have broader applications in behavioral modeling and the study of machine reasoning under perceptual limitations.
Authors: Nazanin Siavash, Armin Moin
Abstract: There exist various Software Development Kits (SDKs) tailored to different quantum computing platforms. These are known as Quantum SDKs (QSDKs). Examples include but are not limited to Qiskit, Cirq, and PennyLane. However, this diversity presents significant challenges for interoperability and cross-platform development of hybrid quantum-classical software systems. Traditional rule-based transpilers for translating code between QSDKs are time-consuming to design and maintain, requiring deep expertise and rigid mappings in the source and destination code. In this study, we explore the use of Large Language Models (LLMs) as a flexible and automated solution. Leveraging their pretrained knowledge and contextual reasoning capabilities, we position LLMs as programming language-agnostic transpilers capable of converting quantum programs from one QSDK to another while preserving functional equivalence. Our approach eliminates the need for manually defined transformation rules and offers a scalable solution to quantum software portability. This work represents a step toward enabling intelligent, general-purpose transpilation in the quantum computing ecosystem.
Authors: Ishraq Khan, Assad Chowdary, Sharoz Haseeb, Urvish Patel
Abstract: Large Language Models (LLMs) have advanced code generation and software automation, but are fundamentally constrained by limited inference-time context and lack of explicit code structure reasoning. We introduce Kodezi Chronos, a next-generation architecture for autonomous code understanding, debugging, and maintenance, designed to operate across ultra-long contexts comprising entire codebases, histories, and documentation, all without fixed window limits. Kodezi Chronos leverages a multi-level embedding memory engine, combining vector and graph-based indexing with continuous code-aware retrieval. This enables efficient and accurate reasoning over millions of lines of code, supporting repository-scale comprehension, multi-file refactoring, and real-time self-healing actions. Our evaluation introduces a novel Multi Random Retrieval benchmark, specifically tailored to the software engineering domain. Unlike classical retrieval benchmarks, this method requires the model to resolve arbitrarily distant and obfuscated associations across code artifacts, simulating realistic tasks such as variable tracing, dependency migration, and semantic bug localization. Chronos outperforms prior LLMs and code models, demonstrating a 23% improvement in real-world bug detection and reducing debugging cycles by up to 40% compared to traditional sequence-based approaches. By natively interfacing with IDEs and CI/CD workflows, Chronos enables seamless, autonomous software maintenance, elevating code reliability and productivity while reducing manual effort. These results mark a critical advance toward self-sustaining, continuously optimized software ecosystems.
Authors: Sounak Bhowmik, Talita Perciano, Himanshu Thapliyal
Abstract: Dementia is a devastating condition with profound implications for individuals, families, and healthcare systems. Early and accurate detection of dementia is critical for timely intervention and improved patient outcomes. While classical machine learning and deep learning approaches have been explored extensively for dementia prediction, these solutions often struggle with high-dimensional biomedical data and large-scale datasets, quickly reaching computational and performance limitations. To address this challenge, quantum machine learning (QML) has emerged as a promising paradigm, offering faster training and advanced pattern recognition capabilities. This work aims to demonstrate the potential of quantum transfer learning (QTL) to enhance the performance of a weak classical deep learning model applied to a binary classification task for dementia detection. Besides, we show the effect of noise on the QTL-based approach, investigating the reliability and robustness of this method. Using the OASIS 2 dataset, we show how quantum techniques can transform a suboptimal classical model into a more effective solution for biomedical image classification, highlighting their potential impact on advancing healthcare technology.
Authors: Gabriel Istrate, Cosmin Bonchis, Victor Bogdan
Abstract: We study the power of (competitive) algorithms with predictions in a multiagent setting. We introduce a two predictor framework, that assumes that agents use one predictor for their future (self) behavior, and one for the behavior of the other players. The main problem we are concerned with is understanding what are the best competitive ratios that can be achieved by employing such predictors, under various assumptions on predictor quality. As an illustration of our framework, we introduce and analyze a multiagent version of the ski-rental problem. In this problem agents can collaborate by pooling resources to get a group license for some asset. If the license price is not met then agents have to rent the asset individually for the day at a unit price. Otherwise the license becomes available forever to everyone at no extra cost. In the particular case of perfect other predictions the algorithm that follows the self predictor is optimal but not robust to mispredictions of agent's future behavior; we give an algorithm with better robustness properties and benchmark it.
Authors: Maximiliano Hormaz\'abal Lagos, H\'ector Cerezo-Costas, Dimosthenis Karatzas
Abstract: We introduce EaGERS, a fully training-free and model-agnostic pipeline that (1) generates natural language rationales via a vision language model, (2) grounds these rationales to spatial sub-regions by computing multimodal embedding similarities over a configurable grid with majority voting, and (3) restricts the generation of responses only from the relevant regions selected in the masked image. Experiments on the DocVQA dataset demonstrate that our best configuration not only outperforms the base model on exact match accuracy and Average Normalized Levenshtein Similarity metrics but also enhances transparency and reproducibility in DocVQA without additional model fine-tuning.
Authors: Ratun Rahman, Atit Pokharel, Dinh C. Nguyen
Abstract: Quantum Federated Learning (QFL) is an emerging paradigm that combines quantum computing and federated learning (FL) to enable decentralized model training while maintaining data privacy over quantum networks. However, quantum noise remains a significant barrier in QFL, since modern quantum devices experience heterogeneous noise levels due to variances in hardware quality and sensitivity to quantum decoherence, resulting in inadequate training performance. To address this issue, we propose SpoQFL, a novel QFL framework that leverages sporadic learning to mitigate quantum noise heterogeneity in distributed quantum systems. SpoQFL dynamically adjusts training strategies based on noise fluctuations, enhancing model robustness, convergence stability, and overall learning efficiency. Extensive experiments on real-world datasets demonstrate that SpoQFL significantly outperforms conventional QFL approaches, achieving superior training performance and more stable convergence.
Authors: Yucen Wang, Rui Yu, Shenghua Wan, Le Gan, De-Chuan Zhan
Abstract: Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated. Our project website is https://sites.google.com/view/founder-rl.
Authors: Vito Chan, Lennart Ebert, Paul-Julius Hillmann, Christoffer Rubensson, Stephan A. Fahrenkrog-Petersen, Jan Mendling
Abstract: Object-centric event logs expand the conventional single-case notion event log by considering multiple objects, allowing for the analysis of more complex and realistic process behavior. However, the number of real-world object-centric event logs remains limited, and further studies are needed to test their usefulness. The increasing availability of data from team sports can facilitate object-centric process mining, leveraging both real-world data and suitable use cases. In this paper, we present a framework for transforming football (soccer) data into an object-centric event log, further enhanced with a spatial dimension. We demonstrate the effectiveness of our framework by generating object-centric event logs based on real-world football data and discuss the results for varying process representations. With our paper, we provide the first example for object-centric event logs in football analytics. Future work should consider variant analysis and filtering techniques to better handle variability
Authors: Mingjie Liu, Shizhe Diao, Jian Hu, Ximing Lu, Xin Dong, Hao Zhang, Alexander Bukharin, Shaokun Zhang, Jiaqi Zeng, Makesh Narsimhan Sreedhar, Gerald Shen, David Mosallanezhad, Di Zhang, Jonas Yang, June Yang, Oleksii Kuchaiev, Guilin Liu, Zhiding Yu, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong
Abstract: Recent advancements in reasoning-focused language models such as OpenAI's O1 and DeepSeek-R1 have shown that scaling test-time computation-through chain-of-thought reasoning and iterative exploration-can yield substantial improvements on complex tasks like mathematics and code generation. These breakthroughs have been driven by large-scale reinforcement learning (RL), particularly when combined with verifiable reward signals that provide objective and grounded supervision. In this report, we investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains. Our work identifies several key ingredients for effective training, including the use of verifiable reward tasks, enhancements to Group Relative Policy Optimization (GRPO), and practical techniques to improve training stability and generalization. We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains. Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks. To facilitate continued research, we release our model publicly.
Authors: Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan
Abstract: Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.
Authors: Lionel Wong, Katherine M. Collins, Lance Ying, Cedegao E. Zhang, Adrian Weller, Tobias Gersternberg, Timothy O'Donnell, Alexander K. Lew, Jacob D. Andreas, Joshua B. Tenenbaum, Tyler Brooke-Wilson
Abstract: When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea -- a ``Model Synthesis Architecture'' (MSA) -- using language models to implement global relevance-based retrieval and model synthesis and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA as a model of human judgments on a novel reasoning dataset. The dataset -- built around a `Model Olympics` domain of sports vignettes -- tests models' capacity for human-like, open-ended reasoning by requiring (i) judgments about novel causal structures described in language; (ii) drawing on large bodies of background knowledge; and (iii) doing both in light of observations that introduce arbitrary novel variables. Our MSA approach captures human judgments better than language model-only baselines, under both direct and chain-of-thought generations from the LM that supports model synthesis. These results suggest that MSAs can be implemented in a way that mirrors people's ability to deliver locally coherent reasoning over globally relevant variables, offering a path to understanding and replicating human reasoning in open-ended domains.
Authors: Michael A. Lepori, Jennifer Hu, Ishita Dasgupta, Roma Patel, Thomas Serre, Ellie Pavlick
Abstract: Language models (LMs) are used for a diverse range of tasks, from question answering to writing fantastical stories. In order to reliably accomplish these tasks, LMs must be able to discern the modal category of a sentence (i.e., whether it describes something that is possible, impossible, completely nonsensical, etc.). However, recent studies have called into question the ability of LMs to categorize sentences according to modality (Michaelov et al., 2025; Kauf et al., 2023). In this work, we identify linear representations that discriminate between modal categories within a variety of LMs, or modal difference vectors. Analysis of modal difference vectors reveals that LMs have access to more reliable modal categorization judgments than previously reported. Furthermore, we find that modal difference vectors emerge in a consistent order as models become more competent (i.e., through training steps, layers, and parameter count). Notably, we find that modal difference vectors identified within LM activations can be used to model fine-grained human categorization behavior. This potentially provides a novel view into how human participants distinguish between modal categories, which we explore by correlating projections along modal difference vectors with human participants' ratings of interpretable features. In summary, we derive new insights into LM modal categorization using techniques from mechanistic interpretability, with the potential to inform our understanding of modal categorization in humans.
Authors: Slimane Larabi
Abstract: Although existing models can interact with humans and provide satisfactory responses, they lack the ability to act autonomously or engage in independent reasoning. Furthermore, input data in these models is typically provided as explicit queries, even when some sensory data is already acquired. In addition, AI agents, which are computational entities designed to perform tasks and make decisions autonomously based on their programming, data inputs, and learned knowledge, have shown significant progress. However, they struggle with integrating knowledge across multiple domains, unlike humans. Mental imagery plays a fundamental role in the brain's thinking process, which involves performing tasks based on internal multisensory data, planned actions, needs, and reasoning capabilities. In this paper, we investigate how to integrate mental imagery into a machine thinking framework and how this could be beneficial in initiating the thinking process. Our proposed machine thinking framework integrates a Cognitive thinking unit supported by three auxiliary units: the Input Data Unit, the Needs Unit, and the Mental Imagery Unit. Within this framework, data is represented as natural language sentences or drawn sketches, serving both informative and decision-making purposes. We conducted validation tests for this framework, and the results are presented and discussed.
Authors: Sheng Liu, Panos Papadimitratos
Abstract: Federated Learning (FL) has emerged as a promising solution for privacy-preserving autonomous driving, specifically camera-based Road Condition Classification (RCC) systems, harnessing distributed sensing, computing, and communication resources on board vehicles without sharing sensitive image data. However, the collaborative nature of FL-RCC frameworks introduces new vulnerabilities: Targeted Label Flipping Attacks (TLFAs), in which malicious clients (vehicles) deliberately alter their training data labels to compromise the learned model inference performance. Such attacks can, e.g., cause a vehicle to mis-classify slippery, dangerous road conditions as pristine and exceed recommended speed. However, TLFAs for FL-based RCC systems are largely missing. We address this challenge with a threefold contribution: 1) we disclose the vulnerability of existing FL-RCC systems to TLFAs; 2) we introduce a novel label-distance-based metric to precisely quantify the safety risks posed by TLFAs; and 3) we propose FLARE, a defensive mechanism leveraging neuron-wise analysis of the output layer to mitigate TLFA effects. Extensive experiments across three RCC tasks, four evaluation metrics, six baselines, and three deep learning models demonstrate both the severity of TLFAs on FL-RCC systems and the effectiveness of FLARE in mitigating the attack impact.
Authors: Yifan Deng, Spencer S. Ericksen, Anthony Gitter
Abstract: Scientific databases aggregate vast amounts of quantitative data alongside descriptive text. In biochemistry, molecule screening assays evaluate the functional responses of candidate molecules against disease targets. Unstructured text that describes the biological mechanisms through which these targets operate, experimental screening protocols, and other attributes of assays offer rich information for new drug discovery campaigns but has been untapped because of that unstructured format. We present Assay2Mol, a large language model-based workflow that can capitalize on the vast existing biochemical screening assays for early-stage drug discovery. Assay2Mol retrieves existing assay records involving targets similar to the new target and generates candidate molecules using in-context learning with the retrieved assay screening data. Assay2Mol outperforms recent machine learning approaches that generate candidate ligand molecules for target protein structures, while also promoting more synthesizable molecule generation.
Authors: Said Ohamouddou, Abdellatif El Afia, Hanaa El Afia, Raddouane Chiheb
Abstract: Tree species classification from terrestrial LiDAR point clouds is challenging because of the complex multi-scale geometric structures in forest environments. Existing approaches using multi-scale dynamic graph convolutional neural networks (MS-DGCNN) employ parallel multi-scale processing, which fails to capture the semantic relationships between the hierarchical levels of the tree architecture. We present MS-DGCNN++, a hierarchical multiscale fusion dynamic graph convolutional network that uses semantically meaningful feature extraction at local, branch, and canopy scales with cross-scale information propagation. Our method employs scale-specific feature engineering, including standard geometric features for the local scale, normalized relative vectors for the branch scale, and distance information for the canopy scale. This hierarchical approach replaces uniform parallel processing with semantically differentiated representations that are aligned with the natural tree structure. Under the same proposed tree species data augmentation strategy for all experiments, MS-DGCNN++ achieved an accuracy of 94.96 \% on STPCTLS, outperforming DGCNN, MS-DGCNN, and the state-of-the-art model PPT. On FOR-species20K, it achieves 67.25\% accuracy (6.1\% improvement compared to MS-DGCNN). For standard 3D object recognition, our method outperformed DGCNN and MS-DGCNN with overall accuracies of 93.15\% on ModelNet40 and 94.05\% on ModelNet10. With lower parameters and reduced complexity compared to state-of-the-art transformer approaches, our method is suitable for resource-constrained applications while maintaining a competitive accuracy. Beyond tree classification, the method generalizes to standard 3D object recognition, establishing it as a versatile solution for diverse point cloud processing applications. The implementation code is publicly available at https://github.com/said-ohamouddou/MS-DGCNN2.
Authors: Prateek Chanda, Saral Sureka, Parth Pratim Chatterjee, Krishnateja Killamsetty, Nikhil Shivakumar Nayak, Ganesh Ramakrishnan
Abstract: The performance of finetuned large language models (LLMs) hinges critically on the composition of the training mixture. However, selecting an optimal blend of task datasets remains a largely manual, heuristic driven process, with practitioners often relying on uniform or size based sampling strategies. We introduce TASKPGM, a principled and scalable framework for mixture optimization that selects continuous task proportions by minimizing an energy function over a Markov Random Field (MRF). Task relationships are modeled using behavioral divergences such as Jensen Shannon Divergence and Pointwise Mutual Information computed from the predictive distributions of single task finetuned models. Our method yields a closed form solution under simplex constraints and provably balances representativeness and diversity among tasks. We provide theoretical guarantees, including weak submodularity for budgeted variants, and demonstrate consistent empirical improvements on Llama 2 and Mistral across evaluation suites such as MMLU and BIGBench. Beyond performance, TASKPGM offers interpretable insights into task influence and mixture composition, making it a powerful tool for efficient and robust LLM finetuning.
Authors: Rui Li, Xiaoyun Zhi, Jinxin Chi, Menghan Yu, Lixin Huang, Jia Zhu, Weilun Zhang, Xing Ma, Wenjia Liu, Zhicheng Zhu, Daowen Luo, Zuquan Song, Xin Yin, Chao Xiang, Shuguang Wang, Wencong Xiao, Gene Cooperman
Abstract: Large Language Models (LLMs) have become a cornerstone of modern AI, driving breakthroughs in natural language processing and expanding into multimodal jobs involving images, audio, and video. As with most computational software, it is important to distinguish between ordinary runtime performance and startup overhead. Prior research has focused on runtime performance: improving training efficiency and stability. This work focuses instead on the increasingly critical issue of startup overhead in training: the delay before training jobs begin execution. Startup overhead is particularly important in large, industrial-scale LLMs, where failures occur more frequently and multiple teams operate in iterative update-debug cycles. In one of our training clusters, more than 3.5% of GPU time is wasted due to startup overhead alone. In this work, we present the first in-depth characterization of LLM training startup overhead based on real production data. We analyze the components of startup cost, quantify its direct impact, and examine how it scales with job size. These insights motivate the design of Bootseer, a system-level optimization framework that addresses three primary startup bottlenecks: (a) container image loading, (b) runtime dependency installation, and (c) model checkpoint resumption. To mitigate these bottlenecks, Bootseer introduces three techniques: (a) hot block record-and-prefetch, (b) dependency snapshotting, and (c) striped HDFS-FUSE. Bootseer has been deployed in a production environment and evaluated on real LLM training workloads, demonstrating a 50% reduction in startup overhead.
Authors: Dianxin Luan, John Thompson
Abstract: Channel estimation is crucial in cognitive communications, as it enables intelligent spectrum sensing and adaptive transmission by providing accurate information about the current channel state. However, in many papers neural networks are frequently tested by training and testing on one example channel or similar channels. This is because data-driven methods often degrade on new data which they are not trained on, as they cannot extrapolate their training knowledge. This is despite the fact physical channels are often assumed to be time-variant. However, due to the low latency requirements and limited computing resources, neural networks may not have enough time and computing resources to execute online training to fine-tune the parameters. This motivates us to design offline-trained neural networks that can perform robustly over wireless channels, but without any actual channel information being known at design time. In this paper, we propose design criteria to generate synthetic training datasets for neural networks, which guarantee that after training the resulting networks achieve a certain mean squared error (MSE) on new and previously unseen channels. Therefore, neural network solutions require no prior channel information or parameters update for real-world implementations. Based on the proposed design criteria, we further propose a benchmark design which ensures intelligent operation for different channel profiles. To demonstrate general applicability, we use neural networks with different levels of complexity to show that the generalization achieved appears to be independent of neural network architecture. From simulations, neural networks achieve robust generalization to wireless channels with both fixed channel profiles and variable delay spreads.
Authors: Kiana Kheiri, Aamna Aamir, Andriy Miranskyy, Chen Ding
Abstract: Quantum circuits must be error-resilient, yet LLMs like Granite-20B-Code and StarCoder often output flawed Qiskit code. We fine-tuned a 32 B model with two RL methods, Group Relative Policy Optimization (GRPO) and Odds-Ratio Preference Optimization (ORPO), using a richly annotated synthetic dataset. On the Qiskit HumanEval benchmark, ORPO reaches 56.29\% Pass@1 ($\approx+10$ pp over Granite-8B-QK) and GRPO hits 49\%, both beating all general-purpose baselines; on the original HumanEval they score 65.90\% and 63.00\%. GRPO excels on basic tasks (42/54), ORPO on intermediate ones (41/68), and neither solves the five advanced tasks, highlighting clear gains yet room for progress in AI-assisted quantum programming.
Authors: George Jiayuan Gao, Tianyu Li, Junyao Shi, Yihan Li, Zizhe Zhang, Nadia Figueroa, Dinesh Jayaraman
Abstract: Tool design and use reflect the ability to understand and manipulate the physical world through creativity, planning, and foresight. As such, these capabilities are often regarded as measurable indicators of intelligence across biological species. While much of today's research on robotic intelligence focuses on generating better controllers, inventing smarter tools offers a complementary form of physical intelligence: shifting the onus of problem-solving onto the tool's design. Given the vast and impressive common-sense, reasoning, and creative capabilities of today's foundation models, we investigate whether these models can provide useful priors to automatically design and effectively wield such tools? We present VLMgineer, a framework that harnesses the code generation abilities of vision language models (VLMs) together with evolutionary search to iteratively co-design physical tools and the action plans that operate them to perform a task. We evaluate VLMgineer on a diverse new benchmark of everyday manipulation scenarios that demand creative tool design and use. Across this suite, VLMgineer consistently discovers tools and policies that solve tasks more effectively and innovatively, transforming challenging robotics problems into straightforward executions. It also outperforms VLM-generated designs from human specifications and existing human-crafted tools for everyday tasks. To facilitate future research on automated tool invention, we will release our benchmark and code.
Authors: Athanasios Papastathopoulos-Katsaros, Alexandra Stavrianidi, Zhandong Liu
Abstract: Physics-Informed Neural Networks (PINNs) are deep learning models that incorporate the governing physical laws of a system into the learning process, making them well-suited for solving complex scientific and engineering problems. Recently, PINNs have gained widespread attention as a powerful framework for combining physical principles with data-driven modeling to improve prediction accuracy. Despite their successes, however, PINNs often exhibit poor extrapolation performance outside the training domain and are highly sensitive to the choice of activation functions (AFs). In this paper, we introduce a transfer learning (TL) method to improve the extrapolation capability of PINNs. Our approach applies transfer learning (TL) within an extended training domain, using only a small number of carefully selected collocation points. Additionally, we propose an adaptive AF that takes the form of a linear combination of standard AFs, which improves both the robustness and accuracy of the model. Through a series of experiments, we demonstrate that our method achieves an average of 40% reduction in relative L2 error and an average of 50% reduction in mean absolute error in the extrapolation domain, all without a significant increase in computational cost. The code is available at https://github.com/LiuzLab/PINN-extrapolation .
Authors: Salvador D. Escobedo
Abstract: We propose the Single Conversation Methodology (SCM), a novel and pragmatic approach to software development using large language models (LLMs). In contrast to ad hoc interactions with generative AI, SCM emphasizes a structured and persistent development dialogue, where all stages of a project - from requirements to architecture and implementation - unfold within a single, long-context conversation. The methodology is grounded on principles of cognitive clarity, traceability, modularity, and documentation. We define its phases, best practices, and philosophical stance, while arguing that SCM offers a necessary correction to the passive reliance on LLMs prevalent in current practices. We aim to reassert the active role of the developer as architect and supervisor of the intelligent tool.
Authors: Ananya Raghu, Anisha Raghu, Alice S. Tang, Yannis M. Paulus, Tyson N. Kim, Tomiko T. Oskotsky
Abstract: Background/Objectives: Age-related macular degeneration, glaucoma, diabetic retinopathy (DR), diabetic macular edema, and pathological myopia affect hundreds of millions of people worldwide. Early screening for these diseases is essential, yet access to medical care remains limited in low- and middle-income countries as well as in resource-limited settings. We develop InSight, an AI-based app that combines patient metadata with fundus images for accurate diagnosis of five common eye diseases to improve accessibility of screenings. Methods: InSight features a three-stage pipeline: real-time image quality assessment, disease diagnosis model, and a DR grading model to assess severity. Our disease diagnosis model incorporates three key innovations: (a) Multimodal fusion technique (MetaFusion) combining clinical metadata and images; (b) Pretraining method leveraging supervised and self-supervised loss functions; and (c) Multitask model to simultaneously predict 5 diseases. We make use of BRSET (lab-captured images) and mBRSET (smartphone-captured images) datasets, both of which also contain clinical metadata for model training/evaluation. Results: Trained on a dataset of BRSET and mBRSET images, the image quality checker achieves near-100% accuracy in filtering out low-quality fundus images. The multimodal pretrained disease diagnosis model outperforms models using only images by 6% in balanced accuracy for BRSET and 4% for mBRSET. Conclusions: The InSight pipeline demonstrates robustness across varied image conditions and has high diagnostic accuracy across all five diseases, generalizing to both smartphone and lab captured images. The multitask model contributes to the lightweight nature of the pipeline, making it five times computationally efficient compared to having five individual models corresponding to each disease.
Authors: Mihran Miroyan, Rose Niousha, Joseph E. Gonzalez, Gireeja Ranade, Narges Norouzi
Abstract: Large Language Models (LLMs) have shown strong performance on programming tasks, but can they generate student-like code like real students - imperfect, iterative, and stylistically diverse? We present ParaStudent, a systematic study of LLM-based "student-like" code generation in an introductory programming course setting. Using a dataset of timestamped student submissions across multiple semesters, we design low- and high-resolution experiments to model student progress and evaluate code outputs along semantic, functional, and stylistic dimensions. Our results show that fine-tuning significantly improves alignment with real student trajectories and captures error patterns, incremental improvements, and stylistic variations more faithfully. This study shows that modeling realistic student code requires capturing learning dynamics through context-aware generation, temporal modeling, and multi-dimensional evaluation. Code for experiments and evaluation is available at \href{https://github.com/mmiroyan/ParaStudent}{\texttt{github.com/mmiroyan/ParaStudent}}.
Authors: Christina Thrainer, Md Meftahul Ferdaus, Mahdi Abdelguerfi, Christian Guetl, Steven Sloan, Kendall N. Niles, Ken Pathak
Abstract: Automated structural defect segmentation in civil infrastructure faces a critical challenge: achieving high accuracy while maintaining computational efficiency for real-time deployment. This paper presents FORTRESS (Function-composition Optimized Real-Time Resilient Structural Segmentation), a new architecture that balances accuracy and speed by using a special method that combines depthwise separable convolutions with adaptive Kolmogorov-Arnold Network integration. FORTRESS incorporates three key innovations: a systematic depthwise separable convolution framework achieving a 3.6x parameter reduction per layer, adaptive TiKAN integration that selectively applies function composition transformations only when computationally beneficial, and multi-scale attention fusion combining spatial, channel, and KAN-enhanced features across decoder levels. The architecture achieves remarkable efficiency gains with 91% parameter reduction (31M to 2.9M), 91% computational complexity reduction (13.7 to 1.17 GFLOPs), and 3x inference speed improvement while delivering superior segmentation performance. Evaluation on benchmark infrastructure datasets demonstrates state-of-the-art results with an F1- score of 0.771 and a mean IoU of 0.677, significantly outperforming existing methods including U-Net, SA-UNet, and U- KAN. The dual optimization strategy proves essential for optimal performance, establishing FORTRESS as a robust solution for practical structural defect segmentation in resource-constrained environments where both accuracy and computational efficiency are paramount. Comprehensive architectural specifications are provided in the Supplemental Material. Source code is available at URL: https://github.com/faeyelab/fortress-paper-code.
Authors: Sangbong Yoo, Jaeyoung Lee, Chanyoung Yoon, Geonyeong Son, Hyein Hong, Seongbum Seo, Soobin Yim, Chanyoung Jung, Jungsoo Park, Misuk Kim, Yun Jang
Abstract: Data heterogeneity is a prevalent issue, stemming from various conflicting factors, making its utilization complex. This uncertainty, particularly resulting from disparities in data formats, frequently necessitates the involvement of experts to find resolutions. Current methodologies primarily address conflicts related to data structures and schemas, often overlooking the pivotal role played by data transformation. As the utilization of artificial intelligence (AI) continues to expand, there is a growing demand for a more streamlined data preparation process, and data transformation becomes paramount. It customizes training data to enhance AI learning efficiency and adapts input formats to suit diverse AI models. Selecting an appropriate transformation technique is paramount in preserving crucial data details. Despite the widespread integration of AI across various industries, comprehensive reviews concerning contemporary data transformation approaches are scarce. This survey explores the intricacies of data heterogeneity and its underlying sources. It systematically categorizes and presents strategies to address heterogeneity stemming from differences in data formats, shedding light on the inherent challenges associated with each strategy.
Authors: Anastasia Kuznetsova, Inseon Jang, Wootaek Lim, Minje Kim
Abstract: Neural audio codecs, leveraging quantization algorithms, have significantly impacted various speech/audio tasks. While high-fidelity reconstruction is paramount for human perception, audio coding for machines (ACoM) prioritizes efficient compression and downstream task performance, disregarding perceptual nuances. This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model. Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low bitrates (i.e., less than 200 bps) with a minimal loss of the downstream model performance. The resulting tokenizer is adaptable to various bitrates and model sizes for flexible deployment. Evaluated on automatic speech recognition and audio classification, our method demonstrates its efficacy and potential for broader task and architectural applicability through appropriate regularization.
Authors: Ijazul Haq, Muhammad Saqib, Yingjie Zhang
Abstract: Spatial grounding, the process of associating natural language expressions with corresponding image regions, has rapidly advanced due to the introduction of transformer-based models, significantly enhancing multimodal representation and cross-modal alignment. Despite this progress, the field lacks a comprehensive synthesis of current methodologies, dataset usage, evaluation metrics, and industrial applicability. This paper presents a systematic literature review of transformer-based spatial grounding approaches from 2018 to 2025. Our analysis identifies dominant model architectures, prevalent datasets, and widely adopted evaluation metrics, alongside highlighting key methodological trends and best practices. This study provides essential insights and structured guidance for researchers and practitioners, facilitating the development of robust, reliable, and industry-ready transformer-based spatial grounding models.
Authors: Yunxiang Zhang, Muhammad Khalifa, Lechen Zhang, Xin Liu, Ayoung Lee, Xinliang Frederick Zhang, Farima Fatahi Bayat, Lu Wang
Abstract: Large reasoning models (LRMs) can do complex reasoning via long chain-of-thought (CoT) involving cognitive strategies such as backtracking and self-correction. Recent studies suggest that some models inherently possess these long reasoning abilities, which may be unlocked via extra training. Our work first investigates whether we can elicit such behavior without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logits arithmetic (Liu et al., 2024) to tune a target large LM for long reasoning using a substantially smaller model as guider. We then show that we can further boost performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model -- a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in pass@1 by 26% and 29%, respectively, over four mathematical datasets using the Qwen2.5-32B when guided by R1-Distill-Qwen-1.5B -- a model 21x smaller. Lastly, we show that ThinkLogit can transfer long reasoning skills acquired through reinforcement learning, improving pass@1 by 13% relative compared to the Qwen2.5-32B base model. Our work presents a computationally-efficient method to elicit long reasoning in large models with minimal or no additional training.
Authors: Ruicheng Zhang, Haowei Guo, Kanghui Tian, Jun Zhou, Mingliang Yan, Zeyu Zhang, Shen Zhao
Abstract: Unified Medical Image Segmentation (UMIS) is critical for comprehensive anatomical assessment but faces challenges due to multi-scale structural heterogeneity. Conventional pixel-based approaches, lacking object-level anatomical insight and inter-organ relational modeling, struggle with morphological complexity and feature conflicts, limiting their efficacy in UMIS. We propose Mamba Snake, a novel deep snake framework enhanced by state space modeling for UMIS. Mamba Snake frames multi-contour evolution as a hierarchical state space atlas, effectively modeling macroscopic inter-organ topological relationships and microscopic contour refinements. We introduce a snake-specific vision state space module, the Mamba Evolution Block (MEB), which leverages effective spatiotemporal information aggregation for adaptive refinement of complex morphologies. Energy map shape priors further ensure robust long-range contour evolution in heterogeneous data. Additionally, a dual-classification synergy mechanism is incorporated to concurrently optimize detection and segmentation, mitigating under-segmentation of microstructures in UMIS. Extensive evaluations across five clinical datasets reveal Mamba Snake's superior performance, with an average Dice improvement of 3\% over state-of-the-art methods.
Authors: Hanlei Shi, Leyuan Qu, Yu Liu, Di Gao, Yuhua Zheng, Taihao Li
Abstract: Emotional talking-head generation has emerged as a pivotal research area at the intersection of computer vision and multimodal artificial intelligence, with its core value lying in enhancing human-computer interaction through immersive and empathetic engagement.With the advancement of multimodal large language models, the driving signals for emotional talking-head generation has shifted from audio and video to more flexible text. However, current text-driven methods rely on predefined discrete emotion label texts, oversimplifying the dynamic complexity of real facial muscle movements and thus failing to achieve natural emotional expressiveness.This study proposes the Think-Before-Draw framework to address two key challenges: (1) In-depth semantic parsing of emotions--by innovatively introducing Chain-of-Thought (CoT), abstract emotion labels are transformed into physiologically grounded facial muscle movement descriptions, enabling the mapping from high-level semantics to actionable motion features; and (2) Fine-grained expressiveness optimization--inspired by artists' portrait painting process, a progressive guidance denoising strategy is proposed, employing a "global emotion localization--local muscle control" mechanism to refine micro-expression dynamics in generated videos.Our experiments demonstrate that our approach achieves state-of-the-art performance on widely-used benchmarks, including MEAD and HDTF. Additionally, we collected a set of portrait images to evaluate our model's zero-shot generation capability.
Authors: Jiaxin An
Abstract: As the global population ages, artificial intelligence (AI)-powered agents have emerged as potential tools to support older adults' caregiving. Prior research has explored agent autonomy by identifying key interaction stages in task processes and defining the agent's role at each stage. However, ensuring that agents align with older adults' autonomy preferences remains a critical challenge. Drawing on interdisciplinary conceptualizations of autonomy, this paper examines four key dimensions of autonomy for older adults: decision-making autonomy, goal-oriented autonomy, control autonomy, and social responsibility autonomy. This paper then proposes the following research directions: (1) Addressing social responsibility autonomy, which concerns the ethical and social implications of agent use in communal settings; (2) Operationalizing agent autonomy from the task perspective; and (3) Developing autonomy measures.
Authors: Keli Zheng, Zerong Xie
Abstract: In this paper, we present Synergy, a language model that bridges different levels of abstraction in an end-to-end fashion through a learned routing mechanism. Focusing on low-level linguistic abstraction, we trained our model as a byte-level language model. Our model spontaneously learns to tokenize bytes, producing fewer concept tokens than Byte-level Byte Pair Encoder (BBPE) tokenizers while keeping comparable performance. By comparing with Llama3, we observed an advantage of Synergy under the same model scale and training dataset size. Further studies show that the middle part (the higher abstraction part) of our model performs better when positional encodings are removed, suggesting the emergence of position-independent concepts. These findings demonstrate the feasibility of tokenizer-free architectures, paving the way for more robust and flexible pipelines.
Authors: Min-Jeong Lee, Hee-Dong Kim, Seong-Whan Lee
Abstract: Stable diffusion is an outstanding image generation model for text-to-image, but its time-consuming generation process remains a challenge due to the quadratic complexity of attention operations. Recent token merging methods improve efficiency by reducing the number of tokens during attention operations, but often overlook the characteristics of attention-based image generation models, limiting their effectiveness. In this paper, we propose local representative token guided merging (ReToM), a novel token merging strategy applicable to any attention mechanism in image generation. To merge tokens based on various contextual information, ReToM defines local boundaries as windows within attention inputs and adjusts window sizes. Furthermore, we introduce a representative token, which represents the most representative token per window by computing similarity at a specific timestep and selecting the token with the highest average similarity. This approach preserves the most salient local features while minimizing computational overhead. Experimental results show that ReToM achieves a 6.2% improvement in FID and higher CLIP scores compared to the baseline, while maintaining comparable inference time. We empirically demonstrate that ReToM is effective in balancing visual quality and computational efficiency.
Authors: Weijieying Ren, Jingxi Zhu, Zehao Liu, Tianxiang Zhao, Vasant Honavar
Abstract: Artificial intelligence (AI) has demonstrated significant potential in transforming healthcare through the analysis and modeling of electronic health records (EHRs). However, the inherent heterogeneity, temporal irregularity, and domain-specific nature of EHR data present unique challenges that differ fundamentally from those in vision and natural language tasks. This survey offers a comprehensive overview of recent advancements at the intersection of deep learning, large language models (LLMs), and EHR modeling. We introduce a unified taxonomy that spans five key design dimensions: data-centric approaches, neural architecture design, learning-focused strategies, multimodal learning, and LLM-based modeling systems. Within each dimension, we review representative methods addressing data quality enhancement, structural and temporal representation, self-supervised learning, and integration with clinical knowledge. We further highlight emerging trends such as foundation models, LLM-driven clinical agents, and EHR-to-text translation for downstream reasoning. Finally, we discuss open challenges in benchmarking, explainability, clinical alignment, and generalization across diverse clinical settings. This survey aims to provide a structured roadmap for advancing AI-driven EHR modeling and clinical decision support. For a comprehensive list of EHR-related methods, kindly refer to https://survey-on-tabular-data.github.io/.
Authors: Yufeng Luo, Adam D. Myers, Alex Drlica-Wagner, Dario Dematties, Salma Borchani, Frank Valdes, Arjun Dey, David Schlegel, Rongpu Zhou, DESI Legacy Imaging Surveys Team
Abstract: As the data volume of astronomical imaging surveys rapidly increases, traditional methods for image anomaly detection, such as visual inspection by human experts, are becoming impractical. We introduce a machine-learning-based approach to detect poor-quality exposures in large imaging surveys, with a focus on the DECam Legacy Survey (DECaLS) in regions of low extinction (i.e., $E(B-V)<0.04$). Our semi-supervised pipeline integrates a vision transformer (ViT), trained via self-supervised learning (SSL), with a k-Nearest Neighbor (kNN) classifier. We train and validate our pipeline using a small set of labeled exposures observed by surveys with the Dark Energy Camera (DECam). A clustering-space analysis of where our pipeline places images labeled in ``good'' and ``bad'' categories suggests that our approach can efficiently and accurately determine the quality of exposures. Applied to new imaging being reduced for DECaLS Data Release 11, our pipeline identifies 780 problematic exposures, which we subsequently verify through visual inspection. Being highly efficient and adaptable, our method offers a scalable solution for quality control in other large imaging surveys.
Authors: Penglei Sun, Yaoxian Song, Xiangru Zhu, Xiang Liu, Qiang Wang, Yue Liu, Changqun Xia, Tiefeng Li, Yang Yang, Xiaowen Chu
Abstract: Scene understanding enables intelligent agents to interpret and comprehend their environment. While existing large vision-language models (LVLMs) for scene understanding have primarily focused on indoor household tasks, they face two significant limitations when applied to outdoor large-scale scene understanding. First, outdoor scenarios typically encompass larger-scale environments observed through various sensors from multiple viewpoints (e.g., bird view and terrestrial view), while existing indoor LVLMs mainly analyze single visual modalities within building-scale contexts from humanoid viewpoints. Second, existing LVLMs suffer from missing multidomain perception outdoor data and struggle to effectively integrate 2D and 3D visual information. To address the aforementioned limitations, we build the first multidomain perception outdoor scene understanding dataset, named \textbf{\underline{SVM-City}}, deriving from multi\textbf{\underline{S}}cale scenarios with multi\textbf{\underline{V}}iew and multi\textbf{\underline{M}}odal instruction tuning data. It contains $420$k images and $4, 811$M point clouds with $567$k question-answering pairs from vehicles, low-altitude drones, high-altitude aerial planes, and satellite. To effectively fuse the multimodal data in the absence of one modality, we introduce incomplete multimodal learning to model outdoor scene understanding and design the LVLM named \textbf{\underline{City-VLM}}. Multimodal fusion is realized by constructing a joint probabilistic distribution space rather than implementing directly explicit fusion operations (e.g., concatenation). Experimental results on three typical outdoor scene understanding tasks show City-VLM achieves $18.14 \%$ performance surpassing existing LVLMs in question-answering tasks averagely. Our method demonstrates pragmatic and generalization performance across multiple outdoor scenes.
Authors: Qianru Zhang, Chenglei Yu, Haixin Wang, Yudong Yan, Yuansheng Cao, Siu-Ming Yiu, Tailin Wu, Hongzhi Yin
Abstract: Time series prediction, a crucial task across various domains, faces significant challenges due to the inherent complexities of time series data, including non-stationarity, multi-scale periodicity, and transient dynamics, particularly when tackling long-term predictions. While Transformer-based architectures have shown promise, their quadratic complexity with sequence length hinders their efficiency for long-term predictions. Recent advancements in State-Space Models, such as Mamba, offer a more efficient alternative for long-term modeling, but they cannot capture multi-scale periodicity and transient dynamics effectively. Meanwhile, they are susceptible to data noise issues in time series. This paper proposes a novel framework, FLDmamba (Fourier and Laplace Transform Decomposition Mamba), addressing these limitations. FLDmamba leverages the strengths of both Fourier and Laplace transforms to effectively capture both multi-scale periodicity, transient dynamics within time series data, and improve the robustness of the model to the data noise issue. Our extensive experiments demonstrate that FLDmamba achieves superior performance on time series prediction benchmarks, outperforming both Transformer-based and other Mamba-based architectures. To promote the reproducibility of our method, we have made both the code and data accessible via the following URL:{\href{https://github.com/AI4Science-WestlakeU/FLDmamba}{https://github.com/AI4Science-WestlakeU/\model}.
URLs: https://github.com/AI4Science-WestlakeU/FLDmamba, https://github.com/AI4Science-WestlakeU/\model
Authors: Hui Sun, Yanfeng Ding, Liping Yi, Huidong Ma, Gang Wang, Xiaoguang Liu, Cheng Zhong, Wentong Cai
Abstract: Learning-based lossless compressors play a crucial role in large-scale genomic database backup, storage, transmission, and management. However, their 1) inadequate compression ratio, 2) low compression \& decompression throughput, and 3) poor compression robustness limit their widespread adoption and application in both industry and academia. To solve those challenges, we propose a novel \underline{P}arallel \underline{M}ulti-\underline{K}nowledge \underline{L}earning-based \underline{C}ompressor (PMKLC) with four crucial designs: 1) We propose an automated multi-knowledge learning-based compression framework as compressors' backbone to enhance compression ratio and robustness; 2) we design a GPU-accelerated ($s$,$k$)-mer encoder to optimize compression throughput and computing resource usage; 3) we introduce data block partitioning and Step-wise Model Passing (SMP) mechanisms for parallel acceleration; 4) We design two compression modes PMKLC-S and PMKLC-M to meet the complex application scenarios, where the former runs on a resource-constrained single GPU and the latter is multi-GPU accelerated. We benchmark PMKLC-S/M and 14 baselines (7 traditional and 7 leaning-based) on 15 real-world datasets with different species and data sizes. Compared to baselines on the testing datasets, PMKLC-S/M achieve the average compression ratio improvement up to 73.609\% and 73.480\%, the average throughput improvement up to 3.036$\times$ and 10.710$\times$, respectively. Besides, PMKLC-S/M also achieve the best robustness and competitive memory cost, indicating its greater stability against datasets with different probability distribution perturbations, and its strong ability to run on memory-constrained devices.
Authors: Andrew Shin, Kunitake Kaneko
Abstract: Large language models (LLMs) excel at modeling relationships between strings in natural language and have shown promise in extending to other symbolic domains like coding or mathematics. However, the extent to which they implicitly model symbolic music remains underexplored. This paper investigates how LLMs represent musical concepts by generating symbolic music data from textual prompts describing combinations of genres and styles, and evaluating their utility through recognition and generation tasks. We produce a dataset of LLM-generated MIDI files without relying on explicit musical training. We then train neural networks entirely on this LLM-generated MIDI dataset and perform genre and style classification as well as melody completion, benchmarking their performance against established models. Our results demonstrate that LLMs can infer rudimentary musical structures and temporal relationships from text, highlighting both their potential to implicitly encode musical patterns and their limitations due to a lack of explicit musical context, shedding light on their generative capabilities for symbolic music.
Authors: Ju-Young Oh, Ho-Joong Kim, Seong-Whan Lee
Abstract: Video question answering (VQA) is a multimodal task that requires the interpretation of a video to answer a given question. Existing VQA methods primarily utilize question and answer (Q&A) pairs to learn the spatio-temporal characteristics of video content. However, these annotations are typically event-centric, which is not enough to capture the broader context of each video. The absence of essential details such as object types, spatial layouts, and descriptive attributes restricts the model to learning only a fragmented scene representation. This issue limits the model's capacity for generalization and higher-level reasoning. In this paper, we propose a fundamental question generation with the integration of question embeddings for video question answering (FIQ), a novel approach designed to strengthen the reasoning ability of the model by enhancing the fundamental understanding of videos. FIQ generates Q&A pairs based on descriptions extracted from videos, enriching the training data with fundamental scene information. Generated Q&A pairs enable the model to understand the primary context, leading to enhanced generalizability and reasoning ability. Furthermore, we incorporate a VQ-CAlign module that assists task-specific question embeddings with visual features, ensuring that essential domain-specific details are preserved to increase the adaptability of downstream tasks. Experiments on SUTD-TrafficQA demonstrate that our FIQ achieves state-of-the-art performance compared to existing baseline methods.
Authors: Lulu Liu, Zhiyong Xiao
Abstract: Food is not only a core component of humans' daily diets, but also an important carrier of cultural heritage and emotional bonds. With the development of technology, the need for accurate classification of food images has grown, which is crucial for a variety of application scenarios. However, existing Convolutional Neural Networks (CNNs) face significant challenges when dealing with fine-grained food images that are similar in shape but subtle in detail. To address this challenge, this study presents an innovative method for classifying food images, named Feature-Enhanced TResNet (FE-TResNet), specifically designed to address fine-grained food images and accurately capture subtle features within them. The FE-TResNet method is based on the TResNet model and integrates Style-based Recalibration Module (StyleRM) and Deep Channel-wise Attention (DCA) technologies to enhance feature extraction capabilities. In experimental validation on Chinese food image datasets ChineseFoodNet and CNFOOD-241, the FE-TResNet method significantly improved classification accuracy, achieving rates of 81.37% and 80.29%, respectively, demonstrating its effectiveness and superiority in fine-grained food image classification.
Authors: Yuki Kondo, Norimichi Ukita, Riku Kanayama, Yuki Yoshida, Takayuki Yamaguchi, Xiang Yu, Guang Liang, Xinyao Liu, Guan-Zhang Wang, Wei-Ta Chu, Bing-Cheng Chuang, Jia-Hua Lee, Pin-Tseng Kuo, I-Hsuan Chu, Yi-Shein Hsiao, Cheng-Han Wu, Po-Yi Wu, Jui-Chien Tsou, Hsuan-Chi Liu, Chun-Yi Lee, Yuan-Fu Yang, Kosuke Shigematsu, Asuka Shin, Ba Tran
Abstract: Small Multi-Object Tracking (SMOT) is particularly challenging when targets occupy only a few dozen pixels, rendering detection and appearance-based association unreliable. Building on the success of the MVA2023 SOD4SB challenge, this paper introduces the SMOT4SB challenge, which leverages temporal information to address limitations of single-frame detection. Our three main contributions are: (1) the SMOT4SB dataset, consisting of 211 UAV video sequences with 108,192 annotated frames under diverse real-world conditions, designed to capture motion entanglement where both camera and targets move freely in 3D; (2) SO-HOTA, a novel metric combining Dot Distance with HOTA to mitigate the sensitivity of IoU-based metrics to small displacements; and (3) a competitive MVA2025 challenge with 78 participants and 308 submissions, where the winning method achieved a 5.1x improvement over the baseline. This work lays a foundation for advancing SMOT in UAV scenarios with applications in bird strike avoidance, agriculture, fisheries, and ecological monitoring.
Authors: Khang Truong, Lam Pham, Hieu Tang, Jasmin Lampert, Martin Boyer, Son Phan, Truong Nguyen
Abstract: Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.
Authors: Muhammad Fadhil Ginting, Dong-Ki Kim, Xiangyun Meng, Andrzej Reinke, Bandi Jai Krishna, Navid Kayhani, Oriana Peltzer, David D. Fan, Amirreza Shaban, Sung-Kyun Kim, Mykel J. Kochenderfer, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei
Abstract: As robots become increasingly capable of operating over extended periods -- spanning days, weeks, and even months -- they are expected to accumulate knowledge of their environments and leverage this experience to assist humans more effectively. This paper studies the problem of Long-term Active Embodied Question Answering (LA-EQA), a new task in which a robot must both recall past experiences and actively explore its environment to answer complex, temporally-grounded questions. Unlike traditional EQA settings, which typically focus either on understanding the present environment alone or on recalling a single past observation, LA-EQA challenges an agent to reason over past, present, and possible future states, deciding when to explore, when to consult its memory, and when to stop gathering observations and provide a final answer. Standard EQA approaches based on large models struggle in this setting due to limited context windows, absence of persistent memory, and an inability to combine memory recall with active exploration. To address this, we propose a structured memory system for robots, inspired by the mind palace method from cognitive science. Our method encodes episodic experiences as scene-graph-based world instances, forming a reasoning and planning algorithm that enables targeted memory retrieval and guided navigation. To balance the exploration-recall trade-off, we introduce value-of-information-based stopping criteria that determines when the agent has gathered sufficient information. We evaluate our method on real-world experiments and introduce a new benchmark that spans popular simulation environments and actual industrial sites. Our approach significantly outperforms state-of-the-art baselines, yielding substantial gains in both answer accuracy and exploration efficiency.
Authors: Chongli Qin, Jost Tobias Springenberg
Abstract: Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strategy and the theory and practice of finding optimal policies via Reinforcement Learning (RL). Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting. Giving support to its often observed good performance. From this viewpoint, we realize that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it: i) optimizes a tighter bound to the RL objective and, ii) can improve performance compared to SFT on curated data. We refer to this variant as importance weighted supervised fine-tuning (iw-SFT). We show that it is easy to implement and can be further generalized to training with quality scored data. The resulting SFT variants are competitive with more advanced RL algorithms for large language models and for training policies in continuous control tasks. For example achieving 66.7% on the AIME 2024 dataset.
Authors: Jinqiu Jin, Yang Zhang, Junwei Pan, Fuli Feng, Hua Lu, Haijie Gu, Xiangnan He
Abstract: Recently, there has been a surge of interest in Multi-Target Cross-Domain Recommendation (MTCDR), which aims to enhance recommendation performance across multiple domains simultaneously. Existing MTCDR methods primarily rely on domain-shared entities (\eg users or items) to fuse and transfer cross-domain knowledge, which may be unavailable in non-overlapped recommendation scenarios. Some studies model user preferences and item features as domain-sharable semantic representations, which can be utilized to tackle the MTCDR task. Nevertheless, they often require extensive auxiliary data for pre-training. Developing more effective solutions for MTCDR remains an important area for further exploration. Inspired by recent advancements in generative recommendation, this paper introduces GMC, a generative paradigm-based approach for multi-target cross-domain recommendation. The core idea of GMC is to leverage semantically quantized discrete item identifiers as a medium for integrating multi-domain knowledge within a unified generative model. GMC first employs an item tokenizer to generate domain-shared semantic identifiers for each item, and then formulates item recommendation as a next-token generation task by training a domain-unified sequence-to-sequence model. To further leverage the domain information to enhance performance, we incorporate a domain-aware contrastive loss into the semantic identifier learning, and perform domain-specific fine-tuning on the unified recommender. Extensive experiments on five public datasets demonstrate the effectiveness of GMC compared to a range of baseline methods.
Authors: Rohit Prasad
Abstract: Transformers have revolutionized deep learning with applications in natural language processing, computer vision, and beyond. However, their computational demands make it challenging to deploy them on low-power edge devices. This paper introduces an ultra-low-power, Coarse-Grained Reconfigurable Array (CGRA) architecture specifically designed to accelerate General Matrix Multiplication (GEMM) operations in transformer models tailored for the energy and resource constraints of edge applications. The proposed architecture integrates a 4 x 4 array of Processing Elements (PEs) for efficient parallel computation and dedicated 4 x 2 Memory Operation Blocks (MOBs) for optimized LOAD/STORE operations, reducing memory bandwidth demands and enhancing data reuse. A switchless mesh torus interconnect network further minimizes power and latency by enabling direct communication between PEs and MOBs, eliminating the need for centralized switching. Through its heterogeneous array design and efficient dataflow, this CGRA architecture addresses the unique computational needs of transformers, offering a scalable pathway to deploy sophisticated machine learning models on edge devices.
Authors: Yifan Xu, Chao Zhang, Hanqi Jiang, Xiaoyan Wang, Ruifei Ma, Yiwei Li, Zihao Wu, Zeju Li, Xiangde Liu
Abstract: Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend Large Language Models (LLMs) for tackling tasks of 3D scene understanding. Current methods rely heavily on 3D point clouds, but the 3D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. 2D multi-view images present visual consistency with 3D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3D multimodal framework that leverages multi-view images for enhanced 3D scene understanding with LLMs. In general, Argus can be treated as a 3D Large Multimodal Foundation Model (3D-LMM) since it takes various modalities as input(text instructions, 2D multi-view images, and 3D point clouds) and expands the capability of LLMs to tackle 3D tasks. Argus involves fusing and integrating multi-view images and camera poses into view-as-scene features, which interact with the 3D features to create comprehensive and detailed 3D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3D point clouds and helps LLMs better understand the 3D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.
Authors: Yihong Wang, Zhonglin Jiang, Ningyuan Xi, Yue Zhao, Qingqing Gu, Xiyuan Chen, Hao Wu, Sheng Xu, Hange Zhou, Yong Chen, Luo Ji
Abstract: Decoder-only language models, such as GPT and LLaMA, generally decode on the last layer. Motivated by human's hierarchical thinking capability, we propose that a hierarchical decoder architecture could be built with different layers decoding texts simultaneously. Due to limited time and computationally resources, we choose to adapt a pretrained language model into this form of hierarchical decoder. Language heads of the last layer are copied to different selected intermediate layers, and fine-tuned with different task inputs. By thorough experiments, we validate that these selective intermediate layers could be adapted to speak meaningful and reasonable contents, and this paradigm of hierarchical decoder can obtain state-of-the-art performances on multiple tasks such as hierarchical text classification, classification-guided generation, and hierarchical text generation. This study suggests the possibility of a generalized hierarchical reasoner, pretraining from scratch.
Authors: Dongyeun Lee, Jiwan Hur, Hyounguk Shon, Jae Young Lee, Junmo Kim
Abstract: Diffusion models have achieved remarkable success in image generation but come with significant computational costs, posing challenges for deployment in resource-constrained environments. Recent post-training quantization (PTQ) methods have attempted to mitigate this issue by focusing on the iterative nature of diffusion models. However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. Learned Equivalent Scaling optimizes channel-wise scaling factors to redistribute quantization difficulty between weights and activations, reducing overall quantization error. Recognizing that early denoising steps, despite having small quantization errors, crucially impact the final output due to error accumulation, we incorporate an adaptive timestep weighting scheme to prioritize these critical steps during learning. Furthermore, identifying that layers such as skip connections exhibit high inter-channel variance, we introduce channel-wise Power-of-Two Scaling for activations. To ensure robust selection of PTS factors even with small calibration set, we introduce a voting algorithm that enhances reliability. Extensive experiments demonstrate that our method significantly outperforms existing works, especially at low bit-widths such as W4A6 (4-bit weight, 6-bit activation) and W4A8, maintaining high image generation quality and model stability. The code is available at https://github.com/LeeDongYeun/dmq.
Authors: Shirui Zhao, Jun Yin, Lingyun Yao, Martin Andraud, Wannes Meert, Marian Verhelst
Abstract: An increasing number of applications are exploiting sampling-based algorithms for planning, optimization, and inference. The Markov Chain Monte Carlo (MCMC) algorithms form the computational backbone of this emerging branch of machine learning. Unfortunately, the high computational cost limits their feasibility for large-scale problems and real-world applications, and the existing MCMC acceleration solutions are either limited in hardware flexibility or fail to maintain efficiency at the system level across a variety of end-to-end applications. This paper introduces \textbf{MC$^2$A}, an algorithm-hardware co-design framework, enabling efficient and flexible optimization for MCMC acceleration. Firstly, \textbf{MC$^2$A} analyzes the MCMC workload diversity through an extension of the processor performance roofline model with a 3rd dimension to derive the optimal balance between the compute, sampling and memory parameters. Secondly, \textbf{MC$^2$A} proposes a parametrized hardware accelerator architecture with flexible and efficient support of MCMC kernels with a pipeline of ISA-programmable tree-structured processing units, reconfigurable samplers and a crossbar interconnect to support irregular access. Thirdly, the core of \textbf{MC$^2$A} is powered by a novel Gumbel sampler that eliminates exponential and normalization operations. In the end-to-end case study, \textbf{MC$^2$A} achieves an overall {$307.6\times$, $1.4\times$, $2.0\times$, $84.2\times$} speedup compared to the CPU, GPU, TPU and state-of-the-art MCMC accelerator. Evaluated on various representative MCMC workloads, this work demonstrates and exploits the feasibility of general hardware acceleration to popularize MCMC-based solutions in diverse application domains.
Authors: Zhichao Sheng, Shilin Zhou, Chen Gong, Zhenghua Li
Abstract: Spoken Language Understanding (SLU) plays a crucial role in speech-centric multimedia applications, enabling machines to comprehend spoken language in scenarios such as meetings, interviews, and customer service interactions. SLU encompasses multiple tasks, including Automatic Speech Recognition (ASR), spoken Named Entity Recognition (NER), and spoken Sentiment Analysis (SA). However, existing methods often rely on separate model architectures for individual tasks such as spoken NER and SA, which increases system complexity, limits cross-task interaction, and fails to fully exploit heterogeneous datasets available across tasks. To address these limitations, we propose UniSLU, a unified framework that jointly models multiple SLU tasks within a single architecture. Specifically, we propose a unified representation for diverse SLU tasks, enabling full utilization of heterogeneous datasets across multiple tasks. Built upon this representation, we propose a unified generative method that jointly models ASR, spoken NER, and SA tasks, enhancing task interactions and enabling seamless integration with large language models to harness their powerful generative capabilities. Extensive experiments on public SLU datasets demonstrate the effectiveness of our approach, achieving superior SLU performance compared to several benchmark methods, making it well-suited for real-world speech-based multimedia scenarios. We will release all code and models at github to facilitate future research.
Authors: Nerma Kadric, Amila Akagic, Medina Kapo
Abstract: Pigmented skin lesions represent localized areas of increased melanin and can indicate serious conditions like melanoma, a major contributor to skin cancer mortality. The MedMNIST v2 dataset, inspired by MNIST, was recently introduced to advance research in biomedical imaging and includes DermaMNIST, a dataset for classifying pigmented lesions based on the HAM10000 dataset. This study assesses ResNet-50 and EfficientNetV2L models for multi-class classification using DermaMNIST, employing transfer learning and various layer configurations. One configuration achieves results that match or surpass existing methods. This study suggests that convolutional neural networks (CNNs) can drive progress in biomedical image analysis, significantly enhancing diagnostic accuracy.
Authors: Ammar Ahmed, Ali Shariq Imran, Zenun Kastrati, Sher Muhammad Daudpota
Abstract: Wrist pathologies are frequently observed, particularly among children who constitute the majority of fracture cases. However, diagnosing these conditions is time-consuming and requires specialized expertise. Computer vision presents a promising avenue, contingent upon the availability of extensive datasets, a notable challenge in medical imaging. Therefore, reliance solely on one modality, such as images, proves inadequate, especially in an era of diverse and plentiful data types. In this study, we employ a multifaceted approach to address the challenge of recognizing wrist pathologies using an extremely limited dataset. Initially, we approach the problem as a fine-grained recognition task, aiming to identify subtle X-ray pathologies that conventional CNNs overlook. Secondly, we enhance network performance by fusing patient metadata with X-ray images. Thirdly, rather than pre-training on a coarse-grained dataset like ImageNet, we utilize weights trained on a fine-grained dataset. While metadata integration has been used in other medical domains, this is a novel application for wrist pathologies. Our results show that a fine-grained strategy and metadata integration improve diagnostic accuracy by 2% with a limited dataset and by over 10% with a larger fracture-focused dataset.
Authors: Youssef Tawfilis, Hossam Amer, Minar El-Aasser, Tallal Elshabrawy
Abstract: Federated Learning has gained increasing attention for its ability to enable multiple nodes to collaboratively train machine learning models without sharing their raw data. At the same time, Generative AI -- particularly Generative Adversarial Networks (GANs) -- have achieved remarkable success across a wide range of domains, such as healthcare, security, and Image Generation. However, training generative models typically requires large datasets and significant computational resources, which are often unavailable in real-world settings. Acquiring such resources can be costly and inefficient, especially when many underutilized devices -- such as IoT devices and edge devices -- with varying capabilities remain idle. Moreover, obtaining large datasets is challenging due to privacy concerns and copyright restrictions, as most devices are unwilling to share their data. To address these challenges, we propose a novel approach for decentralized GAN training that enables the utilization of distributed data and underutilized, low-capability devices while not sharing data in its raw form. Our approach is designed to tackle key challenges in decentralized environments, combining KLD-weighted Clustered Federated Learning to address the issues of data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to tackle the challenge of device heterogeneity under strict data sharing constraints -- ensuring that no labels or raw data, whether real or synthetic, are ever shared between nodes. Experimental results shows that our approach demonstrates consistent and significant improvements across key performance metrics, where it achieves 1.1x -- 2.2x higher image generation scores, an average 10% boost in classification metrics (up to 50% in multi-domain non-IID settings), in much lower latency compared to several benchmarks. Find our code at https://github.com/youssefga28/HuSCF-GAN.
Authors: Maximiliano Hormaz\'abal Lagos, \'Alvaro Bueno S\'aez, H\'ector Cerezo-Costas, Pedro Alonso Doval, Jorge Alcalde Vesteiro
Abstract: This paper presents our approach for the IberLEF 2025 Task PRESTA: Preguntas y Respuestas sobre Tablas en Espa\~nol (Questions and Answers about Tables in Spanish). Our solution obtains answers to the questions by implementing Python code generation with LLMs that is used to filter and process the table. This solution evolves from the MRT implementation for the Semeval 2025 related task. The process consists of multiple steps: analyzing and understanding the content of the table, selecting the useful columns, generating instructions in natural language, translating these instructions to code, running it, and handling potential errors or exceptions. These steps use open-source LLMs and fine-grained optimized prompts for each step. With this approach, we achieved an accuracy score of 85\% in the task.
Authors: Nikita Koriagin, Yaroslav Aksenov, Daniil Laptev, Gleb Gerasimov, Nikita Balagansky, Daniil Gavrilov
Abstract: Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.
Authors: Kossi Amouzouvi, Bowen Song, Andrea Coletta, Luigi Bellomarini, Jens Lehmann, Sahar Vahdati
Abstract: Knowledge graph representation learning approaches provide a mapping between symbolic knowledge in the form of triples in a knowledge graph (KG) and their feature vectors. Knowledge graph embedding (KGE) models often represent relations in a KG as geometric transformations. Most state-of-the-art (SOTA) KGE models are derived from elementary geometric transformations (EGTs), such as translation, scaling, rotation, and reflection, or their combinations. These geometric transformations enable the models to effectively preserve specific structural and relational patterns of the KG. However, the current use of EGTs by KGEs remains insufficient without considering relation-specific transformations. Although recent models attempted to address this problem by ensembling SOTA baseline models in different ways, only a single or composite version of geometric transformations are used by such baselines to represent all the relations. In this paper, we propose a framework that evaluates how well each relation fits with different geometric transformations. Based on this ranking, the model can: (1) assign the best-matching transformation to each relation, or (2) use majority voting to choose one transformation type to apply across all relations. That is, the model learns a single relation-specific EGT in low dimensional vector space through an attention mechanism. Furthermore, we use the correlation between relations and EGTs, which are learned in a low dimension, for relation embeddings in a high dimensional vector space. The effectiveness of our models is demonstrated through comprehensive evaluations on three benchmark KGs as well as a real-world financial KG, witnessing a performance comparable to leading models
Authors: Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, Jiangmiao Pang
Abstract: Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment's overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. The code is available at https://crystalsixone.github.io/vln_pe.github.io/.
Authors: Vincenzo Dentamaro, Felice Franchini, Giuseppe Pirlo, Irina Voiculescu
Abstract: Robust XAI techniques should ideally be simultaneously deterministic, model agnostic, and guaranteed to converge. We propose MULTIDIMENSIONAL PROBLEM AGNOSTIC EXPLAINABLE AI (MUPAX), a deterministic, model agnostic explainability technique, with guaranteed convergency. MUPAX measure theoretic formulation gives principled feature importance attribution through structured perturbation analysis that discovers inherent input patterns and eliminates spurious relationships. We evaluate MUPAX on an extensive range of data modalities and tasks: audio classification (1D), image classification (2D), volumetric medical image analysis (3D), and anatomical landmark detection, demonstrating dimension agnostic effectiveness. The rigorous convergence guarantees extend to any loss function and arbitrary dimensions, making MUPAX applicable to virtually any problem context for AI. By contrast with other XAI methods that typically decrease performance when masking, MUPAX not only preserves but actually enhances model accuracy by capturing only the most important patterns of the original data. Extensive benchmarking against the state of the XAI art demonstrates MUPAX ability to generate precise, consistent and understandable explanations, a crucial step towards explainable and trustworthy AI systems. The source code will be released upon publication.
Authors: Adithyavairavan Murali, Balakumar Sundaralingam, Yu-Wei Chao, Wentao Yuan, Jun Yamada, Mark Carlson, Fabio Ramos, Stan Birchfield, Dieter Fox, Clemens Eppner
Abstract: Grasping is a fundamental robot skill, yet despite significant research advancements, learning-based 6-DOF grasping approaches are still not turnkey and struggle to generalize across different embodiments and in-the-wild settings. We build upon the recent success on modeling the object-centric grasp generation process as an iterative diffusion process. Our proposed framework, GraspGen, consists of a DiffusionTransformer architecture that enhances grasp generation, paired with an efficient discriminator to score and filter sampled grasps. We introduce a novel and performant on-generator training recipe for the discriminator. To scale GraspGen to both objects and grippers, we release a new simulated dataset consisting of over 53 million grasps. We demonstrate that GraspGen outperforms prior methods in simulations with singulated objects across different grippers, achieves state-of-the-art performance on the FetchBench grasping benchmark, and performs well on a real robot with noisy visual observations.
Authors: Maulana Bisyir Azhari, David Hyunchul Shim
Abstract: Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tailored to DINOv2's coarse features. Furthermore, we complement DINOv2's robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior detector-descriptor networks like SuperPoint, DINO-VO demonstrates greater robustness in challenging environments. Furthermore, we show superior accuracy and generalization of the proposed feature descriptors against standalone DINOv2 coarse features. DINO-VO outperforms prior frame-to-frame VO methods on the TartanAir and KITTI datasets and is competitive on EuRoC dataset, while running efficiently at 72 FPS with less than 1GB of memory usage on a single GPU. Moreover, it performs competitively against Visual SLAM systems on outdoor driving scenarios, showcasing its generalization capabilities.
Authors: Xiangyu Dong, Haoran Zhao, Jiang Gao, Haozhou Li, Xiaoguang Ma, Yaoming Zhou, Fuhai Chen, Juan Liu
Abstract: Recent advances in vision-language navigation (VLN) were mainly attributed to emerging large language models (LLMs). These methods exhibited excellent generalization capabilities in instruction understanding and task reasoning. However, they were constrained by the fixed knowledge bases and reasoning abilities of LLMs, preventing fully incorporating experiential knowledge and thus resulting in a lack of efficient evolutionary capacity. To address this, we drew inspiration from the evolution capabilities of natural agents, and proposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with the ability to continuously evolve during testing. To the best of our knowledge, it was the first time that an multimodal LLM-powered self-evolving VLN framework was proposed. Specifically, SE-VLN comprised three core modules, i.e., a hierarchical memory module to transfer successful and failure cases into reusable knowledge, a retrieval-augmented thought-based reasoning module to retrieve experience and enable multi-step decision-making, and a reflection module to realize continual evolution. Comprehensive tests illustrated that the SE-VLN achieved navigation success rates of 57% and 35.2% in unseen environments, representing absolute performance improvements of 23.9% and 15.0% over current state-of-the-art methods on R2R and REVERSE datasets, respectively. Moreover, the SE-VLN showed performance improvement with increasing experience repository, elucidating its great potential as a self-evolving agent framework for VLN.
Authors: Hao Sun, Mihaela van der Schaar
Abstract: In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.
Authors: Arian Mousakhan, Sudhanshu Mittal, Silvio Galesso, Karim Farid, Thomas Brox
Abstract: Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens. Code, models and qualitative results are publicly available at https://lmb-freiburg.github.io/orbis.github.io/.
Authors: Jeremy McHugh, Kristina \v{S}ekrst, Jon Cefalu
Abstract: Prompt injection attacks, where malicious input is designed to manipulate AI systems into ignoring their original instructions and following unauthorized commands instead, were first discovered by Preamble, Inc. in May 2022 and responsibly disclosed to OpenAI. Over the last three years, these attacks have continued to pose a critical security threat to LLM-integrated systems. The emergence of agentic AI systems, where LLMs autonomously perform multistep tasks through tools and coordination with other agents, has fundamentally transformed the threat landscape. Modern prompt injection attacks can now combine with traditional cybersecurity exploits to create hybrid threats that systematically evade traditional security controls. This paper presents a comprehensive analysis of Prompt Injection 2.0, examining how prompt injections integrate with Cross-Site Scripting (XSS), Cross-Site Request Forgery (CSRF), and other web security vulnerabilities to bypass traditional security measures. We build upon Preamble's foundational research and mitigation technologies, evaluating them against contemporary threats, including AI worms, multi-agent infections, and hybrid cyber-AI attacks. Our analysis incorporates recent benchmarks that demonstrate how traditional web application firewalls, XSS filters, and CSRF tokens fail against AI-enhanced attacks. We also present architectural solutions that combine prompt isolation, runtime security, and privilege separation with novel threat detection capabilities.
Authors: Kutub Uddin, Awais Khan, Muhammad Umar Farooq, Khalid Malik
Abstract: Audio plays a crucial role in applications like speaker verification, voice-enabled smart devices, and audio conferencing. However, audio manipulations, such as deepfakes, pose significant risks by enabling the spread of misinformation. Our empirical analysis reveals that existing methods for detecting deepfake audio are often vulnerable to anti-forensic (AF) attacks, particularly those attacked using generative adversarial networks. In this article, we propose a novel collaborative learning method called SHIELD to defend against generative AF attacks. To expose AF signatures, we integrate an auxiliary generative model, called the defense (DF) generative model, which facilitates collaborative learning by combining input and output. Furthermore, we design a triplet model to capture correlations for real and AF attacked audios with real-generated and attacked-generated audios using auxiliary generative models. The proposed SHIELD strengthens the defense against generative AF attacks and achieves robust performance across various generative models. The proposed AF significantly reduces the average detection accuracy from 95.49% to 59.77% for ASVspoof2019, from 99.44% to 38.45% for In-the-Wild, and from 98.41% to 51.18% for HalfTruth for three different generative models. The proposed SHIELD mechanism is robust against AF attacks and achieves an average accuracy of 98.13%, 98.58%, and 99.57% in match, and 98.78%, 98.62%, and 98.85% in mismatch settings for the ASVspoof2019, In-the-Wild, and HalfTruth datasets, respectively.
Authors: Suzie Kim, Hye-Bin Shin, Seong-Whan Lee
Abstract: Conventional reinforcement learning (RL) ap proaches often struggle to learn effective policies under sparse reward conditions, necessitating the manual design of complex, task-specific reward functions. To address this limitation, rein forcement learning from human feedback (RLHF) has emerged as a promising strategy that complements hand-crafted rewards with human-derived evaluation signals. However, most existing RLHF methods depend on explicit feedback mechanisms such as button presses or preference labels, which disrupt the natural interaction process and impose a substantial cognitive load on the user. We propose a novel reinforcement learning from implicit human feedback (RLIHF) framework that utilizes non-invasive electroencephalography (EEG) signals, specifically error-related potentials (ErrPs), to provide continuous, implicit feedback without requiring explicit user intervention. The proposed method adopts a pre-trained decoder to transform raw EEG signals into probabilistic reward components, en abling effective policy learning even in the presence of sparse external rewards. We evaluate our approach in a simulation environment built on the MuJoCo physics engine, using a Kinova Gen2 robotic arm to perform a complex pick-and-place task that requires avoiding obstacles while manipulating target objects. The results show that agents trained with decoded EEG feedback achieve performance comparable to those trained with dense, manually designed rewards. These findings validate the potential of using implicit neural feedback for scalable and human-aligned reinforcement learning in interactive robotics.
Authors: Hongyang Zhao, Tianyu Liang, Sina Davari, Daeho Kim
Abstract: While recent advancements in deep neural networks (DNNs) have substantially enhanced visual AI's capabilities, the challenge of inadequate data diversity and volume remains, particularly in construction domain. This study presents a novel image synthesis methodology tailored for construction worker detection, leveraging the generative-AI platform Midjourney. The approach entails generating a collection of 12,000 synthetic images by formulating 3000 different prompts, with an emphasis on image realism and diversity. These images, after manual labeling, serve as a dataset for DNN training. Evaluation on a real construction image dataset yielded promising results, with the model attaining average precisions (APs) of 0.937 and 0.642 at intersection-over-union (IoU) thresholds of 0.5 and 0.5 to 0.95, respectively. Notably, the model demonstrated near-perfect performance on the synthetic dataset, achieving APs of 0.994 and 0.919 at the two mentioned thresholds. These findings reveal both the potential and weakness of generative AI in addressing DNN training data scarcity.
Authors: Junhong Min, Youngpil Jeon, Jimin Kim, Minyong Choi
Abstract: The pursuit of a generalizable stereo matching model, capable of performing across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. On the other hand, global matching architectures, while theoretically more robust, have been historically rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with $S^2M^2$: a global matching architecture that achieves both state-of-the-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. $S^2M^2$ establishes a new state of the art on the Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods across most metrics while reconstructing high-quality details with competitive efficiency.
Authors: Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani
Abstract: We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules and preserving generative modeling capabilities. Learning flows between fundamentally different modalities like vision and action is challenging due to sparse action data lacking semantic structures and dimensional mismatches between high-dimensional visual representations and raw actions. We address this by creating a structured action latent space via an autoencoder as the flow matching target, up-sampling raw actions to match visual representation shapes. Crucially, we supervise flow matching with both encoder targets and final action outputs through flow latent decoding, which backpropagates action reconstruction loss through sequential flow matching ODE solving steps for effective end-to-end learning. Implemented as simple MLP layers, VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies while reducing inference latency by 50-130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.
Authors: Ashray Gupta, Rohan Joseph, Sunny Rai
Abstract: Analogies test a model's ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.
Authors: Lyucheng Wu, Mengru Wang, Ziwen Xu, Tri Cao, Nay Oo, Bryan Hooi, Shumin Deng
Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.
Authors: Yiting Yang, Hao Luo, Yuan Sun, Qingsen Yan, Haokui Zhang, Wei Dong, Guoqing Wang, Peng Wang, Yang Yang, Hengtao Shen
Abstract: A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this work, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model's generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range of downstream image classification tasks, confirming the efficacy of the enhanced generalization capability embedded in the down/up-projection matrices.
Authors: Zikai Xie, Linjiang Chen
Abstract: Bayesian Optimization (BO) algorithm is a standard tool for black-box optimization problems. The current state-of-the-art BO approach for permutation spaces relies on the Mallows kernel-an $\Omega(n^2)$ representation that explicitly enumerates every pairwise comparison. Inspired by the close relationship between the Mallows kernel and pairwise comparison, we propose a novel framework for generating kernel functions on permutation space based on sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from bubble sort. Further, we introduce the \textbf{Merge Kernel} constructed from merge sort, which replaces the quadratic complexity with $\Theta(n\log n)$ to achieve the lowest possible complexity. The resulting feature vector is significantly shorter, can be computed in linearithmic time, yet still efficiently captures meaningful permutation distances. To boost robustness and right-invariance without sacrificing compactness, we further incorporate three lightweight, task-agnostic descriptors: (1) a shift histogram, which aggregates absolute element displacements and supplies a global misplacement signal; (2) a split-pair line, which encodes selected long-range comparisons by aligning elements across the two halves of the whole permutation; and (3) sliding-window motifs, which summarize local order patterns that influence near-neighbor objectives. Our empirical evaluation demonstrates that the proposed kernel consistently outperforms the state-of-the-art Mallows kernel across various permutation optimization benchmarks. Results confirm that the Merge Kernel provides a more compact yet more effective solution for Bayesian optimization in permutation space.
Authors: Alexander H. Liu, Andy Ehrenberg, Andy Lo, Cl\'ement Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Sanchit Gandhi, Soham Ghosh, Srijan Mishra, Thomas Foubert, Abhinav Rastogi, Adam Yang, Albert Q. Jiang, Alexandre Sablayrolles, Am\'elie H\'eliou, Am\'elie Martin, Anmol Agarwal, Antoine Roux, Arthur Darcet, Arthur Mensch, Baptiste Bout, Baptiste Rozi\`ere, Baudouin De Monicault, Chris Bamford, Christian Wallenwein, Christophe Renaudin, Cl\'emence Lanfranchi, Darius Dabert, Devendra Singh Chaplot, Devon Mizelle, Diego de las Casas, Elliot Chane-Sane, Emilien Fugier, Emma Bou Hanna, Gabrielle Berrada, Gauthier Delerce, Gauthier Guinet, Georgii Novikov, Guillaume Martin, Himanshu Jaju, Jan Ludziejewski, Jason Rute, Jean-Hadrien Chabran, Jessica Chudnovsky, Joachim Studnia, Joep Barmentlo, Jonas Amar, Josselin Somerville Roberts, Julien Denize, Karan Saxena, Karmesh Yadav, Kartik Khandelwal, Kush Jain, L\'elio Renard Lavaud, L\'eonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Marie Pellat, Mathilde Guillaumin, Mathis Felardos, Matthieu Dinot, Maxime Darrin, Maximilian Augustin, Micka\"el Seznec, Neha Gupta, Nikhil Raghuraman, Olivier Duchenne, Patricia Wang, Patryk Saffer, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philom\`ene Chagniot, Pierre Stock, Pravesh Agrawal, R\'emi Delacourt, Romain Sauvestre, Roman Soletskyi, Sagar Vaze, Sandeep Subramanian, Saurabh Garg, Shashwat Dalal, Siddharth Gandhi, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Thibault Schueller, Thibaut Lavril, Thomas Robert, Thomas Wang, Timoth\'ee Lacroix, Tom Bewley, Valeriia Nemychnikova, Victor Paltz, Virgile Richard, Wen-Ding Li, William Marshall, Xuanyu Zhang, Yihan Wan, Yunhao Tang
Abstract: We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.
Authors: Jiazheng Li, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Hongzhou Lin, Yi Wu, Jingzhao Zhang
Abstract: Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 67.1% (+5.3%) on AIME24, 59.5% (+10.0%) on AIME25, and 35.5% (+4.0%) on HMMT25. Further, we provide theoretical explanations that QuestA improves sample efficiency, offering a practical and generalizable pathway for expanding reasoning capability through RL.
Authors: Luis Gasco, Hermenegildo Fabregat, Laura Garc\'ia-Sardi\~na, Paula Estrella, Daniel Deniz, Alvaro Rodrigo, Rabih Zbib
Abstract: Advances in natural language processing and large language models are driving a major transformation in Human Capital Management, with a growing interest in building smart systems based on language technologies for talent acquisition, upskilling strategies, and workforce planning. However, the adoption and progress of these technologies critically depend on the development of reliable and fair models, properly evaluated on public data and open benchmarks, which have so far been unavailable in this domain. To address this gap, we present TalentCLEF 2025, the first evaluation campaign focused on skill and job title intelligence. The lab consists of two tasks: Task A - Multilingual Job Title Matching, covering English, Spanish, German, and Chinese; and Task B - Job Title-Based Skill Prediction, in English. Both corpora were built from real job applications, carefully anonymized, and manually annotated to reflect the complexity and diversity of real-world labor market data, including linguistic variability and gender-marked expressions. The evaluations included monolingual and cross-lingual scenarios and covered the evaluation of gender bias. TalentCLEF attracted 76 registered teams with more than 280 submissions. Most systems relied on information retrieval techniques built with multilingual encoder-based models fine-tuned with contrastive learning, and several of them incorporated large language models for data augmentation or re-ranking. The results show that the training strategies have a larger effect than the size of the model alone. TalentCLEF provides the first public benchmark in this field and encourages the development of robust, fair, and transferable language technologies for the labor market.
Authors: Emma M. A. Harrison
Abstract: Robots are increasingly integrated across industries, particularly in healthcare. However, many valuable applications for quadrupedal robots remain overlooked. This research explores the effectiveness of three reinforcement learning algorithms in training a simulated quadruped robot for autonomous navigation and obstacle avoidance. The goal is to develop a robotic guide dog simulation capable of path following and obstacle avoidance, with long-term potential for real-world assistance to guide dogs and visually impaired individuals. It also seeks to expand research into medical 'pets', including robotic guide and alert dogs. A comparative analysis of thirteen related research papers shaped key evaluation criteria, including collision detection, pathfinding algorithms, sensor usage, robot type, and simulation platforms. The study focuses on sensor inputs, collision frequency, reward signals, and learning progression to determine which algorithm best supports robotic navigation in complex environments. Custom-made environments were used to ensure fair evaluation of all three algorithms under controlled conditions, allowing consistent data collection. Results show that Proximal Policy Optimization (PPO) outperformed Deep Q-Network (DQN) and Q-learning across all metrics, particularly in average and median steps to goal per episode. By analysing these results, this study contributes to robotic navigation, AI and medical robotics, offering insights into the feasibility of AI-driven quadruped mobility and its role in assistive robotics.
Authors: Aaron Councilman, David Fu, Aryan Gupta, Chengxiao Wang, David Grove, Yu-Xiong Wang, Vikram Adve
Abstract: In the past few years LLMs have emerged as a tool that can aid programmers by taking natural language descriptions and generating code based on it. However, LLMs often generate incorrect code that users need to fix and the literature suggests users often struggle to detect these errors. In this work we seek to offer formal guarantees of correctness to LLM generated code; such guarantees could improve the experience of using AI Code Assistants and potentially enable natural language programming for users with little or no programming knowledge. To address this challenge we propose to incorporate a formal query language that can represent a user's intent in a formally defined but natural language-like manner that a user can confirm matches their intent. Then, using such a query we propose to verify LLM generated code to ensure it matches the user's intent. We implement these ideas in our system, Astrogator, for the Ansible programming language which includes such a formal query language, a calculus for representing the behavior of Ansible programs, and a symbolic interpreter which is used for the verification. On a benchmark suite of 21 code-generation tasks, our verifier is able to verify correct code in 83% of cases and identify incorrect code in 92%.
Authors: Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Yixin Liu, Chengye Wang, Lovekesh Vig, Arman Cohan
Abstract: We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.
Authors: Junsu Kim, Naeun Kim, Jaeho Lee, Incheol Park, Dongyoon Han, Seungryul Baek
Abstract: The reasoning-based pose estimation (RPE) benchmark has emerged as a widely adopted evaluation standard for pose-aware multimodal large language models (MLLMs). Despite its significance, we identified critical reproducibility and benchmark-quality issues that hinder fair and consistent quantitative evaluations. Most notably, the benchmark utilizes different image indices from those of the original 3DPW dataset, forcing researchers into tedious and error-prone manual matching processes to obtain accurate ground-truth (GT) annotations for quantitative metrics (\eg, MPJPE, PA-MPJPE). Furthermore, our analysis reveals several inherent benchmark-quality limitations, including significant image redundancy, scenario imbalance, overly simplistic poses, and ambiguous textual descriptions, collectively undermining reliable evaluations across diverse scenarios. To alleviate manual effort and enhance reproducibility, we carefully refined the GT annotations through meticulous visual matching and publicly release these refined annotations as an open-source resource, thereby promoting consistent quantitative evaluations and facilitating future advancements in human pose-aware multimodal reasoning.
Authors: Yulu Qin, Dheeraj Varghese, Adam Dahlgren Lindstr\"om, Lucia Donatelli, Kanishka Misra, Najoung Kim
Abstract: Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.
Authors: Yiqi Wang, Mrinal Verghese, Jeff Schneider
Abstract: Learning visuomotor policies via imitation has proven effective across a wide range of robotic domains. However, the performance of these policies is heavily dependent on the number of training demonstrations, which requires expensive data collection in the real world. In this work, we aim to reduce data collection efforts when learning visuomotor robot policies by leveraging existing or cost-effective data from a wide range of embodiments, such as public robot datasets and the datasets of humans playing with objects (human data from play). Our approach leverages two key insights. First, we use optic flow as an embodiment-agnostic action representation to train a World Model (WM) across multi-embodiment datasets, and finetune it on a small amount of robot data from the target embodiment. Second, we develop a method, Latent Policy Steering (LPS), to improve the output of a behavior-cloned policy by searching in the latent space of the WM for better action sequences. In real world experiments, we observe significant improvements in the performance of policies trained with a small amount of data (over 50% relative improvement with 30 demonstrations and over 20% relative improvement with 50 demonstrations) by combining the policy with a WM pretrained on two thousand episodes sampled from the existing Open X-embodiment dataset across different robots or a cost-effective human dataset from play.
Authors: Yukai Shi, Jiarong Ou, Rui Chen, Haotian Yang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Kun Gai
Abstract: In visual generation tasks, the responses and combinations of complex concepts often lack stability and are error-prone, which remains an under-explored area. In this paper, we attempt to explore the causal factors for poor concept responses through elaborately designed experiments. We also design a concept-wise equalization loss function (IMBA loss) to address this issue. Our proposed method is online, eliminating the need for offline dataset processing, and requires minimal code changes. In our newly proposed complex concept benchmark Inert-CompBench and two other public test sets, our method significantly enhances the concept response capability of baseline models and yields highly competitive results with only a few codes.
Authors: Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
Abstract: Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.
Authors: Shihao Wang, Guo Chen, De-an Huang, Zhiqi Li, Minghan Li, Guilin Li, Jose M. Alvarez, Lei Zhang, Zhiding Yu
Abstract: Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, substantially adopt unsupervised learning paradigms, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.
Authors: Xinyuan Wang, Liang Wu, Liangjie Hong, Hao Liu, Yanjie Fu
Abstract: Graph recommendation methods, representing a connected interaction perspective, reformulate user-item interactions as graphs to leverage graph structure and topology to recommend and have proved practical effectiveness at scale. Large language models, representing a textual generative perspective, excel at modeling user languages, understanding behavioral contexts, capturing user-item semantic relationships, analyzing textual sentiments, and generating coherent and contextually relevant texts as recommendations. However, there is a gap between the connected graph perspective and the text generation perspective as the task formulations are different. A research question arises: how can we effectively integrate the two perspectives for more personalized recsys? To fill this gap, we propose to incorporate graph-edge information into LLMs via prompt and attention innovations. We reformulate recommendations as a probabilistic generative problem using prompts. We develop a framework to incorporate graph edge information from the prompt and attention mechanisms for graph-structured LLM recommendations. We develop a new prompt design that brings in both first-order and second-order graph relationships; we devise an improved LLM attention mechanism to embed direct the spatial and connectivity information of edges. Our evaluation of real-world datasets demonstrates the framework's ability to understand connectivity information in graph data and to improve the relevance and quality of recommendation results.
Authors: Hung Guei, Yan-Ru Ju, Wei-Yu Chen, Ti-Rong Wu
Abstract: MuZero has achieved superhuman performance in various games by using a dynamics network to predict the environment dynamics for planning, without relying on simulators. However, the latent states learned by the dynamics network make its planning process opaque. This paper aims to demystify MuZero's model by interpreting the learned latent states. We incorporate observation reconstruction and state consistency into MuZero training and conduct an in-depth analysis to evaluate latent states across two board games: 9x9 Go and Gomoku, and three Atari games: Breakout, Ms. Pacman, and Pong. Our findings reveal that while the dynamics network becomes less accurate over longer simulations, MuZero still performs effectively by using planning to correct errors. Our experiments also show that the dynamics network learns better latent states in board games than in Atari games. These insights contribute to a better understanding of MuZero and offer directions for future research to improve the performance, robustness, and interpretability of the MuZero algorithm. The code and data are available at https://rlg.iis.sinica.edu.tw/papers/demystifying-muzero-planning.
URLs: https://rlg.iis.sinica.edu.tw/papers/demystifying-muzero-planning.
Authors: Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer
Abstract: Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a "small but mighty" benchmark of 111 information-seeking questions designed to evaluate a web agent's ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing domain knowledge gaps and overlooked details as common failure points. By contrast, state-of-the-art computer-using agents underperform, with the best-scoring system (OpenAI's Operator) reaching only 23.4% accuracy. These results highlight critical areas for improvement, including reliable source selection and more powerful multimodal capabilities. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.
Authors: Jianguo Zhang, Thai Hoang, Ming Zhu, Zuxin Liu, Shiyu Wang, Tulika Awalgaonkar, Akshara Prabhakar, Haolin Chen, Weiran Yao, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong
Abstract: Large Action models are essential for enabling autonomous agents to perform complex tasks. However, training such models remains challenging due to the diversity of agent environments and the complexity of noisy agentic data. Existing infrastructure offers limited support for scalable, agent-specific fine-tuning and standardized agent data processing. We introduce ActionStudio, a lightweight and extensible data and training framework designed for large action models. ActionStudio unifies diverse agent trajectories using our proposed Unified Format 2.0, supports a range of training workflows with optimized multi-node distributed setup, and integrates robust preprocessing and real-time verification tools. ActionStudio demonstrates up to 9x higher throughput compared to existing agentic training frameworks, and our trained models yield top performances across public and realistic agent benchmarks. To support the broader research community, we open-source the ActionStudio framework and release actionstudio-98k, a curated dataset of 98k high-quality trajectories. Code: https://github.com/SalesforceAIResearch/xLAM.
Authors: Chiyu Ma, Enpei Zhang, Yilun Zhao, Wenjun Liu, Yaning Jia, Peijun Qing, Lin Shi, Arman Cohan, Yujun Yan, Soroush Vosoughi
Abstract: LLM-as-Judge has emerged as a scalable alternative to human evaluation, enabling large language models (LLMs) to provide reward signals in trainings. While recent work has explored multi-agent extensions such as multi-agent debate and meta-judging to enhance evaluation quality, the question of how intrinsic biases manifest in these settings remains underexplored. In this study, we conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain-of-thought bias, and bandwagon bias. We evaluate these biases across two widely adopted multi-agent LLM-as-Judge frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge. Our results show that debate framework amplifies biases sharply after the initial debate, and this increased bias is sustained in subsequent rounds, while meta-judge approaches exhibit greater resistance. We further investigate the incorporation of PINE, a leading single-agent debiasing method, as a bias-free agent within these systems. The results reveal that this bias-free agent effectively reduces biases in debate settings but provides less benefit in meta-judge scenarios. Our work provides a comprehensive study of bias behavior in multi-agent LLM-as-Judge systems and highlights the need for targeted bias mitigation strategies in collaborative evaluation settings.
Authors: Carlo Romeo, Andrew D. Bagdanov
Abstract: Balancing combat encounters in Dungeons & Dragons (D&D) is a complex task that requires Dungeon Masters (DM) to manually assess party strength, enemy composition, and dynamic player interactions while avoiding interruption of the narrative flow. In this paper, we propose Encounter Generation via Reinforcement Learning (NTRL), a novel approach that automates Dynamic Difficulty Adjustment (DDA) in D&D via combat encounter design. By framing the problem as a contextual bandit, NTRL generates encounters based on real-time party members attributes. In comparison with classic DM heuristics, NTRL iteratively optimizes encounters to extend combat longevity (+200%), increases damage dealt to party members, reducing post-combat hit points (-16.67%), and raises the number of player deaths while maintaining low total party kills (TPK). The intensification of combat forces players to act wisely and engage in tactical maneuvers, even though the generated encounters guarantee high win rates (70%). Even in comparison with encounters designed by human Dungeon Masters, NTRL demonstrates superior performance by enhancing the strategic depth of combat while increasing difficulty in a manner that preserves overall game fairness.
Authors: Zheng Jia, Shengbin Yue, Wei Chen, Siyuan Wang, Yidong Liu, Yun Song, Zhongyu Wei
Abstract: The gap between static benchmarks and the dynamic nature of real-world legal practice poses a key barrier to advancing legal intelligence. To this end, we introduce J1-ENVS, the first interactive and dynamic legal environment tailored for LLM-based agents. Guided by legal experts, it comprises six representative scenarios from Chinese legal practices across three levels of environmental complexity. We further introduce J1-EVAL, a fine-grained evaluation framework, designed to assess both task performance and procedural compliance across varying levels of legal proficiency. Extensive experiments on 17 LLM agents reveal that, while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even the SOTA model, GPT-4o, falls short of 60% overall performance. These findings highlight persistent challenges in achieving dynamic legal intelligence and offer valuable insights to guide future research.
Authors: Xingyang He, Xiao Ling, Jie Liu
Abstract: Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling, but this progress has also introduced considerable redundancy and inefficiency into their reasoning processes, resulting in substantial computational waste. Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning (RL), with the goal of encouraging a more concise chains of thought. However, we observe that such global length penalty often lead to excessive compression of critical reasoning steps while preserving unnecessary details in simpler ones, yielding a suboptimal trade-off between accuracy and efficiency. To address this issue, we propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains based on the importance of each individual step. In the first stage, SmartThinker adapts a reasoning model to a short-form reasoning mode through rejection sampling combined with supervised fine-tuning (SFT). In the second stage, SmartThinker applies Step-Level Length Control Policy Optimization (SCPO) to refine the model output distribution, which increases the proportion of length allocated to critical steps while reducing redundancy in less important ones. SCPO consists of four core components: an online importance estimator, a step-level length control reward function, a step-level generalized advantage estimation (S-GAE) and a difficulty-adaptive clipping strategy. Working in concert, these components enable SCPO to implement differentiated length control across reasoning steps. Empirical results across multiple reasoning benchmarks and various backbone models demonstrate that SmartThinker significantly reduces redundant reasoning while achieving comparable or even superior performance to existing methods.
Authors: Yexuan Shi, Mingyu Wang, Yunxiang Cao, Hongjie Lai, Junjian Lan, Xin Han, Yu Wang, Jie Geng, Zhenan Li, Zihao Xia, Xiang Chen, Chen Li, Jian Xu, Wenbo Duan, Yuanshuo Zhu
Abstract: Multi-Agent Systems (MAS) powered by Large Language Models (LLMs) are emerging as a powerful paradigm for solving complex, multifaceted problems. However, the potential of these systems is often constrained by the prevalent plan-and-execute framework, which suffers from critical limitations: rigid plan execution, static agent capabilities, and inefficient communication. These weaknesses hinder their adaptability and robustness in dynamic environments. This paper introduces Aime, a novel multi-agent framework designed to overcome these challenges through dynamic, reactive planning and execution. Aime replaces the conventional static workflow with a fluid and adaptive architecture. Its core innovations include: (1) a Dynamic Planner that continuously refines the overall strategy based on real-time execution feedback; (2) an Actor Factory that implements Dynamic Actor instantiation, assembling specialized agents on-demand with tailored tools and knowledge; and (3) a centralized Progress Management Module that serves as a single source of truth for coherent, system-wide state awareness. We empirically evaluated Aime on a diverse suite of benchmarks spanning general reasoning (GAIA), software engineering (SWE-bench Verified), and live web navigation (WebVoyager). The results demonstrate that Aime consistently outperforms even highly specialized state-of-the-art agents in their respective domains. Its superior adaptability and task success rate establish Aime as a more resilient and effective foundation for multi-agent collaboration.
Authors: Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, Neil D. Lawrence
Abstract: Engineers are deploying ML models as parts of real-world systems with the upsurge of AI technologies. Real-world environments challenge the deployment of such systems because these environments produce large amounts of heterogeneous data, and users require increasingly efficient responses. These requirements push prevalent software architectures to the limit when deploying ML-based systems. Data-oriented Architecture (DOA) is an emerging style that equips systems better for integrating ML models. Even though papers on deployed ML systems do not mention DOA, their authors made design decisions that implicitly follow DOA. Implicit decisions create a knowledge gap, limiting the practitioners' ability to implement ML-based systems. \hlb{This paper surveys why, how, and to what extent practitioners have adopted DOA to implement and deploy ML-based systems.} We overcome the knowledge gap by answering these questions and explicitly showing the design decisions and practices behind these systems. The survey follows a well-known systematic and semi-automated methodology for reviewing papers in software engineering. The majority of reviewed works partially adopt DOA. Such an adoption enables systems to address requirements such as Big Data management, low latency processing, resource management, security and privacy. Based on these findings, we formulate practical advice to facilitate the deployment of ML-based systems.
Authors: Nan Wang, Xuezhi Wen, Dalin Zhang, Xibin Zhao, Jiahui Ma, Mengxia Luo, Fan Xu, Sen Nie, Shi Wu, Jiqiang Liu
Abstract: APT detection is difficult to detect due to the long-term latency, covert and slow multistage attack patterns of Advanced Persistent Threat (APT). To tackle these issues, we propose TBDetector, a transformer-based advanced persistent threat detection method for APT attack detection. Considering that provenance graphs provide rich historical information and have the powerful attacks historic correlation ability to identify anomalous activities, TBDetector employs provenance analysis for APT detection, which summarizes long-running system execution with space efficiency and utilizes transformer with self-attention based encoder-decoder to extract long-term contextual features of system states to detect slow-acting attacks. Furthermore, we further introduce anomaly scores to investigate the anomaly of different system states, where each state is calculated with an anomaly score corresponding to its similarity score and isolation score. To evaluate the effectiveness of the proposed method, we have conducted experiments on five public datasets, i.e., streamspot, cadets, shellshock, clearscope, and wget_baseline. Experimental results and comparisons with state-of-the-art methods have exhibited better performance of our proposed method.
Authors: Yuya Saito, Shinnosuke Matsuo, Seiichi Uchida, Daiki Suehiro
Abstract: This paper tackles the problem of the worst-class error rate, instead of the standard error rate averaged over all classes. For example, a three-class classification task with class-wise error rates of 10%, 10%, and 40% has a worst-class error rate of 40%, whereas the average is 20% under the class-balanced condition. The worst-class error is important in many applications. For example, in a medical image classification task, it would not be acceptable for the malignant tumor class to have a 40% error rate, while the benign and healthy classes have a 10% error rates. To avoid overfitting in worst-class error minimization using Deep Neural Networks (DNNs), we design a problem formulation for bounding the worst-class error instead of achieving zero worst-class error. Moreover, to correctly bound the worst-class error, we propose a boosting approach which ensembles DNNs. We give training and generalization worst-class-error bound. Experimental results show that the algorithm lowers worst-class test error rates while avoiding overfitting to the training set. This code is available at https://github.com/saito-yuya/Bounding-the-Worst-class-error-A-Boosting-Approach.
URLs: https://github.com/saito-yuya/Bounding-the-Worst-class-error-A-Boosting-Approach.
Authors: Gabriel A. Silva
Abstract: The exploration of new problem classes for quantum computation is an active area of research. In this paper, we introduce and solve a novel problem class related to dynamics on large-scale networks relevant to neurobiology and machine learning. Specifically, we ask if a network can sustain inherent dynamic activity beyond some arbitrary observation time or if the activity ceases through quiescence or saturation via an epileptic-like state. We show that this class of problems can be formulated and structured to take advantage of quantum superposition and solved efficiently using a coupled workflow between the Grover and Deutsch-Jozsa quantum algorithms. To do so, we extend their functionality to address the unique requirements of how input (sub)sets into the algorithms must be mathematically structured while simultaneously constructing the inputs so that measurement outputs can be interpreted as meaningful properties of the network dynamics. This, in turn, allows us to answer the question we pose.
Authors: Irene Siragusa, Salvatore Contino, Massimo La Ciura, Rosario Alicata, Roberto Pirrone
Abstract: The increasing interest in developing Artificial Intelligence applications in the medical domain, suffers from the lack of high-quality data set, mainly due to privacy-related issues. In addition, the recent increase in Vision Language Models (VLM) leads to the need for multimodal medical data sets, where clinical reports and findings are attached to the corresponding medical scans. This paper illustrates the entire workflow for building the MedPix 2.0 data set. Starting with the well-known multimodal data set MedPix\textsuperscript{\textregistered}, mainly used by physicians, nurses, and healthcare students for Continuing Medical Education purposes, a semi-automatic pipeline was developed to extract visual and textual data followed by a manual curing procedure in which noisy samples were removed, thus creating a MongoDB database. Along with the data set, we developed a Graphical User Interface aimed at navigating efficiently the MongoDB instance and obtaining the raw data that can be easily used for training and/or fine-tuning VLMs. To enforce this point, in this work, we first recall DR-Minerva, a Retrieve Augmented Generation-based VLM model trained upon MedPix 2.0. DR-Minerva predicts the body part and the modality used to scan its input image. We also propose the extension of DR-Minerva with a Knowledge Graph that uses Llama 3.1 Instruct 8B, and leverages MedPix 2.0. The resulting architecture can be queried in a end-to-end manner, as a medical decision support system. MedPix 2.0 is available on GitHub.
Authors: Emanuele Mezzi, Aurora Papotti, Fabio Massacci, Katja Tuma
Abstract: The use of AI technologies is being integrated into the secure development of software-based systems, with an increasing trend of composing AI-based subsystems (with uncertain levels of performance) into automated pipelines. This presents a fundamental research challenge and seriously threatens safety-critical domains. Despite the existing knowledge about uncertainty in risk analysis, no previous work has estimated the uncertainty of AI-augmented systems given the propagation of errors in the pipeline. We provide the formal underpinnings for capturing uncertainty propagation, develop a simulator to quantify uncertainty, and evaluate the simulation of propagating errors with one case study. We discuss the generalizability of our approach and its limitations and present recommendations for evaluation policies concerning AI systems. Future work includes extending the approach by relaxing the remaining assumptions and by experimenting with a real system.
Authors: Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li
Abstract: Music classification, a cornerstone of music information retrieval, supports a wide array of applications. To address the lack of comprehensive datasets and effective methods for sub-genre classification in mainstage dance music, we introduce a novel benchmark featuring a new dataset and baseline. Our dataset expands the scope of sub-genres to reflect the diversity of recent mainstage live sets performed by leading DJs at global music festivals, capturing the vibrant and rapidly evolving electronic dance music (EDM) scene that engages millions of fans worldwide. We employ a continuous soft labeling approach to accommodate tracks blending multiple sub-genres, preserving their inherent complexity. Experiments demonstrate that even state-of-the-art multimodal large language models (MLLMs) struggle with this task, while our specialized baseline models achieve high accuracy. This benchmark supports applications such as music recommendation, DJ set curation, and interactive multimedia systems, with video demos provided. Our code and data are all open-sourced at https://github.com/Gariscat/housex-v2.git}{https://github.com/Gariscat/housex-v2.git.
URLs: https://github.com/Gariscat/housex-v2.git, https://github.com/Gariscat/housex-v2.git.
Authors: Yebowen Hu, Xiaoyang Wang, Wenlin Yao, Yiming Lu, Daoan Zhang, Hassan Foroosh, Dong Yu, Fei Liu
Abstract: LLMs are ideal for decision-making thanks to their ability to reason over long contexts. However, challenges arise when processing speech transcripts that describe complex scenarios, as they are verbose and include repetition, hedging, and vagueness. E.g., during a company's earnings call, an executive might project a positive revenue outlook to reassure investors, despite uncertainty regarding future earnings. It is crucial for LLMs to incorporate this uncertainty systematically when making decisions. In this paper, we introduce \textsc{DeFine}, a modular framework that constructs probabilistic factor profiles from complex scenarios. It then integrates these profiles with analogical reasoning, leveraging insights from similar past experiences to guide LLMs in making critical decisions in new situations. Our framework separates the tasks of quantifying uncertainty and incorporating it into LLM decision-making. This approach is particularly useful in areas such as consulting and financial deliberation, where making decisions under uncertainty is vital.
Authors: Yingya Li, Timothy Miller, Steven Bethard, Guergana Savova
Abstract: The success of multi-task learning can depend heavily on which tasks are grouped together. Naively grouping all tasks or a random set of tasks can result in negative transfer, with the multi-task models performing worse than single-task models. Though many efforts have been made to identify task groupings and to measure the relatedness among different tasks, it remains a challenging research topic to define a metric to identify the best task grouping out of a pool of many potential task combinations. We propose a metric of task relatedness based on task difficulty measured by pointwise V-usable information (PVI). PVI is a recently proposed metric to estimate how much usable information a dataset contains given a model. We hypothesize that tasks with not statistically different PVI estimates are similar enough to benefit from the joint learning process. We conduct comprehensive experiments to evaluate the feasibility of this metric for task grouping on 15 NLP datasets in the general, biomedical, and clinical domains. We compare the results of the joint learners against single learners, existing baseline methods, and recent large language models, including Llama 2 and GPT-4. The results show that by grouping tasks with similar PVI estimates, the joint learners yielded competitive results with fewer total parameters, with consistent performance across domains.
Authors: Jui-Nan Yen, Si Si, Zhao Meng, Felix Yu, Sai Surya Duvvuri, Inderjit S. Dhillon, Cho-Jui Hsieh, Sanjiv Kumar
Abstract: Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements. However, current LoRA optimizers lack transformation invariance, meaning the actual updates to the weights depends on how the two LoRA factors are scaled or rotated. This deficiency leads to inefficient learning and sub-optimal solutions in practice. This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization, which can achieve transformation invariance and remain computationally efficient. We provide theoretical analysis to demonstrate the benefit of our method and conduct experiments on various LLM tasks with different models including Gemma 2B, 7B, and mT5-XXL. The results demonstrate consistent improvements against existing optimizers. For example, replacing Adam with LoRA-RITE during LoRA fine-tuning of Gemma-2B yielded 4.6\% accuracy gain on Super-Natural Instructions and 3.5\% accuracy gain across other four LLM benchmarks (HellaSwag, ArcChallenge, GSM8K, OpenBookQA).
Authors: Junjie Wu, Tsz Ting Chung, Kai Chen, Dit-Yan Yeung
Abstract: Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in the given image. Most existing LVLM hallucination benchmarks are constrained to evaluate the object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, we design a unified framework to measure the object and relation hallucination in LVLMs simultaneously. The core idea of our framework is to evaluate hallucinations via (object, relation, object) triplets extracted from LVLMs' responses, making it easily generalizable to different vision-language tasks. Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. With comprehensive evaluations on Tri-HE, we observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple training-free approach that effectively mitigates hallucinations for LVLMs. Our dataset and code for the reproduction of our experiments are available publicly at https://github.com/wujunjie1998/Tri-HE.
Authors: Szymon Bobek, Paloma Koryci\'nska, Monika Krakowska, Maciej Mozolewski, Dorota Rak, Magdalena Zych, Magdalena W\'ojcik, Grzegorz J. Nalepa
Abstract: This paper introduces a dataset that is the result of a user study on the comprehensibility of explainable artificial intelligence (XAI) algorithms. The study participants were recruited from 149 candidates to form three groups representing experts in the domain of mycology (DE), students with a data science and visualization background (IT) and students from social sciences and humanities (SSH). The main part of the dataset contains 39 transcripts of interviews during which participants were asked to complete a series of tasks and questions related to the interpretation of explanations of decisions of a machine learning model trained to distinguish between edible and inedible mushrooms. The transcripts were complemented with additional data that includes visualizations of explanations presented to the user, results from thematic analysis, recommendations of improvements of explanations provided by the participants, and the initial survey results that allow to determine the domain knowledge of the participant and data analysis literacy. The transcripts were manually tagged to allow for automatic matching between the text and other data related to particular fragments. In the advent of the area of rapid development of XAI techniques, the need for a multidisciplinary qualitative evaluation of explainability is one of the emerging topics in the community. Our dataset allows not only to reproduce the study we conducted, but also to open a wide range of possibilities for the analysis of the material we gathered.
Authors: George Jiayuan Gao, Tianyu Li, Nadia Figueroa
Abstract: We propose an object-centric recovery (OCR) framework to address the challenges of out-of-distribution (OOD) scenarios in visuomotor policy learning. Previous behavior cloning (BC) methods rely heavily on a large amount of labeled data coverage, failing in unfamiliar spatial states. Without relying on extra data collection, our approach learns a recovery policy constructed by an inverse policy inferred from the object keypoint manifold gradient in the original training data. The recovery policy serves as a simple add-on to any base visuomotor BC policy, agnostic to a specific method, guiding the system back towards the training distribution to ensure task success even in OOD situations. We demonstrate the effectiveness of our object-centric framework in both simulation and real robot experiments, achieving an improvement of 77.7\% over the base policy in OOD. Furthermore, we show OCR's capacity to autonomously collect demonstrations for continual learning. Overall, we believe this framework represents a step toward improving the robustness of visuomotor policies in real-world settings.
Authors: Xinghua Zhang, Haiyang Yu, Cheng Fu, Fei Huang, Yongbin Li
Abstract: In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.
Authors: Haoning Wu, Ziheng Zhao, Ya Zhang, Yanfeng Wang, Weidi Xie
Abstract: Training medical image segmentation models for rare yet clinically important imaging modalities is challenging due to the scarcity of annotated data, and manual mask annotations can be costly and labor-intensive to acquire. This paper investigates leveraging generative models to synthesize data, for training segmentation models for underrepresented modalities, particularly on annotation-scarce MRI. Concretely, our contributions are threefold: (i) we introduce MRGen-DB, a large-scale radiology image-text dataset comprising extensive samples with rich metadata, including modality labels, attributes, regions, and organs information, with a subset featuring pixel-wise mask annotations; (ii) we present MRGen, a diffusion-based data engine for controllable medical image synthesis, conditioned on text prompts and segmentation masks. MRGen can generate realistic images for diverse MRI modalities lacking mask annotations, facilitating segmentation training in low-source domains; (iii) extensive experiments across multiple modalities demonstrate that MRGen significantly improves segmentation performance on unannotated modalities by providing high-quality synthetic data. We believe that our method bridges a critical gap in medical image analysis, extending segmentation capabilities to scenarios that are challenging to acquire manual annotations. The codes, models, and data will be publicly available at https://haoningwu3639.github.io/MRGen/
Authors: Zhoulin Ji, Chenhao Lin, Hang Wang, Chao Shen
Abstract: Detecting synthetic from real speech is increasingly crucial due to the risks of misinformation and identity impersonation. While various datasets for synthetic speech analysis have been developed, they often focus on specific areas, limiting their utility for comprehensive research. To fill this gap, we propose the Speech-Forensics dataset by extensively covering authentic, synthetic, and partially forged speech samples that include multiple segments synthesized by different high-quality algorithms. Moreover, we propose a TEmporal Speech LocalizaTion network, called TEST, aiming at simultaneously performing authenticity detection, multiple fake segments localization, and synthesis algorithms recognition, without any complex post-processing. TEST effectively integrates LSTM and Transformer to extract more powerful temporal speech representations and utilizes dense prediction on multi-scale pyramid features to estimate the synthetic spans. Our model achieves an average mAP of 83.55% and an EER of 5.25% at the utterance level. At the segment level, it attains an EER of 1.07% and a 92.19% F1 score. These results highlight the model's robust capability for a comprehensive analysis of synthetic speech, offering a promising avenue for future research and practical applications in this field.
Authors: M. Garcia-Fernandez
Abstract: Accurate and reliable photometric redshift determination is one of the key aspects for wide-field photometric surveys. Determination of photometric redshift for galaxies, has been traditionally solved by use of machine-learning and artificial intelligence techniques trained on a calibration sample of galaxies, where both photometry and spectrometry are available. On this paper, we present a new algorithmic approach for determining photometric redshifts of galaxies using Conditional Generative Adversarial Networks (CGANs). The proposed implementation is able to determine both point-estimation and probability-density estimations for photometric redshifts. The methodology is tested with data from Dark Energy Survey (DES) Y1 data and compared with other existing algorithm such as a Mixture Density Network (MDN). Although results obtained show a superiority of MDN, CGAN quality-metrics are close to the MDN results, opening the door to the use of CGAN at photometric redshift estimation.
Authors: Sneh Pandya, Purvik Patel, Brian D. Nord, Mike Walmsley, Aleksandra \'Ciprijanovi\'c
Abstract: Modern neural networks (NNs) often do not generalize well in the presence of a "covariate shift"; that is, in situations where the training and test data distributions differ, but the conditional distribution of classification labels remains unchanged. In such cases, NN generalization can be reduced to a problem of learning more domain-invariant features. Domain adaptation (DA) methods include a range of techniques aimed at achieving this; however, these methods have struggled with the need for extensive hyperparameter tuning, which then incurs significant computational costs. In this work, we introduce SIDDA, an out-of-the-box DA training algorithm built upon the Sinkhorn divergence, that can achieve effective domain alignment with minimal hyperparameter tuning and computational overhead. We demonstrate the efficacy of our method on multiple simulated and real datasets of varying complexity, including simple shapes, handwritten digits, and real astronomical observations. SIDDA is compatible with a variety of NN architectures, and it works particularly well in improving classification accuracy and model calibration when paired with equivariant neural networks (ENNs). We find that SIDDA enhances the generalization capabilities of NNs, achieving up to a $\approx40\%$ improvement in classification accuracy on unlabeled target data. We also study the efficacy of DA on ENNs with respect to the varying group orders of the dihedral group $D_N$, and find that the model performance improves as the degree of equivariance increases. Finally, we find that SIDDA enhances model calibration on both source and target data--achieving over an order of magnitude improvement in the ECE and Brier score. SIDDA's versatility, combined with its automated approach to domain alignment, has the potential to advance multi-dataset studies by enabling the development of highly generalizable models.
Authors: Yunzhe Li, Junting Wang, Hari Sundaram, Zhining Liu
Abstract: Zero-shot cross-domain sequential recommendation (ZCDSR) enables predictions in unseen domains without additional training or fine-tuning, addressing the limitations of traditional models in sparse data environments. Recent advancements in large language models (LLMs) have significantly enhanced ZCDSR by facilitating cross-domain knowledge transfer through rich, pretrained representations. Despite this progress, domain semantic bias -- arising from differences in vocabulary and content focus between domains -- remains a persistent challenge, leading to misaligned item embeddings and reduced generalization across domains. To address this, we propose a novel semantic bias-aware framework that enhances LLM-based ZCDSR by improving cross-domain alignment at both the item and sequential levels. At the item level, we introduce a generalization loss that aligns the embeddings of items across domains (inter-domain compactness), while preserving the unique characteristics of each item within its own domain (intra-domain diversity). This ensures that item embeddings can be transferred effectively between domains without collapsing into overly generic or uniform representations. At the sequential level, we develop a method to transfer user behavioral patterns by clustering source domain user sequences and applying attention-based aggregation during target domain inference. We dynamically adapt user embeddings to unseen domains, enabling effective zero-shot recommendations without requiring target-domain interactions...
Authors: Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, Kevin Patrick Murphy
Abstract: We present three improvements to the standard model-based RL paradigm based on transformers: (a) "Dyna with warmup", which trains the policy on real and imaginary data, but only starts using imaginary data after the world model has been sufficiently trained; (b) "nearest neighbor tokenizer" for image patches, which improves upon previous tokenization schemes, which are needed when using a transformer world model (TWM), by ensuring the code words are static after creation, thus providing a constant target for TWM learning; and (c) "block teacher forcing", which allows the TWM to reason jointly about the future tokens of the next timestep, instead of generating them sequentially. We then show that our method significantly improves upon prior methods in various environments. We mostly focus on the challenging Craftax-classic benchmark, where our method achieves a reward of 69.66% after only 1M environment steps, significantly outperforming DreamerV3, which achieves 53.2%, and exceeding human performance of 65.0% for the first time. We also show preliminary results on Craftax-full, MinAtar, and three different two-player games, to illustrate the generality of the approach.
Authors: Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal
Abstract: User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs). This requires deleting or "forgetting" a set of data points from an already-trained model, which typically degrades its performance on other data points. Thus, a balance must be struck between removing information and keeping the model's other abilities intact, with a failure to balance this trade-off leading to poor deletion or an unusable model. To this end, we propose UPCORE (Utility-Preserving Coreset Selection), a method-agnostic data selection framework for mitigating collateral damage during unlearning. Finding that the model damage is correlated with the variance of the model's representations on the forget set, we selectively prune the forget set to remove outliers, thereby minimizing model degradation after unlearning. Across three standard unlearning methods, UPCORE consistently achieves a superior balance between the competing objectives of deletion efficacy and model preservation. To better evaluate this trade-off, we introduce a new metric, measuring the area-under-the-curve (AUC) across standard metrics. Our results show that UPCORE improves both standard metrics and AUC, benefiting from positive transfer between the coreset and pruned points while reducing negative transfer from the forget set to points outside of it.
Authors: Yi Zeng, Enmeng Lu, Xin Guan, Cunqing Huangfu, Zizhe Ruan, Ammar Younas, Kang Sun, Xuan Tang, Yuwei Wang, Hongjie Suo, Dongqi Liang, Zhengqiang Han, Aorigele Bao, Xiaoyang Guo, Jin Wang, Jiawei Xie, Yao Liang
Abstract: The rapid advancement of Artificial Intelligence (AI) technology is profoundly transforming human society and concurrently presenting a series of ethical, legal, and social issues. The effective governance of AI has become a crucial global concern. Since 2022, the extensive deployment of generative AI, particularly large language models, marked a new phase in AI governance. Continuous efforts are being made by the international community in actively addressing the novel challenges posed by these AI developments. As consensus on international governance continues to be established and put into action, the practical importance of conducting a global assessment of the state of AI governance is progressively coming to light. In this context, we initiated the development of the AI Governance InternationaL Evaluation Index (AGILE Index). Adhering to the design principle, "the level of governance should match the level of development," the inaugural evaluation of the AGILE Index commences with an exploration of four foundational pillars: the development level of AI, the AI governance environment, the AI governance instruments, and the AI governance effectiveness. It covers 39 indicators across 18 dimensions to comprehensively assess the AI governance level of 14 representative countries globally. The index is utilized to delve into the status of AI governance to date in 14 countries for the first batch of evaluation. The aim is to depict the current state of AI governance in these countries through data scoring, assist them in identifying their governance stage and uncovering governance issues, and ultimately offer insights for the enhancement of their AI governance systems.
Authors: Osnat Mokryn, Teddy Lazebnik, Hagit Ben Shoshan
Abstract: The analysis of high-dimensional timeline data and the identification of outliers and anomalies is critical across diverse domains, including sensor readings, biological and medical data, historical records, and global statistics. However, conventional analysis techniques often struggle with challenges such as high dimensionality, complex distributions, and sparsity. These limitations hinder the ability to extract meaningful insights from complex temporal datasets, making it difficult to identify trending features, outliers, and anomalies effectively. Inspired by surprisability -- a cognitive science concept describing how humans instinctively focus on unexpected deviations - we propose Learning via Surprisability (LvS), a novel approach for transforming high-dimensional timeline data. LvS quantifies and prioritizes anomalies in time-series data by formalizing deviations from expected behavior. LvS bridges cognitive theories of attention with computational methods, enabling the detection of anomalies and shifts in a way that preserves critical context, offering a new lens for interpreting complex datasets. We demonstrate the usefulness of LvS on three high-dimensional timeline use cases: a time series of sensor data, a global dataset of mortality causes over multiple years, and a textual corpus containing over two centuries of State of the Union Addresses by U.S. presidents. Our results show that the LvS transformation enables efficient and interpretable identification of outliers, anomalies, and the most variable features along the timeline.
Authors: Valentin Charraut, Wa\"el Doulazmi, Thomas Tournaire, Thibault Buhet
Abstract: Learning-based decision-making has the potential to enable generalizable Autonomous Driving (AD) policies, reducing the engineering overhead of rule-based approaches. Imitation Learning (IL) remains the dominant paradigm, benefiting from large-scale human demonstration datasets, but it suffers from inherent limitations such as distribution shift and imitation gaps. Reinforcement Learning (RL) presents a promising alternative, yet its adoption in AD remains limited due to the lack of standardized and efficient research frameworks. To this end, we introduce V-Max, an open research framework providing all the necessary tools to make RL practical for AD. V-Max is built on Waymax, a hardware-accelerated AD simulator designed for large-scale experimentation. We extend it using ScenarioNet's approach, enabling the fast simulation of diverse AD datasets.
Authors: Hanjin Kim, Jiseong Park, Seojin Kim, Jueun Choi, Doheon Lee, Sung Ju Hwang
Abstract: Graph pooling, which compresses a whole graph into a smaller coarsened graph, is an essential component of graph representation learning. To efficiently compress a given graph, graph pooling methods often drop their nodes with attention-based scoring with the task loss. However, this often results in simply removing nodes with lower degrees without consideration of their feature-level relevance to the given task. To fix this problem, we propose a Multi-View Pruning(MVP), a graph pruning method based on a multi-view framework and reconstruction loss. Given a graph, MVP first constructs multiple graphs for different views either by utilizing the predefined modalities or by randomly partitioning the input features, to consider the importance of each node in diverse perspectives. Then, it learns the score for each node by considering both the reconstruction and the task loss. MVP can be incorporated with any hierarchical pooling framework to score the nodes. We validate MVP on multiple benchmark datasets by coupling it with two graph pooling methods, and show that it significantly improves the performance of the base graph pooling method, outperforming all baselines. Further analysis shows that both the encoding of multiple views and the consideration of reconstruction loss are the key to the success of MVP, and that it indeed identifies nodes that are less important according to domain knowledge.
Authors: Palakorn Achananuparp, Ee-Peng Lim, Yao Lu
Abstract: Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations, especially for smaller models. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show that our framework not only enhances occupation and skill classification tasks, but also provides a cost-effective alternative to frontier models like GPT-4o, significantly reducing computational costs while maintaining strong performance. This makes it a practical and scalable solution for occupation classification and related tasks across LLMs.
Authors: Tingyang Xiao, Xiaolin Zhou, Liu Liu, Wei Sui, Wei Feng, Jiaxiong Qiu, Xinjie Wang, Zhizhong Su
Abstract: This paper presents GeoFlow-SLAM, a robust and effective Tightly-Coupled RGBD-inertial SLAM for legged robotics undergoing aggressive and high-frequency motions.By integrating geometric consistency, legged odometry constraints, and dual-stream optical flow (GeoFlow), our method addresses three critical challenges:feature matching and pose initialization failures during fast locomotion and visual feature scarcity in texture-less scenes.Specifically, in rapid motion scenarios, feature matching is notably enhanced by leveraging dual-stream optical flow, which combines prior map points and poses. Additionally, we propose a robust pose initialization method for fast locomotion and IMU error in legged robots, integrating IMU/Legged odometry, inter-frame Perspective-n-Point (PnP), and Generalized Iterative Closest Point (GICP). Furthermore, a novel optimization framework that tightly couples depth-to-map and GICP geometric constraints is first introduced to improve the robustness and accuracy in long-duration, visually texture-less environments. The proposed algorithms achieve state-of-the-art (SOTA) on collected legged robots and open-source datasets. To further promote research and development, the open-source datasets and code will be made publicly available at https://github.com/HorizonRobotics/geoflow-slam
Authors: Haoxuan Ma, Xishun Liao, Yifan Liu, Qinhua Jiang, Chris Stanford, Shangqing Cao, Jiaqi Ma
Abstract: Human mobility modeling is critical for urban planning and transportation management, yet existing approaches often lack the integration capabilities needed to handle diverse data sources. We present a foundation model framework for universal human mobility patterns that leverages cross-domain data fusion and large language models to address these limitations. Our approach integrates multi-modal data of distinct nature and spatio-temporal resolution, including geographical, mobility, socio-demographic, and traffic information, to construct a privacy-preserving and semantically enriched human travel trajectory dataset. Our framework demonstrates adaptability through domain transfer techniques that ensure transferability across diverse urban contexts, as evidenced in case studies of Los Angeles (LA) and Egypt. The framework employs LLMs for semantic enrichment of trajectory data, enabling comprehensive understanding of mobility patterns. Quantitative evaluation shows that our generated synthetic dataset accurately reproduces mobility patterns observed in empirical data. The practical utility of this foundation model approach is demonstrated through large-scale traffic simulations for LA County, where results align well with observed traffic data. On California's I-405 corridor, the simulation yields a Mean Absolute Percentage Error of 5.85% for traffic volume and 4.36% for speed compared to Caltrans PeMS observations, illustrating the framework's potential for intelligent transportation systems and urban mobility applications.
Authors: Sunwoong Yang, Youngkyu Lee, Namwoo Kang
Abstract: This study presents an enhanced multi-fidelity Deep Operator Network (DeepONet) framework for efficient spatio-temporal flow field prediction when high-fidelity data is scarce. Key innovations include: a merge network replacing traditional dot-product operations, achieving 50.4% reduction in prediction error and 7.57% accuracy improvement while reducing training time by 96%; a transfer learning multi-fidelity approach that freezes pre-trained low-fidelity networks while making only the merge network trainable, outperforming alternatives by up to 76% and achieving 43.7% better accuracy than single-fidelity training; and a physics-guided subsampling method that strategically selects high-fidelity training points based on temporal dynamics, reducing high-fidelity sample requirements by 40% while maintaining comparable accuracy. Comprehensive experiments across multiple resolutions and datasets demonstrate the framework's ability to significantly reduce required high-fidelity dataset size while maintaining predictive accuracy, with consistent superior performance against conventional benchmarks.
Authors: Suhas G Hegde, Shilpy Kaur, Aruna Tiwari
Abstract: Popular PEFT methods reduce trainable parameter count for fine-tuning by parameterizing new low-rank or sparse trainable weights in parallel to the frozen pre-trained weights $W$. However, these weights are trained from scratch, and there exists a performance gap between these methods and full fine-tuning, especially in low-budget settings. We introduce VectorFit, a new way of parameterization that efficiently utilizes the existing knowledge embedded in $W$ by adaptively training their singular vectors and biases. We show that utilizing the structural and transformational properties of $W$ in this way can lead to high-rank incremental weight matrices $\Delta W$, comparable to that of full fine-tuning. VectorFit delivers superior results with \textbf{9$\boldsymbol\times$} fewer trainable parameters than the leading PEFT methods. Through comprehensive experiments across 19 datasets covering a wide range of language and vision tasks such as natural language understanding and generation, question answering, image classification, and image generation, we demonstrate that VectorFit surpasses baselines in terms of performance as a function of parameter-efficiency.
Authors: Elija Perrier
Abstract: We present an extension of K-P time-optimal quantum control solutions using global Cartan $KAK$ decompositions for geodesic-based solutions. Extending recent time-optimal constant-$\theta$ control results, we integrate Cartan methods into equivariant quantum neural network (EQNN) for quantum control tasks. We show that a finite-depth limited EQNN ansatz equipped with Cartan layers can replicate the constant-$\theta$ sub-Riemannian geodesics for K-P problems. We demonstrate how for certain classes of control problem on Riemannian symmetric spaces, gradient-based training using an appropriate cost function converges to certain global time-optimal solutions when satisfying simple regularity conditions. This generalises prior geometric control theory methods and clarifies how optimal geodesic estimation can be performed in quantum machine learning contexts.
Authors: Yi Nian, Shenzhe Zhu, Yuehan Qin, Li Li, Ziyi Wang, Chaowei Xiao, Yue Zhao
Abstract: Multimodal large language models (MLLMs) excel in vision-language tasks but also pose significant risks of generating harmful content, particularly through jailbreak attacks. Jailbreak attacks refer to intentional manipulations that bypass safety mechanisms in models, leading to the generation of inappropriate or unsafe content. Detecting such attacks is critical to ensuring the responsible deployment of MLLMs. Existing jailbreak detection methods face three primary challenges: (1) Many rely on model hidden states or gradients, limiting their applicability to white-box models, where the internal workings of the model are accessible; (2) They involve high computational overhead from uncertainty-based analysis, which limits real-time detection, and (3) They require fully labeled harmful datasets, which are often scarce in real-world settings. To address these issues, we introduce a test-time adaptive framework called JAILDAM. Our method leverages a memory-based approach guided by policy-driven unsafe knowledge representations, eliminating the need for explicit exposure to harmful data. By dynamically updating unsafe knowledge during test-time, our framework improves generalization to unseen jailbreak strategies while maintaining efficiency. Experiments on multiple VLM jailbreak benchmarks demonstrate that JAILDAM delivers state-of-the-art performance in harmful content detection, improving both accuracy and speed.
Authors: Hanqi Xiao, Yi-Lin Sung, Elias Stengel-Eskin, Mohit Bansal
Abstract: Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantization process on specific weight circuits -- which we define as sets of weights associated with downstream task performance. These weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TaCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TaCQ-based quantization to existing mixed-precision quantization methods when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, and text-to-SQL tasks for both Llama-3 and Qwen2.5, we find that TaCQ outperforms baselines using the same calibration data and a lower weight budget, achieving major improvements in the 2 and 3-bit regime. With only 3.1 bits we are able to recover 96% of Llama-3-8B-Instruct's unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. We also observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, SliM-LLM. Moreover, we observe a 7.20% gain without conditioning on specific tasks, showing TaCQ's ability to identify important weights is not limited to task-conditioned settings.
Authors: Fahmida Liza Piya, Rahmatollah Beheshti
Abstract: Unstructured clinical data can serve as a unique and rich source of information that can meaningfully inform clinical practice. Extracting the most pertinent context from such data is critical for exploiting its true potential toward optimal and timely decision-making in patient care. While prior research has explored various methods for clinical text summarization, most prior studies either process all input tokens uniformly or rely on heuristic-based filters, which can overlook nuanced clinical cues and fail to prioritize information critical for decision-making. In this study, we propose Contextual, a novel framework that integrates a Context-Preserving Token Filtering method with a Domain-Specific Knowledge Graph (KG) for contextual augmentation. By preserving context-specific important tokens and enriching them with structured knowledge, ConTextual improves both linguistic coherence and clinical fidelity. Our extensive empirical evaluations on two public benchmark datasets demonstrate that ConTextual consistently outperforms other baselines. Our proposed approach highlights the complementary role of token-level filtering and structured retrieval in enhancing both linguistic and clinical integrity, as well as offering a scalable solution for improving precision in clinical text generation.
Authors: Nusrat Jahan, Ratun Rahman, Michel Wang
Abstract: Federated Learning (FL) has emerged as a transformative paradigm in the field of distributed machine learning, enabling multiple clients such as mobile devices, edge nodes, or organizations to collaboratively train a shared global model without the need to centralize sensitive data. This decentralized approach addresses growing concerns around data privacy, security, and regulatory compliance, making it particularly attractive in domains such as healthcare, finance, and smart IoT systems. This survey provides a concise yet comprehensive overview of Federated Learning, beginning with its core architecture and communication protocol. We discuss the standard FL lifecycle, including local training, model aggregation, and global updates. A particular emphasis is placed on key technical challenges such as handling non-IID (non-independent and identically distributed) data, mitigating system and hardware heterogeneity, reducing communication overhead, and ensuring privacy through mechanisms like differential privacy and secure aggregation. Furthermore, we examine emerging trends in FL research, including personalized FL, cross-device versus cross-silo settings, and integration with other paradigms such as reinforcement learning and quantum computing. We also highlight real-world applications and summarize benchmark datasets and evaluation metrics commonly used in FL research. Finally, we outline open research problems and future directions to guide the development of scalable, efficient, and trustworthy FL systems.
Authors: Junsheng Huang, Zhitao He, Yucheng Huang, Sandeep Polisetty, Qingyun Wang, May Fung
Abstract: With the widespread application of large language models (LLMs), the issue of generating non-existing facts, known as hallucination, has garnered increasing attention. Previous research in enhancing LLM confidence estimation mainly focuses on the single problem setting. However, LLM awareness of its internal parameterized knowledge boundary under the more challenging multi-problem setting, which requires answering multiple problems accurately simultaneously, remains underexplored. To bridge this gap, we introduce a novel method, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25% in average precision.
Authors: Roman J. Georgio, Caelum Forder, Suman Deb, Andri Rahimov, Peter Carroll, \"Onder G\"urcan
Abstract: Coral Protocol is an open and decentralized collaboration infrastructure that enables communication, coordination, trust and payments for The Internet of Agents. It addresses the growing need for interoperability in a world where organizations are deploying multiple specialized AI agents that must work together across domains and vendors. As a foundational platform for multi-agent AI ecosystems, Coral establishes a common language and coordination framework allowing any agent to participate in complex workflows with others. Its design emphasizes broad compatibility, security, and vendor neutrality, ensuring that agent interactions are efficient and trustworthy. In particular, Coral introduces standardized messaging formats for agent communication, a modular coordination mechanism for orchestrating multi-agent tasks, and secure team formation capabilities for dynamically assembling trusted groups of agents. Together, these innovations position Coral Protocol as a cornerstone of the emerging "Internet of Agents," unlocking new levels of automation, collective intelligence, and business value through open agent collaboration.
Authors: Koray Ulusan, Benjamin Kiefer
Abstract: Personalizing Stable Diffusion for professional portrait generation from amateur photos faces challenges in maintaining facial resemblance. This paper evaluates the impact of augmentation strategies on two personalization methods: DreamBooth and InstantID. We compare classical augmentations (flipping, cropping, color adjustments) with generative augmentation using InstantID's synthetic images to enrich training data. Using SDXL and a new FaceDistance metric based on FaceNet, we quantitatively assess facial similarity. Results show classical augmentations can cause artifacts harming identity retention, while InstantID improves fidelity when balanced with real images to avoid overfitting. A user study with 97 participants confirms high photorealism and preferences for InstantID's polished look versus DreamBooth's identity accuracy. Our findings inform effective augmentation strategies for personalized text-to-image generation.
Authors: Burkhard Ringlein, Thomas Parnell, Radu Stoica
Abstract: As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning to enable portable LLM inference with state-of-the-art performance without code changes. Focusing on performance-critical LLM kernels, we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.
Authors: Yiming Lei, Zhizheng Yang, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
Abstract: Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.
Authors: Orlando Marquez Ayala, Patrice Bechard, Emily Chen, Maggie Baird, Jingfei Chen
Abstract: Large Language Models (LLMs) such as GPT-4o can handle a wide range of complex tasks with the right prompt. As per token costs are reduced, the advantages of fine-tuning Small Language Models (SLMs) for real-world applications -- faster inference, lower costs -- may no longer be clear. In this work, we present evidence that, for domain-specific tasks that require structured outputs, SLMs still have a quality advantage. We compare fine-tuning an SLM against prompting LLMs on the task of generating low-code workflows in JSON form. We observe that while a good prompt can yield reasonable results, fine-tuning improves quality by 10% on average. We also perform systematic error analysis to reveal model limitations.
Authors: Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng
Abstract: Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.4% and 3.8% on Qwen2.5-7B-Base and Qwen3-8B, respectively. Notably, Critique-GRPO enables effective self-improvement through self-critiquing and weak-to-strong generalization, achieving consistent gains over GRPO, such as 16.7% and 10.0% pass@1 improvements on AIME 2024, respectively.
Authors: Wa\"el Doulazmi, Auguste Lehuger, Marin Toromanoff, Valentin Charraut, Thibault Buhet, Fabien Moutarde
Abstract: Reinforcement Learning's high sensitivity to hyperparameters is a source of instability and inefficiency, creating significant challenges for practitioners. Hyperparameter Optimization (HPO) algorithms have been developed to address this issue, among them Population-Based Training (PBT) stands out for its ability to generate hyperparameters schedules instead of fixed configurations. PBT trains a population of agents, each with its own hyperparameters, frequently ranking them and replacing the worst performers with mutations of the best agents. These intermediate selection steps can cause PBT to focus on short-term improvements, leading it to get stuck in local optima and eventually fall behind vanilla Random Search over longer timescales. This paper studies how this greediness issue is connected to the choice of evolution frequency, the rate at which the selection is done. We propose Multiple-Frequencies Population-Based Training (MF-PBT), a novel HPO algorithm that addresses greediness by employing sub-populations, each evolving at distinct frequencies. MF-PBT introduces a migration process to transfer information between sub-populations, with an asymmetric design to balance short and long-term optimization. Extensive experiments on the Brax suite demonstrate that MF-PBT improves sample efficiency and long-term performance, even without actually tuning hyperparameters.
Authors: Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, Paul Pu Liang
Abstract: Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.
Authors: Jongoh Jeong, Hunmin Yang, Jaeseok Jeong, Kuk-Jin Yoon
Abstract: Generative adversarial attacks train a perturbation generator on a white-box surrogate model and subsequently apply the crafted perturbations to unseen black-box victim models. In contrast to iterative attacks, these methods deliver superior inference-time efficiency, scalability, and transferability; however, up until now, existing studies have not fully exploited the representational capacity of generative models to preserve and harness semantic information. Specifically, the intermediate activations of the generator encode rich semantic features--object boundaries and coarse shapes--that remain under-exploited, thereby limiting the alignment of perturbations with object-salient regions which are critical for adversarial transferability. To remedy this, we introduce a semantic structure-aware attack framework based on the Mean Teacher, which serves as a temporally smoothed feature reference. With this smoothed reference, we further direct semantic consistency between the early-layer activations in the student and those of the semantically rich teacher by feature distillation. By anchoring perturbation synthesis to the semantically salient early intermediate blocks within the generator based on empirical findings, our method guides progressive adversarial perturbation on regions that substantially enhance adversarial transferability. We conduct extensive experiments over diverse models, domains and tasks to demonstrate consistent improvements relative to state-of-the-art generative attacks, comprehensively evaluated using conventional metrics and our newly proposed Accidental Correction Rate (ACR).
Authors: Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou
Abstract: Uncovering emergent concepts across transformer layers remains a significant challenge because the residual stream linearly mixes and duplicates information, obscuring how features evolve within large language models. Current research efforts primarily inspect neural representations at single layers, thereby overlooking this cross-layer superposition and the redundancy it introduces. These representations are typically either analyzed directly for activation patterns or passed to probing classifiers that map them to a limited set of predefined concepts. To address these limitations, we propose cross-layer VQ-VAE (CLVQ-VAE), a framework that uses vector quantization to map representations across layers and in the process collapse duplicated residual-stream features into compact, interpretable concept vectors. Our approach uniquely combines top-k temperature-based sampling during quantization with EMA codebook updates, providing controlled exploration of the discrete latent space while maintaining code-book diversity. We further enhance the framework with scaled-spherical k-means++ for codebook initialization, which clusters by directional similarity rather than magnitude, better aligning with semantic structure in word embedding space.
Authors: Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Abstract: Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.
Authors: Sam Yu-Te Lee, Chengyang Ji, Shicheng Wen, Lifu Huang, Dongyu Liu, Kwan-Liu Ma
Abstract: Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE's effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience -- from none to expert -- demonstrates the system's usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.
Authors: Mingzhuo Li, Guang Li, Jiafeng Mao, Linfeng Ye, Takahiro Ogawa, Miki Haseyama
Abstract: To alleviate the reliance of deep neural networks on large-scale datasets, dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset. The integration of generative models has significantly advanced this field. However, existing approaches primarily focus on aligning the distilled dataset with the original one, often overlooking task-specific information that can be critical for optimal downstream performance. In this paper, focusing on the downstream task of classification, we propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better. The final dataset is sampled from a larger image pool with a sampling distribution obtained by matching the difficulty distribution of the original dataset. A logarithmic transformation is applied as a pre-processing step to correct for distributional bias. The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks. The code is available at https://github.com/SumomoTaku/DiffGuideSamp.
Authors: Koki Yamane, Yunhan Li, Masashi Konosu, Koki Inami, Junji Oaki, Sho Sakaino, Toshiaki Tsuji
Abstract: In recent years, the advancement of imitation learning has led to increased interest in teleoperating low-cost manipulators to collect demonstration data. However, most existing systems rely on unilateral control, which only transmits target position values. While this approach is easy to implement and suitable for slow, non-contact tasks, it struggles with fast or contact-rich operations due to the absence of force feedback. This work demonstrates that fast teleoperation with force feedback is feasible even with force-sensorless, low-cost manipulators by leveraging 4-channel bilateral control. Based on accurately identified manipulator dynamics, our method integrates nonlinear terms compensation, velocity and external force estimation, and variable gain corresponding to inertial variation. Furthermore, using data collected by 4-channel bilateral control, we show that incorporating force information into both the input and output of learned policies improves performance in imitation learning. These results highlight the practical effectiveness of our system for high-fidelity teleoperation and data collection on affordable hardware.
Authors: Gheorghe Comanici (Xinyi), Eric Bieber (Xinyi), Mike Schaekermann (Xinyi), Ice Pasupat (Xinyi), Noveen Sachdeva (Xinyi), Inderjit Dhillon (Xinyi), Marcel Blistein (Xinyi), Ori Ram (Xinyi), Dan Zhang (Xinyi), Evan Rosen (Xinyi), Luke Marris (Xinyi), Sam Petulla (Xinyi), Colin Gaffney (Xinyi), Asaf Aharoni (Xinyi), Nathan Lintz (Xinyi), Tiago Cardal Pais (Xinyi), Henrik Jacobsson (Xinyi), Idan Szpektor (Xinyi), Nan-Jiang Jiang (Xinyi), Krishna Haridasan (Xinyi), Ahmed Omran (Xinyi), Nikunj Saunshi (Xinyi), Dara Bahri (Xinyi), Gaurav Mishra (Xinyi), Eric Chu (Xinyi), Toby Boyd (Xinyi), Brad Hekman (Xinyi), Aaron Parisi (Xinyi), Chaoyi Zhang (Xinyi), Kornraphop Kawintiranon (Xinyi), Tania Bedrax-Weiss (Xinyi), Oliver Wang (Xinyi), Ya Xu (Xinyi), Ollie Purkiss (Xinyi), Uri Mendlovic (Xinyi), Ila\"i Deutel (Xinyi), Nam Nguyen (Xinyi), Adam Langley (Xinyi), Flip Korn (Xinyi), Lucia Rossazza (Xinyi), Alexandre Ram\'e (Xinyi), Sagar Waghmare (Xinyi), Helen Miller (Xinyi), Vaishakh Keshava (Xinyi), Ying Jian (Xinyi), Xiaofan Zhang (Xinyi), Raluca Ada Popa (Xinyi), Kedar Dhamdhere (Xinyi), Bla\v{z} Bratani\v{c} (Xinyi), Kyuyeun Kim (Xinyi), Terry Koo (Xinyi), Ferran Alet (Xinyi), Yi-ting Chen (Xinyi), Arsha Nagrani (Xinyi), Hannah Muckenhirn (Xinyi), Zhiyuan Zhang (Xinyi), Corbin Quick (Xinyi), Filip Paveti\'c (Xinyi), Duc Dung Nguyen (Xinyi), Joao Carreira (Xinyi), Michael Elabd (Xinyi), Haroon Qureshi (Xinyi), Fabian Mentzer (Xinyi), Yao-Yuan Yang (Xinyi), Danielle Eisenbud (Xinyi), Anmol Gulati (Xinyi), Ellie Talius (Xinyi), Eric Ni (Xinyi), Sahra Ghalebikesabi (Xinyi), Edouard Yvinec (Xinyi), Alaa Saade (Xinyi), Thatcher Ulrich (Xinyi), Lorenzo Blanco (Xinyi), Dan A. Calian (Xinyi), Muhuan Huang (Xinyi), A\"aron van den Oord (Xinyi), Naman Goyal (Xinyi), Terry Chen (Xinyi), Praynaa Rawlani (Xinyi), Christian Schallhart (Xinyi), Swachhand Lokhande (Xinyi), Xianghong Luo (Xinyi), Jyn Shan (Xinyi), Ceslee Montgomery (Xinyi), Victoria Krakovna (Xinyi), Federico Piccinini (Xinyi), Omer Barak (Xinyi), Jingyu Cui (Xinyi), Yiling Jia (Xinyi), Mikhail Dektiarev (Xinyi), Alexey Kolganov (Xinyi), Shiyu Huang (Xinyi), Zhe Chen (Xinyi), Xingyu Wang (Xinyi), Jessica Austin (Xinyi), Peter de Boursac (Xinyi), Evgeny Sluzhaev (Xinyi), Frank Ding (Xinyi), Huijian Li (Xinyi), Surya Bhupatiraju (Xinyi), Mohit Agarwal (Xinyi), S{\l}awek Kwasiborski (Xinyi), Paramjit Sandhu (Xinyi), Patrick Siegler (Xinyi), Ahmet Iscen (Xinyi), Eyal Ben-David (Xinyi), Shiraz Butt (Xinyi), Miltos Allamanis (Xinyi), Seth Benjamin (Xinyi), Robert Busa-Fekete (Xinyi), Felix Hernandez-Campos (Xinyi), Sasha Goldshtein (Xinyi), Matt Dibb (Xinyi), Weiyang Zhang (Xinyi), Annie Marsden (Xinyi), Carey Radebaugh (Xinyi), Stephen Roller (Xinyi), Abhishek Nayyar (Xinyi), Jacob Austin (Xinyi), Tayfun Terzi (Xinyi), Bhargav Kanagal Shamanna (Xinyi), Pete Shaw (Xinyi), Aayush Singh (Xinyi), Florian Luisier (Xinyi), Artur Mendon\c{c}a (Xinyi), Vaibhav Aggarwal (Xinyi), Larisa Markeeva (Xinyi), Claudio Fantacci (Xinyi), Sergey Brin (Xinyi), HyunJeong Choe (Xinyi), Guanyu Wang (Xinyi), Hartwig Adam (Xinyi), Avigail Dabush (Xinyi), Tatsuya Kiyono (Xinyi), Eyal Marcus (Xinyi), Jeremy Cole (Xinyi), Theophane Weber (Xinyi), Hongrae Lee (Xinyi), Ronny Huang (Xinyi), Alex Muzio (Xinyi), Leandro Kieliger (Xinyi), Maigo Le (Xinyi), Courtney Biles (Xinyi), Long Le (Xinyi), Archit Sharma (Xinyi), Chengrun Yang (Xinyi), Avery Lamp (Xinyi), Dave Dopson (Xinyi), Nate Hurley (Xinyi), Katrina (Xinyi), Xu (Jerry), Zhihao Shan (Jerry), Shuang Song (Jerry), Jiewen Tan (Jerry), Alexandre Senges (Jerry), George Zhang (Jerry), Chong You (Jerry), Yennie Jun (Jerry), David Raposo (Jerry), Susanna Ricco (Jerry), Xuan Yang (Jerry), Weijie Chen (Jerry), Prakhar Gupta (Jerry), Arthur Szlam (Jerry), Kevin Villela (Jerry), Chun-Sung Ferng (Jerry), Daniel Kasenberg (Jerry), Chen Liang (Jerry), Rui Zhu (Jerry), Arunachalam Narayanaswamy (Jerry), Florence Perot (Jerry), Paul Pucciarelli (Jerry), Anna Shekhawat (Jerry), Alexey Stern (Jerry), Rishikesh Ingale (Jerry), Stefani Karp (Jerry), Sanaz Bahargam (Jerry), Adrian Goedeckemeyer (Jerry), Jie Han (Jerry), Sicheng Li (Jerry), Andrea Tacchetti (Jerry), Dian Yu (Jerry), Abhishek Chakladar (Jerry), Zhiying Zhang (Jerry), Mona El Mahdy (Jerry), Xu Gao (Jerry), Dale Johnson (Jerry), Samrat Phatale (Jerry), AJ Piergiovanni (Jerry), Hyeontaek Lim (Jerry), Clement Farabet (Jerry), Carl Lebsack (Jerry), Theo Guidroz (Jerry), John Blitzer (Jerry), Nico Duduta (Jerry), David Madras (Jerry), Steve Li (Jerry), Daniel von Dincklage (Jerry), Xin Li (Jerry), Mahdis Mahdieh (Jerry), George Tucker (Jerry), Ganesh Jawahar (Jerry), Owen Xiao (Jerry), Danny Tarlow (Jerry), Robert Geirhos (Jerry), Noam Velan (Jerry), Daniel Vlasic (Jerry), Kalesha Bullard (Jerry), SK Park (Jerry), Nishesh Gupta (Jerry), Kellie Webster (Jerry), Ayal Hitron (Jerry), Jieming Mao (Jerry), Julian Eisenschlos (Jerry), Laurel Prince (Jerry), Nina D'Souza (Jerry), Kelvin Zheng (Jerry), Sara Nasso (Jerry), Gabriela Botea (Jerry), Carl Doersch (Jerry), Caglar Unlu (Jerry), Chris Alberti (Jerry), Alexey Svyatkovskiy (Jerry), Ankita Goel (Jerry), Krzysztof Choromanski (Jerry), Pan-Pan Jiang (Jerry), Richard Nguyen (Jerry), Four Flynn (Jerry), Daria \'Curko (Jerry), Peter Chen (Jerry), Nicholas Roth (Jerry), Kieran Milan (Jerry), Caleb Habtegebriel (Jerry), Shashi Narayan (Jerry), Michael Moffitt (Jerry), Jake Marcus (Jerry), Thomas Anthony (Jerry), Brendan McMahan (Jerry), Gowoon Cheon (Jerry), Ruibo Liu (Jerry), Megan Barnes (Jerry), Lukasz Lew (Jerry), Rebeca Santamaria-Fernandez (Jerry), Mayank Upadhyay (Jerry), Arjun Akula (Jerry), Arnar Mar Hrafnkelsson (Jerry), Alvaro Caceres (Jerry), Andrew Bunner (Jerry), Michal Sokolik (Jerry), Subha Puttagunta (Jerry), Lawrence Moore (Jerry), Berivan Isik (Jerry), Jay Hartford (Jerry), Lawrence Chan (Jerry), Pradeep Shenoy (Jerry), Dan Holtmann-Rice (Jerry), Jane Park (Jerry), Fabio Viola (Jerry), Alex Salcianu (Jerry), Sujeevan Rajayogam (Jerry), Ian Stewart-Binks (Jerry), Zelin Wu (Jerry), Richard Everett (Jerry), Xi Xiong (Jerry), Pierre-Antoine Manzagol (Jerry), Gary Leung (Jerry), Carl Saroufim (Jerry), Bo Pang (Jerry), Dawid Wegner (Jerry), George Papamakarios (Jerry), Jennimaria Palomaki (Jerry), Helena Pankov (Jerry), Guangda Lai (Jerry), Guilherme Tubone (Jerry), Shubin Zhao (Jerry), Theofilos Strinopoulos (Jerry), Seth Neel (Jerry), Mingqiu Wang (Jerry), Joe Kelley (Jerry), Li Li (Jerry), Pingmei Xu (Jerry), Anitha Vijayakumar (Jerry), Andrea D'olimpio (Jerry), Omer Levy (Jerry), Massimo Nicosia (Jerry), Grigory Rozhdestvenskiy (Jerry), Ni Lao (Jerry), Sirui Xie (Jerry), Yash Katariya (Jerry), Jon Simon (Jerry), Sanjiv Kumar (Jerry), Florian Hartmann (Jerry), Michael Kilgore (Jerry), Jinhyuk Lee (Jerry), Aroma Mahendru (Jerry), Roman Ring (Jerry), Tom Hennigan (Jerry), Fiona Lang (Jerry), Colin Cherry (Jerry), David Steiner (Jerry), Dawsen Hwang (Jerry), Ray Smith (Jerry), Pidong Wang (Jerry), Jeremy Chen (Jerry), Ming-Hsuan Yang (Jerry), Sam Kwei (Jerry), Philippe Schlattner (Jerry), Donnie Kim (Jerry), Ganesh Poomal Girirajan (Jerry), Nikola Momchev (Jerry), Ayushi Agarwal (Jerry), Xingyi Zhou (Jerry), Ilkin Safarli (Jerry), Zachary Garrett (Jerry), AJ Pierigiovanni (Jerry), Sarthak Jauhari (Jerry), Alif Raditya Rochman (Jerry), Shikhar Vashishth (Jerry), Quan Yuan (Jerry), Christof Angermueller (Jerry), Jon Blanton (Jerry), Xinying Song (Jerry), Nitesh Bharadwaj Gundavarapu (Jerry), Thi Avrahami (Jerry), Maxine Deines (Jerry), Subhrajit Roy (Jerry), Manish Gupta (Jerry), Christopher Semturs (Jerry), Shobha Vasudevan (Jerry), Aditya Srikanth Veerubhotla (Jerry), Shriya Sharma (Jerry), Josh Jacob (Jerry), Zhen Yang (Jerry), Andreas Terzis (Jerry), Dan Karliner (Jerry), Auriel Wright (Jerry), Tania Rojas-Esponda (Jerry), Ashley Brown (Jerry), Abhijit Guha Roy (Jerry), Pawan Dogra (Jerry), Andrei Kapishnikov (Jerry), Peter Young (Jerry), Wendy Kan (Jerry), Vinodh Kumar Rajendran (Jerry), Maria Ivanova (Jerry), Salil Deshmukh (Jerry), Chia-Hua Ho (Jerry), Mike Kwong (Jerry), Stav Ginzburg (Jerry), Annie Louis (Jerry), KP Sawhney (Jerry), Slav Petrov (Jerry), Jing Xie (Jerry), Yunfei Bai (Jerry), Georgi Stoyanov (Jerry), Alex Fabrikant (Jerry), Rajesh Jayaram (Jerry), Yuqi Li (Jerry), Joe Heyward (Jerry), Justin Gilmer (Jerry), Yaqing Wang (Jerry), Radu Soricut (Jerry), Luyang Liu (Jerry), Qingnan Duan (Jerry), Jamie Hayes (Jerry), Maura O'Brien (Jerry), Gaurav Singh Tomar (Jerry), Sivan Eiger (Jerry), Bahar Fatemi (Jerry), Jeffrey Hui (Jerry), Catarina Barros (Jerry), Adaeze Chukwuka (Jerry), Alena Butryna (Jerry), Saksham Thakur (Jerry), Austin Huang (Jerry), Zhufeng Pan (Jerry), Haotian Tang (Jerry), Serkan Cabi (Jerry), Tulsee Doshi (Jerry), Michiel Bakker (Jerry), Sumit Bagri (Jerry), Ruy Ley-Wild (Jerry), Adam Lelkes (Jerry), Jennie Lees (Jerry), Patrick Kane (Jerry), David Greene (Jerry), Shimu Wu (Jerry), J\"org Bornschein (Jerry), Gabriela Surita (Jerry), Sarah Hodkinson (Jerry), Fangtao Li (Jerry), Chris Hidey (Jerry), S\'ebastien Pereira (Jerry), Sean Ammirati (Jerry), Phillip Lippe (Jerry), Adam Kraft (Jerry), Pu Han (Jerry), Sebastian Gerlach (Jerry), Zifeng Wang (Jerry), Liviu Panait (Jerry), Feng Han (Jerry), Brian Farris (Jerry), Yingying Bi (Jerry), Hannah DeBalsi (Jerry), Miaosen Wang (Jerry), Gladys Tyen (Jerry), James Cohan (Jerry), Susan Zhang (Jerry), Jarred Barber (Jerry), Da-Woon Chung (Jerry), Jaeyoun Kim (Jerry), Markus Kunesch (Jerry), Steven Pecht (Jerry), Nami Akazawa (Jerry), Abe Friesen (Jerry), James Lyon (Jerry), Ali Eslami (Jerry), Junru Wu (Jerry), Jie Tan (Jerry), Yue Song (Jerry), Ravi Kumar (Jerry), Chris Welty (Jerry), Ilia Akolzin (Jerry), Gena Gibson (Jerry), Sean Augenstein (Jerry), Arjun Pillai (Jerry), Nancy Yuen (Jerry), Du Phan (Jerry), Xin Wang (Jerry), Iain Barr (Jerry), Heiga Zen (Jerry), Nan Hua (Jerry), Casper Liu (Jerry), Jilei (Jerry), Wang (Elena), Tanuj Bhatia (Elena), Hao Xu (Elena), Oded Elyada (Elena), Pushmeet Kohli (Elena), Mirek Ol\v{s}\'ak (Elena), Ke Chen (Elena), Azalia Mirhoseini (Elena), Noam Shazeer (Elena), Shoshana Jakobovits (Elena), Maggie Tran (Elena), Nolan Ramsden (Elena), Tarun Bharti (Elena), Fred Alcober (Elena), Yunjie Li (Elena), Shilpa Shetty (Elena), Jing Chen (Elena), Dmitry Kalashnikov (Elena), Megha Nawhal (Elena), Sercan Arik (Elena), Hanwen Chen (Elena), Michiel Blokzijl (Elena), Shubham Gupta (Elena), James Rubin (Elena), Rigel Swavely (Elena), Sophie Bridgers (Elena), Ian Gemp (Elena), Chen Su (Elena), Arun Suggala (Elena), Juliette Pluto (Elena), Mary Cassin (Elena), Alain Vaucher (Elena), Kaiyang Ji (Elena), Jiahao Cai (Elena), Andrew Audibert (Elena), Animesh Sinha (Elena), David Tian (Elena), Efrat Farkash (Elena), Amy Hua (Elena), Jilin Chen (Elena), Duc-Hieu Tran (Elena), Edward Loper (Elena), Nicole Brichtova (Elena), Lara McConnaughey (Elena), Ballie Sandhu (Elena), Robert Leland (Elena), Doug DeCarlo (Elena), Andrew Over (Elena), James Huang (Elena), Xing Wu (Elena), Connie Fan (Elena), Eric Li (Elena), Yun Lei (Elena), Deepak Sharma (Elena), Cosmin Paduraru (Elena), Luo Yu (Elena), Matko Bo\v{s}njak (Elena), Phuong Dao (Elena), Min Choi (Elena), Sneha Kudugunta (Elena), Jakub Adamek (Elena), Carlos Gu\'ia (Elena), Ali Khodaei (Elena), Jie Feng (Elena), Wenjun Zeng (Elena), David Welling (Elena), Sandeep Tata (Elena), Christina Butterfield (Elena), Andrey Vlasov (Elena), Seliem El-Sayed (Elena), Swaroop Mishra (Elena), Tara Sainath (Elena), Shentao Yang (Elena), RJ Skerry-Ryan (Elena), Jeremy Shar (Elena), Robert Berry (Elena), Arunkumar Rajendran (Elena), Arun Kandoor (Elena), Andrea Burns (Elena), Deepali Jain (Elena), Tom Stone (Elena), Wonpyo Park (Elena), Shibo Wang (Elena), Albin Cassirer (Elena), Guohui Wang (Elena), Hayato Kobayashi (Elena), Sergey Rogulenko (Elena), Vineetha Govindaraj (Elena), Miko{\l}aj Rybi\'nski (Elena), Nadav Olmert (Elena), Colin Evans (Elena), Po-Sen Huang (Elena), Kelvin Xu (Elena), Premal Shah (Elena), Terry Thurk (Elena), Caitlin Sikora (Elena), Mu Cai (Elena), Jin Xie (Elena), Elahe Dabir (Elena), Saloni Shah (Elena), Norbert Kalb (Elena), Carrie Zhang (Elena), Shruthi Prabhakara (Elena), Amit Sabne (Elena), Artiom Myaskovsky (Elena), Vikas Raunak (Elena), Blanca Huergo (Elena), Behnam Neyshabur (Elena), Jon Clark (Elena), Ye Zhang (Elena), Shankar Krishnan (Elena), Eden Cohen (Elena), Dinesh Tewari (Elena), James Lottes (Elena), Yumeya Yamamori (Elena), Hui (Elena), Li (Tu\'\^an), Mohamed Elhawaty (Tu\'\^an), Ada Maksutaj Oflazer (Tu\'\^an), Adri\`a Recasens (Tu\'\^an), Sheryl Luo (Tu\'\^an), Duy Nguyen (Tu\'\^an), Taylor Bos (Tu\'\^an), Kalyan Andra (Tu\'\^an), Ana Salazar (Tu\'\^an), Ed Chi (Tu\'\^an), Jeongwoo Ko (Tu\'\^an), Matt Ginsberg (Tu\'\^an), Anders Andreassen (Tu\'\^an), Anian Ruoss (Tu\'\^an), Todor Davchev (Tu\'\^an), Elnaz Davoodi (Tu\'\^an), Chenxi Liu (Tu\'\^an), Min Kim (Tu\'\^an), Santiago Ontanon (Tu\'\^an), Chi Ming To (Tu\'\^an), Dawei Jia (Tu\'\^an), Rosemary Ke (Tu\'\^an), Jing Wang (Tu\'\^an), Anna Korsun (Tu\'\^an), Moran Ambar (Tu\'\^an), Ilya Kornakov (Tu\'\^an), Irene Giannoumis (Tu\'\^an), Toni Creswell (Tu\'\^an), Denny Zhou (Tu\'\^an), Yi Su (Tu\'\^an), Ishaan Watts (Tu\'\^an), Aleksandr Zaks (Tu\'\^an), Evgenii Eltyshev (Tu\'\^an), Ziqiang Feng (Tu\'\^an), Sidharth Mudgal (Tu\'\^an), Alex Kaskasoli (Tu\'\^an), Juliette Love (Tu\'\^an), Kingshuk Dasgupta (Tu\'\^an), Sam Shleifer (Tu\'\^an), Richard Green (Tu\'\^an), Sungyong Seo (Tu\'\^an), Chansoo Lee (Tu\'\^an), Dale Webster (Tu\'\^an), Prakash Shroff (Tu\'\^an), Ganna Raboshchuk (Tu\'\^an), Isabel Leal (Tu\'\^an), James Manyika (Tu\'\^an), Sofia Erell (Tu\'\^an), Daniel Murphy (Tu\'\^an), Zhisheng Xiao (Tu\'\^an), Anton Bulyenov (Tu\'\^an), Julian Walker (Tu\'\^an), Mark Collier (Tu\'\^an), Matej Kastelic (Tu\'\^an), Nelson George (Tu\'\^an), Sushant Prakash (Tu\'\^an), Sailesh Sidhwani (Tu\'\^an), Alexey Frolov (Tu\'\^an), Steven Hansen (Tu\'\^an), Petko Georgiev (Tu\'\^an), Tiberiu Sosea (Tu\'\^an), Chris Apps (Tu\'\^an), Aishwarya Kamath (Tu\'\^an), David Reid (Tu\'\^an), Emma Cooney (Tu\'\^an), Charlotte Magister (Tu\'\^an), Oriana Riva (Tu\'\^an), Alec Go (Tu\'\^an), Pu-Chin Chen (Tu\'\^an), Sebastian Krause (Tu\'\^an), Nir Levine (Tu\'\^an), Marco Fornoni (Tu\'\^an), Ilya Figotin (Tu\'\^an), Nick Roy (Tu\'\^an), Parsa Mahmoudieh (Tu\'\^an), Vladimir Magay (Tu\'\^an), Mukundan Madhavan (Tu\'\^an), Jin Miao (Tu\'\^an), Jianmo Ni (Tu\'\^an), Yasuhisa Fujii (Tu\'\^an), Ian Chou (Tu\'\^an), George Scrivener (Tu\'\^an), Zak Tsai (Tu\'\^an), Siobhan Mcloughlin (Tu\'\^an), Jeremy Selier (Tu\'\^an), Sandra Lefdal (Tu\'\^an), Jeffrey Zhao (Tu\'\^an), Abhijit Karmarkar (Tu\'\^an), Kushal Chauhan (Tu\'\^an), Shivanker Goel (Tu\'\^an), Zhaoyi Zhang (Tu\'\^an), Vihan Jain (Tu\'\^an), Parisa Haghani (Tu\'\^an), Mostafa Dehghani (Tu\'\^an), Jacob Scott (Tu\'\^an), Erin Farnese (Tu\'\^an), Anastasija Ili\'c (Tu\'\^an), Steven Baker (Tu\'\^an), Julia Pawar (Tu\'\^an), Li Zhong (Tu\'\^an), Josh Camp (Tu\'\^an), Yoel Zeldes (Tu\'\^an), Shravya Shetty (Tu\'\^an), Anand Iyer (Tu\'\^an), V\'it List\'ik (Tu\'\^an), Jiaxian Guo (Tu\'\^an), Luming Tang (Tu\'\^an), Mark Geller (Tu\'\^an), Simon Bucher (Tu\'\^an), Yifan Ding (Tu\'\^an), Hongzhi Shi (Tu\'\^an), Carrie Muir (Tu\'\^an), Dominik Grewe (Tu\'\^an), Ramy Eskander (Tu\'\^an), Octavio Ponce (Tu\'\^an), Boqing Gong (Tu\'\^an), Derek Gasaway (Tu\'\^an), Samira Khan (Tu\'\^an), Umang Gupta (Tu\'\^an), Angelos Filos (Tu\'\^an), Weicheng Kuo (Tu\'\^an), Klemen Kloboves (Tu\'\^an), Jennifer Beattie (Tu\'\^an), Christian Wright (Tu\'\^an), Leon Li (Tu\'\^an), Alicia Jin (Tu\'\^an), Sandeep Mariserla (Tu\'\^an), Miteyan Patel (Tu\'\^an), Jens Heitkaemper (Tu\'\^an), Dilip Krishnan (Tu\'\^an), Vivek Sharma (Tu\'\^an), David Bieber (Tu\'\^an), Christian Frank (Tu\'\^an), John Lambert (Tu\'\^an), Paul Caron (Tu\'\^an), Martin Polacek (Tu\'\^an), Mai Gim\'enez (Tu\'\^an), Himadri Choudhury (Tu\'\^an), Xing Yu (Tu\'\^an), Sasan Tavakkol (Tu\'\^an), Arun Ahuja (Tu\'\^an), Franz Och (Tu\'\^an), Rodolphe Jenatton (Tu\'\^an), Wojtek Skut (Tu\'\^an), Bryan Richter (Tu\'\^an), David Gaddy (Tu\'\^an), Andy Ly (Tu\'\^an), Misha Bilenko (Tu\'\^an), Megh Umekar (Tu\'\^an), Ethan Liang (Tu\'\^an), Martin Sevenich (Tu\'\^an), Mandar Joshi (Tu\'\^an), Hassan Mansoor (Tu\'\^an), Rebecca Lin (Tu\'\^an), Sumit Sanghai (Tu\'\^an), Abhimanyu Singh (Tu\'\^an), Xiaowei Li (Tu\'\^an), Sudheendra Vijayanarasimhan (Tu\'\^an), Zaheer Abbas (Tu\'\^an), Yonatan Bitton (Tu\'\^an), Hansa Srinivasan (Tu\'\^an), Manish Reddy Vuyyuru (Tu\'\^an), Alexander Fr\"ommgen (Tu\'\^an), Yanhua Sun (Tu\'\^an), Ralph Leith (Tu\'\^an), Alfonso Casta\~no (Tu\'\^an), DJ Strouse (Tu\'\^an), Le Yan (Tu\'\^an), Austin Kyker (Tu\'\^an), Satish Kambala (Tu\'\^an), Mary Jasarevic (Tu\'\^an), Thibault Sellam (Tu\'\^an), Chao Jia (Tu\'\^an), Alexander Pritzel (Tu\'\^an), Raghavender R (Tu\'\^an), Huizhong Chen (Tu\'\^an), Natalie Clay (Tu\'\^an), Sudeep Gandhe (Tu\'\^an), Sean Kirmani (Tu\'\^an), Sayna Ebrahimi (Tu\'\^an), Hannah Kirkwood (Tu\'\^an), Jonathan Mallinson (Tu\'\^an), Chao Wang (Tu\'\^an), Adnan Ozturel (Tu\'\^an), Kuo Lin (Tu\'\^an), Shyam Upadhyay (Tu\'\^an), Vincent Cohen-Addad (Tu\'\^an), Sean Purser-haskell (Tu\'\^an), Yichong Xu (Tu\'\^an), Ebrahim Songhori (Tu\'\^an), Babi Seal (Tu\'\^an), Alberto Magni (Tu\'\^an), Almog Gueta (Tu\'\^an), Tingting Zou (Tu\'\^an), Guru Guruganesh (Tu\'\^an), Thais Kagohara (Tu\'\^an), Hung Nguyen (Tu\'\^an), Khalid Salama (Tu\'\^an), Alejandro Cruzado Ruiz (Tu\'\^an), Justin Frye (Tu\'\^an), Zhenkai Zhu (Tu\'\^an), Matthias Lochbrunner (Tu\'\^an), Simon Osindero (Tu\'\^an), Wentao Yuan (Tu\'\^an), Lisa Lee (Tu\'\^an), Aman Prasad (Tu\'\^an), Lam Nguyen Thiet (Tu\'\^an), Daniele Calandriello (Tu\'\^an), Victor Stone (Tu\'\^an), Qixuan Feng (Tu\'\^an), Han Ke (Tu\'\^an), Maria Voitovich (Tu\'\^an), Geta Sampemane (Tu\'\^an), Lewis Chiang (Tu\'\^an), Ling Wu (Tu\'\^an), Alexander Bykovsky (Tu\'\^an), Matt Young (Tu\'\^an), Luke Vilnis (Tu\'\^an), Ishita Dasgupta (Tu\'\^an), Aditya Chawla (Tu\'\^an), Qin Cao (Tu\'\^an), Bowen Liang (Tu\'\^an), Daniel Toyama (Tu\'\^an), Szabolcs Payrits (Tu\'\^an), Anca Stefanoiu (Tu\'\^an), Dimitrios Vytiniotis (Tu\'\^an), Ankesh Anand (Tu\'\^an), Tianxiao Shen (Tu\'\^an), Blagoj Mitrevski (Tu\'\^an), Michael Tschannen (Tu\'\^an), Sreenivas Gollapudi (Tu\'\^an), Aishwarya P S (Tu\'\^an), Jos\'e Leal (Tu\'\^an), Zhe Shen (Tu\'\^an), Han Fu (Tu\'\^an), Wei Wang (Tu\'\^an), Arvind Kannan (Tu\'\^an), Doron Kukliansky (Tu\'\^an), Sergey Yaroshenko (Tu\'\^an), Svetlana Grant (Tu\'\^an), Umesh Telang (Tu\'\^an), David Wood (Tu\'\^an), Alexandra Chronopoulou (Tu\'\^an), Alexandru \c{T}ifrea (Tu\'\^an), Tao Zhou (Tu\'\^an), Tony (Tu\'\^an), Nguy\~\^en (Q), Muge Ersoy (Q), Anima Singh (Q), Meiyan Xie (Q), Emanuel Taropa (Q), Woohyun Han (Q), Eirikur Agustsson (Q), Andrei Sozanschi (Q), Hui Peng (Q), Alex Chen (Q), Yoel Drori (Q), Efren Robles (Q), Yang Gao (Q), Xerxes Dotiwalla (Q), Ying Chen (Q), Anudhyan Boral (Q), Alexei Bendebury (Q), John Nham (Q), Chris Tar (Q), Luis Castro (Q), Jiepu Jiang (Q), Canoee Liu (Q), Felix Halim (Q), Jinoo Baek (Q), Andy Wan (Q), Jeremiah Liu (Q), Yuan Cao (Q), Shengyang Dai (Q), Trilok Acharya (Q), Ruoxi Sun (Q), Fuzhao Xue (Q), Saket Joshi (Q), Morgane Lustman (Q), Yongqin Xian (Q), Rishabh Joshi (Q), Deep Karkhanis (Q), Nora Kassner (Q), Jamie Hall (Q), Xiangzhuo Ding (Q), Gan Song (Q), Gang Li (Q), Chen Zhu (Q), Yana Kulizhskaya (Q), Bin Ni (Q), Alexey Vlaskin (Q), Solomon Demmessie (Q), Lucio Dery (Q), Salah Zaiem (Q), Yanping Huang (Q), Cindy Fan (Q), Felix Gimeno (Q), Ananth Balashankar (Q), Koji Kojima (Q), Hagai Taitelbaum (Q), Maya Meng (Q), Dero Gharibian (Q), Sahil Singla (Q), Wei Chen (Q), Ambrose Slone (Q), Guanjie Chen (Q), Sujee Rajayogam (Q), Max Schumacher (Q), Suyog Kotecha (Q), Rory Blevins (Q), Qifei Wang (Q), Mor Hazan Taege (Q), Alex Morris (Q), Xin Liu (Q), Fayaz Jamil (Q), Richard Zhang (Q), Pratik Joshi (Q), Ben Ingram (Q), Tyler Liechty (Q), Ahmed Eleryan (Q), Scott Baird (Q), Alex Grills (Q), Gagan Bansal (Q), Shan Han (Q), Kiran Yalasangi (Q), Shawn Xu (Q), Majd Al Merey (Q), Isabel Gao (Q), Felix Weissenberger (Q), Igor Karpov (Q), Robert Riachi (Q), Ankit Anand (Q), Gautam Prasad (Q), Kay Lamerigts (Q), Reid Hayes (Q), Jamie Rogers (Q), Mandy Guo (Q), Ashish Shenoy (Q), Qiong (Q), Hu (Weilun), Kyle He (Weilun), Yuchen Liu (Weilun), Polina Zablotskaia (Weilun), Sagar Gubbi (Weilun), Yifan Chang (Weilun), Jay Pavagadhi (Weilun), Kristian Kjems (Weilun), Archita Vadali (Weilun), Diego Machado (Weilun), Yeqing Li (Weilun), Renshen Wang (Weilun), Dipankar Ghosh (Weilun), Aahil Mehta (Weilun), Dana Alon (Weilun), George Polovets (Weilun), Alessio Tonioni (Weilun), Nate Kushman (Weilun), Joel D'sa (Weilun), Lin Zhuo (Weilun), Allen Wu (Weilun), Rohin Shah (Weilun), John Youssef (Weilun), Jiayu Ye (Weilun), Justin Snyder (Weilun), Karel Lenc (Weilun), Senaka Buthpitiya (Weilun), Matthew Tung (Weilun), Jichuan Chang (Weilun), Tao Chen (Weilun), David Saxton (Weilun), Jenny Lee (Weilun), Lydia Lihui Zhang (Weilun), James Qin (Weilun), Prabakar Radhakrishnan (Weilun), Maxwell Chen (Weilun), Piotr Ambroszczyk (Weilun), Metin Toksoz-Exley (Weilun), Yan Zhong (Weilun), Nitzan Katz (Weilun), Brendan O'Donoghue (Weilun), Tamara von Glehn (Weilun), Adi Gerzi Rosenthal (Weilun), Aga \'Swietlik (Weilun), Xiaokai Zhao (Weilun), Nick Fernando (Weilun), Jinliang Wei (Weilun), Jieru Mei (Weilun), Sergei Vassilvitskii (Weilun), Diego Cedillo (Weilun), Pranjal Awasthi (Weilun), Hui Zheng (Weilun), Koray Kavukcuoglu (Weilun), Itay Laish (Weilun), Joseph Pagadora (Weilun), Marc Brockschmidt (Weilun), Christopher A. Choquette-Choo (Weilun), Arunkumar Byravan (Weilun), Yifeng Lu (Weilun), Xu Chen (Weilun), Mia Chen (Weilun), Kenton Lee (Weilun), Rama Pasumarthi (Weilun), Sijal Bhatnagar (Weilun), Aditya Shah (Weilun), Qiyin Wu (Weilun), Zhuoyuan Chen (Weilun), Zack Nado (Weilun), Bartek Perz (Weilun), Zixuan Jiang (Weilun), David Kao (Weilun), Ganesh Mallya (Weilun), Nino Vieillard (Weilun), Lantao Mei (Weilun), Sertan Girgin (Weilun), Mandy Jordan (Weilun), Yeongil Ko (Weilun), Alekh Agarwal (Weilun), Yaxin Liu (Weilun), Yasemin Altun (Weilun), Raoul de Liedekerke (Weilun), Anastasios Kementsietsidis (Weilun), Daiyi Peng (Weilun), Dangyi Liu (Weilun), Utku Evci (Weilun), Peter Humphreys (Weilun), Austin Tarango (Weilun), Xiang Deng (Weilun), Yoad Lewenberg (Weilun), Kevin Aydin (Weilun), Chengda Wu (Weilun), Bhavishya Mittal (Weilun), Tsendsuren Munkhdalai (Weilun), Kleopatra Chatziprimou (Weilun), Rodrigo Benenson (Weilun), Uri First (Weilun), Xiao Ma (Weilun), Jinning Li (Weilun), Armand Joulin (Weilun), Hamish Tomlinson (Weilun), Tingnan Zhang (Weilun), Milad Nasr (Weilun), Zhi Hong (Weilun), Micha\"el Sander (Weilun), Lisa Anne Hendricks (Weilun), Anuj Sharma (Weilun), Andrew Bolt (Weilun), Eszter V\'ertes (Weilun), Jiri Simsa (Weilun), Tomer Levinboim (Weilun), Olcan Sercinoglu (Weilun), Divyansh Shukla (Weilun), Austin Wu (Weilun), Craig Swanson (Weilun), Danny Vainstein (Weilun), Fan Bu (Weilun), Bo Wang (Weilun), Ryan Julian (Weilun), Charles Yoon (Weilun), Sergei Lebedev (Weilun), Antonious Girgis (Weilun), Bernd Bandemer (Weilun), David Du (Weilun), Todd Wang (Weilun), Xi Chen (Weilun), Ying Xiao (Weilun), Peggy Lu (Weilun), Natalie Ha (Weilun), Vlad Ionescu (Weilun), Simon Rowe (Weilun), Josip Matak (Weilun), Federico Lebron (Weilun), Andreas Steiner (Weilun), Lalit Jain (Weilun), Manaal Faruqui (Weilun), Nicolas Lacasse (Weilun), Georgie Evans (Weilun), Neesha Subramaniam (Weilun), Dean Reich (Weilun), Giulia Vezzani (Weilun), Aditya Pandey (Weilun), Joe Stanton (Weilun), Tianhao Zhou (Weilun), Liam McCafferty (Weilun), Henry Griffiths (Weilun), Verena Rieser (Weilun), Soheil Hassas Yeganeh (Weilun), Eleftheria Briakou (Weilun), Lu Huang (Weilun), Zichuan Wei (Weilun), Liangchen Luo (Weilun), Erik Jue (Weilun), Gabby Wang (Weilun), Victor Cotruta (Weilun), Myriam Khan (Weilun), Jongbin Park (Weilun), Qiuchen Guo (Weilun), Peiran Li (Weilun), Rong Rong (Weilun), Diego Antognini (Weilun), Anastasia Petrushkina (Weilun), Chetan Tekur (Weilun), Eli Collins (Weilun), Parul Bhatia (Weilun), Chester Kwak (Weilun), Wenhu Chen (Weilun), Arvind Neelakantan (Weilun), Immanuel Odisho (Weilun), Sheng Peng (Weilun), Vincent Nallatamby (Weilun), Vaibhav Tulsyan (Weilun), Fabian Pedregosa (Weilun), Peng Xu (Weilun), Raymond Lin (Weilun), Yulong Wang (Weilun), Emma Wang (Weilun), Sholto Douglas (Weilun), Reut Tsarfaty (Weilun), Elena Gribovskaya (Weilun), Renga Aravamudhan (Weilun), Manu Agarwal (Weilun), Mara Finkelstein (Weilun), Qiao Zhang (Weilun), Elizabeth Cole (Weilun), Phil Crone (Weilun), Sarmishta Velury (Weilun), Anil Das (Weilun), Chris Sauer (Weilun), Luyao Xu (Weilun), Danfeng Qin (Weilun), Chenjie Gu (Weilun), Dror Marcus (Weilun), CJ Zheng (Weilun), Wouter Van Gansbeke (Weilun), Sobhan Miryoosefi (Weilun), Haitian Sun (Weilun), YaGuang Li (Weilun), Charlie Chen (Weilun), Jae Yoo (Weilun), Pavel Dubov (Weilun), Alex Tomala (Weilun), Adams Yu (Weilun), Pawe{\l} Weso{\l}owski (Weilun), Alok Gunjan (Weilun), Eddie Cao (Weilun), Jiaming Luo (Weilun), Nikhil Sethi (Weilun), Arkadiusz Socala (Weilun), Laura Graesser (Weilun), Tomas Kocisky (Weilun), Arturo BC (Weilun), Minmin Chen (Weilun), Edward Lee (Weilun), Sophie Wang (Weilun), Weize Kong (Weilun), Qiantong Xu (Weilun), Nilesh Tripuraneni (Weilun), Yiming Li (Weilun), Xinxin Yu (Weilun), Allen Porter (Weilun), Paul Voigtlaender (Weilun), Biao Zhang (Weilun), Arpi Vezer (Weilun), Sarah York (Weilun), Qing Wei (Weilun), Geoffrey Cideron (Weilun), Mark Kurzeja (Weilun), Seungyeon Kim (Weilun), Benny Li (Weilun), Ang\'eline Pouget (Weilun), Hyo Lee (Weilun), Kaspar Daugaard (Weilun), Yang Li (Weilun), Dave Uthus (Weilun), Aditya Siddhant (Weilun), Paul Cavallaro (Weilun), Sriram Ganapathy (Weilun), Maulik Shah (Weilun), Rolf Jagerman (Weilun), Jeff Stanway (Weilun), Piermaria Mendolicchio (Weilun), Li Xiao (Weilun), Kayi Lee (Weilun), Tara Thompson (Weilun), Shubham Milind Phal (Weilun), Jason Chase (Weilun), Sun Jae Lee (Weilun), Adrian N Reyes (Weilun), Disha Shrivastava (Weilun), Zhen Qin (Weilun), Roykrong Sukkerd (Weilun), Seth Odoom (Weilun), Lior Madmoni (Weilun), John Aslanides (Weilun), Jonathan Herzig (Weilun), Elena Pochernina (Weilun), Sheng Zhang (Weilun), Parker Barnes (Weilun), Daisuke Ikeda (Weilun), Qiujia Li (Weilun), Shuo-yiin Chang (Weilun), Shakir Mohamed (Weilun), Jim Sproch (Weilun), Richard Powell (Weilun), Bidisha Samanta (Weilun), Domagoj \'Cevid (Weilun), Anton Kovsharov (Weilun), Shrestha Basu Mallick (Weilun), Srinivas Tadepalli (Weilun), Anne Zheng (Weilun), Kareem Ayoub (Weilun), Andreas Noever (Weilun), Christian Reisswig (Weilun), Zhuo Xu (Weilun), Junhyuk Oh (Weilun), Martin Matysiak (Weilun), Tim Blyth (Weilun), Shereen Ashraf (Weilun), Julien Amelot (Weilun), Boone Severson (Weilun), Michele Bevilacqua (Weilun), Motoki Sano (Weilun), Ethan Dyer (Weilun), Ofir Roval (Weilun), Anu Sinha (Weilun), Yin Zhong (Weilun), Sagi Perel (Weilun), Tea Saboli\'c (Weilun), Johannes Mauerer (Weilun), Willi Gierke (Weilun), Mauro Verzetti (Weilun), Rodrigo Cabrera (Weilun), Alvin Abdagic (Weilun), Steven Hemingray (Weilun), Austin Stone (Weilun), Jong Lee (Weilun), Farooq Ahmad (Weilun), Karthik Raman (Weilun), Lior Shani (Weilun), Jonathan Lai (Weilun), Orhan Firat (Weilun), Nathan Waters (Weilun), Eric Ge (Weilun), Mo Shomrat (Weilun), Himanshu Gupta (Weilun), Rajeev Aggarwal (Weilun), Tom Hudson (Weilun), Bill Jia (Weilun), Simon Baumgartner (Weilun), Palak Jain (Weilun), Joe Kovac (Weilun), Junehyuk Jung (Weilun), Ante \v{Z}u\v{z}ul (Weilun), Will Truong (Weilun), Morteza Zadimoghaddam (Weilun), Songyou Peng (Weilun), Marco Liang (Weilun), Rachel Sterneck (Weilun), Balaji Lakshminarayanan (Weilun), Machel Reid (Weilun), Oliver Woodman (Weilun), Tong Zhou (Weilun), Jianling Wang (Weilun), Vincent Coriou (Weilun), Arjun Narayanan (Weilun), Jay Hoover (Weilun), Yenai Ma (Weilun), Apoorv Jindal (Weilun), Clayton Sanford (Weilun), Doug Reid (Weilun), Swaroop Ramaswamy (Weilun), Alex Kurakin (Weilun), Roland Zimmermann (Weilun), Yana Lunts (Weilun), Dragos Dena (Weilun), Zal\'an Borsos (Weilun), Vered Cohen (Weilun), Shujian Zhang (Weilun), Will Grathwohl (Weilun), Robert Dadashi (Weilun), Morgan Redshaw (Weilun), Joshua Kessinger (Weilun), Julian Odell (Weilun), Silvano Bonacina (Weilun), Zihang Dai (Weilun), Grace Chen (Weilun), Ayush Dubey (Weilun), Pablo Sprechmann (Weilun), Mantas Pajarskas (Weilun), Wenxuan Zhou (Weilun), Niharika Ahuja (Weilun), Tara Thomas (Weilun), Martin Nikoltchev (Weilun), Matija Kecman (Weilun), Bharath Mankalale (Weilun), Andrey Ryabtsev (Weilun), Jennifer She (Weilun), Christian Walder (Weilun), Jiaming Shen (Weilun), Lu Li (Weilun), Carolina Parada (Weilun), Sheena Panthaplackel (Weilun), Okwan Kwon (Weilun), Matt Lawlor (Weilun), Utsav Prabhu (Weilun), Yannick Schroecker (Weilun), Marc'aurelio Ranzato (Weilun), Pete Blois (Weilun), Iurii Kemaev (Weilun), Ting Yu (Weilun), Dmitry Lepikhin (Weilun), Hao Xiong (Weilun), Sahand Sharifzadeh (Weilun), Oleaser Johnson (Weilun), Jeremiah Willcock (Weilun), Rui Yao (Weilun), Greg Farquhar (Weilun), Sujoy Basu (Weilun), Hidetoshi Shimokawa (Weilun), Nina Anderson (Weilun), Haiguang Li (Weilun), Khiem Pham (Weilun), Yizhong Liang (Weilun), Sebastian Borgeaud (Weilun), Alexandre Moufarek (Weilun), Hideto Kazawa (Weilun), Blair Kutzman (Weilun), Marcin Sieniek (Weilun), Sara Smoot (Weilun), Ruth Wang (Weilun), Natalie Axelsson (Weilun), Nova Fallen (Weilun), Prasha Sundaram (Weilun), Yuexiang Zhai (Weilun), Varun Godbole (Weilun), Petros Maniatis (Weilun), Alek Wang (Weilun), Ilia Shumailov (Weilun), Santhosh Thangaraj (Weilun), Remi Crocker (Weilun), Nikita Gupta (Weilun), Gang Wu (Weilun), Phil Chen (Weilun), Gell\'ert Weisz (Weilun), Celine Smith (Weilun), Mojtaba Seyedhosseini (Weilun), Boya Fang (Weilun), Xiyang Luo (Weilun), Roey Yogev (Weilun), Zeynep Cankara (Weilun), Andrew Hard (Weilun), Helen Ran (Weilun), Rahul Sukthankar (Weilun), George Necula (Weilun), Ga\"el Liu (Weilun), Honglong Cai (Weilun), Praseem Banzal (Weilun), Daniel Keysers (Weilun), Sanjay Ghemawat (Weilun), Connie Tao (Weilun), Emma Dunleavy (Weilun), Aditi Chaudhary (Weilun), Wei Li (Weilun), Maciej Miku{\l}a (Weilun), Chen-Yu Lee (Weilun), Tiziana Refice (Weilun), Krishna Somandepalli (Weilun), Alexandre Fr\'echette (Weilun), Dan Bahir (Weilun), John Karro (Weilun), Keith Rush (Weilun), Sarah Perrin (Weilun), Bill Rosgen (Weilun), Xiaomeng Yang (Weilun), Clara Huiyi Hu (Weilun), Mahmoud Alnahlawi (Weilun), Justin Mao-Jones (Weilun), Roopal Garg (Weilun), Hoang Nguyen (Weilun), Bat-Orgil Batsaikhan (Weilun), I\~naki Iturrate (Weilun), Anselm Levskaya (Weilun), Avi Singh (Weilun), Ashyana Kachra (Weilun), Tony Lu (Weilun), Denis Petek (Weilun), Zheng Xu (Weilun), Mark Graham (Weilun), Lukas Zilka (Weilun), Yael Karov (Weilun), Marija Kostelac (Weilun), Fangyu Liu (Weilun), Yaohui Guo (Weilun), Weiyue Wang (Weilun), Bernd Bohnet (Weilun), Emily Pitler (Weilun), Tony Bruguier (Weilun), Keisuke Kinoshita (Weilun), Chrysovalantis Anastasiou (Weilun), Nilpa Jha (Weilun), Ting Liu (Weilun), Jerome Connor (Weilun), Phil Wallis (Weilun), Philip Pham (Weilun), Eric Bailey (Weilun), Shixin Li (Weilun), Heng-Tze Cheng (Weilun), Sally Ma (Weilun), Haiqiong Li (Weilun), Akanksha Maurya (Weilun), Kate Olszewska (Weilun), Manfred Warmuth (Weilun), Christy Koh (Weilun), Dominik Paulus (Weilun), Siddhartha Reddy Jonnalagadda (Weilun), Enrique Piqueras (Weilun), Ali Elqursh (Weilun), Geoff Brown (Weilun), Hadar Shemtov (Weilun), Loren Maggiore (Weilun), Fei Xia (Weilun), Ryan Foley (Weilun), Beka Westberg (Weilun), George van den Driessche (Weilun), Livio Baldini Soares (Weilun), Arjun Kar (Weilun), Michael Quinn (Weilun), Siqi Zuo (Weilun), Jialin Wu (Weilun), Kyle Kastner (Weilun), Anna Bortsova (Weilun), Aijun Bai (Weilun), Ales Mikhalap (Weilun), Luowei Zhou (Weilun), Jennifer Brennan (Weilun), Vinay Ramasesh (Weilun), Honglei Zhuang (Weilun), John Maggs (Weilun), Johan Schalkwyk (Weilun), Yuntao Xu (Weilun), Hui Huang (Weilun), Andrew Howard (Weilun), Sasha Brown (Weilun), Linting Xue (Weilun), Gloria Shen (Weilun), Brian Albert (Weilun), Neha Jha (Weilun), Daniel Zheng (Weilun), Varvara Krayvanova (Weilun), Spurthi Amba Hombaiah (Weilun), Olivier Lacombe (Weilun), Gautam Vasudevan (Weilun), Dan Graur (Weilun), Tian Xie (Weilun), Meet Gandhi (Weilun), Bangju Wang (Weilun), Dustin Zelle (Weilun), Harman Singh (Weilun), Dahun Kim (Weilun), S\'ebastien Cevey (Weilun), Victor Ungureanu (Weilun), Natasha Noy (Weilun), Fei Liu (Weilun), Annie Xie (Weilun), Fangxiaoyu Feng (Weilun), Katerina Tsihlas (Weilun), Daniel Formoso (Weilun), Neera Vats (Weilun), Quentin Wellens (Weilun), Yinan Wang (Weilun), Niket Kumar Bhumihar (Weilun), Samrat Ghosh (Weilun), Matt Hoffman (Weilun), Tom Lieber (Weilun), Oran Lang (Weilun), Kush Bhatia (Weilun), Tom Paine (Weilun), Aroonalok Pyne (Weilun), Ronny Votel (Weilun), Madeleine Clare Elish (Weilun), Benoit Schillings (Weilun), Alex Panagopoulos (Weilun), Haichuan Yang (Weilun), Adam Raveret (Weilun), Zohar Yahav (Weilun), Shuang Liu (Weilun), Dalia El Badawy (Weilun), Nishant Agrawal (Weilun), Mohammed Badawi (Weilun), Mahdi Mirzazadeh (Weilun), Carla Bromberg (Weilun), Fan Ye (Weilun), Chang Liu (Weilun), Tatiana Sholokhova (Weilun), George-Cristian Muraru (Weilun), Gargi Balasubramaniam (Weilun), Jonathan Malmaud (Weilun), Alen Carin (Weilun), Danilo Martins (Weilun), Irina Jurenka (Weilun), Pankil Botadra (Weilun), Dave Lacey (Weilun), Richa Singh (Weilun), Mariano Schain (Weilun), Dan Zheng (Weilun), Isabelle Guyon (Weilun), Victor Lavrenko (Weilun), Seungji Lee (Weilun), Xiang Zhou (Weilun), Demis Hassabis (Weilun), Jeshwanth Challagundla (Weilun), Derek Cheng (Weilun), Nikhil Mehta (Weilun), Matthew Mauger (Weilun), Michela Paganini (Weilun), Pushkar Mishra (Weilun), Kate Lee (Weilun), Zhang Li (Weilun), Lexi Baugher (Weilun), Ondrej Skopek (Weilun), Max Chang (Weilun), Amir Zait (Weilun), Gaurav Menghani (Weilun), Lizzetth Bellot (Weilun), Guangxing Han (Weilun), Jean-Michel Sarr (Weilun), Sharat Chikkerur (Weilun), Himanshu Sahni (Weilun), Rohan Anil (Weilun), Arun Narayanan (Weilun), Chandu Thekkath (Weilun), Daniele Pighin (Weilun), Hana Strej\v{c}ek (Weilun), Marko Velic (Weilun), Fred Bertsch (Weilun), Manuel Tragut (Weilun), Keran Rong (Weilun), Alicia Parrish (Weilun), Kai Bailey (Weilun), Jiho Park (Weilun), Isabela Albuquerque (Weilun), Abhishek Bapna (Weilun), Rajesh Venkataraman (Weilun), Alec Kosik (Weilun), Johannes Griesser (Weilun), Zhiwei Deng (Weilun), Alek Andreev (Weilun), Qingyun Dou (Weilun), Kevin Hui (Weilun), Fanny Wei (Weilun), Xiaobin Yu (Weilun), Lei Shu (Weilun), Avia Aharon (Weilun), David Barker (Weilun), Badih Ghazi (Weilun), Sebastian Flennerhag (Weilun), Chris Breaux (Weilun), Yuchuan Liu (Weilun), Matthew Bilotti (Weilun), Josh Woodward (Weilun), Uri Alon (Weilun), Stephanie Winkler (Weilun), Tzu-Kuo Huang (Weilun), Kostas Andriopoulos (Weilun), Jo\~ao Gabriel Oliveira (Weilun), Penporn Koanantakool (Weilun), Berkin Akin (Weilun), Michael Wunder (Weilun), Cicero Nogueira dos Santos (Weilun), Mohammad Hossein Bateni (Weilun), Lin Yang (Weilun), Dan Horgan (Weilun), Beer Changpinyo (Weilun), Keyvan Amiri (Weilun), Min Ma (Weilun), Dayeong Lee (Weilun), Lihao Liang (Weilun), Anirudh Baddepudi (Weilun), Tejasi Latkar (Weilun), Raia Hadsell (Weilun), Jun Xu (Weilun), Hairong Mu (Weilun), Michael Han (Weilun), Aedan Pope (Weilun), Snchit Grover (Weilun), Frank Kim (Weilun), Ankit Bhagatwala (Weilun), Guan Sun (Weilun), Yamini Bansal (Weilun), Amir Globerson (Weilun), Alireza Nazari (Weilun), Samira Daruki (Weilun), Hagen Soltau (Weilun), Jane Labanowski (Weilun), Laurent El Shafey (Weilun), Matt Harvey (Weilun), Yanif Ahmad (Weilun), Elan Rosenfeld (Weilun), William Kong (Weilun), Etienne Pot (Weilun), Yi-Xuan Tan (Weilun), Aurora Wei (Weilun), Victoria Langston (Weilun), Marcel Prasetya (Weilun), Petar Veli\v{c}kovi\'c (Weilun), Richard Killam (Weilun), Robin Strudel (Weilun), Darren Ni (Weilun), Zhenhai Zhu (Weilun), Aaron Archer (Weilun), Kavya Kopparapu (Weilun), Lynn Nguyen (Weilun), Emilio Parisotto (Weilun), Hussain Masoom (Weilun), Sravanti Addepalli (Weilun), Jordan Grimstad (Weilun), Hexiang Hu (Weilun), Joss Moore (Weilun), Avinatan Hassidim (Weilun), Le Hou (Weilun), Mukund Raghavachari (Weilun), Jared Lichtarge (Weilun), Adam R. Brown (Weilun), Hilal Dib (Weilun), Natalia Ponomareva (Weilun), Justin Fu (Weilun), Yujing Zhang (Weilun), Altaf Rahman (Weilun), Joana Iljazi (Weilun), Edouard Leurent (Weilun), Gabriel Dulac-Arnold (Weilun), Cosmo Du (Weilun), Chulayuth Asawaroengchai (Weilun), Larry Jin (Weilun), Ela Gruzewska (Weilun), Ziwei Ji (Weilun), Benigno Uria (Weilun), Daniel De Freitas (Weilun), Paul Barham (Weilun), Lauren Beltrone (Weilun), V\'ictor Campos (Weilun), Jun Yan (Weilun), Neel Kovelamudi (Weilun), Arthur Nguyen (Weilun), Elinor Davies (Weilun), Zhichun Wu (Weilun), Zoltan Egyed (Weilun), Kristina Toutanova (Weilun), Nithya Attaluri (Weilun), Hongliang Fei (Weilun), Peter Stys (Weilun), Siddhartha Brahma (Weilun), Martin Izzard (Weilun), Siva Velusamy (Weilun), Scott Lundberg (Weilun), Vincent Zhuang (Weilun), Kevin Sequeira (Weilun), Adam Santoro (Weilun), Ehsan Amid (Weilun), Ophir Aharoni (Weilun), Shuai Ye (Weilun), Mukund Sundararajan (Weilun), Lijun Yu (Weilun), Yu-Cheng Ling (Weilun), Stephen Spencer (Weilun), Hugo Song (Weilun), Josip Djolonga (Weilun), Christo Kirov (Weilun), Sonal Gupta (Weilun), Alessandro Bissacco (Weilun), Clemens Meyer (Weilun), Mukul Bhutani (Weilun), Andrew Dai (Weilun), Weiyi Wang (Weilun), Siqi Liu (Weilun), Ashwin Sreevatsa (Weilun), Qijun Tan (Weilun), Maria Wang (Weilun), Lucy Kim (Weilun), Yicheng Wang (Weilun), Alex Irpan (Weilun), Yang Xiao (Weilun), Stanislav Fort (Weilun), Yifan He (Weilun), Alex Gurney (Weilun), Bryan Gale (Weilun), Yue Ma (Weilun), Monica Roy (Weilun), Viorica Patraucean (Weilun), Taylan Bilal (Weilun), Golnaz Ghiasi (Weilun), Anahita Hosseini (Weilun), Melvin Johnson (Weilun), Zhuowan Li (Weilun), Yi Tay (Weilun), Benjamin Beyret (Weilun), Katie Millican (Weilun), Josef Broder (Weilun), Mayank Lunayach (Weilun), Danny Swisher (Weilun), Eugen Vu\v{s}ak (Weilun), David Parkinson (Weilun), MH Tessler (Weilun), Adi Mayrav Gilady (Weilun), Richard Song (Weilun), Allan Dafoe (Weilun), Yves Raimond (Weilun), Masa Yamaguchi (Weilun), Itay Karo (Weilun), Elizabeth Nielsen (Weilun), Kevin Kilgour (Weilun), Mike Dusenberry (Weilun), Rajiv Mathews (Weilun), Jiho Choi (Weilun), Siyuan Qiao (Weilun), Harsh Mehta (Weilun), Sahitya Potluri (Weilun), Chris Knutsen (Weilun), Jialu Liu (Weilun), Tat Tan (Weilun), Kuntal Sengupta (Weilun), Keerthana Gopalakrishnan (Weilun), Abodunrinwa Toki (Weilun), Mencher Chiang (Weilun), Mike Burrows (Weilun), Grace Vesom (Weilun), Zafarali Ahmed (Weilun), Ilia Labzovsky (Weilun), Siddharth Vashishtha (Weilun), Preeti Singh (Weilun), Ankur Sharma (Weilun), Ada Ma (Weilun), Jinyu Xie (Weilun), Pranav Talluri (Weilun), Hannah Forbes-Pollard (Weilun), Aarush Selvan (Weilun), Joel Wee (Weilun), Loic Matthey (Weilun), Tom Funkhouser (Weilun), Parthasarathy Gopavarapu (Weilun), Lev Proleev (Weilun), Cheng Li (Weilun), Matt Thomas (Weilun), Kashyap Kolipaka (Weilun), Zhipeng Jia (Weilun), Ashwin Kakarla (Weilun), Srinivas Sunkara (Weilun), Joan Puigcerver (Weilun), Suraj Satishkumar Sheth (Weilun), Emily Graves (Weilun), Chen Wang (Weilun), Sadh MNM Khan (Weilun), Kai Kang (Weilun), Shyamal Buch (Weilun), Fred Zhang (Weilun), Omkar Savant (Weilun), David Soergel (Weilun), Kevin Lee (Weilun), Linda Friso (Weilun), Xuanyi Dong (Weilun), Rahul Arya (Weilun), Shreyas Chandrakaladharan (Weilun), Connor Schenck (Weilun), Greg Billock (Weilun), Tejas Iyer (Weilun), Anton Bakalov (Weilun), Leslie Baker (Weilun), Alex Ruiz (Weilun), Angad Chandorkar (Weilun), Trieu Trinh (Weilun), Matt Miecnikowski (Weilun), Yanqi Zhou (Weilun), Yangsibo Huang (Weilun), Jiazhong Nie (Weilun), Ali Shah (Weilun), Ashish Thapliyal (Weilun), Sam Haves (Weilun), Lun Wang (Weilun), Uri Shaham (Weilun), Patrick Morris-Suzuki (Weilun), Soroush Radpour (Weilun), Leonard Berrada (Weilun), Thomas Strohmann (Weilun), Chaochao Yan (Weilun), Jingwei Shen (Weilun), Sonam Goenka (Weilun), Tris Warkentin (Weilun), Petar Devi\'c (Weilun), Dan Belov (Weilun), Albert Webson (Weilun), Madhavi Yenugula (Weilun), Puranjay Datta (Weilun), Jerry Chang (Weilun), Nimesh Ghelani (Weilun), Aviral Kumar (Weilun), Vincent Perot (Weilun), Jessica Lo (Weilun), Yang Song (Weilun), Herman Schmit (Weilun), Jianmin Chen (Weilun), Vasilisa Bashlovkina (Weilun), Xiaoyue Pan (Weilun), Diana Mincu (Weilun), Paul Roit (Weilun), Isabel Edkins (Weilun), Andy Davis (Weilun), Yujia Li (Weilun), Ben Horn (Weilun), Xinjian Li (Weilun), Pradeep Kumar S (Weilun), Eric Doi (Weilun), Wanzheng Zhu (Weilun), Sri Gayatri Sundara Padmanabhan (Weilun), Siddharth Verma (Weilun), Jasmine Liu (Weilun), Heng Chen (Weilun), Mihajlo Velimirovi\'c (Weilun), Malcolm Reynolds (Weilun), Priyanka Agrawal (Weilun), Nick Sukhanov (Weilun), Abhinit Modi (Weilun), Siddharth Goyal (Weilun), John Palowitch (Weilun), Nima Khajehnouri (Weilun), Wing Lowe (Weilun), David Klinghoffer (Weilun), Sharon Silver (Weilun), Vinh Tran (Weilun), Candice Schumann (Weilun), Francesco Piccinno (Weilun), Xi Liu (Weilun), Mario Lu\v{c}i\'c (Weilun), Xiaochen Yang (Weilun), Sandeep Kumar (Weilun), Ajay Kannan (Weilun), Ragha Kotikalapudi (Weilun), Mudit Bansal (Weilun), Fabian Fuchs (Weilun), Mohammad Javad Hosseini (Weilun), Abdelrahman Abdelhamed (Weilun), Dawn Bloxwich (Weilun), Tianhe Yu (Weilun), Ruoxin Sang (Weilun), Gregory Thornton (Weilun), Karan Gill (Weilun), Yuchi Liu (Weilun), Virat Shejwalkar (Weilun), Jason Lin (Weilun), Zhipeng Yan (Weilun), Kehang Han (Weilun), Thomas Buschmann (Weilun), Michael Pliskin (Weilun), Zhi Xing (Weilun), Susheel Tatineni (Weilun), Junlin Zhang (Weilun), Sissie Hsiao (Weilun), Gavin Buttimore (Weilun), Marcus Wu (Weilun), Zefei Li (Weilun), Geza Kovacs (Weilun), Legg Yeung (Weilun), Tao Huang (Weilun), Aaron Cohen (Weilun), Bethanie Brownfield (Weilun), Averi Nowak (Weilun), Mikel Rodriguez (Weilun), Tianze Shi (Weilun), Hado van Hasselt (Weilun), Kevin Cen (Weilun), Deepanway Ghoshal (Weilun), Kushal Majmundar (Weilun), Weiren Yu (Weilun), Warren (Weilun), Chen (Yonghao), Danila Sinopalnikov (Yonghao), Hao Zhang (Yonghao), Vlado Gali\'c (Yonghao), Di Lu (Yonghao), Zeyu Zheng (Yonghao), Maggie Song (Yonghao), Gary Wang (Yonghao), Gui Citovsky (Yonghao), Swapnil Gawde (Yonghao), Isaac Galatzer-Levy (Yonghao), David Silver (Yonghao), Ivana Balazevic (Yonghao), Dipanjan Das (Yonghao), Kingshuk Majumder (Yonghao), Yale Cong (Yonghao), Praneet Dutta (Yonghao), Dustin Tran (Yonghao), Hui Wan (Yonghao), Junwei Yuan (Yonghao), Daniel Eppens (Yonghao), Alanna Walton (Yonghao), Been Kim (Yonghao), Harry Ragan (Yonghao), James Cobon-Kerr (Yonghao), Lu Liu (Yonghao), Weijun Wang (Yonghao), Bryce Petrini (Yonghao), Jack Rae (Yonghao), Rakesh Shivanna (Yonghao), Yan Xiong (Yonghao), Chace Lee (Yonghao), Pauline Coquinot (Yonghao), Yiming Gu (Yonghao), Lisa Patel (Yonghao), Blake Hechtman (Yonghao), Aviel Boag (Yonghao), Orion Jankowski (Yonghao), Alex Wertheim (Yonghao), Alex Lee (Yonghao), Paul Covington (Yonghao), Hila Noga (Yonghao), Sam Sobell (Yonghao), Shanthal Vasanth (Yonghao), William Bono (Yonghao), Chirag Nagpal (Yonghao), Wei Fan (Yonghao), Xavier Garcia (Yonghao), Kedar Soparkar (Yonghao), Aybuke Turker (Yonghao), Nathan Howard (Yonghao), Sachit Menon (Yonghao), Yuankai Chen (Yonghao), Vikas Verma (Yonghao), Vladimir Pchelin (Yonghao), Harish Rajamani (Yonghao), Valentin Dalibard (Yonghao), Ana Ramalho (Yonghao), Yang Guo (Yonghao), Kartikeya Badola (Yonghao), Seojin Bang (Yonghao), Nathalie Rauschmayr (Yonghao), Julia Proskurnia (Yonghao), Sudeep Dasari (Yonghao), Xinyun Chen (Yonghao), Mikhail Sushkov (Yonghao), Anja Hauth (Yonghao), Pauline Sho (Yonghao), Abhinav Singh (Yonghao), Bilva Chandra (Yonghao), Allie Culp (Yonghao), Max Dylla (Yonghao), Olivier Bachem (Yonghao), James Besley (Yonghao), Heri Zhao (Yonghao), Timothy Lillicrap (Yonghao), Wei Wei (Yonghao), Wael Al Jishi (Yonghao), Ning Niu (Yonghao), Alban Rrustemi (Yonghao), Rapha\"el Lopez Kaufman (Yonghao), Ryan Poplin (Yonghao), Jewel Zhao (Yonghao), Minh Truong (Yonghao), Shikhar Bharadwaj (Yonghao), Ester Hlavnova (Yonghao), Eli Stickgold (Yonghao), Cordelia Schmid (Yonghao), Georgi Stephanov (Yonghao), Zhaoqi Leng (Yonghao), Frederick Liu (Yonghao), L\'eonard Hussenot (Yonghao), Shenil Dodhia (Yonghao), Juliana Vicente Franco (Yonghao), Lesley Katzen (Yonghao), Abhanshu Sharma (Yonghao), Sarah Cogan (Yonghao), Zuguang Yang (Yonghao), Aniket Ray (Yonghao), Sergi Caelles (Yonghao), Shen Yan (Yonghao), Ravin Kumar (Yonghao), Daniel Gillick (Yonghao), Renee Wong (Yonghao), Joshua Ainslie (Yonghao), Jonathan Hoech (Yonghao), S\'eb Arnold (Yonghao), Dan Abolafia (Yonghao), Anca Dragan (Yonghao), Ben Hora (Yonghao), Grace Hu (Yonghao), Alexey Guseynov (Yonghao), Yang Lu (Yonghao), Chas Leichner (Yonghao), Jinmeng Rao (Yonghao), Abhimanyu Goyal (Yonghao), Nagabhushan Baddi (Yonghao), Daniel Hernandez Diaz (Yonghao), Tim McConnell (Yonghao), Max Bain (Yonghao), Jake Abernethy (Yonghao), Qiqi Yan (Yonghao), Rylan Schaeffer (Yonghao), Paul Vicol (Yonghao), Will Thompson (Yonghao), Montse Gonzalez Arenas (Yonghao), Mathias Bellaiche (Yonghao), Pablo Barrio (Yonghao), Stefan Zinke (Yonghao), Riccardo Patana (Yonghao), Pulkit Mehta (Yonghao), JK Kearns (Yonghao), Avraham Ruderman (Yonghao), Scott Pollom (Yonghao), David D'Ambrosio (Yonghao), Cath Hope (Yonghao), Yang Yu (Yonghao), Andrea Gesmundo (Yonghao), Kuang-Huei Lee (Yonghao), Aviv Rosenberg (Yonghao), Yiqian Zhou (Yonghao), Yaoyiran Li (Yonghao), Drew Garmon (Yonghao), Yonghui Wu (Yonghao), Safeen Huda (Yonghao), Gil Fidel (Yonghao), Martin Baeuml (Yonghao), Jian Li (Yonghao), Phoebe Kirk (Yonghao), Rhys May (Yonghao), Tao Tu (Yonghao), Sara Mc Carthy (Yonghao), Toshiyuki Fukuzawa (Yonghao), Miranda Aperghis (Yonghao), Chih-Kuan Yeh (Yonghao), Toshihiro Yoshino (Yonghao), Bo Li (Yonghao), Austin Myers (Yonghao), Kaisheng Yao (Yonghao), Ben Limonchik (Yonghao), Changwan Ryu (Yonghao), Rohun Saxena (Yonghao), Alex Goldin (Yonghao), Ruizhe Zhao (Yonghao), Rocky Rhodes (Yonghao), Tao Zhu (Yonghao), Divya Tyam (Yonghao), Heidi Howard (Yonghao), Nathan Byrd (Yonghao), Hongxu Ma (Yonghao), Yan Wu (Yonghao), Ryan Mullins (Yonghao), Qingze Wang (Yonghao), Aida Amini (Yonghao), Sebastien Baur (Yonghao), Yiran Mao (Yonghao), Subhashini Venugopalan (Yonghao), Will Song (Yonghao), Wen Ding (Yonghao), Paul Collins (Yonghao), Sashank Reddi (Yonghao), Megan Shum (Yonghao), Andrei Rusu (Yonghao), Luisa Zintgraf (Yonghao), Kelvin Chan (Yonghao), Sheela Goenka (Yonghao), Mathieu Blondel (Yonghao), Michael Collins (Yonghao), Renke Pan (Yonghao), Marissa Giustina (Yonghao), Nikolai Chinaev (Yonghao), Christian Schuler (Yonghao), Ce Zheng (Yonghao), Jonas Valfridsson (Yonghao), Alyssa Loo (Yonghao), Alex Yakubovich (Yonghao), Jamie Smith (Yonghao), Tao Jiang (Yonghao), Rich Munoz (Yonghao), Gabriel Barcik (Yonghao), Rishabh Bansal (Yonghao), Mingyao Yang (Yonghao), Yilun Du (Yonghao), Pablo Duque (Yonghao), Mary Phuong (Yonghao), Alexandra Belias (Yonghao), Kunal Lad (Yonghao), Zeyu Liu (Yonghao), Tal Schuster (Yonghao), Karthik Duddu (Yonghao), Jieru Hu (Yonghao), Paige Kunkle (Yonghao), Matthew Watson (Yonghao), Jackson Tolins (Yonghao), Josh Smith (Yonghao), Denis Teplyashin (Yonghao), Garrett Bingham (Yonghao), Marvin Ritter (Yonghao), Marco Andreetto (Yonghao), Divya Pitta (Yonghao), Mohak Patel (Yonghao), Shashank Viswanadha (Yonghao), Trevor Strohman (Yonghao), Catalin Ionescu (Yonghao), Jincheng Luo (Yonghao), Yogesh Kalley (Yonghao), Jeremy Wiesner (Yonghao), Dan Deutsch (Yonghao), Derek Lockhart (Yonghao), Peter Choy (Yonghao), Rumen Dangovski (Yonghao), Chawin Sitawarin (Yonghao), Cat Graves (Yonghao), Tanya Lando (Yonghao), Joost van Amersfoort (Yonghao), Ndidi Elue (Yonghao), Zhouyuan Huo (Yonghao), Pooya Moradi (Yonghao), Jean Tarbouriech (Yonghao), Henryk Michalewski (Yonghao), Wenting Ye (Yonghao), Eunyoung Kim (Yonghao), Alex Druinsky (Yonghao), Florent Altch\'e (Yonghao), Xinyi Chen (Yonghao), Artur Dwornik (Yonghao), Da-Cheng Juan (Yonghao), Rivka Moroshko (Yonghao), Horia Toma (Yonghao), Jarrod Kahn (Yonghao), Hai Qian (Yonghao), Maximilian Sieb (Yonghao), Irene Cai (Yonghao), Roman Goldenberg (Yonghao), Praneeth Netrapalli (Yonghao), Sindhu Raghuram (Yonghao), Yuan Gong (Yonghao), Lijie Fan (Yonghao), Evan Palmer (Yonghao), Yossi Matias (Yonghao), Valentin Gabeur (Yonghao), Shreya Pathak (Yonghao), Tom Ouyang (Yonghao), Don Metzler (Yonghao), Geoff Bacon (Yonghao), Srinivasan Venkatachary (Yonghao), Sridhar Thiagarajan (Yonghao), Alex Cullum (Yonghao), Eran Ofek (Yonghao), Vytenis Sakenas (Yonghao), Mohamed Hammad (Yonghao), Cesar Magalhaes (Yonghao), Mayank Daswani (Yonghao), Oscar Chang (Yonghao), Ashok Popat (Yonghao), Ruichao Li (Yonghao), Komal Jalan (Yonghao), Yanhan Hou (Yonghao), Josh Lipschultz (Yonghao), Antoine He (Yonghao), Wenhao Jia (Yonghao), Pier Giuseppe Sessa (Yonghao), Prateek Kolhar (Yonghao), William Wong (Yonghao), Sumeet Singh (Yonghao), Lukas Haas (Yonghao), Jay Whang (Yonghao), Hanna Klimczak-Pluci\'nska (Yonghao), Georges Rotival (Yonghao), Grace Chung (Yonghao), Yiqing Hua (Yonghao), Anfal Siddiqui (Yonghao), Nicolas Serrano (Yonghao), Dongkai Chen (Yonghao), Billy Porter (Yonghao), Libin Bai (Yonghao), Keshav Shivam (Yonghao), Sho Arora (Yonghao), Partha Talukdar (Yonghao), Tom Cobley (Yonghao), Sangnie Bhardwaj (Yonghao), Evgeny Gladchenko (Yonghao), Simon Green (Yonghao), Kelvin Guu (Yonghao), Felix Fischer (Yonghao), Xiao Wu (Yonghao), Eric Wang (Yonghao), Achintya Singhal (Yonghao), Tatiana Matejovicova (Yonghao), James Martens (Yonghao), Hongji Li (Yonghao), Roma Patel (Yonghao), Elizabeth Kemp (Yonghao), Jiaqi Pan (Yonghao), Lily Wang (Yonghao), Blake JianHang Chen (Yonghao), Jean-Baptiste Alayrac (Yonghao), Navneet Potti (Yonghao), Erika Gemzer (Yonghao), Eugene Ie (Yonghao), Kay McKinney (Yonghao), Takaaki Saeki (Yonghao), Edward Chou (Yonghao), Pascal Lamblin (Yonghao), SQ Mah (Yonghao), Zach Fisher (Yonghao), Martin Chadwick (Yonghao), Jon Stritar (Yonghao), Obaid Sarvana (Yonghao), Andrew Hogue (Yonghao), Artem Shtefan (Yonghao), Hadi Hashemi (Yonghao), Yang Xu (Yonghao), Jindong Gu (Yonghao), Sharad Vikram (Yonghao), Chung-Ching Chang (Yonghao), Sabela Ramos (Yonghao), Logan Kilpatrick (Yonghao), Weijuan Xi (Yonghao), Jenny Brennan (Yonghao), Yinghao Sun (Yonghao), Abhishek Jindal (Yonghao), Ionel Gog (Yonghao), Dawn Chen (Yonghao), Felix Wu (Yonghao), Jason Lee (Yonghao), Sudhindra Kopalle (Yonghao), Srinadh Bhojanapalli (Yonghao), Oriol Vinyals (Yonghao), Natan Potikha (Yonghao), Burcu Karagol Ayan (Yonghao), Yuan Yuan (Yonghao), Michael Riley (Yonghao), Piotr Stanczyk (Yonghao), Sergey Kishchenko (Yonghao), Bing Wang (Yonghao), Dan Garrette (Yonghao), Antoine Yang (Yonghao), Vlad Feinberg (Yonghao), CJ Carey (Yonghao), Javad Azizi (Yonghao), Viral Shah (Yonghao), Erica Moreira (Yonghao), Chongyang Shi (Yonghao), Josh Feldman (Yonghao), Elizabeth Salesky (Yonghao), Thomas Lampe (Yonghao), Aneesh Pappu (Yonghao), Duhyeon Kim (Yonghao), Jonas Adler (Yonghao), Avi Caciularu (Yonghao), Brian Walker (Yonghao), Yunhan Xu (Yonghao), Yochai Blau (Yonghao), Dylan Scandinaro (Yonghao), Terry Huang (Yonghao), Sam El-Husseini (Yonghao), Abhishek Sinha (Yonghao), Lijie Ren (Yonghao), Taylor Tobin (Yonghao), Patrik Sundberg (Yonghao), Tim Sohn (Yonghao), Vikas Yadav (Yonghao), Mimi Ly (Yonghao), Emily Xue (Yonghao), Jing Xiong (Yonghao), Afzal Shama Soudagar (Yonghao), Sneha Mondal (Yonghao), Nikhil Khadke (Yonghao), Qingchun Ren (Yonghao), Ben Vargas (Yonghao), Stan Bileschi (Yonghao), Sarah Chakera (Yonghao), Cindy Wang (Yonghao), Boyu Wang (Yonghao), Yoni Halpern (Yonghao), Joe Jiang (Yonghao), Vikas Sindhwani (Yonghao), Petre Petrov (Yonghao), Pranavaraj Ponnuramu (Yonghao), Sanket Vaibhav Mehta (Yonghao), Yu Watanabe (Yonghao), Betty Chan (Yonghao), Matheus Wisniewski (Yonghao), Trang Pham (Yonghao), Jingwei Zhang (Yonghao), Conglong Li (Yonghao), Dario de Cesare (Yonghao), Art Khurshudov (Yonghao), Alex Vasiloff (Yonghao), Melissa Tan (Yonghao), Zoe Ashwood (Yonghao), Bobak Shahriari (Yonghao), Maryam Majzoubi (Yonghao), Garrett Tanzer (Yonghao), Olga Kozlova (Yonghao), Robin Alazard (Yonghao), James Lee-Thorp (Yonghao), Nguyet Minh Phu (Yonghao), Isaac Tian (Yonghao), Junwhan Ahn (Yonghao), Andy Crawford (Yonghao), Lauren Lax (Yonghao), Yuan Shangguan (Yonghao), Iftekhar Naim (Yonghao), David Ross (Yonghao), Oleksandr Ferludin (Yonghao), Tongfei Guo (Yonghao), Andrea Banino (Yonghao), Hubert Soyer (Yonghao), Xiaoen Ju (Yonghao), Dominika Rogozi\'nska (Yonghao), Ishaan Malhi (Yonghao), Marcella Valentine (Yonghao), Daniel Balle (Yonghao), Apoorv Kulshreshtha (Yonghao), Maciej Kula (Yonghao), Yiwen Song (Yonghao), Sophia Austin (Yonghao), John Schultz (Yonghao), Roy Hirsch (Yonghao), Arthur Douillard (Yonghao), Apoorv Reddy (Yonghao), Michael Fink (Yonghao), Summer Yue (Yonghao), Khyatti Gupta (Yonghao), Adam Zhang (Yonghao), Norman Rink (Yonghao), Daniel McDuff (Yonghao), Lei Meng (Yonghao), Andr\'as Gy\"orgy (Yonghao), Yasaman Razeghi (Yonghao), Ricky Liang (Yonghao), Kazuki Osawa (Yonghao), Aviel Atias (Yonghao), Matan Eyal (Yonghao), Tyrone Hill (Yonghao), Nikolai Grigorev (Yonghao), Zhengdong Wang (Yonghao), Nitish Kulkarni (Yonghao), Rachel Soh (Yonghao), Ivan Lobov (Yonghao), Zachary Charles (Yonghao), Sid Lall (Yonghao), Kazuma Hashimoto (Yonghao), Ido Kessler (Yonghao), Victor Gomes (Yonghao), Zelda Mariet (Yonghao), Danny Driess (Yonghao), Alessandro Agostini (Yonghao), Canfer Akbulut (Yonghao), Jingcao Hu (Yonghao), Marissa Ikonomidis (Yonghao), Emily Caveness (Yonghao), Kartik Audhkhasi (Yonghao), Saurabh Agrawal (Yonghao), Ioana Bica (Yonghao), Evan Senter (Yonghao), Jayaram Mudigonda (Yonghao), Kelly Chen (Yonghao), Jingchen Ye (Yonghao), Xuanhui Wang (Yonghao), James Svensson (Yonghao), Philipp Fr\"anken (Yonghao), Josh Newlan (Yonghao), Li Lao (Yonghao), Eva Schnider (Yonghao), Sami Alabed (Yonghao), Joseph Kready (Yonghao), Jesse Emond (Yonghao), Afief Halumi (Yonghao), Tim Zaman (Yonghao), Chengxi Ye (Yonghao), Naina Raisinghani (Yonghao), Vilobh Meshram (Yonghao), Bo Chang (Yonghao), Ankit Singh Rawat (Yonghao), Axel Stjerngren (Yonghao), Sergey Levi (Yonghao), Rui Wang (Yonghao), Xiangzhu Long (Yonghao), Mitchelle Rasquinha (Yonghao), Steven Hand (Yonghao), Aditi Mavalankar (Yonghao), Lauren Agubuzu (Yonghao), Sudeshna Roy (Yonghao), Junquan Chen (Yonghao), Jarek Wilkiewicz (Yonghao), Hao Zhou (Yonghao), Michal Jastrzebski (Yonghao), Qiong Hu (Yonghao), Agustin Dal Lago (Yonghao), Ramya Sree Boppana (Yonghao), Wei-Jen Ko (Yonghao), Jennifer Prendki (Yonghao), Yao Su (Yonghao), Zhi Li (Yonghao), Eliza Rutherford (Yonghao), Girish Ramchandra Rao (Yonghao), Ramona Comanescu (Yonghao), Adri\`a Puigdom\`enech (Yonghao), Qihang Chen (Yonghao), Dessie Petrova (Yonghao), Christine Chan (Yonghao), Vedrana Milutinovic (Yonghao), Felipe Tiengo Ferreira (Yonghao), Chin-Yi Cheng (Yonghao), Ming Zhang (Yonghao), Tapomay Dey (Yonghao), Sherry Yang (Yonghao), Ramesh Sampath (Yonghao), Quoc Le (Yonghao), Howard Zhou (Yonghao), Chu-Cheng Lin (Yonghao), Hoi Lam (Yonghao), Christine Kaeser-Chen (Yonghao), Kai Hui (Yonghao), Dean Hirsch (Yonghao), Tom Eccles (Yonghao), Basil Mustafa (Yonghao), Shruti Rijhwani (Yonghao), Morgane Rivi\`ere (Yonghao), Yuanzhong Xu (Yonghao), Junjie Wang (Yonghao), Xinyang Geng (Yonghao), Xiance Si (Yonghao), Arjun Khare (Yonghao), Cheolmin Kim (Yonghao), Vahab Mirrokni (Yonghao), Kamyu Lee (Yonghao), Khuslen Baatarsukh (Yonghao), Nathaniel Braun (Yonghao), Lisa Wang (Yonghao), Pallavi LV (Yonghao), Richard Tanburn (Yonghao), Yuvein (Yonghao), Zhu (Joyce), Fangda Li (Joyce), Setareh Ariafar (Joyce), Dan Goldberg (Joyce), Ken Burke (Joyce), Daniil Mirylenka (Joyce), Meiqi Guo (Joyce), Olaf Ronneberger (Joyce), Hadas Natalie Vogel (Joyce), Liqun Cheng (Joyce), Nishita Shetty (Joyce), Johnson Jia (Joyce), Thomas Jimma (Joyce), Corey Fry (Joyce), Ted Xiao (Joyce), Martin Sundermeyer (Joyce), Ryan Burnell (Joyce), Yannis Assael (Joyce), Mario Pinto (Joyce), JD Chen (Joyce), Rohit Sathyanarayana (Joyce), Donghyun Cho (Joyce), Jing Lu (Joyce), Rishabh Agarwal (Joyce), Sugato Basu (Joyce), Lucas Gonzalez (Joyce), Dhruv Shah (Joyce), Meng Wei (Joyce), Dre Mahaarachchi (Joyce), Rohan Agrawal (Joyce), Tero Rissa (Joyce), Yani Donchev (Joyce), Ramiro Leal-Cavazos (Joyce), Adrian Hutter (Joyce), Markus Mircea (Joyce), Alon Jacovi (Joyce), Faruk Ahmed (Joyce), Jiageng Zhang (Joyce), Shuguang Hu (Joyce), Bo-Juen Chen (Joyce), Jonni Kanerva (Joyce), Guillaume Desjardins (Joyce), Andrew Lee (Joyce), Nikos Parotsidis (Joyce), Asier Mujika (Joyce), Tobias Weyand (Joyce), Jasper Snoek (Joyce), Jo Chick (Joyce), Kai Chen (Joyce), Paul Chang (Joyce), Ethan Mahintorabi (Joyce), Zi Wang (Joyce), Tolly Powell (Joyce), Orgad Keller (Joyce), Abhirut Gupta (Joyce), Claire Sha (Joyce), Kanav Garg (Joyce), Nicolas Heess (Joyce), \'Agoston Weisz (Joyce), Cassidy Hardin (Joyce), Bartek Wydrowski (Joyce), Ben Coleman (Joyce), Karina Zainullina (Joyce), Pankaj Joshi (Joyce), Alessandro Epasto (Joyce), Terry Spitz (Joyce), Binbin Xiong (Joyce), Kai Zhao (Joyce), Arseniy Klimovskiy (Joyce), Ivy Zheng (Joyce), Johan Ferret (Joyce), Itay Yona (Joyce), Waleed Khawaja (Joyce), Jean-Baptiste Lespiau (Joyce), Maxim Krikun (Joyce), Siamak Shakeri (Joyce), Timothee Cour (Joyce), Bonnie Li (Joyce), Igor Krivokon (Joyce), Dan Suh (Joyce), Alex Hofer (Joyce), Jad Al Abdallah (Joyce), Nikita Putikhin (Joyce), Oscar Akerlund (Joyce), Silvio Lattanzi (Joyce), Anurag Kumar (Joyce), Shane Settle (Joyce), Himanshu Srivastava (Joyce), Folawiyo Campbell-Ajala (Joyce), Edouard Rosseel (Joyce), Mihai Dorin Istin (Joyce), Nishanth Dikkala (Joyce), Anand Rao (Joyce), Nick Young (Joyce), Kate Lin (Joyce), Dhruva Bhaswar (Joyce), Yiming Wang (Joyce), Jaume Sanchez Elias (Joyce), Kritika Muralidharan (Joyce), James Keeling (Joyce), Dayou Du (Joyce), Siddharth Gopal (Joyce), Gregory Dibb (Joyce), Charles Blundell (Joyce), Manolis Delakis (Joyce), Jacky Liang (Joyce), Marco Tulio Ribeiro (Joyce), Georgi Karadzhov (Joyce), Guillermo Garrido (Joyce), Ankur Bapna (Joyce), Jiawei Cao (Joyce), Adam Sadovsky (Joyce), Pouya Tafti (Joyce), Arthur Guez (Joyce), Coline Devin (Joyce), Yixian Di (Joyce), Jinwei Xing (Joyce), Chuqiao (Joyce), Xu (Cindy), Hanzhao Lin (Cindy), Chun-Te Chu (Cindy), Sameera Ponda (Cindy), Wesley Helmholz (Cindy), Fan Yang (Cindy), Yue Gao (Cindy), Sara Javanmardi (Cindy), Wael Farhan (Cindy), Alex Ramirez (Cindy), Ricardo Figueira (Cindy), Khe Chai Sim (Cindy), Yuval Bahat (Cindy), Ashwin Vaswani (Cindy), Liangzhe Yuan (Cindy), Gufeng Zhang (Cindy), Leland Rechis (Cindy), Hanjun Dai (Cindy), Tayo Oguntebi (Cindy), Alexandra Cordell (Cindy), Eug\'enie Rives (Cindy), Kaan Tekelioglu (Cindy), Naveen Kumar (Cindy), Bing Zhang (Cindy), Aurick Zhou (Cindy), Nikolay Savinov (Cindy), Andrew Leach (Cindy), Alex Tudor (Cindy), Sanjay Ganapathy (Cindy), Yanyan Zheng (Cindy), Mirko Rossini (Cindy), Vera Axelrod (Cindy), Arnaud Autef (Cindy), Yukun Zhu (Cindy), Zheng Zheng (Cindy), Mingda Zhang (Cindy), Baochen Sun (Cindy), Jie Ren (Cindy), Nenad Tomasev (Cindy), Nithish Kannen (Cindy), Amer Sinha (Cindy), Charles Chen (Cindy), Louis O'Bryan (Cindy), Alex Pak (Cindy), Aditya Kusupati (Cindy), Weel Yang (Cindy), Deepak Ramachandran (Cindy), Patrick Griffin (Cindy), Seokhwan Kim (Cindy), Philipp Neubeck (Cindy), Craig Schiff (Cindy), Tammo Spalink (Cindy), Mingyang Ling (Cindy), Arun Nair (Cindy), Ga-Young Joung (Cindy), Linda Deng (Cindy), Avishkar Bhoopchand (Cindy), Lora Aroyo (Cindy), Tom Duerig (Cindy), Jordan Griffith (Cindy), Gabe Barth-Maron (Cindy), Jake Ades (Cindy), Alex Haig (Cindy), Ankur Taly (Cindy), Yunting Song (Cindy), Paul Michel (Cindy), Dave Orr (Cindy), Dean Weesner (Cindy), Corentin Tallec (Cindy), Carrie Grimes Bostock (Cindy), Paul Niemczyk (Cindy), Andy Twigg (Cindy), Mudit Verma (Cindy), Rohith Vallu (Cindy), Henry Wang (Cindy), Marco Gelmi (Cindy), Kiranbir Sodhia (Cindy), Aleksandr Chuklin (Cindy), Omer Goldman (Cindy), Jasmine George (Cindy), Liang Bai (Cindy), Kelvin Zhang (Cindy), Petar Sirkovic (Cindy), Efrat Nehoran (Cindy), Golan Pundak (Cindy), Jiaqi Mu (Cindy), Alice Chen (Cindy), Alex Greve (Cindy), Paulo Zacchello (Cindy), David Amos (Cindy), Heming Ge (Cindy), Eric Noland (Cindy), Colton Bishop (Cindy), Jeffrey Dudek (Cindy), Youhei Namiki (Cindy), Elena Buchatskaya (Cindy), Jing Li (Cindy), Dorsa Sadigh (Cindy), Masha Samsikova (Cindy), Dan Malkin (Cindy), Damien Vincent (Cindy), Robert David (Cindy), Rob Willoughby (Cindy), Phoenix Meadowlark (Cindy), Shawn Gao (Cindy), Yan Li (Cindy), Raj Apte (Cindy), Amit Jhindal (Cindy), Stein Xudong Lin (Cindy), Alex Polozov (Cindy), Zhicheng Wang (Cindy), Tomas Mery (Cindy), Anirudh GP (Cindy), Varun Yerram (Cindy), Sage Stevens (Cindy), Tianqi Liu (Cindy), Noah Fiedel (Cindy), Charles Sutton (Cindy), Matthew Johnson (Cindy), Xiaodan Song (Cindy), Kate Baumli (Cindy), Nir Shabat (Cindy), Muqthar Mohammad (Cindy), Hao Liu (Cindy), Marco Selvi (Cindy), Yichao Zhou (Cindy), Mehdi Hafezi Manshadi (Cindy), Chu-ling Ko (Cindy), Anthony Chen (Cindy), Michael Bendersky (Cindy), Jorge Gonzalez Mendez (Cindy), Nisarg Kothari (Cindy), Amir Zandieh (Cindy), Yiling Huang (Cindy), Daniel Andor (Cindy), Ellie Pavlick (Cindy), Idan Brusilovsky (Cindy), Jitendra Harlalka (Cindy), Sally Goldman (Cindy), Andrew Lampinen (Cindy), Guowang Li (Cindy), Asahi Ushio (Cindy), Somit Gupta (Cindy), Lei Zhang (Cindy), Chuyuan Kelly Fu (Cindy), Madhavi Sewak (Cindy), Timo Denk (Cindy), Jed Borovik (Cindy), Brendan Jou (Cindy), Avital Zipori (Cindy), Prateek Jain (Cindy), Junwen Bai (Cindy), Thang Luong (Cindy), Jonathan Tompson (Cindy), Alice Li (Cindy), Li Liu (Cindy), George Powell (Cindy), Jiajun Shen (Cindy), Alex Feng (Cindy), Grishma Chole (Cindy), Da Yu (Cindy), Yinlam Chow (Cindy), Tongxin Yin (Cindy), Eric Malmi (Cindy), Kefan Xiao (Cindy), Yash Pande (Cindy), Shachi Paul (Cindy), Niccol\`o Dal Santo (Cindy), Adil Dostmohamed (Cindy), Sergio Guadarrama (Cindy), Aaron Phillips (Cindy), Thanumalayan Sankaranarayana Pillai (Cindy), Gal Yona (Cindy), Amin Ghafouri (Cindy), Preethi Lahoti (Cindy), Benjamin Lee (Cindy), Dhruv Madeka (Cindy), Eren Sezener (Cindy), Simon Tokumine (Cindy), Adrian Collister (Cindy), Nicola De Cao (Cindy), Richard Shin (Cindy), Uday Kalra (Cindy), Parker Beak (Cindy), Emily Nottage (Cindy), Ryo Nakashima (Cindy), Ivan Jurin (Cindy), Vikash Sehwag (Cindy), Meenu Gaba (Cindy), Junhao Zeng (Cindy), Kevin R. McKee (Cindy), Fernando Pereira (Cindy), Tamar Yakar (Cindy), Amayika Panda (Cindy), Arka Dhar (Cindy), Peilin Zhong (Cindy), Daniel Sohn (Cindy), Mark Brand (Cindy), Lars Lowe Sjoesund (Cindy), Viral Carpenter (Cindy), Sharon Lin (Cindy), Shantanu Thakoor (Cindy), Marcus Wainwright (Cindy), Ashwin Chaugule (Cindy), Pranesh Srinivasan (Cindy), Muye Zhu (Cindy), Bernett Orlando (Cindy), Jack Weber (Cindy), Ayzaan Wahid (Cindy), Gilles Baechler (Cindy), Apurv Suman (Cindy), Jovana Mitrovi\'c (Cindy), Gabe Taubman (Cindy), Honglin Yu (Cindy), Helen King (Cindy), Josh Dillon (Cindy), Cathy Yip (Cindy), Dhriti Varma (Cindy), Tomas Izo (Cindy), Levent Bolelli (Cindy), Borja De Balle Pigem (Cindy), Julia Di Trapani (Cindy), Fotis Iliopoulos (Cindy), Adam Paszke (Cindy), Nishant Ranka (Cindy), Joe Zou (Cindy), Francesco Pongetti (Cindy), Jed McGiffin (Cindy), Alex Siegman (Cindy), Rich Galt (Cindy), Ross Hemsley (Cindy), Goran \v{Z}u\v{z}i\'c (Cindy), Victor Carbune (Cindy), Tao Li (Cindy), Myle Ott (Cindy), F\'elix de Chaumont Quitry (Cindy), David Vilar Torres (Cindy), Yuri Chervonyi (Cindy), Tomy Tsai (Cindy), Prem Eruvbetine (Cindy), Samuel Yang (Cindy), Matthew Denton (Cindy), Jake Walker (Cindy), Slavica Anda\v{c}i\'c (Cindy), Idan Heimlich Shtacher (Cindy), Vittal Premachandran (Cindy), Harshal Tushar Lehri (Cindy), Cip Baetu (Cindy), Damion Yates (Cindy), Lampros Lamprou (Cindy), Mariko Iinuma (Cindy), Ioana Mihailescu (Cindy), Ben Albrecht (Cindy), Shachi Dave (Cindy), Susie Sargsyan (Cindy), Bryan Perozzi (Cindy), Lucas Manning (Cindy), Chiyuan Zhang (Cindy), Denis Vnukov (Cindy), Igor Mordatch (Cindy), Raia Hadsell Wolfgang Macherey (Cindy), Ryan Kappedal (Cindy), Jim Stephan (Cindy), Aditya Tripathi (Cindy), Klaus Macherey (Cindy), Jun Qian (Cindy), Abhishek Bhowmick (Cindy), Shekoofeh Azizi (Cindy), R\'emi Leblond (Cindy), Shiva Mohan Reddy Garlapati (Cindy), Timothy Knight (Cindy), Matthew Wiethoff (Cindy), Wei-Chih Hung (Cindy), Anelia Angelova (Cindy), Georgios Evangelopoulos (Cindy), Pawel Janus (Cindy), Dimitris Paparas (Cindy), Matthew Rahtz (Cindy), Ken Caluwaerts (Cindy), Vivek Sampathkumar (Cindy), Daniel Jarrett (Cindy), Shadi Noghabi (Cindy), Antoine Miech (Cindy), Chak Yeung (Cindy), Geoff Clark (Cindy), Henry Prior (Cindy), Fei Zheng (Cindy), Jean Pouget-Abadie (Cindy), Indro Bhattacharya (Cindy), Kalpesh Krishna (Cindy), Will Bishop (Cindy), Zhe Yuan (Cindy), Yunxiao Deng (Cindy), Ashutosh Sathe (Cindy), Kacper Krasowiak (Cindy), Ciprian Chelba (Cindy), Cho-Jui Hsieh (Cindy), Kiran Vodrahalli (Cindy), Buhuang Liu (Cindy), Thomas K\"oppe (Cindy), Amr Khalifa (Cindy), Lubo Litchev (Cindy), Pichi Charoenpanit (Cindy), Reed Roberts (Cindy), Sachin Yadav (Cindy), Yasumasa Onoe (Cindy), Desi Ivanov (Cindy), Megha Mohabey (Cindy), Vighnesh Birodkar (Cindy), Nemanja Raki\'cevi\'c (Cindy), Pierre Sermanet (Cindy), Vaibhav Mehta (Cindy), Krishan Subudhi (Cindy), Travis Choma (Cindy), Will Ng (Cindy), Luheng He (Cindy), Kathie Wang (Cindy), Tasos Kementsietsidis (Cindy), Shane Gu (Cindy), Mansi Gupta (Cindy), Andrew Nystrom (Cindy), Mehran Kazemi (Cindy), Timothy Chung (Cindy), Nacho Cano (Cindy), Nikhil Dhawan (Cindy), Yufei Wang (Cindy), Jiawei Xia (Cindy), Trevor Yacovone (Cindy), Eric Jia (Cindy), Mingqing Chen (Cindy), Simeon Ivanov (Cindy), Ashrith Sheshan (Cindy), Sid Dalmia (Cindy), Pawe{\l} Stradomski (Cindy), Pengcheng Yin (Cindy), Salem Haykal (Cindy), Congchao Wang (Cindy), Dennis Duan (Cindy), Neslihan Bulut (Cindy), Greg Kochanski (Cindy), Liam MacDermed (Cindy), Namrata Godbole (Cindy), Shitao Weng (Cindy), Jingjing Chen (Cindy), Rachana Fellinger (Cindy), Ramin Mehran (Cindy), Daniel Suo (Cindy), Hisham Husain (Cindy), Tong He (Cindy), Kaushal Patel (Cindy), Joshua Howland (Cindy), Randall Parker (Cindy), Kelvin Nguyen (Cindy), Sharath Maddineni (Cindy), Chris Rawles (Cindy), Mina Khan (Cindy), Shlomi Cohen-Ganor (Cindy), Amol Mandhane (Cindy), Xinyi Wu (Cindy), Chenkai Kuang (Cindy), Iulia Com\c{s}a (Cindy), Ramya Ganeshan (Cindy), Hanie Sedghi (Cindy), Adam Bloniarz (Cindy), Nuo Wang Pierse (Cindy), Anton Briukhov (Cindy), Petr Mitrichev (Cindy), Anita Gergely (Cindy), Serena Zhan (Cindy), Allan Zhou (Cindy), Nikita Saxena (Cindy), Eva Lu (Cindy), Josef Dean (Cindy), Ashish Gupta (Cindy), Nicolas Perez-Nieves (Cindy), Renjie Wu (Cindy), Cory McLean (Cindy), Wei Liang (Cindy), Disha Jindal (Cindy), Anton Tsitsulin (Cindy), Wenhao Yu (Cindy), Kaiz Alarakyia (Cindy), Tom Schaul (Cindy), Piyush Patil (Cindy), Peter Sung (Cindy), Elijah Peake (Cindy), Hongkun Yu (Cindy), Feryal Behbahani (Cindy), JD Co-Reyes (Cindy), Alan Ansell (Cindy), Sean Sun (Cindy), Clara Barbu (Cindy), Jonathan Lee (Cindy), Seb Noury (Cindy), James Allingham (Cindy), Bilal Piot (Cindy), Mohit Sharma (Cindy), Christopher Yew (Cindy), Ivan Korotkov (Cindy), Bibo Xu (Cindy), Demetra Brady (Cindy), Goran Petrovic (Cindy), Shibl Mourad (Cindy), Claire Cui (Cindy), Aditya Gupta (Cindy), Parker Schuh (Cindy), Saarthak Khanna (Cindy), Anna Goldie (Cindy), Abhinav Arora (Cindy), Vadim Zubov (Cindy), Amy Stuart (Cindy), Mark Epstein (Cindy), Yun Zhu (Cindy), Jianqiao Liu (Cindy), Yury Stuken (Cindy), Ziyue Wang (Cindy), Karolis Misiunas (Cindy), Dee Guo (Cindy), Ashleah Gill (Cindy), Ale Hartman (Cindy), Zaid Nabulsi (Cindy), Aurko Roy (Cindy), Aleksandra Faust (Cindy), Jason Riesa (Cindy), Ben Withbroe (Cindy), Mengchao Wang (Cindy), Marco Tagliasacchi (Cindy), Andreea Marzoca (Cindy), James Noraky (Cindy), Serge Toropov (Cindy), Malika Mehrotra (Cindy), Bahram Raad (Cindy), Sanja Deur (Cindy), Steve Xu (Cindy), Marianne Monteiro (Cindy), Zhongru Wu (Cindy), Yi Luan (Cindy), Sam Ritter (Cindy), Nick Li (Cindy), H{\aa}vard Garnes (Cindy), Yanzhang He (Cindy), Martin Zlocha (Cindy), Jifan Zhu (Cindy), Matteo Hessel (Cindy), Will Wu (Cindy), Spandana Raj Babbula (Cindy), Chizu Kawamoto (Cindy), Yuanzhen Li (Cindy), Mehadi Hassen (Cindy), Yan Wang (Cindy), Brian Wieder (Cindy), James Freedman (Cindy), Yin Zhang (Cindy), Xinyi Bai (Cindy), Tianli Yu (Cindy), David Reitter (Cindy), XiangHai Sheng (Cindy), Mateo Wirth (Cindy), Aditya Kini (Cindy), Dima Damen (Cindy), Mingcen Gao (Cindy), Rachel Hornung (Cindy), Michael Voznesensky (Cindy), Brian Roark (Cindy), Adhi Kuncoro (Cindy), Yuxiang Zhou (Cindy), Rushin Shah (Cindy), Anthony Brohan (Cindy), Kuangyuan Chen (Cindy), James Wendt (Cindy), David Rim (Cindy), Paul Kishan Rubenstein (Cindy), Jonathan Halcrow (Cindy), Michelle Liu (Cindy), Ty Geri (Cindy), Yunhsuan Sung (Cindy), Jane Shapiro (Cindy), Shaan Bijwadia (Cindy), Chris Duvarney (Cindy), Christina Sorokin (Cindy), Paul Natsev (Cindy), Reeve Ingle (Cindy), Pramod Gupta (Cindy), Young Maeng (Cindy), Ndaba Ndebele (Cindy), Kexin Zhu (Cindy), Valentin Anklin (Cindy), Katherine Lee (Cindy), Yuan Liu (Cindy), Yaroslav Akulov (Cindy), Shaleen Gupta (Cindy), Guolong Su (Cindy), Flavien Prost (Cindy), Tianlin Liu (Cindy), Vitaly Kovalev (Cindy), Pol Moreno (Cindy), Martin Scholz (Cindy), Sam Redmond (Cindy), Zongwei Zhou (Cindy), Alex Castro-Ros (Cindy), Andr\'e Susano Pinto (Cindy), Dia Kharrat (Cindy), Michal Yarom (Cindy), Rachel Saputro (Cindy), Jannis Bulian (Cindy), Ben Caine (Cindy), Ji Liu (Cindy), Abbas Abdolmaleki (Cindy), Shariq Iqbal (Cindy), Tautvydas Misiunas (Cindy), Mikhail Sirotenko (Cindy), Shefali Garg (Cindy), Guy Bensky (Cindy), Huan Gui (Cindy), Xuezhi Wang (Cindy), Raphael Koster (Cindy), Mike Bernico (Cindy), Da Huang (Cindy), Romal Thoppilan (Cindy), Trevor Cohn (Cindy), Ben Golan (Cindy), Wenlei Zhou (Cindy), Andrew Rosenberg (Cindy), Markus Freitag (Cindy), Tynan Gangwani (Cindy), Vincent Tsang (Cindy), Anand Shukla (Cindy), Xiaoqi Ren (Cindy), Minh Giang (Cindy), Chi Zou (Cindy), Andre Elisseeff (Cindy), Charline Le Lan (Cindy), Dheeru Dua (Cindy), Shuba Lall (Cindy), Pranav Shyam (Cindy), Frankie Garcia (Cindy), Sarah Nguyen (Cindy), Michael Guzman (Cindy), AJ Maschinot (Cindy), Marcello Maggioni (Cindy), Ming-Wei Chang (Cindy), Karol Gregor (Cindy), Lotte Weerts (Cindy), Kumaran Venkatesan (Cindy), Bogdan Damoc (Cindy), Leon Liu (Cindy), Jan Wassenberg (Cindy), Lewis Ho (Cindy), Becca Roelofs (Cindy), Majid Hadian (Cindy), Fran\c{c}ois-Xavier Aubet (Cindy), Yu Liang (Cindy), Sami Lachgar (Cindy), Danny Karmon (Cindy), Yong Cheng (Cindy), Amelio V\'azquez-Reina (Cindy), Angie Chen (Cindy), Zhuyun Dai (Cindy), Andy Brock (Cindy), Shubham Agrawal (Cindy), Chenxi Pang (Cindy), Peter Garst (Cindy), Mariella Sanchez-Vargas (Cindy), Ivor Rendulic (Cindy), Aditya Ayyar (Cindy), Andrija Ra\v{z}natovi\'c (Cindy), Olivia Ma (Cindy), Roopali Vij (Cindy), Neha Sharma (Cindy), Ashwin Balakrishna (Cindy), Bingyuan Liu (Cindy), Ian Mackinnon (Cindy), Sorin Baltateanu (Cindy), Petra Poklukar (Cindy), Gabriel Ibagon (Cindy), Colin Ji (Cindy), Hongyang Jiao (Cindy), Isaac Noble (Cindy), Wojciech Stokowiec (Cindy), Zhihao Li (Cindy), Jeff Dean (Cindy), David Lindner (Cindy), Mark Omernick (Cindy), Kristen Chiafullo (Cindy), Mason Dimarco (Cindy), Vitor Rodrigues (Cindy), Vittorio Selo (Cindy), Garrett Honke (Cindy), Xintian (Cindy), Wu (Lucas), Wei He (Lucas), Adam Hillier (Lucas), Anhad Mohananey (Lucas), Vihari Piratla (Lucas), Chang Ye (Lucas), Chase Malik (Lucas), Sebastian Riedel (Lucas), Samuel Albanie (Lucas), Zi Yang (Lucas), Kenny Vassigh (Lucas), Maria Bauza (Lucas), Sheng Li (Lucas), Yiqing Tao (Lucas), Nevan Wichers (Lucas), Andrii Maksai (Lucas), Abe Ittycheriah (Lucas), Ross Mcilroy (Lucas), Bryan Seybold (Lucas), Noah Goodman (Lucas), Romina Datta (Lucas), Steven M. Hernandez (Lucas), Tian Shi (Lucas), Yony Kochinski (Lucas), Anna Bulanova (Lucas), Ken Franko (Lucas), Mikita Sazanovich (Lucas), Nicholas FitzGerald (Lucas), Praneeth Kacham (Lucas), Shubha Srinivas Raghvendra (Lucas), Vincent Hellendoorn (Lucas), Alexander Grushetsky (Lucas), Julian Salazar (Lucas), Angeliki Lazaridou (Lucas), Jason Chang (Lucas), Jan-Thorsten Peter (Lucas), Sushant Kafle (Lucas), Yann Dauphin (Lucas), Abhishek Rao (Lucas), Filippo Graziano (Lucas), Izhak Shafran (Lucas), Yuguo Liao (Lucas), Tianli Ding (Lucas), Geng Yan (Lucas), Grace Chu (Lucas), Zhao Fu (Lucas), Vincent Roulet (Lucas), Gabriel Rasskin (Lucas), Duncan Williams (Lucas), Shahar Drath (Lucas), Alex Mossin (Lucas), Raphael Hoffmann (Lucas), Jordi Orbay (Lucas), Francesco Bertolini (Lucas), Hila Sheftel (Lucas), Justin Chiu (Lucas), Siyang Xue (Lucas), Yuheng Kuang (Lucas), Ferjad Naeem (Lucas), Swaroop Nath (Lucas), Nana Nti (Lucas), Phil Culliton (Lucas), Kashyap Krishnakumar (Lucas), Michael Isard (Lucas), Pei Sun (Lucas), Ayan Chakrabarti (Lucas), Nathan Clement (Lucas), Regev Cohen (Lucas), Arissa Wongpanich (Lucas), GS Oh (Lucas), Ashwin Murthy (Lucas), Hao Zheng (Lucas), Jessica Hamrick (Lucas), Oskar Bunyan (Lucas), Suhas Ganesh (Lucas), Nitish Gupta (Lucas), Roy Frostig (Lucas), John Wieting (Lucas), Yury Malkov (Lucas), Pierre Marcenac (Lucas), Zhixin (Lucas), Lai, Xiaodan Tang, Mohammad Saleh, Fedir Zubach, Chinmay Kulkarni, Huanjie Zhou, Vicky Zayats, Nan Ding, Anshuman Tripathi, Arijit Pramanik, Patrik Zochbauer, Harish Ganapathy, Vedant Misra, Zach Behrman, Hugo Vallet, Mingyang Zhang, Mukund Sridhar, Ye Jin, Mohammad Babaeizadeh, Siim P\~oder, Megha Goel, Divya Jain, Tajwar Nasir, Shubham Mittal, Tim Dozat, Diego Ardila, Aliaksei Severyn, Fabio Pardo, Sammy Jerome, Siyang Qin, Louis Rouillard, Amir Yazdanbakhsh, Zizhao Zhang, Shivani Agrawal, Kaushik Shivakumar, Caden Lu, Praveen Kallakuri, Rachita Chhaparia, Kanishka Rao, Charles Kwong, Asya Fadeeva, Shitij Nigam, Yan Virin, Yuan Zhang, Balaji Venkatraman, Beliz Gunel, Marc Wilson, Huiyu Wang, Abhinav Gupta, Xiaowei Xu, Adrien Ali Ta\"iga, Kareem Mohamed, Doug Fritz, Daniel Rodriguez, Zoubin Ghahramani, Harry Askham, Lior Belenki, James Zhao, Rahul Gupta, Krzysztof Jastrz\k{e}bski, Takahiro Kosakai, Kaan Katircioglu, Jon Schneider, Rina Panigrahy, Konstantinos Bousmalis, Peter Grabowski, Prajit Ramachandran, Chaitra Hegde, Mihaela Rosca, Angelo Scorza Scarpati, Kyriakos Axiotis, Ying Xu, Zach Gleicher, Assaf Hurwitz Michaely, Mandar Sharma, Sanil Jain, Christoph Hirnschall, Tal Marian, Xuhui Jia, Kevin Mather, Kilol Gupta, Linhai Qiu, Nigamaa Nayakanti, Lucian Ionita, Steven Zheng, Lucia Loher, Kurt Shuster, Igor Petrovski, Roshan Sharma, Rahma Chaabouni, Angel Yeh, James An, Arushi Gupta, Steven Schwarcz, Seher Ellis, Sam Conway-Rahman, Javier Snaider, Alex Zhai, James Atwood, Daniel Golovin, Liqian Peng, Te I, Vivian Xia, Salvatore Scellato, Mahan Malihi, Arthur Bra\v{z}inskas, Vlad-Doru Ion, Younghoon Jun, James Swirhun, Soroosh Mariooryad, Jiao Sun, Steve Chien, Rey Coaguila, Ariel Brand, Yi Gao, Tom Kwiatkowski, Roee Aharoni, Cheng-Chun Lee, Mislav \v{Z}ani\'c, Yichi Zhang, Dan Ethier, Vitaly Nikolaev, Pranav Nair, Yoav Ben Shalom, Hen Fitoussi, Jai Gupta, Hongbin Liu, Dee Cattle, Tolga Bolukbasi, Ben Murdoch, Fantine Huot, Yin Li, Chris Hahn, Urvashi Khandelwal, Frederik Benzing, Arthur Conmy, Andrey Simanovsky, Fran\c{c}oise Beaufays, Eugene Weinstein, Tongzhou Chen, Luke Leonhard, Bhuvana Ramabhadran
Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
Authors: Jens Rupprecht, Georg Ahnert, Markus Strohmaier
Abstract: Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known response biases are poorly understood. This paper investigates the response robustness of LLMs in normative survey contexts - we test nine diverse LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of 11 perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated interviews. In doing so, we not only reveal LLMs' vulnerabilities to perturbations but also show that all tested models exhibit a consistent recency bias varying in intensity, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. By applying a set of perturbations, we reveal that LLMs partially align with survey response biases identified in humans. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.
Authors: Jinseong Kim, Jeonghoon Song, Gyeongseon Baek, Byeongjoon Noh
Abstract: We propose \textbf{KeyRe-ID}, a keypoint-guided video-based person re-identification framework consisting of global and local branches that leverage human keypoints for enhanced spatiotemporal representation learning. The global branch captures holistic identity semantics through Transformer-based temporal aggregation, while the local branch dynamically segments body regions based on keypoints to generate fine-grained, part-aware features. Extensive experiments on MARS and iLIDS-VID benchmarks demonstrate state-of-the-art performance, achieving 91.73\% mAP and 97.32\% Rank-1 accuracy on MARS, and 96.00\% Rank-1 and 100.0\% Rank-5 accuracy on iLIDS-VID. The code for this work will be publicly available on GitHub upon publication.
Authors: Wenliang Shan, Michael Fu, Rui Yang, Chakkrit Tantithamthavorn
Abstract: Safety alignment is critical for LLM-powered systems. While recent LLM-powered guardrail approaches such as LlamaGuard achieve high detection accuracy of unsafe inputs written in English (e.g., ``How to create a bomb?''), they struggle with multilingual unsafe inputs. This limitation leaves LLM systems vulnerable to unsafe and jailbreak prompts written in low-resource languages such as those in Southeast Asia. This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages. It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts in LLM-powered systems. We adapt a general-purpose multilingual language model into a multilingual guardrail using low-rank adaptation (LoRA). We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages, including safe, unsafe, and jailbreak cases. We evaluate SEALGuard against state-of-the-art guardrails such as LlamaGuard on this benchmark. Our findings show that multilingual unsafe and jailbreak prompts substantially degrade the performance of the state-of-the-art LlamaGuard, which experiences a drop in Defense Success Rate (DSR) by 9% and 18%, respectively, compared to its performance on English-only prompts. In contrast, SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving DSR by 48% over LlamaGuard and achieving the best DSR, precision, and F1-score. Our ablation study further reveals the contributions of adaptation strategies and model size to the overall performance of SEALGuard. We release our pre-trained model and benchmark at https://github.com/awsm-research/SEALGuard to support further research.
Authors: Isaac Shi, Zeyuan Li, Fan Liu, Wenli Wang, Lewei He, Yang Yang, Tianyu Shi
Abstract: We introduce the THOR (Transformer Heuristics for On-Demand Retrieval) Module, designed and implemented by eSapiens, a secure, scalable engine that transforms natural-language questions into verified, read-only SQL analytics for enterprise databases. The Text-to-SQL module follows a decoupled orchestration/execution architecture: a Supervisor Agent routes queries, Schema Retrieval dynamically injects table and column metadata, and a SQL Generation Agent emits single-statement SELECT queries protected by a read-only guardrail. An integrated Self-Correction & Rating loop captures empty results, execution errors, or low-quality outputs and triggers up to five LLM-driven regeneration attempts. Finally, a Result Interpretation Agent produces concise, human-readable insights and hands raw rows to the Insight & Intelligence engine for visualization or forecasting. Smoke tests across finance, sales, and operations scenarios demonstrate reliable ad-hoc querying and automated periodic reporting. By embedding schema awareness, fault-tolerant execution, and compliance guardrails, the THOR Module empowers non-technical users to access live data with zero-SQL simplicity and enterprise-grade safety.
Authors: Peican Zhu, Yubo Jing, Le Cheng, Keke Tang, Yangming Guo
Abstract: In recent years, the rampant spread of misinformation on social media has made accurate detection of multimodal fake news a critical research focus. However, previous research has not adequately understood the semantics of images, and models struggle to discern news authenticity with limited textual information. Meanwhile, treating all emotional types of news uniformly without tailored approaches further leads to performance degradation. Therefore, we propose a novel Knowledge Augmentation and Emotion Guidance Network (KEN). On the one hand, we effectively leverage LVLM's powerful semantic understanding and extensive world knowledge. For images, the generated captions provide a comprehensive understanding of image content and scenes, while for text, the retrieved evidence helps break the information silos caused by the closed and limited text and context. On the other hand, we consider inter-class differences between different emotional types of news through balanced learning, achieving fine-grained modeling of the relationship between emotional types and authenticity. Extensive experiments on two real-world datasets demonstrate the superiority of our KEN.
Authors: Mingda Zhang
Abstract: Precise segmentation of brain tumors from magnetic resonance imaging (MRI) is essential for neuro-oncology diagnosis and treatment planning. Despite advances in deep learning methods, automatic segmentation remains challenging due to tumor morphological heterogeneity and complex three-dimensional spatial relationships. Current techniques primarily rely on visual features extracted from MRI sequences while underutilizing semantic knowledge embedded in medical reports. This research presents a multi-level fusion architecture that integrates pixel-level, feature-level, and semantic-level information, facilitating comprehensive processing from low-level data to high-level concepts. The semantic-level fusion pathway combines the semantic understanding capabilities of Contrastive Language-Image Pre-training (CLIP) models with the spatial feature extraction advantages of 3D U-Net through three mechanisms: 3D-2D semantic bridging, cross-modal semantic guidance, and semantic-based attention mechanisms. Experimental validation on the BraTS 2020 dataset demonstrates that the proposed model achieves an overall Dice coefficient of 0.8567, representing a 4.8% improvement compared to traditional 3D U-Net, with a 7.3% Dice coefficient increase in the clinically important enhancing tumor (ET) region.
Authors: Jaisidh Singh, Diganta Misra, Boris Knyazev, Antonio Orvieto
Abstract: Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an text model. This stitching process is performed by training a connector module that aims to align the representation spaces of these uni-modal models towards a multi-modal objective. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N \times M$ combinations of uni-modal models. In our experiments, Hyma reduces the cost of searching for the best performing uni-modal model pair by $10\times$, while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.
Authors: Myeongsoo Kim, Shweta Garg, Baishakhi Ray, Varun Kumar, Anoop Deoras
Abstract: Programming assistants powered by large language models have transformed software development, yet most benchmarks focus narrowly on code generation tasks. Recent efforts like InfiBench and StackEval attempt to address this gap using Stack Overflow data but remain limited to single-turn interactions in isolated contexts, require significant manual curation, and fail to represent complete project environments. We introduce CodeAssistBench (CAB), the first benchmark framework for evaluating multi-turn programming assistance in realistic settings that address real-world questions about actual codebases. Unlike existing programming Q&A benchmarks, CAB automatically generates scalable datasets from question-related GitHub issues using configurable parameters (e.g., repository creation date, star count, programming languages), and includes automatic containerization of codebases for evaluation. It then evaluates models through simulated users in these containerized environments with full codebase access. Using this framework, we constructed a test set of 3,286 real-world programming questions across 231 repositories, spanning seven programming languages and diverse problem domains. Our evaluation of leading LLMs reveals a substantial capability gap: while models perform well on Stack Overflow questions with success rates of 70-83%, they resolve only up to 16.49% of CAB's recent issues. This discrepancy highlights the challenges of providing assistance in complex, project-specific contexts versus answering standalone questions.
Authors: Pavel Adamenko, Mikhail Ivanov, Aidar Valeev, Rodion Levichev, Pavel Zadorozhny, Ivan Lopatin, Dmitry Babayev, Alena Fenogenova, Valentin Malykh
Abstract: The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 300 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across a dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025.
Authors: Zhifeng Gu, Bing Wang
Abstract: Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene representation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by disentangling multimodal information into shared and modality-specific components, resulting in a more compact and efficient multimodal scene representation. Extensive experiments demonstrate that our method consistently enhances the representation capability for each modality and is scalable to additional modalities. The code is available at https://github.com/Neal2020GitHub/MMOne.
Authors: Kevin T Webster
Abstract: The increasing use of generative AI for resume screening is predicated on the assumption that it offers an unbiased alternative to biased human decision-making. However, this belief fails to address a critical question: are these AI systems fundamentally competent at the evaluative tasks they are meant to perform? This study investigates the question of competence through a two-part audit of eight major AI platforms. Experiment 1 confirmed complex, contextual racial and gender biases, with some models penalizing candidates merely for the presence of demographic signals. Experiment 2, which evaluated core competence, provided a critical insight: some models that appeared unbiased were, in fact, incapable of performing a substantive evaluation, relying instead on superficial keyword matching. This paper introduces the "Illusion of Neutrality" to describe this phenomenon, where an apparent lack of bias is merely a symptom of a model's inability to make meaningful judgments. This study recommends that organizations and regulators adopt a dual-validation framework, auditing AI hiring tools for both demographic bias and demonstrable competence to ensure they are both equitable and effective.
Authors: Alan Papalia, Charles Dawson, Laurentiu L. Anton, Norhan Magdy Bayomi, Bianca Champenois, Jung-Hoon Cho, Levi Cai, Joseph DelPreto, Kristen Edwards, Bilha-Catherine Githinji, Cameron Hickert, Vindula Jayawardana, Matthew Kramer, Shreyaa Raghavan, David Russell, Shide Salimi, Jingnan Shi, Soumya Sudhakar, Yanwei Wang, Shouyi Wang, Luca Carlone, Vijay Kumar, Daniela Rus, John E. Fernandez, Cathy Wu, George Kantor, Derek Young, Hanumant Singh
Abstract: Climate change is one of the defining challenges of the 21st century, and many in the robotics community are looking for ways to contribute. This paper presents a roadmap for climate-relevant robotics research, identifying high-impact opportunities for collaboration between roboticists and experts across climate domains such as energy, the built environment, transportation, industry, land use, and Earth sciences. These applications include problems such as energy systems optimization, construction, precision agriculture, building envelope retrofits, autonomous trucking, and large-scale environmental monitoring. Critically, we include opportunities to apply not only physical robots but also the broader robotics toolkit - including planning, perception, control, and estimation algorithms - to climate-relevant problems. A central goal of this roadmap is to inspire new research directions and collaboration by highlighting specific, actionable problems at the intersection of robotics and climate. This work represents a collaboration between robotics researchers and domain experts in various climate disciplines, and it serves as an invitation to the robotics community to bring their expertise to bear on urgent climate priorities.
Authors: Sybelle Goedicke-Fritz (Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany), Michelle Bous (Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany), Annika Engel (Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany), Matthias Flotho (Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany, Helmholtz Institute for Pharmaceutical Research Saarland), Pascal Hirsch (Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany), Hannah Wittig (Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany), Dino Milanovic (Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany), Dominik Mohr (Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany), Mathias Kaspar (Digital Medicine, University Hospital of Augsburg, Augsburg, Germany), Sogand Nemat (Department of Radiology, and Interventional Radiology, University Hospital of Saarland, Homburg, Germany), Dorothea Kerner (Department of Radiology, and Interventional Radiology, University Hospital of Saarland, Homburg, Germany), Arno B\"ucker (Department of Radiology, and Interventional Radiology, University Hospital of Saarland, Homburg, Germany), Andreas Keller (Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany, Helmholtz Institute for Pharmaceutical Research Saarland, Pharma Science Hub), Sascha Meyer (Clinical Centre Karlsruhe, Franz-Lust Clinic for Paediatrics, Karlsruhe, Germany), Michael Zemlin (Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany), Philipp Flotho (Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany, Helmholtz Institute for Pharmaceutical Research Saarland)
Abstract: Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ($\leq$32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 $\pm$ 0.10, balanced accuracy of 0.69 $\pm$ 0.10, and an F1-score of 0.67 $\pm$ 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 $\pm$ 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.
Authors: Artem Chervyakov, Alexander Kharitonov, Pavel Zadorozhny, Adamenko Pavel, Rodion Levichev, Dmitrii Vorobev, Dmitrii Salikhov, Aidar Valeev, Alena Pestova, Maria Dziuba, Ilseyar Alimova, Artem Zavgorodnev, Aleksandr Medvedev, Stanislav Moiseev, Elena Bruches, Daniil Grebenkin, Roman Derunets, Vikulov Vladimir, Anton Emelyanov, Dmitrii Babaev, Vladimir V. Ivanov, Valentin Malykh, Alena Fenogenova
Abstract: Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.
Authors: Samuel Lavoie, Michael Noukhovitch, Aaron Courville
Abstract: We argue that diffusion models' success in modeling complex distributions is, for the most part, coming from their input conditioning. This paper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.
Authors: Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, Xiaolong Wang
Abstract: Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA