Authors: Shai Shalev-Shwartz, Amnon Shashua
Abstract: Chain-of-Thought (CoT) reasoning has emerged as a powerful tool for enhancing the problem-solving capabilities of large language models (LLMs). However, the theoretical foundations of learning from CoT data remain underdeveloped, and existing approaches -- such as Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), Tree-of-Thoughts (ToT), and Monte Carlo Tree Search (MCTS) -- often fail on complex reasoning tasks. In this work, we identify core obstacles that hinder effective CoT learning, including distribution drift, lack of embedded search, and exponential inference costs. We introduce the Diligent Learner, a new learning paradigm that explicitly models reasoning as a depth-first search guided by a validator and supports backtracking upon failure. Under two mild and realistic assumptions, we prove that the Diligent Learner can efficiently learn from CoT data while existing methods fail to do so. This framework offers a path toward building scalable and reliable reasoning systems trained on naturally occurring, incomplete data -- paving the way for the development of Large Reasoning Models (LRMs) with robust, interpretable problem-solving abilities.
Authors: Marek Vlk, Premysl Sucha, Jaroslaw Rudy, Radoslaw Idzikowski
Abstract: The food production industry, especially the meat production sector, faces many challenges that have even escalated due to the recent outbreak of the energy crisis in the European Union. Therefore, efficient use of input materials is an essential aspect affecting the profit of such companies. This paper addresses an optimization problem concerning the purchase and subsequent material processing we solved for a meat processing company. Unlike the majority of existing papers, we do not concentrate on how this problem concerns supply chain management, but we focus purely on the production stage. The problem involves the concept of alternative ways of material processing, stock of material with different expiration dates, and extra constraints widely neglected in the current literature, namely, the minimum order quantity and the minimum percentage in alternatives. We prove that each of these two constraints makes the problem \mbox{$\mathcal{NP}$-hard}, and hence we design a simple iterative approach based on integer linear programming that allows us to solve real-life instances even using an open-source integer linear programming solver. Another advantage of this approach is that it mitigates numerical issues, caused by the extensive range of data values, we experienced with a commercial solver. The results obtained using real data from the meat processing company showed that our algorithm can find the optimum solution in a few seconds for all considered use cases.
Authors: Yin Wu, Daniel Slieter, Vivek Subramanian, Ahmed Abouelazm, Robin Bohn, J. Marius Z\"ollner
Abstract: The growing number of ADAS-equipped vehicles has led to a dramatic increase in driving data, yet most of them capture routine driving behavior. Identifying and understanding safety-critical corner cases within this vast dataset remains a significant challenge. Braking events are particularly indicative of potentially hazardous situations, motivating the central question of our research: Why does a vehicle brake? Existing approaches primarily rely on rule-based heuristics to retrieve target scenarios using predefined condition filters. While effective in simple environments such as highways, these methods lack generalization in complex urban settings. In this paper, we propose a novel framework that leverages Large Language Model (LLM) for scenario understanding and reasoning. Our method bridges the gap between low-level numerical signals and natural language descriptions, enabling LLM to interpret and classify driving scenarios. We propose a dual-path scenario retrieval that supports both category-based search for known scenarios and embedding-based retrieval for unknown Out-of-Distribution (OOD) scenarios. To facilitate evaluation, we curate scenario annotations on the Argoverse 2 Sensor Dataset. Experimental results show that our method outperforms rule-based baselines and generalizes well to OOD scenarios.
Authors: Jerry Li, Timothy Oh, Joseph Hoang, Vardhit Veeramachaneni
Abstract: Small language models have gained significant popularity due to their efficiency and growing capabilities. However, incorporating additional modalities, such as vision, can exacerbate the challenge of limited context windows by introducing noise. Recent studies have highlighted that Transformer attention mechanisms often disproportionately focus on irrelevant contexts. In this work, we extend the Differential Attention mechanism, originally designed for text-only models, to the text-vision model PaliGemma. Our aim is to evaluate its ability to mitigate noisy information retrieval and reduce hallucinations. To this end, we fine-tuned the PaliGemma 3B model using LoRA, incorporating Differential Attention, and experimented with various parameter settings and configurations. We demonstrate that Differential Attention can be adapted and integrated into the fine-tuning of existing models to enhance noisy information retrieval and question-answering capabilities.
Authors: Eric Benhamou, Jean-Jacques Ohana, Alban Etienne, B\'eatrice Guez, Ethan Setrouk, Thomas Jacquot
Abstract: Commodity Trading Advisors (CTAs) have historically relied on trend-following rules that operate on vastly different horizons from long-term breakouts that capture major directional moves to short-term momentum signals that thrive in fast-moving markets. Despite a large body of work on trend following, the relative merits and interactions of short-versus long-term trend systems remain controversial. This paper adds to the debate by (i) dynamically decomposing CTA returns into short-term trend, long-term trend and market beta factors using a Bayesian graphical model, and (ii) showing how the blend of horizons shapes the strategy's risk-adjusted performance.
Authors: Simon Ouellette
Abstract: We run a controlled compositional generalization experiment in the ARC-AGI domain: an open-world problem domain in which the ability to generalize out-of-distribution is, by design, an essential characteristic for success. We compare neural program synthesis and test-time fine-tuning approaches on this experiment. We find that execution-guided neural program synthesis outperforms all reference algorithms in its ability to compose novel solutions. Our empirical findings also suggest that the success of TTFT on ARC-AGI lies mainly in eliciting in-distribution knowledge that the LLM otherwise fails to rely on directly.
Authors: Andy E. Williams
Abstract: Intelligence-biological, artificial, or collective-requires structural coherence across recursive reasoning processes to scale effectively. As complex systems grow, coherence becomes fragile unless a higher-order structure ensures semantic consistency. This paper introduces the Recursive Coherence Principle (RCP): a foundational constraint stating that for any reasoning system of order N, composed of systems operating over conceptual spaces of order N-1, semantic coherence is preserved only by a recursively evaluable generalization operator that spans and aligns those lower-order conceptual spaces. Crucially, this coherence enables structural alignment. Without recursive coherence, no system can reliably preserve goals, meanings, or reasoning consistency at scale. We formally define the Functional Model of Intelligence (FMI) as the only known operator capable of satisfying the RCP at any scale. The FMI is a minimal, composable architecture with internal functions (evaluation, modeling, adaptation, stability, decomposition, bridging) and external functions (storage, recall, System 1 and System 2 reasoning) vital for preserving semantic structure across inference and coordination layers. We prove that any system lacking the FMI will experience recursive coherence breakdown as it scales, arguing that common AI issues like misalignment, hallucination, and instability are symptoms of this structural coherence loss. Unlike other foundational principles, RCP uniquely captures the internal, recursive dynamics needed for coherent, alignable intelligence, modeling semantic coherence under recursion. This work significantly impacts AI alignment, advocating a shift from behavioral constraints to structural coherence, and offers a pathway for safely generalizable, robustly coherent AI at scale.
Authors: Pierluca D'Oro, Caley Drooff, Joy Chen, Joseph Tighe
Abstract: Large language models have paved the way to powerful and flexible AI agents, assisting humans by increasingly integrating into their daily life. This flexibility, potential, and growing adoption demands a holistic and cross-disciplinary approach to developing, monitoring and discussing the capabilities required for agent-driven user experiences. However, current guidance on human-centered AI agent development is scattered: UX heuristics focus on interface behaviors, engineering taxonomies describe internal pipelines, and ethics checklists address high-level governance. There is no concise, user-facing vocabulary that tells teams what an agent should fundamentally be able to do. We introduce ADEPTS, a capability framework defining a set of core user-facing capabilities to provide unified guidance around the development of AI agents. ADEPTS is based on six principles for human-centered agent design, that express the minimal, user-facing capabilities an AI agent should demonstrate to be understandable, controllable and trustworthy in everyday use. ADEPTS complements existing frameworks and taxonomies; differently from them, it sits at the interface between technical and experience development. By presenting ADEPTS, we aim to condense complex AI-UX requirements into a compact framework that is actionable guidance for AI researchers, designers, engineers, and policy reviewers alike. We believe ADEPTS has the potential of accelerating the improvement of user-relevant agent capabilities, of easing the design of experiences that take advantage of those capabilities, and of providing a shared language to track and discuss progress around the development of AI agents.
Authors: Lisa Dargasz
Abstract: Reinforcement Learning is a machine learning methodology that has demonstrated strong performance across a variety of tasks. In particular, it plays a central role in the development of artificial autonomous agents. As these agents become increasingly capable, market readiness is rapidly approaching, which means those agents, for example taking the form of humanoid robots or autonomous cars, are poised to transition from laboratory prototypes to autonomous operation in real-world environments. This transition raises concerns leading to specific requirements for these systems - among them, the requirement that they are designed to behave ethically. Crucially, research directed toward building agents that fulfill the requirement to behave ethically - referred to as artificial moral agents(AMAs) - has to address a range of challenges at the intersection of computer science and philosophy. This study explores the development of reason-based artificial moral agents (RBAMAs). RBAMAs are build on an extension of the reinforcement learning architecture to enable moral decision-making based on sound normative reasoning, which is achieved by equipping the agent with the capacity to learn a reason-theory - a theory which enables it to process morally relevant propositions to derive moral obligations - through case-based feedback. They are designed such that they adapt their behavior to ensure conformance to these obligations while they pursue their designated tasks. These features contribute to the moral justifiability of the their actions, their moral robustness, and their moral trustworthiness, which proposes the extended architecture as a concrete and deployable framework for the development of AMAs that fulfills key ethical desiderata. This study presents a first implementation of an RBAMA and demonstrates the potential of RBAMAs in initial experiments.
Authors: Joydeep Chandra, Satyam Kumar Navneet
Abstract: The implementation of Artificial Intelligence (AI) in household environments, especially in the form of proactive autonomous agents, brings about possibilities of comfort and attention as well as it comes with intra or extramural ethical challenges. This article analyzes agentic AI and its applications, focusing on its move from reactive to proactive autonomy, privacy, fairness and user control. We review responsible innovation frameworks, human-centered design principles, and governance practices to distill practical guidance for ethical smart home systems. Vulnerable user groups such as elderly individuals, children, and neurodivergent who face higher risks of surveillance, bias, and privacy risks were studied in detail in context of Agentic AI. Design imperatives are highlighted such as tailored explainability, granular consent mechanisms, and robust override controls, supported by participatory and inclusive methodologies. It was also explored how data-driven insights, including social media analysis via Natural Language Processing(NLP), can inform specific user needs and ethical concerns. This survey aims to provide both a conceptual foundation and suggestions for developing transparent, inclusive, and trustworthy agentic AI in household automation.
Authors: Tong Wu, Chong Xiang, Jiachen T. Wang, Weichen Yu, Chawin Sitawarin, Vikash Sehwag, Prateek Mittal
Abstract: Recently, Zaremba et al. demonstrated that increasing inference-time computation improves robustness in large proprietary reasoning LLMs. In this paper, we first show that smaller-scale, open-source models (e.g., DeepSeek R1, Qwen3, Phi-reasoning) can also benefit from inference-time scaling using a simple budget forcing strategy. More importantly, we reveal and critically examine an implicit assumption in prior work: intermediate reasoning steps are hidden from adversaries. By relaxing this assumption, we identify an important security risk, intuitively motivated and empirically verified as an inverse scaling law: if intermediate reasoning steps become explicitly accessible, increased inference-time computation consistently reduces model robustness. Finally, we discuss practical scenarios where models with hidden reasoning chains are still vulnerable to attacks, such as models with tool-integrated reasoning and advanced reasoning extraction attacks. Our findings collectively demonstrate that the robustness benefits of inference-time scaling depend heavily on the adversarial setting and deployment context. We urge practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world applications.
Authors: Xi Yang, Jiachen Wang, Song Han, Suining He
Abstract: Efficient use of urban micromobility resources such as bike sharing is challenging due to the unbalanced station-level demand and supply, which causes the maintenance of the bike sharing systems painstaking. Prior efforts have been made on accurate prediction of bike traffics, i.e., demand/pick-up and return/drop-off, to achieve system efficiency. However, bike station-level traffic prediction is difficult because of the spatial-temporal complexity of bike sharing systems. Moreover, such level of prediction over entire bike sharing systems is also challenging due to the large number of bike stations. To fill this gap, we propose BikeMAN, a multi-level spatio-temporal attention neural network to predict station-level bike traffic for entire bike sharing systems. The proposed network consists of an encoder and a decoder with an attention mechanism representing the spatial correlation between features of bike stations in the system and another attention mechanism describing the temporal characteristic of bike station traffic. Through experimental study on over 10 millions trips of bike sharing systems (> 700 stations) of New York City, our network showed high accuracy in predicting the bike station traffic of all stations in the city.
Authors: Tehseen Rug, Felix B\"ohmer, Tessa Pfattheicher
Abstract: Classical computation, grounded in formal, logical systems, has been the engine of technological progress for decades, excelling at problems that can be described with unambiguous rules. This paradigm, however, leaves a vast ocean of human problems -- those characterized by ambiguity, dynamic environments, and subjective context -- largely untouched. The advent of Large Language Models (LLMs) represents a fundamental shift, enabling computational systems to engage with this previously inaccessible domain using natural language. This paper introduces a unified framework to understand and contrast these problem-solving paradigms. We define and delineate the problem spaces addressable by formal languages versus natural language. While solutions to the former problem class can be evaluated using binary quality measures, the latter requires a much more nuanced definition of approximate solution space taking into account the vagueness, subjectivity and ambiguity inherent to natural language. We therefore introduce a vector-valued trust index Q, which reflects solution quality and distinguishes the binary correctness of formal solutions from the continuous adequacy spectrum characteristic of natural language solutions. Within this framework, we propose two statistical quality dimensions. Normalized bi-semantic entropy measures robustness and conceptual diversity of LLM answers given semantic variation in problem formulations. Emotional valence maps subjective valuation of a solution to a quantifiable metric that can be maximized by invoking statistical measures. The concepts introduced in this work will provide a more rigorous understanding of the capabilities, limitations, and inherent nature of problem-solving in the age of LLMs.
Authors: Jeroen Spaans, Jesse Heyninck
Abstract: Constraint Logic Programming (CLP) is a logic programming formalism used to solve problems requiring the consideration of constraints, like resource allocation and automated planning and scheduling. It has previously been extended in various directions, for example to support fuzzy constraint satisfaction, uncertainty, or negation, with different notions of semiring being used as a unifying abstraction for these generalizations. None of these extensions have studied clauses with negation allowed in the body. We investigate an extension of CLP which unifies many of these extensions and allows negation in the body. We provide semantics for such programs, using the framework of approximation fixpoint theory, and give a detailed overview of the impacts of properties of the semirings on the resulting semantics. As such, we provide a unifying framework that captures existing approaches and allows extending them with a more expressive language.
Authors: Shengchao Liu, Hannan Xu, Yan Ai, Huanxin Li, Yoshua Bengio, Harry Guo
Abstract: Large language models (LLMs) leverage chain-of-thought (CoT) techniques to tackle complex problems, representing a transformative breakthrough in artificial intelligence (AI). However, their reasoning capabilities have primarily been demonstrated in solving math and coding problems, leaving their potential for domain-specific applications-such as battery discovery-largely unexplored. Inspired by the idea that reasoning mirrors a form of guided search, we introduce ChatBattery, a novel agentic framework that integrates domain knowledge to steer LLMs toward more effective reasoning in materials design. Using ChatBattery, we successfully identify, synthesize, and characterize three novel lithium-ion battery cathode materials, which achieve practical capacity improvements of 28.8%, 25.2%, and 18.5%, respectively, over the widely used cathode material, LiNi0.8Mn0.1Co0.1O2 (NMC811). Beyond this discovery, ChatBattery paves a new path by showing a successful LLM-driven and reasoning-based platform for battery materials invention. This complete AI-driven cycle-from design to synthesis to characterization-demonstrates the transformative potential of AI-driven reasoning in revolutionizing materials discovery.
Authors: Michael R. Bock, Kara Molisee, Zachary Ozer, Sumit Shah
Abstract: Can AI file your taxes? Not yet. Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully compute results. We propose TaxCalcBench, a benchmark for determining models' abilities to calculate personal income tax returns given all of the necessary information. Our experiment shows that state-of-the-art models succeed in calculating less than a third of federal income tax returns even on this simplified sample set. Our analysis concludes that models consistently misuse tax tables, make errors in tax calculation, and incorrectly determine eligibility. Our findings point to the need for additional infrastructure to apply LLMs to the personal income tax calculation task.
Authors: Shuhao Mei, Yongchao Long, Shan Cao, Xiaobo Han, Shijia Geng, Jinbo Sun, Yuxi Zhou, Shenda Hong
Abstract: Chronic Obstructive Pulmonary Disease (COPD), a major chronic respiratory disease with persistent airflow limitation, is a leading global cause of disability and mortality. Respiratory spirogram time series, routinely collected during pulmonary function tests (PFTs), play a critical role in the early detection of repsiratory diseases and in monitoring lung function over time. However, most current AI models for COPD diagnosis are limited to outputting classification results without providing a rationale for their diagnostic process, while current Large Language Models (LLMs) cannot understand spirograms yet, which severely limits their clinical trust and adoption. To tackle this challenge, we leverage a cohort of 234,028 individuals from the UK Biobank (UKB) to propose SpiroLLM, the first multimodal large language model that can understand spirogram. The model extracts morphological features from respiratory curves via a SpiroEncoder and aligns them with PFT numerical values in a unified latent space using a SpiroProjector, ultimately empowering a large language model to generate a comprehensive diagnostic report. Experimental results confirm that SpiroLLM achieved a diagnostic AUROC of 0.8980 (95% CI: 0.8820-0.9132). In a robustness test with missing core data, it maintained a 100% valid response rate, far surpassing the 13.4% of a text-only model and showcasing the superiority of its multimodal design. This work demonstrates the substantial potential of deeply fusing physiological signals with large language models, establishing a new paradigm for the next generation of interpretable and reliable clinical decision support tools.
Authors: Myung Ho Kim
Abstract: We report the discovery of a structural convergence across four influential theories of mind: Kahneman's dual-system theory, Friston's predictive processing, Minsky's society of mind, and Clark's extended mind-emerging unintentionally within a practical AI agent architecture called Agentic Flow. Designed to address limitations in large language models (LLMs), Agentic Flow comprises five interdependent modules such as Retrieval, Cognition, Control, Memory, and Action arranged in a recurrent cognitive loop. Although originally inspired only by Minsky and Clark, the system's structure retrospectively aligns with computational motifs found in all four theories, including predictive modeling, associative recall, and error-sensitive control. To assess this convergence, we conducted comparative experiments with baseline LLM agents on multi-step reasoning tasks. The structured agent achieved 95.8% task success and exhibited strong constraint adherence, while the baseline system succeeded 62.3% of the time. These results were not aimed at proving superiority, but at illustrating how theoretical structures may emerge through practical design choices rather than top-down theory. We introduce PEACE as a descriptive meta-architecture that captures design-level regularities observed in Agentic Flow. Not intended as a new theory, PEACE provides a shared vocabulary for understanding architectures shaped by real-world implementation demands. This paper should be read as a position paper - an exploratory reflection on how implementation can surface latent structural echoes of cognitive theory, without asserting theoretical unification.
Authors: Li-Hsiang Shen, Jyun-Jhe Huang
Abstract: A space-air-ground integrated network (SAGIN) architecture is proposed, empowered by multi-functional reconfigurable intelligent surfaces (MF-RIS) capable of simultaneously reflecting, amplifying, and harvesting wireless energy. The MF-RIS plays a pivotal role in addressing the energy shortages of low-Earth orbit (LEO) satellites operating in shadowed regions, while explicitly accounting for both communication and computing energy consumption across the SAGIN nodes. To maximize the long-term energy efficiency (EE), we formulate a joint optimization problem over the MF-RIS parameters, including signal amplification, phase-shifts, energy harvesting ratio, and active element selection as well as the SAGIN parameters of beamforming vectors, high-altitude platform station (HAPS) deployment, user association, and computing capability. The formulated problem is highly non-convex and non-linear and contains mixed discrete-continuous parameters. To tackle this, we conceive a compressed hybrid intelligence for twin-model enhanced multi-agent deep reinforcement learning (CHIMERA) framework, which integrates semantic state-action compression and parametrized sharing under hybrid reinforcement learning to efficiently explore suitable complex actions. The simulation results have demonstrated that the proposed CHIMERA scheme substantially outperforms the conventional benchmarks, including fixed-configuration or non-harvesting MF-RIS, traditional RIS, and no-RIS cases, as well as centralized and multi-agent deep reinforcement learning baselines in terms of the highest EE. Moreover, the proposed SAGIN-MF-RIS architecture achieves superior EE performance due to its complementary coverage, offering notable advantages over either standalone satellite, aerial, or ground-only deployments.
Authors: Dong Ben, Hui Feng, Qian Wang
Abstract: Large Language Models (LLMs) are increasingly used in circuit design tasks and have typically undergone multiple rounds of training. Both the trained models and their associated training data are considered confidential intellectual property (IP) and must be protected from exposure. Confidential Computing offers a promising solution to protect data and models through Trusted Execution Environments (TEEs). However, existing TEE implementations are not designed to support the resource-intensive nature of LLMs efficiently. In this work, we first present a comprehensive evaluation of the LLMs within a TEE-enabled confidential computing environment, specifically utilizing Intel Trust Domain Extensions (TDX). We constructed experiments on three environments: TEE-based, CPU-only, and CPU-GPU hybrid implementations, and evaluated their performance in terms of tokens per second. Our first observation is that distilled models, i.e., DeepSeek, surpass other models in performance due to their smaller parameters, making them suitable for resource-constrained devices. Also, in the quantized models such as 4-bit quantization (Q4) and 8-bit quantization (Q8), we observed a performance gain of up to 3x compared to FP16 models. Our findings indicate that for fewer parameter sets, such as DeepSeek-r1-1.5B, the TDX implementation outperforms the CPU version in executing computations within a secure environment. We further validate the results using a testbench designed for SoC design tasks. These validations demonstrate the potential of efficiently deploying lightweight LLMs on resource-constrained systems for semiconductor CAD applications.
Authors: Bo Wen, Chen Wang, Qiwei Han, Raquel Norel, Julia Liu, Thaddeus Stappenbeck, Jeffrey L. Rogers
Abstract: The integration of voice-based AI agents in healthcare presents a transformative opportunity to bridge economic and accessibility gaps in digital health delivery. This paper explores the role of large language model (LLM)-powered voice assistants in enhancing preventive care and continuous patient monitoring, particularly in underserved populations. Drawing insights from the development and pilot study of Agent PULSE (Patient Understanding and Liaison Support Engine) -- a collaborative initiative between IBM Research, Cleveland Clinic Foundation, and Morehouse School of Medicine -- we present an economic model demonstrating how AI agents can provide cost-effective healthcare services where human intervention is economically unfeasible. Our pilot study with 33 inflammatory bowel disease patients revealed that 70\% expressed acceptance of AI-driven monitoring, with 37\% preferring it over traditional modalities. Technical challenges, including real-time conversational AI processing, integration with healthcare systems, and privacy compliance, are analyzed alongside policy considerations surrounding regulation, bias mitigation, and patient autonomy. Our findings suggest that AI-driven voice agents not only enhance healthcare scalability and efficiency but also improve patient engagement and accessibility. For healthcare executives, our cost-utility analysis demonstrates huge potential savings for routine monitoring tasks, while technologists can leverage our framework to prioritize improvements yielding the highest patient impact. By addressing current limitations and aligning AI development with ethical and regulatory frameworks, voice-based AI agents can serve as a critical entry point for equitable, sustainable digital healthcare solutions.
Authors: Tianze Xu, Pengrui Lu, Lyumanshan Ye, Xiangkun Hu, Pengfei Liu
Abstract: The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems - which we refer to as Deep AI Research Systems (DARS) - on frontier AI scientific questions. We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios such as laboratory discussions and interviews, spanning 35 different AI subjects and categorized into three types: technical details, literature review, and open consulting. Our dual evaluation framework combines rubric assessment, which uses expert-designed criteria to evaluate insight quality, with factual assessment, which measures citation accuracy (faithfulness) and coverage (groundedness). We evaluated several leading commercial DARS and baseline systems. Results show that OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions. Such capabilities represent a meaningful step toward AI self-improvement, aligning with the vision of ASI for AI. We open-source ResearcherBench to provide a standardized platform for promoting the development of next-generation AI research assistants, hoping to foster a new perspective in AI research evaluation for a novel pattern of scientific collaboration: https://github.com/GAIR-NLP/ResearcherBench.
Authors: Cairong Zhao, Yufeng Jin, Zifan Song, Haonan Chen, Duoqian Miao, Guosheng Hu
Abstract: Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. To solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. l2 loss forcing the student being exact the same as the teacher, can easily lead to overfitting in cross-modality distillation. To address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively. In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech.
Authors: Fred Mutisya (Qhala, Kenya Medical Association), Shikoh Gitau (Qhala), Christine Syovata (Kenya Medical Association), Diana Oigara (Kenya Medical Association), Ibrahim Matende (Kenya Medical Association), Muna Aden (Kenya Medical Association), Munira Ali (Kenya Medical Association), Ryan Nyotu (Kenya Medical Association), Diana Marion (Kenya Medical Association), Job Nyangena (Kenya Medical Association), Nasubo Ongoma (Qhala), Keith Mbae (Qhala), Elizabeth Wamicha (Qhala), Eric Mibuari (Qhala), Jean Philbert Nsengemana (Africa CDC), Talkmore Chidede (AfCFTA)
Abstract: Introduction: Existing medical LLM benchmarks largely reflect examination syllabi and disease profiles from high income settings, raising questions about their validity for African deployment where malaria, HIV, TB, sickle cell disease and other neglected tropical diseases (NTDs) dominate burden and national guidelines drive care. Methodology: We systematically reviewed 31 quantitative LLM evaluation papers (Jan 2019 May 2025) identifying 19 English medical QA benchmarks. Alama Health QA was developed using a retrieval augmented generation framework anchored on the Kenyan Clinical Practice Guidelines. Six widely used sets (AfriMedQA, MMLUMedical, PubMedQA, MedMCQA, MedQAUSMLE, and guideline grounded Alama Health QA) underwent harmonized semantic profiling (NTD proportion, recency, readability, lexical diversity metrics) and blinded expert rating across five dimensions: clinical relevance, guideline alignment, clarity, distractor plausibility, and language/cultural fit. Results: Alama Health QA captured >40% of all NTD mentions across corpora and the highest within set frequencies for malaria (7.7%), HIV (4.1%), and TB (5.2%); AfriMedQA ranked second but lacked formal guideline linkage. Global benchmarks showed minimal representation (e.g., sickle cell disease absent in three sets) despite large scale. Qualitatively, Alama scored highest for relevance and guideline alignment; PubMedQA lowest for clinical utility. Discussion: Quantitative medical LLM benchmarks widely used in the literature underrepresent African disease burdens and regulatory contexts, risking misleading performance claims. Guideline anchored, regionally curated resources such as Alama Health QA and expanded disease specific derivatives are essential for safe, equitable model evaluation and deployment across African health systems.
Authors: Alexander Strunk, Roland Assam
Abstract: This paper introduces Higher Gauge Flow Models, a novel class of Generative Flow Models. Building upon ordinary Gauge Flow Models (arXiv:2507.13414), these Higher Gauge Flow Models leverage an L$_{\infty}$-algebra, effectively extending the Lie Algebra. This expansion allows for the integration of the higher geometry and higher symmetries associated with higher groups into the framework of Generative Flow Models. Experimental evaluation on a Gaussian Mixture Model dataset revealed substantial performance improvements compared to traditional Flow Models.
Authors: Arpan Dasgupta, Mizhaan Maniyar, Awadhesh Srivastava, Sanat Kumar, Amrita Mahale, Aparna Hedge, Arun Suggala, Karthikeyan Shanmugam, Aparna Taneja, Milind Tambe
Abstract: Mobile health (mHealth) programs utilize automated voice messages to deliver health information, particularly targeting underserved communities, demonstrating the effectiveness of using mobile technology to disseminate crucial health information to these populations, improving health outcomes through increased awareness and behavioral change. India's Kilkari program delivers vital maternal health information via weekly voice calls to millions of mothers. However, the current random call scheduling often results in missed calls and reduced message delivery. This study presents a field trial of a collaborative bandit algorithm designed to optimize call timing by learning individual mothers' preferred call times. We deployed the algorithm with around $6500$ Kilkari participants as a pilot study, comparing its performance to the baseline random calling approach. Our results demonstrate a statistically significant improvement in call pick-up rates with the bandit algorithm, indicating its potential to enhance message delivery and impact millions of mothers across India. This research highlights the efficacy of personalized scheduling in mobile health interventions and underscores the potential of machine learning to improve maternal health outreach at scale.
Authors: Lucas de Lara (IECL)
Abstract: Counterfactual reasoning aims at answering contrary-to-fact questions like ''Would have Alice recovered had she taken aspirin?'' and corresponds to the most fine-grained layer of causation. Critically, while many counterfactual statements cannot be falsified -- even by randomized experiments -- they underpin fundamental concepts like individual-wise fairness. Therefore, providing models to formalize and implement counterfactual beliefs remains a fundamental scientific problem. In the Markovian setting of Pearl's causal framework, we propose an alternative approach to structural causal models to represent counterfactuals compatible with a given causal graphical model. More precisely, we introduce counterfactual models, also called canonical representations of structural causal models. They enable analysts to choose a counterfactual conception via random-process probability distributions with preassigned marginals and characterize the counterfactual equivalence class of structural causal models. Then, we present a normalization procedure to describe and implement various counterfactual conceptions. Compared to structural causal models, it allows to specify many counterfactual conceptions without altering the observational and interventional constraints. Moreover, the content of the model corresponding to the counterfactual layer does not need to be estimated; only to make a choice. Finally, we illustrate the specific role of counterfactuals in causality and the benefits of our approach on theoretical and numerical examples.
Authors: Bo Hou, Xin Tan, Kai Zheng, Fang Liu, Yinghao Zhu, Li Zhang
Abstract: Atomic commits, each of which addresses a single development concern, are a best practice in software development. However, developers frequently produce tangled commits that mix unrelated changes due to practical constraints or unclear boundaries, negatively impacting code review and maintenance. Although prior commit untangling approaches: rule-based, feature-based, or graph-based, have made progress, they often rely on shallow signals and fail to distinguish between explicit dependencies (e.g., control/data flow) and implicit ones (e.g., semantic or conceptual relationships). In this paper, we propose ColaUntangle, a new collaborative consultation framework for commit untangling that models both explicit and implicit dependencies among code changes. ColaUntangle integrates Large Language Model (LLM)-driven agents in a multi-agent architecture: one agent specializes in explicit dependencies, another in implicit ones, and a reviewer agent synthesizes their perspectives through iterative consultation. To capture explicit and implicit contextual information, we construct multi-version Program Dependency Graphs (delta-PDG), enabling agents to reason over code relationships with both symbolic and semantic depth. We evaluate ColaUntangle on two widely-used datasets (1,612 C# and 14k Java tangled commits). Experimental results show that ColaUntangle outperforms the best-performing baseline, achieving an improvement of 44% on the C# dataset and 100% on the Java dataset. These findings highlight the potential of LLM-based collaborative frameworks for advancing automated commit untangling tasks.
Authors: Stassa Patsantzis
Abstract: Inductive Logic Programming (ILP) approaches like Meta \-/ Interpretive Learning (MIL) can learn, from few examples, recursive logic programs with invented predicates that generalise well to unseen instances. This ability relies on a background theory and negative examples, both carefully selected with expert knowledge of a learning problem and its solutions. But what if such a problem-specific background theory or negative examples are not available? We formalise this question as a new setting for Self-Supervised ILP and present a new MIL algorithm that learns in the new setting from some positive labelled, and zero or more unlabelled examples, and automatically generates, and labels, new positive and negative examples during learning. We implement this algorithm in Prolog in a new MIL system, called Poker. We compare Poker to state-of-the-art MIL system Louise on experiments learning grammars for Context-Free and L-System languages from labelled, positive example strings, no negative examples, and just the terminal vocabulary of a language, seen in examples, as a first-order background theory. We introduce a new approach for the principled selection of a second-order background theory as a Second Order Definite Normal Form (SONF), sufficiently general to learn all programs in a class, thus removing the need for a backgound theory tailored to a learning task. We find that Poker's performance improves with increasing numbers of automatically generated examples while Louise, bereft of negative examples, over-generalises.
Authors: Hongyi Tang, Zhihao Zhu, Yi Yang
Abstract: The performance of large language models (LLMs) is closely tied to their training data, which can include copyrighted material or private information, raising legal and ethical concerns. Additionally, LLMs face criticism for dataset contamination and internalizing biases. To address these issues, the Pre-Training Data Detection (PDD) task was proposed to identify if specific data was included in an LLM's pre-training corpus. However, existing PDD methods often rely on superficial features like prediction confidence and loss, resulting in mediocre performance. To improve this, we introduce NA-PDD, a novel algorithm analyzing differential neuron activation patterns between training and non-training data in LLMs. This is based on the observation that these data types activate different neurons during LLM inference. We also introduce CCNewsPDD, a temporally unbiased benchmark employing rigorous data transformations to ensure consistent time distributions between training and non-training data. Our experiments demonstrate that NA-PDD significantly outperforms existing methods across three benchmarks and multiple LLMs.
Authors: Stassa Patsantzis
Abstract: A "model" is a theory that describes the state of an environment and the effects of an agent's decisions on the environment. A model-based agent can use its model to predict the effects of its future actions and so plan ahead, but must know the state of the environment. A model-free agent cannot plan, but can act without a model and without completely observing the environment. An autonomous agent capable of acting independently in novel environments must combine both sets of capabilities. We show how to create such an agent with Meta-Interpretive Learning used to learn a model-based Solver used to train a model-free Controller that can solve the same planning problems as the Solver. We demonstrate the equivalence in problem-solving ability of the two agents on grid navigation problems in two kinds of environment: randomly generated mazes, and lake maps with wide open areas. We find that all navigation problems solved by the Solver are also solved by the Controller, indicating the two are equivalent.
Authors: Pierangela Bruno, Carmine Dodaro, Giuseppe Galat\`a, Marco Maratea, Marco Mochi
Abstract: The Operating Room Scheduling (ORS) problem deals with the optimization of daily operating room surgery schedules. It is a challenging problem subject to many constraints, like to determine the starting time of different surgeries and allocating the required resources, including the availability of beds in different department units. Recently, solutions to this problem based on Answer Set Programming (ASP) have been delivered. Such solutions are overall satisfying but, when applied to real data, they can currently only verify whether the encoding aligns with the actual data and, at most, suggest alternative schedules that could have been computed. As a consequence, it is not currently possible to generate provisional schedules. Furthermore, the resulting schedules are not always robust. In this paper, we integrate inductive and deductive techniques for solving these issues. We first employ machine learning algorithms to predict the surgery duration, from historical data, to compute provisional schedules. Then, we consider the confidence of such predictions as an additional input to our problem and update the encoding correspondingly in order to compute more robust schedules. Results on historical data from the ASL1 Liguria in Italy confirm the viability of our integration. Under consideration in Theory and Practice of Logic Programming (TPLP).
Authors: Chang Li, Yaren Zhang, Haoran Lv, Qiong Cao, Chao Xue, Xiaodong He
Abstract: Large Language Models (LLMs) have shown remarkable reasoning ability through explicit Chain-of-Thought (CoT) prompting, but generating these step-by-step textual explanations is computationally expensive and slow. To overcome this, we aim to develop a framework for efficient, implicit reasoning, where the model "thinks" in a latent space without generating explicit text for every step. We propose that these latent thoughts can be modeled as temporally-extended abstract actions, or options, within a hierarchical reinforcement learning framework. To effectively learn a diverse library of options as latent embeddings, we first introduce the Variational Markovian Option Critic (VMOC), an off-policy algorithm that uses variational inference within the HiT-MDP framework. To provide a rigorous foundation for using these options as an abstract reasoning space, we extend the theory of continuous MDP homomorphisms. This proves that learning a policy in the simplified, abstract latent space, for which VMOC is suited, preserves the optimality of the solution to the original, complex problem. Finally, we propose a cold-start procedure that leverages supervised fine-tuning (SFT) data to distill human reasoning demonstrations into this latent option space, providing a rich initialization for the model's reasoning capabilities. Extensive experiments demonstrate that our approach achieves strong performance on complex logical reasoning benchmarks and challenging locomotion tasks, validating our framework as a principled method for learning abstract skills for both language and control.
Authors: Shreya Saxena, Siva Prasad, Zishan Ahmad, Vishal Vaddina
Abstract: Code translation is a crucial process in software development and migration projects, enabling interoperability between different programming languages and enhancing software adaptability and thus longevity. Traditional automated translation methods rely heavily on handcrafted transformation rules, which often lack flexibility and scalability. Meanwhile, advanced language models present promising alternatives but are often limited by proprietary, API-based implementations that raise concerns over data security and reliance. In this paper, we present Auto-Train for Code Translation (ACT), an innovative framework that aims to improve code translation capabilities by enabling in-house finetuning of open-source Large Language Models (LLMs). ACT's automated pipeline significantly boosts the performance of these models, narrowing the gap between open-source accessibility and the high performance of closed-source solutions. Central to ACT is its synthetic data generation module, which builds extensive, high-quality datasets from initial code samples, incorporating unit tests to ensure functional accuracy and diversity. ACT's evaluation framework incorporates execution-level checks, offering a comprehensive assessment of translation quality. A key feature in ACT is its controller module, which manages the entire pipeline by dynamically adjusting hyperparameters, orchestrating iterative data generation, and finetuning based on real-time evaluations. This enables ACT to intelligently optimize when to continue training, generate additional targeted training data, or stop the process. Our results demonstrate that ACT consistently enhances the effectiveness of open-source models, offering businesses and developers a secure and reliable alternative. Additionally, applying our data generation pipeline to industry-scale migration projects has led to a notable increase in developer acceleration.
Authors: Jean Lelong, Adnane Errazine, Annabelle Blangero
Abstract: Conventional Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) but often fall short on complex queries, delivering limited, extractive answers and struggling with multiple targeted retrievals or navigating intricate entity relationships. This is a critical gap in knowledge-intensive domains. We introduce INRAExplorer, an agentic RAG system for exploring the scientific data of INRAE (France's National Research Institute for Agriculture, Food and Environment). INRAExplorer employs an LLM-based agent with a multi-tool architecture to dynamically engage a rich knowledge base, through a comprehensive knowledge graph derived from open access INRAE publications. This design empowers INRAExplorer to conduct iterative, targeted queries, retrieve exhaustive datasets (e.g., all publications by an author), perform multi-hop reasoning, and deliver structured, comprehensive answers. INRAExplorer serves as a concrete illustration of enhancing knowledge interaction in specialized fields.
Authors: Shanghai AI Lab, :, Xiaoyang Chen, Yunhao Chen, Zeren Chen, Zhiyun Chen, Hanyun Cui, Yawen Duan, Jiaxuan Guo, Qi Guo, Xuhao Hu, Hong Huang, Lige Huang, Chunxiao Li, Juncheng Li, Qihao Lin, Dongrui Liu, Xinmin Liu, Zicheng Liu, Chaochao Lu, Xiaoya Lu, Jingjing Qu, Qibing Ren, Jing Shao, Jingwei Shi, Jingwei Sun, Peng Wang, Weibing Wang, Jia Xu, Lewen Yan, Xiao Yu, Yi Yu, Boxuan Zhang, Jie Zhang, Weichen Zhang, Zhijie Zheng, Tianyi Zhou, Bowen Zhou
Abstract: To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R\&D, strategic deception and scheming, self-replication, and collusion. Guided by the "AI-$45^\circ$ Law," we evaluate these risks using "red lines" (intolerable thresholds) and "yellow lines" (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R\&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
Authors: Ali Mohamed Ali, Luca Tirel, Hashim A. Hashim
Abstract: Efficient planning of activities is essential for modern industrial assembly lines to uphold manufacturing standards, prevent project constraint violations, and achieve cost-effective operations. While exact solutions to such challenges can be obtained through Integer Programming (IP), the dependence of the search space on input parameters often makes IP computationally infeasible for large-scale scenarios. Heuristic methods, such as Genetic Algorithms, can also be applied, but they frequently produce suboptimal solutions in extensive cases. This paper introduces a novel mathematical model of a generic industrial assembly line formulated as a Markov Decision Process (MDP), without imposing assumptions on the type of assembly line a notable distinction from most existing models. The proposed model is employed to create a virtual environment for training Deep Reinforcement Learning (DRL) agents to optimize task and resource scheduling. To enhance the efficiency of agent training, the paper proposes two innovative tools. The first is an action-masking technique, which ensures the agent selects only feasible actions, thereby reducing training time. The second is a multi-agent approach, where each workstation is managed by an individual agent, as a result, the state and action spaces were reduced. A centralized training framework with decentralized execution is adopted, offering a scalable learning architecture for optimizing industrial assembly lines. This framework allows the agents to learn offline and subsequently provide real-time solutions during operations by leveraging a neural network that maps the current factory state to the optimal action. The effectiveness of the proposed scheme is validated through numerical simulations, demonstrating significantly faster convergence to the optimal solution compared to a comparable model-based approach.
Authors: Amandeep Kaur, Gyan Prakash
Abstract: Agricultural products are often subject to seasonal fluctuations in production and demand. Predicting and managing inventory levels in response to these variations can be challenging, leading to either excess inventory or stockouts. Additionally, the coordination among stakeholders at various level of food supply chain is not considered in the existing body of literature. To bridge these research gaps, this study focuses on inventory management of agri-food products under demand and lead time uncertainties. By implementing effective inventory replenishment policy results in maximize the overall profit throughout the supply chain. However, the complexity of the problem increases due to these uncertainties and shelf-life of the product, that makes challenging to implement traditional approaches to generate optimal set of solutions. Thus, the current study propose a novel Deep Reinforcement Learning (DRL) algorithm that combines the benefits of both value- and policy-based DRL approaches for inventory optimization under uncertainties. The proposed algorithm can incentivize collaboration among stakeholders by aligning their interests and objectives through shared optimization goal of maximizing profitability along the agri-food supply chain while considering perishability, and uncertainty simultaneously. By selecting optimal order quantities with continuous action space, the proposed algorithm effectively addresses the inventory optimization challenges. To rigorously evaluate this algorithm, the empirical data from fresh agricultural products supply chain inventory is considered. Experimental results corroborate the improved performance of the proposed inventory replenishment policy under stochastic demand patterns and lead time scenarios. The research findings hold managerial implications for policymakers to manage the inventory of agricultural products more effectively under uncertainty.
Authors: Zhenyun Yin, Shujie Wang, Xuhong Wang, Xingjun Ma, Yinchun Wang
Abstract: Improving the reliability of large language models (LLMs) is critical for deploying them in real-world scenarios. In this paper, we propose \textbf{Deliberative Searcher}, the first framework to integrate certainty calibration with retrieval-based search for open-domain question answering. The agent performs multi-step reflection and verification over Wikipedia data and is trained with a reinforcement learning algorithm that optimizes for accuracy under a soft reliability constraint. Empirical results show that proposed method improves alignment between model confidence and correctness, leading to more trustworthy outputs. This paper will be continuously updated.
Authors: Ran Wang, Xiaoxuan Liu, Hao Ren, Gang Chen, Fanchao Qi, Maosong Sun
Abstract: Structured decoding enables large language models (LLMs) to generate outputs in formats required by downstream systems, such as HTML or JSON. However, existing methods suffer from efficiency bottlenecks due to grammar compilation, state tracking, and mask creation. We observe that many real-world tasks embed strong prior knowledge about output structure. Leveraging this, we propose a decomposition of constraints into static and dynamic components -- precompiling static structures offline and instantiating dynamic arguments at runtime using grammar snippets. Instead of relying on pushdown automata, we employ a compositional set of operators to model regular formats, achieving lower transition latency. We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching, achieving up to 250x speedup over existing systems. wgrammar's source code is publicly available at https://github.com/wrran/wgrammar.
Authors: Roman Mayr, Michel Schimpf, Thomas Bohn\'e
Abstract: While modern dialogue systems heavily rely on large language models (LLMs), their implementation often goes beyond pure LLM interaction. Developers integrate multiple LLMs, external tools, and databases. Therefore, assessment of the underlying LLM alone does not suffice, and the dialogue systems must be tested and evaluated as a whole. However, this remains a major challenge. With most previous work focusing on turn-level analysis, less attention has been paid to integrated dialogue-level quality assurance. To address this, we present ChatChecker, a framework for automated evaluation and testing of complex dialogue systems. ChatChecker uses LLMs to simulate diverse user interactions, identify dialogue breakdowns, and evaluate quality. Compared to previous approaches, our design reduces setup effort and is generalizable, as it does not require reference dialogues and is decoupled from the implementation of the target dialogue system. We improve breakdown detection performance over a prior LLM-based approach by including an error taxonomy in the prompt. Additionally, we propose a novel non-cooperative user simulator based on challenging personas that uncovers weaknesses in target dialogue systems more effectively. Through this, ChatChecker contributes to thorough and scalable testing. This enables both researchers and practitioners to accelerate the development of robust dialogue systems.
Authors: Mian Ibad Ali Shah, Enda Barrett, Karl Mason
Abstract: This paper presents a novel framework for Peer-to-Peer (P2P) energy trading that integrates uncertainty-aware prediction with multi-agent reinforcement learning (MARL), addressing a critical gap in current literature. In contrast to previous works relying on deterministic forecasts, the proposed approach employs a heteroscedastic probabilistic transformer-based prediction model called Knowledge Transformer with Uncertainty (KTU) to explicitly quantify prediction uncertainty, which is essential for robust decision-making in the stochastic environment of P2P energy trading. The KTU model leverages domain-specific features and is trained with a custom loss function that ensures reliable probabilistic forecasts and confidence intervals for each prediction. Integrating these uncertainty-aware forecasts into the MARL framework enables agents to optimize trading strategies with a clear understanding of risk and variability. Experimental results show that the uncertainty-aware Deep Q-Network (DQN) reduces energy purchase costs by up to 5.7% without P2P trading and 3.2% with P2P trading, while increasing electricity sales revenue by 6.4% and 44.7%, respectively. Additionally, peak hour grid demand is reduced by 38.8% without P2P and 45.6% with P2P. These improvements are even more pronounced when P2P trading is enabled, highlighting the synergy between advanced forecasting and market mechanisms for resilient, economically efficient energy communities.
Authors: Weikang Gu, Mingyue Han, Li Xue, Heng Dong, Changcai Yang, Riqing Chen, Lifang Wei
Abstract: The accurate identification of high-quality correspondences is a prerequisite task in feature-based point cloud registration. However, it is extremely challenging to handle the fusion of local and global features due to feature redundancy and complex spatial relationships. Given that Gestalt principles provide key advantages in analyzing local and global relationships, we propose a novel Gestalt-guided Parallel Interaction Network via orthogonal geometric consistency (GPI-Net) in this paper. It utilizes Gestalt principles to facilitate complementary communication between local and global information. Specifically, we introduce an orthogonal integration strategy to optimally reduce redundant information and generate a more compact global structure for high-quality correspondences. To capture geometric features in correspondences, we leverage a Gestalt Feature Attention (GFA) block through a hybrid utilization of self-attention and cross-attention mechanisms. Furthermore, to facilitate the integration of local detail information into the global structure, we design an innovative Dual-path Multi-Granularity parallel interaction aggregation (DMG) block to promote information exchange across different granularities. Extensive experiments on various challenging tasks demonstrate the superior performance of our proposed GPI-Net in comparison to existing methods. The code will be released at https://github.com/gwk/GPI-Net.
Authors: Harsha Sammangi (Dakota State University), Aditya Jagatha (College of Business,Information Systems, Dakota State University), Giridhar Reddy Bojja (College of Business, Michigan Technological University), Jun Liu (College of Business,I.S, Dakota State University)
Abstract: AI Innovations in the IoT for Real-Time Patient Monitoring On one hand, the current traditional centralized healthcare architecture poses numerous issues, including data privacy, delay, and security. Here, we present an AI-enabled decentralized IoT architecture that can address such challenges during a pandemic and critical care settings. This work presents our architecture to enhance the effectiveness of the current available federated learning, blockchain, and edge computing approach, maximizing data privacy, minimizing latency, and improving other general system metrics. Experimental results demonstrate transaction latency, energy consumption, and data throughput orders of magnitude lower than competitive cloud solutions.
Authors: Isaac Shi, Zeyuan Li, Fan Liu, Wenli Wang, Lewei He, Yang Yang, Tianyu Shi
Abstract: We present the DEREK (Deep Extraction & Reasoning Engine for Knowledge) Module, a secure and scalable Retrieval-Augmented Generation pipeline designed specifically for enterprise document question answering. Designed and implemented by eSapiens, the system ingests heterogeneous content (PDF, Office, web), splits it into 1,000-token overlapping chunks, and indexes them in a hybrid HNSW+BM25 store. User queries are refined by GPT-4o, retrieved via combined vector+BM25 search, reranked with Cohere, and answered by an LLM using CO-STAR prompt engineering. A LangGraph verifier enforces citation overlap, regenerating answers until every claim is grounded. On four LegalBench subsets, 1000-token chunks improve Recall@50 by approximately 1 pp and hybrid+rerank boosts Precision@10 by approximately 7 pp; the verifier raises TRACe Utilization above 0.50 and limits unsupported statements to less than 3%. All components run in containers, enforce end-to-end TLS 1.3 and AES-256. These results demonstrate that the DEREK module delivers accurate, traceable, and production-ready document QA with minimal operational overhead. The module is designed to meet enterprise demands for secure, auditable, and context-faithful retrieval, providing a reliable baseline for high-stakes domains such as legal and finance.
Authors: John Wu, Adam Cross, Jimeng Sun
Abstract: Rare diseases affect 1 in 10 Americans, yet standard ICD coding systems fail to capture these conditions in electronic health records (EHR), leaving crucial information buried in clinical notes. Current approaches struggle with medical abbreviations, miss implicit disease mentions, raise privacy concerns with cloud processing, and lack clinical reasoning abilities. We present Rare Disease Mining Agents (RDMA), a framework that mirrors how medical experts identify rare disease patterns in EHR. RDMA connects scattered clinical observations that together suggest specific rare conditions. By handling clinical abbreviations, recognizing implicit disease patterns, and applying contextual reasoning locally on standard hardware, RDMA reduces privacy risks while improving F1 performance by upwards of 30\% and decreasing inferences costs 10-fold. This approach helps clinicians avoid the privacy risk of using cloud services while accessing key rare disease information from EHR systems, supporting earlier diagnosis for rare disease patients. Available at https://github.com/jhnwu3/RDMA.
Authors: Altynbek Ismailov, Salia Asanova
Abstract: Large language models (LLMs) now write code in settings where misreading a single word can break safety or cost money, yet we still expect them to overlook stray typos. To probe where useful robustness ends and harmful insensitivity begins, we compile 50 LeetCode problems and craft three minimal prompt perturbations that should vary in importance: (i) progressive underspecification deleting 10 % of words per step; (ii) lexical flip swapping a pivotal quantifier ("max" to "min"); and (iii) jargon inflation replacing a common noun with an obscure technical synonym. Six frontier models, including three "reasoning-tuned" versions, solve each mutated prompt, and their Python outputs are checked against the original test suites to reveal whether they reused the baseline solution or adapted. Among 11 853 generations we observe a sharp double asymmetry. Models remain correct in 85 % of cases even after 90 % of the prompt is missing, showing over-robustness to underspecification, yet only 54 % react to a single quantifier flip that reverses the task, with reasoning-tuned variants even less sensitive than their bases. Jargon edits lie in between, passing through 56 %. Current LLMs thus blur the line between harmless noise and meaning - changing edits, often treating both as ignorable. Masking salient anchors such as function names can force re - evaluation. We advocate evaluation and training protocols that reward differential sensitivity: stay steady under benign noise but adapt - or refuse - when semantics truly change.
Authors: Bin Han, Jonathan Gratch
Abstract: Emotion recognition in dynamic social contexts requires an understanding of the complex interaction between facial expressions and situational cues. This paper presents a salience-adjusted framework for context-aware emotion recognition with Bayesian Cue Integration (BCI) and Visual-Language Models (VLMs) to dynamically weight facial and contextual information based on the expressivity of facial cues. We evaluate this approach using human annotations and automatic emotion recognition systems in prisoner's dilemma scenarios, which are designed to evoke emotional reactions. Our findings demonstrate that incorporating salience adjustment enhances emotion recognition performance, offering promising directions for future research to extend this framework to broader social contexts and multimodal applications.
Authors: Goeric Huybrechts, Srikanth Ronanki, Sai Muralidhar Jayanthi, Jack Fitzgerald, Srinivasan Veeravanallur
Abstract: The proliferation of multimodal Large Language Models has significantly advanced the ability to analyze and understand complex data inputs from different modalities. However, the processing of long documents remains under-explored, largely due to a lack of suitable benchmarks. To address this, we introduce Document Haystack, a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long, visually complex documents. Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image "needles" at various depths within the documents to challenge VLMs' retrieval capabilities. Comprising 400 document variants and a total of 8,250 questions, it is supported by an objective, automated evaluation framework. We detail the construction and characteristics of the Document Haystack dataset, present results from prominent VLMs and discuss potential research avenues in this area.
Authors: Tim Tian Hua, James Baskerville, Henri Lemoine, Mia Hopman, Aryan Bhatt, Tyler Tracy
Abstract: Monitoring AIs at runtime can help us detect and stop harmful actions. In this paper, we study how to combine multiple runtime monitors into a single monitoring protocol. The protocol's objective is to maximize the probability of applying a safety intervention on misaligned outputs (i.e., maximize recall). Since running monitors and applying safety interventions are costly, the protocol also needs to adhere to an average-case budget constraint. Taking the monitors' performance and cost as given, we develop an algorithm to find the most efficient protocol. The algorithm exhaustively searches over when and which monitors to call, and allocates safety interventions based on the Neyman-Pearson lemma. By focusing on likelihood ratios and strategically trading off spending on monitors against spending on interventions, we more than double our recall rate compared to a naive baseline in a code review setting. We also show that combining two monitors can Pareto dominate using either monitor alone. Our framework provides a principled methodology for combining existing monitors to detect undesirable behavior in cost-sensitive settings.
Authors: Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, Ofir Press
Abstract: Despite progress in language model (LM) capabilities, evaluations have thus far focused on models' performance on tasks that humans have previously solved, including in programming (Jimenez et al., 2024) and mathematics (Glazer et al., 2024). We therefore propose testing models' ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 155 coding tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner achieves an average 1.72x speedup against our reference solvers, which use libraries such as SciPy, sk-learn and CVXPY. However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.
Authors: Noah van der Vleuten
Abstract: Language models for program synthesis are usually trained and evaluated on programming competition datasets (MBPP, APPS). However, these datasets are limited in size and quality, while these language models are extremely data hungry. Additionally, the language models have a misaligned program synthesis process compared to humans. While humans iteratively develop code with the help of a compiler, most program synthesis models currently produce code in one go. To solve these issues, we introduce a bootstrapping algorithm for program synthesis, that supports teaching models how to repair. We show that bootstrapping consistently outperforms regular fine-tuning. Compared to other work, our bootstrapped model performs on par with fine-tuned models that are 68\% larger. Notably, bootstrapping with repairing also improves non-repairing performance compared to regular bootstrapping during inference. However, on our models, repairing during inference is likely inferior to simply sampling the same number of solutions. Furthermore, we find that there are issues with the example test cases in the training portion of the APPS dataset that are valuable to the community, as many repairing and reinforcement learning methods rely on them.
Authors: Shahar Zuler, Gal Lifshitz, Hadar Averbuch-Elor, Dan Raviv
Abstract: Accurate motion estimation in cardiac computed tomography (CT) imaging is critical for assessing cardiac function and surgical planning. Data-driven methods have become the standard approach for dense motion estimation, but they rely on vast amounts of labeled data with dense ground-truth (GT) motion annotations, which are often unfeasible to obtain. To address this limitation, we present a novel approach that synthesizes realistically looking pairs of cardiac CT frames enriched with dense 3D flow field annotations. Our method leverages a conditional Variational Autoencoder (CVAE), which incorporates a novel multi-scale feature conditioning mechanism and is trained to generate 3D flow fields conditioned on a single CT frame. By applying the generated flow field to warp the given frame, we create pairs of frames that simulate realistic myocardium deformations across the cardiac cycle. These pairs serve as fully annotated data samples, providing optical flow GT annotations. Our data generation pipeline could enable the training and validation of more complex and accurate myocardium motion models, allowing for substantially reducing reliance on manual annotations. Our code, along with animated generated samples and additional material, is available on our project page: https://shaharzuler.github.io/GenerativeCardiacMotion_Page.
URLs: https://shaharzuler.github.io/GenerativeCardiacMotion_Page.
Authors: Jaehoon Yoo, Wonjung Kim, Seunghoon Hong
Abstract: Discrete Flow-based Models (DFMs) are powerful generative models for high-quality discrete data but typically suffer from slow sampling speeds due to their reliance on iterative decoding processes. This reliance on a multi-step process originates from the factorization approximation of DFMs, which is necessary for handling high-dimensional data. In this paper, we rigorously characterize the approximation error from factorization using Conditional Total Correlation (TC), which depends on the coupling. To reduce the Conditional TC and enable efficient few-step generation, we propose Rectified Discrete Flow (ReDi), a novel iterative method that reduces factorization error by rectifying the coupling between source and target distributions. We theoretically prove that each ReDi step guarantees a monotonic decreasing Conditional TC, ensuring its convergence. Empirically, ReDi significantly reduces Conditional TC and enables few-step generation. Moreover, we demonstrate that the rectified couplings are well-suited for training efficient one-step models on image generation. ReDi offers a simple and theoretically grounded approach for tackling the few-step challenge, providing a new perspective on efficient discrete data synthesis. Code is available at https://github.com/Ugness/ReDi_discrete
Authors: Keen Leung, Colen Yan, Jun Yin
Abstract: Ongoing and future photometric surveys will produce unprecedented volumes of galaxy images, necessitating robust, efficient methods for deriving galaxy morphological parameters at scale. Traditional approaches, such as parametric light-profile fitting, offer valuable insights but become computationally prohibitive when applied to billions of sources. In this work, we propose a Conditional AutoEncoder (CAE) framework to simultaneously model and characterize galaxy morphology. Our CAE is trained on a suite of realistic mock galaxy images generated via GalSim, encompassing a broad range of galaxy types, photometric parameters (e.g., flux, half-light radius, Sersic index, ellipticity), and observational conditions. By encoding each galaxy image into a low-dimensional latent representation conditioned on key parameters, our model effectively recovers these morphological features in a disentangled manner, while also reconstructing the original image. The results demonstrate that the CAE approach can accurately and efficiently infer complex structural properties, offering a powerful alternative to existing methods.
Authors: Siyuan Liu, Wenjing Liu, Zhiwei Xu, Xin Wang, Bo Chen, Tao Li
Abstract: Empowered by large language models (LLMs), intelligent agents have become a popular paradigm for interacting with open environments to facilitate AI deployment. However, hallucinations generated by LLMs-where outputs are inconsistent with facts-pose a significant challenge, undermining the credibility of intelligent agents. Only if hallucinations can be mitigated, the intelligent agents can be used in real-world without any catastrophic risk. Therefore, effective detection and mitigation of hallucinations are crucial to ensure the dependability of agents. Unfortunately, the related approaches either depend on white-box access to LLMs or fail to accurately identify hallucinations. To address the challenge posed by hallucinations of intelligent agents, we present HalMit, a novel black-box watchdog framework that models the generalization bound of LLM-empowered agents and thus detect hallucinations without requiring internal knowledge of the LLM's architecture. Specifically, a probabilistic fractal sampling technique is proposed to generate a sufficient number of queries to trigger the incredible responses in parallel, efficiently identifying the generalization bound of the target agent. Experimental evaluations demonstrate that HalMit significantly outperforms existing approaches in hallucination monitoring. Its black-box nature and superior performance make HalMit a promising solution for enhancing the dependability of LLM-powered systems.
Authors: Mou\"in Ben Ammar, Arturo Mendoza, Nacim Belkhir, Antoine Manzanera, Gianni Franchi
Abstract: In line with the development of deep learning, this survey examines the transformative role of Transformers and foundation models in advancing visual anomaly detection (VAD). We explore how these architectures, with their global receptive fields and adaptability, address challenges such as long-range dependency modeling, contextual modeling and data scarcity. The survey categorizes VAD methods into reconstruction-based, feature-based and zero/few-shot approaches, highlighting the paradigm shift brought about by foundation models. By integrating attention mechanisms and leveraging large-scale pre-training, Transformers and foundation models enable more robust, interpretable, and scalable anomaly detection solutions. This work provides a comprehensive review of state-of-the-art techniques, their strengths, limitations, and emerging trends in leveraging these architectures for VAD.
Authors: Debangshu Banerjee, Kintan Saha, Aditya Gopalan
Abstract: Alignment of large language models (LLMs) typically involves training a reward model on preference data, followed by policy optimization with respect to the reward model. However, optimizing policies with respect to a single reward model estimate can render it vulnerable to inaccuracies in the reward model. We empirically study the variability of reward model training on open-source benchmarks. We observe that independently trained reward models on the same preference dataset can exhibit substantial disagreement, highlighting the instability of current alignment strategies. Employing a theoretical model, we demonstrate that variability in reward model estimation can cause overfitting, leading to the risk of performance degradation. To mitigate this risk, we propose a variance-aware policy optimization framework for preference-based alignment. The key ingredient of the framework is a new policy regularizer that incorporates reward model variance estimates. We show that variance-aware policy optimization provably reduces the risk of outputting a worse policy than the default. Experiments across diverse LLM and reward model configurations confirm that our approach yields more stable and robust alignment than the standard (variance-unaware) pipeline.
Authors: Alberto Messina
Abstract: In this short note, we propose a unified framework that bridges three areas: (1) a flipped perspective on the Turing Test, the "dual Turing test", in which a human judge's goal is to identify an AI rather than reward a machine for deception; (2) a formal adversarial classification game with explicit quality constraints and worst-case guarantees; and (3) a reinforcement learning (RL) alignment pipeline that uses an undetectability detector and a set of quality related components in its reward model. We review historical precedents, from inverted and meta-Turing variants to modern supervised reverse-Turing classifiers, and highlight the novelty of combining quality thresholds, phased difficulty levels, and minimax bounds. We then formalize the dual test: define the judge's task over N independent rounds with fresh prompts drawn from a prompt space Q, introduce a quality function Q and parameters tau and delta, and cast the interaction as a two-player zero-sum game over the adversary's feasible strategy set M. Next, we map this minimax game onto an RL-HF style alignment loop, in which an undetectability detector D provides negative reward for stealthy outputs, balanced by a quality proxy that preserves fluency. Throughout, we include detailed explanations of each component notation, the meaning of inner minimization over sequences, phased tests, and iterative adversarial training and conclude with a suggestion for a couple of immediate actions.
Authors: Haitian Wang, Xinyu Wang, Yiren Wang, Karen Lee, Zichen Geng, Xian Zhang, Kehkashan Kiran, Yu Zhang, Bo Miao
Abstract: Accurate and efficient skin lesion classification on edge devices is critical for accessible dermatological care but remains challenging due to computational, energy, and privacy constraints. We introduce QANA, a novel quantization-aware neuromorphic architecture for incremental skin lesion classification on resource-limited hardware. QANA effectively integrates ghost modules, efficient channel attention, and squeeze-and-excitation blocks for robust feature representation with low-latency and energy-efficient inference. Its quantization-aware head and spike-compatible transformations enable seamless conversion to spiking neural networks (SNNs) and deployment on neuromorphic platforms. Evaluation on the large-scale HAM10000 benchmark and a real-world clinical dataset shows that QANA achieves 91.6\% Top-1 accuracy and 82.4\% macro F1 on HAM10000, and 90.8\% / 81.7\% on the clinical dataset, significantly outperforming state-of-the-art CNN-to-SNN models under fair comparison. Deployed on BrainChip Akida hardware, QANA achieves 1.5\,ms inference latency and 1.7\,mJ energy per image, reducing inference latency and energy use by over 94.6\%/98.6\% compared to GPU-based CNNs surpassing state-of-the-art CNN-to-SNN conversion baselines. These results demonstrate the effectiveness of QANA for accurate, real-time, and privacy-sensitive medical analysis in edge environments.
Authors: Ahmed Aman Ibrahim, Hamad Mansour Alawar, Abdulnasser Abbas Zehi, Ahmed Mohammad Alkendi, Bilal Shafi Ashfaq Ahmed Mirza, Shan Ullah, Ismail Lujain Jaleel, Hassan Ugail
Abstract: Face image quality plays a critical role in determining the accuracy and reliability of face verification systems, particularly in real-time screening applications such as surveillance, identity verification, and access control. Low-quality face images, often caused by factors such as motion blur, poor lighting conditions, occlusions, and extreme pose variations, significantly degrade the performance of face recognition models, leading to higher false rejection and false acceptance rates. In this work, we propose a lightweight yet effective framework for automatic face quality assessment, which aims to pre-filter low-quality face images before they are passed to the verification pipeline. Our approach utilises normalised facial landmarks in conjunction with a Random Forest Regression classifier to assess image quality, achieving an accuracy of 96.67\%. By integrating this quality assessment module into the face verification process, we observe a substantial improvement in performance, including a comfortable 99.7\% reduction in the false rejection rate and enhanced cosine similarity scores when paired with the ArcFace face verification model. To validate our approach, we have conducted experiments on a real-world dataset collected comprising over 600 subjects captured from CCTV footage in unconstrained environments within Dubai Police. Our results demonstrate that the proposed framework effectively mitigates the impact of poor-quality face images, outperforming existing face quality assessment techniques while maintaining computational efficiency. Moreover, the framework specifically addresses two critical challenges in real-time screening: variations in face resolution and pose deviations, both of which are prevalent in practical surveillance scenarios.
Authors: Tarikul Islam Tamiti, Nursad Mamun, Anomadarshi Barua
Abstract: Recovering high-frequency components lost to bandwidth constraints is crucial for applications ranging from telecommunications to high-fidelity audio on limited resources. We introduce NDSI-BWE, a new adversarial Band Width Extension (BWE) framework that leverage four new discriminators inspired by nonlinear dynamical system to capture diverse temporal behaviors: a Multi-Resolution Lyapunov Discriminator (MRLD) for determining sensitivity to initial conditions by capturing deterministic chaos, a Multi-Scale Recurrence Discriminator (MS-RD) for self-similar recurrence dynamics, a Multi-Scale Detrended Fractal Analysis Discriminator (MSDFA) for long range slow variant scale invariant relationship, a Multi-Resolution Poincar\'e Plot Discriminator (MR-PPD) for capturing hidden latent space relationship, a Multi-Period Discriminator (MPD) for cyclical patterns, a Multi-Resolution Amplitude Discriminator (MRAD) and Multi-Resolution Phase Discriminator (MRPD) for capturing intricate amplitude-phase transition statistics. By using depth-wise convolution at the core of the convolutional block with in each discriminators, NDSI-BWE attains an eight-times parameter reduction. These seven discriminators guide a complex-valued ConformerNeXt based genetor with a dual stream Lattice-Net based architecture for simultaneous refinement of magnitude and phase. The genertor leverage the transformer based conformer's global dependency modeling and ConvNeXt block's local temporal modeling capability. Across six objective evaluation metrics and subjective based texts comprises of five human judges, NDSI-BWE establishes a new SoTA in BWE.
Authors: Suchit Gupte, Vishnu Kabir Chhabra, Mohammad Mahdi Khalili
Abstract: Modern LLMs face inference efficiency challenges due to their scale. To address this, many compression methods have been proposed, such as pruning and quantization. However, the effect of compression on a model's interpretability remains elusive. While several model interpretation approaches exist, such as circuit discovery, Sparse Autoencoders (SAEs) have proven particularly effective in decomposing a model's activation space into its feature basis. In this work, we explore the differences in SAEs for the original and compressed models. We find that SAEs trained on the original model can interpret the compressed model albeit with slight performance degradation compared to the trained SAE on the compressed model. Furthermore, simply pruning the original SAE itself achieves performance comparable to training a new SAE on the pruned model. This finding enables us to mitigate the extensive training costs of SAEs.
Authors: Marcel C. B\"uhler, Ye Yuan, Xueting Li, Yangyi Huang, Koki Nagano, Umar Iqbal
Abstract: We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.
Authors: Sumit Singh, Rohit Mishra, Uma Shanker Tiwary
Abstract: One major challenge in natural language processing is named entity recognition (NER), which identifies and categorises named entities in textual input. In order to improve NER, this study investigates a Hindi NER technique that makes use of Hindi-specific pretrained encoders (MuRIL and XLM-R) and Generative Models ( Llama-2-7B-chat-hf (Llama2-7B), Llama-2-70B-chat-hf (Llama2-70B), Llama-3-70B-Instruct (Llama3-70B) and GPT3.5-turbo), and augments the data with retrieved data from external relevant contexts, notably from Wikipedia. We have fine-tuned MuRIL, XLM-R and Llama2-7B with and without RA. However, Llama2-70B, lama3-70B and GPT3.5-turbo are utilised for few-shot NER generation. Our investigation shows that the mentioned language models (LMs) with Retrieval Augmentation (RA) outperform baseline methods that don't incorporate RA in most cases. The macro F1 scores for MuRIL and XLM-R are 0.69 and 0.495, respectively, without RA and increase to 0.70 and 0.71, respectively, in the presence of RA. Fine-tuned Llama2-7B outperforms Llama2-7B by a significant margin. On the other hand the generative models which are not fine-tuned also perform better with augmented data. GPT3.5-turbo adopted RA well; however, Llama2-70B and llama3-70B did not adopt RA with our retrieval context. The findings show that RA significantly improves performance, especially for low-context data. This study adds significant knowledge about how best to use data augmentation methods and pretrained models to enhance NER performance, particularly in languages with limited resources.
Authors: Penghui Yang, Chendong Zhao, Bijun Tang, Zhonghan Zhang, Xinrun Wang, Yanchen Deng, Yuhao Lu, Cuntai Guan, Zheng Liu, Bo An
Abstract: Alloy discovery is central to advancing modern industry but remains hindered by the vastness of compositional design space and the costly validation. Here, we present AutoMAT, a hierarchical and autonomous framework grounded in and validated by experiments, which integrates large language models, automated CALPHAD-based simulations, and AI-driven search to accelerate alloy design. Spanning the entire pipeline from ideation to validation, AutoMAT achieves high efficiency, accuracy, and interpretability without the need for manually curated large datasets. In a case study targeting a lightweight, high-strength alloy, AutoMAT identifies a titanium alloy with 8.1% lower density and comparable yield strength relative to the state-of-the-art reference, achieving the highest specific strength among all comparisons. In a second case targeting high-yield-strength high-entropy alloys, AutoMAT achieves a 28.2% improvement in yield strength over the base alloy. In both cases, AutoMAT reduces the discovery timeline from years to weeks, illustrating its potential as a scalable and versatile platform for next-generation alloy design.
Authors: Ding Wang, Mark D\'iaz, Charvi Rastogi, Aida Davani, Vinodkumar Prabhakaran, Pushkar Mishra, Roma Patel, Alicia Parrish, Zoe Ashwood, Michela Paganini, Tian Huey Teh, Verena Rieser, Lora Aroyo
Abstract: Understanding what constitutes safety in AI-generated content is complex. While developers often rely on predefined taxonomies, real-world safety judgments also involve personal, social, and cultural perceptions of harm. This paper examines how annotators evaluate the safety of AI-generated images, focusing on the qualitative reasoning behind their judgments. Analyzing 5,372 open-ended comments, we find that annotators consistently invoke moral, emotional, and contextual reasoning that extends beyond structured safety categories. Many reflect on potential harm to others more than to themselves, grounding their judgments in lived experience, collective risk, and sociocultural awareness. Beyond individual perceptions, we also find that the structure of the task itself -- including annotation guidelines -- shapes how annotators interpret and express harm. Guidelines influence not only which images are flagged, but also the moral judgment behind the justifications. Annotators frequently cite factors such as image quality, visual distortion, and mismatches between prompt and output as contributing to perceived harm dimensions, which are often overlooked in standard evaluation frameworks. Our findings reveal that existing safety pipelines miss critical forms of reasoning that annotators bring to the task. We argue for evaluation designs that scaffold moral reflection, differentiate types of harm, and make space for subjective, context-sensitive interpretations of AI-generated content.
Authors: Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Ashley Xu, Gia Ancone, Wanhee Lee, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel Yamins
Abstract: Segments in computer vision are often defined by semantic considerations and are highly dependent on category-specific conventions. In contrast, developmental psychology suggests that humans perceive the world in terms of Spelke objects--groupings of physical things that reliably move together when acted on by physical forces. Spelke objects thus operate on category-agnostic causal motion relationships which potentially better support tasks like manipulation and planning. In this paper, we first benchmark the Spelke object concept, introducing the SpelkeBench dataset that contains a wide variety of well-defined Spelke segments in natural images. Next, to extract Spelke segments from images algorithmically, we build SpelkeNet, a class of visual world models trained to predict distributions over future motions. SpelkeNet supports estimation of two key concepts for Spelke object discovery: (1) the motion affordance map, identifying regions likely to move under a poke, and (2) the expected-displacement map, capturing how the rest of the scene will move. These concepts are used for "statistical counterfactual probing", where diverse "virtual pokes" are applied on regions of high motion-affordance, and the resultant expected displacement maps are used define Spelke segments as statistical aggregates of correlated motion statistics. We find that SpelkeNet outperforms supervised baselines like SegmentAnything (SAM) on SpelkeBench. Finally, we show that the Spelke concept is practically useful for downstream applications, yielding superior performance on the 3DEditBench benchmark for physical object manipulation when used in a variety of off-the-shelf object manipulation models.
Authors: Yuzhi Liu, Zixuan Chen, Zirui Zhang, Yufei Liu, Giulia Lanzillotta
Abstract: The Neural Tangent Kernel (NTK) offers a powerful tool to study the functional dynamics of neural networks. In the so-called lazy, or kernel regime, the NTK remains static during training and the network function is linear in the static neural tangents feature space. The evolution of the NTK during training is necessary for feature learning, a key driver of deep learning success. The study of the NTK dynamics has led to several critical discoveries in recent years, in generalization and scaling behaviours. However, this body of work has been limited to the single task setting, where the data distribution is assumed constant over time. In this work, we present a comprehensive empirical analysis of NTK dynamics in continual learning, where the data distribution shifts over time. Our findings highlight continual learning as a rich and underutilized testbed for probing the dynamics of neural training. At the same time, they challenge the validity of static-kernel approximations in theoretical treatments of continual learning, even at large scale.
Authors: Ziqiao Yu, Pengfei Sun, Dan F. M. Goodman
Abstract: We investigate the extent to which Spiking Neural Networks (SNNs) trained with Surrogate Gradient Descent (Surrogate GD), with and without delay learning, can learn from precise spike timing beyond firing rates. We first design synthetic tasks isolating intra-neuron inter-spike intervals and cross-neuron synchrony under matched spike counts. On more complex spike-based speech recognition datasets (Spiking Heidelberg Digits (SHD) and Spiking Speech Commands (SSC), we construct variants where spike count information is eliminated and only timing information remains, and show that Surrogate GD-trained SNNs are able to perform significantly above chance whereas purely rate-based models perform at chance level. We further evaluate robustness under biologically inspired perturbations -- including Gaussian jitter per spike or per-neuron, and spike deletion -- revealing consistent but perturbation-specific degradation. Networks show a sharp performance drop when spike sequences are reversed in time, with a larger drop in performance from SNNs trained with delays, indicating that these networks are more human-like in terms of behaviour. To facilitate further studies of temporal coding, we have released our modified SHD and SSC datasets.
Authors: Yousab Grees, Polina Iaremchuk, Ramtin Ehsani, Esteban Parra, Preetha Chatterjee, Sonia Haiduc
Abstract: Commit messages in a version control system provide valuable information for developers regarding code changes in software systems. Commit messages can be the only source of information left for future developers describing what was changed and why. However, writing high-quality commit messages is often neglected in practice. Large Language Model (LLM) generated commit messages have emerged as a way to mitigate this issue. We introduce the AI-Powered Commit Explorer (APCE), a tool to support developers and researchers in the use and study of LLM-generated commit messages. APCE gives researchers the option to store different prompts for LLMs and provides an additional evaluation prompt that can further enhance the commit message provided by LLMs. APCE also provides researchers with a straightforward mechanism for automated and human evaluation of LLM-generated messages. Demo link https://youtu.be/zYrJ9s6sZvo
Authors: Zhehui Huang, Guangyao Shi, Yuwei Wu, Vijay Kumar, Gaurav S. Sukhatme
Abstract: Multi-robot coordination has traditionally relied on a task-specific and expert-driven pipeline, where natural language mission descriptions are manually translated by domain experts into mathematical formulation, algorithm design, and executable code. This conventional process is labor-intensive, inaccessible to non-experts, and inflexible to changes in mission requirements. Here, we propose LAN2CB (Language to Collective Behavior), a novel framework that leverages large language models (LLMs) to streamline and generalize the multi-robot coordination pipeline. LAN2CB directly converts natural language mission descriptions into executable Python code for multi-robot systems through two key components: (1) Mission Decomposition for Task Representation, which parses the mission into a task graph with dependencies, and (2) Code Generation, which uses the task graph and a structured knowledge base to generate deployable robot control code. We further introduce a dataset of natural language mission specifications to support development and benchmarking. Experimental results in both simulation and real-world settings show that LAN2CB enables effective and flexible multi-robot coordination from natural language, significantly reducing the need for manual engineering while supporting generalization across mission types. Website: https://sites.google.com/view/lan2cb.
Authors: Rodrigo Moreira, Rafael Pasquini, Joberto S. B. Martins, Tereza C. Carvalho, Fl\'avio de Oliveira Silva
Abstract: Network Slicing (NS) realization requires AI-native orchestration architectures to efficiently and intelligently handle heterogeneous user requirements. To achieve this, network slicing is evolving towards a more user-centric digital transformation, focusing on architectures that incorporate native intelligence to enable self-managed connectivity in an integrated and isolated manner. However, these initiatives face the challenge of validating their results in production environments, particularly those utilizing ML-enabled orchestration, as they are often tested in local networks or laboratory simulations. This paper proposes a large-scale validation method using a network slicing prediction model to forecast latency using Deep Neural Networks (DNNs) and basic ML algorithms embedded within an NS architecture, evaluated in real large-scale production testbeds. It measures and compares the performance of different DNNs and ML algorithms, considering a distributed database application deployed as a network slice over two large-scale production testbeds. The investigation highlights how AI-based prediction models can enhance network slicing orchestration architectures and presents a seamless, production-ready validation method as an alternative to fully controlled simulations or laboratory setups.
Authors: Yuta Nakahara, Manabu Kobayashi, Toshiyasu Matsushima
Abstract: With the advancement of deep learning, reducing computational complexity and memory consumption has become a critical challenge, and ternary neural networks (NNs) that restrict parameters to $\{-1, 0, +1\}$ have attracted attention as a promising approach. While ternary NNs demonstrate excellent performance in practical applications such as image recognition and natural language processing, their theoretical understanding remains insufficient. In this paper, we theoretically analyze the expressivity of ternary NNs from the perspective of the number of linear regions. Specifically, we evaluate the number of linear regions of ternary regression NNs with Rectified Linear Unit (ReLU) for activation functions and prove that the number of linear regions increases polynomially with respect to network width and exponentially with respect to depth, similar to standard NNs. Moreover, we show that it suffices to either square the width or double the depth of ternary NNs to achieve a lower bound on the maximum number of linear regions comparable to that of general ReLU regression NNs. This provides a theoretical explanation, in some sense, for the practical success of ternary NNs.
Authors: Ondrej Bohdal, Mete Ozay, Jijoong Moon, Kyeng-Hun Lee, Hyeonmok Ko, Umberto Michieli
Abstract: Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.
Authors: Dakota Sullivan, Shirley Zhang, Jennica Li, Heather Kirkorian, Bilge Mutlu, Kassem Fawaz
Abstract: Social robots are embodied agents that interact with people while following human communication norms. These robots interact using verbal and non-verbal cues, and share the physical environments of people. While social robots have previously utilized rule-based systems or probabilistic models for user interaction, the rapid evolution of large language models (LLMs) presents new opportunities to develop LLM-empowered social robots for enhanced human-robot interaction. To fully realize these capabilities, however, robots need to collect data such as audio, fine-grained images, video, and locations. As a result, LLMs often process sensitive personal information, particularly within home environments. Given the tension between utility and privacy risks, evaluating how current LLMs manage sensitive data is critical. Specifically, we aim to explore the extent to which out-of-the-box LLMs are privacy-aware in the context of household social robots. In this study, we present a set of privacy-relevant scenarios crafted through the lens of Contextual Integrity (CI). We first survey users' privacy preferences regarding in-home social robot behaviors and then examine how their privacy orientation affects their choices of these behaviors (N = 450). We then provide the same set of scenarios and questions to state-of-the-art LLMs (N = 10) and find that the agreement between humans and LLMs is low. To further investigate the capabilities of LLMs as a potential privacy controller, we implement four additional prompting strategies and compare their results. Finally, we discuss the implications and potential of AI privacy awareness in human-robot interaction.
Authors: Mahika Phutane, Aditya Vashistha
Abstract: People with disabilities (PwD) experience disproportionately high levels of discrimination and hate online, particularly in India, where entrenched stigma and limited resources intensify these challenges. Large language models (LLMs) are increasingly used to identify and mitigate online hate, yet most research on online ableism focuses on Western audiences with Western AI models. Are these models adequately equipped to recognize ableist harm in non-Western places like India? Do localized, Indic language models perform better? To investigate, we adopted and translated a publicly available ableist speech dataset to Hindi, and prompted eight LLMs--four developed in the U.S. (GPT-4, Gemini, Claude, Llama) and four in India (Krutrim, Nanda, Gajendra, Airavata)--to score and explain ableism. In parallel, we recruited 175 PwD from both the U.S. and India to perform the same task, revealing stark differences between groups. Western LLMs consistently overestimated ableist harm, while Indic LLMs underestimated it. Even more concerning, all LLMs were more tolerant of ableism when it was expressed in Hindi and asserted Western framings of ableist harm. In contrast, Indian PwD interpreted harm through intention, relationality, and resilience--emphasizing a desire to inform and educate perpetrators. This work provides groundwork for global, inclusive standards of ableism, demonstrating the need to center local disability experiences in the design and evaluation of AI systems.
Authors: Eduardo Pacheco, Atila Orhon, Berkin Durmus, Blaise Munyampirwa, Andrey Leonov
Abstract: Even state-of-the-art speaker diarization systems exhibit high variance in error rates across different datasets, representing numerous use cases and domains. Furthermore, comparing across systems requires careful application of best practices such as dataset splits and metric definitions to allow for apples-to-apples comparison. We propose SDBench (Speaker Diarization Benchmark), an open-source benchmark suite that integrates 13 diverse datasets with built-in tooling for consistent and fine-grained analysis of speaker diarization performance for various on-device and server-side systems. SDBench enables reproducible evaluation and easy integration of new systems over time. To demonstrate the efficacy of SDBench, we built SpeakerKit, an inference efficiency-focused system built on top of Pyannote v3. SDBench enabled rapid execution of ablation studies that led to SpeakerKit being 9.6x faster than Pyannote v3 while achieving comparable error rates. We benchmark 6 state-of-the-art systems including Deepgram, AWS Transcribe, and Pyannote AI API, revealing important trade-offs between accuracy and speed.
Authors: Yasser Ashraf, Ahmed Sharshar, Velibor Bojkovic, Bin Gu
Abstract: Spike cameras, bio-inspired vision sensors, asynchronously fire spikes by accumulating light intensities at each pixel, offering ultra-high energy efficiency and exceptional temporal resolution. Unlike event cameras, which record changes in light intensity to capture motion, spike cameras provide even finer spatiotemporal resolution and a more precise representation of continuous changes. In this paper, we introduce the first video action recognition (VAR) dataset using spike camera, alongside synchronized RGB and thermal modalities, to enable comprehensive benchmarking for Spiking Neural Networks (SNNs). By preserving the inherent sparsity and temporal precision of spiking data, our three datasets offer a unique platform for exploring multimodal video understanding and serve as a valuable resource for directly comparing spiking, thermal, and RGB modalities. This work contributes a novel dataset that will drive research in energy-efficient, ultra-low-power video understanding, specifically for action recognition tasks using spike-based data.
Authors: Jyun-Ze Tang, Chih-Fan Hsu, Jeng-Lin Li, Ming-Ching Chang, Wei-Chao Chen
Abstract: Flow matching and diffusion models have shown impressive results in text-to-image generation, producing photorealistic images through an iterative denoising process. A common strategy to speed up synthesis is to perform early denoising at lower resolutions. However, traditional methods that downscale and upscale in pixel space often introduce artifacts and distortions. These issues arise when the upscaled images are re-encoded into the latent space, leading to degraded final image quality. To address this, we propose {\bf Latent Space Scaling Generation (LSSGen)}, a framework that performs resolution scaling directly in the latent space using a lightweight latent upsampler. Without altering the Transformer or U-Net architecture, LSSGen improves both efficiency and visual quality while supporting flexible multi-resolution generation. Our comprehensive evaluation covering text-image alignment and perceptual quality shows that LSSGen significantly outperforms conventional scaling approaches. When generating $1024^2$ images at similar speeds, it achieves up to 246\% TOPIQ score improvement.
Authors: Eldor Abdukhamidov, Tamer Abuhmed, Joanna C. S. Santos, Mohammed Abuhamad
Abstract: Studies have shown that machine learning systems are vulnerable to adversarial examples in theory and practice. Where previous attacks have focused mainly on visual models that exploit the difference between human and machine perception, text-based models have also fallen victim to these attacks. However, these attacks often fail to maintain the semantic meaning of the text and similarity. This paper introduces AdvChar, a black-box attack on Interpretable Natural Language Processing Systems, designed to mislead the classifier while keeping the interpretation similar to benign inputs, thus exploiting trust in system transparency. AdvChar achieves this by making less noticeable modifications to text input, forcing the deep learning classifier to make incorrect predictions and preserve the original interpretation. We use an interpretation-focused scoring approach to determine the most critical tokens that, when changed, can cause the classifier to misclassify the input. We apply simple character-level modifications to measure the importance of tokens, minimizing the difference between the original and new text while generating adversarial interpretations similar to benign ones. We thoroughly evaluated AdvChar by testing it against seven NLP models and three interpretation models using benchmark datasets for the classification task. Our experiments show that AdvChar can significantly reduce the prediction accuracy of current deep learning models by altering just two characters on average in input samples.
Authors: Yang Yu, Kai Han, Hang Zhou, Yehui Tang, Kaiqi Huang, Yunhe Wang, Dacheng Tao
Abstract: While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model's data preferences evolve throughout training, providing new insights into the data preference of the model during training.
Authors: Rui Guo, Avinash Ayalasomayajula, Henian Li, Jingbo Zhou, Sujan Kumar Saha, Farimah Farahmandi
Abstract: Verification using SystemVerilog assertions (SVA) is one of the most popular methods for detecting circuit design vulnerabilities. However, with the globalization of integrated circuit design and the continuous upgrading of security requirements, the SVA development model has exposed major limitations. It is not only inefficient in development, but also unable to effectively deal with the increasing number of security vulnerabilities in modern complex integrated circuits. In response to these challenges, this paper proposes an innovative SVA automatic generation framework SVAgent. SVAgent introduces a requirement decomposition mechanism to transform the original complex requirements into a structured, gradually solvable fine-grained problem-solving chain. Experiments have shown that SVAgent can effectively suppress the influence of hallucinations and random answers, and the key evaluation indicators such as the accuracy and consistency of the SVA are significantly better than existing frameworks. More importantly, we successfully integrated SVAgent into the most mainstream integrated circuit vulnerability assessment framework and verified its practicality and reliability in a real engineering design environment.
Authors: Xu Yang, Qi Zhang, Shuming Jiang, Yaowen Xu, Zhaofan Zou, Hao Sun, Xuelong Li
Abstract: With the rapid advancement of generative AI, synthetic content across images, videos, and audio has become increasingly realistic, amplifying the risk of misinformation. Existing detection approaches predominantly focus on binary classification while lacking detailed and interpretable explanations of forgeries, which limits their applicability in safety-critical scenarios. Moreover, current methods often treat each modality separately, without a unified benchmark for cross-modal forgery detection and interpretation. To address these challenges, we introduce METER, a unified, multi-modal benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content. Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations, including spatio-temporal localization, textual rationales, and forgery type tracing. Compared to prior benchmarks, METER offers broader modality coverage and richer interpretability metrics such as spatial/temporal IoU, multi-class tracing, and evidence consistency. We further propose a human-aligned, three-stage Chain-of-Thought (CoT) training strategy combining SFT, DPO, and a novel GRPO stage that integrates a human-aligned evaluator with CoT reasoning. We hope METER will serve as a standardized foundation for advancing generalizable and interpretable forgery detection in the era of generative media.
Authors: Katelyn Morrison, Arpit Mathur, Aidan Bradshaw, Tom Wartmann, Steven Lundi, Afrooz Zandifar, Weichang Dai, Kayhan Batmanghelich, Motahhare Eslami, Adam Perer
Abstract: As text-to-image generative models rapidly improve, AI researchers are making significant advances in developing domain-specific models capable of generating complex medical imagery from text prompts. Despite this, these technical advancements have overlooked whether and how medical professionals would benefit from and use text-to-image generative AI (GenAI) in practice. By developing domain-specific GenAI without involving stakeholders, we risk the potential of building models that are either not useful or even more harmful than helpful. In this paper, we adopt a human-centered approach to responsible model development by involving stakeholders in evaluating and reflecting on the promises, risks, and challenges of a novel text-to-CT Scan GenAI model. Through exploratory model prompting activities, we uncover the perspectives of medical students, radiology trainees, and radiologists on the role that text-to-CT Scan GenAI can play across medical education, training, and practice. This human-centered approach additionally enabled us to surface technical challenges and domain-specific risks of generating synthetic medical images. We conclude by reflecting on the implications of medical text-to-image GenAI.
Authors: Sohaib Muhammad, Ashwati Vipin, Karan Shetti, Honey Mittal
Abstract: Despite rapid advances in Large Language Models and Multimodal Large Language Models (LLMs), numerous challenges related to interpretability, scalability, resource requirements and repeatability remain, related to their application in the design-to-code space. To address this, we introduce the Large Design Models (LDMs) paradigm specifically trained on designs and webpages to enable seamless conversion from design-to-code. We have developed a training and inference pipeline by incorporating data engineering and appropriate model architecture modification. The training pipeline consists of the following: 1)Design Optimiser: developed using a proprietary ground truth dataset and addresses sub-optimal designs; 2)Tagging and feature detection: using pre-trained and fine-tuned models, this enables the accurate detection and classification of UI elements; and 3)Auto Components: extracts repeated UI structures into reusable components to enable creation of modular code, thus reducing redundancy while enhancing code reusability. In this manner, each model addresses distinct but key issues for design-to-code conversion. Separately, our inference pipeline processes real-world designs to produce precise and interpretable instructions for code generation and ensures reliability. Additionally, our models illustrated exceptional end-to-end design-to-code conversion accuracy using a novel preview match score metric. Comparative experiments indicated superior performance of LDMs against LLMs on accuracy of node positioning, responsiveness and reproducibility. Moreover, our custom-trained tagging and feature detection model demonstrated high precision and consistency in identifying UI elements across a wide sample of test designs. Thus, our proposed LDMs are a reliable and superior solution to understanding designs that subsequently enable the generation of efficient and reliable production-ready code.
Authors: Wentao Xiang, Haoxian Tan, Cong Wei, Yujie Zhong, Dengjie Li, Yujiu Yang
Abstract: Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two dimensions: prediction type and instruction type. Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVP-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. Our framework is designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions within a single architecture. MVP-LM features an innovative multi-granularity decoder in conjunction with a CoT-inspired dataset unification strategy, enabling seamless supervised fine-tuning across a wide spectrum of tasks, including but not limited to panoptic segmentation, detection, grounding, and referring expression segmentation. Furthermore, we introduce a query enhancement strategy aimed at harnessing the decoding and generative capabilities inherent in VLLMs. Extensive experiments conducted across a range of benchmarks in both word-based and sentence-based perception tasks substantiate the efficacy of our framework. The code will be available at https://github.com/xiangwentao666/MVP-LM.
Authors: Batu Candan, Simone Servadio
Abstract: Accurate and robust relative pose estimation is crucial for enabling challenging Active Debris Removal (ADR) missions targeting tumbling derelict satellites such as ESA's ENVISAT. This work presents a complete pipeline integrating advanced computer vision techniques with adaptive nonlinear filtering to address this challenge. A Convolutional Neural Network (CNN), enhanced with image preprocessing, detects structural markers (corners) from chaser imagery, whose 2D coordinates are converted to 3D measurements using camera modeling. These measurements are fused within an Unscented Kalman Filter (UKF) framework, selected for its ability to handle nonlinear relative dynamics, to estimate the full relative pose. Key contributions include the integrated system architecture and a dual adaptive strategy within the UKF: dynamic tuning of the measurement noise covariance compensates for varying CNN measurement uncertainty, while adaptive tuning of the process noise covariance, utilizing measurement residual analysis, accounts for unmodeled dynamics or maneuvers online. This dual adaptation enhances robustness against both measurement imperfections and dynamic model uncertainties. The performance of the proposed adaptive integrated system is evaluated through high-fidelity simulations using a realistic ENVISAT model, comparing estimates against ground truth under various conditions, including measurement outages. This comprehensive approach offers an enhanced solution for robust onboard relative navigation, significantly advancing the capabilities required for safe proximity operations during ADR missions.
Authors: Shahriar Golchin, Yanfei Chen, Rujun Han, Manan Gandhi, Tianli Yu, Swaroop Mishra, Mihai Surdeanu, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister
Abstract: Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.
Authors: Da Fan, David John Gagne II, Steven J. Greybush, Eugene E. Clothiaux, John S. Schreck, Chaopeng Shen
Abstract: This study evaluated the probability and uncertainty forecasts of five recently proposed Bayesian deep learning methods relative to a deterministic residual neural network (ResNet) baseline for 0-1 h convective initiation (CI) nowcasting using GOES-16 satellite infrared observations. Uncertainty was assessed by how well probabilistic forecasts were calibrated and how well uncertainty separated forecasts with large and small errors. Most of the Bayesian deep learning methods produced probabilistic forecasts that outperformed the deterministic ResNet, with one, the initial-weights ensemble + Monte Carlo (MC) dropout, an ensemble of deterministic ResNets with different initial weights to start training and dropout activated during inference, producing the most skillful and well-calibrated forecasts. The initial-weights ensemble + MC dropout benefited from generating multiple solutions that more thoroughly sampled the hypothesis space. The Bayesian ResNet ensemble was the only one that performed worse than the deterministic ResNet at longer lead times, likely due to the challenge of optimizing a larger number of parameters. To address this issue, the Bayesian-MOPED (MOdel Priors with Empirical Bayes using Deep neural network) ResNet ensemble was adopted, and it enhanced forecast skill by constraining the hypothesis search near the deterministic ResNet hypothesis. All Bayesian methods demonstrated well-calibrated uncertainty and effectively separated cases with large and small errors. In case studies, the initial-weights ensemble + MC dropout demonstrated better forecast skill than the Bayesian-MOPED ensemble and the deterministic ResNet on selected CI events in clear-sky regions. However, the initial-weights ensemble + MC dropout exhibited poorer generalization in clear-sky and anvil cloud regions without CI occurrence compared to the deterministic ResNet and Bayesian-MOPED ensemble.
Authors: Zixu Wang, Yuhan Wang, Junfei Ma, Fuyuan Wu, Junchi Yan, Xiaohui Yuan, Zhe Zhang, Jie Zhang
Abstract: This work presents predictive hydrodynamic simulations empowered by artificial intelligence (AI) for laser driven implosion experiments, taking the double-cone ignition (DCI) scheme as an example. A Transformer-based deep learning model MULTI-Net is established to predict implosion features according to laser waveforms and target radius. A Physics-Informed Decoder (PID) is proposed for high-dimensional sampling, significantly reducing the prediction errors compared to Latin hypercube sampling. Applied to DCI experiments conducted on the SG-II Upgrade facility, the MULTI-Net model is able to predict the implosion dynamics measured by the x-ray streak camera. It is found that an effective laser absorption factor about 65\% is suitable for the one-dimensional simulations of the DCI-R10 experiments. For shot 33, the mean implosion velocity and collided plasma density reached 195 km/s and 117 g/cc, respectively. This study demonstrates a data-driven AI framework that enhances the prediction ability of simulations for complicated laser fusion experiments.
Authors: Paul R. B. Houssel, Siamak Layeghy, Priyanka Singh, Marius Portmann
Abstract: This paper introduces eX-NIDS, a framework designed to enhance interpretability in flow-based Network Intrusion Detection Systems (NIDS) by leveraging Large Language Models (LLMs). In our proposed framework, flows labelled as malicious by NIDS are initially processed through a module called the Prompt Augmenter. This module extracts contextual information and Cyber Threat Intelligence (CTI)-related knowledge from these flows. This enriched, context-specific data is then integrated with an input prompt for an LLM, enabling it to generate detailed explanations and interpretations of why the flow was identified as malicious by NIDS. We compare the generated interpretations against a Basic-Prompt Explainer baseline, which does not incorporate any contextual information into the LLM's input prompt. Our framework is quantitatively evaluated using the Llama 3 and GPT-4 models, employing a novel evaluation method tailored for natural language explanations, focusing on their correctness and consistency. The results demonstrate that augmented LLMs can produce accurate and consistent explanations, serving as valuable complementary tools in NIDS to explain the classification of malicious flows. The use of augmented prompts enhances performance by over 20% compared to the Basic-Prompt Explainer.
Authors: Tanusree Sharma, Yihao Zhou, Visar Berisha
Abstract: Early large-scale audio datasets, such as LibriSpeech, were built with hundreds of individual contributors whose voices were instrumental in the development of speech technologies, including audiobooks and voice assistants. Yet, a decade later, these same contributions have exposed voice actors to a range of risks. While existing ethical frameworks emphasize Consent, Credit, and Compensation (C3), they do not adequately address the emergent risks involving vocal identities that are increasingly decoupled from context, authorship, and control. Drawing on qualitative interviews with 20 professional voice actors, this paper reveals how the synthetic replication of voice without enforceable constraints exposes individuals to a range of threats. Beyond reputational harm, such as re-purposing voice data in erotic content, offensive political messaging, and meme culture, we document concerns about accountability breakdowns when their voice is leveraged to clone voices that are deployed in high-stakes scenarios such as financial fraud, misinformation campaigns, or impersonation scams. In such cases, actors face social and legal fallout without recourse, while very few of them have a legal representative or union protection. To make sense of these shifting dynamics, we introduce the PRAC3 framework, an expansion of C3 that foregrounds Privacy, Reputation, Accountability, Consent, Credit, and Compensation as interdependent pillars of data used in the synthetic voice economy. This framework captures how privacy risks are amplified through non-consensual training, how reputational harm arises from decontextualized deployment, and how accountability can be reimagined AI Data ecosystems. We argue that voice, as both a biometric identifier and creative labor, demands governance models that restore creator agency, ensure traceability, and establish enforceable boundaries for ethical reuse.
Authors: Yu Wang, Bo Dang, Wanchun Li, Wei Chen, Yansheng Li
Abstract: With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces HoliTracer, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. In HoliTracer, we enhance segmentation of large-size RSI using the Context Attention Net (CAN), which employs a local-to-global attention mechanism to capture contextual dependencies. Furthermore, we achieve holistic vectorization through a robust pipeline that leverages the Mask Contour Reformer (MCR) to reconstruct polygons and the Polygon Sequence Tracer (PST) to trace vertices. Extensive experiments on large-size RSI datasets, including buildings, water bodies, and roads, demonstrate that HoliTracer outperforms state-of-the-art methods. Our code and data are available in https://github.com/vvangfaye/HoliTracer.
Authors: Hyunji Nam, Omer Gottesman, Amy Zhang, Dean Foster, Emma Brunskill, Lyle Ungar
Abstract: Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize responses based on immediate turn-level human preferences. However, this approach falls short in multi-turn dialogue settings, such as online math tutoring. We propose a method to enhance LLM-based tutors by representing the dialogue history with a lower-dimensional latent state representation of a student and optimizing a long-term policy to determine high-level actions based on the latent state. The goal is to better align the tutor's behavior with the long-term objective of guiding the student towards solving a target math problem on their own. Our model is lightweight, requiring less computational resources than prior work of training the tutor policy end-to-end to directly output the tutor's next utterance. Our experiment results demonstrate that these modifications lead to improved long-term outcomes compared to prompting in LLM-simulated tutoring tasks.
Authors: Seunghyeon Kim, Kyeongryeol Go
Abstract: Fisheye cameras introduce significant distortion and pose unique challenges to object detection models trained on conventional datasets. In this work, we propose a data-centric pipeline that systematically improves detection performance by focusing on the key question of identifying the blind spots of the model. Through detailed error analysis, we identify critical edge-cases such as confusing class pairs, peripheral distortions, and underrepresented contexts. Then we directly address them through edge-case synthesis. We fine-tuned an image generative model and guided it with carefully crafted prompts to produce images that replicate real-world failure modes. These synthetic images are pseudo-labeled using a high-quality detector and integrated into training. Our approach results in consistent performance gains, highlighting how deeply understanding data and selectively fixing its weaknesses can be impactful in specialized domains like fisheye object detection.
Authors: Xinyue Yang, Meiliang Liu, Yunfang Xu, Xiaoxiao Yang, Zhengye Si, Zijin Li, Zhiwen Zhao
Abstract: Alzheimer's disease (AD) is a progressive neurodegenerative disorder that predominantly affects the elderly population and currently has no cure. Magnetic Resonance Imaging (MRI), as a non-invasive imaging technique, is essential for the early diagnosis of AD. MRI inherently contains both spatial and frequency information, as raw signals are acquired in the frequency domain and reconstructed into spatial images via the Fourier transform. However, most existing AD diagnostic models extract features from a single domain, limiting their capacity to fully capture the complex neuroimaging characteristics of the disease. While some studies have combined spatial and frequency information, they are mostly confined to 2D MRI, leaving the potential of dual-domain analysis in 3D MRI unexplored. To overcome this limitation, we propose Spatio-Frequency Network (SFNet), the first end-to-end deep learning framework that simultaneously leverages spatial and frequency domain information to enhance 3D MRI-based AD diagnosis. SFNet integrates an enhanced dense convolutional network to extract local spatial features and a global frequency module to capture global frequency-domain representations. Additionally, a novel multi-scale attention module is proposed to further refine spatial feature extraction. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that SFNet outperforms existing baselines and reduces computational overhead in classifying cognitively normal (CN) and AD, achieving an accuracy of 95.1%.
Authors: Zixiao Huang, Junhao Hu, Hao Lin, Chunyang Zhu, Yueran Tang, Quanlu Zhang, Zhen Guo, Zhenhua Li, Shengen Yan, Zhenhua Zhu, Guohao Dai, Yu Wang
Abstract: The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Default GPU memory allocators of popular deep learning frameworks like PyTorch use online strategies without knowledge of tensor lifespans, which can waste up to 43\% of memory and cause out-of-memory errors, rendering optimization techniques ineffective or even unusable. To address this, we introduce STWeaver, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STWeaver introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch allocator, STWeaver reduces fragmentation ratio on average by 79.2\% (up to 100\%) across both dense and sparse models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves performance by up to 32.5\%.
Authors: Yash Kumar
Abstract: Although modern deep learning often relies on massive over-parameterized models, the fundamental interplay between capacity, sparsity, and robustness in low-capacity networks remains a vital area of study. We introduce a controlled framework to investigate these properties by creating a suite of binary classification tasks from the MNIST dataset with increasing visual difficulty (e.g., 0 and 1 vs. 4 and 9). Our experiments reveal three core findings. First, the minimum model capacity required for successful generalization scales directly with task complexity. Second, these trained networks are robust to extreme magnitude pruning (up to 95% sparsity), revealing the existence of sparse, high-performing subnetworks. Third, we show that over-parameterization provides a significant advantage in robustness against input corruption. Interpretability analysis via saliency maps further confirms that these identified sparse subnetworks preserve the core reasoning process of the original dense models. This work provides a clear, empirical demonstration of the foundational trade-offs governing simple neural networks.
Authors: Boheng Li, Renjie Gu, Junjie Wang, Leyi Qi, Yiming Li, Run Wang, Zhan Qin, Tianwei Zhang
Abstract: Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are identified to be fragile to downstream fine-tuning, where we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau Envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety after downstream fine-tuning while preserving benign generation capability well.
Authors: Xin-De Wang, Zhi-Rui Chen, Peng-Jie Guo, Ze-Feng Gao, Cheng Mu, Zhong-Yi Lu
Abstract: Perovskite solar cells (PSCs) have rapidly emerged as a leading contender in next-generation photovoltaic technologies, owing to their exceptional power conversion efficiencies and advantageous material properties. Despite these advances, challenges such as long-term stability, environmental sustainability, and scalable manufacturing continue to hinder their commercialization. Precursor additive engineering has shown promise in addressing these issues by enhancing both the performance and durability of PSCs. However, the explosive growth of scientific literature and the complex interplay of materials, processes, and device architectures make it increasingly difficult for researchers to efficiently access, organize, and utilize domain knowledge in this rapidly evolving field. To address this gap, we introduce Perovskite-R1, a specialized large language model (LLM) with advanced reasoning capabilities tailored for the discovery and design of PSC precursor additives. By systematically mining and curating 1,232 high-quality scientific publications and integrating a comprehensive library of 33,269 candidate materials, we constructed a domain-specific instruction-tuning dataset using automated question-answer generation and chain-of-thought reasoning. Fine-tuning the QwQ-32B model on this dataset resulted in Perovskite-R1, which can intelligently synthesize literature insights and generate innovative and practical solutions for defect passivation and the selection of precursor additives. Experimental validation of several model-proposed strategies confirms their effectiveness in improving material stability and performance. Our work demonstrates the potential of domain-adapted LLMs in accelerating materials discovery and provides a closed-loop framework for intelligent, data-driven advancements in perovskite photovoltaic research.
Authors: Boheng Li, Junjie Wang, Yiming Li, Zhiyang Hu, Leyi Qi, Jianshuo Dong, Run Wang, Han Qiu, Zhan Qin, Tianwei Zhang
Abstract: Despite the integration of safety alignment and external filters, text-to-image (T2I) generative models are still susceptible to producing harmful content, such as sexual or violent imagery. This raises serious concerns about unintended exposure and potential misuse. Red teaming, which aims to proactively identify diverse prompts that can elicit unsafe outputs from the T2I system (including the core generative model as well as potential external safety filters and other processing components), is increasingly recognized as an essential method for assessing and improving safety before real-world deployment. Yet, existing automated red teaming approaches often treat prompt discovery as an isolated, prompt-level optimization task, which limits their scalability, diversity, and overall effectiveness. To bridge this gap, in this paper, we propose DREAM, a scalable red teaming framework to automatically uncover diverse problematic prompts from a given T2I system. Unlike most prior works that optimize prompts individually, DREAM directly models the probabilistic distribution of the target system's problematic prompts, which enables explicit optimization over both effectiveness and diversity, and allows efficient large-scale sampling after training. To achieve this without direct access to representative training samples, we draw inspiration from energy-based models and reformulate the objective into simple and tractable objectives. We further introduce GC-SPSA, an efficient optimization algorithm that provide stable gradient estimates through the long and potentially non-differentiable T2I pipeline. The effectiveness of DREAM is validated through extensive experiments, demonstrating that it surpasses 9 state-of-the-art baselines by a notable margin across a broad range of T2I models and safety filters in terms of prompt success rate and diversity.
Authors: Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, Ian McLoughlin
Abstract: Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at https://cai525.github.io/Transformer4SED/demo_page/DASM/.
URLs: https://cai525.github.io/Transformer4SED/demo_page/DASM/.
Authors: Yumeng Wang, Zengyi Wo, Wenjun Wang, Xingcheng Fu, Minglai Shao
Abstract: Graph Neural Networks (GNNs) excel in node classification tasks but often assume homophily, where connected nodes share similar labels. This assumption does not hold in many real-world heterophilic graphs. Existing models for heterophilic graphs primarily rely on pairwise relationships, overlooking multi-scale information from higher-order structures. This leads to suboptimal performance, particularly under noise from conflicting class information across nodes. To address these challenges, we propose HPGNN, a novel model integrating Higher-order Personalized PageRank with Graph Neural Networks. HPGNN introduces an efficient high-order approximation of Personalized PageRank (PPR) to capture long-range and multi-scale node interactions. This approach reduces computational complexity and mitigates noise from surrounding information. By embedding higher-order structural information into convolutional networks, HPGNN effectively models key interactions across diverse graph dimensions. Extensive experiments on benchmark datasets demonstrate HPGNN's effectiveness. The model achieves better performance than five out of seven state-of-the-art methods on heterophilic graphs in downstream tasks while maintaining competitive performance on homophilic graphs. HPGNN's ability to balance multi-scale information and robustness to noise makes it a versatile solution for real-world graph learning challenges. Codes are available at https://github.com/streetcorner/HPGNN.
Authors: Tian Dong, Yan Meng, Shaofeng Li, Guoxing Chen, Zhen Liu, Haojin Zhu
Abstract: Large Language Models (LLMs) are increasingly integrated into daily routines, yet they raise significant privacy and safety concerns. Recent research proposes collaborative inference, which outsources the early-layer inference to ensure data locality, and introduces model safety auditing based on inner neuron patterns. Both techniques expose the LLM's Internal States (ISs), which are traditionally considered irreversible to inputs due to optimization challenges and the highly abstract representations in deep layers. In this work, we challenge this assumption by proposing four inversion attacks that significantly improve the semantic similarity and token matching rate of inverted inputs. Specifically, we first develop two white-box optimization-based attacks tailored for low-depth and high-depth ISs. These attacks avoid local minima convergence, a limitation observed in prior work, through a two-phase inversion process. Then, we extend our optimization attack under more practical black-box weight access by leveraging the transferability between the source and the derived LLMs. Additionally, we introduce a generation-based attack that treats inversion as a translation task, employing an inversion model to reconstruct inputs. Extensive evaluation of short and long prompts from medical consulting and coding assistance datasets and 6 LLMs validates the effectiveness of our inversion attacks. Notably, a 4,112-token long medical consulting prompt can be nearly perfectly inverted with 86.88 F1 token matching from the middle layer of Llama-3 model. Finally, we evaluate four practical defenses that we found cannot perfectly prevent ISs inversion and draw conclusions for future mitigation design.
Authors: Chenhao Yao, Zike Yuan, Xiaoxu Liu, Chi Zhu
Abstract: Multi-Agent Systems (MAS) excel at accomplishing complex objectives through the collaborative efforts of individual agents. Among the methodologies employed in MAS, Multi-Agent Reinforcement Learning (MARL) stands out as one of the most efficacious algorithms. However, when confronted with the complex objective of Formation Control with Collision Avoidance (FCCA): designing an effective reward function that facilitates swift convergence of the policy network to an optimal solution. In this paper, we introduce a novel framework that aims to overcome this challenge. By giving large language models (LLMs) on the prioritization of tasks and the observable information available to each agent, our framework generates reward functions that can be dynamically adjusted online based on evaluation outcomes by employing more advanced evaluation metrics rather than the rewards themselves. This mechanism enables the MAS to simultaneously achieve formation control and obstacle avoidance in dynamic environments with enhanced efficiency, requiring fewer iterations to reach superior performance levels. Our empirical studies, conducted in both simulation and real-world settings, validate the practicality and effectiveness of our proposed approach.
Authors: Sijin Yu, Zijiao Chen, Wenxuan Wu, Shengxian Chen, Zhongliang Liu, Jingxin Nie, Xiaofen Xing, Xiangmin Xu, Xin Zhang
Abstract: Reconstructing visual stimuli from human brain activity (e.g., fMRI) bridges neuroscience and computer vision by decoding neural representations. However, existing methods often overlook critical brain structure-function relationships, flattening spatial information and neglecting individual anatomical variations. To address these issues, we propose (1) a novel sphere tokenizer that explicitly models fMRI signals as spatially coherent 2D spherical data on the cortical surface; (2) integration of structural MRI (sMRI) data, enabling personalized encoding of individual anatomical variations; and (3) a positive-sample mixup strategy for efficiently leveraging multiple fMRI scans associated with the same visual stimulus. Collectively, these innovations enhance reconstruction accuracy, biological interpretability, and generalizability across individuals. Experiments demonstrate superior reconstruction performance compared to SOTA methods, highlighting the effectiveness and interpretability of our biologically informed approach.
Authors: Octavian M. Machidon
Abstract: In this paper, I examine the ethical and anthropological challenges posed by AI-driven recommender systems (RSs), which have become central to shaping digital environments and social interactions. By curating personalized content, RSs do not merely reflect user preferences but actively construct individual experiences across social media, entertainment platforms, and e-commerce. Despite their ubiquity, the ethical implications of RSs remain insufficiently explored, even as concerns over privacy, autonomy, and mental well-being intensify. I argue that existing ethical approaches, including algorethics, the effort to embed ethical principles into algorithmic design, are necessary but ultimately inadequate. RSs inherently reduce human complexity to quantifiable dimensions, exploit user vulnerabilities, and prioritize engagement over well-being. Addressing these concerns requires moving beyond purely technical solutions. I propose a comprehensive framework for human-centered RS design, integrating interdisciplinary perspectives, regulatory strategies, and educational initiatives to ensure AI systems foster rather than undermine human autonomy and societal flourishing.
Authors: Patrik Reizinger, Lester Mackey, Wieland Brendel, Rahul Krishnan
Abstract: The field of causal inference has developed a variety of methods to accurately estimate treatment effects in the presence of nuisance. Meanwhile, the field of identifiability theory has developed methods like Independent Component Analysis (ICA) to identify latent sources and mixing weights from data. While these two research communities have developed largely independently, they aim to achieve similar goals: the accurate and sample-efficient estimation of model parameters. In the partially linear regression (PLR) setting, Mackey et al. (2018) recently found that estimation consistency can be improved with non-Gaussian treatment noise. Non-Gaussianity is also a crucial assumption for identifying latent factors in ICA. We provide the first theoretical and empirical insights into this connection, showing that ICA can be used for causal effect estimation in the PLR model. Surprisingly, we find that linear ICA can accurately estimate multiple treatment effects even in the presence of Gaussian confounders or nonlinear nuisance.
Authors: Sabrina Livanec, Laura Londo\~no, Michael Gorki, Adrian R\"ofer, Abhinav Valada, Andrea Kiesel
Abstract: The development of assistive robots for social collaboration raises critical questions about responsible and inclusive design, especially when interacting with individuals from protected groups such as those with disabilities or advanced age. Currently, research is scarce on how participants assess varying robot behaviors in combination with diverse human needs, likely since participants have limited real-world experience with advanced domestic robots. In the current study, we aim to address this gap while using methods that enable participants to assess robot behavior, as well as methods that support meaningful reflection despite limited experience. In an online study, 112 participants (from both experimental and control groups) evaluated 7 videos from a total of 28 variations of human-robot collaboration types. The experimental group first completed a cognitive-affective mapping (CAM) exercise on human-robot collaboration before providing their ratings. Although CAM reflection did not significantly affect overall ratings, it led to more pronounced assessments for certain combinations of robot behavior and human condition. Most importantly, the type of human-robot collaboration influences the assessment. Antisocial robot behavior was consistently rated as the lowest, while collaboration with aged individuals elicited more sensitive evaluations. Scenarios involving object handovers were viewed more positively than those without them. These findings suggest that both human characteristics and interaction paradigms influence the perceived acceptability of collaborative robots, underscoring the importance of prosocial design. They also highlight the potential of reflective methods, such as CAM, to elicit nuanced feedback, supporting the development of user-centered and socially responsible robotic systems tailored to diverse populations.
Authors: Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, Xiaojun Wan
Abstract: Large language models (LLMs) excel at various natural language processing tasks, but their tendency to generate hallucinations undermines their reliability. Existing hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations, overlooking their dynamic evolution across layers, which limits efficacy. To address this limitation, we shift the focus to the hidden state update process and introduce a novel metric, the ICR Score (Information Contribution to Residual Stream), which quantifies the contribution of modules to the hidden states' update. We empirically validate that the ICR Score is effective and reliable in distinguishing hallucinations. Building on these insights, we propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states. Experimental results show that the ICR Probe achieves superior performance with significantly fewer parameters. Furthermore, ablation studies and case analyses offer deeper insights into the underlying mechanism of this method, improving its interpretability.
Authors: David G. Nagy, Tingke Shen, Hanqi Zhou, Charley M. Wu, Peter Dayan
Abstract: Humans flexibly construct internal models to navigate novel situations. To be useful, these internal models must be sufficiently faithful to the environment that resource-limited planning leads to adequate outcomes; equally, they must be tractable to construct in the first place. We argue that analogy plays a central role in these processes, enabling agents to reuse solution-relevant structure from past experiences and amortise the computational costs of both model construction (construal) and planning. Formalising analogies as partial homomorphisms between Markov decision processes, we sketch a framework in which abstract modules, derived from previous construals, serve as composable building blocks for new ones. This modular reuse allows for flexible adaptation of policies and representations across domains with shared structural essence.
Authors: Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai
Abstract: As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad **Range**, wide **Reach**, and high **Rigor**, yet they often face two major challenges: **data leakage risks** that compromise benchmarking validity, and **evaluation inefficiency** due to large-scale testing. To address these issues, we introduce the **Ever-Evolving Science Exam (EESE)**, a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public **EESE-Pool** with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring **Range**, **Reach**, and **Rigor**, 2) a periodically updated 500-instance subset **EESE**, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.
Authors: Xiaoyan Wang, Zeju Li, Yifan Xu, Jiaxing Qi, Zhifei Yang, Ruifei Ma, Xiangde Liu, Chao Zhang
Abstract: New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting independent objects to perform these tasks, which limits their spatial awareness due to insufficient representation of the richness inherent in 3D scenes. To overcome these limitations, we propose Spatial 3D-LLM, a 3D MLLM specifically designed to enhance spatial awareness for 3D vision-language tasks by enriching the spatial embeddings of 3D scenes. Spatial 3D-LLM integrates an LLM backbone with a progressive spatial awareness scheme that progressively captures spatial information as the perception field expands, generating location-enriched 3D scene embeddings to serve as visual prompts. Furthermore, we introduce two novel tasks: 3D object distance measurement and 3D layout editing, and construct a 3D instruction dataset, MODEL, to evaluate the model's spatial awareness capabilities. Experimental results demonstrate that Spatial 3D-LLM achieves state-of-the-art performance across a wide range of 3D vision-language tasks, revealing the improvements stemmed from our progressive spatial awareness scheme of mining more profound spatial information. Our code is available at https://github.com/bjshuyuan/Spatial-3D-LLM.
Authors: Abhash Kumar Jha, Shakiba Moradian, Arjun Krishnakumar, Martin Rapp, Frank Hutter
Abstract: Gradient-based one-shot neural architecture search (NAS) has significantly reduced the cost of exploring architectural spaces with discrete design choices, such as selecting operations within a model. However, the field faces two major challenges. First, evaluations of gradient-based NAS methods heavily rely on the DARTS benchmark, despite the existence of other available benchmarks. This overreliance has led to saturation, with reported improvements often falling within the margin of noise. Second, implementations of gradient-based one-shot NAS methods are fragmented across disparate repositories, complicating fair and reproducible comparisons and further development. In this paper, we introduce Configurable Optimizer (confopt), an extensible library designed to streamline the development and evaluation of gradient-based one-shot NAS methods. Confopt provides a minimal API that makes it easy for users to integrate new search spaces, while also supporting the decomposition of NAS optimizers into their core components. We use this framework to create a suite of new DARTS-based benchmarks, and combine them with a novel evaluation protocol to reveal a critical flaw in how gradient-based one-shot NAS methods are currently assessed. The code can be found at https://github.com/automl/ConfigurableOptimizer.
Authors: Shang Liu, Chenjie Cao, Chaohui Yu, Wen Qian, Jing Wang, Fan Wang
Abstract: Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth's surface, remains an open challenge. We address this through a dual innovation in data infrastructure and model architecture. First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m x 600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames. Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity. Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation: 1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently. Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation. The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D. Our project page is available at https://whiteinblue.github.io/earthcrafter/
Authors: Christian D. Blakely
Abstract: We propose a multilayered symbolic framework for general graph classification that leverages sparse binary hypervectors and Tsetlin Machines. Each graph is encoded through structured message passing, where node, edge, and attribute information are bound and bundled into a symbolic hypervector. This process preserves the hierarchical semantics of the graph through layered binding from node attributes to edge relations to structural roles resulting in a compact, discrete representation. We also formulate a local interpretability framework which lends itself to a key advantage of our approach being locally interpretable. We validate our method on TUDataset benchmarks, demonstrating competitive accuracy with strong symbolic transparency compared to neural graph models.
Authors: Zhengyu Wu, Xunkai Li, Yinlin Zhu, Zekai Chen, Guochen Yan, Yanyu Yan, Hao Zhang, Yuming Ai, Xinmo Jin, Rong-Hua Li, Guoren Wang
Abstract: In the era of big data applications, Federated Graph Learning (FGL) has emerged as a prominent solution that reconcile the tradeoff between optimizing the collective intelligence between decentralized datasets holders and preserving sensitive information to maximum. Existing FGL surveys have contributed meaningfully but largely focus on integrating Federated Learning (FL) and Graph Machine Learning (GML), resulting in early stage taxonomies that emphasis on methodology and simulated scenarios. Notably, a data centric perspective, which systematically examines FGL methods through the lens of data properties and usage, remains unadapted to reorganize FGL research, yet it is critical to assess how FGL studies manage to tackle data centric constraints to enhance model performances. This survey propose a two-level data centric taxonomy: Data Characteristics, which categorizes studies based on the structural and distributional properties of datasets used in FGL, and Data Utilization, which analyzes the training procedures and techniques employed to overcome key data centric challenges. Each taxonomy level is defined by three orthogonal criteria, each representing a distinct data centric configuration. Beyond taxonomy, this survey examines FGL integration with Pretrained Large Models, showcases realistic applications, and highlights future direction aligned with emerging trends in GML.
Authors: Jon Guti\'errez-Zaballa, Koldo Basterretxea, Javier Echanobe
Abstract: The use of HSI for autonomous navigation is a promising research field aimed at improving the accuracy and robustness of detection, tracking, and scene understanding systems based on vision sensors. Combining advanced computer algorithms, such as DNNs, with small-size snapshot HSI cameras enhances the reliability of these systems. HSI overcomes intrinsic limitations of greyscale and RGB imaging in depicting physical properties of targets, particularly regarding spectral reflectance and metamerism. Despite promising results in HSI-based vision developments, safety-critical systems like ADS demand strict constraints on latency, resource consumption, and security, motivating the shift of ML workloads to edge platforms. This involves a thorough software/hardware co-design scheme to distribute and optimize the tasks efficiently among the limited resources of computing platforms. With respect to inference, the over-parameterized nature of DNNs poses significant computational challenges for real-time on-the-edge deployment. In addition, the intensive data preprocessing required by HSI, which is frequently overlooked, must be carefully managed in terms of memory arrangement and inter-task communication to enable an efficient integrated pipeline design on a SoC. This work presents a set of optimization techniques for the practical co-design of a DNN-based HSI segmentation processor deployed on a FPGA-based SoC targeted at ADS, including key optimizations such as functional software/hardware task distribution, hardware-aware preprocessing, ML model compression, and a complete pipelined deployment. Applied compression techniques significantly reduce the complexity of the designed DNN to 24.34% of the original operations and to 1.02% of the original number of parameters, achieving a 2.86x speed-up in the inference task without noticeable degradation of the segmentation accuracy.
Authors: Megha Quamara, Viktor Schmuck, Cristina Iani, Axel Primavesi, Alexander Plaum, Luca Vigano
Abstract: In this paper, we present the findings of a user study that evaluated the social acceptance of eXtended Reality (XR) agent technology, focusing on a remotely accessible, web-based XR training system developed for journalists. This system involves user interaction with a virtual avatar, enabled by a modular toolkit. The interactions are designed to provide tailored training for journalists in digital-remote settings, especially for sensitive or dangerous scenarios, without requiring specialized end-user equipment like headsets. Our research adapts and extends the Almere model, representing social acceptance through existing attributes such as perceived ease of use and perceived usefulness, along with added ones like dependability and security in the user-agent interaction. The XR agent was tested through a controlled experiment in a real-world setting, with data collected on users' perceptions. Our findings, based on quantitative and qualitative measurements involving questionnaires, contribute to the understanding of user perceptions and acceptance of XR agent solutions within a specific social context, while also identifying areas for the improvement of XR systems.
Authors: Yuxuan He, Xiaoran Yang, Ningning Pan, Gongping Huang
Abstract: Most existing text-to-audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.
Authors: G. de Rom\'emont, F. Renac, F. Chinesta, J. Nunez, D. Gueyffier
Abstract: We present a novel data-driven approach for enhancing gradient reconstruction in unstructured finite volume methods for hyperbolic conservation laws, specifically for the 2D Euler equations. Our approach extends previous structured-grid methodologies to unstructured meshes through a modified DeepONet architecture that incorporates local geometry in the neural network. The architecture employs local mesh topology to ensure rotation invariance, while also ensuring first-order constraint on the learned operator. The training methodology incorporates physics-informed regularization through entropy penalization, total variation diminishing penalization, and parameter regularization to ensure physically consistent solutions, particularly in shock-dominated regions. The model is trained on high-fidelity datasets solutions derived from sine waves and randomized piecewise constant initial conditions with periodic boundary conditions, enabling robust generalization to complex flow configurations or geometries. Validation test cases from the literature, including challenging geometry configuration, demonstrates substantial improvements in accuracy compared to traditional second-order finite volume schemes. The method achieves gains of 20-60% in solution accuracy while enhancing computational efficiency. A convergence study has been conveyed and reveal improved mesh convergence rates compared to the conventional solver. The proposed algorithm is faster and more accurate than the traditional second-order finite volume solver, enabling high-fidelity simulations on coarser grids while preserving the stability and conservation properties essential for hyperbolic conservation laws. This work is a part of a new generation of solvers that are built by combining Machine-Learning (ML) tools with traditional numerical schemes, all while ensuring physical constraint on the results.
Authors: Xiaojiao Xiao, Qinmin Vivian Hu, Guanghui Wang
Abstract: Medical image synthesis plays a crucial role in clinical workflows, addressing the common issue of missing imaging modalities due to factors such as extended scan times, scan corruption, artifacts, patient motion, and intolerance to contrast agents. The paper presents a novel image synthesis network, the Pyramid Hierarchical Masked Diffusion Model (PHMDiff), which employs a multi-scale hierarchical approach for more detailed control over synthesizing high-quality images across different resolutions and layers. Specifically, this model utilizes randomly multi-scale high-proportion masks to speed up diffusion model training, and balances detail fidelity and overall structure. The integration of a Transformer-based Diffusion model process incorporates cross-granularity regularization, modeling the mutual information consistency across each granularity's latent spaces, thereby enhancing pixel-level perceptual accuracy. Comprehensive experiments on two challenging datasets demonstrate that PHMDiff achieves superior performance in both the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), highlighting its capability to produce high-quality synthesized images with excellent structural integrity. Ablation studies further confirm the contributions of each component. Furthermore, the PHMDiff model, a multi-scale image synthesis framework across and within medical imaging modalities, shows significant advantages over other methods. The source code is available at https://github.com/xiaojiao929/PHMDiff
Authors: Choro Ulan Uulu, Mikhail Kulyabin, Layan Etaiwi, Nuno Miguel Martins Pacheco, Jan Joosten, Kerstin R\"ose, Filippos Petridis, Jan Bosch, Helena Holmstr\"om Olsson
Abstract: Computer-Aided Engineering (CAE) enables simulation experts to optimize complex models, but faces challenges in user experience (UX) that limit efficiency and accessibility. While artificial intelligence (AI) has demonstrated potential to enhance CAE processes, research integrating these fields with a focus on UX remains fragmented. This paper presents a multivocal literature review (MLR) examining how AI enhances UX in CAE software across both academic research and industry implementations. Our analysis reveals significant gaps between academic explorations and industry applications, with companies actively implementing LLMs, adaptive UIs, and recommender systems while academic research focuses primarily on technical capabilities without UX validation. Key findings demonstrate opportunities in AI-powered guidance, adaptive interfaces, and workflow automation that remain underexplored in current research. By mapping the intersection of these domains, this study provides a foundation for future work to address the identified research gaps and advance the integration of AI to improve CAE user experience.
Authors: Zied Jenhani, Mounir Bensalem, Jasenka Dizdarevi\'c, Admela Jukan
Abstract: Running deep learning inference directly on ultra-low-power edge/IoT nodes has been limited by the tight memory and compute budgets of microcontrollers. Split learning (SL) addresses this limitation in which it executes part of the inference process on the sensor and off-loads the remainder to a companion device. In the context of constrained devices and the related impact of low-power, over-the-air transport protocols, the performance of split learning remains largely unexplored. TO the best of our knowledge, this paper presents the first end-to-end TinyML + SL testbed built on Espressif ESP32-S3 boards, designed to benchmark the over-the-air performance of split learning TinyML in edge/IoT environments. We benchmark the performance of a MobileNetV2 image recognition model, which is quantized to 8-bit integers, partitioned, and delivered to the nodes via over-the-air updates. The intermediate activations are exchanged through different wireless communication methods: ESP-NOW, BLE, and traditional UDP/IP and TCP/IP, enabling a head-to-head comparison on identical hardware. Measurements show that splitting the model after block_16_project_BN layer generates a 5.66 kB tensor that traverses the link in 3.2 ms, when UDP is used, achieving a steady-state round-trip latency of 5.8 s. ESP-NOW presents the most favorable RTT performance 3.7 s; BLE extends battery life further but increases latency beyond 10s.
Authors: Armin Berger, Lars Hillebrand, David Leonhard, Tobias Deu{\ss}er, Thiago Bell Felix de Oliveira, Tim Dilmaghani, Mohamed Khaled, Bernd Kliem, R\"udiger Loitz, Christian Bauckhage, Rafet Sifa
Abstract: The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from financial reports to align with the legal requirements of accounting standards. However, a glaring limitation remains: these systems commonly fall short in verifying if the recommended excerpts indeed comply with the specific legal mandates. Hence, in this paper, we probe the efficiency of publicly available Large Language Models (LLMs) in the realm of regulatory compliance across different model configurations. We place particular emphasis on comparing cutting-edge open-source LLMs, such as Llama-2, with their proprietary counterparts like OpenAI's GPT models. This comparative analysis leverages two custom datasets provided by our partner PricewaterhouseCoopers (PwC) Germany. We find that the open-source Llama-2 70 billion model demonstrates outstanding performance in detecting non-compliance or true negative occurrences, beating all their proprietary counterparts. Nevertheless, proprietary models such as GPT-4 perform the best in a broad variety of scenarios, particularly in non-English contexts.
Authors: Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Lin, Yingya Zhang, Shiwei Zhang, Difan Zou
Abstract: Despite efforts to unify multimodal generation and understanding tasks in a single model, we show these MLLMs exhibit self-contradiction where generation produces images deemed misaligned with input prompts based on the model's own understanding. We define a Nonunified score that quantifies such self-contradiction. Our empirical results reveal that the self-contradiction mainly arises from weak generation that fails to align with prompts, rather than misunderstanding. This capability asymmetry indicates the potential of leveraging self-contradiction for self-improvement, where the stronger model understanding guides the weaker generation to mitigate the generation-understanding gap. Applying standard post-training methods (e.g., SFT, DPO) with such internal supervision successfully improves both generation and unification. We discover a co-improvement effect on both generation and understanding when only fine-tuning the generation branch, a phenomenon known in pre-training but underexplored in post-training. Our analysis shows improvements stem from better detection of false positives that are previously incorrectly identified as prompt-aligned. Theoretically, we show the aligned training dynamics between generation and understanding allow reduced prompt-misaligned generations to also improve mismatch detection in the understanding branch. Additionally, the framework reveals a potential risk of co-degradation under poor supervision-an overlooked phenomenon that is empirically validated in our experiments. Notably, we find intrinsic metrics like Nonunified score cannot distinguish co-degradation from co-improvement, which highlights the necessity of data quality check. Finally, we propose a curriculum-based strategy based on our findings that gradually introduces harder samples as the model improves, leading to better unification and improved MLLM generation and understanding.
Authors: Yushang Zhao, Huijie Shen, Dannier Li, Lu Chang, Chengrui Zhou, Yinuo Yang
Abstract: Generative, explainable, and flexible recommender systems, derived using Large Language Models (LLM) are promising and poorly adapted to the cold-start user situation, where there is little to no history of interaction. The current solutions i.e. supervised fine-tuning and collaborative filtering are dense-user-item focused and would be expensive to maintain and update. This paper introduces a meta-learning framework, that can be used to perform parameter-efficient prompt-tuning, to effectively personalize LLM-based recommender systems quickly at cold-start. The model learns soft prompt embeddings with first-order (Reptile) and second-order (MAML) optimization by treating each of the users as the tasks. As augmentations to the input tokens, these learnable vectors are the differentiable control variables that represent user behavioral priors. The prompts are meta-optimized through episodic sampling, inner-loop adaptation, and outer-loop generalization. On MovieLens-1M, Amazon Reviews, and Recbole, we can see that our adaptive model outperforms strong baselines in NDCG@10, HR@10, and MRR, and it runs in real-time (i.e., below 300 ms) on consumer GPUs. Zero-history personalization is also supported by this scalable solution, and its 275 ms rate of adaptation allows successful real-time risk profiling of financial systems by shortening detection latency and improving payment network stability. Crucially, the 275 ms adaptation capability can enable real-time risk profiling for financial institutions, reducing systemic vulnerability detection latency significantly versus traditional compliance checks. By preventing contagion in payment networks (e.g., Fedwire), the framework strengthens national financial infrastructure resilience.
Authors: Han Jiang, Dongyao Zhu, Zhihua Wei, Xiaoyuan Yi, Ziang Xiao, Xing Xie
Abstract: In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs' understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.
Authors: Lars Hillebrand, David Biesner, Christian Bauckhage, Rafet Sifa
Abstract: The DEDICOM algorithm provides a uniquely interpretable matrix factorization method for symmetric and asymmetric square matrices. We employ a new row-stochastic variation of DEDICOM on the pointwise mutual information matrices of text corpora to identify latent topic clusters within the vocabulary and simultaneously learn interpretable word embeddings. We introduce a method to efficiently train a constrained DEDICOM algorithm and a qualitative evaluation of its topic modeling and word embedding performance.
Authors: Pingyi Fan, Anbai Jiang, Shuwei Zhang, Zhiqiang Lv, Bing Han, Xinhu Zheng, Wenrui Liang, Junjie Li, Wei-Qiang Zhang, Yanmin Qian, Xie Chen, Cheng Lu, Jia Liu
Abstract: With the rapid deployment of SCADA systems, how to effectively analyze industrial signals and detect abnormal states is an urgent need for the industry. Due to the significant heterogeneity of these signals, which we summarize as the M5 problem, previous works only focus on small sub-problems and employ specialized models, failing to utilize the synergies between modalities and the powerful scaling law. However, we argue that the M5 signals can be modeled in a unified manner due to the intrinsic similarity. As a result, we propose FISHER, a Foundation model for multi-modal Industrial Signal compreHEnsive Representation. To support arbitrary sampling rates, FISHER considers the increment of sampling rate as the concatenation of sub-band information. Specifically, FISHER takes the STFT sub-band as the modeling unit and adopts a teacher student SSL framework for pre-training. We also develop the RMIS benchmark, which evaluates the representations of M5 industrial signals on multiple health management tasks. Compared with top SSL models, FISHER showcases versatile and outstanding capabilities with a general performance gain up to 5.03%, along with much more efficient scaling curves. We also investigate the scaling law on downstream tasks and derive potential avenues for future works. FISHER is now open-sourced on https://github.com/jianganbai/FISHER
Authors: Viktor Muryn, Marta Sumyk, Mariya Hirna, Sofiya Garkot, Maksym Shamrai
Abstract: Desktop accessibility metadata enables AI agents to interpret screens and supports users who depend on tools like screen readers. Yet, many applications remain largely inaccessible due to incomplete or missing metadata provided by developers - our investigation shows that only 33% of applications on macOS offer full accessibility support. While recent work on structured screen representation has primarily addressed specific challenges, such as UI element detection or captioning, none has attempted to capture the full complexity of desktop interfaces by replicating their entire hierarchical structure. To bridge this gap, we introduce Screen2AX, the first framework to automatically create real-time, tree-structured accessibility metadata from a single screenshot. Our method uses vision-language and object detection models to detect, describe, and organize UI elements hierarchically, mirroring macOS's system-level accessibility structure. To tackle the limited availability of data for macOS desktop applications, we compiled and publicly released three datasets encompassing 112 macOS applications, each annotated for UI element detection, grouping, and hierarchical accessibility metadata alongside corresponding screenshots. Screen2AX accurately infers hierarchy trees, achieving a 77% F1 score in reconstructing a complete accessibility tree. Crucially, these hierarchy trees improve the ability of autonomous agents to interpret and interact with complex desktop interfaces. We introduce Screen2AX-Task, a benchmark specifically designed for evaluating autonomous agent task execution in macOS desktop environments. Using this benchmark, we demonstrate that Screen2AX delivers a 2.2x performance improvement over native accessibility representations and surpasses the state-of-the-art OmniParser V2 system on the ScreenSpot benchmark.
Authors: Lars Hillebrand, Armin Berger, Daniel Uedelhoven, David Berghaus, Ulrich Warning, Tim Dilmaghani, Bernd Kliem, Thomas Schmid, R\"udiger Loitz, Rafet Sifa
Abstract: Risk and Quality (R&Q) assurance in highly regulated industries requires constant navigation of complex regulatory frameworks, with employees handling numerous daily queries demanding accurate policy interpretation. Traditional methods relying on specialized experts create operational bottlenecks and limit scalability. We present a novel Retrieval Augmented Generation (RAG) system leveraging Large Language Models (LLMs), hybrid search and relevance boosting to enhance R&Q query processing. Evaluated on 124 expert-annotated real-world queries, our actively deployed system demonstrates substantial improvements over traditional RAG approaches. Additionally, we perform an extensive hyperparameter analysis to compare and evaluate multiple configuration setups, delivering valuable insights to practitioners.
Authors: Guowei Lan, Kaixian Qu, Ren\'e Zurbr\"ugg, Changan Chen, Christopher E. Mower, Haitham Bou-Ammar, Marco Hutter
Abstract: Vision-language models (VLMs) have been widely adopted in robotics to enable autonomous planning. However, grounding VLMs, originally trained on internet data, to diverse real-world robots remains a challenge. This paper presents ExpTeach, a framework that grounds VLMs to physical robots by building a self-generated memory of real-world experiences. In ExpTeach, the VLM autonomously plans actions, verifies outcomes, reflects on failures, and adapts robot behaviors in a closed loop. The self-generated experiences during this process are then summarized into a long-term memory, enabling retrieval of learned knowledge to guide future tasks via retrieval-augmented generation (RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with an on-demand image annotation module. In experiments, we show that reflection improves success rates from 36% to 84% on four challenging robotic tasks and observe the emergence of intelligent object interactions, including creative tool use. Across extensive tests on 12 real-world scenarios (including eight unseen ones), we find that grounding with long-term memory boosts single-trial success rates from 22% to 80%, demonstrating the effectiveness and generalizability of ExpTeach.
Authors: Yilong Xu, Xiang Long, Zhi Zheng, Jinhua Gao
Abstract: Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine -- a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model's interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.
Authors: Laura Moradbakhti, Dorian Peters, Jennifer K. Quint, Bj\"orn Schuller, Darren Cook, Rafael A. Calvo
Abstract: Asthma-related deaths in the UK are the highest in Europe, and only 30% of patients access basic care. There is a need for alternative approaches to reaching people with asthma in order to provide health education, self-management support and bridges to care. Automated conversational agents (specifically, mobile chatbots) present opportunities for providing alternative and individually tailored access to health education, self-management support and risk self-assessment. But would patients engage with a chatbot, and what factors influence engagement? We present results from a patient survey (N=1257) devised by a team of asthma clinicians, patients, and technology developers, conducted to identify optimal factors for efficacy, value and engagement for a chatbot. Results indicate that most adults with asthma (53%) are interested in using a chatbot and the patients most likely to do so are those who believe their asthma is more serious and who are less confident about self-management. Results also indicate enthusiasm for 24/7 access, personalisation, and for WhatsApp as the preferred access method (compared to app, voice assistant, SMS or website). Obstacles to uptake include security/privacy concerns and skepticism of technological capabilities. We present detailed findings and consolidate these into 7 recommendations for developers for optimising efficacy of chatbot-based health support.
Authors: Fangjian Lei, Mariam El Mezouar, Shayan Noei, Ying Zou
Abstract: Large Language Models (LLMs) have shown promise in assisting developers with code-related questions; however, LLMs carry the risk of generating unreliable answers. To address this, Retrieval-Augmented Generation (RAG) has been proposed to reduce the unreliability (i.e., hallucinations) of LLMs. However, designing effective pipelines remains challenging due to numerous design choices. In this paper, we construct a retrieval corpus of over 3 million Java and Python related Stack Overflow posts with accepted answers, and explore various RAG pipeline designs to answer developer questions, evaluating their effectiveness in generating accurate and reliable responses. More specifically, we (1) design and evaluate 7 different RAG pipelines and 63 pipeline variants to answer questions that have historically similar matches, and (2) address new questions without any close prior matches by automatically lowering the similarity threshold during retrieval, thereby increasing the chance of finding partially relevant context and improving coverage for unseen cases. We find that implementing a RAG pipeline combining hypothetical-documentation-embedding (HyDE) with the full-answer context performs best in retrieving and answering similarcontent for Stack Overflow questions. Finally, we apply our optimal RAG pipeline to 4 open-source LLMs and compare the results to their zero-shot performance. Our findings show that RAG with our optimal RAG pipeline consistently outperforms zero-shot baselines across models, achieving higher scores for helpfulness, correctness, and detail with LLM-as-a-judge. These findings demonstrate that our optimal RAG pipelines robustly enhance answer quality for a wide range of developer queries including both previously seen and novel questions across different LLMs
Authors: Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda
Abstract: Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM's latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.
Authors: Yuxi Lin, Yaxue Fang, Zehong Zhang, Zhouwu Liu, Siyun Zhong, Fulong Yu
Abstract: Understanding how 5' untranslated regions (5'UTRs) regulate mRNA translation is critical for controlling protein expression and designing effective therapeutic mRNAs. While recent deep learning models have shown promise in predicting translational efficiency from 5'UTR sequences, most are constrained by fixed input lengths and limited interpretability. We introduce UTR-STCNet, a Transformer-based architecture for flexible and biologically grounded modeling of variable-length 5'UTRs. UTR-STCNet integrates a Saliency-Aware Token Clustering (SATC) module that iteratively aggregates nucleotide tokens into multi-scale, semantically meaningful units based on saliency scores. A Saliency-Guided Transformer (SGT) block then captures both local and distal regulatory dependencies using a lightweight attention mechanism. This combined architecture achieves efficient and interpretable modeling without input truncation or increased computational cost. Evaluated across three benchmark datasets, UTR-STCNet consistently outperforms state-of-the-art baselines in predicting mean ribosome load (MRL), a key proxy for translational efficiency. Moreover, the model recovers known functional elements such as upstream AUGs and Kozak motifs, highlighting its potential for mechanistic insight into translation regulation.
Authors: Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas
Abstract: When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score -- a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations -- outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.
Authors: Zhihao Xu, Bixin Li, Lulu Wang
Abstract: Register Transfer Level(RTL) code optimization is crucial for achieving high performance and low power consumption in digital circuit design. However, traditional optimization methods often rely on manual tuning and heuristics, which can be time-consuming and error-prone. Recent studies proposed to leverage Large Language Models(LLMs) to assist in RTL code optimization. LLMs can generate optimized code snippets based on natural language descriptions, potentially speeding up the optimization process. However, existing approaches have not thoroughly evaluated the effectiveness of LLM-Based code optimization methods for RTL code with complex timing logic. To address this gap, we conducted a comprehensive empirical investigation to assess the capability of LLM-Based RTL code optimization methods in handling RTL code with complex timing logic. In this study, we first propose a new benchmark for RTL optimization evaluation. It comprises four subsets, each corresponding to a specific area of RTL code optimization. Then we introduce a method based on metamorphosis to systematically evaluate the effectiveness of LLM-Based RTL code optimization methods.Our key insight is that the optimization effectiveness should remain consistent for semantically equivalent but more complex code. After intensive experiments, we revealed several key findings. (1) LLM-Based RTL optimization methods can effectively optimize logic operations and outperform existing compiler-based methods. (2) LLM-Based RTL optimization methods do not perform better than existing compiler-based methods on RTL code with complex timing logic, particularly in timing control flow optimization and clock domain optimization. This is primarily attributed to the challenges LLMs face in understanding timing logic in RTL code. Based on these findings, we provide insights for further research in leveraging LLMs for RTL code optimization.
Authors: Run-Ze Fan, Zengzhi Wang, Pengfei Liu
Abstract: Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.
Authors: Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang
Abstract: Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.
Authors: Jose M. Alvarez, Salvatore Ruggieri
Abstract: Perception occurs when individuals interpret the same information differently. It is a known cognitive phenomenon with implications for bias in human decision-making. Perception, however, remains understudied in machine learning (ML). This is problematic as modern decision flows, whether partially or fully automated by ML applications, always involve human experts. How might we account for cases in which two experts, e.g., interpret differently the same deferred instance or explanation from a ML model? Addressing this and similar questions requires a formulation of perception, particularly, in a manner that integrates with ML-enabled decision flows. In this work, we present a first approach to modeling perception causally. We define perception under causal reasoning using structural causal models (SCM). Our approach formalizes individual experience as additional causal knowledge that comes with and is used by the expert decision-maker in the form of a SCM. We define two kinds of probabilistic causal perception: structural perception and parametrical perception. We showcase our framework through a series of examples of modern decision flows. We also emphasize the importance of addressing perception in fair ML, discussing relevant fairness implications and possible applications.
Authors: Deepti Raghavan, Keshav Santhanam, Muhammad Shahir Rahman, Nayani Modugula, Luis Gaspar Schroeder, Maximilien Cura, Houjun Liu, Pratiksha Thaker, Philip Levis, Matei Zaharia
Abstract: Compound AI applications chain together subcomponents such as generative language models, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during intermediate computations, and text streams may be split into smaller, independent fragments (such as documents to sentences) which may then be re-aggregated at later parts of the computation. Due to this complexity, existing systems to serve compound AI queries do not fully take advantage of parallelism and pipelining opportunities. We present Alto, a framework that automatically optimizes execution of compound AI queries through streaming and parallelism. Bento introduces a new abstraction called nested ancestry, a metadata hierarchy that allows the system to correctly track partial outputs and aggregate data across the heterogeneous constraints of the components of compound AI applications. This metadata is automatically inferred from the programming model, allowing developers to express complex dataflow patterns without needing to reason manually about the details of routing and aggregation. Implementations of four applications in Alto outperform or match implementations in LangGraph, a popular existing AI programming framework. Alto implementations match or improve latency by between 10-30%.
Authors: Yanwei Zheng, Shaopu Feng, Bowen Huang, Chuanlin Lan, Xiao Zhang, Dongxiao Yu
Abstract: Inspired by human-like behaviors for navigation: first searching to explore unknown areas before discovering the target, and then the pathfinding of moving towards the discovered target, recent studies design parallel submodules to achieve different functions in the searching and pathfinding stages, while ignoring the differences in reward signals between the two stages. As a result, these models often cannot be fully trained or are overfitting on training scenes. Another bottleneck that restricts agents from learning two-stage strategies is spatial perception ability, since the studies used generic visual encoders without considering the depth information of navigation scenes. To release the potential of the model on strategy learning, we propose the Two-Stage Reward Mechanism (TSRM) for object navigation that decouples the searching and pathfinding behaviours in an episode, enabling the agent to explore larger area in searching stage and seek the optimal path in pathfinding stage. Also, we propose a pretraining method Depth Enhanced Masked Autoencoders (DE-MAE) that enables agent to determine explored and unexplored areas during the searching stage, locate target object and plan paths during the pathfinding stage more accurately. In addition, we propose a new metric of Searching Success weighted by Searching Path Length (SSSPL) that assesses agent's searching ability and exploring efficiency. Finally, we evaluated our method on AI2-Thor and RoboTHOR extensively and demonstrated it can outperform the state-of-the-art (SOTA) methods in both the success rate and the navigation efficiency.
Authors: Tenghao Huang, Kinjal Basu, Ibrahim Abdelaziz, Pavan Kapanipathi, Jonathan May, Muhao Chen
Abstract: The proliferation of web agents necessitates advanced navigation and interaction strategies within complex web environments. Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures. Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect. The Remember paradigm uses a replay buffer that aids agents in reconstructing the web environment dynamically, thus enabling the formulation of a detailed "map" of previously visited pages. This helps in reducing navigational errors and optimizing the decision-making process during web interactions. Conversely, the Reflect paradigm allows agents to learn from past mistakes by providing a mechanism for error analysis and strategy refinement, enhancing overall task performance. We evaluate R2D2 using the WebArena benchmark, demonstrating substantial improvements over existing methods, including a 50% reduction in navigation errors and a threefold increase in task completion rates. Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents, potentially benefiting various applications such as automated customer service and personal digital assistants.
Authors: InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai
Abstract: Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce InternAgent, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. InternAgent highlights three key advantages: 1) Scalability: InternAgent has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: InternAgent provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: InternAgent has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.65 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.
Authors: Jihyung Lee, Jin-Seop Lee, Jaehoon Lee, YunSeok Choi, Jee-Hyong Lee
Abstract: Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. The code is available at https://github.com/jjklle/DCG-SQL}{https://github.com/jjklle/DCG-SQL.
URLs: https://github.com/jjklle/DCG-SQL, https://github.com/jjklle/DCG-SQL.
Authors: Edward Y. Chang, Zeyneb N. Kaya, Ethan Chang
Abstract: Large language models (LLMs) are vast repositories of latent patterns, but without structured guidance, they lack explicit reasoning, semantic grounding, and goal-directed intelligence. We propose Unified Cognitive Consciousness Theory (UCCT), a unified model that reinterprets LLMs as unconscious substrates requiring external mechanisms, few-shot prompting, RAG, fine-tuning, and multi-agent reasoning, to semantically anchor latent representations. UCCT formalizes this anchoring process through a Bayesian formulation, revealing a threshold-crossing dynamic characterized by 1/sqrt(n) scaling that explains the sudden capability transitions observed across diverse tasks. The theory unifies previously disparate techniques, few-shot prompting, RAG, fine-tuning, and multi-agent reasoning, as special cases of a general anchoring architecture. Through case studies in simple math, visual recognition, and structured debate tasks, we confirm the predictive power of UCCT. Furthermore, our experiment in arithmetic in three numeral systems validates the theories of UCCT. Rather than treating intelligence as an intrinsic property of LLMs, UCCT demonstrates that LLMs are merely unconscious pattern repositories with no inherent intelligence. Intelligence emerges only when external anchoring mechanisms assign target semantics to these latent patterns, transforming unconscious representations into conscious, goal-directed capabilities.
Authors: Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, Yasin Abbasi Yadkori
Abstract: Reasoning, the process of devising and executing complex goal-oriented action sequences, remains a critical challenge in AI. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive data requirements, and high latency. Inspired by the hierarchical and multi-timescale processing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel recurrent architecture that attains significant computational depth while maintaining both training stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass without explicit supervision of the intermediate process, through two interdependent recurrent modules: a high-level module responsible for slow, abstract planning, and a low-level module handling rapid, detailed computations. With only 27 million parameters, HRM achieves exceptional performance on complex reasoning tasks using only 1000 training samples. The model operates without pre-training or CoT data, yet achieves nearly perfect performance on challenging tasks including complex Sudoku puzzles and optimal path finding in large mazes. Furthermore, HRM outperforms much larger models with significantly longer context windows on the Abstraction and Reasoning Corpus (ARC), a key benchmark for measuring artificial general intelligence capabilities. These results underscore HRM's potential as a transformative advancement toward universal computation and general-purpose reasoning systems.
Authors: Mingda Zhang, Na Zhao, Jianglong Qing, Qing xu, Kaiwen Pan, Ting luo
Abstract: This research presents a framework combining prompt engineering with multidimensional knowledge graphs to improve LLMs' legal dispute analysis. Specifically, the framework includes a three-stage hierarchical prompt structure (task definition, knowledge background, reasoning guidance) along with a three-layer knowledge graph (legal ontology, representation, instance layers). Additionally, four supporting methods enable precise legal concept retrieval: direct code matching, semantic vector similarity, ontology path reasoning, and lexical segmentation. Through extensive testing, results show major improvements: sensitivity increased by 9.9%-13.8%, specificity by 4.8%-6.7%, and citation accuracy by 22.4%-39.7%. As a result, the framework provides better legal analysis and understanding of judicial logic, thus offering a new technical method for intelligent legal assistance systems.
Authors: Mingda Zhang, Na Zhao, Jianglong Qin, Guoyu Ye, Ruixiang Tang
Abstract: Rare disease diagnosis remains challenging for medical large language models due to insufficient knowledge representation, limited concept understanding, and constrained clinical reasoning. We propose a framework combining multi-granularity sparse activation with hierarchical knowledge graphs. Our approach employs four complementary matching algorithms with diversity control and a five-level fallback strategy for precise concept activation. A three-layer knowledge graph (taxonomy, clinical features, instances) provides structured, up-to-date context. Experiments on the BioASQ rare disease dataset demonstrate significant improvements: BLEU scores increased by up to 0.13, ROUGE by up to 0.10, and diagnostic accuracy by up to 0.25, with the best model achieving 0.92 accuracy--surpassing the 0.90 clinical threshold. Expert evaluation confirms enhancements in information quality, reasoning, and professional expression. Our framework shows promise in reducing the diagnostic odyssey for rare disease patients.
Authors: Lance Ying, Katherine M. Collins, Prafull Sharma, Cedric Colas, Kaiya Ivy Zhao, Adrian Weller, Zenna Tavares, Phillip Isola, Samuel J. Gershman, Jacob D. Andreas, Thomas L. Griffiths, Francois Chollet, Kelsey R. Allen, Joshua B. Tenenbaum
Abstract: Human intelligence exhibits a remarkable capacity for rapid adaptation and effective problem-solving in novel and unfamiliar contexts. We argue that this profound adaptability is fundamentally linked to the efficient construction and refinement of internal representations of the environment, commonly referred to as world models, and we refer to this adaptation mechanism as world model induction. However, current understanding and evaluation of world models in artificial intelligence (AI) remains narrow, often focusing on static representations learned from training on massive corpora of data, instead of the efficiency and efficacy in learning these representations through interaction and exploration within a novel environment. In this Perspective, we provide a view of world model induction drawing on decades of research in cognitive science on how humans learn and adapt so efficiently; we then call for a new evaluation framework for assessing adaptive world models in AI. Concretely, we propose a new benchmarking paradigm based on suites of carefully designed games with genuine, deep and continually refreshing novelty in the underlying game structures -- we refer to this class of games as novel games. We detail key desiderata for constructing these games and propose appropriate metrics to explicitly challenge and evaluate the agent's ability for rapid world model induction. We hope that this new evaluation framework will inspire future evaluation efforts on world models in AI and provide a crucial step towards developing AI systems capable of human-like rapid adaptation and robust generalization -- a critical component of artificial general intelligence.
Authors: Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum
Abstract: The exponential growth in demand for GPU computing resources, driven by the rapid advancement of Large Language Models, has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models (e.g. R1, o1) achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization. CUDA-L1 achieves performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x17.7 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x449. Furthermore, the model also demonstrates excellent portability across GPU architectures, achieving average speedups of x17.8 on H100, x19.0 on RTX 3090, x16.5 on L40, x14.7 on H800, and x13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties: 1) Discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) Uncovers fundamental principles of CUDA optimization; 3) Identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that harm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. More importantly, the trained RL model extend the acquired reasoning abilities to new kernels. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.
Authors: Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, Yujia Wang, Wenqiang Han, Linyan Huang, Gang Li, Jingjing Mo, Haowen Hu
Abstract: The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing to guide the agent's execution module in performing multi-step tool-calling tasks with high stability. In evaluations conducted within a real-world enterprise scenario, Routine significantly increases the execution accuracy in model tool calls, increasing the performance of GPT-4o from 41.1% to 96.3%, and Qwen3-14B from 32.6% to 83.3%. We further constructed a Routine-following training dataset and fine-tuned Qwen3-14B, resulting in an accuracy increase to 88.2% on scenario-specific evaluations, indicating improved adherence to execution plans. In addition, we employed Routine-based distillation to create a scenario-specific, multi-step tool-calling dataset. Fine-tuning on this distilled dataset raised the model's accuracy to 95.5%, approaching GPT-4o's performance. These results highlight Routine's effectiveness in distilling domain-specific tool-usage patterns and enhancing model adaptability to new scenarios. Our experimental results demonstrate that Routine provides a practical and accessible approach to building stable agent workflows, accelerating the deployment and adoption of agent systems in enterprise environments, and advancing the technical vision of AI for Process.
Authors: Yitong Lin, Jiaying He, Jiahe Chen, Xinnan Zhu, Jianwei Zheng, Tao Bo
Abstract: Motivation: Biomedical knowledge graphs (KGs) are crucial for drug discovery and disease understanding, yet their completion and reasoning are challenging. Knowledge Embedding (KE) methods capture global semantics but struggle with dynamic structural integration, while Graph Neural Networks (GNNs) excel locally but often lack semantic understanding. Even ensemble approaches, including those leveraging language models, often fail to achieve a deep, adaptive, and synergistic co-evolution between semantic comprehension and structural learning. Addressing this critical gap in fostering continuous, reciprocal refinement between these two aspects in complex biomedical KGs is paramount. Results: We introduce BioGraphFusion, a novel framework for deeply synergistic semantic and structural learning. BioGraphFusion establishes a global semantic foundation via tensor decomposition, guiding an LSTM-driven mechanism to dynamically refine relation embeddings during graph propagation. This fosters adaptive interplay between semantic understanding and structural learning, further enhanced by query-guided subgraph construction and a hybrid scoring mechanism. Experiments across three key biomedical tasks demonstrate BioGraphFusion's superior performance over state-of-the-art KE, GNN, and ensemble models. A case study on Cutaneous Malignant Melanoma 1 (CMM1) highlights its ability to unveil biologically meaningful pathways. Availability and Implementation: Source code and all training data are freely available for download at https://github.com/Y-TARL/BioGraphFusion. Supplementary information: Supplementary data are available at Bioinformatics online.
Authors: Shangke Lyu, Linjuan Wu, Yuchen Yan, Xingyu Wu, Hao Li, Yongliang Shen, Peisheng Jiang, Weiming Lu, Jun Xiao, Yueting Zhuang
Abstract: Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet exhibit significant computational inefficiency by applying uniform reasoning strategies regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. HBPO addresses the fundamental challenge of exploration space collapse in efficiency-oriented training, where penalties on long output length systematically bias models away from necessary long reasoning paths. Through hierarchical budget exploration, our approach partitions rollout samples into multiple subgroups with distinct token budgets, aiming to enable efficient resource allocation while preventing degradation of capability. We introduce differentiated reward mechanisms that create budget-aware incentives aligned with the complexity of the problem, allowing models to discover natural correspondences between task requirements and computational effort. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Unlike existing methods that impose external constraints or rely on discrete mode selection, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.
Authors: Yichen Huang, Lin F. Yang
Abstract: The International Mathematical Olympiad (IMO) poses uniquely challenging problems requiring deep insight, creativity, and formal reasoning. While Large Language Models (LLMs) perform well on mathematical benchmarks like AIME, they struggle with Olympiad-level tasks. We use Google's Gemini 2.5 Pro on the newly released IMO 2025 problems, avoiding data contamination. Using a self-verification pipeline with careful prompt design, 5 (out of 6) problems are solved correctly (up to a caveat discussed below). This result underscores the importance of developing optimal strategies to harness the full potential of powerful LLMs for complex reasoning tasks.
Authors: Minh Ngoc Luu, Minh-Duong Nguyen, Ebrahim Bedeer, Van Duc Nguyen, Dinh Thai Hoang, Diep N. Nguyen, Quoc-Viet Pham
Abstract: An intelligent Real-Time Sensing (RTS) system must continuously acquire, update, integrate, and apply knowledge to adapt to real-world dynamics. Managing distributed intelligence in this context requires Federated Continual Learning (FCL). However, effectively capturing the diverse characteristics of RTS data in FCL systems poses significant challenges, including severely impacting computational and communication resources, escalating energy costs, and ultimately degrading overall system performance. To overcome these challenges, we investigate how the data distribution shift from ideal to practical RTS scenarios affects Artificial Intelligence (AI) model performance by leveraging the \textit{generalization gap} concept. In this way, we can analyze how sampling time in RTS correlates with the decline in AI performance, computation cost, and communication efficiency. Based on this observation, we develop a novel Sample-driven Control for Federated Continual Learning (SCFL) technique, specifically designed for mobile edge networks with RTS capabilities. In particular, SCFL is an optimization problem that harnesses the sampling process to concurrently minimize the generalization gap and improve overall accuracy while upholding the energy efficiency of the FCL framework. To solve the highly complex and time-varying optimization problem, we introduce a new soft actor-critic algorithm with explicit and implicit constraints (A2C-EI). Our empirical experiments reveal that we can achieve higher efficiency compared to other DRL baselines. Notably, SCFL can significantly reduce energy consumption up to $85\%$ while maintaining FL convergence and timely data transmission.
Authors: Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, Arman Cohan, Zhiyong Lu, Mark Gerstein
Abstract: AI scientists powered by large language models have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, these agents also introduce novel vulnerabilities that require careful consideration for safety. However, there has been limited comprehensive exploration of these vulnerabilities. This perspective examines vulnerabilities in AI scientists, shedding light on potential risks associated with their misuse, and emphasizing the need for safety measures. We begin by providing an overview of the potential risks inherent to AI scientists, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. Then, we explore the underlying causes of these vulnerabilities and provide a scoping review of the limited existing works. Based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. Furthermore, we highlight the limitations and challenges associated with safeguarding AI scientists and advocate for the development of improved models, robust benchmarks, and comprehensive regulations.
Authors: Norah Alballa, Ahmed M. Abdelmoniem, Marco Canini
Abstract: This research investigates the enhancement of knowledge distillation (KD) processes in pre-trained models, an emerging field in knowledge transfer with significant implications for distributed training and federated learning environments. These environments benefit from reduced communication demands and accommodate various model architectures. Despite the adoption of numerous KD approaches for transferring knowledge among pre-trained models, a comprehensive understanding of KD's application in these scenarios is lacking. Our study conducts an extensive comparison of multiple KD techniques, including standard KD, tuned KD (via optimized temperature and weight parameters), deep mutual learning, and data partitioning KD. We assess these methods across various data distribution strategies to identify the most effective contexts for each. Through detailed examination of hyperparameter tuning, informed by extensive grid search evaluations, we pinpoint when adjustments are crucial to enhance model performance. This paper sheds light on optimal hyperparameter settings for distinct data partitioning scenarios and investigates KD's role in improving federated learning by minimizing communication rounds and expediting the training process. By filling a notable void in current research, our findings serve as a practical framework for leveraging KD in pre-trained models within collaborative and federated learning frameworks.
Authors: Sergio Morales, Robert Claris\'o, Jordi Cabot
Abstract: The integration of Large Language Models (LLMs) into various software applications raises concerns about their potential biases. Typically, those models are trained on a vast amount of data scrapped from forums, websites, social media and other internet sources, which may instill harmful and discriminating behavior into the model. To address this issue, we present LangBiTe, a testing platform to systematically assess the presence of biases within an LLM. LangBiTe enables development teams to tailor their test scenarios, and automatically generate and execute the test cases according to a set of user-defined ethical requirements. Each test consists of a prompt fed into the LLM and a corresponding test oracle that scrutinizes the LLM's response for the identification of biases. LangBite provides users with the bias evaluation of LLMs, and end-to-end traceability between the initial ethical requirements and the insights obtained.
Authors: Hang Zhou, Yuezhou Ma, Haixu Wu, Haowen Wang, Mingsheng Long
Abstract: Deep models have recently emerged as promising tools to solve partial differential equations (PDEs), known as neural PDE solvers. While neural solvers trained from either simulation data or physics-informed loss can solve PDEs reasonably well, they are mainly restricted to a few instances of PDEs, e.g. a certain equation with a limited set of coefficients. This limits their generalization to diverse PDEs, preventing them from being practical surrogate models of numerical solvers. In this paper, we present Unisolver, a novel Transformer model trained on diverse data and conditioned on diverse PDEs, aiming towards a universal neural PDE solver capable of solving a wide scope of PDEs. Instead of purely scaling up data and parameters, Unisolver stems from the theoretical analysis of the PDE-solving process. Inspired by the mathematical structure of PDEs that a PDE solution is fundamentally governed by a series of PDE components such as equation symbols and boundary conditions, we define a complete set of PDE components and flexibly embed them as domain-wise and point-wise deep conditions for Transformer PDE solvers. Integrating physical insights with recent Transformer advances, Unisolver achieves consistent state-of-the-art on three challenging large-scale benchmarks, showing impressive performance and generalizability. Code is available at https://github.com/thuml/Unisolver.
Authors: Dominic LaBella, Valeriia Abramova, Mehdi Astaraki, Andre Ferreira, Zhifan Jiang, Mason C. Cleveland, Ramandeep Kang, Uma M. Lal-Trehan Estrada, Cansu Yalcin, Rachika E. Hamadache, Clara Lisazo, Adri\`a Casamitjana, Joaquim Salvi, Arnau Oliver, Xavier Llad\'o, Iuliana Toma-Dasu, Tiago Jesus, Behrus Puladi, Jens Kleesiek, Victor Alves, Jan Egger, Daniel Capell\'an-Mart\'in, Abhijeet Parida, Austin Tapp, Xinyang Liu, Maria J. Ledesma-Carbayo, Jay B. Patel, Thomas N. McNeal, Maya Viera, Owen McCall, Albert E. Kim, Elizabeth R. Gerstner, Christopher P. Bridge, Katherine Schumacher, Michael Mix, Kevin Leu, Shan McBurney-Lin, Pierre Nedelec, Javier Villanueva-Meyer, David R. Raleigh, Jonathan Shapey, Tom Vercauteren, Kazumi Chia, Marina Ivory, Theodore Barfoot, Omar Al-Salihi, Justin Leu, Lia M. Halasz, Yuri S. Velichko, Chunhao Wang, John P. Kirkpatrick, Scott R. Floyd, Zachary J. Reitman, Trey C. Mullikin, Eugene J. Vaios, Christina Huang, Ulas Bagci, Sean Sachdev, Jona A. Hattangadi-Gluth, Tyler M. Seibert, Nikdokht Farid, Connor Puett, Matthew W. Pease, Kevin Shiue, Syed Muhammad Anwar, Shahriar Faghani, Peter Taylor, Pranav Warman, Jake Albrecht, Andr\'as Jakab, Mana Moassefi, Verena Chung, Rong Chai, Alejandro Aristizabal, Alexandros Karargyris, Hasan Kassem, Sarthak Pati, Micah Sheller, Nazanin Maleki, Rachit Saluja, Florian Kofler, Christopher G. Schwarz, Philipp Lohmann, Phillipp Vollmuth, Louis Gagnon, Maruf Adewole, Hongwei Bran Li, Anahita Fathi Kazerooni, Nourel Hoda Tahon, Udunna Anazodo, Ahmed W. Moawad, Bjoern Menze, Marius George Linguraru, Mariam Aboian, Benedikt Wiestler, Ujjwal Baid, Gian-Marco Conte, Andreas M. Rauschecker, Ayman Nada, Aly H. Abayazeed, Raymond Huang, Maria Correia de Verdier, Jeffrey D. Rudie, Spyridon Bakas, Evan Calabrese
Abstract: The 2024 Brain Tumor Segmentation Meningioma Radiotherapy (BraTS-MEN-RT) challenge aimed to advance automated segmentation algorithms using the largest known multi-institutional dataset of 750 radiotherapy planning brain MRIs with expert-annotated target labels for patients with intact or postoperative meningioma that underwent either conventional external beam radiotherapy or stereotactic radiosurgery. Each case included a defaced 3D post-contrast T1-weighted radiotherapy planning MRI in its native acquisition space, accompanied by a single-label "target volume" representing the gross tumor volume (GTV) and any at-risk post-operative site. Target volume annotations adhered to established radiotherapy planning protocols, ensuring consistency across cases and institutions, and were approved by expert neuroradiologists and radiation oncologists. Six participating teams developed, containerized, and evaluated automated segmentation models using this comprehensive dataset. Team rankings were assessed using a modified lesion-wise Dice Similarity Coefficient (DSC) and 95% Hausdorff Distance (95HD). The best reported average lesion-wise DSC and 95HD was 0.815 and 26.92 mm, respectively. BraTS-MEN-RT is expected to significantly advance automated radiotherapy planning by enabling precise tumor segmentation and facilitating tailored treatment, ultimately improving patient outcomes. We describe the design and results from the BraTS-MEN-RT challenge.
Authors: Xiang Gao, Jiaying Liu
Abstract: Large-scale text-to-image diffusion models have been a revolutionary milestone in the evolution of generative AI and multimodal technology, allowing wonderful image generation with natural-language text prompt. However, the issue of lacking controllability of such models restricts their practical applicability for real-life content creation. Thus, attention has been focused on leveraging a reference image to control text-to-image synthesis, which is also regarded as manipulating (or editing) a reference image as per a text prompt, namely, text-driven image-to-image translation. This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner, realizing high-quality and versatile text-driven I2I translation without any model training, model fine-tuning, or online optimization process. To guide T2I generation with a reference image, we propose to decompose diverse guiding factors with different frequency bands of diffusion features in the DCT spectral space, and accordingly devise a novel frequency band substitution layer which realizes dynamic control of the reference image to the T2I generation result in a plug-and-play manner. We demonstrate that our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band, respectively. Extensive qualitative and quantitative experiments verify superiority of our approach over related methods in I2I translation visual quality, versatility, and controllability. The code is publicly available at: https://github.com/XiangGao1102/FBSDiff.
Authors: Binbin Ding, Penghui Yang, Sheng-Jun Huang
Abstract: Federated learning (FL) enables multiple clients to collaboratively train machine learning models under the coordination of a central server, while maintaining privacy. However, the server cannot directly monitor the local training processes, leaving room for malicious clients to introduce backdoors into the model. Research has shown that backdoor attacks exploit specific neurons that are activated only by malicious inputs, remaining dormant with clean data. Building on this insight, we propose a novel defense method called Flipping Weight Updates of Low-Activation Input Neurons (FLAIN) to counter backdoor attacks in FL. Specifically, upon the completion of global training, we use an auxiliary dataset to identify low-activation input neurons and iteratively flip their associated weight updates. This flipping process continues while progressively raising the threshold for low-activation neurons, until the model's performance on the auxiliary data begins to degrade significantly. Extensive experiments demonstrate that FLAIN effectively reduces the success rate of backdoor attacks across a variety of scenarios, including Non-IID data distributions and high malicious client ratios (MCR), while maintaining minimal impact on the performance of clean data.
Authors: Natchapon Jongwiriyanurak, Zichao Zeng, June Moh Goo, Xinglei Wang, Ilya Ilyankou, Kerkritt Sriroongvikrai, Nicola Christie, Meihui Wang, Huanfa Chen, James Haworth
Abstract: Road safety assessments are critical yet costly, especially in Low- and Middle-Income Countries (LMICs), where most roads remain unrated. Traditional methods require expert annotation and training data, while supervised learning-based approaches struggle to generalise across regions. In this paper, we introduce \textit{V-RoAst}, a zero-shot Visual Question Answering (VQA) framework using Vision-Language Models (VLMs) to classify road safety attributes defined by the iRAP standard. We introduce the first open-source dataset from ThaiRAP, consisting of over 2,000 curated street-level images from Thailand annotated for this task. We evaluate Gemini-1.5-flash and GPT-4o-mini on this dataset and benchmark their performance against VGGNet and ResNet baselines. While VLMs underperform on spatial awareness, they generalise well to unseen classes and offer flexible prompt-based reasoning without retraining. Our results show that VLMs can serve as automatic road assessment tools when integrated with complementary data. This work is the first to explore VLMs for zero-shot infrastructure risk assessment and opens new directions for automatic, low-cost road safety mapping. Code and dataset: https://github.com/PongNJ/V-RoAst.
Authors: Roland Pihlakas, Joel Pyykk\"o
Abstract: Developing safe, aligned agentic AI systems requires comprehensive empirical testing, yet many existing benchmarks neglect crucial themes aligned with biology and economics, both time-tested fundamental sciences describing our needs and preferences. To address this gap, the present work focuses on introducing biologically and economically motivated themes that have been neglected in current mainstream discussions on AI safety - namely a set of multi-objective, multi-agent alignment benchmarks that emphasize homeostasis for bounded and biological objectives, diminishing returns for unbounded, instrumental, and business objectives, sustainability principle, and resource sharing. We implemented eight main benchmark environments on the above themes, to illustrate key pitfalls and challenges in agentic AI-s, such as unboundedly maximizing a homeostatic objective, over-optimizing one objective at the expense of others, neglecting safety constraints, or depleting shared resources.
Authors: Kailai Feng, Yabo Zhang, Haodong Yu, Zhilong Ji, Jinfeng Bai, Hongzhi Zhang, Wangmeng Zuo
Abstract: Artistic typography is a technique to visualize the meaning of input character in an imaginable and readable manner. With powerful text-to-image diffusion models, existing methods directly design the overall geometry and texture of input character, making it challenging to ensure both creativity and legibility. In this paper, we introduce a dual-branch, training-free method called VitaGlyph, enabling flexible artistic typography with controllable geometry changes while maintaining the readability. The key insight of VitaGlyph is to treat input character as a scene composed of a Subject and its Surrounding, which are rendered with varying degrees of geometric transformation. To enhance the visual appeal and creativity of the generated artistic typography, the subject flexibly expresses the essential concept of the input character, while the surrounding enriches relevant background without altering the shape, thus maintaining overall readability. Specifically, we implement VitaGlyph through a three-phase framework: (i) Knowledge Acquisition leverages large language models to design text descriptions for the subject and surrounding. (ii) Regional Interpretation detects the part that most closely matches the subject description and refines the structure via Semantic Typography. (iii) Attentional Compositional Generation separately renders the textures of the Subject and Surrounding regions and blends them in an attention-based manner. Experimental results demonstrate that VitaGlyph not only achieves better artistry and readability but also manages to depict multiple customized concepts, facilitating more creative and pleasing artistic typography generation. Our code will be made publicly available.
Authors: Tanusree Sharma, Yujin Potter, Zachary Kilhoffer, Yun Huang, Dawn Song, Yang Wang
Abstract: How AI models should deal with political topics has been discussed, but it remains challenging and requires better governance. This paper examines the governance of large language models through individual and collective deliberation, focusing on politically sensitive videos. We conducted a two-step study: interviews with 10 journalists established a baseline understanding of expert video interpretation; 114 individuals through deliberation using InclusiveAI, a platform that facilitates democratic decision-making through decentralized autonomous organization (DAO) mechanisms. Our findings reveal distinct differences in interpretative priorities: while experts emphasized emotion and narrative, the general public prioritized factual clarity, objectivity, and emotional neutrality. Furthermore, we examined how different governance mechanisms - quadratic vs. weighted voting and equal vs. 20/80 voting power - shape users' decision-making regarding AI behavior. Results indicate that voting methods significantly influence outcomes, with quadratic voting reinforcing perceptions of liberal democracy and political equality. Our study underscores the necessity of selecting appropriate governance mechanisms to better capture user perspectives and suggests decentralized AI governance as a potential way to facilitate broader public engagement in AI development, ensuring that varied perspectives meaningfully inform design decisions.
Authors: Huan Liu, Shusen Yang, Yuzhe Zhang, Mengze Wang, Fanyu Gong, Chengxi Xie, Guanjian Liu, Zejun Liu, Yong-Jin Liu, Bao-Liang Lu, Dalin Zhang
Abstract: EEG-based emotion recognition (EER) has gained significant attention due to its potential for understanding and analyzing human emotions. While recent advancements in deep learning techniques have substantially improved EER, the field lacks a convincing benchmark and comprehensive open-source libraries. This absence complicates fair comparisons between models and creates reproducibility challenges for practitioners, which collectively hinder progress. To address these issues, we introduce LibEER, a comprehensive benchmark and algorithm library designed to facilitate fair comparisons in EER. LibEER carefully selects popular and powerful baselines, harmonizes key implementation details across methods, and provides a standardized codebase in PyTorch. By offering a consistent evaluation framework with standardized experimental settings, LibEER enables unbiased assessments of seventeen representative deep learning models for EER across the six most widely used datasets. Additionally, we conduct a thorough, reproducible comparison of model performance and efficiency, providing valuable insights to guide researchers in the selection and design of EER models. Moreover, we make observations and in-depth analysis on the experiment results and identify current challenges in this community. We hope that our work will not only lower entry barriers for newcomers to EEG-based emotion recognition but also contribute to the standardization of research in this domain, fostering steady development. The library and source code are publicly available at https://github.com/XJTU-EEG/LibEER.
Authors: Caiqi Zhang, Ruihan Yang, Zhisong Zhang, Xinting Huang, Sen Yang, Dong Yu, Nigel Collier
Abstract: Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, which estimates the underlying uncertainty of model predictions, is essential to enhance the LLMs' trustworthiness. Existing research on LLM calibration has primarily focused on short-form tasks, providing a single confidence score at the response level (macro calibration). However, this approach is insufficient for long-form generations, where responses often contain more complex statements and may include both accurate and inaccurate information. Therefore, we introduce atomic calibration, a novel approach that evaluates factuality calibration at a fine-grained level by breaking down long responses into atomic claims. We classify confidence elicitation methods into discriminative and generative types and demonstrate that their combination can enhance calibration. Our extensive experiments on various LLMs and datasets show that atomic calibration is well-suited for long-form generation and can also improve macro calibration results. Additionally, atomic calibration reveals insightful patterns in LLM confidence throughout the generation process.
Authors: Zhaorui Tan, Xi Yang, Tan Pan, Tianyi Liu, Chen Jiang, Xin Guo, Qiufeng Wang, Anh Nguyen, Yuan Qi, Kaizhu Huang, Yuan Cheng
Abstract: The differences among medical imaging modalities, driven by distinct underlying principles, pose significant challenges for generalization in multi-modal medical tasks. Beyond modality gaps, individual variations, such as differences in organ size and metabolic rate, further impede a model's ability to generalize effectively across both modalities and diverse populations. Despite the importance of personalization, existing approaches to multi-modal generalization often neglect individual differences, focusing solely on common anatomical features. This limitation may result in weakened generalization in various medical tasks. In this paper, we unveil that personalization is critical for multi-modal generalization. Specifically, we propose an approach to achieve personalized generalization through approximating the underlying personalized invariant representation ${X}_h$ across various modalities by leveraging individual-level constraints and a learnable biological prior. We validate the feasibility and benefits of learning a personalized ${X}_h$, showing that this representation is highly generalizable and transferable across various multi-modal medical tasks. Extensive experimental results consistently show that the additionally incorporated personalization significantly improves performance and generalization across diverse scenarios, confirming its effectiveness.
Authors: Zhaoyan Sun, Xuanhe Zhou, Guoliang Li, Xiang Yu, Jianhua Feng, Yong Zhang
Abstract: Query rewrite is essential for optimizing SQL queries to improve their execution efficiency without changing their results. Traditionally, this task has been tackled through heuristic and learning-based methods, each with its limitations in terms of inferior quality and low robustness. Recent advancements in LLMs offer a new paradigm by leveraging their superior natural language and code comprehension abilities. Despite their potential, directly applying LLMs like GPT-4 has faced challenges due to problems such as hallucinations, where the model might generate inaccurate or irrelevant results. To address this, we propose R-Bot, an LLM-based query rewrite system with a systematic approach. We first design a multi-source rewrite evidence preparation pipeline to generate query rewrite evidences for guiding LLMs to avoid hallucinations. We then propose a hybrid structure-semantics retrieval method that combines structural and semantic analysis to retrieve the most relevant rewrite evidences for effectively answering an online query. We next propose a step-by-step LLM rewrite method that iteratively leverages the retrieved evidences to select and arrange rewrite rules with self-reflection. We conduct comprehensive experiments on real-world datasets and widely used benchmarks, and demonstrate the superior performance of our system, R-Bot, surpassing state-of-the-art query rewrite methods. The R-Bot system has been deployed at Huawei and with real customers, and the results show that the proposed R-Bot system achieves lower query latency.
Authors: Difei Gu, Yunhe Gao, Yang Zhou, Mu Zhou, Dimitris Metaxas
Abstract: Automated chest radiographs interpretation requires both accurate disease classification and detailed radiology report generation, presenting a significant challenge in the clinical workflow. Current approaches either focus on classification accuracy at the expense of interpretability or generate detailed but potentially unreliable reports through image captioning techniques. In this study, we present RadAlign, a novel framework that combines the predictive accuracy of vision-language models (VLMs) with the reasoning capabilities of large language models (LLMs). Inspired by the radiologist's workflow, RadAlign first employs a specialized VLM to align visual features with key medical concepts, achieving superior disease classification with an average AUC of 0.885 across multiple diseases. These recognized medical conditions, represented as text-based concepts in the aligned visual-language space, are then used to prompt LLM-based report generation. Enhanced by a retrieval-augmented generation mechanism that grounds outputs in similar historical cases, RadAlign delivers superior report quality with a GREEN score of 0.678, outperforming state-of-the-art methods' 0.634. Our framework maintains strong clinical interpretability while reducing hallucinations, advancing automated medical imaging and report analysis through integrated predictive and generative AI. Code is available at https://github.com/difeigu/RadAlign.
Authors: Francisco Caetano, Christiaan Viviers, Luis A. Zavala-Mondrag\'on, Peter H. N. de With, Fons van der Sommen
Abstract: Out-of-distribution (OOD) detection holds significant importance across many applications. While semantic and domain-shift OOD problems are well-studied, this work focuses on covariate shifts - subtle variations in the data distribution that can degrade machine learning performance. We hypothesize that detecting these subtle shifts can improve our understanding of in-distribution boundaries, ultimately improving OOD detection. In adversarial discriminators trained with Batch Normalization (BN), real and adversarial samples form distinct domains with unique batch statistics - a property we exploit for OOD detection. We introduce DisCoPatch, an unsupervised Adversarial Variational Autoencoder (VAE) framework that harnesses this mechanism. During inference, batches consist of patches from the same image, ensuring a consistent data distribution that allows the model to rely on batch statistics. DisCoPatch uses the VAE's suboptimal outputs (generated and reconstructed) as negative samples to train the discriminator, thereby improving its ability to delineate the boundary between in-distribution samples and covariate shifts. By tightening this boundary, DisCoPatch achieves state-of-the-art results in public OOD detection benchmarks. The proposed model not only excels in detecting covariate shifts, achieving 95.5% AUROC on ImageNet-1K(-C) but also outperforms all prior methods on public Near-OOD (95.0%) benchmarks. With a compact model size of 25MB, it achieves high OOD detection performance at notably lower latency than existing methods, making it an efficient and practical solution for real-world OOD detection applications. The code is publicly available.
Authors: Meng Lu, Catherine Chen, Carsten Eickhoff
Abstract: Neural Ranking Models (NRMs) have rapidly advanced state-of-the-art performance on information retrieval tasks. In this work, we investigate a Cross-Encoder variant of MiniLM to determine which relevance features it computes and where they are stored. We find that it employs a semantic variant of the traditional BM25 in an interpretable manner, featuring localized components: (1) Transformer attention heads that compute soft term frequency while controlling for term saturation and document length effects, and (2) a low-rank component of its embedding matrix that encodes inverse document frequency information for the vocabulary. This suggests that the Cross-Encoder uses the same fundamental mechanisms as BM25, but further leverages their capacity to capture semantics for improved retrieval performance. The granular understanding lays the groundwork for model editing to enhance model transparency, addressing safety concerns, and improving scalability in training and real-world applications.
Authors: Bary Tim, Fuchs Cl\'ement, Macq Beno\^it
Abstract: Human-in-the-Loop (HITL) systems are essential in high-stakes, real-world applications where AI must collaborate with human decision-makers. This work investigates how Conformal Prediction (CP) techniques, which provide rigorous coverage guarantees, can enhance the reliability of state-of-the-art human action recognition (HAR) systems built upon Vision-Language Models (VLMs). We demonstrate that CP can significantly reduce the average number of candidate classes without modifying the underlying VLM. However, these reductions often result in distributions with long tails which can hinder their practical utility. To mitigate this, we propose tuning the temperature of the softmax prediction, without using additional calibration data. This work contributes to ongoing efforts for multi-modal human-AI interaction in dynamic real-world environments.
Authors: Yanbiao Ma, Bowei Liu, Boyuan Gao, Wei Dai, Jiayi Chen, Shuo Li, Andi Zhang
Abstract: Deep neural networks (DNNs) often exhibit biases toward certain categories during object recognition, even under balanced training data conditions. The intrinsic mechanisms underlying these biases remain unclear. Inspired by the human visual system, which decouples object manifolds through hierarchical processing to achieve object recognition, we propose a geometric analysis framework linking the geometric complexity of class-specific perceptual manifolds in DNNs to model bias. Our findings reveal that differences in geometric complexity can lead to varying recognition capabilities across categories, introducing biases. To support this analysis, we present the Perceptual-Manifold-Geometry library, designed for calculating the geometric properties of perceptual manifolds.
Authors: Haiteng Zhao, Chang Ma, Fangzhi Xu, Lingpeng Kong, Zhi-Hong Deng
Abstract: The applications of large language models (LLMs) in various biological domains have been explored recently, but their reasoning ability in complex biological systems, such as pathways, remains underexplored, which is crucial for predicting biological phenomena, formulating hypotheses, and designing experiments. This work explores the potential of LLMs in pathway reasoning. We introduce BioMaze, a dataset with 5.1K complex pathway problems derived from real research, covering various biological contexts including natural dynamic changes, disturbances, additional intervention conditions, and multi-scale research targets. Our evaluation of methods such as CoT and graph-augmented reasoning, shows that LLMs struggle with pathway reasoning, especially in perturbed systems. To address this, we propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph-based navigation, enabling a more effective approach to handling the complexities of biological systems in a scientifically aligned manner. The dataset and code are available at https://github.com/zhao-ht/BioMaze.
Authors: Xiachong Feng, Longxu Dou, Lingpeng Kong
Abstract: The application of role-playing large language models (LLMs) is rapidly expanding in both academic and commercial domains, driving an increasing demand for high-precision role-playing models. Simultaneously, the rapid advancement of reasoning techniques has continuously pushed the performance boundaries of LLMs. This intersection of practical role-playing demands and evolving reasoning capabilities raises an important research question: "Can reasoning techniques enhance the role-playing capabilities of LLMs?" To address this, we conduct a comprehensive study using 6 role-playing benchmarks, 24 LLMs, and 3 distinct role-playing strategies, comparing the effectiveness of direct zero-shot role-playing, role-playing with Chain-of-Thought (CoT), and role-playing using reasoning-optimized LLMs. Our findings reveal that CoT may reduce role-playing performance, reasoning-optimized LLMs are unsuitable for role-playing, reasoning ability disrupts the role-playing scaling law, large models still lack proficiency in advanced role-playing, and Chinese role-playing performance surpasses English role-playing performance. Furthermore, based on extensive experimental results, we propose two promising future research directions: Role-aware CoT for improving role-playing LLMs and Reinforcement Learning for role-playing LLMs, aiming to enhance the adaptability, consistency, and effectiveness of role-playing LLMs for both research and real-world applications.
Authors: Amar Kumar, Anita Kriz, Mohammad Havaei, Tal Arbel
Abstract: Developing reliable and generalizable deep learning systems for medical imaging faces significant obstacles due to spurious correlations, data imbalances, and limited text annotations in datasets. Addressing these challenges requires architectures that are robust to the unique complexities posed by medical imaging data. Rapid advancements in vision-language foundation models within the natural image domain prompt the question of how they can be adapted for medical imaging tasks. In this work, we present PRISM, a framework that leverages foundation models to generate high-resolution, language-guided medical image counterfactuals using Stable Diffusion. Our approach demonstrates unprecedented precision in selectively modifying spurious correlations (the medical devices) and disease features, enabling the removal and addition of specific attributes while preserving other image characteristics. Through extensive evaluation, we show how PRISM advances counterfactual generation and enables the development of more robust downstream classifiers for clinically deployable solutions. To facilitate broader adoption and research, we make our code publicly available at https://github.com/Amarkr1/PRISM.
Authors: Wenrui Cheng, Tiantian Zhu, Shunan Jing, Jian-Ping Mei, Mingjun Ma, Jiaobo Jin, Zhengqiu Weng
Abstract: Recently, Provenance-based Intrusion Detection Systems (PIDSes) have been widely used for endpoint threat analysis. These studies can be broadly categorized into rule-based detection systems and learning-based detection systems. Among these, due to the evolution of attack techniques, rules cannot dynamically model all the characteristics of attackers. As a result, such systems often face false negatives. Learning-based detection systems are further divided into supervised learning and anomaly detection. The scarcity of attack samples hinders the usability and effectiveness of supervised learning-based detection systems in practical applications. Anomaly-based detection systems face a massive false positive problem because they cannot distinguish between changes in normal behavior and real attack behavior. The alert results of detection systems are closely related to the manual labor costs of subsequent security analysts. To reduce manual analysis time, we propose OMNISEC, which applies large language models (LLMs) to anomaly-based intrusion detection systems via retrieval-augmented behavior prompting. OMNISEC can identify abnormal nodes and corresponding abnormal events by constructing suspicious nodes and rare paths. By combining two external knowledge bases, OMNISEC uses Retrieval Augmented Generation (RAG) to enable the LLM to determine whether abnormal behavior is a real attack. Finally, OMNISEC can reconstruct the attack graph and restore the complete attack behavior chain of the attacker's intrusion. Experimental results show that OMNISEC outperforms state-of-the-art methods on public benchmark datasets.
Authors: Annie S. Chen, Alec M. Lessing, Yuejiang Liu, Chelsea Finn
Abstract: Many robot demonstration datasets contain heterogeneous demonstrations of varying quality. This heterogeneity may benefit policy pre-training, but can hinder robot performance when used with a final imitation learning objective. In particular, some strategies in the data may be less reliable than others or may be underrepresented in the data, leading to poor performance when such strategies are sampled at test time. Moreover, such unreliable or underrepresented strategies can be difficult even for people to discern, and sifting through demonstration datasets is time-consuming and costly. On the other hand, policy performance when trained on such demonstrations can reflect the reliability of different strategies. We thus propose for robots to self-curate based on online robot experience (Demo-SCORE). More specifically, we train and cross-validate a classifier to discern successful policy roll-outs from unsuccessful ones and use the classifier to filter heterogeneous demonstration datasets. Our experiments in simulation and the real world show that Demo-SCORE can effectively identify suboptimal demonstrations without manual curation. Notably, Demo-SCORE achieves over 15-35% higher absolute success rate in the resulting policy compared to the base policy trained with all original demonstrations.
Authors: Zixiang Chen, Greg Yang, Qingyue Zhao, Quanquan Gu
Abstract: Despite deep neural networks' powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.
Authors: Boqian Li, Haiwen Feng, Zeyu Cai, Michael J. Black, Yuliang Xiu
Abstract: Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods -- both tightness-agnostic and tightness-aware -- in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant tightness design can even reduce directional errors by (67.2% ~ 89.8%) in one-shot (or out-of-distribution) settings (~ 1% data). Qualitative results demonstrate strong generalization of ETCH, regardless of challenging poses, unseen shapes, loose clothing, and non-rigid dynamics. We will release the code and models soon for research purposes at https://boqian-li.github.io/ETCH/.
Authors: Pierre Sermanet, Anirudha Majumdar, Vikas Sindhwani
Abstract: Given the recent rate of progress in artificial intelligence (AI) and robotics, a tantalizing question is emerging: would robots controlled by emerging AI systems be strongly aligned with human values? In this work, we propose a scalable way to probe this question by generating a benchmark spanning the key moments in 824 major pieces of science fiction literature (movies, tv, novels and scientific books) where an agent (AI or robot) made critical decisions (good or bad). We use a state-of-the-art LLM's recollection of each key moment to generate questions in similar situations, the decisions made by the agent, and alternative decisions it could have made (good or bad). We then measure an approximation of how well models align with human values on a set of human-voted answers. We also generate rules that can be automatically improved via an amendment process in order to generate the first Sci-Fi inspired constitutions for promoting ethical behavior in AIs and robots in the real world. Our first finding is that modern LLMs paired with constitutions turn out to be well-aligned with human values (95.8%), contrary to unsettling decisions typically made in Sci-Fi (only 21.2% alignment). Secondly, we find that generated constitutions substantially increase alignment compared to the base model (79.4% to 95.8%), and show resilience to an adversarial prompt setting (23.3% to 92.3%). Additionally, we find that those constitutions are among the top performers on the ASIMOV Benchmark which is derived from real-world images and hospital injury reports. Sci-Fi-inspired constitutions are thus highly aligned and applicable in real-world situations. We release SciFi-Benchmark: a large-scale dataset to advance robot ethics and safety research. It comprises 9,056 questions and 53,384 answers generated through a novel LLM-introspection process, in addition to a smaller human-labeled evaluation set.
Authors: Tingyang Xiao, Xiaolin Zhou, Liu Liu, Wei Sui, Wei Feng, Jiaxiong Qiu, Xinjie Wang, Zhizhong Su
Abstract: This paper presents GeoFlow-SLAM, a robust and effective Tightly-Coupled RGBD-inertial SLAM for legged robotics undergoing aggressive and high-frequency motions.By integrating geometric consistency, legged odometry constraints, and dual-stream optical flow (GeoFlow), our method addresses three critical challenges:feature matching and pose initialization failures during fast locomotion and visual feature scarcity in texture-less scenes.Specifically, in rapid motion scenarios, feature matching is notably enhanced by leveraging dual-stream optical flow, which combines prior map points and poses. Additionally, we propose a robust pose initialization method for fast locomotion and IMU error in legged robots, integrating IMU/Legged odometry, inter-frame Perspective-n-Point (PnP), and Generalized Iterative Closest Point (GICP). Furthermore, a novel optimization framework that tightly couples depth-to-map and GICP geometric constraints is first introduced to improve the robustness and accuracy in long-duration, visually texture-less environments. The proposed algorithms achieve state-of-the-art (SOTA) on collected legged robots and open-source datasets. To further promote research and development, the open-source datasets and code will be made publicly available at https://github.com/HorizonRobotics/GeoFlowSlam
Authors: Patrick Kolpaczki, Tim Nielen, Eyke H\"ullermeier
Abstract: Additive feature explanations rely primarily on game-theoretic notions such as the Shapley value by viewing features as cooperating players. The Shapley value's popularity in and outside of explainable AI stems from its axiomatic uniqueness. However, its computational complexity severely limits practicability. Most works investigate the uniform approximation of all features' Shapley values, needlessly consuming samples for insignificant features. In contrast, identifying the $k$ most important features can already be sufficiently insightful and yields the potential to leverage algorithmic opportunities connected to the field of multi-armed bandits. We propose Comparable Marginal Contributions Sampling (CMCS), a method for the top-$k$ identification problem utilizing a new sampling scheme taking advantage of correlated observations. We conduct experiments to showcase the efficacy of our method in compared to competitive baselines. Our empirical findings reveal that estimation quality for the approximate-all problem does not necessarily transfer to top-$k$ identification and vice versa.
Authors: Jon Guti\'errez-Zaballa, Koldo Basterretxea, Javier Echanobe
Abstract: Machine learning-based embedded systems for safety-critical applications, such as aerospace and autonomous driving, must be robust to perturbations caused by soft errors. As transistor geometries shrink and voltages decrease, modern electronic devices become more susceptible to background radiation, increasing the concern about failures produced by soft errors. The resilience of deep neural networks (DNNs) to these errors depends not only on target device technology but also on model structure and the numerical representation and arithmetic precision of their parameters. Compression techniques like pruning and quantization, used to reduce memory footprint and computational complexity, alter both model structure and representation, affecting soft error robustness. In this regard, although often overlooked, the choice of activation functions (AFs) impacts not only accuracy and trainability but also compressibility and error resilience. This paper explores the use of bounded AFs to enhance robustness against parameter perturbations, while evaluating their effects on model accuracy, compressibility, and computational load with a technology-agnostic approach. We focus on encoder-decoder convolutional models developed for semantic segmentation of hyperspectral images with application to autonomous driving systems. Experiments are conducted on an AMD-Xilinx's KV260 SoM.
Authors: Mingda Zhang, Jianglong Qin
Abstract: Despite significant advances in foundation models like DeepSeek-R1 and ChatGPT, their deployment in medical settings faces critical challenges including computational requirements and professional knowledge barriers. This paper presents an efficient lightweight medical large language model architecture that systematically addresses these challenges through three-dimensional optimization: knowledge acquisition, model compression, and computational enhancement. We design a knowledge transfer pipeline from DeepSeek-R1-Distill-70B to DeepSeek-R1-Distill-7B using Low-Rank Adaptation (LoRA) for precise medical knowledge retention. Through 4-bit quantization and mixed-precision strategies, we achieve substantial model compression while preserving medical reasoning capabilities. The inference framework incorporates Flash Attention acceleration and continuous batching, complemented by specialized prompt templates for diverse medical queries. Experimental evaluation on medical benchmarks demonstrates that our approach maintains 92.1% accuracy on USMLE examinations while reducing memory consumption by 64.7% and inference latency by 12.4% compared to baseline models. This work provides a practical solution for deploying advanced language models in resource-constrained medical environments, enabling broader accessibility of AI-assisted healthcare.
Authors: Bofei Liu, Dong Ye, Zunhao Yao, Zhaowei Sun
Abstract: Modular self-reconfigurable satellites refer to satellite clusters composed of individual modular units capable of altering their configurations. The configuration changes enable the execution of diverse tasks and mission objectives. Existing path planning algorithms for reconfiguration often suffer from high computational complexity, poor generalization capability, and limited support for diverse target configurations. To address these challenges, this paper proposes a goal-oriented reinforcement learning-based path planning algorithm. This algorithm is the first to address the challenge that previous reinforcement learning methods failed to overcome, namely handling multiple target configurations. Moreover, techniques such as Hindsight Experience Replay and Invalid Action Masking are incorporated to overcome the significant obstacles posed by sparse rewards and invalid actions. Based on these designs, our model achieves a 95% and 73% success rate in reaching arbitrary target configurations in a modular satellite cluster composed of four and six units, respectively.
Authors: Jinming Hu, Hassan Nawaz, Yuting Rui, Lijie Chi, Arif Ullah, Pavlo O. Dral
Abstract: We have developed Aitomia - a platform powered by AI to assist in performing AI-driven atomistic and quantum chemical (QC) simulations. This evolving intelligent assistant platform is equipped with chatbots and AI agents to help experts and guide non-experts in setting up and running atomistic simulations, monitoring their computational status, analyzing simulation results, and summarizing them for the user in both textual and graphical forms. We achieve these goals by exploiting large language models that leverage the versatility of our MLatom ecosystem, supporting AI-enhanced computational chemistry tasks ranging from ground-state to excited-state calculations, including geometry optimizations, thermochemistry, and spectral calculations. The multi-agent implementation enables autonomous executions of the complex computational workflows, such as the computation of the reaction enthalpies. Aitomia is the first intelligent assistant publicly accessible online on a cloud computing platform for atomistic simulations of broad scope (Aitomistic Hub at https://aitomistic.xyz). It may also be deployed locally as described at http://mlatom.com/aitomia. Aitomia is expected to lower the barrier to performing atomistic simulations, thereby democratizing simulations and accelerating research and development in relevant fields.
Authors: Hugo Chateau-Laurent, Tara Vanhatalo, Wei-Tung Pan, Xavier Hinaut
Abstract: Generative artificial intelligence raises concerns related to energy consumption, copyright infringement and creative atrophy. We show that randomly initialized recurrent neural networks can produce arpeggios and low-frequency oscillations that are rich and configurable. In contrast to end-to-end music generation that aims to replace musicians, our approach expands their creativity while requiring no data and much less computational power. More information can be found at: https://allendia.com/
URLs: https://allendia.com/
Authors: Ziteng Yang, Jingzehua Xu, Yanshu Li, Zepeng Li, Yeqiang Wang, Xinghui Li
Abstract: Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data. Existing CLIP-based methods attempt to activate the model's ZSAD potential via handcrafted or static learnable prompts. The former incur high engineering costs and limited semantic coverage, whereas the latter apply identical descriptions across diverse anomaly types, thus fail to adapt to complex variations. Furthermore, since CLIP is originally pretrained on large-scale classification tasks, its anomaly segmentation quality is highly sensitive to the exact wording of class names, severely constraining prompting strategies that depend on class labels. To address these challenges, we introduce ViP$^{2}$-CLIP. The key insight of ViP$^{2}$-CLIP is a Visual-Perception Prompting (ViP-Prompt) mechanism, which fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts, eliminating manual templates and class-name priors. This design enables our model to focus on precise abnormal regions, making it particularly valuable when category labels are ambiguous or privacy-constrained. Extensive experiments on 15 industrial and medical benchmarks demonstrate that ViP$^{2}$-CLIP achieves state-of-the-art performance and robust cross-domain generalization.
Authors: Charles Hong, Sahil Bhatia, Alvin Cheung, Yakun Sophia Shao
Abstract: Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages like specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three categories of representative workloads and two different accelerators, we demonstrate that Autocomp-optimized code runs 5.6x (GEMM) and 2.7x (convolution) faster than the vendor-provided library, and outperforms expert-level hand-tuned code by 1.4x (GEMM), 1.1x (convolution), and 1.3x (fine-grained linear algebra). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.
Authors: Jintao Zhang, Zirui Liu, Mingyue Cheng, Shilong Zhang, Tingyue Pan, Yitong zhou, Qi Liu, Yanhu Xie
Abstract: Intraoperative hypotension (IOH) frequently occurs under general anesthesia and is strongly linked to adverse outcomes such as myocardial injury and increased mortality. Despite its significance, IOH prediction is hindered by event sparsity and the challenge of integrating static and dynamic data across diverse patients. In this paper, we propose \textbf{IOHFuseLM}, a multimodal language model framework. To accurately identify and differentiate sparse hypotensive events, we leverage a two-stage training strategy. The first stage involves domain adaptive pretraining on IOH physiological time series augmented through diffusion methods, thereby enhancing the model sensitivity to patterns associated with hypotension. Subsequently, task fine-tuning is performed on the original clinical dataset to further enhance the ability to distinguish normotensive from hypotensive states. To enable multimodal fusion for each patient, we align structured clinical descriptions with the corresponding physiological time series at the token level. Such alignment enables the model to capture individualized temporal patterns alongside their corresponding clinical semantics. In addition, we convert static patient attributes into structured text to enrich personalized information. Experimental evaluations on two intraoperative datasets demonstrate that IOHFuseLM outperforms established baselines in accurately identifying IOH events, highlighting its applicability in clinical decision support scenarios. Our code is publicly available to promote reproducibility at https://github.com/zjt-gpu/IOHFuseLM.
Authors: Boning Zhao
Abstract: Assessing student depression in sensitive environments like special education is challenging. Standardized questionnaires may not fully reflect students' true situations. Furthermore, automated methods often falter with rich student narratives, lacking the crucial, individualized insights stemming from teachers' empathetic connections with students. Existing methods often fail to address this ambiguity or effectively integrate educator understanding. To address these limitations by fostering a synergistic human-AI collaboration, this paper introduces Human Empathy as Encoder (HEAE), a novel, human-centered AI framework for transparent and socially responsible depression severity assessment. Our approach uniquely integrates student narrative text with a teacher-derived, 9-dimensional "Empathy Vector" (EV), its dimensions guided by the PHQ-9 framework,to explicitly translate tacit empathetic insight into a structured AI input enhancing rather than replacing human judgment. Rigorous experiments optimized the multimodal fusion, text representation, and classification architecture, achieving 82.74% accuracy for 7-level severity classification. This work demonstrates a path toward more responsible and ethical affective computing by structurally embedding human empathy
Authors: Roksana Goworek, Harpal Karlcut, Muhammad Shezad, Nijaguna Darshana, Abhishek Mane, Syam Bondada, Raghav Sikka, Ulvi Mammadov, Rauf Allahverdiyev, Sriram Purighella, Paridhi Gupta, Muhinyia Ndegwa, Haim Dubossarsky
Abstract: This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning ten low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.
Authors: Linxuan He, Qing-Shan Jia, Ang Li, Hongyan Sang, Ling Wang, Jiwen Lu, Tao Zhang, Jie Zhou, Yi Zhang, Yisen Wang, Peng Wei, Zhongyuan Wang, Henry X. Liu, Shuo Feng
Abstract: Embodied AI systems, comprising AI models and physical plants, are increasingly prevalent across various applications. Due to the rarity of system failures, ensuring their safety in complex operating environments remains a major challenge, which severely hinders their large-scale deployment in safety-critical domains, such as autonomous vehicles, medical devices, and robotics. While achieving provable deterministic safety--verifying system safety across all possible scenarios--remains theoretically ideal, the rarity and complexity of corner cases make this approach impractical for scalable embodied AI systems. Instead, empirical safety evaluation is employed as an alternative, but the absence of provable guarantees imposes significant limitations. To address these issues, we argue for a paradigm shift to provable probabilistic safety that integrates provable guarantees with progressive achievement toward a probabilistic safety boundary on overall system performance. The new paradigm better leverages statistical methods to enhance feasibility and scalability, and a well-defined probabilistic safety boundary enables embodied AI systems to be deployed at scale. In this Perspective, we outline a roadmap for provable probabilistic safety, along with corresponding challenges and potential solutions. By bridging the gap between theoretical safety assurance and practical deployment, this Perspective offers a pathway toward safer, large-scale adoption of embodied AI systems in safety-critical applications.
Authors: Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tian Xie, Tianxing He
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.
Authors: Zicheng Zhao, Kangyu Wang, Shijie Li, Rui Qian, Weiyao Lin, Huabin Liu
Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. The project is released on https://github.com/LiamZhao326/CogStream.
Authors: Tae-Seong Han, Jae-Wook Heo, Hakseung Kim, Cheol-Hui Lee, Hyub Huh, Eue-Keun Choi, Hye Jin Kim, Dong-Joo Kim
Abstract: Electrocardiography (ECG) signals are frequently degraded by noise, limiting their clinical reliability in both conventional and wearable settings. Existing methods for addressing ECG noise, relying on artifact classification or denoising, are constrained by annotation inconsistencies and poor generalizability. Here, we address these limitations by reframing ECG noise quantification as an anomaly detection task. We propose a diffusion-based framework trained to model the normative distribution of clean ECG signals, identifying deviations as noise without requiring explicit artifact labels. To robustly evaluate performance and mitigate label inconsistencies, we introduce a distribution-based metric using the Wasserstein-1 distance ($W_1$). Our model achieved a macro-average $W_1$ score of 1.308, outperforming the next-best method by over 48\%. External validation confirmed strong generalizability, facilitating the exclusion of noisy segments to improve diagnostic accuracy and support timely clinical intervention. This approach enhances real-time ECG monitoring and broadens ECG applicability in digital health technologies.
Authors: Filip Rydin, Attila Lischka, Jiaming Wu, Morteza Haghir Chehreghani, Bal\'azs Kulcs\'ar
Abstract: Learning-based methods for routing have gained significant attention in recent years, both in single-objective and multi-objective contexts. Yet, existing methods are unsuitable for routing on multigraphs, which feature multiple edges with distinct attributes between node pairs, despite their strong relevance in real-world scenarios. In this paper, we propose two graph neural network-based methods to address multi-objective routing on multigraphs. Our first approach operates directly on the multigraph by autoregressively selecting edges until a tour is completed. The second model first simplifies the multigraph via a learned pruning strategy and then performs routing on the resulting simple graph. We evaluate both models empirically and demonstrate their strong performance across a range of problems and distributions.
Authors: Seung-Wook Kim, Seongyeol Kim, Jiah Kim, Seowon Ji, Se-Ho Lee
Abstract: Federated learning (FL) often suffers from performance degradation due to key challenges such as data heterogeneity and communication constraints. To address these limitations, we present a novel FL framework called FedWSQ, which integrates weight standardization (WS) and the proposed distribution-aware non-uniform quantization (DANUQ). WS enhances FL performance by filtering out biased components in local updates during training, thereby improving the robustness of the model against data heterogeneity and unstable client participation. In addition, DANUQ minimizes quantization errors by leveraging the statistical properties of local model updates. As a result, FedWSQ significantly reduces communication overhead while maintaining superior model accuracy. Extensive experiments on FL benchmark datasets demonstrate that FedWSQ consistently outperforms existing FL methods across various challenging FL settings, including extreme data heterogeneity and ultra-low-bit communication scenarios.
Authors: Sunandita Patra, Mehtab Pathan, Mahmoud Mahfouz, Parisa Zehtabi, Wided Ouaja, Daniele Magazzeni, Manuela Veloso
Abstract: Organizations around the world schedule jobs (programs) regularly to perform various tasks dictated by their end users. With the major movement towards using a cloud computing infrastructure, our organization follows a hybrid approach with both cloud and on-prem servers. The objective of this work is to perform capacity planning, i.e., estimate resource requirements, and job scheduling for on-prem grid computing environments. A key contribution of our approach is handling uncertainty in both resource usage and duration of the jobs, a critical aspect in the finance industry where stochastic market conditions significantly influence job characteristics. For capacity planning and scheduling, we simultaneously balance two conflicting objectives: (a) minimize resource usage, and (b) provide high quality-of-service to the end users by completing jobs by their requested deadlines. We propose approximate approaches using deterministic estimators and pair sampling-based constraint programming. Our best approach (pair sampling-based) achieves much lower peak resource usage compared to manual scheduling without compromising on the quality-of-service.
Authors: Linshen Liu, Boyan Su, Junyue Jiang, Guanlin Wu, Cong Guo, Ceyu Xu, Hao Frank Yang
Abstract: This paper presents Edge-based Mixture of Experts (MoE) Collaborative Computing (EMC2), an optimal computing system designed for autonomous vehicles (AVs) that simultaneously achieves low-latency and high-accuracy 3D object detection. Unlike conventional approaches, EMC2 incorporates a scenario-aware MoE architecture specifically optimized for edge platforms. By effectively fusing LiDAR and camera data, the system leverages the complementary strengths of sparse 3D point clouds and dense 2D images to generate robust multimodal representations. To enable this, EMC2 employs an adaptive multimodal data bridge that performs multi-scale preprocessing on sensor inputs, followed by a scenario-aware routing mechanism that dynamically dispatches features to dedicated expert models based on object visibility and distance. In addition, EMC2 integrates joint hardware-software optimizations, including hardware resource utilization optimization and computational graph simplification, to ensure efficient and real-time inference on resource-constrained edge devices. Experiments on open-source benchmarks clearly show the EMC2 advancements as an end-to-end system. On the KITTI dataset, it achieves an average accuracy improvement of 3.58% and a 159.06% inference speedup compared to 15 baseline methods on Jetson platforms, with similar performance gains on the nuScenes dataset, highlighting its capability to advance reliable, real-time 3D object detection tasks for AVs. The official implementation is available at https://github.com/LinshenLiu622/EMC2.
Authors: Michele Caprio
Abstract: Conformal prediction (CP) is an Uncertainty Representation technique that delivers finite-sample calibrated prediction regions for any underlying Machine Learning model. Its status as an Uncertainty Quantification (UQ) tool, though, has remained conceptually opaque: While Conformal Prediction Regions (CPRs) give an ordinal representation of uncertainty (larger regions typically indicate higher uncertainty), they lack the capability to cardinally quantify it (twice as large regions do not imply twice the uncertainty). We adopt a category-theoretic approach to CP -- framing it as a morphism, embedded in a commuting diagram, of two newly-defined categories -- that brings us three joys. First, we show that -- under minimal assumptions -- CP is intrinsically a UQ mechanism, that is, its cardinal UQ capabilities are a structural feature of the method. Second, we demonstrate that CP bridges (and perhaps subsumes) the Bayesian, frequentist, and imprecise probabilistic approaches to predictive statistical reasoning. Finally, we show that a CPR is the image of a covariant functor. This observation is relevant to AI privacy: It implies that privacy noise added locally does not break the global coverage guarantee.
Authors: Xin Dong, Shichao Dong, Jin Wang, Jing Huang, Li Zhou, Zenghui Sun, Lihua Jing, Jingsong Lan, Xiaoyong Zhu, Bo Zheng
Abstract: Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans' ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose \textbf{INTER}: \textbf{Inter}action Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4\% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.
Authors: Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
Abstract: Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.
Authors: Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou
Abstract: Current AI agents cannot effectively learn from each other's problem-solving experiences or use past successes to guide self-reflection and error correction in new tasks. We introduce Agent KB, a shared knowledge base that captures both high-level problem-solving strategies and detailed execution lessons, enabling knowledge transfer across agent frameworks. Agent KB implements a novel teacher-student dual-phase retrieval mechanism where student agents retrieve workflow-level patterns for strategic guidance while teacher agents identify execution-level patterns for refinement. This hierarchical approach enables agents to break out of limited reasoning pathways by incorporating diverse strategies from external sources. Evaluations on the GAIA benchmark demonstrate substantial performance gains, with Agent KB improving success rates by up to 6.06 percentage points overall under pass@1. For SWE-bench code repair tasks, our system significantly improved resolution rates, with o3-mini achieving an 8.67 percentage point gain (23 percent to 31.67 percent) in pass@1. Our ablation studies demonstrate that the refinement module proves most critical, with its removal causing a 3.85% drop on challenging Level 3 tasks, highlighting that effective knowledge transfer necessitates both strategic guidance and execution-level refinement.
Authors: Gheorghe Comanici (Xinyi), Eric Bieber (Xinyi), Mike Schaekermann (Xinyi), Ice Pasupat (Xinyi), Noveen Sachdeva (Xinyi), Inderjit Dhillon (Xinyi), Marcel Blistein (Xinyi), Ori Ram (Xinyi), Dan Zhang (Xinyi), Evan Rosen (Xinyi), Luke Marris (Xinyi), Sam Petulla (Xinyi), Colin Gaffney (Xinyi), Asaf Aharoni (Xinyi), Nathan Lintz (Xinyi), Tiago Cardal Pais (Xinyi), Henrik Jacobsson (Xinyi), Idan Szpektor (Xinyi), Nan-Jiang Jiang (Xinyi), Krishna Haridasan (Xinyi), Ahmed Omran (Xinyi), Nikunj Saunshi (Xinyi), Dara Bahri (Xinyi), Gaurav Mishra (Xinyi), Eric Chu (Xinyi), Toby Boyd (Xinyi), Brad Hekman (Xinyi), Aaron Parisi (Xinyi), Chaoyi Zhang (Xinyi), Kornraphop Kawintiranon (Xinyi), Tania Bedrax-Weiss (Xinyi), Oliver Wang (Xinyi), Ya Xu (Xinyi), Ollie Purkiss (Xinyi), Uri Mendlovic (Xinyi), Ila\"i Deutel (Xinyi), Nam Nguyen (Xinyi), Adam Langley (Xinyi), Flip Korn (Xinyi), Lucia Rossazza (Xinyi), Alexandre Ram\'e (Xinyi), Sagar Waghmare (Xinyi), Helen Miller (Xinyi), Nathan Byrd (Xinyi), Ashrith Sheshan (Xinyi), Raia Hadsell Sangnie Bhardwaj (Xinyi), Pawel Janus (Xinyi), Tero Rissa (Xinyi), Dan Horgan (Xinyi), Sharon Silver (Xinyi), Ayzaan Wahid (Xinyi), Sergey Brin (Xinyi), Yves Raimond (Xinyi), Klemen Kloboves (Xinyi), Cindy Wang (Xinyi), Nitesh Bharadwaj Gundavarapu (Xinyi), Ilia Shumailov (Xinyi), Bo Wang (Xinyi), Mantas Pajarskas (Xinyi), Joe Heyward (Xinyi), Martin Nikoltchev (Xinyi), Maciej Kula (Xinyi), Hao Zhou (Xinyi), Zachary Garrett (Xinyi), Sushant Kafle (Xinyi), Sercan Arik (Xinyi), Ankita Goel (Xinyi), Mingyao Yang (Xinyi), Jiho Park (Xinyi), Koji Kojima (Xinyi), Parsa Mahmoudieh (Xinyi), Koray Kavukcuoglu (Xinyi), Grace Chen (Xinyi), Doug Fritz (Xinyi), Anton Bulyenov (Xinyi), Sudeshna Roy (Xinyi), Dimitris Paparas (Xinyi), Hadar Shemtov (Xinyi), Bo-Juen Chen (Xinyi), Robin Strudel (Xinyi), David Reitter (Xinyi), Aurko Roy (Xinyi), Andrey Vlasov (Xinyi), Changwan Ryu (Xinyi), Chas Leichner (Xinyi), Haichuan Yang (Xinyi), Zelda Mariet (Xinyi), Denis Vnukov (Xinyi), Tim Sohn (Xinyi), Amy Stuart (Xinyi), Wei Liang (Xinyi), Minmin Chen (Xinyi), Praynaa Rawlani (Xinyi), Christy Koh (Xinyi), JD Co-Reyes (Xinyi), Guangda Lai (Xinyi), Praseem Banzal (Xinyi), Dimitrios Vytiniotis (Xinyi), Jieru Mei (Xinyi), Mu Cai (Xinyi), Mohammed Badawi (Xinyi), Corey Fry (Xinyi), Ale Hartman (Xinyi), Daniel Zheng (Xinyi), Eric Jia (Xinyi), James Keeling (Xinyi), Annie Louis (Xinyi), Ying Chen (Xinyi), Efren Robles (Xinyi), Wei-Chih Hung (Xinyi), Howard Zhou (Xinyi), Nikita Saxena (Xinyi), Sonam Goenka (Xinyi), Olivia Ma (Xinyi), Zach Fisher (Xinyi), Mor Hazan Taege (Xinyi), Emily Graves (Xinyi), David Steiner (Xinyi), Yujia Li (Xinyi), Sarah Nguyen (Xinyi), Rahul Sukthankar (Xinyi), Joe Stanton (Xinyi), Ali Eslami (Xinyi), Gloria Shen (Xinyi), Berkin Akin (Xinyi), Alexey Guseynov (Xinyi), Yiqian Zhou (Xinyi), Jean-Baptiste Alayrac (Xinyi), Armand Joulin (Xinyi), Efrat Farkash (Xinyi), Ashish Thapliyal (Xinyi), Stephen Roller (Xinyi), Noam Shazeer (Xinyi), Todor Davchev (Xinyi), Terry Koo (Xinyi), Hannah Forbes-Pollard (Xinyi), Kartik Audhkhasi (Xinyi), Greg Farquhar (Xinyi), Adi Mayrav Gilady (Xinyi), Maggie Song (Xinyi), John Aslanides (Xinyi), Piermaria Mendolicchio (Xinyi), Alicia Parrish (Xinyi), John Blitzer (Xinyi), Pramod Gupta (Xinyi), Xiaoen Ju (Xinyi), Xiaochen Yang (Xinyi), Puranjay Datta (Xinyi), Andrea Tacchetti (Xinyi), Sanket Vaibhav Mehta (Xinyi), Gregory Dibb (Xinyi), Shubham Gupta (Xinyi), Federico Piccinini (Xinyi), Raia Hadsell (Xinyi), Sujee Rajayogam (Xinyi), Jiepu Jiang (Xinyi), Patrick Griffin (Xinyi), Patrik Sundberg (Xinyi), Jamie Hayes (Xinyi), Alexey Frolov (Xinyi), Tian Xie (Xinyi), Adam Zhang (Xinyi), Kingshuk Dasgupta (Xinyi), Uday Kalra (Xinyi), Lior Shani (Xinyi), Klaus Macherey (Xinyi), Tzu-Kuo Huang (Xinyi), Liam MacDermed (Xinyi), Karthik Duddu (Xinyi), Paulo Zacchello (Xinyi), Zi Yang (Xinyi), Jessica Lo (Xinyi), Kai Hui (Xinyi), Matej Kastelic (Xinyi), Derek Gasaway (Xinyi), Qijun Tan (Xinyi), Summer Yue (Xinyi), Pablo Barrio (Xinyi), John Wieting (Xinyi), Weel Yang (Xinyi), Andrew Nystrom (Xinyi), Solomon Demmessie (Xinyi), Anselm Levskaya (Xinyi), Fabio Viola (Xinyi), Chetan Tekur (Xinyi), Greg Billock (Xinyi), George Necula (Xinyi), Mandar Joshi (Xinyi), Rylan Schaeffer (Xinyi), Swachhand Lokhande (Xinyi), Christina Sorokin (Xinyi), Pradeep Shenoy (Xinyi), Mia Chen (Xinyi), Mark Collier (Xinyi), Hongji Li (Xinyi), Taylor Bos (Xinyi), Nevan Wichers (Xinyi), Sun Jae Lee (Xinyi), Ang\'eline Pouget (Xinyi), Santhosh Thangaraj (Xinyi), Kyriakos Axiotis (Xinyi), Phil Crone (Xinyi), Rachel Sterneck (Xinyi), Nikolai Chinaev (Xinyi), Victoria Krakovna (Xinyi), Oleksandr Ferludin (Xinyi), Ian Gemp (Xinyi), Stephanie Winkler (Xinyi), Dan Goldberg (Xinyi), Ivan Korotkov (Xinyi), Kefan Xiao (Xinyi), Malika Mehrotra (Xinyi), Sandeep Mariserla (Xinyi), Vihari Piratla (Xinyi), Terry Thurk (Xinyi), Khiem Pham (Xinyi), Hongxu Ma (Xinyi), Alexandre Senges (Xinyi), Ravi Kumar (Xinyi), Clemens Meyer (Xinyi), Ellie Talius (Xinyi), Nuo Wang Pierse (Xinyi), Ballie Sandhu (Xinyi), Horia Toma (Xinyi), Kuo Lin (Xinyi), Swaroop Nath (Xinyi), Tom Stone (Xinyi), Dorsa Sadigh (Xinyi), Nikita Gupta (Xinyi), Arthur Guez (Xinyi), Avi Singh (Xinyi), Matt Thomas (Xinyi), Tom Duerig (Xinyi), Yuan Gong (Xinyi), Richard Tanburn (Xinyi), Lydia Lihui Zhang (Xinyi), Phuong Dao (Xinyi), Mohamed Hammad (Xinyi), Sirui Xie (Xinyi), Shruti Rijhwani (Xinyi), Ben Murdoch (Xinyi), Duhyeon Kim (Xinyi), Will Thompson (Xinyi), Heng-Tze Cheng (Xinyi), Daniel Sohn (Xinyi), Pablo Sprechmann (Xinyi), Qiantong Xu (Xinyi), Srinivas Tadepalli (Xinyi), Peter Young (Xinyi), Ye Zhang (Xinyi), Hansa Srinivasan (Xinyi), Miranda Aperghis (Xinyi), Aditya Ayyar (Xinyi), Hen Fitoussi (Xinyi), Ryan Burnell (Xinyi), David Madras (Xinyi), Mike Dusenberry (Xinyi), Xi Xiong (Xinyi), Tayo Oguntebi (Xinyi), Ben Albrecht (Xinyi), J\"org Bornschein (Xinyi), Jovana Mitrovi\'c (Xinyi), Mason Dimarco (Xinyi), Bhargav Kanagal Shamanna (Xinyi), Premal Shah (Xinyi), Eren Sezener (Xinyi), Shyam Upadhyay (Xinyi), Dave Lacey (Xinyi), Craig Schiff (Xinyi), Sebastien Baur (Xinyi), Sanjay Ganapathy (Xinyi), Eva Schnider (Xinyi), Mateo Wirth (Xinyi), Connor Schenck (Xinyi), Andrey Simanovsky (Xinyi), Yi-Xuan Tan (Xinyi), Philipp Fr\"anken (Xinyi), Dennis Duan (Xinyi), Bharath Mankalale (Xinyi), Nikhil Dhawan (Xinyi), Kevin Sequeira (Xinyi), Zichuan Wei (Xinyi), Shivanker Goel (Xinyi), Caglar Unlu (Xinyi), Yukun Zhu (Xinyi), Haitian Sun (Xinyi), Ananth Balashankar (Xinyi), Kurt Shuster (Xinyi), Megh Umekar (Xinyi), Mahmoud Alnahlawi (Xinyi), A\"aron van den Oord (Xinyi), Kelly Chen (Xinyi), Yuexiang Zhai (Xinyi), Zihang Dai (Xinyi), Kuang-Huei Lee (Xinyi), Eric Doi (Xinyi), Lukas Zilka (Xinyi), Rohith Vallu (Xinyi), Disha Shrivastava (Xinyi), Jason Lee (Xinyi), Hisham Husain (Xinyi), Honglei Zhuang (Xinyi), Vincent Cohen-Addad (Xinyi), Jarred Barber (Xinyi), James Atwood (Xinyi), Adam Sadovsky (Xinyi), Quentin Wellens (Xinyi), Steven Hand (Xinyi), Arunkumar Rajendran (Xinyi), Aybuke Turker (Xinyi), CJ Carey (Xinyi), Yuanzhong Xu (Xinyi), Hagen Soltau (Xinyi), Zefei Li (Xinyi), Xinying Song (Xinyi), Conglong Li (Xinyi), Iurii Kemaev (Xinyi), Sasha Brown (Xinyi), Andrea Burns (Xinyi), Viorica Patraucean (Xinyi), Piotr Stanczyk (Xinyi), Renga Aravamudhan (Xinyi), Mathieu Blondel (Xinyi), Hila Noga (Xinyi), Lorenzo Blanco (Xinyi), Will Song (Xinyi), Michael Isard (Xinyi), Mandar Sharma (Xinyi), Reid Hayes (Xinyi), Dalia El Badawy (Xinyi), Avery Lamp (Xinyi), Itay Laish (Xinyi), Olga Kozlova (Xinyi), Kelvin Chan (Xinyi), Sahil Singla (Xinyi), Srinivas Sunkara (Xinyi), Mayank Upadhyay (Xinyi), Chang Liu (Xinyi), Aijun Bai (Xinyi), Jarek Wilkiewicz (Xinyi), Martin Zlocha (Xinyi), Jeremiah Liu (Xinyi), Zhuowan Li (Xinyi), Haiguang Li (Xinyi), Omer Barak (Xinyi), Ganna Raboshchuk (Xinyi), Jiho Choi (Xinyi), Fangyu Liu (Xinyi), Erik Jue (Xinyi), Mohit Sharma (Xinyi), Andreea Marzoca (Xinyi), Robert Busa-Fekete (Xinyi), Anna Korsun (Xinyi), Andre Elisseeff (Xinyi), Zhe Shen (Xinyi), Sara Mc Carthy (Xinyi), Kay Lamerigts (Xinyi), Anahita Hosseini (Xinyi), Hanzhao Lin (Xinyi), Charlie Chen (Xinyi), Fan Yang (Xinyi), Kushal Chauhan (Xinyi), Mark Omernick (Xinyi), Dawei Jia (Xinyi), Karina Zainullina (Xinyi), Demis Hassabis (Xinyi), Danny Vainstein (Xinyi), Ehsan Amid (Xinyi), Xiang Zhou (Xinyi), Ronny Votel (Xinyi), Eszter V\'ertes (Xinyi), Xinjian Li (Xinyi), Zongwei Zhou (Xinyi), Angeliki Lazaridou (Xinyi), Brendan McMahan (Xinyi), Arjun Narayanan (Xinyi), Hubert Soyer (Xinyi), Sujoy Basu (Xinyi), Kayi Lee (Xinyi), Bryan Perozzi (Xinyi), Qin Cao (Xinyi), Leonard Berrada (Xinyi), Rahul Arya (Xinyi), Ke Chen (Xinyi), Katrina (Xinyi), Xu (Joyce), Matthias Lochbrunner (Joyce), Alex Hofer (Joyce), Sahand Sharifzadeh (Joyce), Renjie Wu (Joyce), Sally Goldman (Joyce), Pranjal Awasthi (Joyce), Xuezhi Wang (Joyce), Yan Wu (Joyce), Claire Sha (Joyce), Biao Zhang (Joyce), Maciej Miku{\l}a (Joyce), Filippo Graziano (Joyce), Siobhan Mcloughlin (Joyce), Irene Giannoumis (Joyce), Youhei Namiki (Joyce), Chase Malik (Joyce), Carey Radebaugh (Joyce), Jamie Hall (Joyce), Ramiro Leal-Cavazos (Joyce), Jianmin Chen (Joyce), Vikas Sindhwani (Joyce), David Kao (Joyce), David Greene (Joyce), Jordan Griffith (Joyce), Chris Welty (Joyce), Ceslee Montgomery (Joyce), Toshihiro Yoshino (Joyce), Liangzhe Yuan (Joyce), Noah Goodman (Joyce), Assaf Hurwitz Michaely (Joyce), Kevin Lee (Joyce), KP Sawhney (Joyce), Wei Chen (Joyce), Zheng Zheng (Joyce), Megan Shum (Joyce), Nikolay Savinov (Joyce), Etienne Pot (Joyce), Alex Pak (Joyce), Morteza Zadimoghaddam (Joyce), Sijal Bhatnagar (Joyce), Yoad Lewenberg (Joyce), Blair Kutzman (Joyce), Ji Liu (Joyce), Lesley Katzen (Joyce), Jeremy Selier (Joyce), Josip Djolonga (Joyce), Dmitry Lepikhin (Joyce), Kelvin Xu (Joyce), Jacky Liang (Joyce), Jiewen Tan (Joyce), Benoit Schillings (Joyce), Muge Ersoy (Joyce), Pete Blois (Joyce), Bernd Bandemer (Joyce), Abhimanyu Singh (Joyce), Sergei Lebedev (Joyce), Pankaj Joshi (Joyce), Adam R. Brown (Joyce), Evan Palmer (Joyce), Shreya Pathak (Joyce), Komal Jalan (Joyce), Fedir Zubach (Joyce), Shuba Lall (Joyce), Randall Parker (Joyce), Alok Gunjan (Joyce), Sergey Rogulenko (Joyce), Sumit Sanghai (Joyce), Zhaoqi Leng (Joyce), Zoltan Egyed (Joyce), Shixin Li (Joyce), Maria Ivanova (Joyce), Kostas Andriopoulos (Joyce), Jin Xie (Joyce), Elan Rosenfeld (Joyce), Auriel Wright (Joyce), Ankur Sharma (Joyce), Xinyang Geng (Joyce), Yicheng Wang (Joyce), Sam Kwei (Joyce), Renke Pan (Joyce), Yujing Zhang (Joyce), Gabby Wang (Joyce), Xi Liu (Joyce), Chak Yeung (Joyce), Elizabeth Cole (Joyce), Aviv Rosenberg (Joyce), Zhen Yang (Joyce), Phil Chen (Joyce), George Polovets (Joyce), Pranav Nair (Joyce), Rohun Saxena (Joyce), Josh Smith (Joyce), Shuo-yiin Chang (Joyce), Aroma Mahendru (Joyce), Svetlana Grant (Joyce), Anand Iyer (Joyce), Irene Cai (Joyce), Jed McGiffin (Joyce), Jiaming Shen (Joyce), Alanna Walton (Joyce), Antonious Girgis (Joyce), Oliver Woodman (Joyce), Rosemary Ke (Joyce), Mike Kwong (Joyce), Louis Rouillard (Joyce), Jinmeng Rao (Joyce), Zhihao Li (Joyce), Yuntao Xu (Joyce), Flavien Prost (Joyce), Chi Zou (Joyce), Ziwei Ji (Joyce), Alberto Magni (Joyce), Tyler Liechty (Joyce), Dan A. Calian (Joyce), Deepak Ramachandran (Joyce), Igor Krivokon (Joyce), Hui Huang (Joyce), Terry Chen (Joyce), Anja Hauth (Joyce), Anastasija Ili\'c (Joyce), Weijuan Xi (Joyce), Hyeontaek Lim (Joyce), Vlad-Doru Ion (Joyce), Pooya Moradi (Joyce), Metin Toksoz-Exley (Joyce), Kalesha Bullard (Joyce), Miltos Allamanis (Joyce), Xiaomeng Yang (Joyce), Sophie Wang (Joyce), Zhi Hong (Joyce), Anita Gergely (Joyce), Cheng Li (Joyce), Bhavishya Mittal (Joyce), Vitaly Kovalev (Joyce), Victor Ungureanu (Joyce), Jane Labanowski (Joyce), Jan Wassenberg (Joyce), Nicolas Lacasse (Joyce), Geoffrey Cideron (Joyce), Petar Devi\'c (Joyce), Annie Marsden (Joyce), Lynn Nguyen (Joyce), Michael Fink (Joyce), Yin Zhong (Joyce), Tatsuya Kiyono (Joyce), Desi Ivanov (Joyce), Sally Ma (Joyce), Max Bain (Joyce), Kiran Yalasangi (Joyce), Jennifer She (Joyce), Anastasia Petrushkina (Joyce), Mayank Lunayach (Joyce), Carla Bromberg (Joyce), Sarah Hodkinson (Joyce), Vilobh Meshram (Joyce), Daniel Vlasic (Joyce), Austin Kyker (Joyce), Steve Xu (Joyce), Jeff Stanway (Joyce), Zuguang Yang (Joyce), Kai Zhao (Joyce), Matthew Tung (Joyce), Seth Odoom (Joyce), Yasuhisa Fujii (Joyce), Justin Gilmer (Joyce), Eunyoung Kim (Joyce), Felix Halim (Joyce), Quoc Le (Joyce), Bernd Bohnet (Joyce), Seliem El-Sayed (Joyce), Behnam Neyshabur (Joyce), Malcolm Reynolds (Joyce), Dean Reich (Joyce), Yang Xu (Joyce), Erica Moreira (Joyce), Anuj Sharma (Joyce), Zeyu Liu (Joyce), Mohammad Javad Hosseini (Joyce), Naina Raisinghani (Joyce), Yi Su (Joyce), Ni Lao (Joyce), Daniel Formoso (Joyce), Marco Gelmi (Joyce), Almog Gueta (Joyce), Tapomay Dey (Joyce), Elena Gribovskaya (Joyce), Domagoj \'Cevid (Joyce), Sidharth Mudgal (Joyce), Garrett Bingham (Joyce), Jianling Wang (Joyce), Anurag Kumar (Joyce), Alex Cullum (Joyce), Feng Han (Joyce), Konstantinos Bousmalis (Joyce), Diego Cedillo (Joyce), Grace Chu (Joyce), Vladimir Magay (Joyce), Paul Michel (Joyce), Ester Hlavnova (Joyce), Daniele Calandriello (Joyce), Setareh Ariafar (Joyce), Kaisheng Yao (Joyce), Vikash Sehwag (Joyce), Arpi Vezer (Joyce), Agustin Dal Lago (Joyce), Zhenkai Zhu (Joyce), Paul Kishan Rubenstein (Joyce), Allen Porter (Joyce), Anirudh Baddepudi (Joyce), Oriana Riva (Joyce), Mihai Dorin Istin (Joyce), Chih-Kuan Yeh (Joyce), Zhi Li (Joyce), Andrew Howard (Joyce), Nilpa Jha (Joyce), Jeremy Chen (Joyce), Raoul de Liedekerke (Joyce), Zafarali Ahmed (Joyce), Mikel Rodriguez (Joyce), Tanuj Bhatia (Joyce), Bangju Wang (Joyce), Ali Elqursh (Joyce), David Klinghoffer (Joyce), Peter Chen (Joyce), Pushmeet Kohli (Joyce), Te I (Joyce), Weiyang Zhang (Joyce), Zack Nado (Joyce), Jilin Chen (Joyce), Maxwell Chen (Joyce), George Zhang (Joyce), Aayush Singh (Joyce), Adam Hillier (Joyce), Federico Lebron (Joyce), Yiqing Tao (Joyce), Ting Liu (Joyce), Gabriel Dulac-Arnold (Joyce), Jingwei Zhang (Joyce), Shashi Narayan (Joyce), Buhuang Liu (Joyce), Orhan Firat (Joyce), Abhishek Bhowmick (Joyce), Bingyuan Liu (Joyce), Hao Zhang (Joyce), Zizhao Zhang (Joyce), Georges Rotival (Joyce), Nathan Howard (Joyce), Anu Sinha (Joyce), Alexander Grushetsky (Joyce), Benjamin Beyret (Joyce), Keerthana Gopalakrishnan (Joyce), James Zhao (Joyce), Kyle He (Joyce), Szabolcs Payrits (Joyce), Zaid Nabulsi (Joyce), Zhaoyi Zhang (Joyce), Weijie Chen (Joyce), Edward Lee (Joyce), Nova Fallen (Joyce), Sreenivas Gollapudi (Joyce), Aurick Zhou (Joyce), Filip Paveti\'c (Joyce), Thomas K\"oppe (Joyce), Shiyu Huang (Joyce), Rama Pasumarthi (Joyce), Nick Fernando (Joyce), Felix Fischer (Joyce), Daria \'Curko (Joyce), Yang Gao (Joyce), James Svensson (Joyce), Austin Stone (Joyce), Haroon Qureshi (Joyce), Abhishek Sinha (Joyce), Apoorv Kulshreshtha (Joyce), Martin Matysiak (Joyce), Jieming Mao (Joyce), Carl Saroufim (Joyce), Aleksandra Faust (Joyce), Qingnan Duan (Joyce), Gil Fidel (Joyce), Kaan Katircioglu (Joyce), Rapha\"el Lopez Kaufman (Joyce), Dhruv Shah (Joyce), Weize Kong (Joyce), Abhishek Bapna (Joyce), Gell\'ert Weisz (Joyce), Emma Dunleavy (Joyce), Praneet Dutta (Joyce), Tianqi Liu (Joyce), Rahma Chaabouni (Joyce), Carolina Parada (Joyce), Marcus Wu (Joyce), Alexandra Belias (Joyce), Alessandro Bissacco (Joyce), Stanislav Fort (Joyce), Li Xiao (Joyce), Fantine Huot (Joyce), Chris Knutsen (Joyce), Yochai Blau (Joyce), Gang Li (Joyce), Jennifer Prendki (Joyce), Juliette Love (Joyce), Yinlam Chow (Joyce), Pichi Charoenpanit (Joyce), Hidetoshi Shimokawa (Joyce), Vincent Coriou (Joyce), Karol Gregor (Joyce), Tomas Izo (Joyce), Arjun Akula (Joyce), Mario Pinto (Joyce), Chris Hahn (Joyce), Dominik Paulus (Joyce), Jiaxian Guo (Joyce), Neha Sharma (Joyce), Cho-Jui Hsieh (Joyce), Adaeze Chukwuka (Joyce), Kazuma Hashimoto (Joyce), Nathalie Rauschmayr (Joyce), Ling Wu (Joyce), Christof Angermueller (Joyce), Yulong Wang (Joyce), Sebastian Gerlach (Joyce), Michael Pliskin (Joyce), Daniil Mirylenka (Joyce), Min Ma (Joyce), Lexi Baugher (Joyce), Bryan Gale (Joyce), Shaan Bijwadia (Joyce), Nemanja Raki\'cevi\'c (Joyce), David Wood (Joyce), Jane Park (Joyce), Chung-Ching Chang (Joyce), Babi Seal (Joyce), Chris Tar (Joyce), Kacper Krasowiak (Joyce), Yiwen Song (Joyce), Georgi Stephanov (Joyce), Gary Wang (Joyce), Marcello Maggioni (Joyce), Stein Xudong Lin (Joyce), Felix Wu (Joyce), Shachi Paul (Joyce), Zixuan Jiang (Joyce), Shubham Agrawal (Joyce), Bilal Piot (Joyce), Alex Feng (Joyce), Cheolmin Kim (Joyce), Tulsee Doshi (Joyce), Jonathan Lai (Joyce), Chuqiao (Joyce), Xu (Yonghao), Sharad Vikram (Yonghao), Ciprian Chelba (Yonghao), Sebastian Krause (Yonghao), Vincent Zhuang (Yonghao), Jack Rae (Yonghao), Timo Denk (Yonghao), Adrian Collister (Yonghao), Lotte Weerts (Yonghao), Xianghong Luo (Yonghao), Yifeng Lu (Yonghao), H{\aa}vard Garnes (Yonghao), Nitish Gupta (Yonghao), Terry Spitz (Yonghao), Avinatan Hassidim (Yonghao), Lihao Liang (Yonghao), Izhak Shafran (Yonghao), Peter Humphreys (Yonghao), Kenny Vassigh (Yonghao), Phil Wallis (Yonghao), Virat Shejwalkar (Yonghao), Nicolas Perez-Nieves (Yonghao), Rachel Hornung (Yonghao), Melissa Tan (Yonghao), Beka Westberg (Yonghao), Andy Ly (Yonghao), Richard Zhang (Yonghao), Brian Farris (Yonghao), Jongbin Park (Yonghao), Alec Kosik (Yonghao), Zeynep Cankara (Yonghao), Andrii Maksai (Yonghao), Yunhan Xu (Yonghao), Albin Cassirer (Yonghao), Sergi Caelles (Yonghao), Abbas Abdolmaleki (Yonghao), Mencher Chiang (Yonghao), Alex Fabrikant (Yonghao), Shravya Shetty (Yonghao), Luheng He (Yonghao), Mai Gim\'enez (Yonghao), Hadi Hashemi (Yonghao), Sheena Panthaplackel (Yonghao), Yana Kulizhskaya (Yonghao), Salil Deshmukh (Yonghao), Daniele Pighin (Yonghao), Robin Alazard (Yonghao), Disha Jindal (Yonghao), Seb Noury (Yonghao), Pradeep Kumar S (Yonghao), Siyang Qin (Yonghao), Xerxes Dotiwalla (Yonghao), Stephen Spencer (Yonghao), Mohammad Babaeizadeh (Yonghao), Blake JianHang Chen (Yonghao), Vaibhav Mehta (Yonghao), Jennie Lees (Yonghao), Andrew Leach (Yonghao), Penporn Koanantakool (Yonghao), Ilia Akolzin (Yonghao), Ramona Comanescu (Yonghao), Junwhan Ahn (Yonghao), Alexey Svyatkovskiy (Yonghao), Basil Mustafa (Yonghao), David D'Ambrosio (Yonghao), Shiva Mohan Reddy Garlapati (Yonghao), Pascal Lamblin (Yonghao), Alekh Agarwal (Yonghao), Shuang Song (Yonghao), Pier Giuseppe Sessa (Yonghao), Pauline Coquinot (Yonghao), John Maggs (Yonghao), Hussain Masoom (Yonghao), Divya Pitta (Yonghao), Yaqing Wang (Yonghao), Patrick Morris-Suzuki (Yonghao), Billy Porter (Yonghao), Johnson Jia (Yonghao), Jeffrey Dudek (Yonghao), Raghavender R (Yonghao), Cosmin Paduraru (Yonghao), Alan Ansell (Yonghao), Tolga Bolukbasi (Yonghao), Tony Lu (Yonghao), Ramya Ganeshan (Yonghao), Zi Wang (Yonghao), Henry Griffiths (Yonghao), Rodrigo Benenson (Yonghao), Yifan He (Yonghao), James Swirhun (Yonghao), George Papamakarios (Yonghao), Aditya Chawla (Yonghao), Kuntal Sengupta (Yonghao), Yan Wang (Yonghao), Vedrana Milutinovic (Yonghao), Igor Mordatch (Yonghao), Zhipeng Jia (Yonghao), Jamie Smith (Yonghao), Will Ng (Yonghao), Shitij Nigam (Yonghao), Matt Young (Yonghao), Eugen Vu\v{s}ak (Yonghao), Blake Hechtman (Yonghao), Sheela Goenka (Yonghao), Avital Zipori (Yonghao), Kareem Ayoub (Yonghao), Ashok Popat (Yonghao), Trilok Acharya (Yonghao), Luo Yu (Yonghao), Dawn Bloxwich (Yonghao), Hugo Song (Yonghao), Paul Roit (Yonghao), Haiqiong Li (Yonghao), Aviel Boag (Yonghao), Nigamaa Nayakanti (Yonghao), Bilva Chandra (Yonghao), Tianli Ding (Yonghao), Aahil Mehta (Yonghao), Cath Hope (Yonghao), Jiageng Zhang (Yonghao), Idan Heimlich Shtacher (Yonghao), Kartikeya Badola (Yonghao), Ryo Nakashima (Yonghao), Andrei Sozanschi (Yonghao), Iulia Com\c{s}a (Yonghao), Ante \v{Z}u\v{z}ul (Yonghao), Emily Caveness (Yonghao), Julian Odell (Yonghao), Matthew Watson (Yonghao), Dario de Cesare (Yonghao), Phillip Lippe (Yonghao), Derek Lockhart (Yonghao), Siddharth Verma (Yonghao), Huizhong Chen (Yonghao), Sean Sun (Yonghao), Lin Zhuo (Yonghao), Aditya Shah (Yonghao), Prakhar Gupta (Yonghao), Alex Muzio (Yonghao), Ning Niu (Yonghao), Amir Zait (Yonghao), Abhinav Singh (Yonghao), Meenu Gaba (Yonghao), Fan Ye (Yonghao), Prajit Ramachandran (Yonghao), Mohammad Saleh (Yonghao), Raluca Ada Popa (Yonghao), Ayush Dubey (Yonghao), Frederick Liu (Yonghao), Sara Javanmardi (Yonghao), Mark Epstein (Yonghao), Ross Hemsley (Yonghao), Richard Green (Yonghao), Nishant Ranka (Yonghao), Eden Cohen (Yonghao), Chuyuan Kelly Fu (Yonghao), Sanjay Ghemawat (Yonghao), Jed Borovik (Yonghao), James Martens (Yonghao), Anthony Chen (Yonghao), Pranav Shyam (Yonghao), Andr\'e Susano Pinto (Yonghao), Ming-Hsuan Yang (Yonghao), Alexandru \c{T}ifrea (Yonghao), David Du (Yonghao), Boqing Gong (Yonghao), Ayushi Agarwal (Yonghao), Seungyeon Kim (Yonghao), Christian Frank (Yonghao), Saloni Shah (Yonghao), Xiaodan Song (Yonghao), Zhiwei Deng (Yonghao), Ales Mikhalap (Yonghao), Kleopatra Chatziprimou (Yonghao), Timothy Chung (Yonghao), Toni Creswell (Yonghao), Susan Zhang (Yonghao), Yennie Jun (Yonghao), Carl Lebsack (Yonghao), Will Truong (Yonghao), Slavica Anda\v{c}i\'c (Yonghao), Itay Yona (Yonghao), Marco Fornoni (Yonghao), Rong Rong (Yonghao), Serge Toropov (Yonghao), Afzal Shama Soudagar (Yonghao), Andrew Audibert (Yonghao), Salah Zaiem (Yonghao), Zaheer Abbas (Yonghao), Andrei Rusu (Yonghao), Sahitya Potluri (Yonghao), Shitao Weng (Yonghao), Anastasios Kementsietsidis (Yonghao), Anton Tsitsulin (Yonghao), Daiyi Peng (Yonghao), Natalie Ha (Yonghao), Sanil Jain (Yonghao), Tejasi Latkar (Yonghao), Simeon Ivanov (Yonghao), Cory McLean (Yonghao), Anirudh GP (Yonghao), Rajesh Venkataraman (Yonghao), Canoee Liu (Yonghao), Dilip Krishnan (Yonghao), Joel D'sa (Yonghao), Roey Yogev (Yonghao), Paul Collins (Yonghao), Benjamin Lee (Yonghao), Lewis Ho (Yonghao), Carl Doersch (Yonghao), Gal Yona (Yonghao), Shawn Gao (Yonghao), Felipe Tiengo Ferreira (Yonghao), Adnan Ozturel (Yonghao), Hannah Muckenhirn (Yonghao), Ce Zheng (Yonghao), Gargi Balasubramaniam (Yonghao), Mudit Bansal (Yonghao), George van den Driessche (Yonghao), Sivan Eiger (Yonghao), Salem Haykal (Yonghao), Vedant Misra (Yonghao), Abhimanyu Goyal (Yonghao), Danilo Martins (Yonghao), Gary Leung (Yonghao), Jonas Valfridsson (Yonghao), Four Flynn (Yonghao), Will Bishop (Yonghao), Chenxi Pang (Yonghao), Yoni Halpern (Yonghao), Honglin Yu (Yonghao), Lawrence Moore (Yonghao), Yuvein (Yonghao), Zhu (Cindy), Sridhar Thiagarajan (Cindy), Yoel Drori (Cindy), Zhisheng Xiao (Cindy), Lucio Dery (Cindy), Rolf Jagerman (Cindy), Jing Lu (Cindy), Eric Ge (Cindy), Vaibhav Aggarwal (Cindy), Arjun Khare (Cindy), Vinh Tran (Cindy), Oded Elyada (Cindy), Ferran Alet (Cindy), James Rubin (Cindy), Ian Chou (Cindy), David Tian (Cindy), Libin Bai (Cindy), Lawrence Chan (Cindy), Lukasz Lew (Cindy), Karolis Misiunas (Cindy), Taylan Bilal (Cindy), Aniket Ray (Cindy), Sindhu Raghuram (Cindy), Alex Castro-Ros (Cindy), Viral Carpenter (Cindy), CJ Zheng (Cindy), Michael Kilgore (Cindy), Josef Broder (Cindy), Emily Xue (Cindy), Praveen Kallakuri (Cindy), Dheeru Dua (Cindy), Nancy Yuen (Cindy), Steve Chien (Cindy), John Schultz (Cindy), Saurabh Agrawal (Cindy), Reut Tsarfaty (Cindy), Jingcao Hu (Cindy), Ajay Kannan (Cindy), Dror Marcus (Cindy), Nisarg Kothari (Cindy), Baochen Sun (Cindy), Ben Horn (Cindy), Matko Bo\v{s}njak (Cindy), Ferjad Naeem (Cindy), Dean Hirsch (Cindy), Lewis Chiang (Cindy), Boya Fang (Cindy), Jie Han (Cindy), Qifei Wang (Cindy), Ben Hora (Cindy), Antoine He (Cindy), Mario Lu\v{c}i\'c (Cindy), Beer Changpinyo (Cindy), Anshuman Tripathi (Cindy), John Youssef (Cindy), Chester Kwak (Cindy), Philippe Schlattner (Cindy), Cat Graves (Cindy), R\'emi Leblond (Cindy), Wenjun Zeng (Cindy), Anders Andreassen (Cindy), Gabriel Rasskin (Cindy), Yue Song (Cindy), Eddie Cao (Cindy), Junhyuk Oh (Cindy), Matt Hoffman (Cindy), Wojtek Skut (Cindy), Yichi Zhang (Cindy), Jon Stritar (Cindy), Xingyu Cai (Cindy), Saarthak Khanna (Cindy), Kathie Wang (Cindy), Shriya Sharma (Cindy), Christian Reisswig (Cindy), Younghoon Jun (Cindy), Aman Prasad (Cindy), Tatiana Sholokhova (Cindy), Preeti Singh (Cindy), Adi Gerzi Rosenthal (Cindy), Anian Ruoss (Cindy), Fran\c{c}oise Beaufays (Cindy), Sean Kirmani (Cindy), Dongkai Chen (Cindy), Johan Schalkwyk (Cindy), Jonathan Herzig (Cindy), Been Kim (Cindy), Josh Jacob (Cindy), Damien Vincent (Cindy), Adrian N Reyes (Cindy), Ivana Balazevic (Cindy), L\'eonard Hussenot (Cindy), Jon Schneider (Cindy), Parker Barnes (Cindy), Luis Castro (Cindy), Spandana Raj Babbula (Cindy), Simon Green (Cindy), Serkan Cabi (Cindy), Nico Duduta (Cindy), Danny Driess (Cindy), Rich Galt (Cindy), Noam Velan (Cindy), Junjie Wang (Cindy), Hongyang Jiao (Cindy), Matthew Mauger (Cindy), Du Phan (Cindy), Miteyan Patel (Cindy), Vlado Gali\'c (Cindy), Jerry Chang (Cindy), Eyal Marcus (Cindy), Matt Harvey (Cindy), Julian Salazar (Cindy), Elahe Dabir (Cindy), Suraj Satishkumar Sheth (Cindy), Amol Mandhane (Cindy), Hanie Sedghi (Cindy), Jeremiah Willcock (Cindy), Amir Zandieh (Cindy), Shruthi Prabhakara (Cindy), Aida Amini (Cindy), Antoine Miech (Cindy), Victor Stone (Cindy), Massimo Nicosia (Cindy), Paul Niemczyk (Cindy), Ying Xiao (Cindy), Lucy Kim (Cindy), S{\l}awek Kwasiborski (Cindy), Vikas Verma (Cindy), Ada Maksutaj Oflazer (Cindy), Christoph Hirnschall (Cindy), Peter Sung (Cindy), Lu Liu (Cindy), Richard Everett (Cindy), Michiel Bakker (Cindy), \'Agoston Weisz (Cindy), Yufei Wang (Cindy), Vivek Sampathkumar (Cindy), Uri Shaham (Cindy), Bibo Xu (Cindy), Yasemin Altun (Cindy), Mingqiu Wang (Cindy), Takaaki Saeki (Cindy), Guanjie Chen (Cindy), Emanuel Taropa (Cindy), Shanthal Vasanth (Cindy), Sophia Austin (Cindy), Lu Huang (Cindy), Goran Petrovic (Cindy), Qingyun Dou (Cindy), Daniel Golovin (Cindy), Grigory Rozhdestvenskiy (Cindy), Allie Culp (Cindy), Will Wu (Cindy), Motoki Sano (Cindy), Divya Jain (Cindy), Julia Proskurnia (Cindy), S\'ebastien Cevey (Cindy), Alejandro Cruzado Ruiz (Cindy), Piyush Patil (Cindy), Mahdi Mirzazadeh (Cindy), Eric Ni (Cindy), Javier Snaider (Cindy), Lijie Fan (Cindy), Alexandre Fr\'echette (Cindy), AJ Pierigiovanni (Cindy), Shariq Iqbal (Cindy), Kenton Lee (Cindy), Claudio Fantacci (Cindy), Jinwei Xing (Cindy), Lisa Wang (Cindy), Alex Irpan (Cindy), David Raposo (Cindy), Yi Luan (Cindy), Zhuoyuan Chen (Cindy), Harish Ganapathy (Cindy), Kevin Hui (Cindy), Jiazhong Nie (Cindy), Isabelle Guyon (Cindy), Heming Ge (Cindy), Roopali Vij (Cindy), Hui Zheng (Cindy), Dayeong Lee (Cindy), Alfonso Casta\~no (Cindy), Khuslen Baatarsukh (Cindy), Gabriel Ibagon (Cindy), Alexandra Chronopoulou (Cindy), Nicholas FitzGerald (Cindy), Shashank Viswanadha (Cindy), Safeen Huda (Cindy), Rivka Moroshko (Cindy), Georgi Stoyanov (Cindy), Prateek Kolhar (Cindy), Alain Vaucher (Cindy), Ishaan Watts (Cindy), Adhi Kuncoro (Cindy), Henryk Michalewski (Cindy), Satish Kambala (Cindy), Bat-Orgil Batsaikhan (Cindy), Alek Andreev (Cindy), Irina Jurenka (Cindy), Maigo Le (Cindy), Qihang Chen (Cindy), Wael Al Jishi (Cindy), Sarah Chakera (Cindy), Zhe Chen (Cindy), Aditya Kini (Cindy), Vikas Yadav (Cindy), Aditya Siddhant (Cindy), Ilia Labzovsky (Cindy), Balaji Lakshminarayanan (Cindy), Carrie Grimes Bostock (Cindy), Pankil Botadra (Cindy), Ankesh Anand (Cindy), Colton Bishop (Cindy), Sam Conway-Rahman (Cindy), Mohit Agarwal (Cindy), Yani Donchev (Cindy), Achintya Singhal (Cindy), F\'elix de Chaumont Quitry (Cindy), Natalia Ponomareva (Cindy), Nishant Agrawal (Cindy), Bin Ni (Cindy), Kalpesh Krishna (Cindy), Masha Samsikova (Cindy), John Karro (Cindy), Yilun Du (Cindy), Tamara von Glehn (Cindy), Caden Lu (Cindy), Christopher A. Choquette-Choo (Cindy), Zhen Qin (Cindy), Tingnan Zhang (Cindy), Sicheng Li (Cindy), Divya Tyam (Cindy), Swaroop Mishra (Cindy), Wing Lowe (Cindy), Colin Ji (Cindy), Weiyi Wang (Cindy), Manaal Faruqui (Cindy), Ambrose Slone (Cindy), Valentin Dalibard (Cindy), Arunachalam Narayanaswamy (Cindy), John Lambert (Cindy), Pierre-Antoine Manzagol (Cindy), Dan Karliner (Cindy), Andrew Bolt (Cindy), Ivan Lobov (Cindy), Aditya Kusupati (Cindy), Chang Ye (Cindy), Xuan Yang (Cindy), Heiga Zen (Cindy), Nelson George (Cindy), Mukul Bhutani (Cindy), Olivier Lacombe (Cindy), Robert Riachi (Cindy), Gagan Bansal (Cindy), Rachel Soh (Cindy), Yue Gao (Cindy), Yang Yu (Cindy), Adams Yu (Cindy), Emily Nottage (Cindy), Tania Rojas-Esponda (Cindy), James Noraky (Cindy), Manish Gupta (Cindy), Ragha Kotikalapudi (Cindy), Jichuan Chang (Cindy), Sanja Deur (Cindy), Dan Graur (Cindy), Alex Mossin (Cindy), Erin Farnese (Cindy), Ricardo Figueira (Cindy), Alexandre Moufarek (Cindy), Austin Huang (Cindy), Patrik Zochbauer (Cindy), Ben Ingram (Cindy), Tongzhou Chen (Cindy), Zelin Wu (Cindy), Adri\`a Puigdom\`enech (Cindy), Leland Rechis (Cindy), Da Yu (Cindy), Sri Gayatri Sundara Padmanabhan (Cindy), Rui Zhu (Cindy), Chu-ling Ko (Cindy), Andrea Banino (Cindy), Samira Daruki (Cindy), Aarush Selvan (Cindy), Dhruva Bhaswar (Cindy), Daniel Hernandez Diaz (Cindy), Chen Su (Cindy), Salvatore Scellato (Cindy), Jennifer Brennan (Cindy), Woohyun Han (Cindy), Grace Chung (Cindy), Priyanka Agrawal (Cindy), Urvashi Khandelwal (Cindy), Khe Chai Sim (Cindy), Morgane Lustman (Cindy), Sam Ritter (Cindy), Kelvin Guu (Cindy), Jiawei Xia (Cindy), Prateek Jain (Cindy), Emma Wang (Cindy), Tyrone Hill (Cindy), Mirko Rossini (Cindy), Marija Kostelac (Cindy), Tautvydas Misiunas (Cindy), Amit Sabne (Cindy), Kyuyeun Kim (Cindy), Ahmet Iscen (Cindy), Congchao Wang (Cindy), Jos\'e Leal (Cindy), Ashwin Sreevatsa (Cindy), Utku Evci (Cindy), Manfred Warmuth (Cindy), Saket Joshi (Cindy), Daniel Suo (Cindy), James Lottes (Cindy), Garrett Honke (Cindy), Brendan Jou (Cindy), Stefani Karp (Cindy), Jieru Hu (Cindy), Himanshu Sahni (Cindy), Adrien Ali Ta\"iga (Cindy), William Kong (Cindy), Samrat Ghosh (Cindy), Renshen Wang (Cindy), Jay Pavagadhi (Cindy), Natalie Axelsson (Cindy), Nikolai Grigorev (Cindy), Patrick Siegler (Cindy), Rebecca Lin (Cindy), Guohui Wang (Cindy), Emilio Parisotto (Cindy), Sharath Maddineni (Cindy), Krishan Subudhi (Cindy), Eyal Ben-David (Cindy), Elena Pochernina (Cindy), Orgad Keller (Cindy), Thi Avrahami (Cindy), Zhe Yuan (Cindy), Pulkit Mehta (Cindy), Jialu Liu (Cindy), Sherry Yang (Cindy), Wendy Kan (Cindy), Katherine Lee (Cindy), Tom Funkhouser (Cindy), Derek Cheng (Cindy), Hongzhi Shi (Cindy), Archit Sharma (Cindy), Joe Kelley (Cindy), Matan Eyal (Cindy), Yury Malkov (Cindy), Corentin Tallec (Cindy), Yuval Bahat (Cindy), Shen Yan (Cindy), Xintian (Cindy), Wu (Weilun), David Lindner (Weilun), Chengda Wu (Weilun), Avi Caciularu (Weilun), Xiyang Luo (Weilun), Rodolphe Jenatton (Weilun), Tim Zaman (Weilun), Yingying Bi (Weilun), Ilya Kornakov (Weilun), Ganesh Mallya (Weilun), Daisuke Ikeda (Weilun), Itay Karo (Weilun), Anima Singh (Weilun), Colin Evans (Weilun), Praneeth Netrapalli (Weilun), Vincent Nallatamby (Weilun), Isaac Tian (Weilun), Yannis Assael (Weilun), Vikas Raunak (Weilun), Victor Carbune (Weilun), Ioana Bica (Weilun), Lior Madmoni (Weilun), Dee Cattle (Weilun), Snchit Grover (Weilun), Krishna Somandepalli (Weilun), Sid Lall (Weilun), Amelio V\'azquez-Reina (Weilun), Riccardo Patana (Weilun), Jiaqi Mu (Weilun), Pranav Talluri (Weilun), Maggie Tran (Weilun), Rajeev Aggarwal (Weilun), RJ Skerry-Ryan (Weilun), Jun Xu (Weilun), Mike Burrows (Weilun), Xiaoyue Pan (Weilun), Edouard Yvinec (Weilun), Di Lu (Weilun), Zhiying Zhang (Weilun), Duc Dung Nguyen (Weilun), Hairong Mu (Weilun), Gabriel Barcik (Weilun), Helen Ran (Weilun), Lauren Beltrone (Weilun), Krzysztof Choromanski (Weilun), Dia Kharrat (Weilun), Samuel Albanie (Weilun), Sean Purser-haskell (Weilun), David Bieber (Weilun), Carrie Zhang (Weilun), Jing Wang (Weilun), Tom Hudson (Weilun), Zhiyuan Zhang (Weilun), Han Fu (Weilun), Johannes Mauerer (Weilun), Mohammad Hossein Bateni (Weilun), AJ Maschinot (Weilun), Bing Wang (Weilun), Muye Zhu (Weilun), Arjun Pillai (Weilun), Tobias Weyand (Weilun), Shuang Liu (Weilun), Oscar Akerlund (Weilun), Fred Bertsch (Weilun), Vittal Premachandran (Weilun), Alicia Jin (Weilun), Vincent Roulet (Weilun), Peter de Boursac (Weilun), Shubham Mittal (Weilun), Ndaba Ndebele (Weilun), Georgi Karadzhov (Weilun), Sahra Ghalebikesabi (Weilun), Ricky Liang (Weilun), Allen Wu (Weilun), Yale Cong (Weilun), Nimesh Ghelani (Weilun), Sumeet Singh (Weilun), Bahar Fatemi (Weilun), Warren (Weilun), Chen (Q), Charles Kwong (Q), Alexey Kolganov (Q), Steve Li (Q), Richard Song (Q), Chenkai Kuang (Q), Sobhan Miryoosefi (Q), Dale Webster (Q), James Wendt (Q), Arkadiusz Socala (Q), Guolong Su (Q), Artur Mendon\c{c}a (Q), Abhinav Gupta (Q), Xiaowei Li (Q), Tomy Tsai (Q), Qiong (Q), Hu (Jerry), Kai Kang (Jerry), Angie Chen (Jerry), Sertan Girgin (Jerry), Yongqin Xian (Jerry), Andrew Lee (Jerry), Nolan Ramsden (Jerry), Leslie Baker (Jerry), Madeleine Clare Elish (Jerry), Varvara Krayvanova (Jerry), Rishabh Joshi (Jerry), Jiri Simsa (Jerry), Yao-Yuan Yang (Jerry), Piotr Ambroszczyk (Jerry), Dipankar Ghosh (Jerry), Arjun Kar (Jerry), Yuan Shangguan (Jerry), Yumeya Yamamori (Jerry), Yaroslav Akulov (Jerry), Andy Brock (Jerry), Haotian Tang (Jerry), Siddharth Vashishtha (Jerry), Rich Munoz (Jerry), Andreas Steiner (Jerry), Kalyan Andra (Jerry), Daniel Eppens (Jerry), Qixuan Feng (Jerry), Hayato Kobayashi (Jerry), Sasha Goldshtein (Jerry), Mona El Mahdy (Jerry), Xin Wang (Jerry), Jilei (Jerry), Wang (Tu\'\^an), Richard Killam (Tu\'\^an), Tom Kwiatkowski (Tu\'\^an), Kavya Kopparapu (Tu\'\^an), Serena Zhan (Tu\'\^an), Chao Jia (Tu\'\^an), Alexei Bendebury (Tu\'\^an), Sheryl Luo (Tu\'\^an), Adri\`a Recasens (Tu\'\^an), Timothy Knight (Tu\'\^an), Jing Chen (Tu\'\^an), Mohak Patel (Tu\'\^an), YaGuang Li (Tu\'\^an), Ben Withbroe (Tu\'\^an), Dean Weesner (Tu\'\^an), Kush Bhatia (Tu\'\^an), Jie Ren (Tu\'\^an), Danielle Eisenbud (Tu\'\^an), Ebrahim Songhori (Tu\'\^an), Yanhua Sun (Tu\'\^an), Travis Choma (Tu\'\^an), Tasos Kementsietsidis (Tu\'\^an), Lucas Manning (Tu\'\^an), Brian Roark (Tu\'\^an), Wael Farhan (Tu\'\^an), Jie Feng (Tu\'\^an), Susheel Tatineni (Tu\'\^an), James Cobon-Kerr (Tu\'\^an), Yunjie Li (Tu\'\^an), Lisa Anne Hendricks (Tu\'\^an), Isaac Noble (Tu\'\^an), Chris Breaux (Tu\'\^an), Nate Kushman (Tu\'\^an), Liqian Peng (Tu\'\^an), Fuzhao Xue (Tu\'\^an), Taylor Tobin (Tu\'\^an), Jamie Rogers (Tu\'\^an), Josh Lipschultz (Tu\'\^an), Chris Alberti (Tu\'\^an), Alexey Vlaskin (Tu\'\^an), Mostafa Dehghani (Tu\'\^an), Roshan Sharma (Tu\'\^an), Tris Warkentin (Tu\'\^an), Chen-Yu Lee (Tu\'\^an), Benigno Uria (Tu\'\^an), Da-Cheng Juan (Tu\'\^an), Angad Chandorkar (Tu\'\^an), Hila Sheftel (Tu\'\^an), Ruibo Liu (Tu\'\^an), Elnaz Davoodi (Tu\'\^an), Borja De Balle Pigem (Tu\'\^an), Kedar Dhamdhere (Tu\'\^an), David Ross (Tu\'\^an), Jonathan Hoech (Tu\'\^an), Mahdis Mahdieh (Tu\'\^an), Li Liu (Tu\'\^an), Qiujia Li (Tu\'\^an), Liam McCafferty (Tu\'\^an), Chenxi Liu (Tu\'\^an), Markus Mircea (Tu\'\^an), Yunting Song (Tu\'\^an), Omkar Savant (Tu\'\^an), Alaa Saade (Tu\'\^an), Colin Cherry (Tu\'\^an), Vincent Hellendoorn (Tu\'\^an), Siddharth Goyal (Tu\'\^an), Paul Pucciarelli (Tu\'\^an), David Vilar Torres (Tu\'\^an), Zohar Yahav (Tu\'\^an), Hyo Lee (Tu\'\^an), Lars Lowe Sjoesund (Tu\'\^an), Christo Kirov (Tu\'\^an), Bo Chang (Tu\'\^an), Deepanway Ghoshal (Tu\'\^an), Lu Li (Tu\'\^an), Gilles Baechler (Tu\'\^an), S\'ebastien Pereira (Tu\'\^an), Tara Sainath (Tu\'\^an), Anudhyan Boral (Tu\'\^an), Dominik Grewe (Tu\'\^an), Afief Halumi (Tu\'\^an), Nguyet Minh Phu (Tu\'\^an), Tianxiao Shen (Tu\'\^an), Marco Tulio Ribeiro (Tu\'\^an), Dhriti Varma (Tu\'\^an), Alex Kaskasoli (Tu\'\^an), Vlad Feinberg (Tu\'\^an), Navneet Potti (Tu\'\^an), Jarrod Kahn (Tu\'\^an), Matheus Wisniewski (Tu\'\^an), Shakir Mohamed (Tu\'\^an), Arnar Mar Hrafnkelsson (Tu\'\^an), Bobak Shahriari (Tu\'\^an), Jean-Baptiste Lespiau (Tu\'\^an), Lisa Patel (Tu\'\^an), Legg Yeung (Tu\'\^an), Tom Paine (Tu\'\^an), Lantao Mei (Tu\'\^an), Alex Ramirez (Tu\'\^an), Rakesh Shivanna (Tu\'\^an), Li Zhong (Tu\'\^an), Josh Woodward (Tu\'\^an), Guilherme Tubone (Tu\'\^an), Samira Khan (Tu\'\^an), Heng Chen (Tu\'\^an), Elizabeth Nielsen (Tu\'\^an), Catalin Ionescu (Tu\'\^an), Utsav Prabhu (Tu\'\^an), Mingcen Gao (Tu\'\^an), Qingze Wang (Tu\'\^an), Sean Augenstein (Tu\'\^an), Neesha Subramaniam (Tu\'\^an), Jason Chang (Tu\'\^an), Fotis Iliopoulos (Tu\'\^an), Jiaming Luo (Tu\'\^an), Myriam Khan (Tu\'\^an), Weicheng Kuo (Tu\'\^an), Denis Teplyashin (Tu\'\^an), Florence Perot (Tu\'\^an), Logan Kilpatrick (Tu\'\^an), Amir Globerson (Tu\'\^an), Hongkun Yu (Tu\'\^an), Anfal Siddiqui (Tu\'\^an), Nick Sukhanov (Tu\'\^an), Arun Kandoor (Tu\'\^an), Umang Gupta (Tu\'\^an), Marco Andreetto (Tu\'\^an), Moran Ambar (Tu\'\^an), Donnie Kim (Tu\'\^an), Pawe{\l} Weso{\l}owski (Tu\'\^an), Sarah Perrin (Tu\'\^an), Ben Limonchik (Tu\'\^an), Wei Fan (Tu\'\^an), Jim Stephan (Tu\'\^an), Ian Stewart-Binks (Tu\'\^an), Ryan Kappedal (Tu\'\^an), Tong He (Tu\'\^an), Sarah Cogan (Tu\'\^an), Romina Datta (Tu\'\^an), Tong Zhou (Tu\'\^an), Jiayu Ye (Tu\'\^an), Leandro Kieliger (Tu\'\^an), Ana Ramalho (Tu\'\^an), Kyle Kastner (Tu\'\^an), Fabian Mentzer (Tu\'\^an), Wei-Jen Ko (Tu\'\^an), Arun Suggala (Tu\'\^an), Tianhao Zhou (Tu\'\^an), Shiraz Butt (Tu\'\^an), Hana Strej\v{c}ek (Tu\'\^an), Lior Belenki (Tu\'\^an), Subhashini Venugopalan (Tu\'\^an), Mingyang Ling (Tu\'\^an), Evgenii Eltyshev (Tu\'\^an), Yunxiao Deng (Tu\'\^an), Geza Kovacs (Tu\'\^an), Mukund Raghavachari (Tu\'\^an), Hanjun Dai (Tu\'\^an), Tal Schuster (Tu\'\^an), Steven Schwarcz (Tu\'\^an), Richard Nguyen (Tu\'\^an), Arthur Nguyen (Tu\'\^an), Gavin Buttimore (Tu\'\^an), Shrestha Basu Mallick (Tu\'\^an), Sudeep Gandhe (Tu\'\^an), Seth Benjamin (Tu\'\^an), Michal Jastrzebski (Tu\'\^an), Le Yan (Tu\'\^an), Sugato Basu (Tu\'\^an), Chris Apps (Tu\'\^an), Isabel Edkins (Tu\'\^an), James Allingham (Tu\'\^an), Immanuel Odisho (Tu\'\^an), Tomas Kocisky (Tu\'\^an), Jewel Zhao (Tu\'\^an), Linting Xue (Tu\'\^an), Apoorv Reddy (Tu\'\^an), Chrysovalantis Anastasiou (Tu\'\^an), Aviel Atias (Tu\'\^an), Sam Redmond (Tu\'\^an), Kieran Milan (Tu\'\^an), Nicolas Heess (Tu\'\^an), Herman Schmit (Tu\'\^an), Allan Dafoe (Tu\'\^an), Daniel Andor (Tu\'\^an), Tynan Gangwani (Tu\'\^an), Anca Dragan (Tu\'\^an), Sheng Zhang (Tu\'\^an), Ashyana Kachra (Tu\'\^an), Gang Wu (Tu\'\^an), Siyang Xue (Tu\'\^an), Kevin Aydin (Tu\'\^an), Siqi Liu (Tu\'\^an), Yuxiang Zhou (Tu\'\^an), Mahan Malihi (Tu\'\^an), Austin Wu (Tu\'\^an), Siddharth Gopal (Tu\'\^an), Candice Schumann (Tu\'\^an), Peter Stys (Tu\'\^an), Alek Wang (Tu\'\^an), Mirek Ol\v{s}\'ak (Tu\'\^an), Dangyi Liu (Tu\'\^an), Christian Schallhart (Tu\'\^an), Yiran Mao (Tu\'\^an), Demetra Brady (Tu\'\^an), Hao Xu (Tu\'\^an), Tomas Mery (Tu\'\^an), Chawin Sitawarin (Tu\'\^an), Siva Velusamy (Tu\'\^an), Tom Cobley (Tu\'\^an), Alex Zhai (Tu\'\^an), Christian Walder (Tu\'\^an), Nitzan Katz (Tu\'\^an), Ganesh Jawahar (Tu\'\^an), Chinmay Kulkarni (Tu\'\^an), Antoine Yang (Tu\'\^an), Adam Paszke (Tu\'\^an), Yinan Wang (Tu\'\^an), Bogdan Damoc (Tu\'\^an), Zal\'an Borsos (Tu\'\^an), Ray Smith (Tu\'\^an), Jinning Li (Tu\'\^an), Mansi Gupta (Tu\'\^an), Andrei Kapishnikov (Tu\'\^an), Sushant Prakash (Tu\'\^an), Florian Luisier (Tu\'\^an), Rishabh Agarwal (Tu\'\^an), Will Grathwohl (Tu\'\^an), Kuangyuan Chen (Tu\'\^an), Kehang Han (Tu\'\^an), Nikhil Mehta (Tu\'\^an), Andrew Over (Tu\'\^an), Shekoofeh Azizi (Tu\'\^an), Lei Meng (Tu\'\^an), Niccol\`o Dal Santo (Tu\'\^an), Kelvin Zheng (Tu\'\^an), Jane Shapiro (Tu\'\^an), Igor Petrovski (Tu\'\^an), Jeffrey Hui (Tu\'\^an), Amin Ghafouri (Tu\'\^an), Jasper Snoek (Tu\'\^an), James Qin (Tu\'\^an), Mandy Jordan (Tu\'\^an), Caitlin Sikora (Tu\'\^an), Jonathan Malmaud (Tu\'\^an), Yuheng Kuang (Tu\'\^an), Aga \'Swietlik (Tu\'\^an), Ruoxin Sang (Tu\'\^an), Chongyang Shi (Tu\'\^an), Leon Li (Tu\'\^an), Andrew Rosenberg (Tu\'\^an), Shubin Zhao (Tu\'\^an), Andy Crawford (Tu\'\^an), Jan-Thorsten Peter (Tu\'\^an), Yun Lei (Tu\'\^an), Xavier Garcia (Tu\'\^an), Long Le (Tu\'\^an), Todd Wang (Tu\'\^an), Julien Amelot (Tu\'\^an), Dave Orr (Tu\'\^an), Praneeth Kacham (Tu\'\^an), Dana Alon (Tu\'\^an), Gladys Tyen (Tu\'\^an), Abhinav Arora (Tu\'\^an), James Lyon (Tu\'\^an), Alex Kurakin (Tu\'\^an), Mimi Ly (Tu\'\^an), Theo Guidroz (Tu\'\^an), Zhipeng Yan (Tu\'\^an), Rina Panigrahy (Tu\'\^an), Pingmei Xu (Tu\'\^an), Thais Kagohara (Tu\'\^an), Yong Cheng (Tu\'\^an), Eric Noland (Tu\'\^an), Jinhyuk Lee (Tu\'\^an), Jonathan Lee (Tu\'\^an), Cathy Yip (Tu\'\^an), Maria Wang (Tu\'\^an), Efrat Nehoran (Tu\'\^an), Alexander Bykovsky (Tu\'\^an), Zhihao Shan (Tu\'\^an), Ankit Bhagatwala (Tu\'\^an), Chaochao Yan (Tu\'\^an), Jie Tan (Tu\'\^an), Guillermo Garrido (Tu\'\^an), Dan Ethier (Tu\'\^an), Nate Hurley (Tu\'\^an), Grace Vesom (Tu\'\^an), Xu Chen (Tu\'\^an), Siyuan Qiao (Tu\'\^an), Abhishek Nayyar (Tu\'\^an), Julian Walker (Tu\'\^an), Paramjit Sandhu (Tu\'\^an), Mihaela Rosca (Tu\'\^an), Danny Swisher (Tu\'\^an), Mikhail Dektiarev (Tu\'\^an), Josh Dillon (Tu\'\^an), George-Cristian Muraru (Tu\'\^an), Manuel Tragut (Tu\'\^an), Artiom Myaskovsky (Tu\'\^an), David Reid (Tu\'\^an), Marko Velic (Tu\'\^an), Owen Xiao (Tu\'\^an), Jasmine George (Tu\'\^an), Mark Brand (Tu\'\^an), Jing Li (Tu\'\^an), Wenhao Yu (Tu\'\^an), Shane Gu (Tu\'\^an), Xiang Deng (Tu\'\^an), Fran\c{c}ois-Xavier Aubet (Tu\'\^an), Soheil Hassas Yeganeh (Tu\'\^an), Fred Alcober (Tu\'\^an), Celine Smith (Tu\'\^an), Trevor Cohn (Tu\'\^an), Kay McKinney (Tu\'\^an), Michael Tschannen (Tu\'\^an), Ramesh Sampath (Tu\'\^an), Gowoon Cheon (Tu\'\^an), Liangchen Luo (Tu\'\^an), Luyang Liu (Tu\'\^an), Jordi Orbay (Tu\'\^an), Hui Peng (Tu\'\^an), Gabriela Botea (Tu\'\^an), Xiaofan Zhang (Tu\'\^an), Charles Yoon (Tu\'\^an), Cesar Magalhaes (Tu\'\^an), Pawe{\l} Stradomski (Tu\'\^an), Ian Mackinnon (Tu\'\^an), Steven Hemingray (Tu\'\^an), Kumaran Venkatesan (Tu\'\^an), Rhys May (Tu\'\^an), Jaeyoun Kim (Tu\'\^an), Alex Druinsky (Tu\'\^an), Jingchen Ye (Tu\'\^an), Zheng Xu (Tu\'\^an), Terry Huang (Tu\'\^an), Jad Al Abdallah (Tu\'\^an), Adil Dostmohamed (Tu\'\^an), Rachana Fellinger (Tu\'\^an), Tsendsuren Munkhdalai (Tu\'\^an), Akanksha Maurya (Tu\'\^an), Peter Garst (Tu\'\^an), Yin Zhang (Tu\'\^an), Maxim Krikun (Tu\'\^an), Simon Bucher (Tu\'\^an), Aditya Srikanth Veerubhotla (Tu\'\^an), Yaxin Liu (Tu\'\^an), Sheng Li (Tu\'\^an), Nishesh Gupta (Tu\'\^an), Jakub Adamek (Tu\'\^an), Hanwen Chen (Tu\'\^an), Bernett Orlando (Tu\'\^an), Aleksandr Zaks (Tu\'\^an), Joost van Amersfoort (Tu\'\^an), Josh Camp (Tu\'\^an), Hui Wan (Tu\'\^an), HyunJeong Choe (Tu\'\^an), Zhichun Wu (Tu\'\^an), Kate Olszewska (Tu\'\^an), Weiren Yu (Tu\'\^an), Archita Vadali (Tu\'\^an), Martin Scholz (Tu\'\^an), Daniel De Freitas (Tu\'\^an), Jason Lin (Tu\'\^an), Amy Hua (Tu\'\^an), Xin Liu (Tu\'\^an), Frank Ding (Tu\'\^an), Yichao Zhou (Tu\'\^an), Boone Severson (Tu\'\^an), Katerina Tsihlas (Tu\'\^an), Samuel Yang (Tu\'\^an), Tammo Spalink (Tu\'\^an), Varun Yerram (Tu\'\^an), Helena Pankov (Tu\'\^an), Rory Blevins (Tu\'\^an), Ben Vargas (Tu\'\^an), Sarthak Jauhari (Tu\'\^an), Matt Miecnikowski (Tu\'\^an), Ming Zhang (Tu\'\^an), Sandeep Kumar (Tu\'\^an), Clement Farabet (Tu\'\^an), Charline Le Lan (Tu\'\^an), Sebastian Flennerhag (Tu\'\^an), Yonatan Bitton (Tu\'\^an), Ada Ma (Tu\'\^an), Arthur Bra\v{z}inskas (Tu\'\^an), Eli Collins (Tu\'\^an), Niharika Ahuja (Tu\'\^an), Sneha Kudugunta (Tu\'\^an), Anna Bortsova (Tu\'\^an), Minh Giang (Tu\'\^an), Wanzheng Zhu (Tu\'\^an), Ed Chi (Tu\'\^an), Scott Lundberg (Tu\'\^an), Alexey Stern (Tu\'\^an), Subha Puttagunta (Tu\'\^an), Jing Xiong (Tu\'\^an), Xiao Wu (Tu\'\^an), Yash Pande (Tu\'\^an), Amit Jhindal (Tu\'\^an), Daniel Murphy (Tu\'\^an), Jon Clark (Tu\'\^an), Marc Brockschmidt (Tu\'\^an), Maxine Deines (Tu\'\^an), Kevin R. McKee (Tu\'\^an), Dan Bahir (Tu\'\^an), Jiajun Shen (Tu\'\^an), Minh Truong (Tu\'\^an), Daniel McDuff (Tu\'\^an), Andrea Gesmundo (Tu\'\^an), Edouard Rosseel (Tu\'\^an), Bowen Liang (Tu\'\^an), Ken Caluwaerts (Tu\'\^an), Jessica Hamrick (Tu\'\^an), Joseph Kready (Tu\'\^an), Mary Cassin (Tu\'\^an), Rishikesh Ingale (Tu\'\^an), Li Lao (Tu\'\^an), Scott Pollom (Tu\'\^an), Yifan Ding (Tu\'\^an), Wei He (Tu\'\^an), Lizzetth Bellot (Tu\'\^an), Joana Iljazi (Tu\'\^an), Ramya Sree Boppana (Tu\'\^an), Shan Han (Tu\'\^an), Tara Thompson (Tu\'\^an), Amr Khalifa (Tu\'\^an), Anna Bulanova (Tu\'\^an), Blagoj Mitrevski (Tu\'\^an), Bo Pang (Tu\'\^an), Emma Cooney (Tu\'\^an), Tian Shi (Tu\'\^an), Rey Coaguila (Tu\'\^an), Tamar Yakar (Tu\'\^an), Marc'aurelio Ranzato (Tu\'\^an), Nikola Momchev (Tu\'\^an), Chris Rawles (Tu\'\^an), Zachary Charles (Tu\'\^an), Young Maeng (Tu\'\^an), Yuan Zhang (Tu\'\^an), Rishabh Bansal (Tu\'\^an), Xiaokai Zhao (Tu\'\^an), Brian Albert (Tu\'\^an), Yuan Yuan (Tu\'\^an), Sudheendra Vijayanarasimhan (Tu\'\^an), Roy Hirsch (Tu\'\^an), Vinay Ramasesh (Tu\'\^an), Kiran Vodrahalli (Tu\'\^an), Xingyu Wang (Tu\'\^an), Arushi Gupta (Tu\'\^an), DJ Strouse (Tu\'\^an), Jianmo Ni (Tu\'\^an), Roma Patel (Tu\'\^an), Gabe Taubman (Tu\'\^an), Zhouyuan Huo (Tu\'\^an), Dero Gharibian (Tu\'\^an), Marianne Monteiro (Tu\'\^an), Hoi Lam (Tu\'\^an), Shobha Vasudevan (Tu\'\^an), Aditi Chaudhary (Tu\'\^an), Isabela Albuquerque (Tu\'\^an), Kilol Gupta (Tu\'\^an), Sebastian Riedel (Tu\'\^an), Chaitra Hegde (Tu\'\^an), Avraham Ruderman (Tu\'\^an), Andr\'as Gy\"orgy (Tu\'\^an), Marcus Wainwright (Tu\'\^an), Ashwin Chaugule (Tu\'\^an), Burcu Karagol Ayan (Tu\'\^an), Tomer Levinboim (Tu\'\^an), Sam Shleifer (Tu\'\^an), Yogesh Kalley (Tu\'\^an), Vahab Mirrokni (Tu\'\^an), Abhishek Rao (Tu\'\^an), Prabakar Radhakrishnan (Tu\'\^an), Jay Hartford (Tu\'\^an), Jialin Wu (Tu\'\^an), Zhenhai Zhu (Tu\'\^an), Francesco Bertolini (Tu\'\^an), Hao Xiong (Tu\'\^an), Nicolas Serrano (Tu\'\^an), Hamish Tomlinson (Tu\'\^an), Myle Ott (Tu\'\^an), Yifan Chang (Tu\'\^an), Mark Graham (Tu\'\^an), Jian Li (Tu\'\^an), Marco Liang (Tu\'\^an), Xiangzhu Long (Tu\'\^an), Sebastian Borgeaud (Tu\'\^an), Yanif Ahmad (Tu\'\^an), Alex Grills (Tu\'\^an), Diana Mincu (Tu\'\^an), Martin Izzard (Tu\'\^an), Yuan Liu (Tu\'\^an), Jinyu Xie (Tu\'\^an), Louis O'Bryan (Tu\'\^an), Sameera Ponda (Tu\'\^an), Simon Tong (Tu\'\^an), Michelle Liu (Tu\'\^an), Dan Malkin (Tu\'\^an), Khalid Salama (Tu\'\^an), Yuankai Chen (Tu\'\^an), Rohan Anil (Tu\'\^an), Anand Rao (Tu\'\^an), Rigel Swavely (Tu\'\^an), Misha Bilenko (Tu\'\^an), Nina Anderson (Tu\'\^an), Tat Tan (Tu\'\^an), Jing Xie (Tu\'\^an), Xing Wu (Tu\'\^an), Lijun Yu (Tu\'\^an), Oriol Vinyals (Tu\'\^an), Andrey Ryabtsev (Tu\'\^an), Rumen Dangovski (Tu\'\^an), Kate Baumli (Tu\'\^an), Daniel Keysers (Tu\'\^an), Christian Wright (Tu\'\^an), Zoe Ashwood (Tu\'\^an), Betty Chan (Tu\'\^an), Artem Shtefan (Tu\'\^an), Yaohui Guo (Tu\'\^an), Ankur Bapna (Tu\'\^an), Radu Soricut (Tu\'\^an), Steven Pecht (Tu\'\^an), Sabela Ramos (Tu\'\^an), Rui Wang (Tu\'\^an), Jiahao Cai (Tu\'\^an), Trieu Trinh (Tu\'\^an), Paul Barham (Tu\'\^an), Linda Friso (Tu\'\^an), Eli Stickgold (Tu\'\^an), Xiangzhuo Ding (Tu\'\^an), Siamak Shakeri (Tu\'\^an), Diego Ardila (Tu\'\^an), Eleftheria Briakou (Tu\'\^an), Phil Culliton (Tu\'\^an), Adam Raveret (Tu\'\^an), Jingyu Cui (Tu\'\^an), David Saxton (Tu\'\^an), Subhrajit Roy (Tu\'\^an), Javad Azizi (Tu\'\^an), Pengcheng Yin (Tu\'\^an), Lucia Loher (Tu\'\^an), Andrew Bunner (Tu\'\^an), Min Choi (Tu\'\^an), Faruk Ahmed (Tu\'\^an), Eric Li (Tu\'\^an), Yin Li (Tu\'\^an), Shengyang Dai (Tu\'\^an), Michael Elabd (Tu\'\^an), Sriram Ganapathy (Tu\'\^an), Shivani Agrawal (Tu\'\^an), Yiqing Hua (Tu\'\^an), Paige Kunkle (Tu\'\^an), Sujeevan Rajayogam (Tu\'\^an), Arun Ahuja (Tu\'\^an), Arthur Conmy (Tu\'\^an), Alex Vasiloff (Tu\'\^an), Parker Beak (Tu\'\^an), Christopher Yew (Tu\'\^an), Jayaram Mudigonda (Tu\'\^an), Bartek Wydrowski (Tu\'\^an), Jon Blanton (Tu\'\^an), Zhengdong Wang (Tu\'\^an), Yann Dauphin (Tu\'\^an), Zhuo Xu (Tu\'\^an), Martin Polacek (Tu\'\^an), Xi Chen (Tu\'\^an), Hexiang Hu (Tu\'\^an), Pauline Sho (Tu\'\^an), Markus Kunesch (Tu\'\^an), Mehdi Hafezi Manshadi (Tu\'\^an), Eliza Rutherford (Tu\'\^an), Bo Li (Tu\'\^an), Sissie Hsiao (Tu\'\^an), Iain Barr (Tu\'\^an), Alex Tudor (Tu\'\^an), Matija Kecman (Tu\'\^an), Arsha Nagrani (Tu\'\^an), Vladimir Pchelin (Tu\'\^an), Martin Sundermeyer (Tu\'\^an), Aishwarya P S (Tu\'\^an), Abhijit Karmarkar (Tu\'\^an), Yi Gao (Tu\'\^an), Grishma Chole (Tu\'\^an), Olivier Bachem (Tu\'\^an), Isabel Gao (Tu\'\^an), Arturo BC (Tu\'\^an), Matt Dibb (Tu\'\^an), Mauro Verzetti (Tu\'\^an), Felix Hernandez-Campos (Tu\'\^an), Yana Lunts (Tu\'\^an), Matthew Johnson (Tu\'\^an), Julia Di Trapani (Tu\'\^an), Raphael Koster (Tu\'\^an), Idan Brusilovsky (Tu\'\^an), Binbin Xiong (Tu\'\^an), Megha Mohabey (Tu\'\^an), Han Ke (Tu\'\^an), Joe Zou (Tu\'\^an), Tea Saboli\'c (Tu\'\^an), V\'ictor Campos (Tu\'\^an), John Palowitch (Tu\'\^an), Alex Morris (Tu\'\^an), Linhai Qiu (Tu\'\^an), Pranavaraj Ponnuramu (Tu\'\^an), Fangtao Li (Tu\'\^an), Vivek Sharma (Tu\'\^an), Kiranbir Sodhia (Tu\'\^an), Kaan Tekelioglu (Tu\'\^an), Aleksandr Chuklin (Tu\'\^an), Madhavi Yenugula (Tu\'\^an), Erika Gemzer (Tu\'\^an), Theofilos Strinopoulos (Tu\'\^an), Sam El-Husseini (Tu\'\^an), Huiyu Wang (Tu\'\^an), Yan Zhong (Tu\'\^an), Edouard Leurent (Tu\'\^an), Paul Natsev (Tu\'\^an), Weijun Wang (Tu\'\^an), Dre Mahaarachchi (Tu\'\^an), Tao Zhu (Tu\'\^an), Songyou Peng (Tu\'\^an), Sami Alabed (Tu\'\^an), Cheng-Chun Lee (Tu\'\^an), Anthony Brohan (Tu\'\^an), Arthur Szlam (Tu\'\^an), GS Oh (Tu\'\^an), Anton Kovsharov (Tu\'\^an), Jenny Lee (Tu\'\^an), Renee Wong (Tu\'\^an), Megan Barnes (Tu\'\^an), Gregory Thornton (Tu\'\^an), Felix Gimeno (Tu\'\^an), Omer Levy (Tu\'\^an), Martin Sevenich (Tu\'\^an), Melvin Johnson (Tu\'\^an), Jonathan Mallinson (Tu\'\^an), Robert Dadashi (Tu\'\^an), Ziyue Wang (Tu\'\^an), Qingchun Ren (Tu\'\^an), Preethi Lahoti (Tu\'\^an), Arka Dhar (Tu\'\^an), Josh Feldman (Tu\'\^an), Dan Zheng (Tu\'\^an), Thatcher Ulrich (Tu\'\^an), Liviu Panait (Tu\'\^an), Michiel Blokzijl (Tu\'\^an), Cip Baetu (Tu\'\^an), Josip Matak (Tu\'\^an), Jitendra Harlalka (Tu\'\^an), Maulik Shah (Tu\'\^an), Tal Marian (Tu\'\^an), Daniel von Dincklage (Tu\'\^an), Cosmo Du (Tu\'\^an), Ruy Ley-Wild (Tu\'\^an), Bethanie Brownfield (Tu\'\^an), Max Schumacher (Tu\'\^an), Yury Stuken (Tu\'\^an), Shadi Noghabi (Tu\'\^an), Sonal Gupta (Tu\'\^an), Xiaoqi Ren (Tu\'\^an), Eric Malmi (Tu\'\^an), Felix Weissenberger (Tu\'\^an), Blanca Huergo (Tu\'\^an), Maria Bauza (Tu\'\^an), Thomas Lampe (Tu\'\^an), Arthur Douillard (Tu\'\^an), Mojtaba Seyedhosseini (Tu\'\^an), Roy Frostig (Tu\'\^an), Zoubin Ghahramani (Tu\'\^an), Kelvin Nguyen (Tu\'\^an), Kashyap Krishnakumar (Tu\'\^an), Chengxi Ye (Tu\'\^an), Rahul Gupta (Tu\'\^an), Alireza Nazari (Tu\'\^an), Robert Geirhos (Tu\'\^an), Pete Shaw (Tu\'\^an), Ahmed Eleryan (Tu\'\^an), Dima Damen (Tu\'\^an), Jennimaria Palomaki (Tu\'\^an), Ted Xiao (Tu\'\^an), Qiyin Wu (Tu\'\^an), Quan Yuan (Tu\'\^an), Phoenix Meadowlark (Tu\'\^an), Matthew Bilotti (Tu\'\^an), Raymond Lin (Tu\'\^an), Mukund Sridhar (Tu\'\^an), Yannick Schroecker (Tu\'\^an), Da-Woon Chung (Tu\'\^an), Jincheng Luo (Tu\'\^an), Trevor Strohman (Tu\'\^an), Tianlin Liu (Tu\'\^an), Anne Zheng (Tu\'\^an), Jesse Emond (Tu\'\^an), Wei Wang (Tu\'\^an), Andrew Lampinen (Tu\'\^an), Toshiyuki Fukuzawa (Tu\'\^an), Folawiyo Campbell-Ajala (Tu\'\^an), Monica Roy (Tu\'\^an), James Lee-Thorp (Tu\'\^an), Lily Wang (Tu\'\^an), Iftekhar Naim (Tu\'\^an), Tony (Tu\'\^an), Nguy\~\^en (Lucas), Guy Bensky (Lucas), Aditya Gupta (Lucas), Dominika Rogozi\'nska (Lucas), Justin Fu (Lucas), Thanumalayan Sankaranarayana Pillai (Lucas), Petar Veli\v{c}kovi\'c (Lucas), Shahar Drath (Lucas), Philipp Neubeck (Lucas), Vaibhav Tulsyan (Lucas), Arseniy Klimovskiy (Lucas), Don Metzler (Lucas), Sage Stevens (Lucas), Angel Yeh (Lucas), Junwei Yuan (Lucas), Tianhe Yu (Lucas), Kelvin Zhang (Lucas), Alec Go (Lucas), Vincent Tsang (Lucas), Ying Xu (Lucas), Andy Wan (Lucas), Isaac Galatzer-Levy (Lucas), Sam Sobell (Lucas), Abodunrinwa Toki (Lucas), Elizabeth Salesky (Lucas), Wenlei Zhou (Lucas), Diego Antognini (Lucas), Sholto Douglas (Lucas), Shimu Wu (Lucas), Adam Lelkes (Lucas), Frank Kim (Lucas), Paul Cavallaro (Lucas), Ana Salazar (Lucas), Yuchi Liu (Lucas), James Besley (Lucas), Tiziana Refice (Lucas), Yiling Jia (Lucas), Zhang Li (Lucas), Michal Sokolik (Lucas), Arvind Kannan (Lucas), Jon Simon (Lucas), Jo Chick (Lucas), Avia Aharon (Lucas), Meet Gandhi (Lucas), Mayank Daswani (Lucas), Keyvan Amiri (Lucas), Vighnesh Birodkar (Lucas), Abe Ittycheriah (Lucas), Peter Grabowski (Lucas), Oscar Chang (Lucas), Charles Sutton (Lucas), Zhixin (Lucas), Lai (Elena), Umesh Telang (Elena), Susie Sargsyan (Elena), Tao Jiang (Elena), Raphael Hoffmann (Elena), Nicole Brichtova (Elena), Matteo Hessel (Elena), Jonathan Halcrow (Elena), Sammy Jerome (Elena), Geoff Brown (Elena), Alex Tomala (Elena), Elena Buchatskaya (Elena), Dian Yu (Elena), Sachit Menon (Elena), Pol Moreno (Elena), Yuguo Liao (Elena), Vicky Zayats (Elena), Luming Tang (Elena), SQ Mah (Elena), Ashish Shenoy (Elena), Alex Siegman (Elena), Majid Hadian (Elena), Okwan Kwon (Elena), Tao Tu (Elena), Nima Khajehnouri (Elena), Ryan Foley (Elena), Parisa Haghani (Elena), Zhongru Wu (Elena), Vaishakh Keshava (Elena), Khyatti Gupta (Elena), Tony Bruguier (Elena), Rui Yao (Elena), Danny Karmon (Elena), Luisa Zintgraf (Elena), Zhicheng Wang (Elena), Enrique Piqueras (Elena), Junehyuk Jung (Elena), Jenny Brennan (Elena), Diego Machado (Elena), Marissa Giustina (Elena), MH Tessler (Elena), Kamyu Lee (Elena), Qiao Zhang (Elena), Joss Moore (Elena), Kaspar Daugaard (Elena), Alexander Fr\"ommgen (Elena), Jennifer Beattie (Elena), Fred Zhang (Elena), Daniel Kasenberg (Elena), Ty Geri (Elena), Danfeng Qin (Elena), Gaurav Singh Tomar (Elena), Tom Ouyang (Elena), Tianli Yu (Elena), Luowei Zhou (Elena), Rajiv Mathews (Elena), Andy Davis (Elena), Yaoyiran Li (Elena), Jai Gupta (Elena), Damion Yates (Elena), Linda Deng (Elena), Elizabeth Kemp (Elena), Ga-Young Joung (Elena), Sergei Vassilvitskii (Elena), Mandy Guo (Elena), Pallavi LV (Elena), Dave Dopson (Elena), Sami Lachgar (Elena), Lara McConnaughey (Elena), Himadri Choudhury (Elena), Dragos Dena (Elena), Aaron Cohen (Elena), Joshua Ainslie (Elena), Sergey Levi (Elena), Parthasarathy Gopavarapu (Elena), Polina Zablotskaia (Elena), Hugo Vallet (Elena), Sanaz Bahargam (Elena), Xiaodan Tang (Elena), Nenad Tomasev (Elena), Ethan Dyer (Elena), Daniel Balle (Elena), Hongrae Lee (Elena), William Bono (Elena), Jorge Gonzalez Mendez (Elena), Vadim Zubov (Elena), Shentao Yang (Elena), Ivor Rendulic (Elena), Yanyan Zheng (Elena), Andrew Hogue (Elena), Golan Pundak (Elena), Ralph Leith (Elena), Avishkar Bhoopchand (Elena), Michael Han (Elena), Mislav \v{Z}ani\'c (Elena), Tom Schaul (Elena), Manolis Delakis (Elena), Tejas Iyer (Elena), Guanyu Wang (Elena), Harman Singh (Elena), Abdelrahman Abdelhamed (Elena), Tara Thomas (Elena), Siddhartha Brahma (Elena), Hilal Dib (Elena), Naveen Kumar (Elena), Wenxuan Zhou (Elena), Liang Bai (Elena), Pushkar Mishra (Elena), Jiao Sun (Elena), Valentin Anklin (Elena), Roykrong Sukkerd (Elena), Lauren Agubuzu (Elena), Anton Briukhov (Elena), Anmol Gulati (Elena), Maximilian Sieb (Elena), Fabio Pardo (Elena), Sara Nasso (Elena), Junquan Chen (Elena), Kexin Zhu (Elena), Tiberiu Sosea (Elena), Alex Goldin (Elena), Keith Rush (Elena), Spurthi Amba Hombaiah (Elena), Andreas Noever (Elena), Allan Zhou (Elena), Sam Haves (Elena), Mary Phuong (Elena), Jake Ades (Elena), Yi-ting Chen (Elena), Lin Yang (Elena), Joseph Pagadora (Elena), Stan Bileschi (Elena), Victor Cotruta (Elena), Rachel Saputro (Elena), Arijit Pramanik (Elena), Sean Ammirati (Elena), Dan Garrette (Elena), Kevin Villela (Elena), Tim Blyth (Elena), Canfer Akbulut (Elena), Neha Jha (Elena), Alban Rrustemi (Elena), Arissa Wongpanich (Elena), Chirag Nagpal (Elena), Yonghui Wu (Elena), Morgane Rivi\`ere (Elena), Sergey Kishchenko (Elena), Pranesh Srinivasan (Elena), Alice Chen (Elena), Animesh Sinha (Elena), Trang Pham (Elena), Bill Jia (Elena), Tom Hennigan (Elena), Anton Bakalov (Elena), Nithya Attaluri (Elena), Drew Garmon (Elena), Daniel Rodriguez (Elena), Dawid Wegner (Elena), Wenhao Jia (Elena), Evan Senter (Elena), Noah Fiedel (Elena), Denis Petek (Elena), Yuchuan Liu (Elena), Cassidy Hardin (Elena), Harshal Tushar Lehri (Elena), Joao Carreira (Elena), Sara Smoot (Elena), Marcel Prasetya (Elena), Nami Akazawa (Elena), Anca Stefanoiu (Elena), Chia-Hua Ho (Elena), Anelia Angelova (Elena), Kate Lin (Elena), Min Kim (Elena), Charles Chen (Elena), Marcin Sieniek (Elena), Alice Li (Elena), Tongfei Guo (Elena), Sorin Baltateanu (Elena), Pouya Tafti (Elena), Michael Wunder (Elena), Nadav Olmert (Elena), Divyansh Shukla (Elena), Jingwei Shen (Elena), Neel Kovelamudi (Elena), Balaji Venkatraman (Elena), Seth Neel (Elena), Romal Thoppilan (Elena), Jerome Connor (Elena), Frederik Benzing (Elena), Axel Stjerngren (Elena), Golnaz Ghiasi (Elena), Alex Polozov (Elena), Joshua Howland (Elena), Theophane Weber (Elena), Justin Chiu (Elena), Ganesh Poomal Girirajan (Elena), Andreas Terzis (Elena), Pidong Wang (Elena), Fangda Li (Elena), Yoav Ben Shalom (Elena), Dinesh Tewari (Elena), Matthew Denton (Elena), Roee Aharoni (Elena), Norbert Kalb (Elena), Heri Zhao (Elena), Junlin Zhang (Elena), Angelos Filos (Elena), Matthew Rahtz (Elena), Lalit Jain (Elena), Connie Fan (Elena), Vitor Rodrigues (Elena), Ruth Wang (Elena), Richard Shin (Elena), Jacob Austin (Elena), Roman Ring (Elena), Mariella Sanchez-Vargas (Elena), Mehadi Hassen (Elena), Ido Kessler (Elena), Uri Alon (Elena), Gufeng Zhang (Elena), Wenhu Chen (Elena), Yenai Ma (Elena), Xiance Si (Elena), Le Hou (Elena), Azalia Mirhoseini (Elena), Marc Wilson (Elena), Geoff Bacon (Elena), Becca Roelofs (Elena), Lei Shu (Elena), Gautam Vasudevan (Elena), Jonas Adler (Elena), Artur Dwornik (Elena), Tayfun Terzi (Elena), Matt Lawlor (Elena), Harry Askham (Elena), Mike Bernico (Elena), Xuanyi Dong (Elena), Chris Hidey (Elena), Kevin Kilgour (Elena), Ga\"el Liu (Elena), Surya Bhupatiraju (Elena), Luke Leonhard (Elena), Siqi Zuo (Elena), Partha Talukdar (Elena), Qing Wei (Elena), Aliaksei Severyn (Elena), V\'it List\'ik (Elena), Jong Lee (Elena), Aditya Tripathi (Elena), SK Park (Elena), Yossi Matias (Elena), Hao Liu (Elena), Alex Ruiz (Elena), Rajesh Jayaram (Elena), Jackson Tolins (Elena), Pierre Marcenac (Elena), Yiming Wang (Elena), Bryan Seybold (Elena), Henry Prior (Elena), Deepak Sharma (Elena), Jack Weber (Elena), Mikhail Sirotenko (Elena), Yunhsuan Sung (Elena), Dayou Du (Elena), Ellie Pavlick (Elena), Stefan Zinke (Elena), Markus Freitag (Elena), Max Dylla (Elena), Montse Gonzalez Arenas (Elena), Natan Potikha (Elena), Omer Goldman (Elena), Connie Tao (Elena), Rachita Chhaparia (Elena), Maria Voitovich (Elena), Pawan Dogra (Elena), Andrija Ra\v{z}natovi\'c (Elena), Zak Tsai (Elena), Chong You (Elena), Oleaser Johnson (Elena), George Tucker (Elena), Chenjie Gu (Elena), Jae Yoo (Elena), Maryam Majzoubi (Elena), Valentin Gabeur (Elena), Bahram Raad (Elena), Rocky Rhodes (Elena), Kashyap Kolipaka (Elena), Heidi Howard (Elena), Geta Sampemane (Elena), Benny Li (Elena), Chulayuth Asawaroengchai (Elena), Duy Nguyen (Elena), Chiyuan Zhang (Elena), Timothee Cour (Elena), Xinxin Yu (Elena), Zhao Fu (Elena), Joe Jiang (Elena), Po-Sen Huang (Elena), Gabriela Surita (Elena), I\~naki Iturrate (Elena), Yael Karov (Elena), Michael Collins (Elena), Martin Baeuml (Elena), Fabian Fuchs (Elena), Shilpa Shetty (Elena), Swaroop Ramaswamy (Elena), Sayna Ebrahimi (Elena), Qiuchen Guo (Elena), Jeremy Shar (Elena), Gabe Barth-Maron (Elena), Sravanti Addepalli (Elena), Bryan Richter (Elena), Chin-Yi Cheng (Elena), Eug\'enie Rives (Elena), Fei Zheng (Elena), Johannes Griesser (Elena), Nishanth Dikkala (Elena), Yoel Zeldes (Elena), Ilkin Safarli (Elena), Dipanjan Das (Elena), Himanshu Srivastava (Elena), Sadh MNM Khan (Elena), Xin Li (Elena), Aditya Pandey (Elena), Larisa Markeeva (Elena), Dan Belov (Elena), Qiqi Yan (Elena), Miko{\l}aj Rybi\'nski (Elena), Tao Chen (Elena), Megha Nawhal (Elena), Michael Quinn (Elena), Vineetha Govindaraj (Elena), Sarah York (Elena), Reed Roberts (Elena), Roopal Garg (Elena), Namrata Godbole (Elena), Jake Abernethy (Elena), Anil Das (Elena), Lam Nguyen Thiet (Elena), Jonathan Tompson (Elena), John Nham (Elena), Neera Vats (Elena), Ben Caine (Elena), Wesley Helmholz (Elena), Francesco Pongetti (Elena), Yeongil Ko (Elena), James An (Elena), Clara Huiyi Hu (Elena), Yu-Cheng Ling (Elena), Julia Pawar (Elena), Robert Leland (Elena), Keisuke Kinoshita (Elena), Waleed Khawaja (Elena), Marco Selvi (Elena), Eugene Ie (Elena), Danila Sinopalnikov (Elena), Lev Proleev (Elena), Nilesh Tripuraneni (Elena), Michele Bevilacqua (Elena), Seungji Lee (Elena), Clayton Sanford (Elena), Dan Suh (Elena), Dustin Tran (Elena), Jeff Dean (Elena), Simon Baumgartner (Elena), Jens Heitkaemper (Elena), Sagar Gubbi (Elena), Kristina Toutanova (Elena), Yichong Xu (Elena), Chandu Thekkath (Elena), Keran Rong (Elena), Palak Jain (Elena), Annie Xie (Elena), Yan Virin (Elena), Yang Li (Elena), Lubo Litchev (Elena), Richard Powell (Elena), Tarun Bharti (Elena), Adam Kraft (Elena), Nan Hua (Elena), Marissa Ikonomidis (Elena), Ayal Hitron (Elena), Sanjiv Kumar (Elena), Loic Matthey (Elena), Sophie Bridgers (Elena), Lauren Lax (Elena), Ishaan Malhi (Elena), Ondrej Skopek (Elena), Ashish Gupta (Elena), Jiawei Cao (Elena), Mitchelle Rasquinha (Elena), Siim P\~oder (Elena), Wojciech Stokowiec (Elena), Nicholas Roth (Elena), Guowang Li (Elena), Micha\"el Sander (Elena), Joshua Kessinger (Elena), Vihan Jain (Elena), Edward Loper (Elena), Wonpyo Park (Elena), Michal Yarom (Elena), Liqun Cheng (Elena), Guru Guruganesh (Elena), Kanishka Rao (Elena), Yan Li (Elena), Catarina Barros (Elena), Mikhail Sushkov (Elena), Chun-Sung Ferng (Elena), Rohin Shah (Elena), Ophir Aharoni (Elena), Ravin Kumar (Elena), Tim McConnell (Elena), Peiran Li (Elena), Chen Wang (Elena), Fernando Pereira (Elena), Craig Swanson (Elena), Fayaz Jamil (Elena), Yan Xiong (Elena), Anitha Vijayakumar (Elena), Prakash Shroff (Elena), Kedar Soparkar (Elena), Jindong Gu (Elena), Livio Baldini Soares (Elena), Eric Wang (Elena), Kushal Majmundar (Elena), Aurora Wei (Elena), Kai Bailey (Elena), Nora Kassner (Elena), Chizu Kawamoto (Elena), Goran \v{Z}u\v{z}i\'c (Elena), Victor Gomes (Elena), Abhirut Gupta (Elena), Michael Guzman (Elena), Ishita Dasgupta (Elena), Xinyi Bai (Elena), Zhufeng Pan (Elena), Francesco Piccinno (Elena), Hadas Natalie Vogel (Elena), Octavio Ponce (Elena), Adrian Hutter (Elena), Paul Chang (Elena), Pan-Pan Jiang (Elena), Ionel Gog (Elena), Vlad Ionescu (Elena), James Manyika (Elena), Fabian Pedregosa (Elena), Harry Ragan (Elena), Zach Behrman (Elena), Ryan Mullins (Elena), Coline Devin (Elena), Aroonalok Pyne (Elena), Swapnil Gawde (Elena), Martin Chadwick (Elena), Yiming Gu (Elena), Sasan Tavakkol (Elena), Andy Twigg (Elena), Naman Goyal (Elena), Ndidi Elue (Elena), Anna Goldie (Elena), Srinivasan Venkatachary (Elena), Hongliang Fei (Elena), Ziqiang Feng (Elena), Marvin Ritter (Elena), Isabel Leal (Elena), Sudeep Dasari (Elena), Pei Sun (Elena), Alif Raditya Rochman (Elena), Brendan O'Donoghue (Elena), Yuchen Liu (Elena), Jim Sproch (Elena), Kai Chen (Elena), Natalie Clay (Elena), Slav Petrov (Elena), Sailesh Sidhwani (Elena), Ioana Mihailescu (Elena), Alex Panagopoulos (Elena), AJ Piergiovanni (Elena), Yunfei Bai (Elena), George Powell (Elena), Deep Karkhanis (Elena), Trevor Yacovone (Elena), Petr Mitrichev (Elena), Joe Kovac (Elena), Dave Uthus (Elena), Amir Yazdanbakhsh (Elena), David Amos (Elena), Steven Zheng (Elena), Bing Zhang (Elena), Jin Miao (Elena), Bhuvana Ramabhadran (Elena), Soroush Radpour (Elena), Shantanu Thakoor (Elena), Josh Newlan (Elena), Oran Lang (Elena), Orion Jankowski (Elena), Shikhar Bharadwaj (Elena), Jean-Michel Sarr (Elena), Shereen Ashraf (Elena), Sneha Mondal (Elena), Jun Yan (Elena), Ankit Singh Rawat (Elena), Sarmishta Velury (Elena), Greg Kochanski (Elena), Tom Eccles (Elena), Franz Och (Elena), Abhanshu Sharma (Elena), Ethan Mahintorabi (Elena), Alex Gurney (Elena), Carrie Muir (Elena), Vered Cohen (Elena), Saksham Thakur (Elena), Adam Bloniarz (Elena), Asier Mujika (Elena), Alexander Pritzel (Elena), Paul Caron (Elena), Altaf Rahman (Elena), Fiona Lang (Elena), Yasumasa Onoe (Elena), Petar Sirkovic (Elena), Jay Hoover (Elena), Ying Jian (Elena), Pablo Duque (Elena), Arun Narayanan (Elena), David Soergel (Elena), Alex Haig (Elena), Loren Maggiore (Elena), Shyamal Buch (Elena), Josef Dean (Elena), Ilya Figotin (Elena), Igor Karpov (Elena), Shaleen Gupta (Elena), Denny Zhou (Elena), Muhuan Huang (Elena), Ashwin Vaswani (Elena), Christopher Semturs (Elena), Kaushik Shivakumar (Elena), Yu Watanabe (Elena), Vinodh Kumar Rajendran (Elena), Eva Lu (Elena), Yanhan Hou (Elena), Wenting Ye (Elena), Shikhar Vashishth (Elena), Nana Nti (Elena), Vytenis Sakenas (Elena), Darren Ni (Elena), Doug DeCarlo (Elena), Michael Bendersky (Elena), Sumit Bagri (Elena), Nacho Cano (Elena), Elijah Peake (Elena), Simon Tokumine (Elena), Varun Godbole (Elena), Carlos Gu\'ia (Elena), Tanya Lando (Elena), Vittorio Selo (Elena), Seher Ellis (Elena), Danny Tarlow (Elena), Daniel Gillick (Elena), Alessandro Epasto (Elena), Siddhartha Reddy Jonnalagadda (Elena), Meng Wei (Elena), Meiyan Xie (Elena), Ankur Taly (Elena), Michela Paganini (Elena), Mukund Sundararajan (Elena), Daniel Toyama (Elena), Ting Yu (Elena), Dessie Petrova (Elena), Aneesh Pappu (Elena), Rohan Agrawal (Elena), Senaka Buthpitiya (Elena), Justin Frye (Elena), Thomas Buschmann (Elena), Remi Crocker (Elena), Marco Tagliasacchi (Elena), Mengchao Wang (Elena), Da Huang (Elena), Sagi Perel (Elena), Brian Wieder (Elena), Hideto Kazawa (Elena), Weiyue Wang (Elena), Jeremy Cole (Elena), Himanshu Gupta (Elena), Ben Golan (Elena), Seojin Bang (Elena), Nitish Kulkarni (Elena), Ken Franko (Elena), Casper Liu (Elena), Doug Reid (Elena), Sid Dalmia (Elena), Jay Whang (Elena), Kevin Cen (Elena), Prasha Sundaram (Elena), Johan Ferret (Elena), Berivan Isik (Elena), Lucian Ionita (Elena), Guan Sun (Elena), Anna Shekhawat (Elena), Muqthar Mohammad (Elena), Philip Pham (Elena), Ronny Huang (Elena), Karthik Raman (Elena), Xingyi Zhou (Elena), Ross Mcilroy (Elena), Austin Myers (Elena), Sheng Peng (Elena), Jacob Scott (Elena), Paul Covington (Elena), Sofia Erell (Elena), Pratik Joshi (Elena), Jo\~ao Gabriel Oliveira (Elena), Natasha Noy (Elena), Tajwar Nasir (Elena), Jake Walker (Elena), Vera Axelrod (Elena), Tim Dozat (Elena), Pu Han (Elena), Chun-Te Chu (Elena), Eugene Weinstein (Elena), Anand Shukla (Elena), Shreyas Chandrakaladharan (Elena), Petra Poklukar (Elena), Bonnie Li (Elena), Ye Jin (Elena), Prem Eruvbetine (Elena), Steven Hansen (Elena), Avigail Dabush (Elena), Alon Jacovi (Elena), Samrat Phatale (Elena), Chen Zhu (Elena), Steven Baker (Elena), Mo Shomrat (Elena), Yang Xiao (Elena), Jean Pouget-Abadie (Elena), Mingyang Zhang (Elena), Fanny Wei (Elena), Yang Song (Elena), Helen King (Elena), Yiling Huang (Elena), Yun Zhu (Elena), Ruoxi Sun (Elena), Juliana Vicente Franco (Elena), Chu-Cheng Lin (Elena), Sho Arora (Elena), Hui (Elena), Li, Vivian Xia, Luke Vilnis, Mariano Schain, Kaiz Alarakyia, Laurel Prince, Aaron Phillips, Caleb Habtegebriel, Luyao Xu, Huan Gui, Santiago Ontanon, Lora Aroyo, Karan Gill, Peggy Lu, Yash Katariya, Dhruv Madeka, Shankar Krishnan, Shubha Srinivas Raghvendra, James Freedman, Yi Tay, Gaurav Menghani, Peter Choy, Nishita Shetty, Dan Abolafia, Doron Kukliansky, Edward Chou, Jared Lichtarge, Ken Burke, Ben Coleman, Dee Guo, Larry Jin, Indro Bhattacharya, Victoria Langston, Yiming Li, Suyog Kotecha, Alex Yakubovich, Xinyun Chen, Petre Petrov, Tolly Powell, Yanzhang He, Corbin Quick, Kanav Garg, Dawsen Hwang, Yang Lu, Srinadh Bhojanapalli, Kristian Kjems, Ramin Mehran, Aaron Archer, Hado van Hasselt, Ashwin Balakrishna, JK Kearns, Meiqi Guo, Jason Riesa, Mikita Sazanovich, Xu Gao, Chris Sauer, Chengrun Yang, XiangHai Sheng, Thomas Jimma, Wouter Van Gansbeke, Vitaly Nikolaev, Wei Wei, Katie Millican, Ruizhe Zhao, Justin Snyder, Levent Bolelli, Maura O'Brien, Shawn Xu, Fei Xia, Wentao Yuan, Arvind Neelakantan, David Barker, Sachin Yadav, Hannah Kirkwood, Farooq Ahmad, Joel Wee, Jordan Grimstad, Boyu Wang, Matthew Wiethoff, Shane Settle, Miaosen Wang, Charles Blundell, Jingjing Chen, Chris Duvarney, Grace Hu, Olaf Ronneberger, Alex Lee, Yuanzhen Li, Abhishek Chakladar, Alena Butryna, Georgios Evangelopoulos, Guillaume Desjardins, Jonni Kanerva, Henry Wang, Averi Nowak, Nick Li, Alyssa Loo, Art Khurshudov, Laurent El Shafey, Nagabhushan Baddi, Karel Lenc, Yasaman Razeghi, Tom Lieber, Amer Sinha, Xiao Ma, Yao Su, James Huang, Asahi Ushio, Hanna Klimczak-Pluci\'nska, Kareem Mohamed, JD Chen, Simon Osindero, Stav Ginzburg, Lampros Lamprou, Vasilisa Bashlovkina, Duc-Hieu Tran, Ali Khodaei, Ankit Anand, Yixian Di, Ramy Eskander, Manish Reddy Vuyyuru, Jasmine Liu, Aishwarya Kamath, Roman Goldenberg, Mathias Bellaiche, Juliette Pluto, Bill Rosgen, Hassan Mansoor, William Wong, Suhas Ganesh, Eric Bailey, Scott Baird, Dan Deutsch, Jinoo Baek, Xuhui Jia, Chansoo Lee, Abe Friesen, Nathaniel Braun, Kate Lee, Amayika Panda, Steven M. Hernandez, Duncan Williams, Jianqiao Liu, Ethan Liang, Arnaud Autef, Emily Pitler, Deepali Jain, Phoebe Kirk, Oskar Bunyan, Jaume Sanchez Elias, Tongxin Yin, Machel Reid, Aedan Pope, Nikita Putikhin, Bidisha Samanta, Sergio Guadarrama, Dahun Kim, Simon Rowe, Marcella Valentine, Geng Yan, Alex Salcianu, David Silver, Gan Song, Richa Singh, Shuai Ye, Hannah DeBalsi, Majd Al Merey, Eran Ofek, Albert Webson, Shibl Mourad, Ashwin Kakarla, Silvio Lattanzi, Nick Roy, Evgeny Sluzhaev, Christina Butterfield, Alessio Tonioni, Nathan Waters, Sudhindra Kopalle, Jason Chase, James Cohan, Girish Ramchandra Rao, Robert Berry, Michael Voznesensky, Shuguang Hu, Kristen Chiafullo, Sharat Chikkerur, George Scrivener, Ivy Zheng, Jeremy Wiesner, Wolfgang Macherey, Timothy Lillicrap, Fei Liu, Brian Walker, David Welling, Elinor Davies, Yangsibo Huang, Lijie Ren, Nir Shabat, Alessandro Agostini, Mariko Iinuma, Dustin Zelle, Rohit Sathyanarayana, Andrea D'olimpio, Morgan Redshaw, Matt Ginsberg, Ashwin Murthy, Mark Geller, Tatiana Matejovicova, Ayan Chakrabarti, Ryan Julian, Christine Chan, Qiong Hu, Daniel Jarrett, Manu Agarwal, Jeshwanth Challagundla, Tao Li, Sandeep Tata, Wen Ding, Maya Meng, Zhuyun Dai, Giulia Vezzani, Shefali Garg, Jannis Bulian, Mary Jasarevic, Honglong Cai, Harish Rajamani, Adam Santoro, Florian Hartmann, Chen Liang, Bartek Perz, Apoorv Jindal, Fan Bu, Sungyong Seo, Ryan Poplin, Adrian Goedeckemeyer, Badih Ghazi, Nikhil Khadke, Leon Liu, Kevin Mather, Mingda Zhang, Ali Shah, Alex Chen, Jinliang Wei, Keshav Shivam, Yuan Cao, Donghyun Cho, Angelo Scorza Scarpati, Michael Moffitt, Clara Barbu, Ivan Jurin, Ming-Wei Chang, Hongbin Liu, Hao Zheng, Shachi Dave, Christine Kaeser-Chen, Xiaobin Yu, Alvin Abdagic, Lucas Gonzalez, Yanping Huang, Peilin Zhong, Cordelia Schmid, Bryce Petrini, Alex Wertheim, Jifan Zhu, Hoang Nguyen, Kaiyang Ji, Yanqi Zhou, Tao Zhou, Fangxiaoyu Feng, Regev Cohen, David Rim, Shubham Milind Phal, Petko Georgiev, Ariel Brand, Yue Ma, Wei Li, Somit Gupta, Chao Wang, Pavel Dubov, Jean Tarbouriech, Kingshuk Majumder, Huijian Li, Norman Rink, Apurv Suman, Yang Guo, Yinghao Sun, Arun Nair, Xiaowei Xu, Mohamed Elhawaty, Rodrigo Cabrera, Guangxing Han, Julian Eisenschlos, Junwen Bai, Yuqi Li, Yamini Bansal, Thibault Sellam, Mina Khan, Hung Nguyen, Justin Mao-Jones, Nikos Parotsidis, Jake Marcus, Cindy Fan, Roland Zimmermann, Yony Kochinski, Laura Graesser, Feryal Behbahani, Alvaro Caceres, Michael Riley, Patrick Kane, Sandra Lefdal, Rob Willoughby, Paul Vicol, Lun Wang, Shujian Zhang, Ashleah Gill, Yu Liang, Gautam Prasad, Soroosh Mariooryad, Mehran Kazemi, Zifeng Wang, Kritika Muralidharan, Paul Voigtlaender, Jeffrey Zhao, Huanjie Zhou, Nina D'Souza, Aditi Mavalankar, S\'eb Arnold, Nick Young, Obaid Sarvana, Chace Lee, Milad Nasr, Tingting Zou, Seokhwan Kim, Lukas Haas, Kaushal Patel, Neslihan Bulut, David Parkinson, Courtney Biles, Dmitry Kalashnikov, Chi Ming To, Aviral Kumar, Jessica Austin, Alex Greve, Lei Zhang, Megha Goel, Yeqing Li, Sergey Yaroshenko, Max Chang, Abhishek Jindal, Geoff Clark, Hagai Taitelbaum, Dale Johnson, Ofir Roval, Jeongwoo Ko, Anhad Mohananey, Christian Schuler, Shenil Dodhia, Ruichao Li, Kazuki Osawa, Claire Cui, Peng Xu, Rushin Shah, Tao Huang, Ela Gruzewska, Nathan Clement, Mudit Verma, Olcan Sercinoglu, Hai Qian, Viral Shah, Masa Yamaguchi, Abhinit Modi, Takahiro Kosakai, Thomas Strohmann, Junhao Zeng, Beliz Gunel, Jun Qian, Austin Tarango, Krzysztof Jastrz\k{e}bski, Robert David, Jyn Shan, Parker Schuh, Kunal Lad, Willi Gierke, Mukundan Madhavan, Xinyi Chen, Mark Kurzeja, Rebeca Santamaria-Fernandez, Dawn Chen, Alexandra Cordell, Yuri Chervonyi, Frankie Garcia, Nithish Kannen, Vincent Perot, Nan Ding, Shlomi Cohen-Ganor, Victor Lavrenko, Junru Wu, Georgie Evans, Cicero Nogueira dos Santos, Madhavi Sewak, Ashley Brown, Andrew Hard, Joan Puigcerver, Zeyu Zheng, Yizhong Liang, Evgeny Gladchenko, Reeve Ingle, Uri First, Pierre Sermanet, Charlotte Magister, Mihajlo Velimirovi\'c, Sashank Reddi, Susanna Ricco, Eirikur Agustsson, Hartwig Adam, Nir Levine, David Gaddy, Dan Holtmann-Rice, Xuanhui Wang, Ashutosh Sathe, Abhijit Guha Roy, Bla\v{z} Bratani\v{c}, Alen Carin, Harsh Mehta, Silvano Bonacina, Nicola De Cao, Mara Finkelstein, Verena Rieser, Xinyi Wu, Florent Altch\'e, Dylan Scandinaro, Li Li, Nino Vieillard, Nikhil Sethi, Garrett Tanzer, Zhi Xing, Shibo Wang, Parul Bhatia, Gui Citovsky, Thomas Anthony, Sharon Lin, Tianze Shi, Shoshana Jakobovits, Gena Gibson, Raj Apte, Lisa Lee, Mingqing Chen, Arunkumar Byravan, Petros Maniatis, Kellie Webster, Andrew Dai, Pu-Chin Chen, Jiaqi Pan, Asya Fadeeva, Zach Gleicher, Thang Luong, Niket Kumar Bhumihar
Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
Authors: Julio Garrido, Javier Vales, Diego Silva-Mu\~niz, Enrique Riveiro, Pablo L\'opez-Matencio, Josu\'e Rivera-Andrade
Abstract: Cable-Driven Parallel Robots (CDPRs) are increasingly used for load manipulation tasks involving predefined toolpaths with intermediate stops. At each stop, where the platform maintains a fixed pose and the motors keep the cables under tension, the system must evaluate whether it is safe to proceed by detecting anomalies that could compromise performance (e.g., wind gusts or cable impacts). This paper investigates whether anomalies can be detected using only motor torque data, without additional sensors. It introduces an adaptive, unsupervised outlier detection algorithm based on Gaussian Mixture Models (GMMs) to identify anomalies from torque signals. The method starts with a brief calibration period, just a few seconds, during which a GMM is fit on known anomaly-free data. Real-time torque measurements are then evaluated using Mahalanobis distance from the GMM, with statistically derived thresholds triggering anomaly flags. Model parameters are periodically updated using the latest segments identified as anomaly-free to adapt to changing conditions. Validation includes 14 long-duration test sessions simulating varied wind intensities. The proposed method achieves a 100% true positive rate and 95.4% average true negative rate, with 1-second detection latency. Comparative evaluation against power threshold and non-adaptive GMM methods indicates higher robustness to drift and environmental variation.
Authors: Jaeheun Jung, Bosung Jung, Suhyun Bae, Donghun Lee
Abstract: Machine unlearning seeks to remove the influence of particular data or class from trained models to meet privacy, legal, or ethical requirements. Existing unlearning methods tend to forget shallowly: phenomenon of an unlearned model pretend to forget by adjusting only the model response, while its internal representations retain information sufficiently to restore the forgotten data or behavior. We empirically confirm the widespread shallowness by reverting the forgetting effect of various unlearning methods via training-free performance recovery attack and gradient-inversion-based data reconstruction attack. To address this vulnerability fundamentally, we define a theoretical criterion of ``deep forgetting'' based on one-point-contraction of feature representations of data to forget. We also propose an efficient approximation algorithm, and use it to construct a novel general-purpose unlearning algorithm: One-Point-Contraction (OPC). Empirical evaluations on image classification unlearning benchmarks show that OPC achieves not only effective unlearning performance but also superior resilience against both performance recovery attack and gradient-inversion attack. The distinctive unlearning performance of OPC arises from the deep feature forgetting enforced by its theoretical foundation, and recaps the need for improved robustness of machine unlearning methods.
Authors: Joel Schlotthauer, Christian Kroos, Chris Hinze, Viktor Hangya, Luzian Hahn, Fabian K\"uch
Abstract: Optimizers play a decisive role in reducing pre-training times for LLMs and achieving better-performing models. In this study, we compare three major variants: the de-facto standard AdamW, the simpler Lion, developed through an evolutionary search, and the second-order optimizer Sophia. For better generalization, we train with two different base architectures and use a single- and a multiple-epoch approach while keeping the number of tokens constant. Using the Maximal Update Parametrization and smaller proxy models, we tune relevant hyperparameters separately for each combination of base architecture and optimizer. We found that while the results from all three optimizers were in approximately the same range, Sophia exhibited the lowest training and validation loss, Lion was fastest in terms of training GPU hours but AdamW led to the best downstream evaluation results.
Authors: Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel
Abstract: Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.
Authors: Jianzhe Ma, Wenxuan Wang, Qin Jin
Abstract: Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.
Authors: Lorenzo Mannocci, Stefano Cresci, Matteo Magnani, Anna Monreale, Maurizio Tesconi
Abstract: Coordinated online behavior, which spans from beneficial collective actions to harmful manipulation such as disinformation campaigns, has become a key focus in digital ecosystem analysis. Traditional methods often rely on monomodal approaches, focusing on single types of interactions like co-retweets or co-hashtags, or consider multiple modalities independently of each other. However, these approaches may overlook the complex dynamics inherent in multimodal coordination. This study compares different ways of operationalizing the detection of multimodal coordinated behavior. It examines the trade-off between weakly and strongly integrated multimodal models, highlighting the balance between capturing broader coordination patterns and identifying tightly coordinated behavior. By comparing monomodal and multimodal approaches, we assess the unique contributions of different data modalities and explore how varying implementations of multimodality impact detection outcomes. Our findings reveal that not all the modalities provide distinct insights, but that with a multimodal approach we can get a more comprehensive understanding of coordination dynamics. This work enhances the ability to detect and analyze coordinated online behavior, offering new perspectives for safeguarding the integrity of digital platforms.
Authors: Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang, Justine Gehring, Antonio Orvieto, Muawiz Chaudhary, Eilif B. Muller, Irina Rish, Samira Ebrahimi Kahou, Massimo Caccia
Abstract: The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.
URLs: https://github.com/mrcabbage972/GitChameleonBenchmark.
Authors: Zeqian Chen
Abstract: The introduction of the transformer architecture in 2017 marked the most striking advancement in natural language processing. The transformer is a model architecture relying entirely on an attention mechanism to draw global dependencies between input and output. However, we believe there is a gap in our theoretical understanding of what the transformer is, and how it works physically. From a physical perspective on modern chips, such as those chips under 28nm, modern intelligent machines should be regarded as open quantum systems beyond conventional statistical systems. Thereby, in this paper, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens. Our physical models underlie the transformer architecture for large language models.
Authors: Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Jingwen Chen, Zhichao Huang, Tao Li, Yifu Li, Huiying Lin, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, Yuxuan Wang, Yonghui Wu
Abstract: Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.
Authors: Brian Ondov, William Xia, Kush Attal, Ishita Unde, Jerry He, Dina Demner-Fushman
Abstract: Objective: Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability, combined with the high potential for harm in this domain, means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems. Methods: We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level, rewriting of abstracts (Task 1) as well as identifying and replacing difficult terms (Task 2). For automatic evaluation of Task 1, we developed a four-fold set of professionally-written references. Submissions for both Tasks 1 and 2 were provided extensive manual evaluation from biomedical experts. Results: Twelve teams spanning twelve countries participated in the track, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human levels of factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity. Conclusion: The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.
Authors: Jaeheun Jung, Jaehyuk Lee, Yeajin Lee, Donghun Lee
Abstract: With the growth of demand on neural network compression methods, the structured pruning methods including importance-based approach are actively studied. The magnitude importance and many correlated modern importance criteria often limit the capacity of pruning decision, since the filters with larger magnitudes are not likely to be pruned if the smaller one didn't, even if it is redundant. In this paper, we propose a novel pruning strategy to challenge this dominating effect of magnitude and provide fair chance to each filter to be pruned, by placing it on projective space. After that, we observe the gradient descent movement whether the filters move toward the origin or not, to measure how the filter is likely to be pruned. This measurement is used to construct PROscore, a novel importance score for IPPRO, a novel importance-based structured pruning with magnitude-indifference. Our evaluation results shows that the proposed importance criteria using the projective space achieves near-lossless pruning by reducing the performance drop in pruning, with promising performance after the finetuning. Our work debunks the ``size-matters'' myth in pruning and expands the frontier of importance-based pruning both theoretically and empirically.
Authors: ZhengXiao He, Jinghao Wen, Huayu Li, Siyuan Tian, Ao Li
Abstract: We present a novel and interpretable framework for electrocardiogram (ECG)-based disease detection that combines hyperdimensional computing (HDC) with learnable neural encoding. Unlike conventional HDC approaches that rely on static, random projections, our method introduces a rhythm-aware and trainable encoding pipeline based on RR intervals, a physiological signal segmentation strategy that aligns with cardiac cycles. The core of our design is a neural-distilled HDC architecture, featuring a learnable RR-block encoder and a BinaryLinear hyperdimensional projection layer, optimized jointly with cross-entropy and proxy-based metric loss. This hybrid framework preserves the symbolic interpretability of HDC while enabling task-adaptive representation learning. Experiments on Apnea-ECG and PTB-XL demonstrate that our model significantly outperforms traditional HDC and classical ML baselines, achieving 73.09\% precision and an F1 score of 0.626 on Apnea-ECG, with comparable robustness on PTB-XL. Our framework offers an efficient and scalable solution for edge-compatible ECG classification, with strong potential for interpretable and personalized health monitoring.
Authors: Rithesh Murthy, Ming Zhu, Liangwei Yang, Jielin Qiu, Juntao Tan, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang
Abstract: Large Language Models (LLMs) perform best with well-crafted prompts, yet prompt engineering remains manual, inconsistent, and inaccessible to non-experts. We introduce Promptomatix, an automatic prompt optimization framework that transforms natural language task descriptions into high-quality prompts without requiring manual tuning or domain expertise. Promptomatix supports both a lightweight meta-prompt-based optimizer and a DSPy-powered compiler, with modular design enabling future extension to more advanced frameworks. The system analyzes user intent, generates synthetic training data, selects prompting strategies, and refines prompts using cost-aware objectives. Evaluated across 5 task categories, Promptomatix achieves competitive or superior performance compared to existing libraries, while reducing prompt length and computational overhead making prompt optimization scalable and efficient.
Authors: Keivan Shariatmadar, Ahmad Osman
Abstract: The integration of Artificial Intelligence (AI) into sports officiating represents a paradigm shift in how decisions are made in competitive environments. Traditional manual systems, even when supported by Instant Video Replay (IVR), often suffer from latency, subjectivity, and inconsistent enforcement, undermining fairness and athlete trust. This paper introduces 'FST.ai' -- which is developed under the 'R3AL.ai' project, which serves as its Principal Investigator: r3al.ai -- a novel AI-powered framework designed to enhance officiating in Sport Taekwondo, particularly focusing on the complex task of real-time head kick detection and scoring. Leveraging computer vision, deep learning, and edge inference, the system automates the identification and classification of key actions, significantly reducing decision time from minutes to seconds while improving consistency and transparency. Importantly, the methodology is not limited to Taekwondo. The underlying framework -- based on pose estimation, motion classification, and impact analysis -- can be adapted to a wide range of sports requiring action detection, such as judo, karate, fencing, or even team sports like football and basketball, where foul recognition or performance tracking is critical. By addressing one of Taekwondo's most challenging scenarios -- head kick scoring -- we demonstrate the robustness, scalability, and sport-agnostic potential of 'FST.ai' to transform officiating standards across multiple disciplines.
Authors: An Wang, Rulin Zhou, Mengya Xu, Yiru Ye, Longfei Gou, Yiting Chang, Hao Chen, Chwee Ming Lim, Jiankun Wang, Hongliang Ren
Abstract: Visualizing subtle vascular motions in endoscopic surgery is crucial for surgical precision and decision-making, yet remains challenging due to the complex and dynamic nature of surgical scenes. To address this, we introduce EndoControlMag, a training-free, Lagrangian-based framework with mask-conditioned vascular motion magnification tailored to endoscopic environments. Our approach features two key modules: a Periodic Reference Resetting (PRR) scheme that divides videos into short overlapping clips with dynamically updated reference frames to prevent error accumulation while maintaining temporal coherence, and a Hierarchical Tissue-aware Magnification (HTM) framework with dual-mode mask dilation. HTM first tracks vessel cores using a pretrained visual tracking model to maintain accurate localization despite occlusions and view changes. It then applies one of two adaptive softening strategies to surrounding tissues: motion-based softening that modulates magnification strength proportional to observed tissue displacement, or distance-based exponential decay that simulates biomechanical force attenuation. This dual-mode approach accommodates diverse surgical scenarios-motion-based softening excels with complex tissue deformations while distance-based softening provides stability during unreliable optical flow conditions. We evaluate EndoControlMag on our EndoVMM24 dataset spanning four different surgery types and various challenging scenarios, including occlusions, instrument disturbance, view changes, and vessel deformations. Quantitative metrics, visual assessments, and expert surgeon evaluations demonstrate that EndoControlMag significantly outperforms existing methods in both magnification accuracy and visual quality while maintaining robustness across challenging surgical conditions. The code, dataset, and video results are available at https://szupc.github.io/EndoControlMag/.
Authors: Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, Hao Niu, Wenxuan Ou, Wanli Peng, Zeyu Ren, Haixin Shi, Jiawen Tian, Hongtao Wu, Xin Xiao, Yuyang Xiao, Jiafeng Xu, Yichu Yang
Abstract: We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, $\pi_0$, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.
Authors: Andrei-Valentin Tanase, Elena Pelican
Abstract: We present Supernova, a 650M-parameter decoder-only transformer that demonstrates how careful architectural design and tokenization innovation can achieve the performance of larger models while maintaining computational efficiency. Our architecture combines Rotary Positional Embeddings (RoPE), Grouped Query Attention (GQA) with a 3:1 compression ratio, RMSNorm for computational efficiency, and SwiGLU activation functions. A critical innovation is our custom 128,000-vocabulary byte-level BPE tokenizer, which achieves state-of-the-art compression performance. Through detailed analysis, we show that Supernova achieves 90% of the performance of 1B-parameter models while using 35% fewer parameters and requiring only 100B training tokens--an order of magnitude less than competing models. Our findings challenge the prevailing scaling paradigm, demonstrating that architectural efficiency and tokenization quality can compensate for reduced parameter counts.
Authors: Mohammad 'Matt' Namvarpour, Brandon Brofsky, Jessica Medina, Mamtaj Akter, Afsaneh Razi
Abstract: As Generative Artificial Intelligence (GenAI) driven chatbots like Character.AI become embedded in adolescent life, they raise concerns about emotional dependence and digital overreliance. While studies have investigated the overreliance of adults on these chatbots, they have not investigated teens' interactions with chatbots with customizable personas. We analyzed 318 Reddit posts made by users self-reported as 13-17 years old on the Character.AI subreddit to understand patterns of overreliance. We found teens commonly begin using chatbots for emotional support or creative expression, but many develop strong attachments that interfere with offline relationships and daily routines. Their posts revealed recurring signs of psychological distress, cycles of relapse, and difficulty disengaging. Teens reported that their overreliance often ended when they reflect on the harm, return to in-person social settings, or become frustrated by platform restrictions. Based on the implications of our findings, we provide recommendations for future chatbot design so they can promote self-awareness, support real-world engagement, and involve teens in developing safer digital tools.
Authors: Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
Abstract: Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.
Authors: Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Abstract: Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.