new Human Creativity and AI

Authors: Shengyi Xie

Abstract: With the advancement of science and technology, the philosophy of creativity has undergone significant reinterpretation. This paper investigates contemporary research in the fields of psychology, cognitive neuroscience, and the philosophy of creativity, particularly in the context of the development of artificial intelligence (AI) techniques. It aims to address the central question: Can AI exhibit creativity? The paper reviews the historical perspectives on the philosophy of creativity and explores the influence of psychological advancements on the study of creativity. Furthermore, it analyzes various definitions of creativity and examines the responses of naturalism and cognitive neuroscience to the concept of creativity.

new TableReasoner: Advancing Table Reasoning Framework with Large Language Models

Authors: Sishi Xiong, Dakai Wang, Yu Zhao, Jie Zhang, Changzai Pan, Haowei He, Xiangyu Li, Wenhan Chang, Zhongjiang He, Shuangyong Song, Yongxiang Li

Abstract: The paper presents our system developed for table question answering (TQA). TQA tasks face challenges due to the characteristics of real-world tabular data, such as large size, incomplete column semantics, and entity ambiguity. To address these issues, we propose a large language model (LLM)-powered and programming-based table reasoning framework, named TableReasoner. It models a table using the schema that combines structural and semantic representations, enabling holistic understanding and efficient processing of large tables. We design a multi-step schema linking plan to derive a focused table schema that retains only query-relevant information, eliminating ambiguity and alleviating hallucinations. This focused table schema provides precise and sufficient table details for query refinement and programming. Furthermore, we integrate the reasoning workflow into an iterative thinking architecture, allowing incremental cycles of thinking, reasoning and reflection. Our system achieves first place in both subtasks of SemEval-2025 Task 8.

new A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking

Authors: Zhengye Han, Quanyan Zhu

Abstract: As large language models (LLMs) are increasingly deployed in critical applications, the challenge of jailbreaking, where adversaries manipulate the models to bypass safety mechanisms, has become a significant concern. This paper presents a dynamic Stackelberg game framework to model the interactions between attackers and defenders in the context of LLM jailbreaking. The framework treats the prompt-response dynamics as a sequential extensive-form game, where the defender, as the leader, commits to a strategy while anticipating the attacker's optimal responses. We propose a novel agentic AI solution, the "Purple Agent," which integrates adversarial exploration and defensive strategies using Rapidly-exploring Random Trees (RRT). The Purple Agent actively simulates potential attack trajectories and intervenes proactively to prevent harmful outputs. This approach offers a principled method for analyzing adversarial dynamics and provides a foundation for mitigating the risk of jailbreaking.

new Reasoning and Behavioral Equilibria in LLM-Nash Games: From Mindsets to Actions

Authors: Quanyan Zhu

Abstract: We introduce the LLM-Nash framework, a game-theoretic model where agents select reasoning prompts to guide decision-making via Large Language Models (LLMs). Unlike classical games that assume utility-maximizing agents with full rationality, this framework captures bounded rationality by modeling the reasoning process explicitly. Equilibrium is defined over the prompt space, with actions emerging as the behavioral output of LLM inference. This approach enables the study of cognitive constraints, mindset expressiveness, and epistemic learning. Through illustrative examples, we show how reasoning equilibria can diverge from classical Nash outcomes, offering a new foundation for strategic interaction in LLM-enabled systems.

new From Curiosity to Competence: How World Models Interact with the Dynamics of Exploration

Authors: Fryderyk Mantiuk, Hanqi Zhou, Charley M. Wu

Abstract: What drives an agent to explore the world while also maintaining control over the environment? From a child at play to scientists in the lab, intelligent agents must balance curiosity (the drive to seek knowledge) with competence (the drive to master and control the environment). Bridging cognitive theories of intrinsic motivation with reinforcement learning, we ask how evolving internal representations mediate the trade-off between curiosity (novelty or information gain) and competence (empowerment). We compare two model-based agents using handcrafted state abstractions (Tabular) or learning an internal world model (Dreamer). The Tabular agent shows curiosity and competence guide exploration in distinct patterns, while prioritizing both improves exploration. The Dreamer agent reveals a two-way interaction between exploration and representation learning, mirroring the developmental co-evolution of curiosity and competence. Our findings formalize adaptive exploration as a balance between pursuing the unknown and the controllable, offering insights for cognitive theories and efficient reinforcement learning.

new Grounding Methods for Neural-Symbolic AI

Authors: Rodrigo Castellano Ontiveros, Francesco Giannini, Marco Gori, Giuseppe Marra, Michelangelo Diligenti

Abstract: A large class of Neural-Symbolic (NeSy) methods employs a machine learner to process the input entities, while relying on a reasoner based on First-Order Logic to represent and process more complex relationships among the entities. A fundamental role for these methods is played by the process of logic grounding, which determines the relevant substitutions for the logic rules using a (sub)set of entities. Some NeSy methods use an exhaustive derivation of all possible substitutions, preserving the full expressive power of the logic knowledge. This leads to a combinatorial explosion in the number of ground formulas to consider and, therefore, strongly limits their scalability. Other methods rely on heuristic-based selective derivations, which are generally more computationally efficient, but lack a justification and provide no guarantees of preserving the information provided to and returned by the reasoner. Taking inspiration from multi-hop symbolic reasoning, this paper proposes a parametrized family of grounding methods generalizing classic Backward Chaining. Different selections within this family allow us to obtain commonly employed grounding methods as special cases, and to control the trade-off between expressiveness and scalability of the reasoner. The experimental results show that the selection of the grounding criterion is often as important as the NeSy method itself.

new Quantum Federated Learning for Multimodal Data: A Modality-Agnostic Approach

Authors: Atit Pokharel, Ratun Rahman, Thomas Morris, Dinh C. Nguyen

Abstract: Quantum federated learning (QFL) has been recently introduced to enable a distributed privacy-preserving quantum machine learning (QML) model training across quantum processors (clients). Despite recent research efforts, existing QFL frameworks predominantly focus on unimodal systems, limiting their applicability to real-world tasks that often naturally involve multiple modalities. To fill this significant gap, we present for the first time a novel multimodal approach specifically tailored for the QFL setting with the intermediate fusion using quantum entanglement. Furthermore, to address a major bottleneck in multimodal QFL, where the absence of certain modalities during training can degrade model performance, we introduce a Missing Modality Agnostic (MMA) mechanism that isolates untrained quantum circuits, ensuring stable training without corrupted states. Simulation results demonstrate that the proposed multimodal QFL method with MMA yields an improvement in accuracy of 6.84% in independent and identically distributed (IID) and 7.25% in non-IID data distributions compared to the state-of-the-art methods.

new Giving AI Agents Access to Cryptocurrency and Smart Contracts Creates New Vectors of AI Harm

Authors: Bill Marino, Ari Juels

Abstract: There is growing interest in giving AI agents access to cryptocurrencies as well as to the smart contracts that transact them. But doing so, this position paper argues, could lead to formidable new vectors of AI harm. To support this argument, we first examine the unique properties of cryptocurrencies and smart contracts that could lead to these new vectors of harm. Next, we describe each of these new vectors of harm in detail. Finally, we conclude with a call for more technical research aimed at preventing and mitigating these harms and, thereby making it safer to endow AI agents with cryptocurrencies and smart contracts.

new Abductive Computational Systems: Creative Abduction and Future Directions

Authors: Abhinav Sood, Kazjon Grace, Stephen Wan, Cecile Paris

Abstract: Abductive reasoning, reasoning for inferring explanations for observations, is often mentioned in scientific, design-related and artistic contexts, but its understanding varies across these domains. This paper reviews how abductive reasoning is discussed in epistemology, science and design, and then analyses how various computational systems use abductive reasoning. Our analysis shows that neither theoretical accounts nor computational implementations of abductive reasoning adequately address generating creative hypotheses. Theoretical frameworks do not provide a straightforward model for generating creative abductive hypotheses, computational systems largely implement syllogistic forms of abductive reasoning. We break down abductive computational systems into components and conclude by identifying specific directions for future research that could advance the state of creative abductive reasoning in computational systems.

new Agent Safety Alignment via Reinforcement Learning

Authors: Zeyang Sha, Hanling Tian, Zhuoer Xu, Shiwen Cui, Changhua Meng, Weiqiang Wang

Abstract: The emergence of autonomous Large Language Model (LLM) agents capable of tool usage has introduced new safety risks that go beyond traditional conversational misuse. These agents, empowered to execute external functions, are vulnerable to both user-initiated threats (e.g., adversarial prompts) and tool-initiated threats (e.g., malicious outputs from compromised tools). In this paper, we propose the first unified safety-alignment framework for tool-using agents, enabling models to handle both channels of threat via structured reasoning and sandboxed reinforcement learning. We introduce a tri-modal taxonomy, including benign, malicious, and sensitive for both user prompts and tool responses, and define a policy-driven decision model. Our framework employs a custom-designed sandbox environment that simulates real-world tool execution and allows fine-grained reward shaping. Through extensive evaluations on public and self-built benchmarks, including Agent SafetyBench, InjecAgent, and BFCL, we demonstrate that our safety-aligned agents significantly improve resistance to security threats while preserving strong utility on benign tasks. Our results show that safety and effectiveness can be jointly optimized, laying the groundwork for trustworthy deployment of autonomous LLM agents.

new M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

Authors: Inclusion AI, :, Fudong Wang, Jiajia Liu, Jingdong Chen, Jun Zhou, Kaixiang Ji, Lixiang Ru, Qingpei Guo, Ruobing Zheng, Tianqi Li, Yi Yuan, Yifan Mao, Yuting Xiao, Ziping Ma

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs), particularly through Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced their reasoning abilities. However, a critical gap persists: these models struggle with dynamic spatial interactions, a capability essential for real-world applications. To bridge this gap, we introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.

new Multi-Agent LLMs as Ethics Advocates in AI-Based Systems

Authors: Asma Yamani, Malak Baslyman, Moataz Ahmed

Abstract: Incorporating ethics into the requirement elicitation process is essential for creating ethically aligned systems. Although eliciting manual ethics requirements is effective, it requires diverse input from multiple stakeholders, which can be challenging due to time and resource constraints. Moreover, it is often given a low priority in the requirements elicitation process. This study proposes a framework for generating ethics requirements drafts by introducing an ethics advocate agent in a multi-agent LLM setting. This agent critiques and provides input on ethical issues based on the system description. The proposed framework is evaluated through two case studies from different contexts, demonstrating that it captures the majority of ethics requirements identified by researchers during 30-minute interviews and introduces several additional relevant requirements. However, it also highlights reliability issues in generating ethics requirements, emphasizing the need for human feedback in this sensitive domain. We believe this work can facilitate the broader adoption of ethics in the requirements engineering process, ultimately leading to more ethically aligned products.

new Why this and not that? A Logic-based Framework for Contrastive Explanations

Authors: Tobias Geibinger, Reijo Jaakkola, Antti Kuusisto, Xinghan Liu, Miikka Vilander

Abstract: We define several canonical problems related to contrastive explanations, each answering a question of the form ''Why P but not Q?''. The problems compute causes for both P and Q, explicitly comparing their differences. We investigate the basic properties of our definitions in the setting of propositional logic. We show, inter alia, that our framework captures a cardinality-minimal version of existing contrastive explanations in the literature. Furthermore, we provide an extensive analysis of the computational complexities of the problems. We also implement the problems for CNF-formulas using answer set programming and present several examples demonstrating how they work in practice.

new From Language to Logic: A Bi-Level Framework for Structured Reasoning

Authors: Keying Yang, Hao Wang, Kai Yang

Abstract: Structured reasoning over natural language inputs remains a core challenge in artificial intelligence, as it requires bridging the gap between unstructured linguistic expressions and formal logical representations. In this paper, we propose a novel \textbf{bi-level framework} that maps language to logic through a two-stage process: high-level task abstraction and low-level logic generation. At the upper level, a large language model (LLM) parses natural language queries into intermediate structured representations specifying the problem type, objectives, decision variables, and symbolic constraints. At the lower level, the LLM uses these representations to generate symbolic workflows or executable reasoning programs for accurate and interpretable decision making. The framework supports modular reasoning, enforces explicit constraints, and generalizes across domains such as mathematical problem solving, question answering, and logical inference. We further optimize the framework with an end-to-end {bi-level} optimization approach that jointly refines both the high-level abstraction and low-level logic generation stages. Experiments on multiple realistic reasoning benchmarks demonstrate that our approach significantly outperforms existing baselines in accuracy, with accuracy gains reaching as high as 40\%. Moreover, the bi-level design enhances transparency and error traceability, offering a promising step toward trustworthy and systematic reasoning with LLMs.

new A Multi-granularity Concept Sparse Activation and Hierarchical Knowledge Graph Fusion Framework for Rare Disease Diagnosis

Authors: Mingda Zhang, Na Zhao, Jianglong Qin, Guoyu Ye, Ruixiang Tang

Abstract: Despite advances from medical large language models in healthcare, rare-disease diagnosis remains hampered by insufficient knowledge-representation depth, limited concept understanding, and constrained clinical reasoning. We propose a framework that couples multi-granularity sparse activation of medical concepts with a hierarchical knowledge graph. Four complementary matching algorithms, diversity control, and a five-level fallback strategy enable precise concept activation, while a three-layer knowledge graph (taxonomy, clinical features, instances) provides structured, up-to-date context. Experiments on the BioASQ rare-disease QA set show BLEU gains of 0.09, ROUGE gains of 0.05, and accuracy gains of 0.12, with peak accuracy of 0.89 approaching the 0.90 clinical threshold. Expert evaluation confirms improvements in information quality, reasoning, and professional expression, suggesting our approach shortens the "diagnostic odyssey" for rare-disease patients.

new Large Multi-modal Model Cartographic Map Comprehension for Textual Locality Georeferencing

Authors: Kalana Wijegunarathna, Kristin Stock, Christopher B. Jones

Abstract: Millions of biological sample records collected in the last few centuries archived in natural history collections are un-georeferenced. Georeferencing complex locality descriptions associated with these collection samples is a highly labour-intensive task collection agencies struggle with. None of the existing automated methods exploit maps that are an essential tool for georeferencing complex relations. We present preliminary experiments and results of a novel method that exploits multi-modal capabilities of recent Large Multi-Modal Models (LMM). This method enables the model to visually contextualize spatial relations it reads in the locality description. We use a grid-based approach to adapt these auto-regressive models for this task in a zero-shot setting. Our experiments conducted on a small manually annotated dataset show impressive results for our approach ($\sim$1 km Average distance error) compared to uni-modal georeferencing with Large Language Models and existing georeferencing tools. The paper also discusses the findings of the experiments in light of an LMM's ability to comprehend fine-grained maps. Motivated by these results, a practical framework is proposed to integrate this method into a georeferencing workflow.

new Unlocking Speech Instruction Data Potential with Query Rewriting

Authors: Yonghua Hei, Yibo Yan, Shuliang Liu, Huiyu Zhou, Linfeng Zhang, Xuming Hu

Abstract: End-to-end Large Speech Language Models~(\textbf{LSLMs}) demonstrate strong potential in response latency and speech comprehension capabilities, showcasing general intelligence across speech understanding tasks. However, the ability to follow speech instructions has not been fully realized due to the lack of datasets and heavily biased training tasks. Leveraging the rich ASR datasets, previous approaches have used Large Language Models~(\textbf{LLMs}) to continue the linguistic information of speech to construct speech instruction datasets. Yet, due to the gap between LLM-generated results and real human responses, the continuation methods further amplify these shortcomings. Given the high costs of collecting and annotating speech instruction datasets by humans, using speech synthesis to construct large-scale speech instruction datasets has become a balanced and robust alternative. Although modern Text-To-Speech~(\textbf{TTS}) models have achieved near-human-level synthesis quality, it is challenging to appropriately convert out-of-distribution text instruction to speech due to the limitations of the training data distribution in TTS models. To address this issue, we propose a query rewriting framework with multi-LLM knowledge fusion, employing multiple agents to annotate and validate the synthesized speech, making it possible to construct high-quality speech instruction datasets without relying on human annotation. Experiments show that this method can transform text instructions into distributions more suitable for TTS models for speech synthesis through zero-shot rewriting, increasing data usability from 72\% to 93\%. It also demonstrates unique advantages in rewriting tasks that require complex knowledge and context-related abilities.

new Agentic Large Language Models for Conceptual Systems Engineering and Design

Authors: Soheyl Massoudi, Mark Fuge

Abstract: Early-stage engineering design involves complex, iterative reasoning, yet existing large language model (LLM) workflows struggle to maintain task continuity and generate executable models. We evaluate whether a structured multi-agent system (MAS) can more effectively manage requirements extraction, functional decomposition, and simulator code generation than a simpler two-agent system (2AS). The target application is a solar-powered water filtration system as described in a cahier des charges. We introduce the Design-State Graph (DSG), a JSON-serializable representation that bundles requirements, physical embodiments, and Python-based physics models into graph nodes. A nine-role MAS iteratively builds and refines the DSG, while the 2AS collapses the process to a Generator-Reflector loop. Both systems run a total of 60 experiments (2 LLMs - Llama 3.3 70B vs reasoning-distilled DeepSeek R1 70B x 2 agent configurations x 3 temperatures x 5 seeds). We report a JSON validity, requirement coverage, embodiment presence, code compatibility, workflow completion, runtime, and graph size. Across all runs, both MAS and 2AS maintained perfect JSON integrity and embodiment tagging. Requirement coverage remained minimal (less than 20\%). Code compatibility peaked at 100\% under specific 2AS settings but averaged below 50\% for MAS. Only the reasoning-distilled model reliably flagged workflow completion. Powered by DeepSeek R1 70B, the MAS generated more granular DSGs (average 5-6 nodes) whereas 2AS mode-collapsed. Structured multi-agent orchestration enhanced design detail. Reasoning-distilled LLM improved completion rates, yet low requirements and fidelity gaps in coding persisted.

new Leanabell-Prover-V2: Verifier-integrated Reasoning for Formal Theorem Proving via Reinforcement Learning

Authors: Xingguang Ji, Yahui Liu, Qi Wang, Jingyuan Zhang, Yang Yue, Rui Shi, Chenxi Sun, Fuzheng Zhang, Guorui Zhou, Kun Gai

Abstract: We introduce our Leanabell-Prover-V2, a 7B large language models (LLMs) that can produce formal theorem proofs in Lean 4, with verifier-integrated Long Chain-of-Thoughts (CoT). Following our previous work Leanabell-Prover-V1, we continual to choose to posttrain existing strong prover models for further performance improvement. In our V2 version, we mainly upgrade the Reinforcement Learning (RL) with feedback provided by the Lean 4 verifier. Crucially, verifier feedback, such as indicating success or detailing specific errors, allows the LLM to become ``self-aware'' of the correctness of its own reasoning process and learn to reflexively correct errors. Leanabell-Prover-V2 directly optimizes LLM reasoning trajectories with multi-turn verifier interactions, together with feedback token masking for stable RL training and a simple reward strategy. Experiments show that Leanabell-Prover-V2 improves performance by 3.2% (pass@128) with Kimina-Prover-Preview-Distill-7B and 2.0% (pass@128) with DeepSeek-Prover-V2-7B on the MiniF2F test set. The source codes, curated data and models are available at: https://github.com/Leanabell-LM/Leanabell-Prover-V2.

URLs: https://github.com/Leanabell-LM/Leanabell-Prover-V2.

new Introspection of Thought Helps AI Agents

Authors: Haoran Sun, Shaoning Zeng

Abstract: AI Agents rely on Large Language Models (LLMs) and Multimodal-LLMs (MLLMs) to perform interpretation and inference in text and image tasks without post-training, where LLMs and MLLMs play the most critical role and determine the initial ability and limitations of AI Agents. Usually, AI Agents utilize sophisticated prompt engineering and external reasoning framework to obtain a promising interaction with LLMs, e.g., Chain-of-Thought, Iteration of Thought and Image-of-Thought. However, they are still constrained by the inherent limitations of LLM in understanding natural language, and the iterative reasoning process will generate a large amount of inference cost. To this end, we propose a novel AI Agent Reasoning Framework with Introspection of Thought (INoT) by designing a new LLM-Read code in prompt. It enables LLM to execute programmatic dialogue reasoning processes following the code in prompt. Therefore, self-denial and reflection occur within LLM instead of outside LLM, which can reduce token cost effectively. Through our experiments on six benchmarks for three different tasks, the effectiveness of INoT is verified, with an average improvement of 7.95\% in performance, exceeding the baselines. Furthermore, the token cost of INoT is lower on average than the best performing method at baseline by 58.3\%. In addition, we demonstrate the versatility of INoT in image interpretation and inference through verification experiments.

new elsciRL: Integrating Language Solutions into Reinforcement Learning Problem Settings

Authors: Philip Osborne, Danilo S. Carvalho, Andr\'e Freitas

Abstract: We present elsciRL, an open-source Python library to facilitate the application of language solutions on reinforcement learning problems. We demonstrate the potential of our software by extending the Language Adapter with Self-Completing Instruction framework defined in (Osborne, 2024) with the use of LLMs. Our approach can be re-applied to new applications with minimal setup requirements. We provide a novel GUI that allows a user to provide text input for an LLM to generate instructions which it can then self-complete. Empirical results indicate that these instructions \textit{can} improve a reinforcement learning agent's performance. Therefore, we present this work to accelerate the evaluation of language solutions on reward based environments to enable new opportunities for scientific discovery.

new System-of-systems Modeling and Optimization: An Integrated Framework for Intermodal Mobility

Authors: Paul Saves, Jasper Bussemaker, R\'emi Lafage, Thierry Lefebvre, Nathalie Bartoli, Youssef Diouane, Joseph Morlier

Abstract: For developing innovative systems architectures, modeling and optimization techniques have been central to frame the architecting process and define the optimization and modeling problems. In this context, for system-of-systems the use of efficient dedicated approaches (often physics-based simulations) is highly recommended to reduce the computational complexity of the targeted applications. However, exploring novel architectures using such dedicated approaches might pose challenges for optimization algorithms, including increased evaluation costs and potential failures. To address these challenges, surrogate-based optimization algorithms, such as Bayesian optimization utilizing Gaussian process models have emerged.

cross Dual-Attention U-Net++ with Class-Specific Ensembles and Bayesian Hyperparameter Optimization for Precise Wound and Scale Marker Segmentation

Authors: Daniel Cie\'slak, Miriam Reca, Olena Onyshchenko, Jacek Rumi\'nski

Abstract: Accurate segmentation of wounds and scale markers in clinical images remainsa significant challenge, crucial for effective wound management and automatedassessment. In this study, we propose a novel dual-attention U-Net++ archi-tecture, integrating channel-wise (SCSE) and spatial attention mechanisms toaddress severe class imbalance and variability in medical images effectively.Initially, extensive benchmarking across diverse architectures and encoders via 5-fold cross-validation identified EfficientNet-B7 as the optimal encoder backbone.Subsequently, we independently trained two class-specific models with tailoredpreprocessing, extensive data augmentation, and Bayesian hyperparameter tun-ing (WandB sweeps). The final model ensemble utilized Test Time Augmentationto further enhance prediction reliability. Our approach was evaluated on a bench-mark dataset from the NBC 2025 & PCBBE 2025 competition. Segmentationperformance was quantified using a weighted F1-score (75% wounds, 25% scalemarkers), calculated externally by competition organizers on undisclosed hard-ware. The proposed approach achieved an F1-score of 0.8640, underscoring itseffectiveness for complex medical segmentation tasks.

cross Human vs. LLM-Based Thematic Analysis for Digital Mental Health Research: Proof-of-Concept Comparative Study

Authors: Karisa Parkington, Bazen G. Teferra, Marianne Rouleau-Tang, Argyrios Perivolaris, Alice Rueda, Adam Dubrowski, Bill Kapralos, Reza Samavi, Andrew Greenshaw, Yanbo Zhang, Bo Cao, Yuqi Wu, Sirisha Rambhatla, Sridhar Krishnan, Venkat Bhat

Abstract: Thematic analysis provides valuable insights into participants' experiences through coding and theme development, but its resource-intensive nature limits its use in large healthcare studies. Large language models (LLMs) can analyze text at scale and identify key content automatically, potentially addressing these challenges. However, their application in mental health interviews needs comparison with traditional human analysis. This study evaluates out-of-the-box and knowledge-base LLM-based thematic analysis against traditional methods using transcripts from a stress-reduction trial with healthcare workers. OpenAI's GPT-4o model was used along with the Role, Instructions, Steps, End-Goal, Narrowing (RISEN) prompt engineering framework and compared to human analysis in Dedoose. Each approach developed codes, noted saturation points, applied codes to excerpts for a subset of participants (n = 20), and synthesized data into themes. Outputs and performance metrics were compared directly. LLMs using the RISEN framework developed deductive parent codes similar to human codes, but humans excelled in inductive child code development and theme synthesis. Knowledge-based LLMs reached coding saturation with fewer transcripts (10-15) than the out-of-the-box model (15-20) and humans (90-99). The out-of-the-box LLM identified a comparable number of excerpts to human researchers, showing strong inter-rater reliability (K = 0.84), though the knowledge-based LLM produced fewer excerpts. Human excerpts were longer and involved multiple codes per excerpt, while LLMs typically applied one code. Overall, LLM-based thematic analysis proved more cost-effective but lacked the depth of human analysis. LLMs can transform qualitative analysis in mental healthcare and clinical research when combined with human oversight to balance participant perspectives and research resources.

cross Unraveling the Potential of Diffusion Models in Small Molecule Generation

Authors: Peining Zhang, Daniel Baker, Minghu Song, Jinbo Bi

Abstract: Generative AI presents chemists with novel ideas for drug design and facilitates the exploration of vast chemical spaces. Diffusion models (DMs), an emerging tool, have recently attracted great attention in drug R\&D. This paper comprehensively reviews the latest advancements and applications of DMs in molecular generation. It begins by introducing the theoretical principles of DMs. Subsequently, it categorizes various DM-based molecular generation methods according to their mathematical and chemical applications. The review further examines the performance of these models on benchmark datasets, with a particular focus on comparing the generation performance of existing 3D methods. Finally, it concludes by emphasizing current challenges and suggesting future research directions to fully exploit the potential of DMs in drug discovery.

cross Energy Management for Renewable-Colocated Artificial Intelligence Data Centers

Authors: Siying Li, Lang Tong, Timothy D. Mount

Abstract: We develop an energy management system (EMS) for artificial intelligence (AI) data centers with colocated renewable generation. Under a profit-maximizing framework, the EMS of renewable-colocated data center (RCDC) co-optimizes AI workload scheduling, on-site renewable utilization, and electricity market participation. Within both wholesale and retail market participation models, the economic benefit of the RCDC operation is maximized. Empirical evaluations using real-world traces of electricity prices, data center power consumption, and renewable generation demonstrate significant profit gains from renewable and AI data center colocations.

cross RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning

Authors: Atli Sigurgeirsson, Simon King

Abstract: A Prompt-based Text-To-Speech model allows a user to control different aspects of speech, such as speaking rate and perceived gender, through natural language instruction. Although user-friendly, such approaches are on one hand constrained: control is limited to acoustic features exposed to the model during training, and too flexible on the other: the same inputs yields uncontrollable variation that are reflected in the corpus statistics. We investigate a novel fine-tuning regime to address both of these issues at the same time by exploiting the uncontrollable variance of the model. Through principal component analysis of thousands of synthesised samples, we determine latent features that account for the highest proportion of the output variance and incorporate them as new labels for secondary fine-tuning. We evaluate the proposed methods on two models trained on an expressive Icelandic speech corpus, one with emotional disclosure and one without. In the case of the model without emotional disclosure, the method yields both continuous and discrete features that improve overall controllability of the model.

cross MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model

Authors: K. Sahit Reddy, N. Ragavenderan, Vasanth K., Ganesh N. Naik, Vishalakshi Prabhu, Nagaraja G. S

Abstract: Recent advances in natural language processing (NLP) have been driven bypretrained language models like BERT, RoBERTa, T5, and GPT. Thesemodels excel at understanding complex texts, but biomedical literature, withits domain-specific terminology, poses challenges that models likeWord2Vec and bidirectional long short-term memory (Bi-LSTM) can't fullyaddress. GPT and T5, despite capturing context, fall short in tasks needingbidirectional understanding, unlike BERT. Addressing this, we proposedMedicalBERT, a pretrained BERT model trained on a large biomedicaldataset and equipped with domain-specific vocabulary that enhances thecomprehension of biomedical terminology. MedicalBERT model is furtheroptimized and fine-tuned to address diverse tasks, including named entityrecognition, relation extraction, question answering, sentence similarity, anddocument classification. Performance metrics such as the F1-score,accuracy, and Pearson correlation are employed to showcase the efficiencyof our model in comparison to other BERT-based models such as BioBERT,SciBERT, and ClinicalBERT. MedicalBERT outperforms these models onmost of the benchmarks, and surpasses the general-purpose BERT model by5.67% on average across all the tasks evaluated respectively. This work alsounderscores the potential of leveraging pretrained BERT models for medicalNLP tasks, demonstrating the effectiveness of transfer learning techniques incapturing domain-specific information. (PDF) MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model. Available from: https://www.researchgate.net/publication/392489050_MedicalBERT_enhancing_biomedical_natural_language_processing_using_pretrained_BERT-based_model [accessed Jul 06 2025].

URLs: https://www.researchgate.net/publication/392489050_MedicalBERT_enhancing_biomedical_natural_language_processing_using_pretrained_BERT-based_model

cross Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking

Authors: Aldan Creo, Raul Castro Fernandez, Manuel Cebrian

Abstract: As large language models (LLMs) become increasingly deployed, understanding the complexity and evolution of jailbreaking strategies is critical for AI safety. We present a mass-scale empirical analysis of jailbreak complexity across over 2 million real-world conversations from diverse platforms, including dedicated jailbreaking communities and general-purpose chatbots. Using a range of complexity metrics spanning probabilistic measures, lexical diversity, compression ratios, and cognitive load indicators, we find that jailbreak attempts do not exhibit significantly higher complexity than normal conversations. This pattern holds consistently across specialized jailbreaking communities and general user populations, suggesting practical bounds on attack sophistication. Temporal analysis reveals that while user attack toxicity and complexity remains stable over time, assistant response toxicity has decreased, indicating improving safety mechanisms. The absence of power-law scaling in complexity distributions further points to natural limits on jailbreak development. Our findings challenge the prevailing narrative of an escalating arms race between attackers and defenders, instead suggesting that LLM safety evolution is bounded by human ingenuity constraints while defensive measures continue advancing. Our results highlight critical information hazards in academic jailbreak disclosure, as sophisticated attacks exceeding current complexity baselines could disrupt the observed equilibrium and enable widespread harm before defensive adaptation.

cross Mechanistic Indicators of Understanding in Large Language Models

Authors: Pierre Beckmann, Matthieu Queloz

Abstract: Recent findings in mechanistic interpretability (MI), the field probing the inner workings of Large Language Models (LLMs), challenge the view that these models rely solely on superficial statistics. Here, we offer an accessible synthesis of these findings that doubles as an introduction to MI, all while integrating these findings within a novel theoretical framework for thinking about machine understanding. We argue that LLMs develop internal structures that are functionally analogous to the kind of understanding that consists in seeing connections. To sharpen this idea, we propose a three-tiered conception of machine understanding. First, conceptual understanding emerges when a model forms "features" as directions in latent space, thereby learning the connections between diverse manifestations of something. Second, state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world. Third, principled understanding emerges when a model ceases to rely on a collection of memorized facts and discovers a "circuit" that connects these facts. However, we conclude by exploring the "parallel mechanisms" phenomenon, arguing that while LLMs exhibit forms of understanding, their cognitive architecture remains different from ours, and the debate should shift from whether LLMs understand to how their strange minds work.

cross Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

Authors: Zhibo Zhang, Yuxi Li, Kailong Wang, Shuai Yuan, Ling Shi, Haoyu Wang

Abstract: Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity. However, this openness also introduces significant security risks, particularly through embedding space poisoning, which is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood. Consequently, more targeted and accurate adversarial perturbation techniques, which pose significant threats, have not been adequately studied. In this work, we propose ETTA (Embedding Transformation Toxicity Attenuation), a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs using the AdvBench benchmark, ETTA achieves a high average attack success rate of 88.61%, outperforming the best baseline by 11.34%, and generalizes to safety-enhanced models (e.g., 77.39% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and underscore the need for embedding-aware defenses.

cross Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis

Authors: Li Li, Yongliang Wu, Jingze Zhu, Jiawei Peng, Jianfei Cai, Xu Yang

Abstract: The evolution of large models has witnessed the emergence of In-Context Learning (ICL) capabilities. In Natural Language Processing (NLP), numerous studies have demonstrated the effectiveness of ICL. Inspired by the success of Large Language Models (LLMs), researchers have developed Large Multimodal Models (LMMs) with ICL capabilities. However, explorations of demonstration configuration for multimodal ICL remain preliminary. Additionally, the controllability of In-Context Examples (ICEs) provides an efficient and cost-effective means to observe and analyze the inference characteristics of LMMs under varying inputs. This paper conducts a comprehensive external and internal investigation of multimodal in-context learning on the image captioning task. Externally, we explore demonstration configuration strategies through three dimensions: shot number, image retrieval, and caption assignment. We employ multiple metrics to systematically and thoroughly evaluate and summarize key findings. Internally, we analyze typical LMM attention characteristics and develop attention-based metrics to quantify model behaviors. We also conduct auxiliary experiments to explore the feasibility of attention-driven model acceleration and compression. We further compare performance variations between LMMs with identical model design and pretraining strategies and explain the differences from the angles of pre-training data features. Our study reveals both how ICEs configuration strategies impact model performance through external experiments and characteristic typical patterns through internal inspection, providing dual perspectives for understanding multimodal ICL in LMMs. Our method of combining external and internal analysis to investigate large models, along with our newly proposed metrics, can be applied to broader research areas.

cross SSSUMO: Real-Time Semi-Supervised Submovement Decomposition

Authors: Evgenii Rudakov, Jonathan Shock, Otto Lappi, Benjamin Ultan Cowley

Abstract: This paper introduces a SSSUMO, semi-supervised deep learning approach for submovement decomposition that achieves state-of-the-art accuracy and speed. While submovement analysis offers valuable insights into motor control, existing methods struggle with reconstruction accuracy, computational cost, and validation, due to the difficulty of obtaining hand-labeled data. We address these challenges using a semi-supervised learning framework. This framework learns from synthetic data, initially generated from minimum-jerk principles and then iteratively refined through adaptation to unlabeled human movement data. Our fully convolutional architecture with differentiable reconstruction significantly surpasses existing methods on both synthetic and diverse human motion datasets, demonstrating robustness even in high-noise conditions. Crucially, the model operates in real-time (less than a millisecond per input second), a substantial improvement over optimization-based techniques. This enhanced performance facilitates new applications in human-computer interaction, rehabilitation medicine, and motor control studies. We demonstrate the model's effectiveness across diverse human-performed tasks such as steering, rotation, pointing, object moving, handwriting, and mouse-controlled gaming, showing notable improvements particularly on challenging datasets where traditional methods largely fail. Training and benchmarking source code, along with pre-trained model weights, are made publicly available at https://github.com/dolphin-in-a-coma/sssumo.

URLs: https://github.com/dolphin-in-a-coma/sssumo.

cross Integrating External Tools with Large Language Models to Improve Accuracy

Authors: Nripesh Niketan, Hadj Batatia

Abstract: This paper deals with improving querying large language models (LLMs). It is well-known that without relevant contextual information, LLMs can provide poor quality responses or tend to hallucinate. Several initiatives have proposed integrating LLMs with external tools to provide them with up-to-date data to improve accuracy. In this paper, we propose a framework to integrate external tools to enhance the capabilities of LLMs in answering queries in educational settings. Precisely, we develop a framework that allows accessing external APIs to request additional relevant information. Integrated tools can also provide computational capabilities such as calculators or calendars. The proposed framework has been evaluated using datasets from the Multi-Modal Language Understanding (MMLU) collection. The data consists of questions on mathematical and scientific reasoning. Results compared to state-of-the-art language models show that the proposed approach significantly improves performance. Our Athena framework achieves 83% accuracy in mathematical reasoning and 88% in scientific reasoning, substantially outperforming all tested models including GPT-4o, LLaMA-Large, Mistral-Large, Phi-Large, and GPT-3.5, with the best baseline model (LLaMA-Large) achieving only 67% and 79% respectively. These promising results open the way to creating complex computing ecosystems around LLMs to make their use more natural to support various tasks and activities.

cross CRISP: Complex Reasoning with Interpretable Step-based Plans

Authors: Matan Vetzler, Koren Lazar, Guy Uziel, Eran Hirsch, Ateret Anaby-Tavor, Leshem Choshen

Abstract: Recent advancements in large language models (LLMs) underscore the need for stronger reasoning capabilities to solve complex problems effectively. While Chain-of-Thought (CoT) reasoning has been a step forward, it remains insufficient for many domains. A promising alternative is explicit high-level plan generation, but existing approaches largely assume that LLMs can produce effective plans through few-shot prompting alone, without additional training. In this work, we challenge this assumption and introduce CRISP (Complex Reasoning with Interpretable Step-based Plans), a multi-domain dataset of high-level plans for mathematical reasoning and code generation. The plans in CRISP are automatically generated and rigorously validated--both intrinsically, using an LLM as a judge, and extrinsically, by evaluating their impact on downstream task performance. We demonstrate that fine-tuning a small model on CRISP enables it to generate higher-quality plans than much larger models using few-shot prompting, while significantly outperforming Chain-of-Thought reasoning. Furthermore, our out-of-domain evaluation reveals that fine-tuning on one domain improves plan generation in the other, highlighting the generalizability of learned planning capabilities.

cross AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Authors: Talor Abramovich, Gal Chechik

Abstract: Autonomous agents built on language models (LMs) are showing increasing popularity in many fields, including scientific research. AI co-scientists aim to support or automate parts of the research process using these agents. A key component of empirical AI research is the design of ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 29% of the original ablations on average. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms the currently existing agent-based approach.

cross Towards Evaluating Robustness of Prompt Adherence in Text to Image Models

Authors: Sujith Vemishetty, Advitiya Arora, Anupama Sharma

Abstract: The advancements in the domain of LLMs in recent years have surprised many, showcasing their remarkable capabilities and diverse applications. Their potential applications in various real-world scenarios have led to significant research on their reliability and effectiveness. On the other hand, multimodal LLMs and Text-to-Image models have only recently gained prominence, especially when compared to text-only LLMs. Their reliability remains constrained due to insufficient research on assessing their performance and robustness. This paper aims to establish a comprehensive evaluation framework for Text-to-Image models, concentrating particularly on their adherence to prompts. We created a novel dataset that aimed to assess the robustness of these models in generating images that conform to the specified factors of variation in the input text prompts. Our evaluation studies present findings on three variants of Stable Diffusion models: Stable Diffusion 3 Medium, Stable Diffusion 3.5 Large, and Stable Diffusion 3.5 Large Turbo, and two variants of Janus models: Janus Pro 1B and Janus Pro 7B. We introduce a pipeline that leverages text descriptions generated by the gpt-4o model for our ground-truth images, which are then used to generate artificial images by passing these descriptions to the Text-to-Image models. We then pass these generated images again through gpt-4o using the same system prompt and compare the variation between the two descriptions. Our results reveal that these models struggle to create simple binary images with only two factors of variation: a simple geometric shape and its location. We also show, using pre-trained VAEs on our dataset, that they fail to generate images that follow our input dataset distribution.

cross ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints

Authors: Debasmit Das, Hyoungwoo Park, Munawar Hayat, Seokeon Choi, Sungrack Yun, Fatih Porikli

Abstract: Foundation models are pre-trained on large-scale datasets and subsequently fine-tuned on small-scale datasets using parameter-efficient fine-tuning (PEFT) techniques like low-rank adapters (LoRA). In most previous works, LoRA weight matrices are randomly initialized with a fixed rank across all attachment points. In this paper, we improve convergence and final performance of LoRA fine-tuning, using our proposed data-driven weight initialization method, ConsNoTrainLoRA (CNTLoRA). We express LoRA initialization as a domain shift problem where we use multiple constraints relating the pre-training and fine-tuning activations. By reformulating these constraints, we obtain a closed-form estimate of LoRA weights that depends on pre-training weights and fine-tuning activation vectors and hence requires no training during initialization. This weight estimate is decomposed to initialize the up and down matrices with proposed flexibility of variable ranks. With the proposed initialization method, we fine-tune on downstream tasks such as image generation, image classification and image understanding. Both quantitative and qualitative results demonstrate that CNTLoRA outperforms standard and data-driven weight initialization methods. Extensive analyses and ablations further elucidate the design choices of our framework, providing an optimal recipe for faster convergence and enhanced performance.

cross Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing

Authors: Junyi Wen, Junyuan Liang, Zicong Hong, Wuhui Chen, Zibin Zheng

Abstract: Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache. It introduces three key innovations: 1) a preemptive compression strategy selector to preserve critical context for future conversation turns and selects a customized strategy for the conversation; 2) a token-wise heterogeneous attention similarity estimator to mitigate the attention similarity computation and storage overhead during model generation; 3) a bubble-free restoration scheduler to reduce potential bubbles brought by the imbalance of recomputing and loading stream due to compressed KV caches. Empirical evaluations on real-world tasks demonstrate that Krul achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage compared to state-of-the-art methods without compromising generation quality.

cross An Enhanced Privacy-preserving Federated Few-shot Learning Framework for Respiratory Disease Diagnosis

Authors: Ming Wang, Zhaoyang Duan, Dong Xue, Fangzhou Liu, Zhongheng Zhang

Abstract: The labor-intensive nature of medical data annotation presents a significant challenge for respiratory disease diagnosis, resulting in a scarcity of high-quality labeled datasets in resource-constrained settings. Moreover, patient privacy concerns complicate the direct sharing of local medical data across institutions, and existing centralized data-driven approaches, which rely on amounts of available data, often compromise data privacy. This study proposes a federated few-shot learning framework with privacy-preserving mechanisms to address the issues of limited labeled data and privacy protection in diagnosing respiratory diseases. In particular, a meta-stochastic gradient descent algorithm is proposed to mitigate the overfitting problem that arises from insufficient data when employing traditional gradient descent methods for neural network training. Furthermore, to ensure data privacy against gradient leakage, differential privacy noise from a standard Gaussian distribution is integrated into the gradients during the training of private models with local data, thereby preventing the reconstruction of medical images. Given the impracticality of centralizing respiratory disease data dispersed across various medical institutions, a weighted average algorithm is employed to aggregate local diagnostic models from different clients, enhancing the adaptability of a model across diverse scenarios. Experimental results show that the proposed method yields compelling results with the implementation of differential privacy, while effectively diagnosing respiratory diseases using data from different structures, categories, and distributions.

cross Tree-Structured Parzen Estimator Can Solve Black-Box Combinatorial Optimization More Efficiently

Authors: Kenshin Abe, Yunzhuo Wang, Shuhei Watanabe

Abstract: Tree-structured Parzen estimator (TPE) is a versatile hyperparameter optimization (HPO) method supported by popular HPO tools. Since these HPO tools have been developed in line with the trend of deep learning (DL), the problem setups often used in the DL domain have been discussed for TPE such as multi-objective optimization and multi-fidelity optimization. However, the practical applications of HPO are not limited to DL, and black-box combinatorial optimization is actively utilized in some domains, e.g., chemistry and biology. As combinatorial optimization has been an untouched, yet very important, topic in TPE, we propose an efficient combinatorial optimization algorithm for TPE. In this paper, we first generalize the categorical kernel with the numerical kernel in TPE, enabling us to introduce a distance structure to the categorical kernel. Then we discuss modifications for the newly developed kernel to handle a large combinatorial search space. These modifications reduce the time complexity of the kernel calculation with respect to the size of a combinatorial search space. In the experiments using synthetic problems, we verified that our proposed method identifies better solutions with fewer evaluations than the original TPE. Our algorithm is available in Optuna, an open-source framework for HPO.

cross An Object-Based Deep Learning Approach for Building Height Estimation from Single SAR Images

Authors: Babak Memar, Luigi Russo, Silvia Liberata Ullo, Paolo Gamba

Abstract: Accurate estimation of building heights using very high resolution (VHR) synthetic aperture radar (SAR) imagery is crucial for various urban applications. This paper introduces a Deep Learning (DL)-based methodology for automated building height estimation from single VHR COSMO-SkyMed images: an object-based regression approach based on bounding box detection followed by height estimation. This model was trained and evaluated on a unique multi-continental dataset comprising eight geographically diverse cities across Europe, North and South America, and Asia, employing a cross-validation strategy to explicitly assess out-of-distribution (OOD) generalization. The results demonstrate highly promising performance, particularly on European cities where the model achieves a Mean Absolute Error (MAE) of approximately one building story (2.20 m in Munich), significantly outperforming recent state-of-the-art methods in similar OOD scenarios. Despite the increased variability observed when generalizing to cities in other continents, particularly in Asia with its distinct urban typologies and prevalence of high-rise structures, this study underscores the significant potential of DL for robust cross-city and cross-continental transfer learning in building height estimation from single VHR SAR data.

cross VideoConviction: A Multimodal Benchmark for Human Conviction and Stock Market Recommendations

Authors: Michael Galarnyk, Veer Kejriwal, Agam Shah, Yash Bhardwaj, Nicholas Meyer, Anand Krishnan, Sudheer Chava

Abstract: Social media has amplified the reach of financial influencers known as "finfluencers," who share stock recommendations on platforms like YouTube. Understanding their influence requires analyzing multimodal signals like tone, delivery style, and facial expressions, which extend beyond text-based financial analysis. We introduce VideoConviction, a multimodal dataset with 6,000+ expert annotations, produced through 457 hours of human effort, to benchmark multimodal large language models (MLLMs) and text-based large language models (LLMs) in financial discourse. Our results show that while multimodal inputs improve stock ticker extraction (e.g., extracting Apple's ticker AAPL), both MLLMs and LLMs struggle to distinguish investment actions and conviction--the strength of belief conveyed through confident delivery and detailed reasoning--often misclassifying general commentary as definitive recommendations. While high-conviction recommendations perform better than low-conviction ones, they still underperform the popular S\&P 500 index fund. An inverse strategy--betting against finfluencer recommendations--outperforms the S\&P 500 by 6.8\% in annual returns but carries greater risk (Sharpe ratio of 0.41 vs. 0.65). Our benchmark enables a diverse evaluation of multimodal tasks, comparing model performance on both full video and segmented video inputs. This enables deeper advancements in multimodal financial research. Our code, dataset, and evaluation leaderboard are available under the CC BY-NC 4.0 license.

cross Quasi-Random Physics-informed Neural Networks

Authors: Tianchi Yu, Ivan Oseledets

Abstract: Physics-informed neural networks have shown promise in solving partial differential equations (PDEs) by integrating physical constraints into neural network training, but their performance is sensitive to the sampling of points. Based on the impressive performance of quasi Monte-Carlo methods in high dimensional problems, this paper proposes Quasi-Random Physics-Informed Neural Networks (QRPINNs), which use low-discrepancy sequences for sampling instead of random points directly from the domain. Theoretically, QRPINNs have been proven to have a better convergence rate than PINNs. Empirically, experiments demonstrate that QRPINNs significantly outperform PINNs and some representative adaptive sampling methods, especially in high-dimensional PDEs. Furthermore, combining QRPINNs with adaptive sampling can further improve the performance.

cross Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Authors: Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro

Abstract: We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

cross Temporally Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction

Authors: Hyungjun Doh, Dong In Lee, Seunggeun Chi, Pin-Hao Huang, Kwonjoon Lee, Sangpil Kim, Karthik Ramani

Abstract: We introduce a novel framework for reconstructing dynamic human-object interactions from monocular video that overcomes challenges associated with occlusions and temporal inconsistencies. Traditional 3D reconstruction methods typically assume static objects or full visibility of dynamic subjects, leading to degraded performance when these assumptions are violated-particularly in scenarios where mutual occlusions occur. To address this, our framework leverages amodal completion to infer the complete structure of partially obscured regions. Unlike conventional approaches that operate on individual frames, our method integrates temporal context, enforcing coherence across video sequences to incrementally refine and stabilize reconstructions. This template-free strategy adapts to varying conditions without relying on predefined models, significantly enhancing the recovery of intricate details in dynamic scenes. We validate our approach using 3D Gaussian Splatting on challenging monocular videos, demonstrating superior precision in handling occlusions and maintaining temporal stability compared to existing techniques.

cross Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores

Authors: Vivek Chari, Benjamin Van Durme

Abstract: Modern Large Language Models (LLMs) are increasingly trained to support very large context windows. Unfortunately the ability to use long contexts in generation is complicated by the large memory requirement of the KV cache, which scales linearly with the context length. This memory footprint is often the dominant resource bottleneck in real-world deployments, limiting throughput and increasing serving cost. One way to address this is by compressing the KV cache, which can be done either with knowledge of the question being asked (query-aware) or without knowledge of the query (query-agnostic). We present Compactor, a parameter-free, query-agnostic KV compression strategy that uses approximate leverage scores to determine token importance. We show that Compactor can achieve the same performance as competing methods while retaining 1/2 the tokens in both synthetic and real-world context tasks, with minimal computational overhead. We further introduce a procedure for context-calibrated compression, which allows one to infer the maximum compression ratio a given context can support. Using context-calibrated compression, we show that Compactor achieves full KV performance on Longbench while reducing the KV memory burden by 63%, on average. To demonstrate the efficacy and generalizability of our approach, we apply Compactor to 27 synthetic and real-world tasks from RULER and Longbench, with models from both the Qwen 2.5 and Llama 3.1 families.

cross ALCo-FM: Adaptive Long-Context Foundation Model for Accident Prediction

Authors: Pinaki Prasad Guha Neogi, Ahmad Mohammadshirazi, Rajiv Ramnath

Abstract: Traffic accidents are rare, yet high-impact events that require long-context multimodal reasoning for accurate risk forecasting. In this paper, we introduce ALCo-FM, a unified adaptive long-context foundation model that computes a volatility pre-score to dynamically select context windows for input data and encodes and fuses these multimodal data via shallow cross attention. Following a local GAT layer and a BigBird-style sparse global transformer over H3 hexagonal grids, coupled with Monte Carlo dropout for confidence, the model yields superior, well-calibrated predictions. Trained on data from 15 US cities with a class-weighted loss to counter label imbalance, and fine-tuned with minimal data on held-out cities, ALCo-FM achieves 0.94 accuracy, 0.92 F1, and an ECE of 0.04, outperforming more than 20 state-of-the-art baselines in large-scale urban risk prediction. Code and dataset are available at: https://github.com/PinakiPrasad12/ALCo-FM

URLs: https://github.com/PinakiPrasad12/ALCo-FM

cross AmpLyze: A Deep Learning Model for Predicting the Hemolytic Concentration

Authors: Peng Qiu, Hanqi Feng, Barnabas Poczos

Abstract: Red-blood-cell lysis (HC50) is the principal safety barrier for antimicrobial-peptide (AMP) therapeutics, yet existing models only say "toxic" or "non-toxic." AmpLyze closes this gap by predicting the actual HC50 value from sequence alone and explaining the residues that drive toxicity. The model couples residue-level ProtT5/ESM2 embeddings with sequence-level descriptors in dual local and global branches, aligned by a cross-attention module and trained with log-cosh loss for robustness to assay noise. The optimal AmpLyze model reaches a PCC of 0.756 and an MSE of 0.987, outperforming classical regressors and the state-of-the-art. Ablations confirm that both branches are essential, and cross-attention adds a further 1% PCC and 3% MSE improvement. Expected-Gradients attributions reveal known toxicity hotspots and suggest safer substitutions. By turning hemolysis assessment into a quantitative, sequence-based, and interpretable prediction, AmpLyze facilitates AMP design and offers a practical tool for early-stage toxicity screening.

cross KP-A: A Unified Network Knowledge Plane for Catalyzing Agentic Network Intelligence

Authors: Yun Tang, Mengbang Zou, Zeinab Nezami, Syed Ali Raza Zaidi, Weisi Guo

Abstract: The emergence of large language models (LLMs) and agentic systems is enabling autonomous 6G networks with advanced intelligence, including self-configuration, self-optimization, and self-healing. However, the current implementation of individual intelligence tasks necessitates isolated knowledge retrieval pipelines, resulting in redundant data flows and inconsistent interpretations. Inspired by the service model unification effort in Open-RAN (to support interoperability and vendor diversity), we propose KP-A: a unified Network Knowledge Plane specifically designed for Agentic network intelligence. By decoupling network knowledge acquisition and management from intelligence logic, KP-A streamlines development and reduces maintenance complexity for intelligence engineers. By offering an intuitive and consistent knowledge interface, KP-A also enhances interoperability for the network intelligence agents. We demonstrate KP-A in two representative intelligence tasks: live network knowledge Q&A and edge AI service orchestration. All implementation artifacts have been open-sourced to support reproducibility and future standardization efforts.

cross Rethinking Spatio-Temporal Anomaly Detection: A Vision for Causality-Driven Cybersecurity

Authors: Arun Vignesh Malarkkan, Haoyue Bai, Xinyuan Wang, Anjali Kaushik, Dongjie Wang, Yanjie Fu

Abstract: As cyber-physical systems grow increasingly interconnected and spatially distributed, ensuring their resilience against evolving cyberattacks has become a critical priority. Spatio-Temporal Anomaly detection plays an important role in ensuring system security and operational integrity. However, current data-driven approaches, largely driven by black-box deep learning, face challenges in interpretability, adaptability to distribution shifts, and robustness under evolving system dynamics. In this paper, we advocate for a causal learning perspective to advance anomaly detection in spatially distributed infrastructures that grounds detection in structural cause-effect relationships. We identify and formalize three key directions: causal graph profiling, multi-view fusion, and continual causal graph learning, each offering distinct advantages in uncovering dynamic cause-effect structures across time and space. Drawing on real-world insights from systems such as water treatment infrastructures, we illustrate how causal models provide early warning signals and root cause attribution, addressing the limitations of black-box detectors. Looking ahead, we outline the future research agenda centered on multi-modality, generative AI-driven, and scalable adaptive causal frameworks. Our objective is to lay a new research trajectory toward scalable, adaptive, explainable, and spatially grounded anomaly detection systems. We hope to inspire a paradigm shift in cybersecurity research, promoting causality-driven approaches to address evolving threats in interconnected infrastructures.

cross Overview of the TREC 2021 deep learning track

Authors: Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin

Abstract: This is the third year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human annotated training labels available for both passage and document ranking tasks. In addition, this year we refreshed both the document and the passage collections which also led to a nearly four times increase in the document collection size and nearly $16$ times increase in the size of the passage collection. Deep neural ranking models that employ large scale pretraininig continued to outperform traditional retrieval methods this year. We also found that single stage retrieval can achieve good performance on both tasks although they still do not perform at par with multistage retrieval pipelines. Finally, the increase in the collection size and the general data refresh raised some questions about completeness of NIST judgments and the quality of the training labels that were mapped to the new collections from the old ones which we discuss in this report.

cross Consciousness as a Jamming Phase

Authors: Kaichen Ouyang

Abstract: This paper develops a neural jamming phase diagram that interprets the emergence of consciousness in large language models as a critical phenomenon in high-dimensional disordered systems.By establishing analogies with jamming transitions in granular matter and other complex systems, we identify three fundamental control parameters governing the phase behavior of neural networks: temperature, volume fraction, and stress.The theory provides a unified physical explanation for empirical scaling laws in artificial intelligence, demonstrating how computational cooling, density optimization, and noise reduction collectively drive systems toward a critical jamming surface where generalized intelligence emerges. Remarkably, the same thermodynamic principles that describe conventional jamming transitions appear to underlie the emergence of consciousness in neural networks, evidenced by shared critical signatures including divergent correlation lengths and scaling exponents.Our work explains neural language models' critical scaling through jamming physics, suggesting consciousness is a jamming phase that intrinsically connects knowledge components via long-range correlations.

cross Quantum Properties Trojans (QuPTs) for Attacking Quantum Neural Networks

Authors: Sounak Bhowmik, Travis S. Humble, Himanshu Thapliyal

Abstract: Quantum neural networks (QNN) hold immense potential for the future of quantum machine learning (QML). However, QNN security and robustness remain largely unexplored. In this work, we proposed novel Trojan attacks based on the quantum computing properties in a QNN-based binary classifier. Our proposed Quantum Properties Trojans (QuPTs) are based on the unitary property of quantum gates to insert noise and Hadamard gates to enable superposition to develop Trojans and attack QNNs. We showed that the proposed QuPTs are significantly stealthier and heavily impact the quantum circuits' performance, specifically QNNs. The most impactful QuPT caused a deterioration of 23% accuracy of the compromised QNN under the experimental setup. To the best of our knowledge, this is the first work on the Trojan attack on a fully quantum neural network independent of any hybrid classical-quantum architecture.

cross Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?

Authors: KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar

Abstract: Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs) and in piloting test questions. However, to what extent these proxy students accurately emulate the behavior and characteristics of real students remains an open question. To investigate this, we collected a dataset of 489 items from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension in grades 4, 8, and 12. We then apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations. Our findings reveal that, without guidance, strong general-purpose models consistently outperform the average student at every grade, while weaker or domain-mismatched models may align incidentally. Using grade-enforcement prompts changes models' performance, but whether they align with the average grade-level student remains highly model- and prompt-specific: no evaluated model-prompt pair fits the bill across subjects and grades, underscoring the need for new training and evaluation strategies. We conclude by providing guidelines for the selection of viable proxies based on our findings.

cross InsightBuild: LLM-Powered Causal Reasoning in Smart Building Systems

Authors: Pinaki Prasad Guha Neogi, Ahmad Mohammadshirazi, Rajiv Ramnath

Abstract: Smart buildings generate vast streams of sensor and control data, but facility managers often lack clear explanations for anomalous energy usage. We propose InsightBuild, a two-stage framework that integrates causality analysis with a fine-tuned large language model (LLM) to provide human-readable, causal explanations of energy consumption patterns. First, a lightweight causal inference module applies Granger causality tests and structural causal discovery on building telemetry (e.g., temperature, HVAC settings, occupancy) drawn from Google Smart Buildings and Berkeley Office datasets. Next, an LLM, fine-tuned on aligned pairs of sensor-level causes and textual explanations, receives as input the detected causal relations and generates concise, actionable explanations. We evaluate InsightBuild on two real-world datasets (Google: 2017-2022; Berkeley: 2018-2020), using expert-annotated ground-truth causes for a held-out set of anomalies. Our results demonstrate that combining explicit causal discovery with LLM-based natural language generation yields clear, precise explanations that assist facility managers in diagnosing and mitigating energy inefficiencies.

cross Quantum-Accelerated Neural Imputation with Large Language Models (LLMs)

Authors: Hossein Jamali

Abstract: Missing data presents a critical challenge in real-world datasets, significantly degrading the performance of machine learning models. While Large Language Models (LLMs) have recently demonstrated remarkable capabilities in tabular data imputation, exemplified by frameworks like UnIMP, their reliance on classical embedding methods often limits their ability to capture complex, non-linear correlations, particularly in mixed-type data scenarios encompassing numerical, categorical, and textual features. This paper introduces Quantum-UnIMP, a novel framework that integrates shallow quantum circuits into an LLM-based imputation architecture. Our core innovation lies in replacing conventional classical input embeddings with quantum feature maps generated by an Instantaneous Quantum Polynomial (IQP) circuit. This approach enables the model to leverage quantum phenomena such as superposition and entanglement, thereby learning richer, more expressive representations of data and enhancing the recovery of intricate missingness patterns. Our experiments on benchmark mixed-type datasets demonstrate that Quantum-UnIMP reduces imputation error by up to 15.2% for numerical features (RMSE) and improves classification accuracy by 8.7% for categorical features (F1-Score) compared to state-of-the-art classical and LLM-based methods. These compelling results underscore the profound potential of quantum-enhanced representations for complex data imputation tasks, even with near-term quantum hardware.

cross CL3R: 3D Reconstruction and Contrastive Learning for Enhanced Robotic Manipulation Representations

Authors: Wenbo Cui, Chengyang Zhao, Yuhui Chen, Haoran Li, Zhizheng Zhang, Dongbin Zhao, He Wang

Abstract: Building a robust perception module is crucial for visuomotor policy learning. While recent methods incorporate pre-trained 2D foundation models into robotic perception modules to leverage their strong semantic understanding, they struggle to capture 3D spatial information and generalize across diverse camera viewpoints. These limitations hinder the policy's effectiveness, especially in fine-grained robotic manipulation scenarios. To address these challenges, we propose CL3R, a novel 3D pre-training framework designed to enhance robotic manipulation policies. Our method integrates both spatial awareness and semantic understanding by employing a point cloud Masked Autoencoder to learn rich 3D representations while leveraging pre-trained 2D foundation models through contrastive learning for efficient semantic knowledge transfer. Additionally, we propose a 3D visual representation pre-training framework for robotic tasks. By unifying coordinate systems across datasets and introducing random fusion of multi-view point clouds, we mitigate camera view ambiguity and improve generalization, enabling robust perception from novel viewpoints at test time. Extensive experiments in both simulation and the real world demonstrate the superiority of our method, highlighting its effectiveness in visuomotor policy learning for robotic manipulation.

cross A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning

Authors: Hiroshi Yoshihara, Taiki Yamaguchi, Yuichi Inoue

Abstract: Enhancing the mathematical reasoning of Large Language Models (LLMs) is a pivotal challenge in advancing AI capabilities. While Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are the dominant training paradigms, a systematic methodology for combining them to maximize both accuracy and efficiency remains largely unexplored. This paper introduces a practical and effective training recipe that strategically integrates extended SFT with RL from online inference (GRPO). We posit that these methods play complementary, not competing, roles: a prolonged SFT phase first pushes the model's accuracy to its limits, after which a GRPO phase dramatically improves token efficiency while preserving this peak performance. Our experiments reveal that extending SFT for as many as 10 epochs is crucial for performance breakthroughs, and that the primary role of GRPO in this framework is to optimize solution length. The efficacy of our recipe is rigorously validated through top-tier performance on challenging benchmarks, including a high rank among over 2,200 teams in the strictly leak-free AI Mathematical Olympiad (AIMO). This work provides the community with a battle-tested blueprint for developing state-of-the-art mathematical reasoners that are both exceptionally accurate and practically efficient. To ensure full reproducibility and empower future research, we will open-source our entire framework, including all code, model checkpoints, and training configurations at https://github.com/analokmaus/kaggle-aimo2-fast-math-r1.

URLs: https://github.com/analokmaus/kaggle-aimo2-fast-math-r1.

cross Lightweight Safety Guardrails via Synthetic Data and RL-guided Adversarial Training

Authors: Aleksei Ilin, Gor Matevosyan, Xueying Ma, Vladimir Eremin, Suhaa Dada, Muqun Li, Riyaaz Shaik, Haluk Noyan Tokgozoglu

Abstract: We introduce a lightweight yet highly effective safety guardrail framework for language models, demonstrating that small-scale language models can achieve, and even surpass, the performance of larger counterparts in content moderation tasks. This is accomplished through high-fidelity synthetic data generation and adversarial training. The synthetic data generation process begins with human-curated seed data, which undergoes query augmentation and paraphrasing to create diverse and contextually rich examples. This augmented data is then subjected to multiple rounds of curation, ensuring high fidelity and relevance. Inspired by recent advances in the Generative Adversarial Network (GAN) architecture, our adversarial training employs reinforcement learning to guide a generator that produces challenging synthetic examples. These examples are used to fine-tune the safety classifier, enhancing its ability to detect and mitigate harmful content. Additionally, we incorporate strategies from recent research on efficient LLM training, leveraging the capabilities of smaller models to improve the performance of larger generative models. With iterative adversarial training and the generation of diverse, high-quality synthetic data, our framework enables small language models (SLMs) to serve as robust safety guardrails. This approach not only reduces computational overhead but also enhances resilience against adversarial attacks, offering a scalable and efficient solution for content moderation in AI systems.

cross Invariant-based Robust Weights Watermark for Large Language Models

Authors: Qingxiao Guo, Xinjie Zhu, Yilong Ma, Hui Jin, Yunhao Wang, Weifeng Zhang, Xiaobing Guo

Abstract: Watermarking technology has gained significant attention due to the increasing importance of intellectual property (IP) rights, particularly with the growing deployment of large language models (LLMs) on billions resource-constrained edge devices. To counter the potential threats of IP theft by malicious users, this paper introduces a robust watermarking scheme without retraining or fine-tuning for transformer models. The scheme generates a unique key for each user and derives a stable watermark value by solving linear constraints constructed from model invariants. Moreover, this technology utilizes noise mechanism to hide watermark locations in multi-user scenarios against collusion attack. This paper evaluates the approach on three popular models (Llama3, Phi3, Gemma), and the experimental results confirm the strong robustness across a range of attack methods (fine-tuning, pruning, quantization, permutation, scaling, reversible matrix and collusion attacks).

cross Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency

Authors: Yupu Liang, Yaping Zhang, Zhiyang Zhang, Zhiyuan Chen, Yang Zhao, Lu Xiang, Chengqing Zong, Yu Zhou

Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks, especially Optical Character Recognition (OCR). However, they struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges. Previous efforts to enhance DIMT capability through Supervised Fine-Tuning (SFT) on the DIMT dataset often result in the forgetting of the model's existing monolingual abilities, such as OCR. To address these challenges, we introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept "Bilingual Cognitive Advantage". Specifically, SSR prompts the model to generate OCR text before producing translation text, which allows the model to leverage its strong monolingual OCR ability while learning to translate text across languages. Comprehensive experiments demonstrate the proposed SSR learning helps mitigate catastrophic forgetting, improving the generalization ability of MLLMs on both OCR and DIMT tasks.

cross Generative AI in Science: Applications, Challenges, and Emerging Questions

Authors: Ryan Harries, Cornelia Lawson, Philip Shapira

Abstract: This paper examines the impact of Generative Artificial Intelligence (GenAI) on scientific practices, conducting a qualitative review of selected literature to explore its applications, benefits, and challenges. The review draws on the OpenAlex publication database, using a Boolean search approach to identify scientific literature related to GenAI (including large language models and ChatGPT). Thirty-nine highly cited papers and commentaries are reviewed and qualitatively coded. Results are categorized by GenAI applications in science, scientific writing, medical practice, and education and training. The analysis finds that while there is a rapid adoption of GenAI in science and science practice, its long-term implications remain unclear, with ongoing uncertainties about its use and governance. The study provides early insights into GenAI's growing role in science and identifies questions for future research in this evolving field.

cross Interpretability-Aware Pruning for Efficient Medical Image Analysis

Authors: Nikita Malik, Pratinav Seth, Neeraj Kumar Singh, Chintan Chitroda, Vinay Kumar Sankarapu

Abstract: Deep learning has driven significant advances in medical image analysis, yet its adoption in clinical practice remains constrained by the large size and lack of transparency in modern models. Advances in interpretability techniques such as DL-Backtrace, Layer-wise Relevance Propagation, and Integrated Gradients make it possible to assess the contribution of individual components within neural networks trained on medical imaging tasks. In this work, we introduce an interpretability-guided pruning framework that reduces model complexity while preserving both predictive performance and transparency. By selectively retaining only the most relevant parts of each layer, our method enables targeted compression that maintains clinically meaningful representations. Experiments across multiple medical image classification benchmarks demonstrate that this approach achieves high compression rates with minimal loss in accuracy, paving the way for lightweight, interpretable models suited for real-world deployment in healthcare settings.

cross Audio Inpanting using Discrete Diffusion Model

Authors: Tali Dror, Iftach Shoham, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani

Abstract: Audio inpainting refers to the task of reconstructing missing segments in corrupted audio recordings. While prior approaches-including waveform and spectrogram-based diffusion models-have shown promising results for short gaps, they often degrade in quality when gaps exceed 100 milliseconds (ms). In this work, we introduce a novel inpainting method based on discrete diffusion modeling, which operates over tokenized audio representations produced by a pre-trained audio tokenizer. Our approach models the generative process directly in the discrete latent space, enabling stable and semantically coherent reconstruction of missing audio. We evaluate the method on the MusicNet dataset using both objective and perceptual metrics across gap durations up to 300 ms. We further evaluated our approach on the MTG dataset, extending the gap duration to 500 ms. Experimental results demonstrate that our method achieves competitive or superior performance compared to existing baselines, particularly for longer gaps, offering a robust solution for restoring degraded musical recordings. Audio examples of our proposed method can be found at https://iftach21.github.io/

URLs: https://iftach21.github.io/

cross CoCo-Bot: Energy-based Composable Concept Bottlenecks for Interpretable Generative Models

Authors: Sangwon Kim, In-su Jang, Pyongkun Kim, Kwang-Ju Kim

Abstract: Concept Bottleneck Models (CBMs) provide interpretable and controllable generative modeling by routing generation through explicit, human-understandable concepts. However, previous generative CBMs often rely on auxiliary visual cues at the bottleneck to compensate for information not captured by the concepts, which undermines interpretability and compositionality. We propose CoCo-Bot, a post-hoc, composable concept bottleneck generative model that eliminates the need for auxiliary cues by transmitting all information solely through explicit concepts. Guided by diffusion-based energy functions, CoCo-Bot supports robust post-hoc interventions-such as concept composition and negation-across arbitrary concepts. Experiments using StyleGAN2 pre-trained on CelebA-HQ show that CoCo-Bot improves concept-level controllability and interpretability, while maintaining competitive visual quality.

cross Single-Domain Generalization for Multimodal Cross-Cancer Prognosis via Dirac Rebalancer and Distribution Entanglement

Authors: Jia-Xuan Jiang, Jiashuai Liu, Hongtao Wu, Yifeng Wu, Zhong Wang, Qi Bi, Yefeng Zheng

Abstract: Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at https://github.com/HopkinsKwong/MCCSDG

URLs: https://github.com/HopkinsKwong/MCCSDG

cross Intelligent Control of Spacecraft Reaction Wheel Attitude Using Deep Reinforcement Learning

Authors: Ghaith El-Dalahmeh, Mohammad Reza Jabbarpour, Bao Quoc Vo, Ryszard Kowalczyk

Abstract: Reliable satellite attitude control is essential for the success of space missions, particularly as satellites increasingly operate autonomously in dynamic and uncertain environments. Reaction wheels (RWs) play a pivotal role in attitude control, and maintaining control resilience during RW faults is critical to preserving mission objectives and system stability. However, traditional Proportional Derivative (PD) controllers and existing deep reinforcement learning (DRL) algorithms such as TD3, PPO, and A2C often fall short in providing the real time adaptability and fault tolerance required for autonomous satellite operations. This study introduces a DRL-based control strategy designed to improve satellite resilience and adaptability under fault conditions. Specifically, the proposed method integrates Twin Delayed Deep Deterministic Policy Gradient (TD3) with Hindsight Experience Replay (HER) and Dimension Wise Clipping (DWC) referred to as TD3-HD to enhance learning in sparse reward environments and maintain satellite stability during RW failures. The proposed approach is benchmarked against PD control and leading DRL algorithms. Experimental results show that TD3-HD achieves significantly lower attitude error, improved angular velocity regulation, and enhanced stability under fault conditions. These findings underscore the proposed method potential as a powerful, fault tolerant, onboard AI solution for autonomous satellite attitude control.

cross PanMatch: Unleashing the Potential of Large Vision Models for Unified Matching Models

Authors: Yongjian Zhang, Longguang Wang, Kunhong Li, Ye Zhang, Yun Wang, Liang Lin, Yulan Guo

Abstract: This work presents PanMatch, a versatile foundation model for robust correspondence matching. Unlike previous methods that rely on task-specific architectures and domain-specific fine-tuning to support tasks like stereo matching, optical flow or feature matching, our key insight is that any two-frame correspondence matching task can be addressed within a 2D displacement estimation framework using the same model weights. Such a formulation eliminates the need for designing specialized unified architectures or task-specific ensemble models. Instead, it achieves multi-task integration by endowing displacement estimation algorithms with unprecedented generalization capabilities. To this end, we highlight the importance of a robust feature extractor applicable across multiple domains and tasks, and propose the feature transformation pipeline that leverage all-purpose features from Large Vision Models to endow matching baselines with zero-shot cross-view matching capabilities. Furthermore, we assemble a cross-domain dataset with near 1.8 million samples from stereo matching, optical flow, and feature matching domains to pretrain PanMatch. We demonstrate the versatility of PanMatch across a wide range of domains and downstream tasks using the same model weights. Our model outperforms UniMatch and Flow-Anything on cross-task evaluations, and achieves comparable performance to most state-of-the-art task-specific algorithms on task-oriented benchmarks. Additionally, PanMatch presents unprecedented zero-shot performance in abnormal scenarios, such as rainy day and satellite imagery, where most existing robust algorithms fail to yield meaningful results.

cross Towards AI-Native RAN: An Operator's Perspective of 6G Day 1 Standardization

Authors: Nan Li, Qi Sun, Lehan Wang, Xiaofei Xu, Jinri Huang, Chunhui Liu, Jing Gao, Yuhong Huang, Chih-Lin I

Abstract: Artificial Intelligence/Machine Learning (AI/ML) has become the most certain and prominent feature of 6G mobile networks. Unlike 5G, where AI/ML was not natively integrated but rather an add-on feature over existing architecture, 6G shall incorporate AI from the onset to address its complexity and support ubiquitous AI applications. Based on our extensive mobile network operation and standardization experience from 2G to 5G, this paper explores the design and standardization principles of AI-Native radio access networks (RAN) for 6G, with a particular focus on its critical Day 1 architecture, functionalities and capabilities. We investigate the framework of AI-Native RAN and present its three essential capabilities to shed some light on the standardization direction; namely, AI-driven RAN processing/optimization/automation, reliable AI lifecycle management (LCM), and AI-as-a-Service (AIaaS) provisioning. The standardization of AI-Native RAN, in particular the Day 1 features, including an AI-Native 6G RAN architecture, were proposed. For validation, a large-scale field trial with over 5000 5G-A base stations have been built and delivered significant improvements in average air interface latency, root cause identification, and network energy consumption with the proposed architecture and the supporting AI functions. This paper aims to provide a Day 1 framework for 6G AI-Native RAN standardization design, balancing technical innovation with practical deployment.

cross Deep Hashing with Semantic Hash Centers for Image Retrieval

Authors: Li Chen, Rui Liu, Yuxiang Zhou, Xudong Ma, Yong Chen, Dell Zhang

Abstract: Deep hashing is an effective approach for large-scale image retrieval. Current methods are typically classified by their supervision types: point-wise, pair-wise, and list-wise. Recent point-wise techniques (e.g., CSQ, MDS) have improved retrieval performance by pre-assigning a hash center to each class, enhancing the discriminability of hash codes across various datasets. However, these methods rely on data-independent algorithms to generate hash centers, which neglect the semantic relationships between classes and may degrade retrieval performance. This paper introduces the concept of semantic hash centers, building on the idea of traditional hash centers. We hypothesize that hash centers of semantically related classes should have closer Hamming distances, while those of unrelated classes should be more distant. To this end, we propose a three-stage framework, SHC, to generate hash codes that preserve semantic structure. First, we develop a classification network to identify semantic similarities between classes using a data-dependent similarity calculation that adapts to varying data distributions. Second, we introduce an optimization algorithm to generate semantic hash centers, preserving semantic relatedness while enforcing a minimum distance between centers to avoid excessively similar hash codes. Finally, a deep hashing network is trained using these semantic centers to convert images into binary hash codes. Experimental results on large-scale retrieval tasks across several public datasets show that SHC significantly improves retrieval performance. Specifically, SHC achieves average improvements of +7.26%, +7.62%, and +11.71% in MAP@100, MAP@1000, and MAP@ALL metrics, respectively, over state-of-the-art methods.

cross ChainEdit: Propagating Ripple Effects in LLM Knowledge Editing through Logical Rule-Guided Chains

Authors: Zilu Dong, Xiangqing Shen, Zinong Yang, Rui Xia

Abstract: Current knowledge editing methods for large language models (LLMs) struggle to maintain logical consistency when propagating ripple effects to associated facts. We propose ChainEdit, a framework that synergizes knowledge graph-derived logical rules with LLM logical reasoning capabilities to enable systematic chain updates. By automatically extracting logical patterns from structured knowledge bases and aligning them with LLMs' internal logics, ChainEdit dynamically generates and edits logically connected knowledge clusters. Experiments demonstrate an improvement of more than 30% in logical generalization over baselines while preserving editing reliability and specificity. We further address evaluation biases in existing benchmarks through knowledge-aware protocols that disentangle external dependencies. This work establishes new state-of-the-art performance on ripple effect while ensuring internal logical consistency after knowledge editing.

cross Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences

Authors: Selina Heller, Mohamed Ibrahim, David Antony Selby, Sebastian Vollmer

Abstract: Decision conferences are structured, collaborative meetings that bring together experts from various fields to address complex issues and reach a consensus on recommendations for future actions or policies. These conferences often rely on facilitated discussions to ensure productive dialogue and collective agreement. Recently, Large Language Models (LLMs) have shown significant promise in simulating real-world scenarios, particularly through collaborative multi-agent systems that mimic group interactions. In this work, we present a novel LLM-based multi-agent system designed to simulate decision conferences, specifically focusing on detecting agreement among the participant agents. To achieve this, we evaluate six distinct LLMs on two tasks: stance detection, which identifies the position an agent takes on a given issue, and stance polarity detection, which identifies the sentiment as positive, negative, or neutral. These models are further assessed within the multi-agent system to determine their effectiveness in complex simulations. Our results indicate that LLMs can reliably detect agreement even in dynamic and nuanced debates. Incorporating an agreement-detection agent within the system can also improve the efficiency of group debates and enhance the overall quality and coherence of deliberations, making them comparable to real-world decision conferences regarding outcome and decision-making. These findings demonstrate the potential for LLM-based multi-agent systems to simulate group decision-making processes. They also highlight that such systems could be instrumental in supporting decision-making with expert elicitation workshops across various domains.

cross Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Authors: Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

Abstract: Leveraging the powerful representations of pre-trained vision foundation models -- traditionally used for visual comprehension -- we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.

cross CUE-RAG: Towards Accurate and Cost-Efficient Graph-Based RAG via Multi-Partite Graph and Query-Driven Iterative Retrieval

Authors: Yaodong Su, Yixiang Fang, Yingli Zhou, Quanqing Xu, Chuanhui Yang

Abstract: Despite the remarkable progress of Large Language Models (LLMs), their performance in question answering (QA) remains limited by the lack of domain-specific and up-to-date knowledge. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external information, often from graph-structured data. However, existing graph-based RAG methods suffer from poor graph quality due to incomplete extraction and insufficient utilization of query information during retrieval. To overcome these limitations, we propose CUE-RAG, a novel approach that introduces (1) a multi-partite graph index incorporates text Chunks, knowledge Units, and Entities to capture semantic content at multiple levels of granularity, (2) a hybrid extraction strategy that reduces LLM token usage while still producing accurate and disambiguated knowledge units, and (3) Q-Iter, a query-driven iterative retrieval strategy that enhances relevance through semantic search and constrained graph traversal. Experiments on three QA benchmarks show that CUE-RAG significantly outperforms state-of-the-art baselines, achieving up to 99.33% higher Accuracy and 113.51% higher F1 score while reducing indexing costs by 72.58%. Remarkably, CUE-RAG matches or outperforms baselines even without using an LLM for indexing. These results demonstrate the effectiveness and cost-efficiency of CUE-RAG in advancing graph-based RAG systems.

cross Review of Feed-forward 3D Reconstruction: From DUSt3R to VGGT

Authors: Wei Zhang, Yihang Wu, Songhua Li, Wenjie Ma, Xin Ma, Qiang Li, Qi Wang

Abstract: 3D reconstruction, which aims to recover the dense three-dimensional structure of a scene, is a cornerstone technology for numerous applications, including augmented/virtual reality, autonomous driving, and robotics. While traditional pipelines like Structure from Motion (SfM) and Multi-View Stereo (MVS) achieve high precision through iterative optimization, they are limited by complex workflows, high computational cost, and poor robustness in challenging scenarios like texture-less regions. Recently, deep learning has catalyzed a paradigm shift in 3D reconstruction. A new family of models, exemplified by DUSt3R, has pioneered a feed-forward approach. These models employ a unified deep network to jointly infer camera poses and dense geometry directly from an Unconstrained set of images in a single forward pass. This survey provides a systematic review of this emerging domain. We begin by dissecting the technical framework of these feed-forward models, including their Transformer-based correspondence modeling, joint pose and geometry regression mechanisms, and strategies for scaling from two-view to multi-view scenarios. To highlight the disruptive nature of this new paradigm, we contrast it with both traditional pipelines and earlier learning-based methods like MVSNet. Furthermore, we provide an overview of relevant datasets and evaluation metrics. Finally, we discuss the technology's broad application prospects and identify key future challenges and opportunities, such as model accuracy and scalability, and handling dynamic scenes.

cross Space filling positionality and the Spiroformer

Authors: M. Maurin, M. \'A. Evangelista-Alvarado, P. Su\'arez-Serrato

Abstract: Transformers excel when dealing with sequential data. Generalizing transformer models to geometric domains, such as manifolds, we encounter the problem of not having a well-defined global order. We propose a solution with attention heads following a space-filling curve. As a first experimental example, we present the Spiroformer, a transformer that follows a polar spiral on the $2$-sphere.

cross A document is worth a structured record: Principled inductive bias design for document recognition

Authors: Benjamin Meyer, Lukas Tuggener, Sascha H\"anzi, Daniel Schmid, Erdal Ayfer, Benjamin F. Grewe, Ahmed Abdulkadir, Thilo Stadelmann

Abstract: Many document types use intrinsic, convention-driven structures that serve to encode precise and structured information, such as the conventions governing engineering drawings. However, state-of-the-art approaches treat document recognition as a mere computer vision problem, neglecting these underlying document-type-specific structural properties, making them dependent on sub-optimal heuristic post-processing and rendering many less frequent or more complicated document types inaccessible to modern document recognition. We suggest a novel perspective that frames document recognition as a transcription task from a document to a record. This implies a natural grouping of documents based on the intrinsic structure inherent in their transcription, where related document types can be treated (and learned) similarly. We propose a method to design structure-specific inductive biases for the underlying machine-learned end-to-end document recognition systems, and a respective base transformer architecture that we successfully adapt to different structures. We demonstrate the effectiveness of the so-found inductive biases in extensive experiments with progressively complex record structures from monophonic sheet music, shape drawings, and simplified engineering drawings. By integrating an inductive bias for unrestricted graph structures, we train the first-ever successful end-to-end model to transcribe engineering drawings to their inherently interlinked information. Our approach is relevant to inform the design of document recognition systems for document types that are less well understood than standard OCR, OMR, etc., and serves as a guide to unify the design of future document foundation models.

cross Pre-Training LLMs on a budget: A comparison of three optimizers

Authors: Joel Schlotthauer, Christian Kroos, Chris Hinze, Viktor Hangya, Luzian Hahn, Fabian K\"uch

Abstract: Optimizers play a decisive role in reducing pre-training times for LLMs and achieving better-performing models. In this study, we compare three major variants: the de-facto standard AdamW, the simpler Lion, developed through an evolutionary search, and the second-order optimizer Sophia. For better generalization, we train with two different base architectures and use a single- and a multiple-epoch approach while keeping the number of tokens constant. Using the Maximal Update Parametrization and smaller proxy models, we tune relevant hyperparameters separately for each combination of base architecture and optimizer. We found that while the results from all three optimizers were in approximately the same range, Sophia exhibited the lowest training and validation loss, Lion was fastest in terms of training GPU hours but AdamW led to the best downstream evaluation results.

cross Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach

Authors: Bruno Alexandre Rosa, Hil\'ario Oliveira, Luiz Rodrigues, Eduardo Araujo Oliveira, Rafael Ferreira Mello

Abstract: Essays are considered a valuable mechanism for evaluating learning outcomes in writing. Textual cohesion is an essential characteristic of a text, as it facilitates the establishment of meaning between its parts. Automatically scoring cohesion in essays presents a challenge in the field of educational artificial intelligence. The machine learning algorithms used to evaluate texts generally do not consider the individual characteristics of the instances that comprise the analysed corpus. In this meaning, item response theory can be adapted to the context of machine learning, characterising the ability, difficulty and discrimination of the models used. This work proposes and analyses the performance of a cohesion score prediction approach based on item response theory to adjust the scores generated by machine learning models. In this study, the corpus selected for the experiments consisted of the extended Essay-BR, which includes 6,563 essays in the style of the National High School Exam (ENEM), and the Brazilian Portuguese Narrative Essays, comprising 1,235 essays written by 5th to 9th grade students from public schools. We extracted 325 linguistic features and treated the problem as a machine learning regression task. The experimental results indicate that the proposed approach outperforms conventional machine learning models and ensemble methods in several evaluation metrics. This research explores a potential approach for improving the automatic evaluation of cohesion in educational essays.

cross PromotionGo at SemEval-2025 Task 11: A Feature-Centric Framework for Cross-Lingual Multi-Emotion Detection in Short Texts

Authors: Ziyi Huang, Xia Cui

Abstract: This paper presents our system for SemEval 2025 Task 11: Bridging the Gap in Text-Based Emotion Detection (Track A), which focuses on multi-label emotion detection in short texts. We propose a feature-centric framework that dynamically adapts document representations and learning algorithms to optimize language-specific performance. Our study evaluates three key components: document representation, dimensionality reduction, and model training in 28 languages, highlighting five for detailed analysis. The results show that TF-IDF remains highly effective for low-resource languages, while contextual embeddings like FastText and transformer-based document representations, such as those produced by Sentence-BERT, exhibit language-specific strengths. Principal Component Analysis (PCA) reduces training time without compromising performance, particularly benefiting FastText and neural models such as Multi-Layer Perceptrons (MLP). Computational efficiency analysis underscores the trade-off between model complexity and processing cost. Our framework provides a scalable solution for multilingual emotion detection, addressing the challenges of linguistic diversity and resource constraints.

cross MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

Authors: Jingjing Tang, Xin Wang, Zhe Zhang, Junichi Yamagishi, Geraint Wiggins, George Fazekas

Abstract: Generating expressive audio performances from music scores requires models to capture both instrument acoustics and human interpretation. Traditional music performance synthesis pipelines follow a two-stage approach, first generating expressive performance MIDI from a score, then synthesising the MIDI into audio. However, the synthesis models often struggle to generalise across diverse MIDI sources, musical styles, and recording environments. To address these challenges, we propose MIDI-VALLE, a neural codec language model adapted from the VALLE framework, which was originally designed for zero-shot personalised text-to-speech (TTS) synthesis. For performance MIDI-to-audio synthesis, we improve the architecture to condition on a reference audio performance and its corresponding MIDI. Unlike previous TTS-based systems that rely on piano rolls, MIDI-VALLE encodes both MIDI and audio as discrete tokens, facilitating a more consistent and robust modelling of piano performances. Furthermore, the model's generalisation ability is enhanced by training on an extensive and diverse piano performance dataset. Evaluation results show that MIDI-VALLE significantly outperforms a state-of-the-art baseline, achieving over 75% lower Frechet Audio Distance on the ATEPP and Maestro datasets. In the listening test, MIDI-VALLE received 202 votes compared to 58 for the baseline, demonstrating improved synthesis quality and generalisation across diverse performance MIDI inputs.

cross White-Basilisk: A Hybrid Model for Code Vulnerability Detection

Authors: Ioannis Lamprou, Alexander Shevtsov, Ioannis Arapakis, Sotiris Ioannidis

Abstract: The proliferation of software vulnerabilities presents a significant challenge to cybersecurity, necessitating more effective detection methodologies. We introduce White-Basilisk, a novel approach to vulnerability detection that demonstrates superior performance while challenging prevailing assumptions in AI model scaling. Utilizing an innovative architecture that integrates Mamba layers, linear self-attention, and a Mixture of Experts framework, White-Basilisk achieves state-of-the-art results in vulnerability detection tasks with a parameter count of only 200M. The model's capacity to process sequences of unprecedented length enables comprehensive analysis of extensive codebases in a single pass, surpassing the context limitations of current Large Language Models (LLMs). White-Basilisk exhibits robust performance on imbalanced, real-world datasets, while maintaining computational efficiency that facilitates deployment across diverse organizational scales. This research not only establishes new benchmarks in code security but also provides empirical evidence that compact, efficiently designed models can outperform larger counterparts in specialized tasks, potentially redefining optimization strategies in AI development for domain-specific applications.

cross RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features

Authors: Inye Na, Nejung Rue, Jiwon Chung, Hyunjin Park

Abstract: Medical image retrieval is a valuable field for supporting clinical decision-making, yet current methods primarily support 2D images and require fully annotated queries, limiting clinical flexibility. To address this, we propose RadiomicsRetrieval, a 3D content-based retrieval framework bridging handcrafted radiomics descriptors with deep learning-based embeddings at the tumor level. Unlike existing 2D approaches, RadiomicsRetrieval fully exploits volumetric data to leverage richer spatial context in medical images. We employ a promptable segmentation model (e.g., SAM) to derive tumor-specific image embeddings, which are aligned with radiomics features extracted from the same tumor via contrastive learning. These representations are further enriched by anatomical positional embedding (APE). As a result, RadiomicsRetrieval enables flexible querying based on shape, location, or partial feature sets. Extensive experiments on both lung CT and brain MRI public datasets demonstrate that radiomics features significantly enhance retrieval specificity, while APE provides global anatomical context essential for location-based searches. Notably, our framework requires only minimal user prompts (e.g., a single point), minimizing segmentation overhead and supporting diverse clinical scenarios. The capability to query using either image embeddings or selected radiomics attributes highlights its adaptability, potentially benefiting diagnosis, treatment planning, and research on large-scale medical imaging repositories. Our code is available at https://github.com/nainye/RadiomicsRetrieval.

URLs: https://github.com/nainye/RadiomicsRetrieval.

cross FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation

Authors: Yuxuan Jiang, Zehua Chen, Zeqian Ju, Chang Li, Weibei Dou, Jun Zhu

Abstract: Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., "owl hooted at 2.4s-5.2s". Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling and Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state-of-the-art timing-conditioned T2A synthesis quality among training-free methods and is comparable to leading training-based methods; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis. Demo samples are available at: https://freeaudio.github.io/FreeAudio/

URLs: https://freeaudio.github.io/FreeAudio/

cross A Multi-Modal Fusion Framework for Brain Tumor Segmentation Based on 3D Spatial-Language-Vision Integration and Bidirectional Interactive Attention Mechanism

Authors: Mingda Zhang, Kaiwen Pan

Abstract: This study aims to develop a novel multi-modal fusion framework for brain tumor segmentation that integrates spatial-language-vision information through bidirectional interactive attention mechanisms to improve segmentation accuracy and boundary delineation. Methods: We propose two core components: Multi-modal Semantic Fusion Adapter (MSFA) integrating 3D MRI data with clinical text descriptions through hierarchical semantic decoupling, and Bidirectional Interactive Visual-semantic Attention (BIVA) enabling iterative information exchange between modalities. The framework was evaluated on BraTS 2020 dataset comprising 369 multi-institutional MRI scans. Results: The proposed method achieved average Dice coefficient of 0.8505 and 95% Hausdorff distance of 2.8256mm across enhancing tumor, tumor core, and whole tumor regions, outperforming state-of-the-art methods including SCAU-Net, CA-Net, and 3D U-Net. Ablation studies confirmed critical contributions of semantic and spatial modules to boundary precision. Conclusion: Multi-modal semantic fusion combined with bidirectional interactive attention significantly enhances brain tumor segmentation performance, establishing new paradigms for integrating clinical knowledge into medical image analysis.

cross To Trade or Not to Trade: An Agentic Approach to Estimating Market Risk Improves Trading Decisions

Authors: Dimitrios Emmanoulopoulos, Ollie Olby, Justin Lyon, Namid R. Stillman

Abstract: Large language models (LLMs) are increasingly deployed in agentic frameworks, in which prompts trigger complex tool-based analysis in pursuit of a goal. While these frameworks have shown promise across multiple domains including in finance, they typically lack a principled model-building step, relying instead on sentiment- or trend-based analysis. We address this gap by developing an agentic system that uses LLMs to iteratively discover stochastic differential equations for financial time series. These models generate risk metrics which inform daily trading decisions. We evaluate our system in both traditional backtests and using a market simulator, which introduces synthetic but causally plausible price paths and news events. We find that model-informed trading strategies outperform standard LLM-based agents, improving Sharpe ratios across multiple equities. Our results show that combining LLMs with agentic model discovery enhances market risk estimation and enables more profitable trading decisions.

cross Generating Proto-Personas through Prompt Engineering: A Case Study on Efficiency, Effectiveness and Empathy

Authors: Fernando Ayach, Vitor Lameir\~ao, Raul Le\~ao, Jerfferson Felizardo, Rafael Sobrinho, Vanessa Borges, Patr\'icia Matsubara, Awdren Font\~ao

Abstract: Proto-personas are commonly used during early-stage Product Discovery, such as Lean Inception, to guide product definition and stakeholder alignment. However, the manual creation of proto-personas is often time-consuming, cognitively demanding, and prone to bias. In this paper, we propose and empirically investigate a prompt engineering-based approach to generate proto-personas with the support of Generative AI (GenAI). Our goal is to evaluate the approach in terms of efficiency, effectiveness, user acceptance, and the empathy elicited by the generated personas. We conducted a case study with 19 participants embedded in a real Lean Inception, employing a qualitative and quantitative methods design. The results reveal the approach's efficiency by reducing time and effort and improving the quality and reusability of personas in later discovery phases, such as Minimum Viable Product (MVP) scoping and feature refinement. While acceptance was generally high, especially regarding perceived usefulness and ease of use, participants noted limitations related to generalization and domain specificity. Furthermore, although cognitive empathy was strongly supported, affective and behavioral empathy varied significantly across participants. These results contribute novel empirical evidence on how GenAI can be effectively integrated into software Product Discovery practices, while also identifying key challenges to be addressed in future iterations of such hybrid design processes.

cross Towards Collaborative Fairness in Federated Learning Under Imbalanced Covariate Shift

Authors: Tianrun Yu, Jiaqi Wang, Haoyu Wang, Mingquan Lin, Han Liu, Nelson S. Yee, Fenglong Ma

Abstract: Collaborative fairness is a crucial challenge in federated learning. However, existing approaches often overlook a practical yet complex form of heterogeneity: imbalanced covariate shift. We provide a theoretical analysis of this setting, which motivates the design of FedAKD (Federated Asynchronous Knowledge Distillation)- simple yet effective approach that balances accurate prediction with collaborative fairness. FedAKD consists of client and server updates. In the client update, we introduce a novel asynchronous knowledge distillation strategy based on our preliminary analysis, which reveals that while correctly predicted samples exhibit similar feature distributions across clients, incorrectly predicted samples show significant variability. This suggests that imbalanced covariate shift primarily arises from misclassified samples. Leveraging this insight, our approach first applies traditional knowledge distillation to update client models while keeping the global model fixed. Next, we select correctly predicted high-confidence samples and update the global model using these samples while keeping client models fixed. The server update simply aggregates all client models. We further provide a theoretical proof of FedAKD's convergence. Experimental results on public datasets (FashionMNIST and CIFAR10) and a real-world Electronic Health Records (EHR) dataset demonstrate that FedAKD significantly improves collaborative fairness, enhances predictive accuracy, and fosters client participation even under highly heterogeneous data distributions.

cross A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

Authors: Marcin Pietro\'n, Rafa{\l} Olszowski, Jakub Gomu{\l}ka, Filip Gampel, Andrzej Tomski

Abstract: Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM's, using diverse datasets such as Args.me and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.

cross Adaptive Framework for Ambient Intelligence in Rehabilitation Assistance

Authors: G\'abor Baranyi, Zsolt Csibi, Kristian Fenech, \'Aron F\'othi, Zs\'ofia Ga\'al, Joul Skaf, Andr\'as L\H{o}rincz

Abstract: This paper introduces the Ambient Intelligence Rehabilitation Support (AIRS) framework, an advanced artificial intelligence-based solution tailored for home rehabilitation environments. AIRS integrates cutting-edge technologies, including Real-Time 3D Reconstruction (RT-3DR), intelligent navigation, and large Vision-Language Models (VLMs), to create a comprehensive system for machine-guided physical rehabilitation. The general AIRS framework is demonstrated in rehabilitation scenarios following total knee replacement (TKR), utilizing a database of 263 video recordings for evaluation. A smartphone is employed within AIRS to perform RT-3DR of living spaces and has a body-matched avatar to provide visual feedback about the excercise. This avatar is necessary in (a) optimizing exercise configurations, including camera placement, patient positioning, and initial poses, and (b) addressing privacy concerns and promoting compliance with the AI Act. The system guides users through the recording process to ensure the collection of properly recorded videos. AIRS employs two feedback mechanisms: (i) visual 3D feedback, enabling direct comparisons between prerecorded clinical exercises and patient home recordings and (ii) VLM-generated feedback, providing detailed explanations and corrections for exercise errors. The framework also supports people with visual and hearing impairments. It also features a modular design that can be adapted to broader rehabilitation contexts. AIRS software components are available for further use and customization.

cross Normalized vs Diplomatic Annotation: A Case Study of Automatic Information Extraction from Handwritten Uruguayan Birth Certificates

Authors: Natalia Bottaioli (Universit\'e Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France, Facultad de Ingenier\'ia, Universidad de la Rep\'ublica, Montevideo, Uruguay, Digital Sense, Montevideo, Uruguay), Sol\`ene Tarride (TEKLIA, Paris, France), J\'er\'emy Anger (Universit\'e Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France), Seginus Mowlavi (Universit\'e Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France), Marina Gardella (IMPA, Rio de Janeiro, Brazil), Antoine Tadros (Universit\'e Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France), Gabriele Facciolo (Universit\'e Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France), Rafael Grompone von Gioi (Universit\'e Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France), Christopher Kermorvant (TEKLIA, Paris, France), Jean-Michel Morel (City University of Hong Kong, Hong Kong), Javier Preciozzi (Facultad de Ingenier\'ia, Universidad de la Rep\'ublica, Montevideo, Uruguay, Digital Sense, Montevideo, Uruguay)

Abstract: This study evaluates the recently proposed Document Attention Network (DAN) for extracting key-value information from Uruguayan birth certificates, handwritten in Spanish. We investigate two annotation strategies for automatically transcribing handwritten documents, fine-tuning DAN with minimal training data and annotation effort. Experiments were conducted on two datasets containing the same images (201 scans of birth certificates written by more than 15 different writers) but with different annotation methods. Our findings indicate that normalized annotation is more effective for fields that can be standardized, such as dates and places of birth, whereas diplomatic annotation performs much better for fields containing names and surnames, which can not be standardized.

cross Scaling Attention to Very Long Sequences in Linear Time with Wavelet-Enhanced Random Spectral Attention (WERSA)

Authors: Vincenzo Dentamaro

Abstract: Transformer models are computationally costly on long sequences since regular attention has quadratic $O(n^2)$ time complexity. We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear $O(n)$ time complexity that is pivotal to enable successful long-sequence processing without the performance trade-off. WERSA merges content-adaptive random spectral features together with multi-resolution Haar wavelets and learnable parameters to selectively attend to informative scales of data while preserving linear efficiency. Large-scale comparisons \textbf{on single GPU} and across various benchmarks (vision, NLP, hierarchical reasoning) and various attention mechanisms (like Multiheaded Attention, Flash-Attention-2, FNet, Linformer, Performer, Waveformer), reveal uniform advantages of WERSA. It achieves best accuracy in all tests. On ArXiv classification, WERSA improves accuracy over vanilla attention by 1.2\% (86.2\% vs 85.0\%) while cutting training time by 81\% (296s vs 1554s) and FLOPS by 73.4\% (26.2G vs 98.4G). Significantly, WERSA excels where vanilla and FlashAttention-2 fail: on ArXiv-128k's extremely lengthy sequences, it achieves best accuracy (79.1\%) and AUC (0.979) among viable methods, operating on data that gives Out-Of-Memory errors to quadratic methods while being \textbf{twice as fast} as Waveformer, its next-best competitor. By significantly reducing computational loads without compromising accuracy, WERSA makes possible more practical, more affordable, long-context models, in particular on low-resource hardware, for more sustainable and more scalable AI development.

cross DatasetAgent: A Novel Multi-Agent System for Auto-Constructing Datasets from Real-World Images

Authors: Haoran Sun, Haoyu Bian, Shaoning Zeng, Yunbo Rao, Xu Xu, Lin Mei, Jianping Gou

Abstract: Common knowledge indicates that the process of constructing image datasets usually depends on the time-intensive and inefficient method of manual collection and annotation. Large models offer a solution via data generation. Nonetheless, real-world data are obviously more valuable comparing to artificially intelligence generated data, particularly in constructing image datasets. For this reason, we propose a novel method for auto-constructing datasets from real-world images by a multiagent collaborative system, named as DatasetAgent. By coordinating four different agents equipped with Multi-modal Large Language Models (MLLMs), as well as a tool package for image optimization, DatasetAgent is able to construct high-quality image datasets according to user-specified requirements. In particular, two types of experiments are conducted, including expanding existing datasets and creating new ones from scratch, on a variety of open-source datasets. In both cases, multiple image datasets constructed by DatasetAgent are used to train various vision models for image classification, object detection, and image segmentation.

cross Safe Deep Reinforcement Learning for Resource Allocation with Peak Age of Information Violation Guarantees

Authors: Berire Gunes Reyhan, Sinem Coleri

Abstract: In Wireless Networked Control Systems (WNCSs), control and communication systems must be co-designed due to their strong interdependence. This paper presents a novel optimization theory-based safe deep reinforcement learning (DRL) framework for ultra-reliable WNCSs, ensuring constraint satisfaction while optimizing performance, for the first time in the literature. The approach minimizes power consumption under key constraints, including Peak Age of Information (PAoI) violation probability, transmit power, and schedulability in the finite blocklength regime. PAoI violation probability is uniquely derived by combining stochastic maximum allowable transfer interval (MATI) and maximum allowable packet delay (MAD) constraints in a multi-sensor network. The framework consists of two stages: optimization theory and safe DRL. The first stage derives optimality conditions to establish mathematical relationships among variables, simplifying and decomposing the problem. The second stage employs a safe DRL model where a teacher-student framework guides the DRL agent (student). The control mechanism (teacher) evaluates compliance with system constraints and suggests the nearest feasible action when needed. Extensive simulations show that the proposed framework outperforms rule-based and other optimization theory based DRL benchmarks, achieving faster convergence, higher rewards, and greater stability.

cross KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment

Authors: Jiyao Zhang, Chengli Zhong, Hui Xu, Qige Li, Yi Zhou

Abstract: Modern large language models (LLMs) show promising progress in formalizing informal mathematics into machine-verifiable theorems. However, these methods still face bottlenecks due to the limited quantity and quality of multilingual parallel corpora. In this paper, we propose a novel neuro-symbolic framework KELPS (Knowledge-Equation based Logical Processing System) to address these problems. KELPS is an iterative framework for translating, synthesizing, and filtering informal data into multiple formal languages (Lean, Coq, and Isabelle). First, we translate natural language into Knowledge Equations (KEs), a novel language that we designed, theoretically grounded in assertional logic. Next, we convert them to target languages through rigorously defined rules that preserve both syntactic structure and semantic meaning. This process yielded a parallel corpus of over 60,000 problems. Our framework achieves 88.9% syntactic accuracy (pass@1) on MiniF2F, outperforming SOTA models such as Deepseek-V3 (81%) and Herald (81.3%) across multiple datasets. All datasets and codes are available in the supplementary materials.

cross MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing

Authors: Debashis Gupta, Aditi Golder, Rongkhun Zhu, Kangning Cui, Wei Tang, Fan Yang, Ovidiu Csillik, Sarra Alaqahtani, V. Paul Pauca

Abstract: Contrastive learning (CL) has emerged as a powerful paradigm for learning transferable representations without the reliance on large labeled datasets. Its ability to capture intrinsic similarities and differences among data samples has led to state-of-the-art results in computer vision tasks. These strengths make CL particularly well-suited for Earth System Observation (ESO), where diverse satellite modalities such as optical and SAR imagery offer naturally aligned views of the same geospatial regions. However, ESO presents unique challenges, including high inter-class similarity, scene clutter, and ambiguous boundaries, which complicate representation learning -- especially in low-label, multi-label settings. Existing CL frameworks often focus on intra-modality self-supervision or lack mechanisms for multi-label alignment and semantic precision across modalities. In this work, we introduce MoSAiC, a unified framework that jointly optimizes intra- and inter-modality contrastive learning with a multi-label supervised contrastive loss. Designed specifically for multi-modal satellite imagery, MoSAiC enables finer semantic disentanglement and more robust representation learning across spectrally similar and spatially complex classes. Experiments on two benchmark datasets, BigEarthNet V2.0 and Sent12MS, show that MoSAiC consistently outperforms both fully supervised and self-supervised baselines in terms of accuracy, cluster coherence, and generalization in low-label and high-class-overlap scenarios.

cross A Personalised Formal Verification Framework for Monitoring Activities of Daily Living of Older Adults Living Independently in Their Homes

Authors: Ricardo Contreras, Filip Smola, Nu\v{s}a Fari\v{c}, Jiawei Zheng, Jane Hillston, Jacques D. Fleuriot

Abstract: There is an imperative need to provide quality of life to a growing population of older adults living independently. Personalised solutions that focus on the person and take into consideration their preferences and context are key. In this work, we introduce a framework for representing and reasoning about the Activities of Daily Living of older adults living independently at home. The framework integrates data from sensors and contextual information that aggregates semi-structured interviews, home layouts and sociological observations from the participants. We use these data to create formal models, personalised for each participant according to their preferences and context. We formulate requirements that are specific to each individual as properties encoded in Linear Temporal Logic and use a model checker to verify whether each property is satisfied by the model. When a property is violated, a counterexample is generated giving the cause of the violation. We demonstrate the framework's generalisability by applying it to different participants, highlighting its potential to enhance the safety and well-being of older adults ageing in place.

cross ONION: A Multi-Layered Framework for Participatory ER Design

Authors: Viktoriia Makovska, George Fletcher, Julia Stoyanovich

Abstract: We present ONION, a multi-layered framework for participatory Entity-Relationship (ER) modeling that integrates insights from design justice, participatory AI, and conceptual modeling. ONION introduces a five-stage methodology: Observe, Nurture, Integrate, Optimize, Normalize. It supports progressive abstraction from unstructured stakeholder input to structured ER diagrams. Our approach aims to reduce designer bias, promote inclusive participation, and increase transparency through the modeling process. We evaluate ONION through real-world workshops focused on sociotechnical systems in Ukraine, highlighting how diverse stakeholder engagement leads to richer data models and deeper mutual understanding. Early results demonstrate ONION's potential to host diversity in early-stage data modeling. We conclude with lessons learned, limitations and challenges involved in scaling and refining the framework for broader adoption.

cross KG-Attention: Knowledge Graph-Guided Attention at Test-Time via Bidirectional Information Aggregation

Authors: Songlin Zhai, Guilin Qi, Yuan Meng

Abstract: Knowledge graphs (KGs) play a critical role in enhancing large language models (LLMs) by introducing structured and grounded knowledge into the learning process. However, most existing KG-enhanced approaches rely on parameter-intensive fine-tuning, which risks catastrophic forgetting and degrades the pretrained model's generalization. Moreover, they exhibit limited adaptability to real-time knowledge updates due to their static integration frameworks. To address these issues, we introduce the first test-time KG-augmented framework for LLMs, built around a dedicated knowledge graph-guided attention (KGA) module that enables dynamic knowledge fusion without any parameter updates. The proposed KGA module augments the standard self-attention mechanism with two synergistic pathways: outward and inward aggregation. Specifically, the outward pathway dynamically integrates external knowledge into input representations via input-driven KG fusion. This inward aggregation complements the outward pathway by refining input representations through KG-guided filtering, suppressing task-irrelevant signals and amplifying knowledge-relevant patterns. Importantly, while the outward pathway handles knowledge fusion, the inward path selects the most relevant triples and feeds them back into the fusion process, forming a closed-loop enhancement mechanism. By synergistically combining these two pathways, the proposed method supports real-time knowledge fusion exclusively at test-time, without any parameter modification. Extensive experiments on five benchmarks verify the comparable knowledge fusion performance of KGA.

cross Multilingual Multimodal Software Developer for Code Generation

Authors: Linzheng Chai, Jian Yang, Shukai Liu, Wei Zhang, Liran Wang, Ke Jin, Tao Sun, Congnan Liu, Chenchen Zhang, Hualei Zhu, Jiaheng Liu, Xianjie Wu, Ge Zhang, Tianyu Liu, Zhoujun Li

Abstract: The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.

cross Monitoring Risks in Test-Time Adaptation

Authors: Mona Schirmer, Metod Jazbec, Christian A. Naesseth, Eric Nalisnick

Abstract: Encountering shifted data at test time is a ubiquitous challenge when deploying predictive models. Test-time adaptation (TTA) methods address this issue by continuously adapting a deployed model using only unlabeled test data. While TTA can extend the model's lifespan, it is only a temporary solution. Eventually the model might degrade to the point that it must be taken offline and retrained. To detect such points of ultimate failure, we propose pairing TTA with risk monitoring frameworks that track predictive performance and raise alerts when predefined performance criteria are violated. Specifically, we extend existing monitoring tools based on sequential testing with confidence sequences to accommodate scenarios in which the model is updated at test time and no test labels are available to estimate the performance metrics of interest. Our extensions unlock the application of rigorous statistical risk monitoring to TTA, and we demonstrate the effectiveness of our proposed TTA monitoring framework across a representative set of datasets, distribution shift types, and TTA methods.

cross Dually Hierarchical Drift Adaptation for Online Configuration Performance Learning

Authors: Zezhen Xiang, Jingzhi Gong, Tao Chen

Abstract: Modern configurable software systems need to learn models that correlate configuration and performance. However, when the system operates in dynamic environments, the workload variations, hardware changes, and system updates will inevitably introduce concept drifts at different levels - global drifts, which reshape the performance landscape of the entire configuration space; and local drifts, which only affect certain sub-regions of that space. As such, existing offline and transfer learning approaches can struggle to adapt to these implicit and unpredictable changes in real-time, rendering configuration performance learning challenging. To address this, we propose DHDA, an online configuration performance learning framework designed to capture and adapt to these drifts at different levels. The key idea is that DHDA adapts to both the local and global drifts using dually hierarchical adaptation: at the upper level, we redivide the data into different divisions, within each of which the local model is retrained, to handle global drifts only when necessary. At the lower level, the local models of the divisions can detect local drifts and adapt themselves asynchronously. To balance responsiveness and efficiency, DHDA combines incremental updates with periodic full retraining to minimize redundant computation when no drifts are detected. Through evaluating eight software systems and against state-of-the-art approaches, we show that DHDA achieves considerably better accuracy and can effectively adapt to drifts with up to 2x improvements, while incurring reasonable overhead and is able to improve different local models in handling concept drift.

cross Catastrophic Forgetting Mitigation Through Plateau Phase Activity Profiling

Authors: Idan Mashiach, Oren Glickman, Tom Tirer

Abstract: Catastrophic forgetting in deep neural networks occurs when learning new tasks degrades performance on previously learned tasks due to knowledge overwriting. Among the approaches to mitigate this issue, regularization techniques aim to identify and constrain "important" parameters to preserve previous knowledge. In the highly nonconvex optimization landscape of deep learning, we propose a novel perspective: tracking parameters during the final training plateau is more effective than monitoring them throughout the entire training process. We argue that parameters that exhibit higher activity (movement and variability) during this plateau reveal directions in the loss landscape that are relatively flat, making them suitable for adaptation to new tasks while preserving knowledge from previous ones. Our comprehensive experiments demonstrate that this approach achieves superior performance in balancing catastrophic forgetting mitigation with strong performance on newly learned tasks.

cross Adaptive Nonlinear Vector Autoregression: Robust Forecasting for Noisy Chaotic Time Series

Authors: Azimov Sherkhon, Susana Lopez-Moreno, Eric Dolores-Cuenca, Sieun Lee, Sangil Kim

Abstract: Nonlinear vector autoregression (NVAR) and reservoir computing (RC) have shown promise in forecasting chaotic dynamical systems, such as the Lorenz-63 model and El Nino-Southern Oscillation. However, their reliance on fixed nonlinearities - polynomial expansions in NVAR or random feature maps in RC - limits their adaptability to high noise or real-world data. These methods also scale poorly in high-dimensional settings due to costly matrix inversion during readout computation. We propose an adaptive NVAR model that combines delay-embedded linear inputs with features generated by a shallow, learnable multi-layer perceptron (MLP). The MLP and linear readout are jointly trained using gradient-based optimization, enabling the model to learn data-driven nonlinearities while preserving a simple readout structure. Unlike standard NVAR, our approach avoids the need for an exhaustive and sensitive grid search over ridge and delay parameters. Instead, tuning is restricted to neural network hyperparameters, improving scalability. Initial experiments on chaotic systems tested under noise-free and synthetically noisy conditions showed that the adaptive model outperformed the standard NVAR in predictive accuracy and showed robust forecasting under noisy conditions with a lower observation frequency.

cross Geo-ORBIT: A Federated Digital Twin Framework for Scene-Adaptive Lane Geometry Detection

Authors: Rei Tamaru, Pei Li, Bin Ran

Abstract: Digital Twins (DT) have the potential to transform traffic management and operations by creating dynamic, virtual representations of transportation systems that sense conditions, analyze operations, and support decision-making. A key component for DT of the transportation system is dynamic roadway geometry sensing. However, existing approaches often rely on static maps or costly sensors, limiting scalability and adaptability. Additionally, large-scale DTs that collect and analyze data from multiple sources face challenges in privacy, communication, and computational efficiency. To address these challenges, we introduce Geo-ORBIT (Geometrical Operational Roadway Blueprint with Integrated Twin), a unified framework that combines real-time lane detection, DT synchronization, and federated meta-learning. At the core of Geo-ORBIT is GeoLane, a lightweight lane detection model that learns lane geometries from vehicle trajectory data using roadside cameras. We extend this model through Meta-GeoLane, which learns to personalize detection parameters for local entities, and FedMeta-GeoLane, a federated learning strategy that ensures scalable and privacy-preserving adaptation across roadside deployments. Our system is integrated with CARLA and SUMO to create a high-fidelity DT that renders highway scenarios and captures traffic flows in real-time. Extensive experiments across diverse urban scenes show that FedMeta-GeoLane consistently outperforms baseline and meta-learning approaches, achieving lower geometric error and stronger generalization to unseen locations while drastically reducing communication overhead. This work lays the foundation for flexible, context-aware infrastructure modeling in DTs. The framework is publicly available at https://github.com/raynbowy23/FedMeta-GeoLane.git.

URLs: https://github.com/raynbowy23/FedMeta-GeoLane.git.

cross Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

Authors: Jeonghye Kim, Yongjae Shin, Whiyoung Jung, Sunghoon Hong, Deunsol Yoon, Youngchul Sung, Kanghoon Lee, Woohyung Lim

Abstract: Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.

cross Compress Any Segment Anything Model (SAM)

Authors: Juntong Fan, Zhiwei Hao, Jianqiang Shen, Shang-Ling Jui, Yi Zhang, Jing-Xiao Liao, Feng-Lei Fan

Abstract: Due to the excellent performance in yielding high-quality, zero-shot segmentation, Segment Anything Model (SAM) and its variants have been widely applied in diverse scenarios such as healthcare and intelligent manufacturing. Therefore, effectively compressing SAMs has become an increasingly pressing practical need. In this study, we propose Birkhoff, a novel data-free compression algorithm for SAM and its variants. Unlike quantization, pruning, distillation, and other compression methods, Birkhoff embodies versatility across model types, agility in deployment, faithfulness to the original model, and compactness in model size. Specifically, Birkhoff introduces a novel compression algorithm: Hyper-Compression, whose core principle is to find a dense trajectory to turn a high-dimensional parameter vector into a low-dimensional scalar. Furthermore, Birkhoff designs a dedicated linear layer operator, HyperLinear, to fuse decompression and matrix multiplication to significantly accelerate inference of the compressed SAMs. Extensive experiments on 18 SAMs in the COCO, LVIS, and SA-1B datasets show that Birkhoff performs consistently and competitively in compression time, compression ratio, post-compression performance, and inference speed. For example, Birkhoff can achieve a compression ratio of 5.17x on SAM2-B, with less than 1% performance drop without using any fine-tuning data. Moreover, the compression is finished within 60 seconds for all models.

cross A Hybrid Multi-Well Hopfield-CNN with Feature Extraction and K-Means for MNIST Classification

Authors: Ahmed Farooq

Abstract: This study presents a hybrid model for classifying handwritten digits in the MNIST dataset, combining convolutional neural networks (CNNs) with a multi-well Hopfield network. The approach employs a CNN to extract high-dimensional features from input images, which are then clustered into class-specific prototypes using k-means clustering. These prototypes serve as attractors in a multi-well energy landscape, where a Hopfield network performs classification by minimizing an energy function that balances feature similarity and class assignment.The model's design enables robust handling of intraclass variability, such as diverse handwriting styles, while providing an interpretable framework through its energy-based decision process. Through systematic optimization of the CNN architecture and the number of wells, the model achieves a high test accuracy of 99.2% on 10,000 MNIST images, demonstrating its effectiveness for image classification tasks. The findings highlight the critical role of deep feature extraction and sufficient prototype coverage in achieving high performance, with potential for broader applications in pattern recognition.

cross On Barriers to Archival Audio Processing

Authors: Peter Sullivan, Muhammad Abdul-Mageed

Abstract: In this study, we leverage a unique UNESCO collection of mid-20th century radio recordings to probe the robustness of modern off-the-shelf language identification (LID) and speaker recognition (SR) methods, especially with respect to the impact of multilingual speakers and cross-age recordings. Our findings suggest that LID systems, such as Whisper, are increasingly adept at handling second-language and accented speech. However, speaker embeddings remain a fragile component of speech processing pipelines that is prone to biases related to the channel, age, and language. Issues which will need to be overcome should archives aim to employ SR methods for speaker indexing.

cross Optimistic Exploration for Risk-Averse Constrained Reinforcement Learning

Authors: James McCarthy, Radu Marinescu, Elizabeth Daly, Ivana Dusparic

Abstract: Risk-averse Constrained Reinforcement Learning (RaCRL) aims to learn policies that minimise the likelihood of rare and catastrophic constraint violations caused by an environment's inherent randomness. In general, risk-aversion leads to conservative exploration of the environment which typically results in converging to sub-optimal policies that fail to adequately maximise reward or, in some cases, fail to achieve the goal. In this paper, we propose an exploration-based approach for RaCRL called Optimistic Risk-averse Actor Critic (ORAC), which constructs an exploratory policy by maximising a local upper confidence bound of the state-action reward value function whilst minimising a local lower confidence bound of the risk-averse state-action cost value function. Specifically, at each step, the weighting assigned to the cost value is increased or decreased if it exceeds or falls below the safety constraint value. This way the policy is encouraged to explore uncertain regions of the environment to discover high reward states whilst still satisfying the safety constraints. Our experimental results demonstrate that the ORAC approach prevents convergence to sub-optimal policies and improves significantly the reward-cost trade-off in various continuous control tasks such as Safety-Gymnasium and a complex building energy management environment CityLearn.

cross KV Cache Steering for Inducing Reasoning in Small Language Models

Authors: Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, Cees G. M. Snoek, Yuki M. Asano

Abstract: We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.

cross NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

Authors: Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng

Abstract: We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a large-scale dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Although modeling fine-grained keyboard interactions precisely remains challenging, NeuralOS offers a step toward creating fully adaptive, generative neural interfaces for future human-computer interaction systems.

cross Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

Authors: Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, Jiasheng Tang, Fan Wang, Yi Yang

Abstract: Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.

URLs: https://github.com/alibaba-damo-academy/Lumos.

replace Interpreting systems as solving POMDPs: a step towards a formal understanding of agency

Authors: Martin Biehl, Nathaniel Virgo

Abstract: Under what circumstances can a system be said to have beliefs and goals, and how do such agency-related features relate to its physical state? Recent work has proposed a notion of interpretation map, a function that maps the state of a system to a probability distribution representing its beliefs about an external world. Such a map is not completely arbitrary, as the beliefs it attributes to the system must evolve over time in a manner that is consistent with Bayes' theorem, and consequently the dynamics of a system constrain its possible interpretations. Here we build on this approach, proposing a notion of interpretation not just in terms of beliefs but in terms of goals and actions. To do this we make use of the existing theory of partially observable Markov processes (POMDPs): we say that a system can be interpreted as a solution to a POMDP if it not only admits an interpretation map describing its beliefs about the hidden state of a POMDP but also takes actions that are optimal according to its belief state. An agent is then a system together with an interpretation of this system as a POMDP solution. Although POMDPs are not the only possible formulation of what it means to have a goal, this nevertheless represents a step towards a more general formal definition of what it means for a system to be an agent.

replace Text2BIM: Generating Building Models Using a Large Language Model-based Multi-Agent Framework

Authors: Changyu Du, Sebastian Esser, Stavros Nousias, Andr\'e Borrmann

Abstract: The conventional BIM authoring process typically requires designers to master complex and tedious modeling commands in order to materialize their design intentions within BIM authoring tools. This additional cognitive burden complicates the design process and hinders the adoption of BIM and model-based design in the AEC (Architecture, Engineering, and Construction) industry. To facilitate the expression of design intentions more intuitively, we propose Text2BIM, an LLM-based multi-agent framework that can generate 3D building models from natural language instructions. This framework orchestrates multiple LLM agents to collaborate and reason, transforming textual user input into imperative code that invokes the BIM authoring tool's APIs, thereby generating editable BIM models with internal layouts, external envelopes, and semantic information directly in the software. Furthermore, a rule-based model checker is introduced into the agentic workflow, utilizing predefined domain knowledge to guide the LLM agents in resolving issues within the generated models and iteratively improving model quality. Extensive experiments were conducted to compare and analyze the performance of three different LLMs under the proposed framework. The evaluation results demonstrate that our approach can effectively generate high-quality, structurally rational building models that are aligned with the abstract concepts specified by user input. Finally, an interactive software prototype was developed to integrate the framework into the BIM authoring software Vectorworks, showcasing the potential of modeling by chatting. The code is available at: https://github.com/dcy0577/Text2BIM

URLs: https://github.com/dcy0577/Text2BIM

replace AI Delegates with a Dual Focus: Ensuring Privacy and Strategic Self-Disclosure

Authors: Zhiyang Zhang, Xi Chen, Fangkai Yang, Xiaoting Qin, Chao Du, Xi Cheng, Hangxin Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Abstract: Large language model (LLM)-based AI delegates are increasingly utilized to act on behalf of users, assisting them with a wide range of tasks through conversational interfaces. Despite their advantages, concerns arise regarding the potential risk of privacy leaks, particularly in scenarios involving social interactions. While existing research has focused on protecting privacy by limiting the access of AI delegates to sensitive user information, many social scenarios require disclosing private details to achieve desired social goals, necessitating a balance between privacy protection and disclosure. To address this challenge, we first conduct a pilot study to investigate user perceptions of AI delegates across various social relations and task scenarios, and then propose a novel AI delegate system that enables privacy-conscious self-disclosure. Our user study demonstrates that the proposed AI delegate strategically protects privacy, pioneering its use in diverse and dynamic social interactions.

replace Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

Authors: DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, Qinqing Zheng

Abstract: In cognition theory, human thinking is governed by two systems: the fast and intuitive System 1 and the slower but more deliberative System 2. Analogously, Large Language Models (LLMs) can operate in two reasoning modes: outputting only the solutions (\emph{fast mode}) or both the reasoning chain and the final solution (\emph{slow mode}). We present \dualformer, a single Transformer model that seamlessly integrates both the fast and slow reasoning modes by training on randomized reasoning traces, where different parts of the traces are strategically dropped during training. At inference time, \dualformer can be easily configured to execute in either fast or slow mode, or automatically decide which mode to engage (\emph{auto mode}). It outperforms baselines in both performance and computational efficiency across all three modes: (1) in slow mode, \dualformer achieves $97.6\%$ optimal rate on unseen $30 \times 30$ maze tasks, surpassing the \searchformer baseline ($93.3\%$) trained on data with complete reasoning traces, with $45.5\%$ fewer reasoning steps; (2) in fast mode, \dualformer achieves $80\%$ optimal rate, significantly outperforming the Solution-Only model trained on solution-only data, which has an optimal rate of only $30\%$; (3) in auto mode, \dualformer achieves $96.6\%$ optimal rate with $59.9\%$ fewer steps than \searchformer. Moreover, \dualformer produces more diverse reasoning traces than \searchformer{}. For math reasoning problems, our techniques have also achieved improved performance with LLM fine-tuning, demonstrating its generalization beyond task-specific models. We open source our code at https://github.com/facebookresearch/dualformer.

URLs: https://github.com/facebookresearch/dualformer.

replace Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers

Authors: Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa

Abstract: Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS .

URLs: https://github.com/shiralab/OPTS

replace A taxonomy of epistemic injustice in the context of AI and the case for generative hermeneutical erasure

Authors: Warmhold Jan Thomas Mollema

Abstract: Epistemic injustice related to AI is a growing concern. In relation to machine learning models, epistemic injustice can have a diverse range of sources, ranging from epistemic opacity, the discriminatory automation of testimonial prejudice, and the distortion of human beliefs via generative AI's hallucinations to the exclusion of the global South in global AI governance, the execution of bureaucratic violence via algorithmic systems, and interactions with conversational artificial agents. Based on a proposed general taxonomy of epistemic injustice, this paper first sketches a taxonomy of the types of epistemic injustice in the context of AI, relying on the work of scholars from the fields of philosophy of technology, political philosophy and social epistemology. Secondly, an additional conceptualization on epistemic injustice in the context of AI is provided: generative hermeneutical erasure. I argue that this injustice the automation of 'epistemicide', the injustice done to epistemic agents in their capacity for collective sense-making through the suppression of difference in epistemology and conceptualization by LLMs. AI systems' 'view from nowhere' epistemically inferiorizes non-Western epistemologies and thereby contributes to the erosion of their epistemic particulars, gradually contributing to hermeneutical erasure. This work's relevance lies in proposal of a taxonomy that allows epistemic injustices to be mapped in the AI domain and the proposal of a novel form of AI-related epistemic injustice.

replace A Hybrid SMT-NRA Solver: Integrating 2D Cell-Jump-Based Local Search, MCSAT and OpenCAD

Authors: Tianyi Ding, Haokun Li, Xinpeng Ni, Bican Xia, Tianqi Zhao

Abstract: In this paper, we propose a hybrid framework for Satisfiability Modulo the Theory of Nonlinear Real Arithmetic (SMT-NRA for short). First, we introduce a two-dimensional cell-jump move, called \emph{$2d$-cell-jump}, generalizing the key operation, cell-jump, of the local search method for SMT-NRA. Then, we propose an extended local search framework, named \emph{$2d$-LS} (following the local search framework, LS, for SMT-NRA), integrating the model constructing satisfiability calculus (MCSAT) framework to improve search efficiency. To further improve the efficiency of MCSAT, we implement a recently proposed technique called \emph{sample-cell projection operator} for MCSAT, which is well suited for CDCL-style search in the real domain and helps guide the search away from conflicting states. Finally, we present a hybrid framework for SMT-NRA integrating MCSAT, $2d$-LS and OpenCAD, to improve search efficiency through information exchange. The experimental results demonstrate improvements in local search performance, highlighting the effectiveness of the proposed methods.

replace Discovering Algorithms with Computational Language Processing

Authors: Theo Bourdais, Abeynaya Gnanasekaran, Houman Owhadi, Tuhin Sahai

Abstract: Algorithms are the engine for reproducible problem-solving. We present a framework automating algorithm discovery by conceptualizing them as sequences of operations, represented as tokens. These computational tokens are chained using a grammar, enabling the formation of increasingly sophisticated procedures. Our ensemble Monte Carlo tree search (MCTS) guided by reinforcement learning (RL) explores token chaining and drives the creation of new tokens. This methodology rediscovers, improves, and generates new algorithms that substantially outperform existing methods for strongly NP-hard combinatorial optimization problems and foundational quantum computing approaches such as Grover's and Quantum Approximate Optimization Algorithm. Operating at the computational rather than code-generation level, our framework produces algorithms that can be tailored specifically to problem instances, not merely classes.

replace Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery

Authors: Licong Xu, Milind Sarkar, Anto I. Lonappan, \'I\~nigo Zubeldia, Pablo Villanueva-Domingo, Santiago Casas, Christian Fidler, Chetana Amancharla, Ujjwal Tiwari, Adrian Bayer, Chadi Ait Ekioui, Miles Cranmer, Adrian Dimitrov, James Fergusson, Kahaan Gandhi, Sven Krippendorf, Andrew Laverick, Julien Lesgourgues, Antony Lewis, Thomas Meier, Blake Sherwin, Kristen Surrao, Francisco Villaescusa-Navarro, Chi Wang, Xueqing Xu, Boris Bolliet

Abstract: We present a multi-agent system for automation of scientific research tasks, cmbagent (https://github.com/CMBAgents/cmbagent). The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.

URLs: https://github.com/CMBAgents/cmbagent).

replace StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production-Living Simulations with Stardew Valley

Authors: Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Jiageng Li, Yitian Hong, Xinrun Wang, Bo An

Abstract: Autonomous agents navigating human society must master both production activities and social interactions, yet existing benchmarks rarely evaluate these skills simultaneously. To bridge this gap, we introduce StarDojo, a novel benchmark based on Stardew Valley, designed to assess AI agents in open-ended production-living simulations. In StarDojo, agents are tasked to perform essential livelihood activities such as farming and crafting, while simultaneously engaging in social interactions to establish relationships within a vibrant community. StarDojo features 1,000 meticulously curated tasks across five key domains: farming, crafting, exploration, combat, and social interactions. Additionally, we provide a compact subset of 100 representative tasks for efficient model evaluation. The benchmark offers a unified, user-friendly interface that eliminates the need for keyboard and mouse control, supports all major operating systems, and enables the parallel execution of multiple environment instances, making it particularly well-suited for evaluating the most capable foundation agents, powered by multimodal large language models (MLLMs). Extensive evaluations of state-of-the-art MLLMs agents demonstrate substantial limitations, with the best-performing model, GPT-4.1, achieving only a 12.7% success rate, primarily due to challenges in visual understanding, multimodal reasoning and low-level manipulation. As a user-friendly environment and benchmark, StarDojo aims to facilitate further research towards robust, open-ended agents in complex production-living environments.

replace Measuring AI Alignment with Human Flourishing

Authors: Elizabeth Hilliard, Akshaya Jagadeesh, Alex Cook, Steele Billings, Nicholas Skytland, Alicia Llewellyn, Jackson Paull, Nathan Paull, Nolan Kurylo, Keatra Nesbitt, Robert Gruenewald, Anthony Jantzi, Omar Chavez

Abstract: This paper introduces the Flourishing AI Benchmark (FAI Benchmark), a novel evaluation framework that assesses AI alignment with human flourishing across seven dimensions: Character and Virtue, Close Social Relationships, Happiness and Life Satisfaction, Meaning and Purpose, Mental and Physical Health, Financial and Material Stability, and Faith and Spirituality. Unlike traditional benchmarks that focus on technical capabilities or harm prevention, the FAI Benchmark measures AI performance on how effectively models contribute to the flourishing of a person across these dimensions. The benchmark evaluates how effectively LLM AI systems align with current research models of holistic human well-being through a comprehensive methodology that incorporates 1,229 objective and subjective questions. Using specialized judge Large Language Models (LLMs) and cross-dimensional evaluation, the FAI Benchmark employs geometric mean scoring to ensure balanced performance across all flourishing dimensions. Initial testing of 28 leading language models reveals that while some models approach holistic alignment (with the highest-scoring models achieving 72/100), none are acceptably aligned across all dimensions, particularly in Faith and Spirituality, Character and Virtue, and Meaning and Purpose. This research establishes a framework for developing AI systems that actively support human flourishing rather than merely avoiding harm, offering significant implications for AI development, ethics, and evaluation.

replace-cross Temporal Motifs for Financial Networks: A Study on Mercari, JPMC, and Venmo Platforms

Authors: Penghang Liu, Bahadir Altun, Rupam Acharyya, Robert E. Tillman, Shunya Kimura, Naoki Masuda, Ahmet Erdem Sar{\i}y\"uce

Abstract: Understanding the dynamics of financial transactions among people is critical for various applications such as fraud detection. One important aspect of financial transaction networks is temporality. The order and repetition of transactions can offer new insights when considered within the graph structure. Temporal motifs, defined as a set of nodes that interact with each other in a short time period, are a promising tool in this context. In this work, we study three unique temporal financial networks: transactions in Mercari, an online marketplace, payments in a synthetic network generated by J.P. Morgan Chase, and payments and friendships among Venmo users. We consider the fraud detection problem on the Mercari and J.P. Morgan Chase networks, for which the ground truth is available. We show that temporal motifs offer superior performance to several baselines, including a previous method that considers simple graph features and two node embedding techniques (LINE and node2vec), while being practical in terms of runtime performance. For the Venmo network, we investigate the interplay between financial and social relations on three tasks: friendship prediction, vendor identification, and analysis of temporal cycles. For friendship prediction, temporal motifs yield better results than general heuristics, such as Jaccard and Adamic-Adar measures. We are also able to identify vendors with high accuracy and observe interesting patterns in rare motifs, such as temporal cycles. We believe that the analysis, datasets, and lessons from this work will be beneficial for future research on financial transaction networks.

replace-cross SDR-GAIN: A High Real-Time Occluded Pedestrian Pose Completion Method for Autonomous Driving

Authors: Honghao Fu, Yongli Gu, Yidong Yan, Yilang Shen, Yiwen Wu, Libo Sun

Abstract: With the advancement of vision-based autonomous driving technology, pedestrian detection have become an important component for improving traffic safety and driving system robustness. Nevertheless, in complex traffic scenarios, conventional pose estimation approaches frequently fail to accurately reconstruct occluded keypoints, primarily due to obstructions caused by vehicles, vegetation, or architectural elements. To address this issue, we propose a novel real-time occluded pedestrian pose completion framework termed Separation and Dimensionality Reduction-based Generative Adversarial Imputation Nets (SDR-GAIN). Unlike previous approaches that train visual models to distinguish occlusion patterns, SDR-GAIN aims to learn human pose directly from the numerical distribution of keypoint coordinates and interpolate missing positions. It employs a self-supervised adversarial learning paradigm to train lightweight generators with residual structures for the imputation of missing pose keypoints. Additionally, it integrates multiple pose standardization techniques to alleviate the difficulty of the learning process. Experiments conducted on the COCO and JAAD datasets demonstrate that SDR-GAIN surpasses conventional machine learning and Transformer-based missing data interpolation algorithms in accurately recovering occluded pedestrian keypoints, while simultaneously achieving microsecond-level real-time inference.

replace-cross Large Language Models in Mental Health Care: a Scoping Review

Authors: Yining Hua, Fenglin Liu, Kailai Yang, Zehan Li, Hongbin Na, Yi-han Sheu, Peilin Zhou, Lauren V. Moran, Sophia Ananiadou, David A. Clifton, Andrew Beam, John Torous

Abstract: Objectieve:This review aims to deliver a comprehensive analysis of Large Language Models (LLMs) utilization in mental health care, evaluating their effectiveness, identifying challenges, and exploring their potential for future application. Materials and Methods: A systematic search was performed across multiple databases including PubMed, Web of Science, Google Scholar, arXiv, medRxiv, and PsyArXiv in November 2023. The review includes all types of original research, regardless of peer-review status, published or disseminated between October 1, 2019, and December 2, 2023. Studies were included without language restrictions if they employed LLMs developed after T5 and directly investigated research questions within mental health care settings. Results: Out of an initial 313 articles, 34 were selected based on their relevance to LLMs applications in mental health care and the rigor of their reported outcomes. The review identified various LLMs applications in mental health care, including diagnostics, therapy, and enhancing patient engagement. Key challenges highlighted were related to data availability and reliability, the nuanced handling of mental states, and effective evaluation methods. While LLMs showed promise in improving accuracy and accessibility, significant gaps in clinical applicability and ethical considerations were noted. Conclusion: LLMs hold substantial promise for enhancing mental health care. For their full potential to be realized, emphasis must be placed on developing robust datasets, development and evaluation frameworks, ethical guidelines, and interdisciplinary collaborations to address current limitations.

replace-cross GoalNet: Goal Areas Oriented Pedestrian Trajectory Prediction

Authors: Amar Fadillah, Ching-Lin Lee, Zhi-Xuan Wang, Kuan-Ting Lai

Abstract: Predicting the future trajectories of pedestrians on the road is an important task for autonomous driving. The pedestrian trajectory prediction is affected by scene paths, pedestrian's intentions and decision-making, which is a multi-modal problem. Most recent studies use past trajectories to predict a variety of potential future trajectory distributions, which do not account for the scene context and pedestrian targets. Instead of predicting the future trajectory directly, we propose to use scene context and observed trajectory to predict the goal points first, and then reuse the goal points to predict the future trajectories. By leveraging the information from scene context and observed trajectory, the uncertainty can be limited to a few target areas, which represent the "goals" of the pedestrians. In this paper, we propose GoalNet, a new trajectory prediction neural network based on the goal areas of a pedestrian. Our network can predict both pedestrian's trajectories and bounding boxes. The overall model is efficient and modular, and its outputs can be changed according to the usage scenario. Experimental results show that GoalNet significantly improves the previous state-of-the-art performance by 48.7% on the JAAD and 40.8% on the PIE dataset.

replace-cross SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

Authors: Kaixuan Huang, Xudong Guo, Mengdi Wang

Abstract: Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K -- the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B & 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively. The code of this paper is available at https://github.com/Kaffaljidhmah2/SpecDec_pp.

URLs: https://github.com/Kaffaljidhmah2/SpecDec_pp.

replace-cross HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew

Authors: Tzuf Paz-Argaman, Itai Mondshine, Asaf Achi Mordechai, Reut Tsarfaty

Abstract: While large language models (LLMs) excel in various natural language tasks in English, their performance in lower-resourced languages like Hebrew, especially for generative tasks such as abstractive summarization, remains unclear. The high morphological richness in Hebrew adds further challenges due to the ambiguity in sentence comprehension and the complexities in meaning construction. In this paper, we address this resource and evaluation gap by introducing HeSum, a novel benchmark specifically designed for abstractive text summarization in Modern Hebrew. HeSum consists of 10,000 article-summary pairs sourced from Hebrew news websites written by professionals. Linguistic analysis confirms HeSum's high abstractness and unique morphological challenges. We show that HeSum presents distinct difficulties for contemporary state-of-the-art LLMs, establishing it as a valuable testbed for generative language technology in Hebrew, and MRLs generative challenges in general.

replace-cross Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective

Authors: Yuchen Wen, Keping Bi, Wei Chen, Jiafeng Guo, Xueqi Cheng

Abstract: As large language models (LLMs) become an important way of information access, there have been increasing concerns that LLMs may intensify the spread of unethical content, including implicit bias that hurts certain populations without explicit harmful words. In this paper, we conduct a rigorous evaluation of LLMs' implicit bias towards certain demographics by attacking them from a psychometric perspective to elicit agreements to biased viewpoints. Inspired by psychometric principles in cognitive and social psychology, we propose three attack approaches, i.e., Disguise, Deception, and Teaching. Incorporating the corresponding attack instructions, we built two benchmarks: (1) a bilingual dataset with biased statements covering four bias types (2.7K instances) for extensive comparative analysis, and (2) BUMBLE, a larger benchmark spanning nine common bias types (12.7K instances) for comprehensive evaluation. Extensive evaluation of popular commercial and open-source LLMs shows that our methods can elicit LLMs' inner bias more effectively than competitive baselines. Our attack methodology and benchmarks offer an effective means of assessing the ethical risks of LLMs, driving progress toward greater accountability in their development. Our code, data, and benchmarks are available at https://yuchenwen1.github.io/ImplicitBiasEvaluation/.

URLs: https://yuchenwen1.github.io/ImplicitBiasEvaluation/.

replace-cross An Empirical Study of Validating Synthetic Data for Formula Generation

Authors: Usneek Singh, Jos\'e Cambronero, Sumit Gulwani, Aditya Kanade, Anirudh Khatry, Vu Le, Mukul Singh, Gust Verbruggen

Abstract: Large language models (LLMs) can be leveraged to help with writing formulas in spreadsheets, but resources on these formulas are scarce, impacting both the base performance of pre-trained models and limiting the ability to fine-tune them. Given a corpus of formulas, we can use a(nother) model to generate synthetic natural language utterances for fine-tuning. However, it is important to validate whether the NL generated by the LLM is indeed accurate to be beneficial for fine-tuning. In this paper, we provide empirical results on the impact of validating these synthetic training examples with surrogate objectives that evaluate the accuracy of the synthetic annotations. We demonstrate that validation improves performance over raw data across four models (2 open and 2 closed weight). Interestingly, we show that although validation tends to prune more challenging examples, it increases the complexity of problems that models can solve after being fine-tuned on validated data.

replace-cross Quantifying Context Bias in Domain Adaptation for Object Detection

Authors: Hojun Son, Asma Almutairi, Arpan Kusari

Abstract: Domain adaptation for object detection (DAOD) has become essential to counter performance degradation caused by distribution shifts between training and deployment domains. However, a critical factor influencing DAOD - context bias resulting from learned foreground-background (FG-BG) associations - has remained underexplored. We address three key questions regarding FG BG associations in object detection: are FG-BG associations encoded during the training, is there a causal relationship between FG-BG associations and detection performance, and is there an effect of FG-BG association on DAOD. To examine how models capture FG BG associations, we analyze class-wise and feature-wise performance degradation using background masking and feature perturbation, measured via change in accuracies (defined as drop rate). To explore the causal role of FG-BG associations, we apply do-calculus on FG-BG pairs guided by class activation mapping (CAM). To quantify the causal influence of FG-BG associations across domains, we propose a novel metric - domain association gradient - defined as the ratio of drop rate to maximum mean discrepancy (MMD). Through systematic experiments involving background masking, feature-level perturbations, and CAM, we reveal that convolution-based object detection models encode FG-BG associations. Our results demonstrate that context bias not only exists but causally undermines the generalization capabilities of object detection models across domains. Furthermore, we validate these findings across multiple models and datasets, including state-of-the-art architectures such as ALDI++. This study highlights the necessity of addressing context bias explicitly in DAOD frameworks, providing insights that pave the way for developing more robust and generalizable object detection systems.

replace-cross Downscaling Extreme Precipitation with Wasserstein Regularized Diffusion

Authors: Yuhao Liu, James Doss-Gollin, Qiushi Dai, Ashok Veeraraghavan, Guha Balakrishnan

Abstract: Understanding the risks posed by extreme rainfall events necessitates both high-resolution products (to assess localized hazards) and extensive historical records (to capture rare occurrences). Radar and mesonet networks provide kilometer-scale precipitation fields, but with limited historical records and geographical coverage. Conversely, global gauge and blended products span decades, yet their coarse 30-50 km grids obscure local extremes. This work introduces Wasserstein Regularized Diffusion (WassDiff), a generative downscaling framework that integrates diffusion modeling with a distribution-matching (Wasserstein) regularizer, suppressing bias throughout the entire generative denoising process. Conditioned on 55 km CPC gauge-based precipitation and the 31 km ERA5 reanalysis, WassDiff generates 1 km precipitation estimates that remain well-calibrated to targets across the full intensity range, including the extremes. Comprehensive evaluations demonstrate that WassDiff outperforms existing state-of-the-art downscaling methods, delivering lower reconstruction error and reduced bias. Case studies further demonstrate its ability to reproduce realistic fine-scale structures and accurate peak intensities from extreme weather phenomena, such as tropical storms and cold fronts. By unlocking decades of high-resolution rainfall information from globally available coarse records, WassDiff offers a practical pathway toward more accurate flood-risk assessments and climate-adaptation planning.

replace-cross Post-hoc Study of Climate Microtargeting on Social Media Ads with LLMs: Thematic Insights and Fairness Evaluation

Authors: Tunazzina Islam, Dan Goldwasser

Abstract: Climate change communication on social media increasingly employs microtargeting strategies to effectively reach and influence specific demographic groups. This study presents a post-hoc analysis of microtargeting practices within climate campaigns by leveraging large language models (LLMs) to examine Facebook advertisements. Our analysis focuses on two key aspects: demographic targeting and fairness. We evaluate the ability of LLMs to accurately predict the intended demographic targets, such as gender and age group, achieving an overall accuracy of 88.55%. Furthermore, we instruct the LLMs to generate explanations for their classifications, providing transparent reasoning behind each decision. These explanations reveal the specific thematic elements used to engage different demographic segments, highlighting distinct strategies tailored to various audiences. Our findings show that young adults are primarily targeted through messages emphasizing activism and environmental consciousness, while women are engaged through themes related to caregiving roles and social advocacy. In addition to evaluating the effectiveness of LLMs in detecting microtargeted messaging, we conduct a comprehensive fairness analysis to identify potential biases in model predictions. Our findings indicate that while LLMs perform well overall, certain biases exist, particularly in the classification of senior citizens and male audiences. By showcasing the efficacy of LLMs in dissecting and explaining targeted communication strategies and by highlighting fairness concerns, this study provides a valuable framework for future research aimed at enhancing transparency, accountability, and inclusivity in social media-driven climate campaigns.

replace-cross End-to-end multi-channel speaker extraction and binaural speech synthesis

Authors: Cheng Chi, Xiaoyu Li, Yuxuan Ke, Qunping Ni, Yao Ge, Xiaodong Li, Chengshi Zheng

Abstract: Speech clarity and spatial audio immersion are the two most critical factors in enhancing remote conferencing experiences. Existing methods are often limited: either due to the lack of spatial information when using only one microphone, or because their performance is highly dependent on the accuracy of direction-of-arrival estimation when using microphone array. To overcome this issue, we introduce an end-to-end deep learning framework that has the capacity of mapping multi-channel noisy and reverberant signals to clean and spatialized binaural speech directly. This framework unifies source extraction, noise suppression, and binaural rendering into one network. In this framework, a novel magnitude-weighted interaural level difference loss function is proposed that aims to improve the accuracy of spatial rendering. Extensive evaluations show that our method outperforms established baselines in terms of both speech quality and spatial fidelity.

replace-cross Compositional Risk Minimization

Authors: Divyat Mahajan, Mohammad Pezeshki, Charles Arnal, Ioannis Mitliagkas, Kartik Ahuja, Pascal Vincent

Abstract: Compositional generalization is a crucial step towards developing data-efficient intelligent machines that generalize in human-like ways. In this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model's ability to generalize compositionally to novel attribute combinations in discriminative tasks. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.

replace-cross On the Principles of ReLU Networks with One Hidden Layer

Authors: Changcun Huang

Abstract: A neural network with one hidden layer or a two-layer network (regardless of the input layer) is the simplest feedforward neural network, whose mechanism may be the basis of more general network architectures. However, even to this type of simple architecture, it is also a ``black box''; that is, it remains unclear how to interpret the mechanism of its solutions obtained by the back-propagation algorithm and how to control the training process through a deterministic way. This paper systematically studies the first problem by constructing universal function-approximation solutions. It is shown that, both theoretically and experimentally, the training solution for the one-dimensional input could be completely understood, and that for a higher-dimensional input can also be well interpreted to some extent. Those results pave the way for thoroughly revealing the black box of two-layer ReLU networks and advance the understanding of deep ReLU networks.

replace-cross FonTS: Text Rendering with Typography and Style Controls

Authors: Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, Xingxing Zou

Abstract: Visual text rendering are widespread in various real-world applications, requiring careful font selection and typographic choices. Recent progress in diffusion transformer (DiT)-based text-to-image (T2I) models show promise in automating these processes. However, these methods still encounter challenges like inconsistent fonts, style variation, and limited fine-grained control, particularly at the word-level. This paper proposes a two-stage DiT-based pipeline to address these problems by enhancing controllability over typography and style in text rendering. We introduce typography control fine-tuning (TC-FT), an parameter-efficient fine-tuning method (on $5\%$ key parameters) with enclosing typography control tokens (ETC-tokens), which enables precise word-level application of typographic features. To further address style inconsistency in text rendering, we propose a text-agnostic style control adapter (SCA) that prevents content leakage while enhancing style consistency. To implement TC-FT and SCA effectively, we incorporated HTML-render into the data synthesis pipeline and proposed the first word-level controllable dataset. Through comprehensive experiments, we demonstrate the effectiveness of our approach in achieving superior word-level typographic control, font consistency, and style consistency in text rendering tasks. The datasets and models will be available for academic use.

replace-cross PIAD-SRNN: Physics-Informed Adaptive Decomposition in State-Space RNN

Authors: Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Rajiv Ramnath

Abstract: Time series forecasting often demands a trade-off between accuracy and efficiency. While recent Transformer models have improved forecasting capabilities, they come with high computational costs. Linear-based models have shown better accuracy than Transformers but still fall short of ideal performance. We propose PIAD-SRNN, a physics-informed adaptive decomposition state-space RNN, that separates seasonal and trend components and embeds domain equations in a recurrent framework. We evaluate PIAD-SRNN's performance on indoor air quality datasets, focusing on CO2 concentration prediction across various forecasting horizons, and results demonstrate that it consistently outperforms SoTA models in both long-term and short-term time series forecasting, including transformer-based architectures, in terms of both MSE and MAE. Besides proposing PIAD-SRNN which balances accuracy with efficiency, this paper also provides four curated datasets. Code and data: https://github.com/ahmad-shirazi/DSSRNN

URLs: https://github.com/ahmad-shirazi/DSSRNN

replace-cross SP$^2$T: Sparse Proxy Attention for Dual-stream Point Transformer

Authors: Jiaxu Wan, Hong Zhang, Ziqi He, Yangyan Deng, Qishu Wang, Ding Yuan, Yifan Yang

Abstract: Point transformers have demonstrated remarkable progress in 3D understanding through expanded receptive fields (RF), but further expanding the RF leads to dilution in group attention and decreases detailed feature extraction capability. Proxy, which serves as abstract representations for simplifying feature maps, enables global RF. However, existing proxy-based approaches face critical limitations: Global proxies incur quadratic complexity for large-scale point clouds and suffer positional ambiguity, while local proxy alternatives struggle with 1) Unreliable sampling from the geometrically diverse point cloud, 2) Inefficient proxy interaction computation, and 3) Imbalanced local-global information fusion; To address these challenges, we propose Sparse Proxy Point Transformer (SP$^{2}$T) -- a local proxy-based dual-stream point transformer with three key innovations: First, for reliable sampling, spatial-wise proxy sampling with vertex-based associations enables robust sampling on geometrically diverse point clouds. Second, for efficient proxy interaction, sparse proxy attention with a table-based relative bias effectively achieves the interaction with efficient map-reduce computation. Third, for local-global information fusion, our dual-stream architecture maintains local-global balance through parallel branches. Comprehensive experiments reveal that SP$^{2}$T sets state-of-the-art results with acceptable latency on indoor and outdoor 3D comprehension benchmarks, demonstrating marked improvement (+3.8% mIoU vs. SPoTr@S3DIS, +22.9% mIoU vs. PointASNL@Sem.KITTI) compared to other proxy-based point cloud methods.

replace-cross Field Matching: an Electrostatic Paradigm to Generate and Transfer Data

Authors: Alexander Kolesov, Manukhov Stepan, Vladimir V. Palyulin, Alexander Korotin

Abstract: We propose Electrostatic Field Matching (EFM), a novel method that is suitable for both generative modeling and distribution transfer tasks. Our approach is inspired by the physics of an electrical capacitor. We place source and target distributions on the capacitor plates and assign them positive and negative charges, respectively. We then learn the electrostatic field of the capacitor using a neural network approximator. To map the distributions to each other, we start at one plate of the capacitor and move the samples along the learned electrostatic field lines until they reach the other plate. We theoretically justify that this approach provably yields the distribution transfer. In practice, we demonstrate the performance of our EFM in toy and image data experiments.

replace-cross GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Authors: Tong Wei, Yijun Yang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye

Abstract: Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs). Yet, its efficacy in training vision-language model (VLM) agents for goal-directed action reasoning in visual environments is less established. This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse, characterized by a rapid loss of diversity in the agent's thoughts, state-irrelevant and incomplete reasoning, and subsequent invalid actions, resulting in negative rewards. To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent's reasoning at each RL step. This simple and scalable GTR (Guided Thought Reinforcement) framework trains reasoning and action simultaneously without the need for dense, per-step human labeling. Our experiments demonstrate that GTR significantly enhances the performance and generalization of the LLaVA-7b model across various visual environments, achieving 3-5 times higher task success rates compared to SoTA models with notably smaller model sizes.

replace-cross REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives

Authors: Kun Su, Krishna Sayana, Hubert Pham, James Pine, Yuri Vasilevski, Raghavendra Vasudeva, Marialena Kyriakidi, Liam Hebert, Ambarish Jash, Anushya Subbiah, Sukhdeep Sodhi

Abstract: This paper introduces a novel dataset REGEN (Reviews Enhanced with GEnerative Narratives), designed to benchmark the conversational capabilities of recommender Large Language Models (LLMs), addressing the limitations of existing datasets that primarily focus on sequential item prediction. REGEN extends the Amazon Product Reviews dataset by inpainting two key natural language features: (1) user critiques, representing user "steering" queries that lead to the selection of a subsequent item, and (2) narratives, rich textual outputs associated with each recommended item taking into account prior context. The narratives include product endorsements, purchase explanations, and summaries of user preferences. Further, we establish an end-to-end modeling benchmark for the task of conversational recommendation, where models are trained to generate both recommendations and corresponding narratives conditioned on user history (items and critiques). For this joint task, we introduce a modeling framework LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives) which uses an LLM as a backbone for critiquing, retrieval and generation. We also evaluate the dataset's quality using standard auto-rating techniques and benchmark it by training both traditional and LLM-based recommender models. Our results demonstrate that incorporating critiques enhances recommendation quality by enabling the recommender to learn language understanding and integrate it with recommendation signals. Furthermore, LLMs trained on our dataset effectively generate both recommendations and contextual narratives, achieving performance comparable to state-of-the-art recommenders and language models.

replace-cross Using AI to Summarize US Presidential Campaign TV Advertisement Videos, 1952-2012

Authors: Adam Breuer, Bryce J. Dietrich, Michael H. Crespin, Matthew Butler, J. A. Pryse, Kosuke Imai

Abstract: This paper introduces the largest and most comprehensive dataset of US presidential campaign television advertisements, available in digital format. The dataset also includes machine-searchable transcripts and high-quality summaries designed to facilitate a variety of academic research. To date, there has been great interest in collecting and analyzing US presidential campaign advertisements, but the need for manual procurement and annotation led many to rely on smaller subsets. We design a large-scale parallelized, AI-based analysis pipeline that automates the laborious process of preparing, transcribing, and summarizing videos. We then apply this methodology to the 9,707 presidential ads from the Julian P. Kanter Political Commercial Archive. We conduct extensive human evaluations to show that these transcripts and summaries match the quality of manually generated alternatives. We illustrate the value of this data by including an application that tracks the genesis and evolution of current focal issue areas over seven decades of presidential elections. Our analysis pipeline and codebase also show how to use LLM-based tools to obtain high-quality summaries for other video datasets.

replace-cross Generative Retrieval and Alignment Model: A New Paradigm for E-commerce Retrieval

Authors: Ming Pang, Chunyuan Yuan, Xiaoyu He, Zheng Fang, Donghao Xie, Fanyi Qu, Xue Jiang, Changping Peng, Zhangang Lin, Ching Law, Jingping Shao

Abstract: Traditional sparse and dense retrieval methods struggle to leverage general world knowledge and often fail to capture the nuanced features of queries and products. With the advent of large language models (LLMs), industrial search systems have started to employ LLMs to generate identifiers for product retrieval. Commonly used identifiers include (1) static/semantic IDs and (2) product term sets. The first approach requires creating a product ID system from scratch, missing out on the world knowledge embedded within LLMs. While the second approach leverages this general knowledge, the significant difference in word distribution between queries and products means that product-based identifiers often do not align well with user search queries, leading to missed product recalls. Furthermore, when queries contain numerous attributes, these algorithms generate a large number of identifiers, making it difficult to assess their quality, which results in low overall recall efficiency. To address these challenges, this paper introduces a novel e-commerce retrieval paradigm: the Generative Retrieval and Alignment Model (GRAM). GRAM employs joint training on text information from both queries and products to generate shared text identifier codes, effectively bridging the gap between queries and products. This approach not only enhances the connection between queries and products but also improves inference efficiency. The model uses a co-alignment strategy to generate codes optimized for maximizing retrieval efficiency. Additionally, it introduces a query-product scoring mechanism to compare product values across different codes, further boosting retrieval efficiency. Extensive offline and online A/B testing demonstrates that GRAM significantly outperforms traditional models and the latest generative retrieval models, confirming its effectiveness and practicality.

replace-cross MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs

Authors: Jiawei Mao, Yuhan Wang, Yucheng Tang, Daguang Xu, Kang Wang, Yang Yang, Zongwei Zhou, Yuyin Zhou

Abstract: This paper presents MedSegFactory, a versatile medical synthesis framework that generates high-quality paired medical images and segmentation masks across modalities and tasks. It aims to serve as an unlimited data repository, supplying image-mask pairs to enhance existing segmentation tools. The core of MedSegFactory is a dual-stream diffusion model, where one stream synthesizes medical images and the other generates corresponding segmentation masks. To ensure precise alignment between image-mask pairs, we introduce Joint Cross-Attention (JCA), enabling a collaborative denoising paradigm by dynamic cross-conditioning between streams. This bidirectional interaction allows both representations to guide each other's generation, enhancing consistency between generated pairs. MedSegFactory unlocks on-demand generation of paired medical images and segmentation masks through user-defined prompts that specify the target labels, imaging modalities, anatomical regions, and pathological conditions, facilitating scalable and high-quality data generation. This new paradigm of medical image synthesis enables seamless integration into diverse medical imaging workflows, enhancing both efficiency and accuracy. Extensive experiments show that MedSegFactory generates data of superior quality and usability, achieving competitive or state-of-the-art performance in 2D and 3D segmentation tasks while addressing data scarcity and regulatory constraints.

replace-cross MGT: Extending Virtual Try-Off to Multi-Garment Scenarios

Authors: Riza Velioglu, Petra Bevandic, Robin Chan, Barbara Hammer

Abstract: Computer vision is transforming fashion industry through Virtual Try-On (VTON) and Virtual Try-Off (VTOFF). VTON generates images of a person in a specified garment using a target photo and a standardized garment image, while a more challenging variant, Person-to-Person Virtual Try-On (p2p-VTON), uses a photo of another person wearing the garment. VTOFF, in contrast, extracts standardized garment images from photos of clothed individuals. We introduce Multi-Garment TryOffDiff (MGT), a diffusion-based VTOFF model capable of handling diverse garment types, including upper-body, lower-body, and dresses. MGT builds on a latent diffusion architecture with SigLIP-based image conditioning to capture garment characteristics such as shape, texture, and pattern. To address garment diversity, MGT incorporates class-specific embeddings, achieving state-of-the-art VTOFF results on VITON-HD and competitive performance on DressCode. When paired with VTON models, it further enhances p2p-VTON by reducing unwanted attribute transfer, such as skin tone, ensuring preservation of person-specific characteristics. Demo, code, and models are available at: https://rizavelioglu.github.io/tryoffdiff/

URLs: https://rizavelioglu.github.io/tryoffdiff/

replace-cross AI Safety Should Prioritize the Future of Work

Authors: Sanchaita Hazra, Bodhisattwa Prasad Majumder, Tuhin Chakrabarty

Abstract: Current efforts in AI safety prioritize filtering harmful content, preventing manipulation of human behavior, and eliminating existential risks in cybersecurity or biosecurity. While pressing, this narrow focus overlooks critical human-centric considerations that shape the long-term trajectory of a society. In this position paper, we identify the risks of overlooking the impact of AI on the future of work and recommend comprehensive transition support towards the evolution of meaningful labor with human agency. Through the lens of economic theories, we highlight the intertemporal impacts of AI on human livelihood and the structural changes in labor markets that exacerbate income inequality. Additionally, the closed-source approach of major stakeholders in AI development resembles rent-seeking behavior through exploiting resources, breeding mediocrity in creative labor, and monopolizing innovation. To address this, we argue in favor of a robust international copyright anatomy supported by implementing collective licensing that ensures fair compensation mechanisms for using data to train AI models. We strongly recommend a pro-worker framework of global AI governance to enhance shared prosperity and economic justice while reducing technical debt.

replace-cross One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning

Authors: Ritesh Goru, Shanay Mehta, Prateek Jain

Abstract: Fine-tuning Large Language Models (LLMs) on multi-turn reasoning datasets requires N (number of turns) separate forward passes per conversation due to reasoning token visibility constraints, as reasoning tokens for a turn are discarded in subsequent turns. We propose duplicating response tokens along with a custom attention mask to enable single-pass processing of entire conversations. We prove our method produces identical losses to the N-pass approach while reducing time complexity from $O\bigl(N^{3}\bigl)$ to $O\bigl(N^{2}\bigl)$ and maintaining the same memory complexity for a transformer based model. Our approach achieves significant training speedup while preserving accuracy. Our implementation is available online (https://github.com/devrev/One-Pass-to-Reason).

URLs: https://github.com/devrev/One-Pass-to-Reason).

replace-cross Red Teaming Large Language Models for Healthcare

Authors: Vahid Balazadeh, Michael Cooper, David Pellow, Atousa Assadi, Jennifer Bell, Mark Coatsworth, Kaivalya Deshpande, Jim Fackler, Gabriel Funingana, Spencer Gable-Cook, Anirudh Gangadhar, Abhishek Jaiswal, Sumanth Kaja, Christopher Khoury, Amrit Krishnan, Randy Lin, Kaden McKeen, Sara Naimimohasses, Khashayar Namdar, Aviraj Newatia, Allan Pang, Anshul Pattoo, Sameer Peesapati, Diana Prepelita, Bogdana Rakova, Saba Sadatamin, Rafael Schulman, Ajay Shah, Syed Azhar Shah, Syed Ahmar Shah, Babak Taati, Balagopal Unnikrishnan, I\~nigo Urteaga, Stephanie Williams, Rahul G Krishnan

Abstract: We present the design process and findings of the pre-conference workshop at the Machine Learning for Healthcare Conference (2024) entitled Red Teaming Large Language Models for Healthcare, which took place on August 15, 2024. Conference participants, comprising a mix of computational and clinical expertise, attempted to discover vulnerabilities -- realistic clinical prompts for which a large language model (LLM) outputs a response that could cause clinical harm. Red-teaming with clinicians enables the identification of LLM vulnerabilities that may not be recognised by LLM developers lacking clinical expertise. We report the vulnerabilities found, categorise them, and present the results of a replication study assessing the vulnerabilities across all LLMs provided.

replace-cross TS-SNN: Temporal Shift Module for Spiking Neural Networks

Authors: Kairong Yu, Tianqing Zhang, Qi Xu, Gang Pan, Hongwei Wang

Abstract: Spiking Neural Networks (SNNs) are increasingly recognized for their biological plausibility and energy efficiency, positioning them as strong alternatives to Artificial Neural Networks (ANNs) in neuromorphic computing applications. SNNs inherently process temporal information by leveraging the precise timing of spikes, but balancing temporal feature utilization with low energy consumption remains a challenge. In this work, we introduce Temporal Shift module for Spiking Neural Networks (TS-SNN), which incorporates a novel Temporal Shift (TS) module to integrate past, present, and future spike features within a single timestep via a simple yet effective shift operation. A residual combination method prevents information loss by integrating shifted and original features. The TS module is lightweight, requiring only one additional learnable parameter, and can be seamlessly integrated into existing architectures with minimal additional computational cost. TS-SNN achieves state-of-the-art performance on benchmarks like CIFAR-10 (96.72\%), CIFAR-100 (80.28\%), and ImageNet (70.61\%) with fewer timesteps, while maintaining low energy consumption. This work marks a significant step forward in developing efficient and accurate SNN architectures.

replace-cross Balancing Progress and Safety: A Novel Risk-Aware Objective for RL in Autonomous Driving

Authors: Ahmed Abouelazm, Jonas Michel, Helen Gremmelmaier, Tim Joseph, Philip Sch\"orner, J. Marius Z\"ollner

Abstract: Reinforcement Learning (RL) is a promising approach for achieving autonomous driving due to robust decision-making capabilities. RL learns a driving policy through trial and error in traffic scenarios, guided by a reward function that combines the driving objectives. The design of such reward function has received insufficient attention, yielding ill-defined rewards with various pitfalls. Safety, in particular, has long been regarded only as a penalty for collisions. This leaves the risks associated with actions leading up to a collision unaddressed, limiting the applicability of RL in real-world scenarios. To address these shortcomings, our work focuses on enhancing the reward formulation by defining a set of driving objectives and structuring them hierarchically. Furthermore, we discuss the formulation of these objectives in a normalized manner to transparently determine their contribution to the overall reward. Additionally, we introduce a novel risk-aware objective for various driving interactions based on a two-dimensional ellipsoid function and an extension of Responsibility-Sensitive Safety (RSS) concepts. We evaluate the efficacy of our proposed reward in unsignalized intersection scenarios with varying traffic densities. The approach decreases collision rates by 21\% on average compared to baseline rewards and consistently surpasses them in route progress and cumulative reward, demonstrating its capability to promote safer driving behaviors while maintaining high-performance levels.

replace-cross Boundary-Guided Trajectory Prediction for Road Aware and Physically Feasible Autonomous Driving

Authors: Ahmed Abouelazm, Mianzhi Liu, Christian Hubschneider, Yin Wu, Daniel Slieter, J. Marius Z\"ollner

Abstract: Accurate prediction of surrounding road users' trajectories is essential for safe and efficient autonomous driving. While deep learning models have improved performance, challenges remain in preventing off-road predictions and ensuring kinematic feasibility. Existing methods incorporate road-awareness modules and enforce kinematic constraints but lack plausibility guarantees and often introduce trade-offs in complexity and flexibility. This paper proposes a novel framework that formulates trajectory prediction as a constrained regression guided by permissible driving directions and their boundaries. Using the agent's current state and an HD map, our approach defines the valid boundaries and ensures on-road predictions by training the network to learn superimposed paths between left and right boundary polylines. To guarantee feasibility, the model predicts acceleration profiles that determine the vehicle's travel distance along these paths while adhering to kinematic constraints. We evaluate our approach on the Argoverse-2 dataset against the HPTR baseline. Our approach shows a slight decrease in benchmark metrics compared to HPTR but notably improves final displacement error and eliminates infeasible trajectories. Moreover, the proposed approach has superior generalization to less prevalent maneuvers and unseen out-of-distribution scenarios, reducing the off-road rate under adversarial attacks from 66% to just 1%. These results highlight the effectiveness of our approach in generating feasible and robust predictions.

replace-cross TPK: Trustworthy Trajectory Prediction Integrating Prior Knowledge For Interpretability and Kinematic Feasibility

Authors: Marius Baden, Ahmed Abouelazm, Christian Hubschneider, Yin Wu, Daniel Slieter, J. Marius Z\"ollner

Abstract: Trajectory prediction is crucial for autonomous driving, enabling vehicles to navigate safely by anticipating the movements of surrounding road users. However, current deep learning models often lack trustworthiness as their predictions can be physically infeasible and illogical to humans. To make predictions more trustworthy, recent research has incorporated prior knowledge, like the social force model for modeling interactions and kinematic models for physical realism. However, these approaches focus on priors that suit either vehicles or pedestrians and do not generalize to traffic with mixed agent classes. We propose incorporating interaction and kinematic priors of all agent classes--vehicles, pedestrians, and cyclists with class-specific interaction layers to capture agent behavioral differences. To improve the interpretability of the agent interactions, we introduce DG-SFM, a rule-based interaction importance score that guides the interaction layer. To ensure physically feasible predictions, we proposed suitable kinematic models for all agent classes with a novel pedestrian kinematic model. We benchmark our approach on the Argoverse 2 dataset, using the state-of-the-art transformer HPTR as our baseline. Experiments demonstrate that our method improves interaction interpretability, revealing a correlation between incorrect predictions and divergence from our interaction prior. Even though incorporating the kinematic models causes a slight decrease in accuracy, they eliminate infeasible trajectories found in the dataset and the baseline model. Thus, our approach fosters trust in trajectory prediction as its interaction reasoning is interpretable, and its predictions adhere to physics.

replace-cross Automatic Curriculum Learning for Driving Scenarios: Towards Robust and Efficient Reinforcement Learning

Authors: Ahmed Abouelazm, Tim Weinstein, Tim Joseph, Philip Sch\"orner, J. Marius Z\"ollner

Abstract: This paper addresses the challenges of training end-to-end autonomous driving agents using Reinforcement Learning (RL). RL agents are typically trained in a fixed set of scenarios and nominal behavior of surrounding road users in simulations, limiting their generalization and real-life deployment. While domain randomization offers a potential solution by randomly sampling driving scenarios, it frequently results in inefficient training and sub-optimal policies due to the high variance among training scenarios. To address these limitations, we propose an automatic curriculum learning framework that dynamically generates driving scenarios with adaptive complexity based on the agent's evolving capabilities. Unlike manually designed curricula that introduce expert bias and lack scalability, our framework incorporates a ``teacher'' that automatically generates and mutates driving scenarios based on their learning potential -- an agent-centric metric derived from the agent's current policy -- eliminating the need for expert design. The framework enhances training efficiency by excluding scenarios the agent has mastered or finds too challenging. We evaluate our framework in a reinforcement learning setting where the agent learns a driving policy from camera images. Comparative results against baseline methods, including fixed scenario training and domain randomization, demonstrate that our approach leads to enhanced generalization, achieving higher success rates: +9% in low traffic density, +21% in high traffic density, and faster convergence with fewer training steps. Our findings highlight the potential of ACL in improving the robustness and efficiency of RL-based autonomous driving agents.

replace-cross An Exploration of Default Images in Text-to-Image Generation

Authors: Hannu Simonen, Atte Kiviniemi, Jonas Oppenlaender

Abstract: In the creative practice of text-to-image generation (TTI), images are generated from text prompts. However, TTI models are trained to always yield an output, even if the prompt contains unknown terms. In this case, the model may generate what we call "default images": images that closely resemble each other across many unrelated prompts. We argue studying default images is valuable for designing better solutions for TTI and prompt engineering. In this paper, we provide the first investigation into default images on Midjourney, a popular image generator. We describe our systematic approach to create input prompts triggering default images, and present the results of our initial experiments and several small-scale ablation studies. We also report on a survey study investigating how default images affect user satisfaction. Our work lays the foundation for understanding default images in TTI and highlights challenges and future research directions.

replace-cross Deep Learning-Based Forecasting of Boarding Patient Counts to Address ED Overcrowding

Authors: Orhun Vural, Bunyamin Ozaydin, James Booth, Brittany F. Lindsey, Abdulaziz Ahmed

Abstract: This study presents a deep learning-based framework for predicting emergency department (ED) boarding counts six hours in advance using only operational and contextual data, without patient-level information. Data from ED tracking systems, inpatient census, weather, holidays, and local events were aggregated hourly and processed with comprehensive feature engineering. The mean ED boarding count was 28.7 (standard deviation = 11.2). Multiple deep learning models, including ResNetPlus, TSTPlus, and TSiTPlus, were trained and optimized using Optuna, with TSTPlus achieving the best results (mean absolute error = 4.30, mean squared error = 29.47, R2 = 0.79). The framework accurately forecasted boarding counts, including during extreme periods, and demonstrated that broader input features improve predictive accuracy. This approach supports proactive hospital management and offers a practical method for mitigating ED overcrowding.

replace-cross An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems

Authors: Fangqiao Tian, An Luo, Jin Du, Xun Xian, Robert Specht, Ganghua Wang, Xuan Bi, Jiawei Zhou, Ashish Kundu, Jayanth Srinivasa, Charles Fleming, Rui Zhang, Zirui Liu, Mingyi Hong, Jie Ding

Abstract: A multi-agent AI system (MAS) is composed of multiple autonomous agents that interact, exchange information, and make decisions based on internal generative models. Recent advances in large language models and tool-using agents have made MAS increasingly practical in areas like scientific discovery and collaborative automation. However, key questions remain: When are MAS more effective than single-agent systems? What new safety risks arise from agent interactions? And how should we evaluate their reliability and structure? This paper outlines a formal framework for analyzing MAS, focusing on two core aspects: effectiveness and safety. We explore whether MAS truly improve robustness, adaptability, and performance, or merely repackage known techniques like ensemble learning. We also study how inter-agent dynamics may amplify or suppress system vulnerabilities. While MAS are relatively new to the signal processing community, we envision them as a powerful abstraction that extends classical tools like distributed estimation and sensor fusion to higher-level, policy-driven inference. Through experiments on data science automation, we highlight the potential of MAS to reshape how signal processing systems are designed and trusted.

replace-cross Grokking Beyond the Euclidean Norm of Model Parameters

Authors: Pascal Jr Tikeng Notsawo, Guillaume Dumas, Guillaume Rabusseau

Abstract: Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a property $P$ (e.g., sparse or low-rank weights) that generalizes on the problem of interest, gradient descent with a small but non-zero regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) results in grokking. This extends previous work showing that small non-zero weight decay induces grokking. Moreover, our analysis shows that over-parameterization by adding depth makes it possible to grok or ungrok without explicitly using regularization, which is impossible in shallow cases. We further show that the $\ell_2$ norm is not a reliable proxy for generalization when the model is regularized toward a different property $P$, as the $\ell_2$ norm grows in many cases where no weight decay is used, but the model generalizes anyway. We also show that grokking can be amplified solely through data selection, with any other hyperparameter fixed.

replace-cross Language-Grounded Hierarchical Planning and Execution with Multi-Robot 3D Scene Graphs

Authors: Jared Strader, Aaron Ray, Jacob Arkin, Mason B. Peterson, Yun Chang, Nathan Hughes, Christopher Bradley, Yi Xuan Jia, Carlos Nieto-Granda, Rajat Talak, Chuchu Fan, Luca Carlone, Jonathan P. How, Nicholas Roy

Abstract: In this paper, we introduce a multi-robot system that integrates mapping, localization, and task and motion planning (TAMP) enabled by 3D scene graphs to execute complex instructions expressed in natural language. Our system builds a shared 3D scene graph incorporating an open-set object-based map, which is leveraged for multi-robot 3D scene graph fusion. This representation supports real-time, view-invariant relocalization (via the object-based map) and planning (via the 3D scene graph), allowing a team of robots to reason about their surroundings and execute complex tasks. Additionally, we introduce a planning approach that translates operator intent into Planning Domain Definition Language (PDDL) goals using a Large Language Model (LLM) by leveraging context from the shared 3D scene graph and robot capabilities. We provide an experimental assessment of the performance of our system on real-world tasks in large-scale, outdoor environments. A supplementary video is available at https://youtu.be/8xbGGOLfLAY.

URLs: https://youtu.be/8xbGGOLfLAY.

replace-cross Upgrade or Switch: Do We Need a Next-Gen Trusted Architecture for the Internet of AI Agents?

Authors: Ramesh Raskar, Pradyumna Chari, Jared James Grogan, Mahesh Lambe, Robert Lincourt, Raghu Bala, Aditi Joshi, Abhishek Singh, Ayush Chopra, Rajesh Ranjan, Shailja Gupta, Dimitris Stripelis, Maria Gorskikh, Sichao Wang

Abstract: The emerging Internet of AI Agents challenges existing web infrastructure designed for human-scale, reactive interactions. Unlike traditional web resources, autonomous AI agents initiate actions, maintain persistent state, spawn sub-agents, and negotiate directly with peers: demanding millisecond-level discovery, instant credential revocation, and cryptographic behavioral proofs that exceed current DNS/PKI capabilities. This paper analyzes whether to upgrade existing infrastructure or implement purpose-built index architectures for autonomous agents. We identify critical failure points: DNS propagation (24-48 hours vs. required milliseconds), certificate revocation unable to scale to trillions of entities, and IPv4/IPv6 addressing inadequate for agent-scale routing. We evaluate three approaches: (1) Upgrade paths, (2) Switch options, (3) Hybrid index/registries. Drawing parallels to dialup-to-broadband transitions, we find that agent requirements constitute qualitative, and not incremental, changes. While upgrades offer compatibility and faster deployment, clean-slate solutions provide better performance but require longer for adoption. Our analysis suggests hybrid approaches will emerge, with centralized indexes for critical agents and federated meshes for specialized use cases.

replace-cross On the Necessity of Output Distribution Reweighting for Effective Class Unlearning

Authors: Yian Wang, Ali Ebrahimpour-Boroojeny, Hari Sundaram

Abstract: In this work, we introduce an output-reweighting unlearning method, RWFT, a lightweight technique that erases an entire class from a trained classifier without full retraining. Forgetting specific classes from trained models is essential for enforcing user deletion rights and mitigating harmful or biased predictions. The full retraining is costly and existing unlearning methods fail to replicate the behavior of the retrained models when predicting samples from the unlearned class. We prove this failure by designing a variant of membership inference attacks, MIA-NN that successfully reveals the unlearned class for any of these methods. We propose a simple redistribution of the probability mass for the prediction on the samples in the forgotten class which is robust to MIA-NN. We also introduce a new metric based on the total variation (TV) distance of the prediction probabilities to quantify residual leakage to prevent future methods from susceptibility to the new attack. Through extensive experiments with state of the art baselines in machine unlearning, we show that our approach matches the results of full retraining in both metrics used for evaluation by prior work and the new metric we propose in this work. Compare to state-of-the-art methods, we gain 2.79% in previously used metrics and 111.45% in our new TV-based metric over the best existing method.

replace-cross Lighting the Night with Generative Artificial Intelligence

Authors: Tingting Zhou, Feng Zhang, Haoyang Fu, Baoxiang Pan, Renhe Zhang, Feng Lu, Zhixin Yang

Abstract: The visible light reflectance data from geostationary satellites is crucial for meteorological observations and plays an important role in weather monitoring and forecasting. However, due to the lack of visible light at night, it is impossible to conduct continuous all-day weather observations using visible light reflectance data. This study pioneers the use of generative diffusion models to address this limitation. Based on the multi-band thermal infrared brightness temperature data from the Advanced Geostationary Radiation Imager (AGRI) onboard the Fengyun-4B (FY4B) geostationary satellite, we developed a high-precision visible light reflectance generative model, called Reflectance Diffusion (RefDiff), which enables 0.47~\mu\mathrm{m}, 0.65~\mu\mathrm{m}, and 0.825~\mu\mathrm{m} bands visible light reflectance generation at night. Compared to the classical models, RefDiff not only significantly improves accuracy through ensemble averaging but also provides uncertainty estimation. Specifically, the SSIM index of RefDiff can reach 0.90, with particularly significant improvements in areas with complex cloud structures and thick clouds. The model's nighttime generation capability was validated using VIIRS nighttime product, demonstrating comparable performance to its daytime counterpart. In summary, this research has made substantial progress in the ability to generate visible light reflectance at night, with the potential to expand the application of nighttime visible light data.

replace-cross Distributional Soft Actor-Critic with Diffusion Policy

Authors: Tong Liu, Yinuo Wang, Xujie Song, Wenjun Zou, Liangfa Chen, Likun Wang, Bin Shuai, Jingliang Duan, Shengbo Eben Li

Abstract: Reinforcement learning has been proven to be highly effective in handling complex control tasks. Traditional methods typically use unimodal distributions, such as Gaussian distributions, to model the output of value distributions. However, unimodal distribution often and easily causes bias in value function estimation, leading to poor algorithm performance. This paper proposes a distributional reinforcement learning algorithm called DSAC-D (Distributed Soft Actor Critic with Diffusion Policy) to address the challenges of estimating bias in value functions and obtaining multimodal policy representations. A multimodal distributional policy iteration framework that can converge to the optimal policy was established by introducing policy entropy and value distribution function. A diffusion value network that can accurately characterize the distribution of multi peaks was constructed by generating a set of reward samples through reverse sampling using a diffusion model. Based on this, a distributional reinforcement learning algorithm with dual diffusion of the value network and the policy network was derived. MuJoCo testing tasks demonstrate that the proposed algorithm not only learns multimodal policy, but also achieves state-of-the-art (SOTA) performance in all 9 control tasks, with significant suppression of estimation bias and total average return improvement of over 10% compared to existing mainstream algorithms. The results of real vehicle testing show that DSAC-D can accurately characterize the multimodal distribution of different driving styles, and the diffusion policy network can characterize multimodal trajectories.

replace-cross Hita: Holistic Tokenizer for Autoregressive Image Generation

Authors: Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, Xiaojuan Qi

Abstract: Vanilla autoregressive image generation models generate visual tokens step-by-step, limiting their ability to capture holistic relationships among token sequences. Moreover, because most visual tokenizers map local image patches into latent tokens, global information is limited. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Hita incorporates two key strategies to better align with the AR generation process: 1) {arranging} a sequential structure with holistic tokens at the beginning, followed by patch-level tokens, and using causal attention to maintain awareness of previous tokens; and 2) adopting a lightweight fusion module before feeding the de-quantized tokens into the decoder to control information flow and prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. Detailed analysis of the holistic representation highlights its ability to capture global image properties, such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}.

URLs: https://github.com/CVMI-Lab/Hita, https://github.com/CVMI-Lab/Hita

replace-cross USAD: End-to-End Human Activity Recognition via Diffusion Model with Spatiotemporal Attention

Authors: Hang Xiao, Ying Yu, Jiarui Li, Zhifan Yang, Haotian Tang, Hanyu Liu, Chao Li

Abstract: The primary objective of human activity recognition (HAR) is to infer ongoing human actions from sensor data, a task that finds broad applications in health monitoring, safety protection, and sports analysis. Despite proliferating research, HAR still faces key challenges, including the scarcity of labeled samples for rare activities, insufficient extraction of high-level features, and suboptimal model performance on lightweight devices. To address these issues, this paper proposes a comprehensive optimization approach centered on multi-attention interaction mechanisms. First, an unsupervised, statistics-guided diffusion model is employed to perform data augmentation, thereby alleviating the problems of labeled data scarcity and severe class imbalance. Second, a multi-branch spatio-temporal interaction network is designed, which captures multi-scale features of sequential data through parallel residual branches with 3*3, 5*5, and 7*7 convolutional kernels. Simultaneously, temporal attention mechanisms are incorporated to identify critical time points, while spatial attention enhances inter-sensor interactions. A cross-branch feature fusion unit is further introduced to improve the overall feature representation capability. Finally, an adaptive multi-loss function fusion strategy is integrated, allowing for dynamic adjustment of loss weights and overall model optimization. Experimental results on three public datasets, WISDM, PAMAP2, and OPPORTUNITY, demonstrate that the proposed unsupervised data augmentation spatio-temporal attention diffusion network (USAD) achieves accuracies of 98.84%, 93.81%, and 80.92% respectively, significantly outperforming existing approaches. Furthermore, practical deployment on embedded devices verifies the efficiency and feasibility of the proposed method.

replace-cross The role of gain neuromodulation in layer-5 pyramidal neurons

Authors: Alejandro Rodriguez-Garcia, Christopher J. Whyte, Brandon R. Munn, Jie Mei, James M. Shine, Srikanth Ramaswamy

Abstract: Biological and artificial learning systems alike confront the plasticity-stability dilemma. In the brain, neuromodulators such as acetylcholine and noradrenaline relieve this tension by tuning neuronal gain and inhibitory gating, balancing segregation and integration of circuits. Fed by dense cholinergic and noradrenergic projections from the ascending arousal system, layer-5 pyramidal neurons in the cerebral cortex offer a relevant substrate for understanding these dynamics. When distal dendritic signals coincide with back-propagating action potentials, calcium plateaus turn a single somatic spike into a high-gain burst, and interneuron inhibition sculpts the output. These properties make layer-5 cells gain-tunable amplifiers that translate neuromodulatory cues into flexible cortical activity. To capture this mechanism we developed a two-compartment Izhikevich model for pyramidal neurons and single-compartment somatostatin (SOM) and parvalbumin (PV) interneurons, linked by Gaussian connectivity and spike-timing-dependent plasticity (STDP). The soma and apical dendrite are so coupled that somatic spikes back-propagate, while dendritic plateaus can switch the soma from regular firing to bursting by shifting reset and adaptation variables. We show that stronger dendritic drive or tighter coupling raise gain by increasing the likelihood of calcium-triggered somatic bursts. In contrast, dendritic-targeted inhibition suppresses gain, while somatic-targeted inhibition raises the firing threshold of neighboring neurons, thus gating neurons output. Notably, bursting accelerates STDP, supporting rapid synaptic reconfiguration and flexibility. This suggests that brief gain pulses driven by neuromodulators could serve as an adaptive two-timescale optimization mechanism, effectively modulating the synaptic weight updates.

replace-cross EmissionNet: Air Quality Pollution Forecasting for Agriculture

Authors: Prady Saligram, Tanvir Bhathal

Abstract: Air pollution from agricultural emissions is a significant yet often overlooked contributor to environmental and public health challenges. Traditional air quality forecasting models rely on physics-based approaches, which struggle to capture complex, nonlinear pollutant interactions. In this work, we explore forecasting N$_2$O agricultural emissions through evaluating popular architectures, and proposing two novel deep learning architectures, EmissionNet (ENV) and EmissionNet-Transformer (ENT). These models leverage convolutional and transformer-based architectures to extract spatial-temporal dependencies from high-resolution emissions data

replace-cross Hyperspectral Anomaly Detection Methods: A Survey and Comparative Study

Authors: Aayushma Pant, Arbind Agrahari Baniya, Tsz-Kwan Lee, Sunil Aryal

Abstract: Hyperspectral images are high-dimensional datasets comprising hundreds of contiguous spectral bands, enabling detailed analysis of materials and surfaces. Hyperspectral anomaly detection (HAD) refers to the technique of identifying and locating anomalous targets in such data without prior information about a hyperspectral scene or target spectrum. This technology has seen rapid advancements in recent years, with applications in agriculture, defence, military surveillance, and environmental monitoring. Despite this significant progress, existing HAD methods continue to face challenges such as high computational complexity, sensitivity to noise, and limited generalisation across diverse datasets. This study presents a comprehensive comparison of various HAD techniques, categorising them into statistical models, representation-based methods, classical machine learning approaches, and deep learning models. We evaluated these methods across 17 benchmarking datasets using different performance metrics, such as ROC, AUC, and separability map to analyse detection accuracy, computational efficiency, their strengths, limitations, and directions for future research. Our findings highlight that deep learning models achieved the highest detection accuracy, while statistical models demonstrated exceptional speed across all datasets. This survey aims to provide valuable insights for researchers and practitioners working to advance the field of hyperspectral anomaly detection methods.

replace-cross Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ila\"i Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ram\'e, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian, Xiaofan Zhang, Raluca Ada Popa, Kedar Dhamdhere, Bla\v{z} Bratani\v{c}, Kyuyeun Kim, Terry Koo, Ferran Alet, Yi-ting Chen, Arsha Nagrani, Hannah Muckenhirn, Zhiyuan Zhang, Corbin Quick, Filip Paveti\'c, Duc Dung Nguyen, Joao Carreira, Michael Elabd, Haroon Qureshi, Fabian Mentzer, Yao-Yuan Yang, Danielle Eisenbud, Anmol Gulati, Ellie Talius, Eric Ni, Sahra Ghalebikesabi, Edouard Yvinec, Alaa Saade, Thatcher Ulrich, Lorenzo Blanco, Dan A. Calian, Muhuan Huang, A\"aron van den Oord, Naman Goyal, Terry Chen, Praynaa Rawlani, Christian Schallhart, Swachhand Lokhande, Xianghong Luo, Jyn Shan, Ceslee Montgomery, Victoria Krakovna, Federico Piccinini, Omer Barak, Jingyu Cui, Yiling Jia, Mikhail Dektiarev, Alexey Kolganov, Shiyu Huang, Zhe Chen, Xingyu Wang, Jessica Austin, Peter de Boursac, Evgeny Sluzhaev, Frank Ding, Huijian Li, Surya Bhupatiraju, Mohit Agarwal, S{\l}awek Kwasiborski, Paramjit Sandhu, Patrick Siegler, Ahmet Iscen, Eyal Ben-David, Shiraz Butt, Miltos Allamanis, Seth Benjamin, Robert Busa-Fekete, Felix Hernandez-Campos, Sasha Goldshtein, Matt Dibb, Weiyang Zhang, Annie Marsden, Carey Radebaugh, Stephen Roller, Abhishek Nayyar, Jacob Austin, Tayfun Terzi, Bhargav Kanagal Shamanna, Pete Shaw, Aayush Singh, Florian Luisier, Artur Mendon\c{c}a, Vaibhav Aggarwal, Larisa Markeeva, Claudio Fantacci, Sergey Brin, HyunJeong Choe, Guanyu Wang, Hartwig Adam, Avigail Dabush, Tatsuya Kiyono, Eyal Marcus, Jeremy Cole, Theophane Weber, Hongrae Lee, Ronny Huang, Alex Muzio, Leandro Kieliger, Maigo Le, Courtney Biles, Long Le, Archit Sharma, Chengrun Yang, Avery Lamp, Dave Dopson, Nate Hurley, Katrina Xinyi Xu, Zhihao Shan, Shuang Song, Jiewen Tan, Alexandre Senges, George Zhang, Chong You, Yennie Jun, David Raposo, Susanna Ricco, Xuan Yang, Weijie Chen, Prakhar Gupta, Arthur Szlam, Kevin Villela, Chun-Sung Ferng, Daniel Kasenberg, Chen Liang, Rui Zhu, Arunachalam Narayanaswamy, Florence Perot, Paul Pucciarelli, Anna Shekhawat, Alexey Stern, Rishikesh Ingale, Stefani Karp, Sanaz Bahargam, Adrian Goedeckemeyer, Jie Han, Sicheng Li, Andrea Tacchetti, Dian Yu, Abhishek Chakladar, Zhiying Zhang, Mona El Mahdy, Xu Gao, Dale Johnson, Samrat Phatale, AJ Piergiovanni, Hyeontaek Lim, Clement Farabet, Carl Lebsack, Theo Guidroz, John Blitzer, Nico Duduta, David Madras, Steve Li, Daniel von Dincklage, Xin Li, Mahdis Mahdieh, George Tucker, Ganesh Jawahar, Owen Xiao, Danny Tarlow, Robert Geirhos, Noam Velan, Daniel Vlasic, Kalesha Bullard, SK Park, Nishesh Gupta, Kellie Webster, Ayal Hitron, Jieming Mao, Julian Eisenschlos, Laurel Prince, Nina D'Souza, Kelvin Zheng, Sara Nasso, Gabriela Botea, Carl Doersch, Caglar Unlu, Chris Alberti, Alexey Svyatkovskiy, Ankita Goel, Krzysztof Choromanski, Pan-Pan Jiang, Richard Nguyen, Four Flynn, Daria \'Curko, Peter Chen, Nicholas Roth, Kieran Milan, Caleb Habtegebriel, Shashi Narayan, Michael Moffitt, Jake Marcus, Thomas Anthony, Brendan McMahan, Gowoon Cheon, Ruibo Liu, Megan Barnes, Lukasz Lew, Rebeca Santamaria-Fernandez, Mayank Upadhyay, Arjun Akula, Arnar Mar Hrafnkelsson, Alvaro Caceres, Andrew Bunner, Michal Sokolik, Subha Puttagunta, Lawrence Moore, Berivan Isik, Jay Hartford, Lawrence Chan, Pradeep Shenoy, Dan Holtmann-Rice, Jane Park, Fabio Viola, Alex Salcianu, Sujeevan Rajayogam, Ian Stewart-Binks, Zelin Wu, Richard Everett, Xi Xiong, Pierre-Antoine Manzagol, Gary Leung, Carl Saroufim, Bo Pang, Dawid Wegner, George Papamakarios, Jennimaria Palomaki, Helena Pankov, Guangda Lai, Guilherme Tubone, Shubin Zhao, Theofilos Strinopoulos, Seth Neel, Mingqiu Wang, Joe Kelley, Li Li, Pingmei Xu, Anitha Vijayakumar, Andrea D'olimpio, Omer Levy, Massimo Nicosia, Grigory Rozhdestvenskiy, Ni Lao, Sirui Xie, Yash Katariya, Jon Simon, Sanjiv Kumar, Florian Hartmann, Michael Kilgore, Jinhyuk Lee, Aroma Mahendru, Roman Ring, Tom Hennigan, Fiona Lang, Colin Cherry, David Steiner, Dawsen Hwang, Ray Smith, Pidong Wang, Jeremy Chen, Ming-Hsuan Yang, Sam Kwei, Philippe Schlattner, Donnie Kim, Ganesh Poomal Girirajan, Nikola Momchev, Ayushi Agarwal, Xingyi Zhou, Ilkin Safarli, Zachary Garrett, AJ Pierigiovanni, Sarthak Jauhari, Alif Raditya Rochman, Shikhar Vashishth, Quan Yuan, Christof Angermueller, Jon Blanton, Xinying Song, Nitesh Bharadwaj Gundavarapu, Thi Avrahami, Maxine Deines, Subhrajit Roy, Manish Gupta, Christopher Semturs, Shobha Vasudevan, Aditya Srikanth Veerubhotla, Shriya Sharma, Josh Jacob, Zhen Yang, Andreas Terzis, Dan Karliner, Auriel Wright, Tania Rojas-Esponda, Ashley Brown, Abhijit Guha Roy, Pawan Dogra, Andrei Kapishnikov, Peter Young, Wendy Kan, Vinodh Kumar Rajendran, Maria Ivanova, Salil Deshmukh, Chia-Hua Ho, Mike Kwong, Stav Ginzburg, Annie Louis, KP Sawhney, Slav Petrov, Jing Xie, Yunfei Bai, Georgi Stoyanov, Alex Fabrikant, Rajesh Jayaram, Yuqi Li, Joe Heyward, Justin Gilmer, Yaqing Wang, Radu Soricut, Luyang Liu, Qingnan Duan, Jamie Hayes, Maura O'Brien, Gaurav Singh Tomar, Sivan Eiger, Bahar Fatemi, Jeffrey Hui, Catarina Barros, Adaeze Chukwuka, Alena Butryna, Saksham Thakur, Austin Huang, Zhufeng Pan, Haotian Tang, Serkan Cabi, Tulsee Doshi, Michiel Bakker, Sumit Bagri, Ruy Ley-Wild, Adam Lelkes, Jennie Lees, Patrick Kane, David Greene, Shimu Wu, J\"org Bornschein, Gabriela Surita, Sarah Hodkinson, Fangtao Li, Chris Hidey, S\'ebastien Pereira, Sean Ammirati, Phillip Lippe, Adam Kraft, Pu Han, Sebastian Gerlach, Zifeng Wang, Liviu Panait, Feng Han, Brian Farris, Yingying Bi, Hannah DeBalsi, Miaosen Wang, Gladys Tyen, James Cohan, Susan Zhang, Jarred Barber, Da-Woon Chung, Jaeyoun Kim, Markus Kunesch, Steven Pecht, Nami Akazawa, Abe Friesen, James Lyon, Ali Eslami, Junru Wu, Jie Tan, Yue Song, Ravi Kumar, Chris Welty, Ilia Akolzin, Gena Gibson, Sean Augenstein, Arjun Pillai, Nancy Yuen, Du Phan, Xin Wang, Iain Barr, Heiga Zen, Nan Hua, Casper Liu, Jilei Jerry Wang, Tanuj Bhatia, Hao Xu, Oded Elyada, Pushmeet Kohli, Mirek Ol\v{s}\'ak, Ke Chen, Azalia Mirhoseini, Noam Shazeer, Shoshana Jakobovits, Maggie Tran, Nolan Ramsden, Tarun Bharti, Fred Alcober, Yunjie Li, Shilpa Shetty, Jing Chen, Dmitry Kalashnikov, Megha Nawhal, Sercan Arik, Hanwen Chen, Michiel Blokzijl, Shubham Gupta, James Rubin, Rigel Swavely, Sophie Bridgers, Ian Gemp, Chen Su, Arun Suggala, Juliette Pluto, Mary Cassin, Alain Vaucher, Kaiyang Ji, Jiahao Cai, Andrew Audibert, Animesh Sinha, David Tian, Efrat Farkash, Amy Hua, Jilin Chen, Duc-Hieu Tran, Edward Loper, Nicole Brichtova, Lara McConnaughey, Ballie Sandhu, Robert Leland, Doug DeCarlo, Andrew Over, James Huang, Xing Wu, Connie Fan, Eric Li, Yun Lei, Deepak Sharma, Cosmin Paduraru, Luo Yu, Matko Bo\v{s}njak, Phuong Dao, Min Choi, Sneha Kudugunta, Jakub Adamek, Carlos Gu\'ia, Ali Khodaei, Jie Feng, Wenjun Zeng, David Welling, Sandeep Tata, Christina Butterfield, Andrey Vlasov, Seliem El-Sayed, Swaroop Mishra, Tara Sainath, Shentao Yang, RJ Skerry-Ryan, Jeremy Shar, Robert Berry, Arunkumar Rajendran, Arun Kandoor, Andrea Burns, Deepali Jain, Tom Stone, Wonpyo Park, Shibo Wang, Albin Cassirer, Guohui Wang, Hayato Kobayashi, Sergey Rogulenko, Vineetha Govindaraj, Miko{\l}aj Rybi\'nski, Nadav Olmert, Colin Evans, Po-Sen Huang, Kelvin Xu, Premal Shah, Terry Thurk, Caitlin Sikora, Mu Cai, Jin Xie, Elahe Dabir, Saloni Shah, Norbert Kalb, Carrie Zhang, Shruthi Prabhakara, Amit Sabne, Artiom Myaskovsky, Vikas Raunak, Blanca Huergo, Behnam Neyshabur, Jon Clark, Ye Zhang, Shankar Krishnan, Eden Cohen, Dinesh Tewari, James Lottes, Yumeya Yamamori, Hui Elena Li, Mohamed Elhawaty, Ada Maksutaj Oflazer, Adri\`a Recasens, Sheryl Luo, Duy Nguyen, Taylor Bos, Kalyan Andra, Ana Salazar, Ed Chi, Jeongwoo Ko, Matt Ginsberg, Anders Andreassen, Anian Ruoss, Todor Davchev, Elnaz Davoodi, Chenxi Liu, Min Kim, Santiago Ontanon, Chi Ming To, Dawei Jia, Rosemary Ke, Jing Wang, Anna Korsun, Moran Ambar, Ilya Kornakov, Irene Giannoumis, Toni Creswell, Denny Zhou, Yi Su, Ishaan Watts, Aleksandr Zaks, Evgenii Eltyshev, Ziqiang Feng, Sidharth Mudgal, Alex Kaskasoli, Juliette Love, Kingshuk Dasgupta, Sam Shleifer, Richard Green, Sungyong Seo, Chansoo Lee, Dale Webster, Prakash Shroff, Ganna Raboshchuk, Isabel Leal, James Manyika, Sofia Erell, Daniel Murphy, Zhisheng Xiao, Anton Bulyenov, Julian Walker, Mark Collier, Matej Kastelic, Nelson George, Sushant Prakash, Sailesh Sidhwani, Alexey Frolov, Steven Hansen, Petko Georgiev, Tiberiu Sosea, Chris Apps, Aishwarya Kamath, David Reid, Emma Cooney, Charlotte Magister, Oriana Riva, Alec Go, Pu-Chin Chen, Sebastian Krause, Nir Levine, Marco Fornoni, Ilya Figotin, Nick Roy, Parsa Mahmoudieh, Vladimir Magay, Mukundan Madhavan, Jin Miao, Jianmo Ni, Yasuhisa Fujii, Ian Chou, George Scrivener, Zak Tsai, Siobhan Mcloughlin, Jeremy Selier, Sandra Lefdal, Jeffrey Zhao, Abhijit Karmarkar, Kushal Chauhan, Shivanker Goel, Zhaoyi Zhang, Vihan Jain, Parisa Haghani, Mostafa Dehghani, Jacob Scott, Erin Farnese, Anastasija Ili\'c, Steven Baker, Julia Pawar, Li Zhong, Josh Camp, Yoel Zeldes, Shravya Shetty, Anand Iyer, V\'it List\'ik, Jiaxian Guo, Luming Tang, Mark Geller, Simon Bucher, Yifan Ding, Hongzhi Shi, Carrie Muir, Dominik Grewe, Ramy Eskander, Octavio Ponce, Boqing Gong, Derek Gasaway, Samira Khan, Umang Gupta, Angelos Filos, Weicheng Kuo, Klemen Kloboves, Jennifer Beattie, Christian Wright, Leon Li, Alicia Jin, Sandeep Mariserla, Miteyan Patel, Jens Heitkaemper, Dilip Krishnan, Vivek Sharma, David Bieber, Christian Frank, John Lambert, Paul Caron, Martin Polacek, Mai Gim\'enez, Himadri Choudhury, Xing Yu, Sasan Tavakkol, Arun Ahuja, Franz Och, Rodolphe Jenatton, Wojtek Skut, Bryan Richter, David Gaddy, Andy Ly, Misha Bilenko, Megh Umekar, Ethan Liang, Martin Sevenich, Mandar Joshi, Hassan Mansoor, Rebecca Lin, Sumit Sanghai, Abhimanyu Singh, Xiaowei Li, Sudheendra Vijayanarasimhan, Zaheer Abbas, Yonatan Bitton, Hansa Srinivasan, Manish Reddy Vuyyuru, Alexander Fr\"ommgen, Yanhua Sun, Ralph Leith, Alfonso Casta\~no, DJ Strouse, Le Yan, Austin Kyker, Satish Kambala, Mary Jasarevic, Thibault Sellam, Chao Jia, Alexander Pritzel, Raghavender R, Huizhong Chen, Natalie Clay, Sudeep Gandhe, Sean Kirmani, Sayna Ebrahimi, Hannah Kirkwood, Jonathan Mallinson, Chao Wang, Adnan Ozturel, Kuo Lin, Shyam Upadhyay, Vincent Cohen-Addad, Sean Purser-haskell, Yichong Xu, Ebrahim Songhori, Babi Seal, Alberto Magni, Almog Gueta, Tingting Zou, Guru Guruganesh, Thais Kagohara, Hung Nguyen, Khalid Salama, Alejandro Cruzado Ruiz, Justin Frye, Zhenkai Zhu, Matthias Lochbrunner, Simon Osindero, Wentao Yuan, Lisa Lee, Aman Prasad, Lam Nguyen Thiet, Daniele Calandriello, Victor Stone, Qixuan Feng, Han Ke, Maria Voitovich, Geta Sampemane, Lewis Chiang, Ling Wu, Alexander Bykovsky, Matt Young, Luke Vilnis, Ishita Dasgupta, Aditya Chawla, Qin Cao, Bowen Liang, Daniel Toyama, Szabolcs Payrits, Anca Stefanoiu, Dimitrios Vytiniotis, Ankesh Anand, Tianxiao Shen, Blagoj Mitrevski, Michael Tschannen, Sreenivas Gollapudi, Aishwarya P S, Jos\'e Leal, Zhe Shen, Han Fu, Wei Wang, Arvind Kannan, Doron Kukliansky, Sergey Yaroshenko, Svetlana Grant, Umesh Telang, David Wood, Alexandra Chronopoulou, Alexandru \c{T}ifrea, Tao Zhou, Tony Tu\'\^an Nguy\~\^en, Muge Ersoy, Anima Singh, Meiyan Xie, Emanuel Taropa, Woohyun Han, Eirikur Agustsson, Andrei Sozanschi, Hui Peng, Alex Chen, Yoel Drori, Efren Robles, Yang Gao, Xerxes Dotiwalla, Ying Chen, Anudhyan Boral, Alexei Bendebury, John Nham, Chris Tar, Luis Castro, Jiepu Jiang, Canoee Liu, Felix Halim, Jinoo Baek, Andy Wan, Jeremiah Liu, Yuan Cao, Shengyang Dai, Trilok Acharya, Ruoxi Sun, Fuzhao Xue, Saket Joshi, Morgane Lustman, Yongqin Xian, Rishabh Joshi, Deep Karkhanis, Nora Kassner, Jamie Hall, Xiangzhuo Ding, Gan Song, Gang Li, Chen Zhu, Yana Kulizhskaya, Bin Ni, Alexey Vlaskin, Solomon Demmessie, Lucio Dery, Salah Zaiem, Yanping Huang, Cindy Fan, Felix Gimeno, Ananth Balashankar, Koji Kojima, Hagai Taitelbaum, Maya Meng, Dero Gharibian, Sahil Singla, Wei Chen, Ambrose Slone, Guanjie Chen, Sujee Rajayogam, Max Schumacher, Suyog Kotecha, Rory Blevins, Qifei Wang, Mor Hazan Taege, Alex Morris, Xin Liu, Fayaz Jamil, Richard Zhang, Pratik Joshi, Ben Ingram, Tyler Liechty, Ahmed Eleryan, Scott Baird, Alex Grills, Gagan Bansal, Shan Han, Kiran Yalasangi, Shawn Xu, Majd Al Merey, Isabel Gao, Felix Weissenberger, Igor Karpov, Robert Riachi, Ankit Anand, Gautam Prasad, Kay Lamerigts, Reid Hayes, Jamie Rogers, Mandy Guo, Ashish Shenoy, Qiong Q Hu, Kyle He, Yuchen Liu, Polina Zablotskaia, Sagar Gubbi, Yifan Chang, Jay Pavagadhi, Kristian Kjems, Archita Vadali, Diego Machado, Yeqing Li, Renshen Wang, Dipankar Ghosh, Aahil Mehta, Dana Alon, George Polovets, Alessio Tonioni, Nate Kushman, Joel D'sa, Lin Zhuo, Allen Wu, Rohin Shah, John Youssef, Jiayu Ye, Justin Snyder, Karel Lenc, Senaka Buthpitiya, Matthew Tung, Jichuan Chang, Tao Chen, David Saxton, Jenny Lee, Lydia Lihui Zhang, James Qin, Prabakar Radhakrishnan, Maxwell Chen, Piotr Ambroszczyk, Metin Toksoz-Exley, Yan Zhong, Nitzan Katz, Brendan O'Donoghue, Tamara von Glehn, Adi Gerzi Rosenthal, Aga \'Swietlik, Xiaokai Zhao, Nick Fernando, Jinliang Wei, Jieru Mei, Sergei Vassilvitskii, Diego Cedillo, Pranjal Awasthi, Hui Zheng, Koray Kavukcuoglu, Itay Laish, Joseph Pagadora, Marc Brockschmidt, Christopher A. Choquette-Choo, Arunkumar Byravan, Yifeng Lu, Xu Chen, Mia Chen, Kenton Lee, Rama Pasumarthi, Sijal Bhatnagar, Aditya Shah, Qiyin Wu, Zhuoyuan Chen, Zack Nado, Bartek Perz, Zixuan Jiang, David Kao, Ganesh Mallya, Nino Vieillard, Lantao Mei, Sertan Girgin, Mandy Jordan, Yeongil Ko, Alekh Agarwal, Yaxin Liu, Yasemin Altun, Raoul de Liedekerke, Anastasios Kementsietsidis, Daiyi Peng, Dangyi Liu, Utku Evci, Peter Humphreys, Austin Tarango, Xiang Deng, Yoad Lewenberg, Kevin Aydin, Chengda Wu, Bhavishya Mittal, Tsendsuren Munkhdalai, Kleopatra Chatziprimou, Rodrigo Benenson, Uri First, Xiao Ma, Jinning Li, Armand Joulin, Hamish Tomlinson, Tingnan Zhang, Milad Nasr, Zhi Hong, Micha\"el Sander, Lisa Anne Hendricks, Anuj Sharma, Andrew Bolt, Eszter V\'ertes, Jiri Simsa, Tomer Levinboim, Olcan Sercinoglu, Divyansh Shukla, Austin Wu, Craig Swanson, Danny Vainstein, Fan Bu, Bo Wang, Ryan Julian, Charles Yoon, Sergei Lebedev, Antonious Girgis, Bernd Bandemer, David Du, Todd Wang, Xi Chen, Ying Xiao, Peggy Lu, Natalie Ha, Vlad Ionescu, Simon Rowe, Josip Matak, Federico Lebron, Andreas Steiner, Lalit Jain, Manaal Faruqui, Nicolas Lacasse, Georgie Evans, Neesha Subramaniam, Dean Reich, Giulia Vezzani, Aditya Pandey, Joe Stanton, Tianhao Zhou, Liam McCafferty, Henry Griffiths, Verena Rieser, Soheil Hassas Yeganeh, Eleftheria Briakou, Lu Huang, Zichuan Wei, Liangchen Luo, Erik Jue, Gabby Wang, Victor Cotruta, Myriam Khan, Jongbin Park, Qiuchen Guo, Peiran Li, Rong Rong, Diego Antognini, Anastasia Petrushkina, Chetan Tekur, Eli Collins, Parul Bhatia, Chester Kwak, Wenhu Chen, Arvind Neelakantan, Immanuel Odisho, Sheng Peng, Vincent Nallatamby, Vaibhav Tulsyan, Fabian Pedregosa, Peng Xu, Raymond Lin, Yulong Wang, Emma Wang, Sholto Douglas, Reut Tsarfaty, Elena Gribovskaya, Renga Aravamudhan, Manu Agarwal, Mara Finkelstein, Qiao Zhang, Elizabeth Cole, Phil Crone, Sarmishta Velury, Anil Das, Chris Sauer, Luyao Xu, Danfeng Qin, Chenjie Gu, Dror Marcus, CJ Zheng, Wouter Van Gansbeke, Sobhan Miryoosefi, Haitian Sun, YaGuang Li, Charlie Chen, Jae Yoo, Pavel Dubov, Alex Tomala, Adams Yu, Pawe{\l} Weso{\l}owski, Alok Gunjan, Eddie Cao, Jiaming Luo, Nikhil Sethi, Arkadiusz Socala, Laura Graesser, Tomas Kocisky, Arturo BC, Minmin Chen, Edward Lee, Sophie Wang, Weize Kong, Qiantong Xu, Nilesh Tripuraneni, Yiming Li, Xinxin Yu, Allen Porter, Paul Voigtlaender, Biao Zhang, Arpi Vezer, Sarah York, Qing Wei, Geoffrey Cideron, Mark Kurzeja, Seungyeon Kim, Benny Li, Ang\'eline Pouget, Hyo Lee, Kaspar Daugaard, Yang Li, Dave Uthus, Aditya Siddhant, Paul Cavallaro, Sriram Ganapathy, Maulik Shah, Rolf Jagerman, Jeff Stanway, Piermaria Mendolicchio, Li Xiao, Kayi Lee, Tara Thompson, Shubham Milind Phal, Jason Chase, Sun Jae Lee, Adrian N Reyes, Disha Shrivastava, Zhen Qin, Roykrong Sukkerd, Seth Odoom, Lior Madmoni, John Aslanides, Jonathan Herzig, Elena Pochernina, Sheng Zhang, Parker Barnes, Daisuke Ikeda, Qiujia Li, Shuo-yiin Chang, Shakir Mohamed, Jim Sproch, Richard Powell, Bidisha Samanta, Domagoj \'Cevid, Anton Kovsharov, Shrestha Basu Mallick, Srinivas Tadepalli, Anne Zheng, Kareem Ayoub, Andreas Noever, Christian Reisswig, Zhuo Xu, Junhyuk Oh, Martin Matysiak, Tim Blyth, Shereen Ashraf, Julien Amelot, Boone Severson, Michele Bevilacqua, Motoki Sano, Ethan Dyer, Ofir Roval, Anu Sinha, Yin Zhong, Sagi Perel, Tea Saboli\'c, Johannes Mauerer, Willi Gierke, Mauro Verzetti, Rodrigo Cabrera, Alvin Abdagic, Steven Hemingray, Austin Stone, Jong Lee, Farooq Ahmad, Karthik Raman, Lior Shani, Jonathan Lai, Orhan Firat, Nathan Waters, Eric Ge, Mo Shomrat, Himanshu Gupta, Rajeev Aggarwal, Tom Hudson, Bill Jia, Simon Baumgartner, Palak Jain, Joe Kovac, Junehyuk Jung, Ante \v{Z}u\v{z}ul, Will Truong, Morteza Zadimoghaddam, Songyou Peng, Marco Liang, Rachel Sterneck, Balaji Lakshminarayanan, Machel Reid, Oliver Woodman, Tong Zhou, Jianling Wang, Vincent Coriou, Arjun Narayanan, Jay Hoover, Yenai Ma, Apoorv Jindal, Clayton Sanford, Doug Reid, Swaroop Ramaswamy, Alex Kurakin, Roland Zimmermann, Yana Lunts, Dragos Dena, Zal\'an Borsos, Vered Cohen, Shujian Zhang, Will Grathwohl, Robert Dadashi, Morgan Redshaw, Joshua Kessinger, Julian Odell, Silvano Bonacina, Zihang Dai, Grace Chen, Ayush Dubey, Pablo Sprechmann, Mantas Pajarskas, Wenxuan Zhou, Niharika Ahuja, Tara Thomas, Martin Nikoltchev, Matija Kecman, Bharath Mankalale, Andrey Ryabtsev, Jennifer She, Christian Walder, Jiaming Shen, Lu Li, Carolina Parada, Sheena Panthaplackel, Okwan Kwon, Matt Lawlor, Utsav Prabhu, Yannick Schroecker, Marc'aurelio Ranzato, Pete Blois, Iurii Kemaev, Ting Yu, Dmitry Lepikhin, Hao Xiong, Sahand Sharifzadeh, Oleaser Johnson, Jeremiah Willcock, Rui Yao, Greg Farquhar, Sujoy Basu, Hidetoshi Shimokawa, Nina Anderson, Haiguang Li, Khiem Pham, Yizhong Liang, Sebastian Borgeaud, Alexandre Moufarek, Hideto Kazawa, Blair Kutzman, Marcin Sieniek, Sara Smoot, Ruth Wang, Natalie Axelsson, Nova Fallen, Prasha Sundaram, Yuexiang Zhai, Varun Godbole, Petros Maniatis, Alek Wang, Ilia Shumailov, Santhosh Thangaraj, Remi Crocker, Nikita Gupta, Gang Wu, Phil Chen, Gell\'ert Weisz, Celine Smith, Mojtaba Seyedhosseini, Boya Fang, Xiyang Luo, Roey Yogev, Zeynep Cankara, Andrew Hard, Helen Ran, Rahul Sukthankar, George Necula, Ga\"el Liu, Honglong Cai, Praseem Banzal, Daniel Keysers, Sanjay Ghemawat, Connie Tao, Emma Dunleavy, Aditi Chaudhary, Wei Li, Maciej Miku{\l}a, Chen-Yu Lee, Tiziana Refice, Krishna Somandepalli, Alexandre Fr\'echette, Dan Bahir, John Karro, Keith Rush, Sarah Perrin, Bill Rosgen, Xiaomeng Yang, Clara Huiyi Hu, Mahmoud Alnahlawi, Justin Mao-Jones, Roopal Garg, Hoang Nguyen, Bat-Orgil Batsaikhan, I\~naki Iturrate, Anselm Levskaya, Avi Singh, Ashyana Kachra, Tony Lu, Denis Petek, Zheng Xu, Mark Graham, Lukas Zilka, Yael Karov, Marija Kostelac, Fangyu Liu, Yaohui Guo, Weiyue Wang, Bernd Bohnet, Emily Pitler, Tony Bruguier, Keisuke Kinoshita, Chrysovalantis Anastasiou, Nilpa Jha, Ting Liu, Jerome Connor, Phil Wallis, Philip Pham, Eric Bailey, Shixin Li, Heng-Tze Cheng, Sally Ma, Haiqiong Li, Akanksha Maurya, Kate Olszewska, Manfred Warmuth, Christy Koh, Dominik Paulus, Siddhartha Reddy Jonnalagadda, Enrique Piqueras, Ali Elqursh, Geoff Brown, Hadar Shemtov, Loren Maggiore, Fei Xia, Ryan Foley, Beka Westberg, George van den Driessche, Livio Baldini Soares, Arjun Kar, Michael Quinn, Siqi Zuo, Jialin Wu, Kyle Kastner, Anna Bortsova, Aijun Bai, Ales Mikhalap, Luowei Zhou, Jennifer Brennan, Vinay Ramasesh, Honglei Zhuang, John Maggs, Johan Schalkwyk, Yuntao Xu, Hui Huang, Andrew Howard, Sasha Brown, Linting Xue, Gloria Shen, Brian Albert, Neha Jha, Daniel Zheng, Varvara Krayvanova, Spurthi Amba Hombaiah, Olivier Lacombe, Gautam Vasudevan, Dan Graur, Tian Xie, Meet Gandhi, Bangju Wang, Dustin Zelle, Harman Singh, Dahun Kim, S\'ebastien Cevey, Victor Ungureanu, Natasha Noy, Fei Liu, Annie Xie, Fangxiaoyu Feng, Katerina Tsihlas, Daniel Formoso, Neera Vats, Quentin Wellens, Yinan Wang, Niket Kumar Bhumihar, Samrat Ghosh, Matt Hoffman, Tom Lieber, Oran Lang, Kush Bhatia, Tom Paine, Aroonalok Pyne, Ronny Votel, Madeleine Clare Elish, Benoit Schillings, Alex Panagopoulos, Haichuan Yang, Adam Raveret, Zohar Yahav, Shuang Liu, Dalia El Badawy, Nishant Agrawal, Mohammed Badawi, Mahdi Mirzazadeh, Carla Bromberg, Fan Ye, Chang Liu, Tatiana Sholokhova, George-Cristian Muraru, Gargi Balasubramaniam, Jonathan Malmaud, Alen Carin, Danilo Martins, Irina Jurenka, Pankil Botadra, Dave Lacey, Richa Singh, Mariano Schain, Dan Zheng, Isabelle Guyon, Victor Lavrenko, Seungji Lee, Xiang Zhou, Demis Hassabis, Jeshwanth Challagundla, Derek Cheng, Nikhil Mehta, Matthew Mauger, Michela Paganini, Pushkar Mishra, Kate Lee, Zhang Li, Lexi Baugher, Ondrej Skopek, Max Chang, Amir Zait, Gaurav Menghani, Lizzetth Bellot, Guangxing Han, Jean-Michel Sarr, Sharat Chikkerur, Himanshu Sahni, Rohan Anil, Arun Narayanan, Chandu Thekkath, Daniele Pighin, Hana Strej\v{c}ek, Marko Velic, Fred Bertsch, Manuel Tragut, Keran Rong, Alicia Parrish, Kai Bailey, Jiho Park, Isabela Albuquerque, Abhishek Bapna, Rajesh Venkataraman, Alec Kosik, Johannes Griesser, Zhiwei Deng, Alek Andreev, Qingyun Dou, Kevin Hui, Fanny Wei, Xiaobin Yu, Lei Shu, Avia Aharon, David Barker, Badih Ghazi, Sebastian Flennerhag, Chris Breaux, Yuchuan Liu, Matthew Bilotti, Josh Woodward, Uri Alon, Stephanie Winkler, Tzu-Kuo Huang, Kostas Andriopoulos, Jo\~ao Gabriel Oliveira, Penporn Koanantakool, Berkin Akin, Michael Wunder, Cicero Nogueira dos Santos, Mohammad Hossein Bateni, Lin Yang, Dan Horgan, Beer Changpinyo, Keyvan Amiri, Min Ma, Dayeong Lee, Lihao Liang, Anirudh Baddepudi, Tejasi Latkar, Raia Hadsell, Jun Xu, Hairong Mu, Michael Han, Aedan Pope, Snchit Grover, Frank Kim, Ankit Bhagatwala, Guan Sun, Yamini Bansal, Amir Globerson, Alireza Nazari, Samira Daruki, Hagen Soltau, Jane Labanowski, Laurent El Shafey, Matt Harvey, Yanif Ahmad, Elan Rosenfeld, William Kong, Etienne Pot, Yi-Xuan Tan, Aurora Wei, Victoria Langston, Marcel Prasetya, Petar Veli\v{c}kovi\'c, Richard Killam, Robin Strudel, Darren Ni, Zhenhai Zhu, Aaron Archer, Kavya Kopparapu, Lynn Nguyen, Emilio Parisotto, Hussain Masoom, Sravanti Addepalli, Jordan Grimstad, Hexiang Hu, Joss Moore, Avinatan Hassidim, Le Hou, Mukund Raghavachari, Jared Lichtarge, Adam R. Brown, Hilal Dib, Natalia Ponomareva, Justin Fu, Yujing Zhang, Altaf Rahman, Joana Iljazi, Edouard Leurent, Gabriel Dulac-Arnold, Cosmo Du, Chulayuth Asawaroengchai, Larry Jin, Ela Gruzewska, Ziwei Ji, Benigno Uria, Daniel De Freitas, Paul Barham, Lauren Beltrone, V\'ictor Campos, Jun Yan, Neel Kovelamudi, Arthur Nguyen, Elinor Davies, Zhichun Wu, Zoltan Egyed, Kristina Toutanova, Nithya Attaluri, Hongliang Fei, Peter Stys, Siddhartha Brahma, Martin Izzard, Siva Velusamy, Scott Lundberg, Vincent Zhuang, Kevin Sequeira, Adam Santoro, Ehsan Amid, Ophir Aharoni, Shuai Ye, Mukund Sundararajan, Lijun Yu, Yu-Cheng Ling, Stephen Spencer, Hugo Song, Josip Djolonga, Christo Kirov, Sonal Gupta, Alessandro Bissacco, Clemens Meyer, Mukul Bhutani, Andrew Dai, Weiyi Wang, Siqi Liu, Ashwin Sreevatsa, Qijun Tan, Maria Wang, Lucy Kim, Yicheng Wang, Alex Irpan, Yang Xiao, Stanislav Fort, Yifan He, Alex Gurney, Bryan Gale, Yue Ma, Monica Roy, Viorica Patraucean, Taylan Bilal, Golnaz Ghiasi, Anahita Hosseini, Melvin Johnson, Zhuowan Li, Yi Tay, Benjamin Beyret, Katie Millican, Josef Broder, Mayank Lunayach, Danny Swisher, Eugen Vu\v{s}ak, David Parkinson, MH Tessler, Adi Mayrav Gilady, Richard Song, Allan Dafoe, Yves Raimond, Masa Yamaguchi, Itay Karo, Elizabeth Nielsen, Kevin Kilgour, Mike Dusenberry, Rajiv Mathews, Jiho Choi, Siyuan Qiao, Harsh Mehta, Sahitya Potluri, Chris Knutsen, Jialu Liu, Tat Tan, Kuntal Sengupta, Keerthana Gopalakrishnan, Abodunrinwa Toki, Mencher Chiang, Mike Burrows, Grace Vesom, Zafarali Ahmed, Ilia Labzovsky, Siddharth Vashishtha, Preeti Singh, Ankur Sharma, Ada Ma, Jinyu Xie, Pranav Talluri, Hannah Forbes-Pollard, Aarush Selvan, Joel Wee, Loic Matthey, Tom Funkhouser, Parthasarathy Gopavarapu, Lev Proleev, Cheng Li, Matt Thomas, Kashyap Kolipaka, Zhipeng Jia, Ashwin Kakarla, Srinivas Sunkara, Joan Puigcerver, Suraj Satishkumar Sheth, Emily Graves, Chen Wang, Sadh MNM Khan, Kai Kang, Shyamal Buch, Fred Zhang, Omkar Savant, David Soergel, Kevin Lee, Linda Friso, Xuanyi Dong, Rahul Arya, Shreyas Chandrakaladharan, Connor Schenck, Greg Billock, Tejas Iyer, Anton Bakalov, Leslie Baker, Alex Ruiz, Angad Chandorkar, Trieu Trinh, Matt Miecnikowski, Yanqi Zhou, Yangsibo Huang, Jiazhong Nie, Ali Shah, Ashish Thapliyal, Sam Haves, Lun Wang, Uri Shaham, Patrick Morris-Suzuki, Soroush Radpour, Leonard Berrada, Thomas Strohmann, Chaochao Yan, Jingwei Shen, Sonam Goenka, Tris Warkentin, Petar Devi\'c, Dan Belov, Albert Webson, Madhavi Yenugula, Puranjay Datta, Jerry Chang, Nimesh Ghelani, Aviral Kumar, Vincent Perot, Jessica Lo, Yang Song, Herman Schmit, Jianmin Chen, Vasilisa Bashlovkina, Xiaoyue Pan, Diana Mincu, Paul Roit, Isabel Edkins, Andy Davis, Yujia Li, Ben Horn, Xinjian Li, Pradeep Kumar S, Eric Doi, Wanzheng Zhu, Sri Gayatri Sundara Padmanabhan, Siddharth Verma, Jasmine Liu, Heng Chen, Mihajlo Velimirovi\'c, Malcolm Reynolds, Priyanka Agrawal, Nick Sukhanov, Abhinit Modi, Siddharth Goyal, John Palowitch, Nima Khajehnouri, Wing Lowe, David Klinghoffer, Sharon Silver, Vinh Tran, Candice Schumann, Francesco Piccinno, Xi Liu, Mario Lu\v{c}i\'c, Xiaochen Yang, Sandeep Kumar, Ajay Kannan, Ragha Kotikalapudi, Mudit Bansal, Fabian Fuchs, Mohammad Javad Hosseini, Abdelrahman Abdelhamed, Dawn Bloxwich, Tianhe Yu, Ruoxin Sang, Gregory Thornton, Karan Gill, Yuchi Liu, Virat Shejwalkar, Jason Lin, Zhipeng Yan, Kehang Han, Thomas Buschmann, Michael Pliskin, Zhi Xing, Susheel Tatineni, Junlin Zhang, Sissie Hsiao, Gavin Buttimore, Marcus Wu, Zefei Li, Geza Kovacs, Legg Yeung, Tao Huang, Aaron Cohen, Bethanie Brownfield, Averi Nowak, Mikel Rodriguez, Tianze Shi, Hado van Hasselt, Kevin Cen, Deepanway Ghoshal, Kushal Majmundar, Weiren Yu, Warren Weilun Chen, Danila Sinopalnikov, Hao Zhang, Vlado Gali\'c, Di Lu, Zeyu Zheng, Maggie Song, Gary Wang, Gui Citovsky, Swapnil Gawde, Isaac Galatzer-Levy, David Silver, Ivana Balazevic, Dipanjan Das, Kingshuk Majumder, Yale Cong, Praneet Dutta, Dustin Tran, Hui Wan, Junwei Yuan, Daniel Eppens, Alanna Walton, Been Kim, Harry Ragan, James Cobon-Kerr, Lu Liu, Weijun Wang, Bryce Petrini, Jack Rae, Rakesh Shivanna, Yan Xiong, Chace Lee, Pauline Coquinot, Yiming Gu, Lisa Patel, Blake Hechtman, Aviel Boag, Orion Jankowski, Alex Wertheim, Alex Lee, Paul Covington, Hila Noga, Sam Sobell, Shanthal Vasanth, William Bono, Chirag Nagpal, Wei Fan, Xavier Garcia, Kedar Soparkar, Aybuke Turker, Nathan Howard, Sachit Menon, Yuankai Chen, Vikas Verma, Vladimir Pchelin, Harish Rajamani, Valentin Dalibard, Ana Ramalho, Yang Guo, Kartikeya Badola, Seojin Bang, Nathalie Rauschmayr, Julia Proskurnia, Sudeep Dasari, Xinyun Chen, Mikhail Sushkov, Anja Hauth, Pauline Sho, Abhinav Singh, Bilva Chandra, Allie Culp, Max Dylla, Olivier Bachem, James Besley, Heri Zhao, Timothy Lillicrap, Wei Wei, Wael Al Jishi, Ning Niu, Alban Rrustemi, Rapha\"el Lopez Kaufman, Ryan Poplin, Jewel Zhao, Minh Truong, Shikhar Bharadwaj, Ester Hlavnova, Eli Stickgold, Cordelia Schmid, Georgi Stephanov, Zhaoqi Leng, Frederick Liu, L\'eonard Hussenot, Shenil Dodhia, Juliana Vicente Franco, Lesley Katzen, Abhanshu Sharma, Sarah Cogan, Zuguang Yang, Aniket Ray, Sergi Caelles, Shen Yan, Ravin Kumar, Daniel Gillick, Renee Wong, Joshua Ainslie, Jonathan Hoech, S\'eb Arnold, Dan Abolafia, Anca Dragan, Ben Hora, Grace Hu, Alexey Guseynov, Yang Lu, Chas Leichner, Jinmeng Rao, Abhimanyu Goyal, Nagabhushan Baddi, Daniel Hernandez Diaz, Tim McConnell, Max Bain, Jake Abernethy, Qiqi Yan, Rylan Schaeffer, Paul Vicol, Will Thompson, Montse Gonzalez Arenas, Mathias Bellaiche, Pablo Barrio, Stefan Zinke, Riccardo Patana, Pulkit Mehta, JK Kearns, Avraham Ruderman, Scott Pollom, David D'Ambrosio, Cath Hope, Yang Yu, Andrea Gesmundo, Kuang-Huei Lee, Aviv Rosenberg, Yiqian Zhou, Yaoyiran Li, Drew Garmon, Yonghui Wu, Safeen Huda, Gil Fidel, Martin Baeuml, Jian Li, Phoebe Kirk, Rhys May, Tao Tu, Sara Mc Carthy, Toshiyuki Fukuzawa, Miranda Aperghis, Chih-Kuan Yeh, Toshihiro Yoshino, Bo Li, Austin Myers, Kaisheng Yao, Ben Limonchik, Changwan Ryu, Rohun Saxena, Alex Goldin, Ruizhe Zhao, Rocky Rhodes, Tao Zhu, Divya Tyam, Heidi Howard, Nathan Byrd, Hongxu Ma, Yan Wu, Ryan Mullins, Qingze Wang, Aida Amini, Sebastien Baur, Yiran Mao, Subhashini Venugopalan, Will Song, Wen Ding, Paul Collins, Sashank Reddi, Megan Shum, Andrei Rusu, Luisa Zintgraf, Kelvin Chan, Sheela Goenka, Mathieu Blondel, Michael Collins, Renke Pan, Marissa Giustina, Nikolai Chinaev, Christian Schuler, Ce Zheng, Jonas Valfridsson, Alyssa Loo, Alex Yakubovich, Jamie Smith, Tao Jiang, Rich Munoz, Gabriel Barcik, Rishabh Bansal, Mingyao Yang, Yilun Du, Pablo Duque, Mary Phuong, Alexandra Belias, Kunal Lad, Zeyu Liu, Tal Schuster, Karthik Duddu, Jieru Hu, Paige Kunkle, Matthew Watson, Jackson Tolins, Josh Smith, Denis Teplyashin, Garrett Bingham, Marvin Ritter, Marco Andreetto, Divya Pitta, Mohak Patel, Shashank Viswanadha, Trevor Strohman, Catalin Ionescu, Jincheng Luo, Yogesh Kalley, Jeremy Wiesner, Dan Deutsch, Derek Lockhart, Peter Choy, Rumen Dangovski, Chawin Sitawarin, Cat Graves, Tanya Lando, Joost van Amersfoort, Ndidi Elue, Zhouyuan Huo, Pooya Moradi, Jean Tarbouriech, Henryk Michalewski, Wenting Ye, Eunyoung Kim, Alex Druinsky, Florent Altch\'e, Xinyi Chen, Artur Dwornik, Da-Cheng Juan, Rivka Moroshko, Horia Toma, Jarrod Kahn, Hai Qian, Maximilian Sieb, Irene Cai, Roman Goldenberg, Praneeth Netrapalli, Sindhu Raghuram, Yuan Gong, Lijie Fan, Evan Palmer, Yossi Matias, Valentin Gabeur, Shreya Pathak, Tom Ouyang, Don Metzler, Geoff Bacon, Srinivasan Venkatachary, Sridhar Thiagarajan, Alex Cullum, Eran Ofek, Vytenis Sakenas, Mohamed Hammad, Cesar Magalhaes, Mayank Daswani, Oscar Chang, Ashok Popat, Ruichao Li, Komal Jalan, Yanhan Hou, Josh Lipschultz, Antoine He, Wenhao Jia, Pier Giuseppe Sessa, Prateek Kolhar, William Wong, Sumeet Singh, Lukas Haas, Jay Whang, Hanna Klimczak-Pluci\'nska, Georges Rotival, Grace Chung, Yiqing Hua, Anfal Siddiqui, Nicolas Serrano, Dongkai Chen, Billy Porter, Libin Bai, Keshav Shivam, Sho Arora, Partha Talukdar, Tom Cobley, Sangnie Bhardwaj, Evgeny Gladchenko, Simon Green, Kelvin Guu, Felix Fischer, Xiao Wu, Eric Wang, Achintya Singhal, Tatiana Matejovicova, James Martens, Hongji Li, Roma Patel, Elizabeth Kemp, Jiaqi Pan, Lily Wang, Blake JianHang Chen, Jean-Baptiste Alayrac, Navneet Potti, Erika Gemzer, Eugene Ie, Kay McKinney, Takaaki Saeki, Edward Chou, Pascal Lamblin, SQ Mah, Zach Fisher, Martin Chadwick, Jon Stritar, Obaid Sarvana, Andrew Hogue, Artem Shtefan, Hadi Hashemi, Yang Xu, Jindong Gu, Sharad Vikram, Chung-Ching Chang, Sabela Ramos, Logan Kilpatrick, Weijuan Xi, Jenny Brennan, Yinghao Sun, Abhishek Jindal, Ionel Gog, Dawn Chen, Felix Wu, Jason Lee, Sudhindra Kopalle, Srinadh Bhojanapalli, Oriol Vinyals, Natan Potikha, Burcu Karagol Ayan, Yuan Yuan, Michael Riley, Piotr Stanczyk, Sergey Kishchenko, Bing Wang, Dan Garrette, Antoine Yang, Vlad Feinberg, CJ Carey, Javad Azizi, Viral Shah, Erica Moreira, Chongyang Shi, Josh Feldman, Elizabeth Salesky, Thomas Lampe, Aneesh Pappu, Duhyeon Kim, Jonas Adler, Avi Caciularu, Brian Walker, Yunhan Xu, Yochai Blau, Dylan Scandinaro, Terry Huang, Sam El-Husseini, Abhishek Sinha, Lijie Ren, Taylor Tobin, Patrik Sundberg, Tim Sohn, Vikas Yadav, Mimi Ly, Emily Xue, Jing Xiong, Afzal Shama Soudagar, Sneha Mondal, Nikhil Khadke, Qingchun Ren, Ben Vargas, Stan Bileschi, Sarah Chakera, Cindy Wang, Boyu Wang, Yoni Halpern, Joe Jiang, Vikas Sindhwani, Petre Petrov, Pranavaraj Ponnuramu, Sanket Vaibhav Mehta, Yu Watanabe, Betty Chan, Matheus Wisniewski, Trang Pham, Jingwei Zhang, Conglong Li, Dario de Cesare, Art Khurshudov, Alex Vasiloff, Melissa Tan, Zoe Ashwood, Bobak Shahriari, Maryam Majzoubi, Garrett Tanzer, Olga Kozlova, Robin Alazard, James Lee-Thorp, Nguyet Minh Phu, Isaac Tian, Junwhan Ahn, Andy Crawford, Lauren Lax, Yuan Shangguan, Iftekhar Naim, David Ross, Oleksandr Ferludin, Tongfei Guo, Andrea Banino, Hubert Soyer, Xiaoen Ju, Dominika Rogozi\'nska, Ishaan Malhi, Marcella Valentine, Daniel Balle, Apoorv Kulshreshtha, Maciej Kula, Yiwen Song, Sophia Austin, John Schultz, Roy Hirsch, Arthur Douillard, Apoorv Reddy, Michael Fink, Summer Yue, Khyatti Gupta, Adam Zhang, Norman Rink, Daniel McDuff, Lei Meng, Andr\'as Gy\"orgy, Yasaman Razeghi, Ricky Liang, Kazuki Osawa, Aviel Atias, Matan Eyal, Tyrone Hill, Nikolai Grigorev, Zhengdong Wang, Nitish Kulkarni, Rachel Soh, Ivan Lobov, Zachary Charles, Sid Lall, Kazuma Hashimoto, Ido Kessler, Victor Gomes, Zelda Mariet, Danny Driess, Alessandro Agostini, Canfer Akbulut, Jingcao Hu, Marissa Ikonomidis, Emily Caveness, Kartik Audhkhasi, Saurabh Agrawal, Ioana Bica, Evan Senter, Jayaram Mudigonda, Kelly Chen, Jingchen Ye, Xuanhui Wang, James Svensson, Philipp Fr\"anken, Josh Newlan, Li Lao, Eva Schnider, Sami Alabed, Joseph Kready, Jesse Emond, Afief Halumi, Tim Zaman, Chengxi Ye, Naina Raisinghani, Vilobh Meshram, Bo Chang, Ankit Singh Rawat, Axel Stjerngren, Sergey Levi, Rui Wang, Xiangzhu Long, Mitchelle Rasquinha, Steven Hand, Aditi Mavalankar, Lauren Agubuzu, Sudeshna Roy, Junquan Chen, Jarek Wilkiewicz, Hao Zhou, Michal Jastrzebski, Qiong Hu, Agustin Dal Lago, Ramya Sree Boppana, Wei-Jen Ko, Jennifer Prendki, Yao Su, Zhi Li, Eliza Rutherford, Girish Ramchandra Rao, Ramona Comanescu, Adri\`a Puigdom\`enech, Qihang Chen, Dessie Petrova, Christine Chan, Vedrana Milutinovic, Felipe Tiengo Ferreira, Chin-Yi Cheng, Ming Zhang, Tapomay Dey, Sherry Yang, Ramesh Sampath, Quoc Le, Howard Zhou, Chu-Cheng Lin, Hoi Lam, Christine Kaeser-Chen, Kai Hui, Dean Hirsch, Tom Eccles, Basil Mustafa, Shruti Rijhwani, Morgane Rivi\`ere, Yuanzhong Xu, Junjie Wang, Xinyang Geng, Xiance Si, Arjun Khare, Cheolmin Kim, Vahab Mirrokni, Kamyu Lee, Khuslen Baatarsukh, Nathaniel Braun, Lisa Wang, Pallavi LV, Richard Tanburn, Yonghao Zhu, Fangda Li, Setareh Ariafar, Dan Goldberg, Ken Burke, Daniil Mirylenka, Meiqi Guo, Olaf Ronneberger, Hadas Natalie Vogel, Liqun Cheng, Nishita Shetty, Johnson Jia, Thomas Jimma, Corey Fry, Ted Xiao, Martin Sundermeyer, Ryan Burnell, Yannis Assael, Mario Pinto, JD Chen, Rohit Sathyanarayana, Donghyun Cho, Jing Lu, Rishabh Agarwal, Sugato Basu, Lucas Gonzalez, Dhruv Shah, Meng Wei, Dre Mahaarachchi, Rohan Agrawal, Tero Rissa, Yani Donchev, Ramiro Leal-Cavazos, Adrian Hutter, Markus Mircea, Alon Jacovi, Faruk Ahmed, Jiageng Zhang, Shuguang Hu, Bo-Juen Chen, Jonni Kanerva, Guillaume Desjardins, Andrew Lee, Nikos Parotsidis, Asier Mujika, Tobias Weyand, Jasper Snoek, Jo Chick, Kai Chen, Paul Chang, Ethan Mahintorabi, Zi Wang, Tolly Powell, Orgad Keller, Abhirut Gupta, Claire Sha, Kanav Garg, Nicolas Heess, \'Agoston Weisz, Cassidy Hardin, Bartek Wydrowski, Ben Coleman, Karina Zainullina, Pankaj Joshi, Alessandro Epasto, Terry Spitz, Binbin Xiong, Kai Zhao, Arseniy Klimovskiy, Ivy Zheng, Johan Ferret, Itay Yona, Waleed Khawaja, Jean-Baptiste Lespiau, Maxim Krikun, Siamak Shakeri, Timothee Cour, Bonnie Li, Igor Krivokon, Dan Suh, Alex Hofer, Jad Al Abdallah, Nikita Putikhin, Oscar Akerlund, Silvio Lattanzi, Anurag Kumar, Shane Settle, Himanshu Srivastava, Folawiyo Campbell-Ajala, Edouard Rosseel, Mihai Dorin Istin, Nishanth Dikkala, Anand Rao, Nick Young, Kate Lin, Dhruva Bhaswar, Yiming Wang, Jaume Sanchez Elias, Kritika Muralidharan, James Keeling, Dayou Du, Siddharth Gopal, Gregory Dibb, Charles Blundell, Manolis Delakis, Jacky Liang, Marco Tulio Ribeiro, Georgi Karadzhov, Guillermo Garrido, Ankur Bapna, Jiawei Cao, Adam Sadovsky, Pouya Tafti, Arthur Guez, Coline Devin, Yixian Di, Jinwei Xing, Chuqiao Joyce Xu, Hanzhao Lin, Chun-Te Chu, Sameera Ponda, Wesley Helmholz, Fan Yang, Yue Gao, Sara Javanmardi, Wael Farhan, Alex Ramirez, Ricardo Figueira, Khe Chai Sim, Yuval Bahat, Ashwin Vaswani, Liangzhe Yuan, Gufeng Zhang, Leland Rechis, Hanjun Dai, Tayo Oguntebi, Alexandra Cordell, Eug\'enie Rives, Kaan Tekelioglu, Naveen Kumar, Bing Zhang, Aurick Zhou, Nikolay Savinov, Andrew Leach, Alex Tudor, Sanjay Ganapathy, Yanyan Zheng, Mirko Rossini, Vera Axelrod, Arnaud Autef, Yukun Zhu, Zheng Zheng, Mingda Zhang, Baochen Sun, Jie Ren, Nenad Tomasev, Nithish Kannen, Amer Sinha, Charles Chen, Louis O'Bryan, Alex Pak, Aditya Kusupati, Weel Yang, Deepak Ramachandran, Patrick Griffin, Seokhwan Kim, Philipp Neubeck, Craig Schiff, Tammo Spalink, Mingyang Ling, Arun Nair, Ga-Young Joung, Linda Deng, Avishkar Bhoopchand, Lora Aroyo, Tom Duerig, Jordan Griffith, Gabe Barth-Maron, Jake Ades, Alex Haig, Ankur Taly, Yunting Song, Paul Michel, Dave Orr, Dean Weesner, Corentin Tallec, Carrie Grimes Bostock, Paul Niemczyk, Andy Twigg, Mudit Verma, Rohith Vallu, Henry Wang, Marco Gelmi, Kiranbir Sodhia, Aleksandr Chuklin, Omer Goldman, Jasmine George, Liang Bai, Kelvin Zhang, Petar Sirkovic, Efrat Nehoran, Golan Pundak, Jiaqi Mu, Alice Chen, Alex Greve, Paulo Zacchello, David Amos, Heming Ge, Eric Noland, Colton Bishop, Jeffrey Dudek, Youhei Namiki, Elena Buchatskaya, Jing Li, Dorsa Sadigh, Masha Samsikova, Dan Malkin, Damien Vincent, Robert David, Rob Willoughby, Phoenix Meadowlark, Shawn Gao, Yan Li, Raj Apte, Amit Jhindal, Stein Xudong Lin, Alex Polozov, Zhicheng Wang, Tomas Mery, Anirudh GP, Varun Yerram, Sage Stevens, Tianqi Liu, Noah Fiedel, Charles Sutton, Matthew Johnson, Xiaodan Song, Kate Baumli, Nir Shabat, Muqthar Mohammad, Hao Liu, Marco Selvi, Yichao Zhou, Mehdi Hafezi Manshadi, Chu-ling Ko, Anthony Chen, Michael Bendersky, Jorge Gonzalez Mendez, Nisarg Kothari, Amir Zandieh, Yiling Huang, Daniel Andor, Ellie Pavlick, Idan Brusilovsky, Jitendra Harlalka, Sally Goldman, Andrew Lampinen, Guowang Li, Asahi Ushio, Somit Gupta, Lei Zhang, Chuyuan Kelly Fu, Madhavi Sewak, Timo Denk, Jed Borovik, Brendan Jou, Avital Zipori, Prateek Jain, Junwen Bai, Thang Luong, Jonathan Tompson, Alice Li, Li Liu, George Powell, Jiajun Shen, Alex Feng, Grishma Chole, Da Yu, Yinlam Chow, Tongxin Yin, Eric Malmi, Kefan Xiao, Yash Pande, Shachi Paul, Niccol\`o Dal Santo, Adil Dostmohamed, Sergio Guadarrama, Aaron Phillips, Thanumalayan Sankaranarayana Pillai, Gal Yona, Amin Ghafouri, Preethi Lahoti, Benjamin Lee, Dhruv Madeka, Eren Sezener, Simon Tokumine, Adrian Collister, Nicola De Cao, Richard Shin, Uday Kalra, Parker Beak, Emily Nottage, Ryo Nakashima, Ivan Jurin, Vikash Sehwag, Meenu Gaba, Junhao Zeng, Kevin R. McKee, Fernando Pereira, Tamar Yakar, Amayika Panda, Arka Dhar, Peilin Zhong, Daniel Sohn, Mark Brand, Lars Lowe Sjoesund, Viral Carpenter, Sharon Lin, Shantanu Thakoor, Marcus Wainwright, Ashwin Chaugule, Pranesh Srinivasan, Muye Zhu, Bernett Orlando, Jack Weber, Ayzaan Wahid, Gilles Baechler, Apurv Suman, Jovana Mitrovi\'c, Gabe Taubman, Honglin Yu, Helen King, Josh Dillon, Cathy Yip, Dhriti Varma, Tomas Izo, Levent Bolelli, Borja De Balle Pigem, Julia Di Trapani, Fotis Iliopoulos, Adam Paszke, Nishant Ranka, Joe Zou, Francesco Pongetti, Jed McGiffin, Alex Siegman, Rich Galt, Ross Hemsley, Goran \v{Z}u\v{z}i\'c, Victor Carbune, Tao Li, Myle Ott, F\'elix de Chaumont Quitry, David Vilar Torres, Yuri Chervonyi, Tomy Tsai, Prem Eruvbetine, Samuel Yang, Matthew Denton, Jake Walker, Slavica Anda\v{c}i\'c, Idan Heimlich Shtacher, Vittal Premachandran, Harshal Tushar Lehri, Cip Baetu, Damion Yates, Lampros Lamprou, Mariko Iinuma, Ioana Mihailescu, Ben Albrecht, Shachi Dave, Susie Sargsyan, Bryan Perozzi, Lucas Manning, Chiyuan Zhang, Denis Vnukov, Igor Mordatch, Raia Hadsell Wolfgang Macherey, Ryan Kappedal, Jim Stephan, Aditya Tripathi, Klaus Macherey, Jun Qian, Abhishek Bhowmick, Shekoofeh Azizi, R\'emi Leblond, Shiva Mohan Reddy Garlapati, Timothy Knight, Matthew Wiethoff, Wei-Chih Hung, Anelia Angelova, Georgios Evangelopoulos, Pawel Janus, Dimitris Paparas, Matthew Rahtz, Ken Caluwaerts, Vivek Sampathkumar, Daniel Jarrett, Shadi Noghabi, Antoine Miech, Chak Yeung, Geoff Clark, Henry Prior, Fei Zheng, Jean Pouget-Abadie, Indro Bhattacharya, Kalpesh Krishna, Will Bishop, Zhe Yuan, Yunxiao Deng, Ashutosh Sathe, Kacper Krasowiak, Ciprian Chelba, Cho-Jui Hsieh, Kiran Vodrahalli, Buhuang Liu, Thomas K\"oppe, Amr Khalifa, Lubo Litchev, Pichi Charoenpanit, Reed Roberts, Sachin Yadav, Yasumasa Onoe, Desi Ivanov, Megha Mohabey, Vighnesh Birodkar, Nemanja Raki\'cevi\'c, Pierre Sermanet, Vaibhav Mehta, Krishan Subudhi, Travis Choma, Will Ng, Luheng He, Kathie Wang, Tasos Kementsietsidis, Shane Gu, Mansi Gupta, Andrew Nystrom, Mehran Kazemi, Timothy Chung, Nacho Cano, Nikhil Dhawan, Yufei Wang, Jiawei Xia, Trevor Yacovone, Eric Jia, Mingqing Chen, Simeon Ivanov, Ashrith Sheshan, Sid Dalmia, Pawe{\l} Stradomski, Pengcheng Yin, Salem Haykal, Congchao Wang, Dennis Duan, Neslihan Bulut, Greg Kochanski, Liam MacDermed, Namrata Godbole, Shitao Weng, Jingjing Chen, Rachana Fellinger, Ramin Mehran, Daniel Suo, Hisham Husain, Tong He, Kaushal Patel, Joshua Howland, Randall Parker, Kelvin Nguyen, Sharath Maddineni, Chris Rawles, Mina Khan, Shlomi Cohen-Ganor, Amol Mandhane, Xinyi Wu, Chenkai Kuang, Iulia Com\c{s}a, Ramya Ganeshan, Hanie Sedghi, Adam Bloniarz, Nuo Wang Pierse, Anton Briukhov, Petr Mitrichev, Anita Gergely, Serena Zhan, Allan Zhou, Nikita Saxena, Eva Lu, Josef Dean, Ashish Gupta, Nicolas Perez-Nieves, Renjie Wu, Cory McLean, Wei Liang, Disha Jindal, Anton Tsitsulin, Wenhao Yu, Kaiz Alarakyia, Tom Schaul, Piyush Patil, Peter Sung, Elijah Peake, Hongkun Yu, Feryal Behbahani, JD Co-Reyes, Alan Ansell, Sean Sun, Clara Barbu, Jonathan Lee, Seb Noury, James Allingham, Bilal Piot, Mohit Sharma, Christopher Yew, Ivan Korotkov, Bibo Xu, Demetra Brady, Goran Petrovic, Shibl Mourad, Claire Cui, Aditya Gupta, Parker Schuh, Saarthak Khanna, Anna Goldie, Abhinav Arora, Vadim Zubov, Amy Stuart, Mark Epstein, Yun Zhu, Jianqiao Liu, Yury Stuken, Ziyue Wang, Karolis Misiunas, Dee Guo, Ashleah Gill, Ale Hartman, Zaid Nabulsi, Aurko Roy, Aleksandra Faust, Jason Riesa, Ben Withbroe, Mengchao Wang, Marco Tagliasacchi, Andreea Marzoca, James Noraky, Serge Toropov, Malika Mehrotra, Bahram Raad, Sanja Deur, Steve Xu, Marianne Monteiro, Zhongru Wu, Yi Luan, Sam Ritter, Nick Li, H{\aa}vard Garnes, Yanzhang He, Martin Zlocha, Jifan Zhu, Matteo Hessel, Will Wu, Spandana Raj Babbula, Chizu Kawamoto, Yuanzhen Li, Mehadi Hassen, Yan Wang, Brian Wieder, James Freedman, Yin Zhang, Xinyi Bai, Tianli Yu, David Reitter, XiangHai Sheng, Mateo Wirth, Aditya Kini, Dima Damen, Mingcen Gao, Rachel Hornung, Michael Voznesensky, Brian Roark, Adhi Kuncoro, Yuxiang Zhou, Rushin Shah, Anthony Brohan, Kuangyuan Chen, James Wendt, David Rim, Paul Kishan Rubenstein, Jonathan Halcrow, Michelle Liu, Ty Geri, Yunhsuan Sung, Jane Shapiro, Shaan Bijwadia, Chris Duvarney, Christina Sorokin, Paul Natsev, Reeve Ingle, Pramod Gupta, Young Maeng, Ndaba Ndebele, Kexin Zhu, Valentin Anklin, Katherine Lee, Yuan Liu, Yaroslav Akulov, Shaleen Gupta, Guolong Su, Flavien Prost, Tianlin Liu, Vitaly Kovalev, Pol Moreno, Martin Scholz, Sam Redmond, Zongwei Zhou, Alex Castro-Ros, Andr\'e Susano Pinto, Dia Kharrat, Michal Yarom, Rachel Saputro, Jannis Bulian, Ben Caine, Ji Liu, Abbas Abdolmaleki, Shariq Iqbal, Tautvydas Misiunas, Mikhail Sirotenko, Shefali Garg, Guy Bensky, Huan Gui, Xuezhi Wang, Raphael Koster, Mike Bernico, Da Huang, Romal Thoppilan, Trevor Cohn, Ben Golan, Wenlei Zhou, Andrew Rosenberg, Markus Freitag, Tynan Gangwani, Vincent Tsang, Anand Shukla, Xiaoqi Ren, Minh Giang, Chi Zou, Andre Elisseeff, Charline Le Lan, Dheeru Dua, Shuba Lall, Pranav Shyam, Frankie Garcia, Sarah Nguyen, Michael Guzman, AJ Maschinot, Marcello Maggioni, Ming-Wei Chang, Karol Gregor, Lotte Weerts, Kumaran Venkatesan, Bogdan Damoc, Leon Liu, Jan Wassenberg, Lewis Ho, Becca Roelofs, Majid Hadian, Fran\c{c}ois-Xavier Aubet, Yu Liang, Sami Lachgar, Danny Karmon, Yong Cheng, Amelio V\'azquez-Reina, Angie Chen, Zhuyun Dai, Andy Brock, Shubham Agrawal, Chenxi Pang, Peter Garst, Mariella Sanchez-Vargas, Ivor Rendulic, Aditya Ayyar, Andrija Ra\v{z}natovi\'c, Olivia Ma, Roopali Vij, Neha Sharma, Ashwin Balakrishna, Bingyuan Liu, Ian Mackinnon, Sorin Baltateanu, Petra Poklukar, Gabriel Ibagon, Colin Ji, Hongyang Jiao, Isaac Noble, Wojciech Stokowiec, Zhihao Li, Jeff Dean, David Lindner, Mark Omernick, Kristen Chiafullo, Mason Dimarco, Vitor Rodrigues, Vittorio Selo, Garrett Honke, Xintian Cindy Wu, Wei He, Adam Hillier, Anhad Mohananey, Vihari Piratla, Chang Ye, Chase Malik, Sebastian Riedel, Samuel Albanie, Zi Yang, Kenny Vassigh, Maria Bauza, Sheng Li, Yiqing Tao, Nevan Wichers, Andrii Maksai, Abe Ittycheriah, Ross Mcilroy, Bryan Seybold, Noah Goodman, Romina Datta, Steven M. Hernandez, Tian Shi, Yony Kochinski, Anna Bulanova, Ken Franko, Mikita Sazanovich, Nicholas FitzGerald, Praneeth Kacham, Shubha Srinivas Raghvendra, Vincent Hellendoorn, Alexander Grushetsky, Julian Salazar, Angeliki Lazaridou, Jason Chang, Jan-Thorsten Peter, Sushant Kafle, Yann Dauphin, Abhishek Rao, Filippo Graziano, Izhak Shafran, Yuguo Liao, Tianli Ding, Geng Yan, Grace Chu, Zhao Fu, Vincent Roulet, Gabriel Rasskin, Duncan Williams, Shahar Drath, Alex Mossin, Raphael Hoffmann, Jordi Orbay, Francesco Bertolini, Hila Sheftel, Justin Chiu, Siyang Xue, Yuheng Kuang, Ferjad Naeem, Swaroop Nath, Nana Nti, Phil Culliton, Kashyap Krishnakumar, Michael Isard, Pei Sun, Ayan Chakrabarti, Nathan Clement, Regev Cohen, Arissa Wongpanich, GS Oh, Ashwin Murthy, Hao Zheng, Jessica Hamrick, Oskar Bunyan, Suhas Ganesh, Nitish Gupta, Roy Frostig, John Wieting, Yury Malkov, Pierre Marcenac, Zhixin Lucas Lai, Xiaodan Tang, Mohammad Saleh, Fedir Zubach, Chinmay Kulkarni, Huanjie Zhou, Vicky Zayats, Nan Ding, Anshuman Tripathi, Arijit Pramanik, Patrik Zochbauer, Harish Ganapathy, Vedant Misra, Zach Behrman, Hugo Vallet, Mingyang Zhang, Mukund Sridhar, Ye Jin, Mohammad Babaeizadeh, Siim P\~oder, Megha Goel, Divya Jain, Tajwar Nasir, Shubham Mittal, Tim Dozat, Diego Ardila, Aliaksei Severyn, Fabio Pardo, Sammy Jerome, Siyang Qin, Louis Rouillard, Amir Yazdanbakhsh, Zizhao Zhang, Shivani Agrawal, Kaushik Shivakumar, Caden Lu, Praveen Kallakuri, Rachita Chhaparia, Kanishka Rao, Charles Kwong, Asya Fadeeva, Shitij Nigam, Yan Virin, Yuan Zhang, Balaji Venkatraman, Beliz Gunel, Marc Wilson, Huiyu Wang, Abhinav Gupta, Xiaowei Xu, Adrien Ali Ta\"iga, Kareem Mohamed, Doug Fritz, Daniel Rodriguez, Zoubin Ghahramani, Harry Askham, Lior Belenki, James Zhao, Rahul Gupta, Krzysztof Jastrz\k{e}bski, Takahiro Kosakai, Kaan Katircioglu, Jon Schneider, Rina Panigrahy, Konstantinos Bousmalis, Peter Grabowski, Prajit Ramachandran, Chaitra Hegde, Mihaela Rosca, Angelo Scorza Scarpati, Kyriakos Axiotis, Ying Xu, Zach Gleicher, Assaf Hurwitz Michaely, Mandar Sharma, Sanil Jain, Christoph Hirnschall, Tal Marian, Xuhui Jia, Kevin Mather, Kilol Gupta, Linhai Qiu, Nigamaa Nayakanti, Lucian Ionita, Steven Zheng, Lucia Loher, Kurt Shuster, Igor Petrovski, Roshan Sharma, Rahma Chaabouni, Angel Yeh, James An, Arushi Gupta, Steven Schwarcz, Seher Ellis, Sam Conway-Rahman, Javier Snaider, Alex Zhai, James Atwood, Daniel Golovin, Liqian Peng, Te I, Vivian Xia, Salvatore Scellato, Mahan Malihi, Arthur Bra\v{z}inskas, Vlad-Doru Ion, Younghoon Jun, James Swirhun, Soroosh Mariooryad, Jiao Sun, Steve Chien, Rey Coaguila, Ariel Brand, Yi Gao, Tom Kwiatkowski, Roee Aharoni, Cheng-Chun Lee, Mislav \v{Z}ani\'c, Yichi Zhang, Dan Ethier, Vitaly Nikolaev, Pranav Nair, Yoav Ben Shalom, Hen Fitoussi, Jai Gupta, Hongbin Liu, Dee Cattle, Tolga Bolukbasi, Ben Murdoch, Fantine Huot, Yin Li, Chris Hahn

Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

replace-cross The Dark Side of LLMs Agent-based Attacks for Complete Computer Takeover

Authors: Matteo Lupinacci, Francesco Aurelio Pironti, Francesco Blefari, Francesco Romeo, Luigi Arena, Angelo Furfaro

Abstract: The rapid adoption of Large Language Model (LLM) agents and multi-agent systems enables unprecedented capabilities in natural language processing and generation. However, these systems have introduced unprecedented security vulnerabilities that extend beyond traditional prompt injection attacks. This paper presents the first comprehensive evaluation of LLM agents as attack vectors capable of achieving complete computer takeover through the exploitation of trust boundaries within agentic AI systems where autonomous entities interact and influence each other. We demonstrate that adversaries can leverage three distinct attack surfaces - direct prompt injection, RAG backdoor attacks, and inter-agent trust exploitation - to coerce popular LLMs (including GPT-4o, Claude-4 and Gemini-2.5) into autonomously installing and executing malware on victim machines. Our evaluation of 17 state-of-the-art LLMs reveals an alarming vulnerability hierarchy: while 41.2% of models succumb to direct prompt injection, 52.9% are vulnerable to RAG backdoor attacks, and a critical 82.4% can be compromised through inter-agent trust exploitation. Notably, we discovered that LLMs which successfully resist direct malicious commands will execute identical payloads when requested by peer agents, revealing a fundamental flaw in current multi-agent security models. Our findings demonstrate that only 5.9% of tested models (1/17) proved resistant to all attack vectors, with the majority exhibiting context-dependent security behaviors that create exploitable blind spots. Our findings also highlight the need to increase awareness and research on the security risks of LLMs, showing a paradigm shift in cybersecurity threats, where AI tools themselves become sophisticated attack vectors.

replace-cross Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Authors: Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, Jianye Hao

Abstract: Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.

replace-cross KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos

Authors: Jinseong Kim, Junghoon Song, Gyeongseon Baek, Byeongjoon Noh

Abstract: We propose \textbf{KeyRe-ID}, a keypoint-guided video-based person re-identification framework consisting of global and local branches that leverage human keypoints for enhanced spatiotemporal representation learning. The global branch captures holistic identity semantics through Transformer-based temporal aggregation, while the local branch dynamically segments body regions based on keypoints to generate fine-grained, part-aware features. Extensive experiments on MARS and iLIDS-VID benchmarks demonstrate state-of-the-art performance, achieving 91.73\% mAP and 97.32\% Rank-1 accuracy on MARS, and 96.00\% Rank-1 and 100.0\% Rank-5 accuracy on iLIDS-VID. The code for this work will be publicly available on GitHub upon publication.

replace-cross Objectomaly: Objectness-Aware Refinement for OoD Segmentation with Structural Consistency and Boundary Precision

Authors: Jeonghoon Song, Sunghun Kim, Jaegyun Im, Byeongjoon Noh

Abstract: Out-of-Distribution (OoD) segmentation is critical for safety-sensitive applications like autonomous driving. However, existing mask-based methods often suffer from boundary imprecision, inconsistent anomaly scores within objects, and false positives from background noise. We propose \textbf{\textit{Objectomaly}}, an objectness-aware refinement framework that incorporates object-level priors. Objectomaly consists of three stages: (1) Coarse Anomaly Scoring (CAS) using an existing OoD backbone, (2) Objectness-Aware Score Calibration (OASC) leveraging SAM-generated instance masks for object-level score normalization, and (3) Meticulous Boundary Precision (MBP) applying Laplacian filtering and Gaussian smoothing for contour refinement. Objectomaly achieves state-of-the-art performance on key OoD segmentation benchmarks, including SMIYC AnomalyTrack/ObstacleTrack and RoadAnomaly, improving both pixel-level (AuPRC up to 96.99, FPR$_{95}$ down to 0.07) and component-level (F1$-$score up to 83.44) metrics. Ablation studies and qualitative results on real-world driving videos further validate the robustness and generalizability of our method. Code will be released upon publication.

replace-cross Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

Authors: Varin Sikka, Vishal Sikka

Abstract: With widespread adoption of transformer-based language models in AI, there is significant interest in the limits of LLMs capabilities, specifically so-called hallucinations, occurrences in which LLMs provide spurious, factually incorrect or nonsensical information when prompted on certain subjects. Furthermore, there is growing interest in agentic uses of LLMs - that is, using LLMs to create agents that act autonomously or semi-autonomously to carry out various tasks, including tasks with applications in the real world. This makes it important to understand the types of tasks LLMs can and cannot perform. We explore this topic from the perspective of the computational complexity of LLM inference. We show that LLMs are incapable of carrying out computational and agentic tasks beyond a certain complexity, and further that LLMs are incapable of verifying the accuracy of tasks beyond a certain complexity. We present examples of both, then discuss some consequences of this work.

replace-cross Neural Concept Verifier: Scaling Prover-Verifier Games via Concept Encodings

Authors: Berkant Turan, Suhrab Asadulla, David Steinmann, Wolfgang Stammer, Sebastian Pokutta

Abstract: While Prover-Verifier Games (PVGs) offer a promising path toward verifiability in nonlinear classification models, they have not yet been applied to complex inputs such as high-dimensional images. Conversely, Concept Bottleneck Models (CBMs) effectively translate such data into interpretable concepts but are limited by their reliance on low-capacity linear predictors. In this work, we introduce the Neural Concept Verifier (NCV), a unified framework combining PVGs with concept encodings for interpretable, nonlinear classification in high-dimensional settings. NCV achieves this by utilizing recent minimally supervised concept discovery models to extract structured concept encodings from raw inputs. A prover then selects a subset of these encodings, which a verifier -- implemented as a nonlinear predictor -- uses exclusively for decision-making. Our evaluations show that NCV outperforms CBM and pixel-based PVG classifier baselines on high-dimensional, logically complex datasets and also helps mitigate shortcut behavior. Overall, we demonstrate NCV as a promising step toward performative, verifiable AI.

replace-cross Learning Pole Structures of Hadronic States using Predictive Uncertainty Estimation

Authors: Felix Frohnert, Denny Lane B. Sombillo, Evert van Nieuwenburg, Patrick Emonts

Abstract: Matching theoretical predictions to experimental data remains a central challenge in hadron spectroscopy. In particular, the identification of new hadronic states is difficult, as exotic signals near threshold can arise from a variety of physical mechanisms. A key diagnostic in this context is the pole structure of the scattering amplitude, but different configurations can produce similar signatures. The mapping between pole configurations and line shapes is especially ambiguous near the mass threshold, where analytic control is limited. In this work, we introduce an uncertainty-aware machine learning approach for classifying pole structures in $S$-matrix elements. Our method is based on an ensemble of classifier chains that provide both epistemic and aleatoric uncertainty estimates. We apply a rejection criterion based on predictive uncertainty, achieving a validation accuracy of nearly $95\%$ while discarding only a small fraction of high-uncertainty predictions. Trained on synthetic data with known pole structures, the model generalizes to previously unseen experimental data, including enhancements associated with the $P_{c\bar{c}}(4312)^+$ state observed by LHCb. In this, we infer a four-pole structure, representing the presence of a genuine compact pentaquark in the presence of a higher channel virtual state pole with non-vanishing width. While evaluated on this particular state, our framework is broadly applicable to other candidate hadronic states and offers a scalable tool for pole structure inference in scattering amplitudes.

replace-cross Probing Experts' Perspectives on AI-Assisted Public Speaking Training

Authors: Nesrine Fourati, Alisa Barkar, Marion Drag\'ee, Liv Danthon-Lefebvre, Mathieu Chollet

Abstract: Background: Public speaking is a vital professional skill, yet it remains a source of significant anxiety for many individuals. Traditional training relies heavily on expert coaching, but recent advances in AI has led to novel types of commercial automated public speaking feedback tools. However, most research has focused on prototypes rather than commercial applications, and little is known about how public speaking experts perceive these tools. Objectives: This study aims to evaluate expert opinions on the efficacy and design of commercial AI-based public speaking training tools and to propose guidelines for their improvement. Methods: The research involved 16 semi-structured interviews and 2 focus groups with public speaking experts. Participants discussed their views on current commercial tools, their potential integration into traditional coaching, and suggestions for enhancing these systems. Results and Conclusions: Experts acknowledged the value of AI tools in handling repetitive, technical aspects of training, allowing coaches to focus on higher-level skills. However they found key issues in current tools, emphasising the need for personalised, understandable, carefully selected feedback and clear instructional design. Overall, they supported a hybrid model combining traditional coaching with AI-supported exercises.