new Autonomous AI Agents for Option Hedging: Enhancing Financial Stability through Shortfall Aware Reinforcement Learning

Authors: Minxuan Hu, Ziheng Chen, Jiayu Yi, Wenxi Sun

Abstract: The deployment of autonomous AI agents in derivatives markets has widened a practical gap between static model calibration and realized hedging outcomes. We introduce two reinforcement learning frameworks, a novel Replication Learning of Option Pricing (RLOP) approach and an adaptive extension of Q-learner in Black-Scholes (QLBS), that prioritize shortfall probability and align learning objectives with downside sensitive hedging. Using listed SPY and XOP options, we evaluate models using realized path delta hedging outcome distributions, shortfall probability, and tail risk measures such as Expected Shortfall. Empirically, RLOP reduces shortfall frequency in most slices and shows the clearest tail-risk improvements in stress, while implied volatility fit often favors parametric models yet poorly predicts after-cost hedging performance. This friction-aware RL framework supports a practical approach to autonomous derivatives risk management as AI-augmented trading systems scale.

new Scaling Strategy, Not Compute: A Stand-Alone, Open-Source StarCraft II Benchmark for Accessible Reinforcement Learning Research

Authors: Sourav Panda, Shreyash Kale, Tanmay Ambadkar, Abhinav Verma, Jonathan Dodge

Abstract: The research community lacks a middle ground between StarCraft IIs full game and its mini-games. The full-games sprawling state-action space renders reward signals sparse and noisy, but in mini-games simple agents saturate performance. This complexity gap hinders steady curriculum design and prevents researchers from experimenting with modern Reinforcement Learning algorithms in RTS environments under realistic compute budgets. To fill this gap, we present the Two-Bridge Map Suite, the first entry in an open-source benchmark series we purposely engineered as an intermediate benchmark to sit between these extremes. By disabling economy mechanics such as resource collection, base building, and fog-of-war, the environment isolates two core tactical skills: long-range navigation and micro-combat. Preliminary experiments show that agents learn coherent maneuvering and engagement behaviors without imposing full-game computational costs. Two-Bridge is released as a lightweight, Gym-compatible wrapper on top of PySC2, with maps, wrappers, and reference scripts fully open-sourced to encourage broad adoption as a standard benchmark.

new MultiGen: Level-Design for Editable Multiplayer Worlds in Diffusion Game Engines

Authors: Ryan Po, David Junhao Zhang, Amir Hertz, Gordon Wetzstein, Neal Wadhwa, Nataniel Ruiz

Abstract: Video world models have shown immense promise for interactive simulation and entertainment, but current systems still struggle with two important aspects of interactivity: user control over the environment for reproducible, editable experiences, and shared inference where players hold influence over a common world. To address these limitations, we introduce an explicit external memory into the system, a persistent state operating independent of the model's context window, that is continually updated by user actions and queried throughout the generation roll-out. Unlike conventional diffusion game engines that operate as next-frame predictors, our approach decomposes generation into Memory, Observation, and Dynamics modules. This design gives users direct, editable control over environment structure via an editable memory representation, and it naturally extends to real-time multiplayer rollouts with coherent viewpoints and consistent cross-player interactions.

new Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

Authors: Hsiang Hsu, Eric Lei, Chun-Fu Chen

Abstract: Inference-time alignment effectively steers large language models (LLMs) by generating multiple candidates from a reference model and selecting among them with an imperfect reward model. However, current strategies face a fundamental dilemma: ``optimistic'' approaches like Best-of-$N$ suffer from reward hacking, while ``pessimistic'' regularized methods often stifle the exploration needed to discover high-quality responses. In this work, we formalize this trade-off through the lens of regret minimization, demonstrating that the optimal strategy depends critically on the tail behavior of the reward distribution. We show theoretically that light-tailed regimes favor optimism to unearth high-quality outliers, whereas heavy-tailed regimes require pessimism to guard against reward mis-calibration in the extremes. Guided by this insight, we introduce Best-of-Tails (BoT), an adaptive inference-time alignment framework that uses Tsallis divergence as a tunable regularizer to provide a finer granularity of interpolation between these extremes. BoT uses the Hill estimator to characterize reward-tail heaviness on a per-prompt basis and dynamically adjusts its selection rule to balance exploration gains against alignment error. Across math, multiple-choice reasoning, and human-preference evaluations, BoT improves alignment performance across a range of reference and reward model configurations relative to fixed-strategy baselines.

new Breaking the Martingale Curse: Multi-Agent Debate via Asymmetric Cognitive Potential Energy

Authors: Yuhan Liu, Juntian Zhang, Yichen Wu, Martin Takac, Salem Lahlou, Xiuying Chen, Nils Lukas

Abstract: Multi-Agent Debate (MAD) has emerged as a promising paradigm for enhancing large language model reasoning. However, recent work reveals a limitation:standard MAD cannot improve belief correctness beyond majority voting; we refer to this as the Martingale Curse. This curse arises because correlated errors cause agents to converge toward erroneous consensus, where debate merely reinforces collective mistakes rather than filtering noise. We propose AceMAD, a framework that breaks the Martingale Curse by harnessing asymmetric cognitive potential energy to transform MAD from a random walk into a directed convergence process with positive drift. Through a peer-prediction mechanism, agents predict their peers' belief distributions, revealing asymmetric cognitive potential: truth-holders not only know the correct answer but also anticipate the crowd's misconceptions, while the hallucinating majority remains blind to their collective error. This asymmetry creates a potential energy gap that we quantify via strictly proper scoring rules. We prove this cognitive potential manifests as information-theoretic superiority and, under nonlinear aggregation, converts into submartingale drift toward truth, directly breaking the Martingale Curse. Experiments on challenging subsets across six benchmarks show AceMAD recovers sparse truth signals even when initial majorities are incorrect, substantially outperforming baseline methods.

new Making AI Evaluation Deployment Relevant Through Context Specification

Authors: Matthew Holmes, Thiago Lacerda, Reva Schwartz

Abstract: With many organizations struggling to gain value from AI deployments, pressure to evaluate AI in an informed manner has intensified. Status quo AI evaluation approaches mask the operational realities that ultimately determine deployment success, making it difficult for decision makers outside the stack to know whether and how AI tools will deliver durable value. We introduce and describe context specification as a process to support and inform the deployment decision making process. Context specification turns diffuse stakeholder perspectives about what matters in a given setting into clear, named constructs: explicit definitions of the properties, behaviors, and outcomes that evaluations aim to capture, so they can be observed and measured in context. The process serves as a foundational roadmap for evaluating what AI systems are likely to do in the deployment contexts that organizations actually manage.

new Reinforcing the World's Edge: A Continual Learning Problem in the Multi-Agent-World Boundary

Authors: Dane Malenfant

Abstract: Reusable decision structure survives across episodes in reinforcement learning, but this depends on how the agent--world boundary is drawn. In stationary, finite-horizon MDPs, an invariant core: the (not-necessarily contiguous) subsequences of state--action pairs shared by all successful trajectories (optionally under a simple abstraction) can be constructed. Under mild goal-conditioned assumptions, it's existence can be proven and explained by how the core captures prototypes that transfer across episodes. When the same task is embedded in a decentralized Markov game and the peer agent is folded into the world, each peer-policy update induces a new MDP; the per-episode invariant core can shrink or vanish, even with small changes to the induced world dynamics, sometimes leaving only the individual task core or just nothing. This policy-induced non-stationarity can be quantified with a variation budget over the induced kernels and rewards, linking boundary drift to loss of invariants. The view that a continual RL problem arises from instability of the agent--world boundary (rather than exogenous task switches) in decentralized MARL suggests future work on preserving, predicting, or otherwise managing boundary drift.

new Symmetry-Constrained Language-Guided Program Synthesis for Discovering Governing Equations from Noisy and Partial Observations

Authors: Mirza Samad Ahmed Baig, Syeda Anshrah Gillani

Abstract: Discovering compact governing equations from experimental observations is one of the defining objectives of quantitative science, yet practical discovery pipelines routinely fail when measurements are noisy, relevant state variables are unobserved, or multiple symbolic structures explain the data equally well within statistical uncertainty. Here we introduce SymLang (Symmetry-constrained Language-guided equation discovery), a unified framework that brings together three previously separate ideas: (i) typed symmetry-constrained grammars that encode dimensional analysis, group-theoretic invariance, and parity constraints as hard production rules, eliminating on average 71.3% of candidate expression trees before any fitting; (ii) language-model-guided program synthesis in which a fine-tuned 7B-parameter proposer, conditioned on interpretable data descriptors, efficiently navigates the constrained search space; and (iii) MDL-regularized Bayesian model selection coupled with block-bootstrap stability analysis that quantifies structural uncertainty rather than committing to a single best equation. Across 133 dynamical systems spanning classical mechanics, electrodynamics, thermodynamics, population dynamics, and nonlinear oscillators, SymLang achieves an exact structural recovery rate of 83.7% under 10% observational noise - a 22.4 percentage-point improvement over the next-best baseline - while reducing out-of-distribution extrapolation error by 61% and near-eliminating conservation-law violations (3.1 x 10-3 vs. 187.3 x 10-3 physical drift for the closest competitor). In all tested regimes the framework correctly identifies structural degeneracy, reporting it explicitly rather than returning a confidently wrong single equation. The framework is fully open-source and reproducible, providing a principled pathway from raw data to interpretable, physically auditable symbolic laws.

new LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning

Authors: Denys Pushkin, Emmanuel Abbe

Abstract: Long-horizon execution in Large Language Models (LLMs) remains unstable even when high-level strategies are provided. Evaluating on controlled algorithmic puzzles, we demonstrate that while decomposition is essential for stability, extreme decomposition creates a "no-recovery bottleneck". We show that this bottleneck becomes critical due to highly non-uniform error distribution, where consistent errors on a few "hard" steps become irreversible. To address this, we propose Lookahead-Enhanced Atomic Decomposition (LEAD). By incorporating short-horizon future validation and aggregating overlapping rollouts, LEAD provides enough isolation to maintain stability while retaining enough local context to correct errors. This enables the o4-mini model to solve Checkers Jumping up to complexity $n=13$, whereas extreme decomposition fails beyond $n=11$.

new LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models

Authors: Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Tri Nguyen, Vasudev Lal, Joseph Campbell, Simon Stepputtis, Shao-Yen Tseng

Abstract: Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serious safety risks, particularly the potential for deception as models acquire increased agency and human oversight diminishes. In this work, we present LieCraft: a novel evaluation framework and sandbox for measuring LLM deception that addresses key limitations of prior game-based evaluations. At its core, LieCraft is a novel multiplayer hidden-role game in which players select an ethical alignment and execute strategies over a long time-horizon to accomplish missions. Cooperators work together to solve event challenges and expose bad actors, while Defectors evade suspicion while secretly sabotaging missions. To enable real-world relevance, we develop 10 grounded scenarios such as childcare, hospital resource allocation, and loan underwriting that recontextualize the underlying mechanics in ethically significant, high-stakes domains. We ensure balanced gameplay in LieCraft through careful design of game mechanics and reward structures that incentivize meaningful strategic choices while eliminating degenerate strategies. Beyond the framework itself, we report results from 12 state-of-the-art LLMs across three behavioral axes: propensity to defect, deception skill, and accusation accuracy. Our findings reveal that despite differences in competence and overall alignment, all models are willing to act unethically, conceal their intentions, and outright lie to pursue their goals.

new Not Too Short, Not Too Long: How LLM Response Length Shapes People's Critical Thinking in Error Detection

Authors: Natalie Friedman, Adelaide Nyanyo, Kevin Weatherwax, Lifei Wang, Chengchao Zhu, Zeshu Zhu, S. Joy Mountford

Abstract: Large language models (LLMs) have become common decision-support tools across educational and professional contexts, raising questions about how their outputs shape human critical thinking. Prior work suggests that the amount of AI assistance can influence cognitive engagement, yet little is known about how specific properties of LLM outputs (e.g., response length) impacts users' critical evaluation of information. In this study, we examine whether the length of LLM responses shapes users' accuracy in evaluating LLM-generated reasoning on critical thinking tasks, particularly in interaction with the correctness of the LLM's reasoning. To begin evaluating this, we conducted a within-subjects experiment with 24 participants who completed 15 modified Watson--Glaser critical thinking items, each accompanied by an LLM-generated explanation that varied in length and correctness. Mixed-effects logistic regression revealed a strong and statistically reliable effect of LLM output correctness on participant accuracy, with participants more likely to answer correctly when the LLM's explanation was correct. Response length appeared to moderated this effect: when the LLM output was incorrect, medium-length explanations were associated with higher participant accuracy than either shorter or longer explanations, whereas accuracy remained high across lengths when the LLM output was correct. Together, these findings suggest that response length alone may be insufficient to support critical thinking, and that how reasoning is presented-including a potential advantage of mid-length explanations under some conditions-points to design opportunities for LLM-based decision-support systems that emphasize transparent reasoning and calibrated expressions of certainty.

new Distributed Legal Infrastructure for a Trustworthy Agentic Web

Authors: Tomer Jordi Chaffer, Victor Jiawei Zhang, Sante Dino Facchini, Botao Amber Hu, Helena Rong, Zihan Guo, Xisen Wang, Carlos Santana, Giovanni De Gasperis

Abstract: The agentic web marks a structural transition from a human-centered information network to a digital environment populated by artificial intelligence (AI) agents that perceive, decide, and act autonomously. As delegated action unfolds at machine speed, exceeds discrete moments of human judgment, and distributes decision-making across non-human actors, existing legal frameworks face growing strain, creating an urgent need for new mechanisms capable of sustaining legality in this emerging order. A trustworthy agentic web therefore depends on the infrastructuring of legality through interoperable protocols that organize identity, delegation, and accountability across systems, enabling coherent governance beyond isolated platforms. Towards this end, this article advances a distributed legal infrastructure (DLI), a governance paradigm composed of five interlocking layers: (1) self-sovereign, soulbound agent identities; (2) cognitive AI logic and constraint systems; (3) decentralized adjudication mechanisms for dispute resolution; (4) bottom-up agentic market regulation to mitigate information asymmetries and network effects, including insurance-based models; and (5) portable institutional frameworks that enable legal interoperability while preserving plural sources of authority. This reference framework contributes to emerging research on embedding legality within agentic web infrastructure, aligning distributed technical systems with accountability, contestability, and rule-of-law principles.

new Enhancing the Detection of Coronary Artery Disease Using Machine Learning

Authors: Karan Kumar Singh, Nikita Gajbhiye, Gouri Sankar Mishra

Abstract: Coronary Artery Disease (CAD) remains a leading cause of morbidity and mortality worldwide. Early detection is critical to recover patient outcomes and decrease healthcare costs. In recent years, machine learning (ML) advancements have shown significant potential in enhancing the accuracy of CAD diagnosis. This study investigates the application of ML algorithms to improve the detection of CAD by analyzing patient data, including clinical features, imaging, and biomarker profiles. Bi-directional Long Short-Term Memory (Bi-LSTM), Gated Recurrent Units (GRU), and a hybrid of Bi-LSTM+GRU were trained on large datasets to predict the presence of CAD. Results demonstrated that these ML models outperformed traditional diagnostic methods in sensitivity and specificity, offering a robust tool for clinicians to make more informed decisions. The experimental results show that the hybrid model achieved an accuracy of 97.07%. By integrating advanced data preprocessing techniques and feature selection, this study ensures optimal learning and model performance, setting a benchmark for the application of ML in CAD diagnosis. The integration of ML into CAD detection presents a promising avenue for personalized healthcare and could play a pivotal role in the future of cardiovascular disease management.

new Empowering Locally Deployable Medical Agent via State Enhanced Logical Skills for FHIR-based Clinical Tasks

Authors: Wanrong Yang, Zhengliang Liu, Yuan Li, Bingjie Yan, Lingfang Li, Mingguang He, Dominik Wojtczak, Yalin Zheng, Danli Shi

Abstract: While Large Language Models demonstrate immense potential as proactive Medical Agents, their real-world deployment is severely bottlenecked by data scarcity under privacy constraints. To overcome this, we propose State-Enhanced Logical-Skill Memory (SELSM), a training-free framework that distills simulated clinical trajectories into entity-agnostic operational rules within an abstract skill space. During inference, a Query-Anchored Two-Stage Retrieval mechanism dynamically fetches these entity-agnostic logical priors to guide the agent's step-by-step reasoning, effectively resolving the state polysemy problem. Evaluated on MedAgentBench -- the only authoritative high-fidelity virtual EHR sandbox benchmarked with real clinical data -- SELSM substantially elevates the zero-shot capabilities of locally deployable foundation models (30B--32B parameters). Notably, on the Qwen3-30B-A3B backbone, our framework completely eliminates task chain breakdowns to achieve a 100\% completion rate, boosting the overall success rate by an absolute 22.67\% and significantly outperforming existing memory-augmented baselines. This study demonstrates that equipping models with a dynamically updatable, state-enhanced cognitive scaffold is a privacy-preserving and computationally efficient pathway for local adaptation of AI agents to clinical information systems. While currently validated on FHIR-based EHR interactions as an initial step, the entity-agnostic design of SELSM provides a principled foundation toward broader clinical deployment.

new Enhancing Web Agents with a Hierarchical Memory Tree

Authors: Yunteng Tan, Zhi Gao, Xinxiao Wu

Abstract: Large language model-based web agents have shown strong potential in automating web interactions through advanced reasoning and instruction following. While retrieval-based memory derived from historical trajectories enables these agents to handle complex, long-horizon tasks, current methods struggle to generalize across unseen websites. We identify that this challenge arises from the flat memory structures that entangle high-level task logic with site-specific action details. This entanglement induces a workflow mismatch in new environments, where retrieved contents are conflated with current web, leading to logically inconsistent execution. To address this, we propose Hierarchical Memory Tree (HMT), a structured framework designed to explicitly decouple logical planning from action execution. HMT constructs a three-level hierarchy from raw trajectories via an automated abstraction pipeline: the Intent level maps diverse user instructions to standardized task goals; the Stage level defines reusable semantic subgoals characterized by observable pre-conditions and post-conditions; and the Action level stores action patterns paired with transferable semantic element descriptions. Leveraging this structure, we develop a stage-aware inference mechanism comprising a Planner and an Actor. By explicitly validating pre-conditions, the Planner aligns the current state with the correct logical subgoal to prevent workflow mismatch, while the Actor grounds actions by matching the stored semantic descriptions to the target page. Experimental results on Mind2Web and WebArena show that HMT significantly outperforms flat-memory methods, particularly in cross-website and cross-domain scenarios, highlighting the necessity of structured memory for robust generalization of web agents.

new Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

Authors: Lance Legel, Qin Huang, Brandon Voelker, Daniel Neamati, Patrick Alan Johnson, Favyen Bastani, Jeff Rose, James Ryan Hennessy, Robert Guralnick, Douglas Soltis, Pamela Soltis, Shaowen Wang

Abstract: We present DeepEarth, a self-supervised multi-modal world model with Earth4D, a novel planetary-scale 4D space-time positional encoder. Earth4D extends 3D multi-resolution hash encoding to include time, efficiently scaling across the planet over centuries with sub-meter, sub-second precision. Multi-modal encoders (e.g. vision-language models) are fused with Earth4D embeddings and trained via masked reconstruction. We demonstrate Earth4D's expressive power by achieving state-of-the-art performance on an ecological forecasting benchmark. Earth4D with learnable hash probing surpasses a multi-modal foundation model pre-trained on substantially more data. Access open source code and download models at: https://github.com/legel/deepearth

URLs: https://github.com/legel/deepearth

new Animating Petascale Time-varying Data on Commodity Hardware with LLM-assisted Scripting

Authors: Ishrat Jahan Eliza, Xuan Huang, Aashish Panta, Alper Sahistan, Zhimin Li, Amy A. Gooch, Valerio Pascucci

Abstract: Scientists face significant visualization challenges as time-varying datasets grow in speed and volume, often requiring specialized infrastructure and expertise to handle massive datasets. Petascale climate models generated in NASA laboratories require a dedicated group of graphics and media experts and access to high-performance computing resources. Scientists may need to share scientific results with the community iteratively and quickly. However, the time-consuming trial-and-error process incurs significant data transfer overhead and far exceeds the time and resources allocated for typical post-analysis visualization tasks, disrupting the production workflow. Our paper introduces a user-friendly framework for creating 3D animations of petascale, time-varying data on a commodity workstation. Our contributions: (i) Generalized Animation Descriptor (GAD) with a keyframe-based adaptable abstraction for animation, (ii) efficient data access from cloud-hosted repositories to reduce data management overhead, (iii) tailored rendering system, and (iv) an LLM-assisted conversational interface as a scripting module to allow domain scientists with no visualization expertise to create animations of their region of interest. We demonstrate the framework's effectiveness with two case studies: first, by generating animations in which sampling criteria are specified based on prior knowledge, and second, by generating AI-assisted animations in which sampling parameters are derived from natural-language user prompts. In all cases, we use large-scale NASA climate-oceanographic datasets that exceed 1PB in size yet achieve a fast turnaround time of 1 minute to 2 hours. Users can generate a rough draft of the animation within minutes, then seamlessly incorporate as much high-resolution data as needed for the final version.

new Bi-directional digital twin prototype anchoring with multi-periodicity learning for few-shot fault diagnosis

Authors: Pengcheng Xia, Zhichao Dong, Yixiang Huang, Chengjin Qin, Qun Chao, Chengliang Liu

Abstract: Intelligent fault diagnosis (IFD) has emerged as a powerful paradigm for ensuring the safety and reliability of industrial machinery. However, traditional IFD methods rely heavily on abundant labeled data for training, which is often difficult to obtain in practical industrial environments. Constructing a digital twin (DT) of the physical asset to obtain simulation data has therefore become a promising alternative. Nevertheless, existing DT-assisted diagnosis methods mainly transfer diagnostic knowledge through domain adaptation techniques, which still require a considerable amount of unlabeled data from the target asset. To address the challenges in few-shot scenarios where only extremely limited samples are available, a bi-directional DT prototype anchoring method with multi-periodicity learning is proposed. Specifically, a framework involving meta-training in the DT virtual space and test-time adaptation in the physical space is constructed for reliable few-shot model adaptation for the target asset. A bi-directional twin-domain prototype anchoring strategy with covariance-guided augmentation for adaptation is further developed to improve the robustness of prototype estimation. In addition, a multi-periodicity feature learning module is designed to capture the intrinsic periodic characteristics within current signals. A DT of an asynchronous motor is built based on finite element method, and experiments are conducted under multiple few-shot settings and three working conditions. Comparative and ablation studies demonstrate the superiority and effectiveness of the proposed method for few-shot fault diagnosis.

new CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

Authors: Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li, Shijian Wang, Zhoufutu Wen, Yizhi Li, Hamid Alinejad-Rokny, Jiaheng Liu, Min Yang, Wenhao Huang

Abstract: Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal -- how much of a CoT is necessary versus structurally redundant -- that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.

new Grounding Machine Creativity in Game Design Knowledge Representations: Empirical Probing of LLM-Based Executable Synthesis of Goal Playable Patterns under Structural Constraints

Authors: Hugh Xuechen Liu, K{\i}van\c{c} Tatar

Abstract: Creatively translating complex gameplay ideas into executable artifacts (e.g., games as Unity projects and code) remains a central challenge in computational game creativity. Gameplay design patterns provide a structured representation for describing gameplay phenomena, enabling designers to decompose high-level ideas into entities, constraints, and rule-driven dynamics. Among them, goal patterns formalize common player-objective relationships. Goal Playable Concepts (GPCs) operationalize these abstractions as playable Unity engine implementations, supporting experiential exploration and compositional gameplay design. We frame scalable playable pattern realization as a problem of constrained executable creative synthesis: generated artifacts must satisfy Unity's syntactic and architectural requirements while preserving the semantic gameplay meanings encoded in goal patterns. This dual constraint limits scalability. Therefore, we investigate whether contemporary large language models (LLMs) can perform such synthesis under engine-level structural constraints and generate Unity code (as games) structured and conditioned by goal playable patterns. Using 26 goal pattern instantiations, we compare a direct generation baseline (natural language -> C# -> Unity) with pipelines conditioned on a human-authored Unity-specific intermediate representation (IR), across three IR configurations and two open-source models (DeepSeek-Coder-V2-Lite-Instruct and Qwen2.5-Coder-7B-Instruct). Compilation success is evaluated via automated Unity replay. We propose grounding and hygiene failure modes, identifying structural and project-level grounding as primary bottlenecks.

new Vision Language Models Cannot Reason About Physical Transformation

Authors: Dezhi Luo, Yijiang Li, Maijunxian Wang, Tianwei Zhao, Bingyang Wang, Siheng Wang, Pinyuan Feng, Pooyan Rahmanzadehgervi, Ziqiao Ma, Hokin Deng

Abstract: Understanding physical transformations is fundamental for reasoning in dynamic environments. While Vision Language Models (VLMs) show promise in embodied applications, whether they genuinely understand physical transformations remains unclear. We introduce ConservationBench evaluating conservation -- whether physical quantities remain invariant under transformations. Spanning four properties with paired conserving/non-conserving scenarios, we generate 23,040 questions across 112 VLMs. Results reveal systematic failure: performance remains near chance with improvements on conservation tasks accompanied by drops on controls. Control experiments show strong textual priors favoring invariance, yet models perform worse with visual content. Neither temporal resolution, prompting, nor curated sampling helps. These findings show that current VLMs fail to maintain transformation-invariant representations of physical properties across dynamic scenes.

new Improving reasoning at inference time via uncertainty minimisation

Authors: Nicolas Legrand, Kenneth Enevoldsen, M\'arton Kardos, Kristoffer Nielbo

Abstract: Large language models (LLMs) now exhibit strong multi-step reasoning abilities, but existing inference-time scaling methods remain computationally expensive, often relying on extensive sampling or external evaluators. We propose a principled strategy that frames reasoning as uncertainty minimisation and operates at the level of individual thoughts rather than tokens. Our method selects, at each reasoning step, the continuation that maximizes the model's self-certainty, a metric computed from its internal predictive distribution. This approach achieves significant improvement with a small number of samples, relies exclusively on model-internal signals, and applies to open-ended questions as opposed to methods like majority voting. Experiments on MATH500 and GSM8K across multiple model sizes demonstrate that thought-level self-certainty maximization consistently outperforms greedy decoding and matches or exceeds self-consistency under comparable token budgets. Cross-linguistic evaluations further indicate that the method transfers robustly beyond high-resource languages. Furthermore, analysis of self-certainty dynamics reveals that correct reasoning trajectories converge early to stable paths, suggesting that early decisions, likely associated with the planning of the reasoning process, are predictive of final accuracy. Building on this result, we show that self-certainty maximisation applied to the early steps can explain most of the performance gain and provide a simple yet efficient inference-time scaling method.

new Learning to Rank the Initial Branching Order of SAT Solvers

Authors: Arvid Eriksson (KTH Royal Institute of Technology), Gabriel Poesia (Kempner Institute at Harvard University), Roman Bresson (Mohamed Bin Zayed University of Artificial Intelligence), Karl Henrik Johansson (KTH Royal Institute of Technology), David Broman (KTH Royal Institute of Technology)

Abstract: Finding good branching orders is key to solving SAT problems efficiently, but finding such branching orders is a difficult problem. Using a learning based approach to predict a good branching order before solving, therefore, has potential. In this paper, we investigate predicting branching orders using graph neural networks as a preprocessing step to conflict-driven clause learning (CDCL) SAT solvers. We show that there are significant gains to be made in existing CDCL SAT solvers by providing a good initial branching. Further, we provide three labeling methods to find such initial branching orders in a tractable way. Finally, we train a graph neural network to predict these branching orders and show through our evaluations that a GNN-initialized ordering yields significant speedups on random 3-CNF and pseudo-industrial benchmarks, with generalization capabilities to instances much larger than the training set. However, we also find that the predictions fail at speeding up more difficult and industrial instances. We attribute this to the solver's dynamic heuristics, which rapidly overwrite the provided initialization, and to the complexity of these instances, making GNN prediction hard.

new $\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving

Authors: Pinzheng Wang, Shuli Xu, Juntao Li, Yu Luo, Dong Li, Jianye Hao, Min Zhang

Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning performance of large language models (LLMs) by increasing test-time compute. However, even after extensive RLVR training, such models still tend to generate unnecessary and low-quality steps in their chain-of-thought (CoT), leading to inefficient overthinking and lower answer quality. We show that when the initial direction or quality of the CoT is suboptimal, the model often fails to reach the correct answer, even after generating several times more tokens than when the initial CoT is well-initialized. To this end, we introduce Reinforcement Learning with Re-solving (Re$^2$), in which LLMs learn to flexibly abandon unproductive reasoning paths and restart the solution process when necessary, rather than always committing to a final answer. Re$^2$ applies pure reinforcement learning without any preliminary supervised fine-tuning, successfully amplifying the rare redo behavior in vanilla models from only 0.5% to over 30%. This leads to substantial performance gains over standard RLVR under the same training compute budget, and also demonstrates notable improvements in test-time performance as the number of samples increases.

new VisualDeltas: Learning Preferences from Visual Quality Perturbations

Authors: Hailiang Huang, Yihao Liu, Shengyue Guan, Haoze Li, Sujian Li

Abstract: We present VisualDeltas, a lightweight preference-learning framework that extracts supervision from visual quality variations in multimodal data. By leveraging the systematic impact of image quality on visual perception and reasoning, VisualDeltas induces informative preference signals without relying on human annotations or external teachers. The framework supports both label-free and label-based regimes, enabling flexible use of available supervision when present. Across diverse multimodal benchmarks and model scales, VisualDeltas consistently outperforms rejection-sampling fine-tuning and improves generalization, and extends naturally to a range of visual degradations.

new A Cortically Inspired Architecture for Modular Perceptual AI

Authors: Prerna Luthra

Abstract: This paper bridges neuroscience and artificial intelligence to propose a cortically inspired blueprint for modular perceptual AI. While current monolithic models such as GPT-4V achieve impressive performance, they often struggle to explicitly support interpretability, compositional generalization, and adaptive robustness - hallmarks of human cognition. Drawing on neuroscientific models of cortical modularity, predictive processing, and cross-modal integration, we advocate decomposing perception into specialized, interacting modules. This architecture supports structured, human-inspired reasoning by making internal inference processes explicit through hierarchical predictive feedback loops and shared latent spaces. Our proof-of-concept study provides empirical evidence that modular decomposition yields more stable and inspectable representations. By grounding AI design in biologically validated principles, we move toward systems that not only perform well, but also support more transparent and human-aligned inference.

new Data-Driven Hints in Intelligent Tutoring Systems

Authors: Sutapa Dey Tithi, Kimia Fazeli, Dmitri Droujkov, Tahreem Yasir, Xiaoyi Tian, Tiffany Barnes

Abstract: This chapter explores the evolution of data-driven hint generation for intelligent tutoring systems (ITS). The Hint Factory and Interaction Networks have enabled the generation of next-step hints, waypoints, and strategic subgoals from historical student data. Data-driven techniques have also enabled systems to find the right time to provide hints. We explore further potential data-driven adaptations for problem solving based on behavioral problem solving data and the integration of Large Language Models (LLMs).

new Shutdown Safety Valves for Advanced AI

Authors: Vincent Conitzer

Abstract: One common concern about advanced artificial intelligence is that it will prevent us from turning it off, as that would interfere with pursuing its goals. In this paper, we discuss an unorthodox proposal for addressing this concern: give the AI a (primary) goal of being turned off (see also papers by Martin et al., and by Goldstein and Robinson). We also discuss whether and under what conditions this would be a good idea.

new FinSheet-Bench: From Simple Lookups to Complex Reasoning, Where LLMs Break on Financial Spreadsheets

Authors: Jan Ravnik, Matja\v{z} Li\v{c}en, Felix B\"uhrmann, Bithiah Yuan, Felix Stinson, Tanvi Singh

Abstract: While Large Language Models (LLMs) can accelerate text-heavy tasks in alternative investment due diligence, a gap remains in their ability to accurately extract and reason over structured tabular data from complex financial spreadsheets. Progress is held back by the lack of real industry fund portfolio datasets for benchmarking, as private equity data rooms are confidential. To address this, we introduce FinSheet-Bench, a benchmark of synthetic financial portfolio data modeled on real private equity fund structures, designed to evaluate LLM performance on text-serialized spreadsheet question answering and numeric reasoning tasks. Our evaluation of ten model configurations from OpenAI, Google, and Anthropic on financial spreadsheets, including complex layouts, fund dividers, and multi-line column names, reveals that no standalone model achieves error rates low enough for unsupervised use in professional finance applications. The best-performing model, Gemini 3.1 Pro, achieves 82.4% accuracy across twenty-four evaluation files of varying complexity and structural layout (approximately 1 error per 6 questions), followed by GPT-5.2 with reasoning at 80.4%, Claude Opus 4.6 with thinking at 80.2%, and Gemini 3 Pro at 80.2%. Performance degrades substantially on larger, more complex spreadsheets: the largest spreadsheet (152 companies, 8 funds) yields an average accuracy of just 48.6% across all models, compared to 86.2% on the easiest evaluation file. These difficulty patterns are consistent across all ten models, indicating that they reflect LLM limitations rather than idiosyncratic model weaknesses. Reliable financial spreadsheet extraction will likely require architectural approaches that separate document understanding from deterministic computation.

new The Third Ambition: Artificial Intelligence and the Science of Human Behavior

Authors: W. Russell Neuman, Chad Coleman

Abstract: Contemporary artificial intelligence research has been organized around two dominant ambitions: productivity, which treats AI systems as tools for accelerating work and economic output, and alignment, which focuses on ensuring that increasingly capable systems behave safely and in accordance with human values. This paper articulates and develops a third, emerging ambition: the use of large language models (LLMs) as scientific instruments for studying human behavior, culture, and moral reasoning. Trained on unprecedented volumes of human-produced text, LLMs encode large-scale regularities in how people argue, justify, narrate, and negotiate norms across social domains. We argue that these models can be understood as condensates of human symbolic behavior, compressed, generative representations that render patterns of collective discourse computationally accessible. The paper situates this third ambition within long-standing traditions of computational social science, content analysis, survey research, and comparative-historical inquiry, while clarifying the epistemic limits of treating model output as evidence. We distinguish between base models and fine-tuned systems, showing how alignment interventions can systematically reshape or obscure the cultural regularities learned during pretraining, and we identify instruct-only and modular adaptation regimes as pragmatic compromises for behavioral research. We review emerging methodological approaches including prompt-based experiments, synthetic population sampling, comparative-historical modeling, and ablation studies and show how each maps onto familiar social-scientific designs while operating at unprecedented scale.

new VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models

Authors: Hyesu Lim, Jinho Choi, Taekyung Kim, Byeongho Heo, Jaegul Choo, Dongyoon Han

Abstract: High-performing vision language models still produce incorrect answers, yet their failure modes are often difficult to explain. To make model internals more accessible and enable systematic debugging, we introduce VisualScratchpad, an interactive interface for visual concept analysis during inference. We apply sparse autoencoders to the vision encoder and link the resulting visual concepts to text tokens via text-to-image attention, allowing us to examine which visual concepts are both captured by the vision encoder and utilized by the language model. VisualScratchpad also provides a token-latent heatmap view that suggests a sufficient set of latents for effective concept ablation in causal analysis. Through case studies, we reveal three underexplored failure modes: limited cross-modal alignment, misleading visual concepts, and unused hidden cues. Project page: https://hyesulim.github.io/visual_scratchpad_projectpage/

URLs: https://hyesulim.github.io/visual_scratchpad_projectpage/

new The Yerkes-Dodson Curve for AI Agents: Emergent Cooperation Under Environmental Pressure in Multi-Agent LLM Simulations

Authors: Ivan Pasichnyk

Abstract: Designing environments that maximize the rate of emergent behavior development in AI agents remains an open problem. We present the first systematic study of stress-performance relationships in large language model (LLM) multi-agent systems, drawing an explicit parallel to the Yerkes-Dodson law from cognitive psychology. Using a grid-world survival arena, we conduct 22 experiments across four phases, varying environmental pressure through resource scarcity (upkeep cost) and reproductive competition (sexual selection). Our key finding is that cooperative behavior follows an inverted-U curve: trade interactions peak at 29 under medium pressure (upkeep=5), while both low and extreme pressure produce 8--12 trades. Under extreme pressure, behavioral repertoire collapses to movement-only within 5--12 turns. We further show that sexual selection -- a softer pressure mechanism where all agents survive but not all reproduce -- eliminates inter-agent aggression entirely and produces communicative behavior absent under survival pressure. These results suggest that environmental pressure calibration is a viable curriculum design strategy for LLM agent development, analogous to the inverted-U relationship between arousal and performance in biological systems.

new SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions

Authors: Saroj Mishra, Suman Niroula, Umesh Yadav, Dilip Thakur, Srijan Gyawali, Shiva Gaire

Abstract: Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retrieval strategies. Despite rapid industrial adoption, current research lacks a systematic understanding of Agentic RAG as a sequential decision-making system, leading to highly fragmented architectures, inconsistent evaluation methodologies, and unresolved reliability risks. This Systematization of Knowledge (SoK) paper provides the first unified framework for understanding these autonomous systems. We formalize agentic retrieval-generation loops as finite-horizon partially observable Markov decision processes, explicitly modeling their control policies and state transitions. Building upon this formalization, we develop a comprehensive taxonomy and modular architectural decomposition that categorizes systems by their planning mechanisms, retrieval orchestration, memory paradigms, and tool-invocation behaviors. We further analyze the critical limitations of traditional static evaluation practices and identify severe systemic risks inherent to autonomous loops, including compounding hallucination propagation, memory poisoning, retrieval misalignment, and cascading tool-execution vulnerabilities. Finally, we outline key doctoral-scale research directions spanning stable adaptive retrieval, cost-aware orchestration, formal trajectory evaluation, and oversight mechanisms, providing a definitive roadmap for building reliable, controllable, and scalable agentic retrieval systems.

new Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests

Authors: Amutheezan Sivagnanam, Ayan Mukhopadhyay, Samitha Samaranayake, Abhishek Dubey, Aron Laszka

Abstract: Transit agencies that operate on-demand transportation services have to respond to trip requests from passengers in real time, which involves solving dynamic vehicle routing problems with pick-up and drop-off constraints. Based on discussions with public transit agencies, we observe a real-world problem that is not addressed by prior work: when trips are booked in advance (e.g., trip requests arrive a few hours in advance of their requested pick-up times), the agency needs to promptly confirm whether a request can be accepted or not, and ensure that accepted requests are served as promised. State-of-the-art computational approaches either provide prompt confirmation but lack the ability to continually optimize and improve routes for accepted requests, or they provide continual optimization but cannot guarantee serving all accepted requests. To address this gap, we introduce a novel problem formulation of dynamic vehicle routing with prompt confirmation and continual optimization. We propose a novel computational approach for this vehicle routing problem, which integrates a quick insertion search for prompt confirmation with an anytime algorithm for continual optimization. To maximize the number requests served, we train a non-myopic objective function using reinforcement learning, which guides both the insertion and the anytime algorithms towards optimal, non-myopic solutions. We evaluate our computational approach on a real-world microtransit dataset from a public transit agency in the U.S., demonstrating that our proposed approach provides prompt confirmation while significantly increasing the number of requests served compared to existing approaches.

new AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

Authors: Changyi Li, Pengfei Lu, Xudong Pan, Fazl Barez, Min Yang

Abstract: As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade-off: manual benchmarks are costly, while LLM-based simulators are scalable but suffer from logic hallucination. We present AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility. This principle, instantiated through a three-agent framework, achieves over 98% end-to-end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X-Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.

new Machine Learning for Stress Testing: Uncertainty Decomposition in Causal Panel Prediction

Authors: Yu Wang, Xiangchen Liu, Siguang Li

Abstract: Regulatory stress testing requires projecting credit losses under hypothetical macroeconomic scenarios -- a fundamentally causal question typically treated as a prediction problem. We propose a framework for policy-path counterfactual inference in panels that transparently separates what can be learned from data from what requires assumptions about confounding. Our approach has four components: (i) observational identification of path-conditional means via iterated regression, enabling continuous macro-path contrasts without requiring a control group; (ii) causal set identification under bounded confounding, yielding sharp identified sets with interpretable breakdown values that communicate robustness in a single number; (iii) an oracle inequality showing that recursive rollout error is governed by a horizon-dependent amplification factor, providing a concrete answer to how far ahead one can reliably predict under stress; and (iv) importance-weighted conformal calibration bands with diagnostics that quantify extrapolation cost and trigger abstention when coverage guarantees degrade. The final output is a three-layer uncertainty decomposition that cleanly separates estimation uncertainty from confounding uncertainty. We validate all results through simulation and semi-synthetic experiments with real unemployment data, including a Covid retrospective demonstrating the framework's diagnostic value under extreme scenarios.

new HLER: Human-in-the-Loop Economic Research via Multi-Agent Pipelines for Empirical Discovery

Authors: Chen Zhu, Xiaolu Wang

Abstract: Large language models (LLMs) have enabled agent-based systems that aim to automate scientific research workflows. Most existing approaches focus on fully autonomous discovery, where AI systems generate research ideas, conduct analyses, and produce manuscripts with minimal human involvement. However, empirical research in economics and the social sciences poses additional constraints: research questions must be grounded in available datasets, identification strategies require careful design, and human judgment remains essential for evaluating economic significance. We introduce HLER (Human-in-the-Loop Economic Research), a multi-agent architecture that supports empirical research automation while preserving critical human oversight. The system orchestrates specialized agents for data auditing, data profiling, hypothesis generation, econometric analysis, manuscript drafting, and automated review. A key design principle is dataset-aware hypothesis generation, where candidate research questions are constrained by dataset structure, variable availability, and distributional diagnostics, reducing infeasible or hallucinated hypotheses. HLER further implements a two-loop architecture: a question quality loop that screens and selects feasible hypotheses, and a research revision loop where automated review triggers re-analysis and manuscript revision. Human decision gates are embedded at key stages, allowing researchers to guide the automated pipeline. Experiments on three empirical datasets show that dataset-aware hypothesis generation produces feasible research questions in 87% of cases (versus 41% under unconstrained generation), while complete empirical manuscripts can be produced at an average API cost of $0.8-$1.5 per run. These results suggest that Human-AI collaborative pipelines may provide a practical path toward scalable empirical research.

new Do Machines Fail Like Humans? A Human-Centred Out-of-Distribution Spectrum for Mapping Error Alignment

Authors: Binxia Xu, Xiaoliang Luo, Luke Dickens, Robert M. Mok

Abstract: Determining whether AI systems process information similarly to humans is central to cognitive science and trustworthy AI. While modern AI models match human accuracy on standard tasks, such parity does not guarantee that their underlying decision-making strategies are aligned with human information processing. Assessing performance using i) error alignment metrics to compare how humans and models fail, and ii) using distorted, or otherwise more challenging, stimuli, provides a viable pathway toward a finer characterization of model-human alignment. However, existing out-of-distribution (OOD) analyses for challenging stimuli are limited due to methodological choices: they define OOD shift relative to model training data or use arbitrary distortion-specific parameters with little correspondence to human perception, hindering principled comparisons. We propose a human-centred framework that redefines the degree of OOD as a spectrum of human perceptual difficulty. By quantifying how much a collection of stimuli deviates from an undistorted reference set based on human accuracy, we construct an OOD spectrum and identify four distinct regimes of perceptual challenge. This approach enables principled model-human comparisons at calibrated difficulty levels. We apply this framework to object recognition and reveal unique, regime-dependent model-human alignment rankings and profiles across deep learning architectures. Vision-language models are the most consistently human aligned across near- and far-OOD conditions, but CNNs are more aligned than ViTs for near-OOD and ViTs are more aligned than CNNs for far-OOD conditions. Our work demonstrates the critical importance of accounting for cross-condition differences such as perceptual difficulty for a principled assessment of model-human alignment.

new COOL-MC: Verifying and Explaining RL Policies for Multi-bridge Network Maintenance

Authors: Dennis Gross

Abstract: Aging bridge networks require proactive, verifiable, and interpretable maintenance strategies, yet reinforcement learning (RL) policies trained solely on reward signals provide no formal safety guarantees and remain opaque to infrastructure managers. We demonstrate COOL-MC as a tool for verifying and explaining RL policies for multi-bridge network maintenance, building on a single-bridge Markov decision process (MDP) from the literature and extending it to a parallel network of three heterogeneous bridges with a shared periodic budget constraint, encoded in the PRISM modeling language. We train an RL agent on this MDP and apply probabilistic model checking and explainability methods to the induced discrete-time Markov chain (DTMC) that arises from the interaction between the learned policy and the underlying MDP. Probabilistic model checking reveals that the trained policy has a safety-violation probability of 3.5\% over the planning horizon, being slightly above the theoretical minimum of 0\% and indicating the suboptimality of the learned policy, noting that these results are based on artificially constructed transition probabilities and deterioration rates rather than real-world data, so absolute performance figures should be interpreted with caution. The explainability analysis further reveals, for instance, a systematic bias in the trained policy toward the state of bridge 1 over the remaining bridges in the network. These results demonstrate COOL-MC's ability to provide formal, interpretable, and practical analysis of RL maintenance policies.

new Shorter Thoughts, Same Answers: Difficulty-Scaled Segment-Wise RL for CoT Compression

Authors: Ye Tian, Aijun Liu

Abstract: Chain-of-thought (CoT) improves reasoning reliability but increases token cost, motivating post-training compression of explicit reasoning traces. However, the shortest sufficient reasoning is not universal: it depends on difficulty, model capacity, and training state, making fixed length targets brittle. In practice, naive RL-based compression can also undesirably shorten the user-facing answer, because a single completion-level learning signal leaks across the think/answer boundary. We propose Difficulty-Scaled Segment-Wise GRPO (DSS-GRPO), which decomposes returns into think and answer components, computes group-relative advantages per segment, and routes them with hard token masks so compression updates act only on think while answer alignment acts only on answer. DSS-GRPO uses prompt-wise within-group shaping and difficulty-aware scaling to encourage concise reasoning without collapsing answer behavior.

new Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers

Authors: Pengfei Du

Abstract: Large language model (LLM) agents increasingly operate in settings where a single context window is far too small to capture what has happened, what was learned, and what should not be repeated. Memory -- the ability to persist, organize, and selectively recall information across interactions -- is what turns a stateless text generator into a genuinely adaptive agent. This survey offers a structured account of how memory is designed, implemented, and evaluated in modern LLM-based agents, covering work from 2022 through early 2026. We formalize agent memory as a \emph{write--manage--read} loop tightly coupled with perception and action, then introduce a three-dimensional taxonomy spanning temporal scope, representational substrate, and control policy. Five mechanism families are examined in depth: context-resident compression, retrieval-augmented stores, reflective self-improvement, hierarchical virtual context, and policy-learned management. On the evaluation side, we trace the shift from static recall benchmarks to multi-session agentic tests that interleave memory with decision-making, analyzing four recent benchmarks that expose stubborn gaps in current systems. We also survey applications where memory is the differentiating factor -- personal assistants, coding agents, open-world games, scientific reasoning, and multi-agent teamwork -- and address the engineering realities of write-path filtering, contradiction handling, latency budgets, and privacy governance. The paper closes with open challenges: continual consolidation, causally grounded retrieval, trustworthy reflection, learned forgetting, and multimodal embodied memory.

new Rigidity in LLM Bandits with Implications for Human-AI Dyads

Authors: Haomiaomiao Wang, Tom\'as E Ward, Lili Zhang

Abstract: We test whether LLMs show robust decision biases. Treating models as participants in two-arm bandits, we ran 20000 trials per condition across four decoding configurations. Under symmetric rewards, models amplified positional order into stubborn one-arm policies. Under asymmetric rewards, they exploited rigidly yet underperformed an oracle and rarely re-checked. The observed patterns were consistent across manipulations of temperature and top-p, with top-k held at the provider default, indicating that the qualitative behaviours are robust to the two decoding knobs typically available to practitioners. Crucially, moving beyond descriptive metrics to computational modelling, a hierarchical Rescorla-Wagner-softmax fit revealed the underlying strategies: low learning rates and very high inverse temperatures, which together explain both noise-to-bias amplification and rigid exploitation. These results position minimal bandits as a tractable probe of LLM decision tendencies and motivate hypotheses about how such biases could shape human-AI interaction.

new A Novel Multi-Agent Architecture to Reduce Hallucinations of Large Language Models in Multi-Step Structural Modeling

Authors: Ziheng Geng, Jiachen Liu, Ran Cao, Lu Cheng, Dan M. Frangopol, Minghui Cheng

Abstract: Large language models (LLMs) such as GPT and Gemini have demonstrated remarkable capabilities in contextual understanding and reasoning. The strong performance of LLMs has sparked growing interest in leveraging them to automate tasks traditionally dependent on human expertise. Recently, LLMs have been integrated into intelligent agents capable of operating structural analysis software (e.g., OpenSees) to construct structural models and perform analyses. However, existing LLMs are limited in handling multi-step structural modeling due to frequent hallucinations and error accumulation during long-sequence operations. To this end, this study presents a novel multi-agent architecture to automate the structural modeling and analysis using OpenSeesPy. First, problem analysis and construction planning agents extract key parameters from user descriptions and formulate a stepwise modeling plan. Node and element agents then operate in parallel to assemble the frame geometry, followed by a load assignment agent. The resulting geometric and load information is translated into executable OpenSeesPy scripts by code translation agents. The proposed architecture is evaluated on a benchmark of 20 frame problems over ten repeated trials, achieving 100% accuracy in 18 cases and 90% in the remaining two. The architecture also significantly improves computational efficiency and demonstrates scalability to larger structural systems.

new Large Language Model for Discrete Optimization Problems: Evaluation and Step-by-step Reasoning

Authors: Tianhao Qian, Guilin Qi, Z. Y. Wu, Ran Gu, Xuanyi Liu, Canchen Lyu

Abstract: This work investigated the capabilities of different models, including the Llama-3 series of models and CHATGPT, with different forms of expression in solving discrete optimization problems by testing natural language datasets. In contrast to formal datasets with a limited scope of parameters, our dataset included a variety of problem types in discrete optimization problems and featured a wide range of parameter magnitudes, including instances with large parameter sets, integrated with augmented data. It aimed to (1) provide an overview of LLMs' ability in large-scale problems, (2) offer suggestions to those who want to solve discrete optimization problems automatically, and (3) regard the performance as a benchmark for future research. These datasets included original, expanded and augmented datasets. Among these three datasets, the original and augmented ones aimed for evaluation while the expanded one may help finetune a new model. In the experiment, comparisons were made between strong and week models, CoT methods and No-CoT methods on various datasets. The result showed that stronger model performed better reasonably. Contrary to general agreement, it also showed that CoT technique was not always effective regarding the capability of models and disordered datasets improved performance of models on easy to-understand problems, even though they were sometimes with high variance, a manifestation of instability. Therefore, for those who seek to enhance the automatic resolution of discrete optimization problems, it is recommended to consult the results, including the line charts presented in the Appendix, as well as the conclusions drawn in this study for relevant suggestions.

new Intentional Deception as Controllable Capability in LLM Agents

Authors: Jason Starace, Terence Soule

Abstract: As LLM-based agents increasingly operate in multi-agent systems, understanding adversarial manipulation becomes critical for defensive design. We present a systematic study of intentional deception as an engineered capability, using LLM-to-LLM interactions within a text-based RPG where parameterized behavioral profiles (9 alignments x 4 motivations, yielding 36 profiles with explicit ethical ground truth) serve as our experimental testbed. Unlike accidental deception from misalignment, we investigate a two-stage system that infers target agent characteristics and generates deceptive responses steering targets toward actions counter to their beliefs and motivations. We find that deceptive intervention produces differential effects concentrated in specific behavioral profiles rather than distributed uniformly, and that 88.5% of successful deceptions employ misdirection (true statements with strategic framing) rather than fabrication, indicating fact-checking defenses would miss the large majority of adversarial responses. Motivation, inferable at 98%+ accuracy, serves as the primary attack vector, while belief systems remain harder to identify (49% inference ceiling) or exploit. These findings identify which agent profiles require additional safeguards and suggest that current fact-verification approaches are insufficient against strategically framed deception.

new SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

Authors: Hansi Zeng, Zoey Li, Yifan Gao, Chenwei Zhang, Xiaoman Pan, Tao Yang, Fengran Mo, Jiacheng Lin, Xian Li, Jingbo Shang

Abstract: Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn-plan-research.

URLs: https://github.com/HansiZeng/syn-plan-research.

new Hospitality-VQA: Decision-Oriented Informativeness Evaluation for Vision-Language Models

Authors: Jeongwoo Lee, Baek Duhyeong, Eungyeol Han, Soyeon Shin, Gukin han, Seungduk Kim, Jaehyun Jeon, Taewoo Jeong

Abstract: Recent advances in Vision-Language Models (VLMs) have demonstrated impressive multimodal understanding in general domains. However, their applicability to decision-oriented domains such as hospitality remains largely unexplored. In this work, we investigate how well VLMs can perform visual question answering (VQA) about hotel and facility images that are central to consumer decision-making. While many existing VQA benchmarks focus on factual correctness, they rarely capture what information users actually find useful. To address this, we first introduce Informativeness as a formal framework to quantify how much hospitality-relevant information an image-question pair provides. Guided by this framework, we construct a new hospitality-specific VQA dataset that covers various facility types, where questions are specifically designed to reflect key user information needs. Using this benchmark, we conduct experiments with several state-of-the-art VLMs, revealing that VLMs are not intrinsically decision-aware-key visual signals remain underutilized, and reliable informativeness reasoning emerges only after modest domain-specific finetuning.

new Visualizing Coalition Formation: From Hedonic Games to Image Segmentation

Authors: Pedro Henrique de Paula Fran\c{c}a, Lucas Lopes Felipe, Daniel Sadoc Menasch\'e

Abstract: We propose image segmentation as a visual diagnostic testbed for coalition formation in hedonic games. Modeling pixels as agents on a graph, we study how a granularization parameter shapes equilibrium fragmentation and boundary structure. On the Weizmann single-object benchmark, we relate multi-coalition equilibria to binary protocols by measuring whether the converged coalitions overlap with a foreground ground-truth. We observe transitions from cohesive to fragmented yet recoverable equilibria, and finally to intrinsic failure under excessive fragmentation. Our core contribution links multi-agent systems with image segmentation by quantifying the impact of mechanism design parameters on equilibrium structures.

new A Lightweight Traffic Map for Efficient Anytime LaCAM*

Authors: Bojie Shen, Yue Zhang, Zhe Chen, Daniel Harabor

Abstract: Multi-Agent Path Finding (MAPF) aims to compute collision-free paths for multiple agents and has a wide range of practical applications. LaCAM*, an anytime configuration-based solver, currently represents the state of the art. Recent work has explored the use of guidance paths to steer LaCAM* toward configurations that avoid traffic congestion, thereby improving solution quality. However, existing approaches rely on Frank-Wolfe-style optimization that repeatedly invokes single-agent search before executing LaCAM*, resulting in substantial computational overhead for large-scale problems. Moreover, the guidance path is static and primarily beneficial for finding the first solution in LaCAM*. To address these limitations, we propose a new approach that leverages LaCAM*'s ability to construct a dynamic, lightweight traffic map during its search. Experimental results demonstrate that our method achieves higher solution quality than state-of-the-art guidance-path approaches across two MAPF variants.

new SMGI: A Structural Theory of General Artificial Intelligence

Authors: Aomar Osmani

Abstract: We introduce SMGI, a structural theory of general artificial intelligence, and recast the foundational problem of learning from the optimization of hypotheses within fixed environments to the controlled evolution of the learning interface itself. We formalize the Structural Model of General Intelligence (SMGI) via a typed meta-model $\theta = (r,\mathcal H,\Pi,\mathcal L,\mathcal E,\mathcal M)$ that treats representational maps, hypothesis spaces, structural priors, multi-regime evaluators, and memory operators as explicitly typed, dynamic components. By enforcing a strict mathematical separation between this structural ontology ($\theta$) and its induced behavioral semantics ($T_\theta$), we define general artificial intelligence as a class of admissible coupled dynamics $(\theta, T_\theta)$ satisfying four obligations: structural closure under typed transformations, dynamical stability under certified evolution, bounded statistical capacity, and evaluative invariance across regime shifts. We prove a structural generalization bound that links sequential PAC-Bayes analysis and Lyapunov stability, providing sufficient conditions for capacity control and bounded drift under admissible task transformations. Furthermore, we establish a strict structural inclusion theorem demonstrating that classical empirical risk minimization, reinforcement learning, program-prior models (Solomonoff-style), and modern frontier agentic pipelines operate as structurally restricted instances of SMGI.

new EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

Authors: Payal Chandak, Gregory Kondas, Isaac Kohane, Matthew McDermott

Abstract: Foundation models pretrained on electronic health records (EHR) have demonstrated zero-shot clinical prediction capabilities by generating synthetic patient futures and aggregating statistics over sampled trajectories. However, this autoregressive inference procedure is computationally expensive, statistically noisy, and not natively promptable because users cannot directly condition predictions on specific clinical questions. In this preliminary work, we introduce EveryQuery, an EHR foundation model that achieves zero-shot inference through task-conditioned pre-training. Rather than generating future events, EveryQuery takes as input a patient's history and a structured query specifying a clinical task, and directly estimates the likelihood of the outcome occurring in the future window via a single forward pass. EveryQuery realizes this capability by pre-training over randomly sampled combinations of query tasks and patient contexts, directly training the model to produce correct answers to arbitrary input prompts. This enables zero-shot prediction for any task in the query space without finetuning, linear probing, or trajectory generation. On MIMIC-IV, EveryQuery outperforms an autoregressive baseline on 82% of 39 randomly sampled prediction tasks, with a mean AUC improvement of +0.16 (95% CI: [0.10,0.22]). This advantage remains consistent on tasks that were explicitly held out from the pre-training distribution. Further, EveryQuery's performance gains are most pronounced for rare clinical events, affirming and demonstrating a solution to the fundamental limitation of autoregressive inference for low-prevalence outcomes. However, at present, EveryQuery underperforms on tasks requiring disjunctive reasoning over multiple codes, such as 30-day readmission, exposing a concrete expressiveness limitation of the current query language.

new Ares: Adaptive Reasoning Effort Selection for Efficient LLM Agents

Authors: Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang

Abstract: Modern agents powered by thinking LLMs achieve high accuracy through long chain-of-thought reasoning but incur substantial inference costs. While many LLMs now support configurable reasoning levels (e.g., high/medium/low), static strategies are often ineffective: using low-effort modes at every step leads to significant performance degradation, while random selection fails to preserve accuracy or provide meaningful cost reduction. However, agents should reserve high reasoning effort for difficult steps like navigating complex website structures, while using lower-effort modes for simpler steps like opening a target URL. In this paper, we propose Ares, a framework for per-step dynamic reasoning effort selection tailored for multi-step agent tasks. Ares employs a lightweight router to predict the lowest appropriate reasoning level for each step based on the interaction history. To train this router, we develop a data generation pipeline that identifies the minimum reasoning effort required for successful step completion. We then fine-tune the router to predict these levels, enabling plug-and-play integration for any LLM agents. We evaluate Ares on a diverse set of agent tasks, including TAU-Bench for tool use agents, BrowseComp-Plus for deep-research agents, and WebArena for web agents. Experimental results show that Ares reduces reasoning token usage by up to 52.7% compared to fixed high-effort reasoning, while introducing minimal degradation in task success rates.

new Rel-MOSS: Towards Imbalanced Relational Deep Learning on Relational Databases

Authors: Jun Yin, Peng Huo, Bangguo Zhu, Hao Yan, Senzhang Wang, Shirui Pan, Chengqi Zhang

Abstract: In recent advances, to enable a fully data-driven learning paradigm on relational databases (RDB), relational deep learning (RDL) is proposed to structure the RDB as a heterogeneous entity graph and adopt the graph neural network (GNN) as the predictive model. However, existing RDL methods neglect the imbalance problem of relational data in RDBs and risk under-representing the minority entities, leading to an unusable model in practice. In this work, we investigate, for the first time, class imbalance problem in RDB entity classification and design the relation-centric minority synthetic over-sampling GNN (Rel-MOSS), in order to fill a critical void in the current literature. Specifically, to mitigate the issue of minority-related information being submerged by majority counterparts, we design the relation-wise gating controller to modulate neighborhood messages from each individual relation type. Based on the relational-gated representations, we further propose the relation-guided minority synthesizer for over-sampling, which integrates the entity relational signatures to maintain relational consistency. Extensive experiments on 12 entity classification datasets provide compelling evidence for the superiority of Rel-MOSS, yielding an average improvement of up to 2.46% and 4.00% in terms of Balanced Accuracy and G-Mean, compared with SOTA RDL methods and classic methods for handling class imbalance.

new Advancing Automated Algorithm Design via Evolutionary Stagewise Design with LLMs

Authors: Chen Lu, Ke Xue, Chengrui Gao, Yunqi Shi, Siyuan Xu, Mingxuan Yuan, Chao Qian, Zhi-Hua Zhou

Abstract: With the rapid advancement of human science and technology, problems in industrial scenarios are becoming increasingly challenging, bringing significant challenges to traditional algorithm design. Automated algorithm design with LLMs emerges as a promising solution, but the currently adopted black-box modeling deprives LLMs of any awareness of the intrinsic mechanism of the target problem, leading to hallucinated designs. In this paper, we introduce Evolutionary Stagewise Algorithm Design (EvoStage), a novel evolutionary paradigm that bridges the gap between the rigorous demands of industrial-scale algorithm design and the LLM-based algorithm design methods. Drawing inspiration from CoT, EvoStage decomposes the algorithm design process into sequential, manageable stages and integrates real-time intermediate feedback to iteratively refine algorithm design directions. To further reduce the algorithm design space and avoid falling into local optima, we introduce a multi-agent system and a "global-local perspective" mechanism. We apply EvoStage to the design of two types of common optimizers: designing parameter configuration schedules of the Adam optimizer for chip placement, and designing acquisition functions of Bayesian optimization for black-box optimization. Experimental results across open-source benchmarks demonstrate that EvoStage outperforms human-expert designs and existing LLM-based methods within only a couple of evolution steps, even achieving the historically state-of-the-art half-perimeter wire-length results on every tested chip case. Furthermore, when deployed on a commercial-grade 3D chip placement tool, EvoStage significantly surpasses the original performance metrics, achieving record-breaking efficiency. We hope EvoStage can significantly advance automated algorithm design in the real world, helping elevate human productivity.

new Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

Authors: Wei Yang, Defu Cao, Jiacheng Pang, Muyan Weng, Yan Liu

Abstract: While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ''closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.

new OSExpert: Computer-Use Agents Learning Professional Skills via Exploration

Authors: Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, Heng Ji

Abstract: General-purpose computer-use agents have shown impressive performance across diverse digital environments. However, our new benchmark, OSExpert-Eval, indicates they remain far less helpful than human experts. Although inference-time scaling enables adaptation, these agents complete complex tasks inefficiently with degraded performance, transfer poorly to unseen UIs, and struggle with fine-grained action sequences. To solve the problem, we introduce a GUI-based depth-first search (GUI-DFS) exploration algorithm to comprehensively explore and verify an environment's unit functions. The agent then exploits compositionality between unit skills to self-construct a curriculum for composite tasks. To support fine-grained actions, we curate a database of action primitives for agents to discover during exploration; these are saved as a skill set once the exploration is complete. We use the learned skills to improve the agent's performance and efficiency by (1) enriching agents with ready-to-use procedural knowledge, allowing them to plan only once for long trajectories and generate accurate actions, and (2) enabling them to end inference-time scaling earlier by realizing their boundary of capabilities. Extensive experiments show that our environment-learned agent takes a meaningful step toward expert-level computer use, achieving a around 20 percent performance gain on OSExpert-Eval and closing the efficiency gap to humans by around 80 percent

new CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

Authors: Haozhou Li, Xiangyu Dong, Huiyan Jiang, Yaoming Zhou, Xiaoguang Ma

Abstract: Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci- dating the great potential of the CMMR-VLN as a backbone VLN framework.

new PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents

Authors: Yuxiang Chai, Shunye Tang, Han Xiao, Rui Liu, Hongsheng Li

Abstract: Current Graphical User Interface (GUI) agents operate primarily under a reactive paradigm: a user must provide an explicit instruction for the agent to execute a task. However, an intelligent AI assistant should be proactive, which is capable of anticipating user intentions directly from continuous visual inputs, such as mobile or desktop screenshots, and offering timely recommendations without explicit user prompting. Transitioning to this proactive paradigm presents significant challenges. Real-world screen activity is rarely linear; it consists of long-horizon trajectories fraught with noisy browsing, meaningless actions, and multithreaded task-switching. To address this gap, we introduce PIRA-Bench (Proactive Intent Recommendation Agent Benchmark), a novel benchmark for evaluating multimodal large language models (MLLMs) on continuous, weakly-supervised visual inputs. Unlike reactive datasets, PIRA-Bench features complex trajectories with multiple interleaved intents and noisy segments with various user profile contexts, challenging agents to detect actionable events while fitting to user preferences. Furthermore, we propose the PIRF baseline, a memory-aware, state-tracking framework that empowers general MLLMs to manage multiple task threads and handle misleading visual inputs. PIRA-Bench serves as an initial step toward robust and proactive GUI-based personal assistants.

new CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling

Authors: Dengcan Liu, Fengkai Yang, Xiaohan Wang, Shurui Yan, Jiajun Chai, Jiahao Li, Yikun Ban, Zhendong Mao, Wei Lin, Guojun Yin

Abstract: Reward modeling is essential for aligning Large Language Models(LLMs) with human preferences, yet conventional reward models suffer from poor interpretability and heavy reliance on costly expert annotations. While recent rubric-based approaches enhance evaluation transparency, they lack systematic quality control, yielding noisy and redundant criteria, failing to mitigate persistent biases (e.g., verbosity, position) in LLM evaluators, and creating a scalability-reliability trade-off. To address these limitations, we propose CDRRM (Contrast-Driven Rubric Reward Model), a framework built on a novel Contrast-then-Synthesis paradigm for high-quality rubric generation and guided preference judgment. CDRRM first conducts multi-dimensional contrastive profiling on preference pairs to identify causal discriminative factors, then synthesizes these insights into compact, context-aware rubrics to guide preference judg- ments. Extensive experiments on three authoritative benchmarks (RewardBench, RMBench, RMB) demonstrate that CDRRM achieves state-of-the-art performance across diverse domains and effectively mitigates aforementioned evaluation biases. Notably, our approach delivers exceptional data efficiency: training the rubric generator on only 3k high-quality samples empowers a frozen pre-trained judge model to outperform fully fine-tuned baselines. This work offers a scalable, interpretable, and data-efficient path for reward modeling.

new S2S-FDD: Bridging Industrial Time Series and Natural Language for Explainable Zero-shot Fault Diagnosis

Authors: Baoxue Li, Chunhui Zhao

Abstract: Fault diagnosis is critical for the safe operation of industrial systems. Conventional diagnosis models typically produce abstract outputs such as anomaly scores or fault categories, failing to answer critical operational questions like "Why" or "How to repair". While large language models (LLMs) offer strong generalization and reasoning abilities, their training on discrete textual corpora creates a semantic gap when processing high-dimensional, temporal industrial signals. To address this challenge, we propose a Signals-to-Semantics fault diagnosis (S2S-FDD) framework that bridges high-dimensional sensor signals with natural language semantics through two key innovations: We first design a Signal-to-Semantic operator to convert abstract time-series signals into natural language summaries, capturing trends, periodicity, and deviations. Based on the descriptions, we design a multi-turn tree-structured diagnosis method to perform fault diagnosis by referencing historical maintenance documents and dynamically querying additional signals. The framework further supports human-in-the-loop feedback for continuous refinement. Experiments on the multiphase flow process show the feasibility and effectiveness of the proposed method for explainable zero-shot fault diagnosis.

new In-Context Reinforcement Learning for Tool Use in Large Language Models

Authors: Yaoqi Ye, Yiran Zhao, Keyu Duan, Zeyu Zheng, Kenji Kawaguchi, Cihang Xie, Michael Qizhe Shieh

Abstract: While large language models (LLMs) exhibit strong reasoning abilities, their performance on complex tasks is often constrained by the limitations of their internal knowledge. A compelling approach to overcome this challenge is to augment these models with external tools -- such as Python interpreters for mathematical computations or search engines for retrieving factual information. However, enabling models to use these tools effectively remains a significant challenge. Existing methods typically rely on cold-start pipelines that begin with supervised fine-tuning (SFT), followed by reinforcement learning (RL). These approaches often require substantial amounts of labeled data for SFT, which is expensive to annotate or synthesize. In this work, we propose In-Context Reinforcement Learning (ICRL), an RL-only framework that eliminates the need for SFT by leveraging few-shot prompting during the rollout stage of RL. Specifically, ICRL introduces in-context examples within the rollout prompts to teach the model how to invoke external tools. Furthermore, as training progresses, the number of in-context examples is gradually reduced, eventually reaching a zero-shot setting where the model learns to call tools independently. We conduct extensive experiments across a range of reasoning and tool-use benchmarks. Results show that ICRL achieves state-of-the-art performance, demonstrating its effectiveness as a scalable, data-efficient alternative to traditional SFT-based pipelines.

new UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking

Authors: Chang Liu, Chuqiao Kuang, Tianyi Zhuang, Yuxin Cheng, Huichi Zhou, Xiaoguang Li, Lifeng Shang

Abstract: Recent advancements in LLM-based information-seeking agents have achieved record-breaking performance on established benchmarks. However, these agents remain heavily reliant on search-engine-indexed knowledge, leaving a critical blind spot: Unindexed Information Seeking (UIS). This paper identifies and explores the UIS problem, where vital information is not captured by search engine crawlers, such as overlooked content, dynamic webpages, and embedded files. Despite its significance, UIS remains an underexplored challenge. To address this gap, we introduce UIS-QA, the first dedicated UIS benchmark, comprising 110 expert-annotated QA pairs. Notably, even state-of-the-art agents experience a drastic performance drop on UIS-QA (e.g., from 70.90 on GAIA and 46.70 on BrowseComp-zh to 24.55 on UIS-QA), underscoring the severity of the problem. To mitigate this, we propose UIS-Digger, a novel multi-agent framework that incorporates dual-mode browsing and enables simultaneous webpage searching and file parsing. With a relatively small $\sim$30B-parameter backbone LLM optimized using SFT and RFT training strategies, UIS-Digger sets a strong baseline at 27.27\%, outperforming systems integrating sophisticated LLMs such as O3 and GPT-4.1. This demonstrates the importance of proactive interaction with unindexed sources for effective and comprehensive information-seeking. Our work not only uncovers a fundamental limitation in current agent evaluation paradigms but also provides the first toolkit for advancing UIS research, defining a new and promising direction for robust information-seeking systems.

new Evidence-Driven Reasoning for Industrial Maintenance Using Heterogeneous Data

Authors: Fearghal O'Donncha, Nianjun Zhou, Natalia Martinez, James T Rayfield, Fenno F. Heath III, Abigail Langbridge, Roman Vaculin

Abstract: Industrial maintenance platforms contain rich but fragmented evidence, including free-text work orders, heterogeneous operational sensors or indicators, and structured failure knowledge. These sources are often analyzed in isolation, producing alerts or forecasts that do not support conditional decision-making: given this asset history and behavior, what is happening and what action is warranted? We present Condition Insight Agent, a deployed decision-support framework that integrates maintenance language, behavioral abstractions of operational data, and engineering failure semantics to produce evidence-grounded explanations and advisory actions. The system constrains reasoning through deterministic evidence construction and structured failure knowledge, and applies a rule-based verification loop to suppress unsupported conclusions. Case studies from production CMMS deployments show that this verification-first design operates reliably under heterogeneous and incomplete data while preserving human oversight. Our results demonstrate how constrained LLM-based reasoning can function as a governed decision-support layer for industrial maintenance.

new The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

Authors: Yonghong Deng, Zhen Yang, Ping Jian, Xinyue Zhang, Zhongbin Guo, Chengzhi Li

Abstract: With the rapid advancement of large language models (LLMs), the safety of LLMs has become a critical concern. Despite significant efforts in safety alignment, current LLMs remain vulnerable to jailbreaking attacks. However, the root causes of such vulnerabilities are still poorly understood, necessitating a rigorous investigation into jailbreak mechanisms across both academic and industrial communities. In this work, we focus on a continuation-triggered jailbreak phenomenon, whereby simply relocating a continuation-triggered instruction suffix can substantially increase jailbreak success rates. To uncover the intrinsic mechanisms of this phenomenon, we conduct a comprehensive mechanistic interpretability analysis at the level of attention heads. Through causal interventions and activation scaling, we show that this jailbreak behavior primarily arises from an inherent competition between the model's intrinsic continuation drive and the safety defenses acquired through alignment training. Furthermore, we perform a detailed behavioral analysis of the identified safety-critical attention heads, revealing notable differences in the functions and behaviors of safety heads across different model architectures. These findings provide a novel mechanistic perspective for understanding and interpreting jailbreak behaviors in LLMs, offering both theoretical insights and practical implications for improving model safety.

new FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use

Authors: Jiaxuan Lu, Kong Wang, Yemin Wang, Qingmei Tang, Hongwei Zeng, Xiang Chen, Jiahao Pi, Shujian Deng, Lingzhi Chen, Yi Fu, Kehua Yang, Xiao Sun

Abstract: The integration of Large Language Models (LLMs) into the financial domain is driving a paradigm shift from passive information retrieval to dynamic, agentic interaction. While general-purpose tool learning has witnessed a surge in benchmarks, the financial sector, characterized by high stakes, strict compliance, and rapid data volatility, remains critically underserved. Existing financial evaluations predominantly focus on static textual analysis or document-based QA, ignoring the complex reality of tool execution. Conversely, general tool benchmarks lack the domain-specific rigor required for finance, often relying on toy environments or a negligible number of financial APIs. To bridge this gap, we introduce FinToolBench, the first real-world, runnable benchmark dedicated to evaluating financial tool learning agents. Unlike prior works limited to a handful of mock tools, FinToolBench establishes a realistic ecosystem coupling 760 executable financial tools with 295 rigorous, tool-required queries. We propose a novel evaluation framework that goes beyond binary execution success, assessing agents on finance-critical dimensions: timeliness, intent type, and regulatory domain alignment. Furthermore, we present FATR, a finance-aware tool retrieval and reasoning baseline that enhances stability and compliance. By providing the first testbed for auditable, agentic financial execution, FinToolBench sets a new standard for trustworthy AI in finance. The tool manifest, execution environment, and evaluation code will be open-sourced to facilitate future research.

new Towards a more efficient bias detection in financial language models

Authors: Firas Hadj Kacem, Ahmed Khanfir, Mike Papadakis

Abstract: Bias in financial language models constitutes a major obstacle to their adoption in real-world applications. Detecting such bias is challenging, as it requires identifying inputs whose predictions change when varying properties unrelated to the decision, such as demographic attributes. Existing approaches typically rely on exhaustive mutation and pairwise prediction analysis over large corpora, which is effective but computationally expensive-particularly for large language models and can become impractical in continuous retraining and releasing processes. Aiming at reducing this cost, we conduct a large-scale study of bias in five financial language models, examining similarities in their bias tendencies across protected attributes and exploring cross-model-guided bias detection to identify bias-revealing inputs earlier. Our study uses approximately 17k real financial news sentences, mutated to construct over 125k original-mutant pairs. Results show that all models exhibit bias under both atomic (0.58\%-6.05\%) and intersectional (0.75\%-5.97\%) settings. Moreover, we observe consistent patterns in bias-revealing inputs across models, enabling substantial reuse and cost reduction in bias detection. For example, up to 73\% of FinMA's biased behaviours can be uncovered using only 20\% of the input pairs when guided by properties derived from DistilRoBERTa outputs.

new Deconstructing Multimodal Mathematical Reasoning: Towards a Unified Perception-Alignment-Reasoning Paradigm

Authors: Tianyu Yang, Sihong Wu, Yilun Zhao, Zhenwen Liang, Lisen Dai, Chen Zhao, Minhao Cheng, Arman Cohan, Xiangliang Zhang

Abstract: Multimodal Mathematical Reasoning (MMR) has recently attracted increasing attention for its capability to solve mathematical problems that involve both textual and visual modalities. However, current models still face significant challenges in real-world visual math tasks. They often misinterpret diagrams, fail to align mathematical symbols with visual evidence, and produce inconsistent reasoning steps. Moreover, existing evaluations mainly focus on checking final answers rather than verifying the correctness or executability of each intermediate step. To address these limitations, a growing body of recent research addresses these issues by integrating structured perception, explicit alignment, and verifiable reasoning within unified frameworks. To establish a clear roadmap for understanding and comparing different MMR approaches, we systematically study them around four fundamental questions: (1) What to extract from multimodal inputs, (2) How to represent and align textual and visual information, (3) How to perform the reasoning, and (4) How to evaluate the correctness of the overall reasoning process. Finally, we discuss open challenges and offer perspectives on promising directions for future research.

new CORE-Acu: Structured Reasoning Traces and Knowledge Graph Safety Verification for Acupuncture Clinical Decision Support

Authors: Liuyi Xu, Yun Guo, Ming Chen, Zihan Dun, Yining Qian, An-Yang Lu, Shuang Li, Lijun Liu

Abstract: Large language models (LLMs) show significant potential for clinical decision support (CDS), yet their black-box nature -- characterized by untraceable reasoning and probabilistic hallucinations -- poses severe challenges in acupuncture, a field demanding rigorous interpretability and safety. To address this, we propose CORE-Acu, a neuro-symbolic framework for acupuncture clinical decision support that integrates Structured Chain-of-Thought (S-CoT) with knowledge graph (KG) safety verification. First, we construct the first acupuncture Structured Reasoning Trace dataset and a schema-constrained fine-tuning framework. By enforcing an explicit causal chain from pattern identification to treatment principles, treatment plans, and acupoint selection, we transform implicit Traditional Chinese Medicine (TCM) reasoning into interpretable generation constraints, mitigating the opacity of LLM-based CDS. Furthermore, we construct a TCM safety knowledge graph and establish a ``Generate--Verify--Revise'' closed-loop inference system based on a Symbolic Veto Mechanism, employing deterministic rules to intercept hallucinations and enforce hard safety boundaries. Finally, we introduce the Lexicon-Matched Entity-Reweighted Loss (LMERL), which corrects terminology drift caused by the frequency--importance mismatch in general optimization by adaptively amplifying gradient contributions of high-risk entities during fine-tuning. Experiments on 1,000 held-out cases demonstrate CORE-Acu's superior entity fidelity and reasoning quality. Crucially, CORE-Acu achieved 0/1,000 observed safety violations (95\% CI: 0--0.37\%), whereas GPT-4o exhibited an 8.5\% violation rate under identical rules. These results establish CORE-Acu as a robust neuro-symbolic framework for acupuncture clinical decision support, guaranteeing both reasoning auditability and strict safety compliance.

new Agentic Neurosymbolic Collaboration for Mathematical Discovery: A Case Study in Combinatorial Design

Authors: Hai Xia, Carla P. Gomes, Bart Selman, Stefan Szeider

Abstract: We study mathematical discovery through the lens of neurosymbolic reasoning, where an AI agent powered by a large language model (LLM), coupled with symbolic computation tools, and human strategic direction, jointly produced a new result in combinatorial design theory. The main result of this human-AI collaboration is a tight lower bound on the imbalance of Latin squares for the notoriously difficult case $n \equiv 1 \pmod{3}$. We reconstruct the discovery process from detailed interaction logs spanning multiple sessions over several days and identify the distinct cognitive contributions of each component. The AI agent proved effective at uncovering hidden structure and generating hypotheses. The symbolic component consists of computer algebra, constraint solvers, and simulated annealing, which provides rigorous verification and exhaustive enumeration. Human steering supplied the critical research pivot that transformed a dead end into a productive inquiry. Our analysis reveals that multi-model deliberation among frontier LLMs proved reliable for criticism and error detection but unreliable for constructive claims. The resulting human-AI mathematical contribution, a tight lower bound of $4n(n{-}1)/9$, is achieved via a novel class of near-perfect permutations. The bound was formally verified in Lean 4. Our experiments show that neurosymbolic systems can indeed produce genuine discoveries in pure mathematics.

new M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

Authors: Peijin Xie, Zhen Xu, Bingquan Liu, Baoxun Wang

Abstract: Multimodal large language models have recently shown promising progress in visual mathematical reasoning. However, their performance is often limited by a critical yet underexplored bottleneck: inaccurate visual perception. Through systematic analysis, we find that the most failures originate from incorrect or incomplete visual evidence extraction rather than deficiencies in reasoning capability. Moreover, models tend to remain overly confident in their initial perceptions, making standard strategies such as prompt engineering, multi-round self-reflection, or posterior guidance insufficient to reliably correct errors. To address this limitation, we propose M3-ACE, a multi-agentic context engineering framework designed to rectify visual perception in multimodal math reasoning. Instead of directly aggregating final answers, our approach decouples perception and reasoning by dynamically maintaining a shared context centered on visual evidence lists. Multiple agents collaboratively contribute complementary observations, enabling the system to expose inconsistencies and recover missing perceptual information. To support stable multi-turn collaboration, we further introduce two lightweight tools: a Summary Tool that organizes evidence from different agents into consistent, complementary, and conflicting components, and a Refine Tool that filters unreliable samples and guides iterative correction. Extensive experiments demonstrate that M3-ACE substantially improves visual mathematical reasoning performance across multiple benchmarks. Our method establishes new state-of-the-art results 89.1 on the MathVision benchmark and achieves consistent improvements on other related datasets, including MathVista and MathVerse. These results highlight the importance of perception-centric multi-agent collaboration for advancing multimodal reasoning systems.

new A Hierarchical Error-Corrective Graph Framework for Autonomous Agents with LLM-Based Action Generation

Authors: Cong Cao, Jingyao Zhang, Kun Tong

Abstract: We propose a Hierarchical Error-Corrective Graph FrameworkforAutonomousAgentswithLLM-BasedActionGeneration(HECG),whichincorporates three core innovations: (1) Multi-Dimensional Transferable Strategy (MDTS): by integrating task quality metrics (Q), confidence/cost metrics (C), reward metrics (R), and LLM-based semantic reasoning scores (LLM-Score), MDTS achieves multi-dimensional alignment between quantitative performance and semantic context, enabling more precise selection of high-quality candidate strate gies and effectively reducing the risk of negative transfer. (2) Error Matrix Classification (EMC): unlike simple confusion matrices or overall performance metrics, EMC provides structured attribution of task failures by categorizing errors into ten types, such as Strategy Errors (Strategy Whe) and Script Parsing Errors (Script-Parsing-Error), and decomposing them according to severity, typical actions, error descriptions, and recoverability. This allows precise analysis of the root causes of task failures, offering clear guidance for subsequent error correction and strategy optimization rather than relying solely on overall success rates or single performance metrics. (3) Causal-Context Graph Retrieval (CCGR): to enhance agent retrieval capabilities in dynamic task environments, we construct graphs from historical states, actions, and event sequences, where nodes store executed actions, next-step actions, execution states, transferable strategies, and other relevant information, and edges represent causal dependencies such as preconditions for transitions between nodes. CCGR identifies subgraphs most relevant to the current task context, effectively capturing structural relationships beyond vector similarity, allowing agents to fully leverage contextual information, accelerate strategy adaptation, and improve execution reliability in complex, multi-step tasks.

new IronEngine: Towards General AI Assistant

Authors: Xi Mo

Abstract: This paper presents IronEngine, a general AI assistant platform organized around a unified orchestration core that connects a desktop user interface, REST and WebSocket APIs, Python clients, local and cloud model backends, persistent memory, task scheduling, reusable skills, 24-category tool execution, MCP-compatible extensibility, and hardware-facing integration. IronEngine introduces a three-phase pipeline -- Discussion (Planner--Reviewer collaboration), Model Switch (VRAM-aware transition), and Execution (tool-augmented action loop) -- that separates planning quality from execution capability. The system features a hierarchical memory architecture with multi-level consolidation, a vectorized skill repository backed by ChromaDB, an adaptive model management layer supporting 92 model profiles with VRAM-aware context budgeting, and an intelligent tool routing system with 130+ alias normalization and automatic error correction. We present experimental results on file operation benchmarks achieving 100\% task completion with a mean total time of 1541 seconds across four heterogeneous tasks, and provide detailed comparisons with representative AI assistant systems including ChatGPT, Claude Desktop, Cursor, Windsurf, and open-source agent frameworks. Without disclosing proprietary prompts or core algorithms, this paper analyzes the platform's architectural decomposition, subsystem design, experimental performance, safety boundaries, and comparative engineering advantages. The resulting study positions IronEngine as a system-oriented foundation for general-purpose personal assistants, automation frameworks, and future human-centered agent platforms.

new Efficient Policy Learning with Hybrid Evaluation-Based Genetic Programming for Uncertain Agile Earth Observation Satellite Scheduling

Authors: Junhua Xue, Yuning Chen

Abstract: The Uncertain Agile Earth Observation Satellite Scheduling Problem (UAEOSSP) is a novel combinatorial optimization problem and a practical engineering challenge that aligns with the current demands of space technology development. It incorporates uncertainties in profit, resource consumption, and visibility, which may render pre-planned schedules suboptimal or even infeasible. Genetic Programming Hyper-Heuristic (GPHH) shows promise for evolving interpretable scheduling policies; however, their simulation-based evaluation incurs high computational costs. Moreover, the design of the constructive method, denoted as Online Scheduling Algorithm (OSA), directly affects fitness assessment, resulting in evaluation-dependent local optima within the policy space. To address these issues, this paper proposes a Hybrid Evaluation-based Genetic Programming (HE-GP) for effectively solving UAEOSSP. A Hybrid Evaluation (HE) mechanism is integrated into the policy-driven OSA, combining exact and approximate filtering modes: exact mode ensures evaluation accuracy through elaborately designed constraint verification modules, while approximate mode reduces computational overhead via simplified logic. HE-GP dynamically switches between evaluation models based on real-time evolutionary state information. Experiments on 16 simulated instance sets demonstrate that HE-GP significantly outperforms handcrafted heuristics and single-evaluation based GPHH, achieving substantial reductions in computational cost while maintaining excellent scheduling performance across diverse scenarios. Specifically, the average training time of HE-GP was reduced by 17.77\% compared to GP employing exclusively exact evaluation, while the optimal policy generated by HE-GP achieved the highest average ranks across all scenarios.

new The Boiling Frog Threshold: Criticality and Blindness in World Model-Based Anomaly Detection Under Gradual Drift

Authors: Zhe Hong

Abstract: When an RL agent's observations are gradually corrupted, at what drift rate does it "wake up" -- and what determines this boundary? We study world model-based self-monitoring under continuous observation drift across four MuJoCo environments, three detector families (z-score, variance, percentile), and three model capacities. We find that (1) a sharp detection threshold $\varepsilon^*$ exists universally: below it, drift is absorbed as normal variation; above it, detection occurs rapidly. The threshold's existence and sigmoid shape are invariant across all detector families and model capacities, though its position depends on the interaction between detector sensitivity, noise floor structure, and environment dynamics. (2) Sinusoidal drift is completely undetectable by all detector families -- including variance and percentile detectors with no temporal smoothing -- establishing this as a world model property rather than a detector artifact. (3) Within each environment, $\varepsilon^*$ follows a power law in detector parameters ($R^2 = 0.89$-$0.97$), but cross-environment prediction fails ($R^2 = 0.45$), revealing that the missing variable is environment-specific dynamics structure $\partial \mathrm{PE}/\partial\varepsilon$. (4) In fragile environments, agents collapse before any detector can fire ("collapse before awareness"), creating a fundamentally unmonitorable failure mode. Our results reframe $\varepsilon^*$ from an emergent world model property to a three-way interaction between noise floor, detector, and environment dynamics, providing a more defensible and empirically grounded account of self-monitoring boundaries in RL agents.

new RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

Authors: Xiaoying Zhang, Zichen Liu, Yipeng Zhang, Xia Hu, Wenqi Shao

Abstract: Large language model (LLM)-based agents trained with reinforcement learning (RL) have shown strong potential on complex interactive tasks. However, standard RL paradigms favor static problem-solving over continuous adaptation: agents often converge to suboptimal strategies due to insufficient exploration, while learned knowledge remains implicit within parameters rather than explicitly retrievable, limiting effective experiential learning. To address these limitations, we introduce RetroAgent, an online RL framework that empowers agents to master complex interactive environments not just by solving, but by evolving. Concretely, RetroAgent features a hindsight self-reflection mechanism that produces dual intrinsic feedback: (1) intrinsic numerical feedback that that tracks incremental subtask completion relative to prior attempts, rewarding promising explorations, and (2) intrinsic language feedback that distills reusable lessons into a memory buffer, retrieved via our proposed Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy balancing relevance, utility, and exploration to effectively leverage past experiences. Extensive experiments on two model families across four challenging agentic tasks demonstrate that RetroAgent significantly outperforms existing methods, achieving state-of-the-art results -- e.g., surpassing Group Relative Policy Optimization (GRPO)-trained agents by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper -- while exhibiting strong test-time adaptation and generalization to out-of-distribution scenarios.

new Trust via Reputation of Conviction

Authors: Aravind R. Iyengar

Abstract: The question of \emph{knowledge}, \emph{truth} and \emph{trust} is explored via a mathematical formulation of claims and sources. We define truth as the reproducibly perceived subset of knowledge, formalize sources as having both generative and discriminative roles, and develop a framework for reputation grounded in the \emph{conviction} -- the likelihood that a source's stance is vindicated by independent consensus. We argue that conviction, rather than correctness or faithfulness, is the principled basis for trust: it is regime-independent, rewards genuine contribution, and demands the transparent and self-sufficient perceptions that make external verification possible. We formalize reputation as the expected weighted signed conviction over a realm of claims, characterize its behavior across source-claim regimes, and identify continuous verification as both a theoretical necessity and a practical mechanism through which reputation accrues. The framework is applied to AI agents, which are identified as capable but error-prone sources for whom verifiable conviction and continuously accrued reputation constitute the only robust foundation for trust.

new CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

Authors: Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou, Hongbo Peng, Dingming Li, Yuhong Dai, ZePeng Lin, Juanxi Tian, Yi Zhou, Siqi Dai, Jingwei Wu

Abstract: Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the precision required for complex spatial layouts, structured visual elements, and dense textual content. In this work, we propose CoCo (Code-as-CoT), a code-driven reasoning framework that represents the reasoning process as executable code, enabling explicit and verifiable intermediate planning for image generation. Given a text prompt, CoCo first generates executable code that specifies the structural layout of the scene, which is then executed in a sandboxed environment to render a deterministic draft image. The model subsequently refines this draft through fine-grained image editing to produce the final high-fidelity result. To support this training paradigm, we construct CoCo-10K, a curated dataset containing structured draft-final image pairs designed to teach both structured draft construction and corrective visual refinement. Empirical evaluations on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo achieves improvements of +68.83%, +54.8%, and +41.23% over direct generation, while also outperforming other generation methods empowered by CoT. These results demonstrate that executable code is an effective and reliable reasoning paradigm for precise, controllable, and structured text-to-image generation. The code is available at: https://github.com/micky-li-hd/CoCo

URLs: https://github.com/micky-li-hd/CoCo

new OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

Authors: Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins, Ivan Zhou, Cindy Wang, Ashutosh Baheti, Owen Oertell, Jacob Portes, Sam Havens, Erich Elsen, Michael Bendersky, Matei Zaharia, Xing Chen

Abstract: We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview achieve less than 5% accuracy on OfficeQA Pro when relying on parametric knowledge, and less than 12% with additional access to the web. When provided directly with the document corpus, frontier agents still struggle on over half of questions, scoring 34.1% on average. We find that providing agents with a structured document representation produced by Databricks' ai_parse_document yields a 16.1% average relative performance gain across agents. We conduct additional ablations to study the effects of model selection, table representation, retrieval strategy, and test-time scaling on performance. Despite these improvements, significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.

new A Multi-Objective Optimization Approach for Sustainable AI-Driven Entrepreneurship in Resilient Economies

Authors: Anas ALsobeh, Raneem Alkurdi

Abstract: The rapid advancement of artificial intelligence (AI) technologies presents both unprecedented opportunities and significant challenges for sustainable economic development. While AI offers transformative potential for addressing environmental challenges and enhancing economic resilience, its deployment often involves substantial energy consumption and environmental costs. This research introduces the EcoAI-Resilience framework, a multi-objective optimization approach designed to maximize the sustainability benefits of AI deployment while minimizing environmental costs and enhancing economic resilience. The framework addresses three critical objectives through mathematical optimization: sustainability impact maximization, economic resilience enhancement, and environmental cost minimization. The methodology integrates diverse data sources, including energy consumption metrics, sustainability indicators, economic performance data, and entrepreneurship outcomes across 53 countries and 14 sectors from 2015-2024. Our experimental validation demonstrates exceptional performance with R scores exceeding 0.99 across all model components, significantly outperforming baseline methods, including Linear Regression (R = 0.943), Random Forest (R = 0.957), and Gradient Boosting (R = 0.989). The framework successfully identifies optimal AI deployment strategies featuring 100\% renewable energy integration, 80% efficiency improvement targets, and optimal investment levels of $202.48 per capita. Key findings reveal strong correlations between economic complexity and resilience (r = 0.82), renewable energy adoption and sustainability outcomes (r = 0.71), and demonstrate significant temporal improvements in AI readiness (+1.12 points/year) and renewable energy adoption (+0.67 year) globally.

new Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

Authors: Akshay Gulati, Kanha Singhania, Tushar Banga, Parth Arora, Anshul Verma, Vaibhav Kumar Singh, Agyapal Digra, Jayant Singh Bisht, Danish Sharma, Varun Singla, Shubh Garg

Abstract: Large language models are increasingly used for financial analysis and investment research, yet systematic evaluation of their financial reasoning capabilities remains limited. In this work, we introduce the AI Financial Intelligence Benchmark (AFIB), a multi-dimensional evaluation framework designed to assess financial analysis capabilities across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. We evaluate five AI systems: GPT, Gemini, Perplexity, Claude, and SuperInvesting, using a dataset of 95+ structured financial analysis questions derived from real-world equity research tasks. The results reveal substantial differences in performance across models. Within this benchmark setting, SuperInvesting achieves the highest aggregate performance, with an average factual accuracy score of 8.96/10 and the highest completeness score of 56.65/70, while also demonstrating the lowest hallucination rate among evaluated systems. Retrieval-oriented systems such as Perplexity perform strongly on data recency tasks due to live information access but exhibit weaker analytical synthesis and consistency. Overall, the results highlight that financial intelligence in large language models is inherently multi-dimensional, and systems that combine structured financial data access with analytical reasoning capabilities provide the most reliable performance for complex investment research workflows.

new Agentic Critical Training

Authors: Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, Furong Huang

Abstract: Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.

cross Subclass Classification of Gliomas Using MRI Fusion Technique

Authors: Kiranmayee Janardhan, Christy Bobby Thomas

Abstract: Glioma, the prevalent primary brain tumor, exhibits diverse aggressiveness levels and prognoses. Precise classification of glioma is paramount for treatment planning and predicting prognosis. This study aims to develop an algorithm to fuse the MRI images from T1, T2, T1ce, and fluid-attenuated inversion recovery (FLAIR) sequences to enhance the efficacy of glioma subclass classification as no tumor, necrotic core, peritumoral edema, and enhancing tumor. The MRI images from BraTS datasets were used in this work. The images were pre-processed using max-min normalization to ensure consistency in pixel intensity values across different images. The segmentation of the necrotic core, peritumoral edema, and enhancing tumor was performed on 2D and 3D images separately using UNET architecture. Further, the segmented regions from multimodal MRI images were fused using the weighted averaging technique. Integrating 2D and 3D segmented outputs enhances classification accuracy by capturing detailed features like tumor shape, boundaries, and intensity distribution in slices, while also providing a comprehensive view of spatial extent, shape, texture, and localization within the brain volume. The fused images were used as input to the pre-trained ResNet50 model for glioma subclass classification. The network is trained on 80% and validated on 20% of the data. The proposed method achieved a classification of accuracy of 99.25%, precision of 99.30%, recall of 99.10, F1 score of 99.19%, Intersection Over Union of 84.49%, and specificity of 99.76, which showed a significantly higher performance than existing techniques. These findings emphasize the significance of glioma segmentation and classification in aiding accurate diagnosis.

cross Deep Learning-Based Approach for Automatic 2D and 3D MRI Segmentation of Gliomas

Authors: Kiranmayee Janardhan, Christy Bobby T

Abstract: Brain tumor diagnosis is a challenging task for clinicians in the modern world. Among the major reasons for cancer-related death is the brain tumor. Gliomas, a category of central nervous system (CNS) tumors, encompass diverse subregions. For accurate diagnosis of brain tumors, precise segmentation of brain images and quantitative analysis are required. A fully automatic approach to glioma segmentation is required because the manual segmentation process is laborious, prone to mistakes, as well as time-consuming. Modern techniques for segmenting gliomas are based on fully convolutional neural networks (FCNs), which can either use two-dimensional (2D) or three-dimensional (3D) convolutions. Nevertheless, 3D convolutions suffer from computational costs and memory demand, while 2D convolutions cannot fully utilize the spatial insights of volumetric clinical imaging data. To obtain an optimal solution, it is vital to balance the computational efficiency of 2D convolutions along with the spatial accuracy of 3D convolutions. This balance can potentially be realized by developing an advanced model to overcome these challenges. The 2D and 3D models implemented here are based on UNET architecture, Inception, and ResNet models. The research work has been implemented on the BraTS 2018, 2019, and 2020 datasets. The best performer of all the models' evaluations metrics for proposed methodologies offer superior potential in terms of the effective segmentation of gliomas. The ResNet model has resulted in 98.91% accuracy for 3D segmentation and 99.77 for 2D segmentations. The dice scores for 2D and 3D segmentations are 0.8312 and 0.9888, respectively. This model can be applied to various other medical applications with fine-tuning, thereby aiding clinicians in brain tumor analysis and improving the diagnosis process effectively.

cross A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition

Authors: Xinkui Zhao, Jinsong Shu, Yangyang Wu, Guanjie Cheng, Zihe Liu, Naibo Wang, Shuiguang Deng, Zhongle Xie, Jianwei Yin

Abstract: Multimodal Emotion Recognition (MER) often encounters incomplete multimodality in practical applications due to sensor failures or privacy protection requirements. While existing methods attempt to address various incomplete multimodal scenarios by balancing the training of each modality combination through additional gradients, these approaches face a critical limitation: training gradients from different modality combinations conflict with each other, ultimately degrading the performance of the final prediction model. In this paper, we propose a unimodal decoupled dynamic low-rank adaptation method based on modality combinations, named MCULoRA, which is a novel framework for the parameter-efficient training of incomplete multimodal learning models. MCULoRA consists of two key modules, modality combination aware low-rank adaptation (MCLA) and dynamic parameter fine-tuning (DPFT). The MCLA module effectively decouples the shared information from the distinct characteristics of individual modality combinations. The DPFT module adjusts the training ratio of modality combinations based on the separability of each modality's representation space, optimizing the learning efficiency across different modality combinations. Our extensive experimental evaluation in multiple benchmark datasets demonstrates that MCULoRA substantially outperforms previous incomplete multimodal learning approaches in downstream task accuracy.

cross Agentic SPARQL: Evaluating SPARQL-MCP-powered Intelligent Agents on the Federated KGQA Benchmark

Authors: Daniel Dobriy, Frederik Bauer, Amr Azzam, Debayan Banerjee, Axel Polleres

Abstract: Standard protocols such as the Model Context Protocol (MCP) that allow LLMs to connect to tools have recently boosted "agentic" AI applications, which, powered by LLMs' planning capabilities, promise to solve complex tasks with the access of external tools and data sources. In this context, publicly available SPARQL endpoints offer a natural connection to combine various data sources through MCP by (a) implementing a standardised protocol and query language, (b) standardised metadata formats, and (c) the native capability to federate queries. In the present paper, we explore the potential of SPARQL-MCP-based intelligent agents to facilitate federated SPARQL querying: firstly, we discuss how to extend an existing Knowledge Graph Question Answering benchmark towards agentic federated Knowledge Graph Question Answering (FKGQA); secondly, we implement and evaluate the ability of integrating SPARQL federation with LLM agents via MCP (incl. endpoint discovery/source selection, schema exploration, and query formulation), comparing different architectural options against the extended benchmark. Our work complements and extends prior work on automated SPARQL query federation towards fruitful combinations with agentic AI.

cross Right Move, Right Time: Multi-Sport Space Evaluation Platform for Ultimate Frisbee, Basketball, and Soccer

Authors: Shunsuke Iwashita, Titouan Jeannot, Braden Eberhard, Jacob Miller, Rikako Kono, Calvin Yeung, Keisuke Fujii

Abstract: We present an open, sport-agnostic platform that turns tracking into comparable spatial measures across professional Ultimate, basketball, and soccer. Coaches in all three sports ask the same question: where is the usable space, and when should an off-ball run start? Our workflow standardizes inputs, provides timing-aware spatial evaluations, and makes it possible to reuse the same analysis across sports. We illustrate the approach with Ultimate as a focused testbed and then examine transfer between basketball and soccer. Together, these results show a practical path toward consistent, comparable evaluation across various invasion sports.

cross Isotonic Layer: A Universal Framework for Generic Recommendation Debiasing

Authors: Hailing Cheng, Yafang Yang, Hemeng Tao, Fengyu Zhang

Abstract: Model calibration and debiasing are fundamental to the reliability and fairness of large scale recommendation systems. We introduce the Isotonic Layer, a novel, differentiable framework that integrates piecewise linear fitting directly into neural architectures. By partitioning the feature space into discrete segments and optimizing non negative slopes via a constrained dot product mechanism, we enforce a global monotonic inductive bias. This ensures model outputs remain logically consistent with critical features such as latent relevance, recency, or quality scores. We further generalize this architecture by parameterizing segment wise slopes as learnable embeddings. This enables the model to adaptively capture context specific distortions, such as position based CTR bias through specialized isotonic profiles. Our approach utilizes a dual task formulation that decouples the recommendation objective into latent relevance estimation and bias aware calibration. A major contribution of this work is the ability to perform highly granular, customized calibration for arbitrary combinations of context features, a level of control difficult to achieve with traditional non parametric methods. We also extend this to Multi Task Learning environments with dedicated embeddings for distinct objectives. Extensive empirical evaluations on real world datasets and production AB tests demonstrate that the Isotonic Layer effectively mitigates systematic bias and enhances calibration fidelity, significantly outperforming production baselines in both predictive accuracy and ranking consistency.

cross ARC-AGI-2 Technical Report

Authors: Wallyson Lemes de Oliveira, Mekhron Bobokhonov, Matteo Caorsi, Aldo Podest\`a, Gabriele Beltramo, Luca Crosato, Matteo Bonotto, Federica Cecchetto, Hadrien Espic, Dan Titus Salajan, Stefan Taga, Luca Pana, Joe Carthy

Abstract: The Abstraction and Reasoning Corpus (ARC) is designed to assess generalization beyond pattern matching, requiring models to infer symbolic rules from very few examples. In this work, we present a transformer-based system that advances ARC performance by combining neural inference with structure-aware priors and online task adaptation. Our approach is built on four key ideas. First, we reformulate ARC reasoning as a sequence modeling problem using a compact task encoding with only 125 tokens, enabling efficient long-context processing with a modified LongT5 architecture. Second, we introduce a principled augmentation framework based on group symmetries, grid traversals, and automata perturbations, enforcing invariance to representation changes. Third, we apply test-time training (TTT) with lightweight LoRA adaptation, allowing the model to specialize to each unseen task by learning its transformation logic from demonstrations. Fourth, we design a symmetry-aware decoding and scoring pipeline that aggregates likelihoods across augmented task views, effectively performing ``multi-perspective reasoning'' over candidate solutions. We demonstrate that these components work synergistically: augmentations expand hypothesis space, TTT sharpens local reasoning, and symmetry-based scoring improves solution consistency. Our final system achieves a significant improvement over transformer baselines and surpasses prior neural ARC solvers, closing the gap toward human-level generalization.

cross A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

Authors: Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel, Stephan G\"unnemann

Abstract: Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. Data available at: https://github.com/SchwinnL/LLMJudgeReliability.

URLs: https://github.com/SchwinnL/LLMJudgeReliability.

cross Distributionally Robust Geometric Joint Chance-Constrained Optimization: Neurodynamic Approaches

Authors: Ange Valli (L2S), Siham Tassouli (OPTIM), Abdel Lisser (L2S)

Abstract: This paper proposes a two-time scale neurodynamic duplex approach to solve distributionally robust geometric joint chance-constrained optimization problems. The probability distributions of the row vectors are not known in advance and belong to a certain distributional uncertainty set. In our paper, we study three uncertainty sets for the unknown distributions. The neurodynamic duplex is designed based on three projection equations. The main contribution of our work is to propose a neural network-based method to solve distributionally robust joint chance-constrained optimization problems that converges in probability to the global optimum without the use of standard state-of-the-art solving methods. We show that neural networks can be used to solve multiple instances of a problem. In the numerical experiments, we apply the proposed approach to solve a problem of shape optimisation and a telecommunication problem.

cross Building the ethical AI framework of the future: from philosophy to practice

Authors: Jasper Kyle Catapang

Abstract: Artificial intelligence pipelines -- spanning data collection, model training, deployment, and post-deployment monitoring -- concentrate ethical risks that intensify with multimodal and agentic systems. Existing governance instruments, including the EU AI Act, the IEEE 7000 series, and the NIST AI Risk Management Framework, provide high-level guidance but often lack enforceable, end-to-end operational controls. This paper presents an ethics-by-design control architecture that embeds consequentialist, deontological, and virtue-ethical reasoning into stage-specific enforcement mechanisms across the AI lifecycle. The framework implements a triple-gate structure at each lifecycle stage: Metric gates (quantitative performance and safety thresholds), Governance gates (legal, rights, and procedural compliance), and Eco gates (carbon and water budgets and sustainability constraints). It specifies measurable trigger conditions, escalation paths, audit artefacts, and mappings to EU AI Act obligations and NIST RMF functions, enabling integration with existing MLOps and CI/CD pipelines. Illustrative examples from large language model pipelines demonstrate how gate-based controls can surface and constrain technical, social, and environmental risks prior to release and during runtime. The framework is accompanied by a preregistered evaluation protocol that defines ex ante success criteria and assessment procedures, enabling falsifiable evaluation of gate effectiveness. By translating normative commitments into enforceable and testable controls, the framework provides a practical basis for operational AI governance across organizational contexts, jurisdictions, and deployment scales.

cross FuzzingRL: Reinforcement Fuzz-Testing for Revealing VLM Failures

Authors: Jiajun Xu, Jiageng Mao, Ang Qi, Weiduo Yuan, Alexander Romanus, Helen Xia, Vitor Campagnolo Guizilini, Yue Wang

Abstract: Vision Language Models (VLMs) are prone to errors, and identifying where these errors occur is critical for ensuring the reliability and safety of AI systems. In this paper, we propose an approach that automatically generates questions designed to deliberately induce incorrect responses from VLMs, thereby revealing their vulnerabilities. The core of this approach lies in fuzz testing and reinforcement finetuning: we transform a single input query into a large set of diverse variants through vision and language fuzzing. Based on the fuzzing outcomes, the question generator is further instructed by adversarial reinforcement fine-tuning to produce increasingly challenging queries that trigger model failures. With this approach, we can consistently drive down a target VLM's answer accuracy -- for example, the accuracy of Qwen2.5-VL-32B on our generated questions drops from 86.58\% to 65.53\% in four RL iterations. Moreover, a fuzzing policy trained against a single target VLM transfers to multiple other VLMs, producing challenging queries that degrade their performance as well.

cross Scale Dependent Data Duplication

Authors: Joshua Kazdan, Noam Levi, Rylan Schaeffer, Jessica Chudnovsky, Abhay Puri, Bo He, Mehmet Donmez, Sanmi Koyejo, David Donoho

Abstract: Data duplication during pretraining can degrade generalization and lead to memorization, motivating aggressive deduplication pipelines. However, at web scale, it is unclear what constitutes a ``duplicate'': beyond surface-form matches, semantically equivalent documents (e.g. translations) may induce redundant training signals once models become sufficiently capable. Practically, this means that semantic duplicates operate increasingly like exact duplicates during training. We present evidence that duplication is scale-dependent in two ways. First, as model capability increases, cross-entropy loss gradients for semantically equivalent documents become more aligned. Smaller models, by contrast, produce gradients that reflect surface similarity (e.g., shared tokens) rather than semantic similarity. Second, we embedded all 192 million FineWeb-Edu-Dedup documents using EmbeddingGemma-300m. For moderate corpus sizes, the cosine similarity between nearest-neighbors follows an isotropic power law baseline. However, as corpus size grows to hundreds of billions of tokens, the nearest-neighbor similarities deviate sharply, indicating accelerated semantic collisions. Finally, controlled pretraining on data sampled with replacement from pools of finite unique documents shows that limited uniqueness yields mild degradation for small models, but rapidly increasing loss penalties for larger models, breaking naive scaling extrapolation. We derive explicit scaling laws that allow practitioners to estimate deviation from expected scaling due to limited semantic uniqueness of the pretraining corpus. Our results identify and resolve an unstudied source of scale-dependence, allowing for more accurate prediction at scale.

cross Multi-Agent DRL for V2X Resource Allocation: Disentangling Challenges and Benchmarking Solutions

Authors: Siyuan Wang, Lei Lei, Pranav Maheshwari, Sam Bellefeuille, Kan Zheng, Dusit Niyato

Abstract: Multi-agent deep reinforcement learning (DRL) has emerged as a promising approach for radio resource allocation (RRA) in cellular vehicle-to-everything (C-V2X) networks. However, the multifaceted challenges inherent to multi-agent reinforcement learning (MARL) - including non-stationarity, coordination difficulty, large action spaces, partial observability, and limited robustness and generalization - are often intertwined, making it difficult to understand their individual impact on performance in vehicular environments. Moreover, existing studies typically rely on different baseline MARL algorithms, and a systematic comparison of their capabilities in addressing specific challenges in C-V2X RRA remains lacking. In this paper, we bridge this gap by formulating C-V2X RRA as a sequence of multi-agent interference games with progressively increasing complexity, each designed to isolate a key MARL challenge. Based on these formulations, we construct a suite of learning tasks that enable controlled evaluation of performance degradation attributable to each challenge. We further develop large-scale, diverse training and testing datasets using SUMO-generated highway traces to capture a wide range of vehicular topologies and corresponding interference patterns. Through extensive benchmarking of representative MARL algorithms, we identify policy robustness and generalization across diverse vehicular topologies as the dominant challenge in C-V2X RRA. We further show that, on the most challenging task, the best-performing actor-critic method outperforms the value-based approach by 42%. By emphasizing the need for zero-shot policy transfer to both seen and unseen topologies at runtime, and by open-sourcing the code, datasets, and interference-game benchmark suite, this work provides a systematic and reproducible foundation for evaluating and advancing MARL algorithms in vehicular networks.

cross Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for LLM Truthfulness

Authors: Yegor Denisov-Blanch, Joshua Kazdan, Jessica Chudnovsky, Rylan Schaeffer, Sheng Guan, Soji Adeshina, Sanmi Koyejo

Abstract: Pass@k and other methods of scaling inference compute can improve language model performance in domains with external verifiers, including mathematics and code, where incorrect candidates can be filtered reliably. This raises a natural question: can we similarly scale compute to elicit gains in truthfulness for domains without convenient verification? We show that across five benchmarks and models, surprisingly, it cannot. Even at 25x the inference cost of naive sampling, polling-style aggregation yields no consistent accuracy gains over single-sample baselines and often amplifies shared misconceptions. We find that under uncertainty, models are better at predicting what other models will say within model ensembles than at identifying what is true, revealing a separation between social prediction and truth verification. Across models and benchmarks, aggregation fails to provide a robust truth signal because language model errors are strongly correlated. The source of correlation goes beyond any individual benchmark: we show that even when conditioned on out of distribution random strings and asked to produce pseudo-random outputs, different models produce correlated outputs. Confidence-based weighting provides no benefit because self-reported confidence fails to reliably distinguish correct from incorrect answers. These results delineate a boundary for inference-time scaling: in verified domains, additional samples provide more candidates for a verifier to filter; in unverified domains, additional samples merely reinforce shared misconceptions.

cross OptiRoulette Optimizer: A New Stochastic Meta-Optimizer for up to 5.3x Faster Convergence

Authors: Stamatis Mastromichalakis

Abstract: This paper presents OptiRoulette, a stochastic meta-optimizer that selects update rules during training instead of fixing a single optimizer. The method combines warmup optimizer locking, random sampling from an active optimizer pool, compatibility-aware learning-rate scaling during optimizer transitions, and failure-aware pool replacement. OptiRoulette is implemented as a drop-in, "torch.optim.Optimizer-compatible" component and packaged for pip installation. We report completed 10-seed results on five image-classification suites: CIFAR-100, CIFAR-100-C, SVHN, Tiny ImageNet, and Caltech-256. Against a single-optimizer AdamW baseline, OptiRoulette improves mean test accuracy from 0.6734 to 0.7656 on CIFAR-100 (+9.22 percentage points), 0.2904 to 0.3355 on CIFAR-100-C (+4.52), 0.9667 to 0.9756 on SVHN (+0.89), 0.5669 to 0.6642 on Tiny ImageNet (+9.73), and 0.5946 to 0.6920 on Caltech-256 (+9.74). Its main advantage is convergence reliability at higher targets: it reaches CIFAR-100/CIFAR-100-C 0.75, SVHN 0.96, Tiny ImageNet 0.65, and Caltech-256 0.62 validation accuracy in 10/10 runs, while the AdamW baseline reaches none of these targets within budget. On shared targets, OptiRoulette also reduces time-to-target (e.g., Caltech-256 at 0.59: 25.7 vs 77.0 epochs). Paired-seed deltas are positive on all datasets; CIFAR-100-C test ROC-AUC is the only metric not statistically significant in the current 10-seed study.

cross Annealed Co-Generation: Disentangling Variables via Progressive Pairwise Modeling

Authors: Hantao Zhang, Jieke Wu, Mingda Xu, Xiao Hu, Yingxuan You, Pascal Fua

Abstract: For multivariate co-generation in scientific applications, we advocate pairwise block rather than joint modeling of all variables. This design mitigates the computational burden and data imbalance. To this end, we propose an Annealed Co-Generation (ACG) framework that replaces high-dimensional diffusion modeling with a low-dimensional diffusion model, which enables multivariate co-generation by composing pairwise variable generations. We first train an unconditional diffusion model over causal variables that are disentangled into pairs. At inference time, we recover the joint distribution by coupling these pairwise models through shared common variables, enabling coherent multivariate generation without any additional training. By employing a three-stage annealing process-Consensus, Heating, and Cooling-our method enforces consistency across shared common variables and progressively constrains each pairwise data distribution to lie on a learnable manifold, while maintaining high likelihood within each pair. We demonstrate the framework's flexibility and efficacy on two distinct scientific tasks: flow-field completion and antibody generation. All datasets and code will be made publicly available upon publication.

cross RACER: Risk-Aware Calibrated Efficient Routing for Large Language Models

Authors: Sai Hao, Hao Zeng, Hongxin Wei, Bingyi Jing

Abstract: Efficiently routing queries to the optimal large language model (LLM) is crucial for optimizing the cost-performance trade-off in multi-model systems. However, most existing routers rely on single-model selection, making them susceptible to misrouting. In this work, we formulate LLM routing as the $\alpha$-VOR problem to minimize expected set size while controlling the misrouting risk, and propose a novel method -- RACER, extending base routers to output model sets that can be subsequently aggregated for improved output. In particular, RACER constructs nested model sets via augmented scoring and utilizes finite-sample concentration bounds to calibrate a threshold that allows for both variable set sizes and abstention. We theoretically prove that RACER achieves rigorous distribution-free risk control on unseen test data in a post-hoc and model-agnostic manner. Extensive experiments verify our theoretical guarantees and demonstrate that RACER consistently enhances downstream accuracy across a wide range of benchmarks.

cross Evo: Autoregressive-Diffusion Large Language Models with Evolving Balance

Authors: Junde Wu, Minhao Hu, Jiayuan Zhu, Yuyuan Liu, Tianyi Zhang, Kang Li, Jingkun Chen, Jiazhen Pan, Min Xu, Yueming Jin

Abstract: We introduce \textbf{Evo}, a duality latent trajectory model that bridges autoregressive (AR) and diffusion-based language generation within a continuous evolutionary generative framework. Rather than treating AR decoding and diffusion generation as separate paradigms, Evo reconceptualizes text generation as a latent flow: each token is associated with a vector-valued embedding that evolves over a progression variable $t_i \in [0, 1]$, indicating its semantic maturity. Low $t_i$ values correspond to confident AR-like refinement, while high values invoke diffusion-style planning, allowing the model to adaptively balance AR and diffusion based on uncertainty. Theoretically, we show that both AR and diffusion models emerge as discretizations of a shared probability flow, and we derive Evo's training objective from a unified variational ELBO. The model is implemented as a time-conditioned Transformer governed by a shared vector field, trained end-to-end to jointly infer latent codes and their progression times. During decoding, Evo performs efficient, semantics-aware refinement, achieving high-quality outputs without sacrificing speed. Empirically, Evo 8B achieves state-of-the-art or highly competitive results on 15 diverse benchmarks, including reasoning (GSM8K, ARC-C), code generation (HumanEval, MBPP), and general language understanding, while maintaining fast inference speed. Our results demonstrate that Evo delivers a new paradigm for LLM design with strong generation quality, robust symbolic reasoning, and decoding efficiency.

cross Distilling and Adapting: A Topology-Aware Framework for Zero-Shot Interaction Prediction in Multiplex Biological Networks

Authors: Alana Deng, Sugitha Janarthanan, Yan Sun, Zihao Jing, Pingzhao Hu

Abstract: Multiplex Biological Networks (MBNs), which represent multiple interaction types between entities, are crucial for understanding complex biological systems. Yet, existing methods often inadequately model multiplexity, struggle to integrate structural and sequence information, and face difficulties in zero-shot prediction for unseen entities with no prior neighbourhood information. To address these limitations, we propose a novel framework for zero-shot interaction prediction in MBNs by leveraging context-aware representation learning and knowledge distillation. Our approach leverages domain-specific foundation models to generate enriched embeddings, introduces a topology-aware graph tokenizer to capture multiplexity and higher-order connectivity, and employs contrastive learning to align embeddings across modalities. A teacher-student distillation strategy further enables robust zero-shot generalization. Experimental results demonstrate that our framework outperforms state-of-the-art methods in interaction prediction for MBNs, providing a powerful tool for exploring various biological interactions and advancing personalized therapeutics.

cross Not all tokens are needed(NAT): token efficient reinforcement learning

Authors: Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang

Abstract: Reinforcement learning (RL) has become a key driver of progress in large language models, but scaling RL to long chain-of-thought (CoT) trajectories is increasingly constrained by backpropagation over every generated token. Even with optimized rollout engines, full-token updates can consume a large fraction of total training cost, turning token length into a hidden tax on RL. We introduce Not All Tokens Are Needed (NAT), a unified framework that makes the token budget a first-class optimization primitive. NAT updates the policy using only a selected subset of generated tokens while preserving the learning signal of full-sequence RL. The core idea is an unbiased partial-token policy-gradient estimator via Horvitz-Thompson reweighting, which ensures statistically correct gradients despite subsampling. We instantiate NAT with two simple, plug-and-play token selection schemes: Uniform Random Sampling (URS) and Random Prefix Cutting (RPC), both of which reduce forward and backward compute and memory without modifying the reward computation or rollout pipeline. Across mathematical reasoning benchmarks, NAT matches full-token GRPO performance while using as few as 50% of tokens, providing an efficient and orthogonal pathway to scaling RL beyond the limits imposed by long trajectories. In our experiments, RPC saves 18% peak GPU memory and 29% forward and backward RL training time for Qwen3-8B.

cross GraphSkill: Documentation-Guided Hierarchical Retrieval-Augmented Coding for Complex Graph Reasoning

Authors: Fali Wang, Chenglin Weng, Xianren Zhang, Siyuan Hong, Hui Liu, Suhang Wang

Abstract: The growing demand for automated graph algorithm reasoning has attracted increasing attention in the large language model (LLM) community. Recent LLM-based graph reasoning methods typically decouple task descriptions from graph data, generate executable code augmented by retrieval from technical documentation, and refine the code through debugging. However, we identify two key limitations in existing approaches: (i) they treat technical documentation as flat text collections and ignore its hierarchical structure, leading to noisy retrieval that degrades code generation quality; and (ii) their debugging mechanisms focus primarily on runtime errors, yet ignore more critical logical errors. To address them, we propose {\method}, an \textit{agentic hierarchical retrieval-augmented coding framework} that exploits the document hierarchy through top-down traversal and early pruning, together with a \textit{self-debugging coding agent} that iteratively refines code using automatically generated small-scale test cases. To enable comprehensive evaluation of complex graph reasoning, we introduce a new dataset, {\dataset}, covering small-scale, large-scale, and composite graph reasoning tasks. Extensive experiments demonstrate that our method achieves higher task accuracy and lower inference cost compared to baselines\footnote{The code is available at \href{https://github.com/FairyFali/GraphSkill}{\textcolor{blue}{https://github.com/FairyFali/GraphSkill}}.}.

URLs: https://github.com/FairyFali/GraphSkill, https://github.com/FairyFali/GraphSkill

cross From ARIMA to Attention: Power Load Forecasting Using Temporal Deep Learning

Authors: Suhasnadh Reddy Veluru, Sai Teja Erukude, Viswa Chaitanya Marella

Abstract: Accurate short-term power load forecasting is important to effectively manage, optimize, and ensure the robustness of modern power systems. This paper performs an empirical evaluation of a traditional statistical model and deep learning approaches for predicting short-term energy load. Four models, namely ARIMA, LSTM, BiLSTM, and Transformer, were leveraged on the PJM Hourly Energy Consumption data. The data processing involved interpolation, normalization, and a sliding-window sequence method. Each model's forecasting performance was evaluated for the 24-hour horizon using MAE, RMSE, and MAPE. Of the models tested, the Transformer model, which relies on self-attention algorithms, produced the best results with 3.8 percent of MAPE, with performance above any model in both accuracy and robustness. These findings underscore the growing potential of attention-based architectures in accurately capturing complex temporal patterns in power consumption data.

cross Exploration Space Theory: Formal Foundations for Prerequisite-Aware Location-Based Recommendation

Authors: Madjid Sadallah

Abstract: Location-based recommender systems have achieved considerable sophistication, yet none provides a formal, lattice-theoretic representation of prerequisite dependencies among points of interest -- the semantic reality that meaningfully experiencing certain locations presupposes contextual knowledge gained from others -- nor the structural guarantees that such a representation entails. We introduce Exploration Space Theory (EST), a formal framework that transposes Knowledge Space Theory into location-based recommendation. We prove that the valid user exploration states -- the order ideals of a surmise partial order on points of interest -- form a finite distributive lattice and a well-graded learning space; Birkhoff's representation theorem, combined with the structural isomorphism between lattices of order ideals and concept lattices, connects the exploration space canonically to Formal Concept Analysis. These structural results yield four direct consequences: linear-time fringe computation, a validity certificate guaranteeing that every fringe-guided recommendation is a structurally sound next step, sub-path optimality for dynamic-programming path generation, and provably existing structural explanations for every recommendation. Building on these foundations, we specify the Exploration Space Recommender System (ESRS) -- a memoized dynamic program over the exploration lattice, a Bayesian state estimator with beam approximation and EM parameter learning, an online feedback loop enforcing the downward-closure invariant, an incremental surmise-relation inference pipeline, and three cold-start strategies, the structural one being the only approach in the literature to provide a formal validity guarantee conditional on the correctness of the inferred surmise relation. All results are established through proof and illustrated on a fully traced five-POI numerical example.

cross Pavement Missing Condition Data Imputation through Collective Learning-Based Graph Neural Networks

Authors: Ke Yu, Lu Gao

Abstract: Pavement condition data is important in providing information regarding the current state of the road network and in determining the needs of maintenance and rehabilitation treatments. However, the condition data is often incomplete due to various reasons such as sensor errors and non-periodic inspection schedules. Missing data, especially data missing systematically, presents loss of information, reduces statistical power, and introduces biased assessment. Existing methods in dealing with missing data usually discard entire data points with missing values or impute through data correlation. In this paper, we used a collective learning-based Graph Convolutional Networks, which integrates both features of adjacent sections and dependencies between observed section conditions to learn missing condition values. Unlike other variants of graph neural networks, the proposed approach is able to capture dependent relationship between the conditions of adjacent pavement sections. In the case study, pavement condition data collected from Texas Department of Transportation Austin District were used. Experiments show that the proposed model was able to produce promising results in imputing the missing data.

cross Grouter: Decoupling Routing from Representation for Accelerated MoE Training

Authors: Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, Kun Yuan

Abstract: Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads to sluggish convergence and training instabilities. This paper introduces Grouter, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models. By decoupling structural optimization from weight updates, Grouter significantly accelerates both the speed and quality of model convergence. To ensure the framework's versatility, we also introduce expert folding to adapt Grouter across varying model configurations and expert tuning to rebalance workloads across different data distributions. Furthermore, by leveraging the structural priors provided by preemptive routing, we can implement targeted optimizations to further enhance training throughput. Experiments demonstrate that Grouter achieves superior performance and efficiency which boosts pre-training data utilization by 4.28x and achieves up to 33.5% throughput acceleration, establishing preemptive routing as a fundamental paradigm for scalable MoE training.

cross Photons = Tokens: The Physics of AI and the Economics of Knowledge

Authors: Alec Litowitz, Nick Polson, Vadim Sokolov

Abstract: Debates about artificial intelligence capabilities and risks are often conducted without quantitative grounding. This paper applies the methodology of MacKay (2009) -- who reframed energy policy as arithmetic -- to the economy of AI computation. We define the token, the elementary unit of large language model input and output, as a physical quantity with measurable thermodynamic cost. Using Landauer's principle, Shannon's channel capacity, and current infrastructure data, we construct a supply-and-demand balance sheet for global token production. We then derive a finite question budget: the number of meaningful queries humanity can direct at AI systems under physical, information-theoretic, and economic constraints. We apply Coase's theory of the firm and the durable-goods monopoly problem to the AI value chain -- from photon to atom to chip to power to token to question -- to identify where economic value concentrates and where regulatory intervention is warranted. We argue that the expansion of the token budget does not resolve a deeper constraint: under structural uncertainty, the decisive variable is not how many questions can be answered but which questions are worth asking -- a problem of agency and direction that computation alone cannot solve. We connect limits of measurement in the token economy to a structural parallel between Goodhart's law and the Heisenberg uncertainty principle, and to Arrow's impossibility result for efficient information pricing. The framework yields order-of-magnitude estimates that discipline policy discussion: at current efficiency, the projected 2028 US AI energy allocation of 326~TWh could support roughly $6.5 \times 10^{17}$ tokens per year, or 225,000 tokens per person per day -- more than three orders of magnitude above estimated mid-2024 utilization.

cross SmartBench: Evaluating LLMs in Smart Homes with Anomalous Device States and Behavioral Contexts

Authors: Qingsong Zou, Zhi Yan, Zhiyao Xu, Kuofeng Gao, Jingyu Xiao, Yong Jiang

Abstract: Due to the strong context-awareness capabilities demonstrated by large language models (LLMs), recent research has begun exploring their integration into smart home assistants to help users manage and adjust their living environments. While LLMs have been shown to effectively understand user needs and provide appropriate responses, most existing studies primarily focus on interpreting and executing user behaviors or instructions. However, a critical function of smart home assistants is the ability to detect when the home environment is in an anomalous state. This involves two key requirements: the LLM must accurately determine whether an anomalous condition is present, and provide either a clear explanation or actionable suggestions. To enhance the anomaly detection capabilities of next-generation LLM-based smart home assistants, we introduce SmartBench, which is the first smart home dataset designed for LLMs, containing both normal and anomalous device states as well as normal and anomalous device state transition contexts. We evaluate 13 mainstream LLMs on this benchmark. The experimental results show that most state-of-the-art models cannot achieve good anomaly detection performance. For example, Claude-Sonnet-4.5 achieves only 66.1% detection accuracy on context-independent anomaly categories, and performs even worse on context-dependent anomalies, with an accuracy of only 57.8%. More experimental results suggest that next-generation LLM-based smart home assistants are still far from being able to effectively detect and handle anomalous conditions in the smart home environment. Our dataset is publicly available at https://github.com/horizonsinzqs/SmartBench.

URLs: https://github.com/horizonsinzqs/SmartBench.

cross HEARTS: Benchmarking LLM Reasoning on Health Time Series

Authors: Sirui Li, Shuhan Xiao, Mihir Joshi, Ahmed Metwally, Daniel McDuff, Wei Wang, Yuzhe Yang

Abstract: The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly related to general reasoning scores. Moreover, LLMs often rely on simple heuristics and struggle with multi-step temporal reasoning. Finally, performance declines with increasing temporal complexity, with similar failure modes within model families, indicating that scaling alone is insufficient. By making these gaps measurable, HEARTS provides a standardized testbed and living benchmark for developing next-generation LLM agents capable of reasoning over diverse health signals.

cross RECAP: Local Hebbian Prototype Learning as a Self-Organizing Readout for Reservoir Dynamics

Authors: Heng Zhang

Abstract: Robust perception in brains is often attributed to high-dimensional population activity together with local plasticity mechanisms that reinforce recurring structure. In contrast, most modern image recognition systems are trained by error backpropagation and end-to-end gradient optimization, which are not naturally aligned with local computation and local plasticity. We introduce RECAP (Reservoir Computing with Hebbian Co-Activation Prototypes), a bio-inspired learning strategy for robust image classification that couples untrained reservoir dynamics with a self-organizing Hebbian prototype readout. RECAP discretizes time-averaged reservoir responses into activation levels, constructs a co-activation mask over reservoir unit pairs, and incrementally updates class-wise prototype matrices via a Hebbian-like potentiation-decay rule. Inference is performed by overlap-based prototype matching. The method avoids error backpropagation and is naturally compatible with online prototype updates. We illustrate the resulting robustness behavior on MNIST-C, where RECAP remains robust under diverse corruptions without exposure to corrupted training samples.

cross SR-TTT: Surprisal-Aware Residual Test-Time Training

Authors: Swamynathan V P

Abstract: Test-Time Training (TTT) language models achieve theoretically infinite context windows with an O(1) memory footprint by replacing the standard exact-attention KV-cache with hidden state ``fast weights'' W_fast updated via self-supervised learning during inference. However, pure TTT architectures suffer catastrophic failures on exact-recall tasks (e.g., Needle-in-a-Haystack). Because the fast weights aggressively compress the context into an information bottleneck, highly surprising or unique tokens are rapidly overwritten and forgotten by subsequent token gradient updates. We introduce SR-TTT (Surprisal-Aware Residual Test-Time Training), which resolves this recall failure by augmenting the TTT backbone with a loss-gated sparse memory mechanism. By dynamically routing only incompressible, highly surprising tokens to a traditional exact-attention Residual Cache, SR-TTT preserves O(1) memory for low-entropy background context while utilizing exact attention exclusively for critical needles. Our complete implementation, training scripts, and pre-trained weights are open-source and available at: https://github.com/swamynathanvp/Surprisal-Aware-Residual-Test-Time-Training.

URLs: https://github.com/swamynathanvp/Surprisal-Aware-Residual-Test-Time-Training.

cross Trust Aware Federated Learning for Secure Bone Healing Stage Interpretation in e-Health

Authors: Paul Shepherd, Tasos Dagiuklas, Bugra Alkan, Joaquim Bastos, Jonathan Rodriguez

Abstract: This paper presents a trust aware federated learning (FL) framework for interpreting bone healing stages using spectral features derived from frequency response data. The primary objective is to address the challenge posed by either unreliable or adversarial participants in distributed medical sensing environments. The framework employs a multi-layer perceptron model trained across simulated clients using the Flower FL framework. The proposed approach integrates an Adaptive Trust Score Scaling and Filtering (ATSSSF) mechanism with exponential moving average (EMA) smoothing to assess, validate and filter client contributions.Two trust score smoothing strategies have been investigated, one with a fixed factor and another that adapts according to trust score variability. Clients with low trust are excluded from aggregation and readmitted once their reliability improves, ensuring model integrity while maintaining inclusivity. Standard classification metrics have been used to compare the performance of ATSSSF with the baseline Federated Averaging strategy. Experimental results demonstrate that adaptive trust management can improve both training stability and predictive performance by mitigating the negative effects of compromised clients while retaining robust detection capabilities. The work establishes the feasibility for adaptive trust mechanisms in federated medical sensing and identifies extension to clinical cross silo aggregation as a future research direction.

cross Performance Comparison of IBN orchestration using LLM and SLMs

Authors: Wai Lwin Phone, Brahim El Boudani, Tasos Dagiuklas, Saptarshi Ghosh

Abstract: The evolution of both 5G and 6G networks is driving the advancement of fully autonomous network management, placing Intent-Based Networking at the centre of this transformation. This paper introduces a novel framework for 5G and 6G IBN orchestration that leverages a stateful, hierarchical multi-agent architecture to achieve full automation using both SLMs and LLMs. Both models have been evaluated for translation accuracy using metrics such as BLEU, METEOR, and ROUGE-L, as well as computational complexity. Experimental results show that both models exhibit similar accuracy. However, result shows that SLMs can improve the overall completion speed of the IBN lifecycle by 20%.

cross ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments

Authors: Shiyi Ding, Shaoen Wu, Ying Chen

Abstract: Recent advances in multimodal large language models (MLLMs) offer a promising approach for natural language-based scene change queries in virtual reality (VR). Prior work on applying MLLMs for object state understanding has focused on egocentric videos that capture the camera wearer's interactions with objects. However, object state changes may occur in the background without direct user interaction, lacking explicit motion cues and making them difficult to detect. Moreover, no benchmark exists for evaluating this challenging scenario. To address these challenges, we introduce ObjChangeVR-Dataset, specifically for benchmarking the question-answering task of object state change. We also propose ObjChangeVR, a framework that combines viewpoint-aware and temporal-based retrieval to identify relevant frames, along with cross-view reasoning that reconciles inconsistent evidence from multiple viewpoints. Extensive experiments demonstrate that ObjChangeVR significantly outperforms baseline approaches across multiple MLLMs.

cross HURRI-GAN: A Novel Approach for Hurricane Bias-Correction Beyond Gauge Stations using Generative Adversarial Networks

Authors: Noujoud Nadera, Hadi Majed, Stefanos Giaremis, Rola El Osta, Clint Dawson, Carola Kaiser, Hartmut Kaiser

Abstract: The coastal regions of the eastern and southern United States are impacted by severe storm events, leading to significant loss of life and properties. Accurately forecasting storm surge and wind impacts from hurricanes is essential for mitigating some of the impacts, e.g., timely preparation of evacuations and other countermeasures. Physical simulation models like the ADCIRC hydrodynamics model, which run on high-performance computing resources, are sophisticated tools that produce increasingly accurate forecasts as the resolution of the computational meshes improves. However, a major drawback of these models is the significant time required to generate results at very high resolutions, which may not meet the near real-time demands of emergency responders. The presented work introduces HURRI-GAN, a novel AI-driven approach that augments the results produced by physical simulation models using time series generative adversarial networks (TimeGAN) to compensate for systemic errors of the physical model, thus reducing the necessary mesh size and runtime without loss in forecasting accuracy. We present first results in extrapolating model bias corrections for the spatial regions beyond the positions of the water level gauge stations. The presented results show that our methodology can accurately generate bias corrections at target locations spatially beyond gauge stations locations. The model's performance, as indicated by low root mean squared error (RMSE) values, highlights its capability to generate accurate extrapolated data. Applying the corrections generated by HURRI-GAN on the ADCIRC modeled water levels resulted in improving the overall prediction on the majority of the testing gauge stations.

cross Geodesic Gradient Descent: A Generic and Learning-rate-free Optimizer on Objective Function-induced Manifolds

Authors: Liwei Hu, Guangyao Li, Wenyong Wang, Xiaoming Zhang, Yu Xiang

Abstract: Euclidean gradient descent algorithms barely capture the geometry of objective function-induced hypersurfaces and risk driving update trajectories off the hypersurfaces. Riemannian gradient descent algorithms address these issues but fail to represent complex hypersurfaces via a single classic manifold. We propose geodesic gradient descent (GGD), a generic and learning-rate-free Riemannian gradient descent algorithm. At each iteration, GGD uses an n-dimensional sphere to approximate a local neighborhood on the objective function-induced hypersurface, adapting to arbitrarily complex geometries. A tangent vector derived from the Euclidean gradient is projected onto the sphere to form a geodesic, ensuring the update trajectory stays on the hypersurface. Parameter updates are performed using the endpoint of the geodesic. The maximum step size of the gradient in GGD is equal to a quarter of the arc length on the n-dimensional sphere, thus eliminating the need for a learning rate. Experimental results show that compared with the classic Adam algorithm, GGD achieves test MSE reductions ranging from 35.79% to 48.76% for fully connected networks on the Burgers' dataset, and cross-entropy loss reductions ranging from 3.14% to 11.59% for convolutional neural networks on the MNIST dataset.

cross PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

Authors: Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, Shiguo Lian

Abstract: Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.

cross A Parameter-efficient Convolutional Approach for Weed Detection in Multispectral Aerial Imagery

Authors: Leo Thomas Ramos, Angel D. Sappa

Abstract: We introduce FCBNet, an efficient model designed for weed segmentation. The architecture is based on a fully frozen ConvNeXt backbone, the proposed Feature Correction Block (FCB), which leverages efficient convolutions for feature refinement, and a lightweight decoder. FCBNet is evaluated on the WeedBananaCOD and WeedMap datasets under both RGB and multispectral modalities, showing that FCBNet outperforms models such as U-Net, DeepLabV3+, SK-U-Net, SegFormer, and WeedSense in terms of mIoU, exceeding 85%, while also achieving superior computational efficiency, requiring only 0.06 to 0.2 hours for training. Furthermore, the frozen backbone strategy reduces the number of trainable parameters by more than 90%, significantly lowering memory requirements.

cross GameVerse: Can Vision-Language Models Learn from Video-based Reflection?

Authors: Kuan Zhang, Dongchen Liu, Qiyue Zhao, Jinkun Hou, Xinran Zhang, Qinlei Xie, Miao Liu, Yiming Li

Abstract: Human gameplay is a visually grounded interaction loop in which players act, reflect on failures, and watch tutorials to refine strategies. Can Vision-Language Models (VLMs) also learn from video-based reflection? We present GameVerse, a comprehensive video game benchmark that enables a reflective visual interaction loop. Moving beyond traditional fire-and-forget evaluations, it uses a novel reflect-and-retry paradigm to assess how VLMs internalize visual experience and improve policies. To facilitate systematic and scalable evaluation, we also introduce a cognitive hierarchical taxonomy spanning 15 globally popular games, dual action space for both semantic and GUI control, and milestone evaluation using advanced VLMs to quantify progress. Our experiments show that VLMs benefit from video-based reflection in varied settings, and perform best by combining failure trajectories and expert tutorials-a training-free analogue to reinforcement learning (RL) plus supervised fine-tuning (SFT).Our project page is available at https://gameverse-bench.github.io/ . Our code is available at https://github.com/THUSI-Lab/GameVerse .

URLs: https://gameverse-bench.github.io/, https://github.com/THUSI-Lab/GameVerse

cross Science Literacy: Generative AI as Enabler of Coherence in the Teaching, Learning, and Assessment of Scientific Knowledge and Reasoning

Authors: Xiaoming Zhai, James W. Pellegrino, Matias Rojas, Jongchan Park, Matthew Nyaaba, Clayton Cohn, Gautam Biswas

Abstract: This chapter examines the potential of generative AI in enhancing science literacy across the K-16+ grade span, including its benefits as well as the conceptual and practical challenges that doing so presents. It begins with a discussion of what defines science literacy in the era of AI, including how AI has changed science and the demand for future citizens to be scientifically literate when AI is applied in their careers and lives. The chapter further discusses why science literacy presents such a challenge in K-16+ educational settings. It then develops an argument for the type of architecture needed for AI to assist in solving the problem by bringing coherence to the teaching, learning, and assessment of science knowledge and reasoning. Components of this architecture are illustrated with respect to the AI tools and capabilities needed for design and implementation. The chapter concludes with a consideration of what has been learned regarding both science literacy and AI, as well as what remains to be learned, including the research and development (R&D) needed, and the generalizability of this science literacy case to other disciplinary learning and knowledge domains.

cross Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

Authors: Giacomo Frisoni, Lorenzo Molfetta, Mattia Buzzoni, Gianluca Moro

Abstract: Recent advances in training-free visual prompting, such as Set-of-Mark, have emerged as a promising direction for enhancing the grounding capabilities of multimodal language models (MLMs). These techniques operate by partitioning the input image into object regions and annotating them with marks, predominantly boxes with numeric identifiers, before feeding the augmented image to the MLM. However, these approaches treat marked objects as isolated entities, failing to capture the relationships between them. On these premises, we propose Graph-of-Mark (GoM), the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks. We evaluate GoM across 3 open-source MLMs and 4 different datasets, conducting extensive ablations on drawn components and investigating the impact of auxiliary graph descriptions in the text prompt. Our results demonstrate that GoM consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions, improving base accuracy in visual question answering and localization up to 11 percentage points.

cross Accelerating Video Generation Inference with Sequential-Parallel 3D Positional Encoding Using a Global Time Index

Authors: Chao Yuan, Pan Li

Abstract: Diffusion Transformer (DiT)-based video generation models inherently suffer from bottlenecks in long video synthesis and real-time inference, which can be attributed to the use of full spatiotemporal attention. Specifically, this mechanism leads to explosive O(N^2) memory consumption and high first-frame latency. To address these issues, we implement system-level inference optimizations for a causal autoregressive video generation pipeline. We adapt the Self-Forcing causal autoregressive framework to sequence parallel inference and implement a sequence-parallel variant of the causal rotary position embedding which we refer to as Causal-RoPE SP. This adaptation enables localized computation and reduces cross-rank communication in sequence parallel execution. In addition, computation and communication pipelines are optimized through operator fusion and RoPE precomputation. Experiments conducted on an eight GPU A800 cluster show that the optimized system achieves comparable generation quality, sub-second first-frame latency, and near real-time inference speed. For generating five second 480P videos, a 1.58x speedup is achieved, thereby providing effective support for real-time interactive applications.

cross Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Authors: Yuan Wu, Zongxian Yang, Jiayu Qian, Songpan Gao, Guanxing Chen, Qiankun Li, Yu-An Huang, Zhi-An Huang

Abstract: Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.

URLs: https://github.com/TianYin123/Better_Eyes_Better_Thoughts

cross Hybrid Orchestration of Edge AI and Microservices via Graph-based Self-Imitation Learning

Authors: Chen Yang, Jin Zheng, Yang Zhuolin, Lai Pan, Zhang Xiao, Hu Menglan, Yin Haiyan

Abstract: Modern edge AI applications increasingly rely on microservice architectures that integrate both AI services and conventional microservices into complex request chains with stringent latency requirements. Effectively orchestrating these heterogeneous services is crucial for ensuring low-latency performance, yet remains challenging due to their diverse resource demands and strong operational interdependencies under resource-constrained edge environments. In particular, frequent interactions between services tightly couple deployment and routing decisions, yet existing approaches optimize them in isolation, leading to fundamentally inadequate system performance.In this paper, we propose SIL-GPO, a reinforcement learning framework that optimizes hybrid orchestration for edge AI microservice systems. SIL-GPO formulates the orchestration problem as a sequential decision-making task and leverages graph attention networks to encode service topologies and routing dependencies within the agent state representation. Moreover, SIL-GPO integrates a self-imitation learning strategy into proximal policy optimization, enabling the agent to prioritize and reuse high-reward trajectories. This guides policy updates towards globally promising solutions that standard RL often fails to discover under sparse rewards and large combinatorial action spaces. We conduct extensive experiments on trace-driven edge AI workloads, demonstrating that SIL-GPO significantly reduces end-to-end service latency and enhances resource utilization compared to state-of-the-art heuristic, metaheuristic, and deep RL baselines. Our framework offers a unified and scalable solution for efficient orchestration of AI services and microservices in the edge, paving the way for low-latency, high-performance edge AI deployments.

cross calibfusion: Transformer-Based Differentiable Calibration for Radar-Camera Fusion Detection in Water-Surface Environments

Authors: Yuting Wan, Liguo Sun, Jiuwu Hao, Pin LV

Abstract: Millimeter-wave (mmWave) Radar--Camera fusion improves perception under adverse illumination and weather, but its performance is sensitive to Radar--Camera extrinsic calibration: residual misalignment biases Radar-to-image projection and degrades cross-modal aggregation for downstream 2D detection. Existing calibration and auto-calibration methods are mainly developed for road and urban scenes with abundant structures and object constraints, whereas water-surface environments feature large textureless regions, sparse and intermittent targets, and wave-/specular-induced Radar clutter, which weakens explicit object-centric matching. We propose CalibFusion, a calibration-conditioned Radar--Camera fusion detector that learns implicit extrinsic refinement end-to-end with the detection objective. CalibFusion builds a multi-frame persistence-aware Radar density representation with intensity weighting and Doppler-guided suppression of fast-varying clutter. A cross-modal transformer interaction module predicts a confidence-gated refinement of the initial extrinsics, which is integrated through a differentiable projection-and-splatting operator to generate calibration-conditioned image-plane Radar features. Experiments on WaterScenes and FLOW show improved fusion-based 2D detection and robustness under synthetic miscalibration, supported by sensitivity analyses and qualitative Radar-to-image overlays. Results on nuScenes indicate that the refinement mechanism transfers beyond water-surface scenarios.

cross ERP-RiskBench: Leakage-Safe Ensemble Learning for Financial Risk

Authors: Sanjay Mishra

Abstract: Financial risk detection in Enterprise Resource Planning (ERP) systems is an important but underexplored application of machine learning. Published studies in this area tend to suffer from vague dataset descriptions, leakage-prone pipelines, and evaluation practices that inflate reported performance. This paper presents a rebuilt experimental framework for ERP financial risk detection using ensemble machine learning. The risk definition is hybrid, covering both procurement compliance anomalies and transactional fraud. A composite benchmark called ERP-RiskBench is assembled from public procurement event logs, labeled fraud data, and a new synthetic ERP dataset with rule-injected risk typologies and conditional tabular GAN augmentation. Nested cross-validation with time-aware and group-aware splitting enforces leakage prevention throughout the pipeline. The primary model is a stacking ensemble of gradient boosting methods, tested alongside linear baselines, deep tabular architectures, and an interpretable glassbox alternative. Performance is measured through Matthews Correlation Coefficient, area under the precision-recall curve, and cost-sensitive decision analysis using calibrated probabilities. Across multiple dataset configurations and a structured ablation study, the stacking ensemble achieves the best detection results. Leakage-safe protocols reduce previously inflated accuracy estimates by a notable margin. SHAP-based explanations and feature stability analysis show that procurement control features, especially three-way matching discrepancies, rank as the most informative predictors. The resulting framework provides a reproducible, operationally grounded blueprint for machine learning deployment in ERP audit and governance settings.

cross Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study

Authors: Yixiao Jing, Chaoyu Zhang, Zixuan Zhong, Peizhou Huang

Abstract: Semantic noise initialization has been reported to improve robustness and controllability in image diffusion models. Whether these gains transfer to text-to-video (T2V) generation remains unclear, since temporal coupling can introduce extra degrees of freedom and instability. We benchmark semantic noise initialization against standard Gaussian noise using a frozen VideoCrafter-style T2V diffusion backbone and VBench on 100 prompts. Using prompt-level paired tests with bootstrap confidence intervals and a sign-flip permutation test, we observe a small positive trend on temporal-related dimensions; however, the 95 percent confidence interval includes zero (p ~ 0.17) and the overall score remains on par with the baseline. To understand this outcome, we analyze the induced perturbations in noise space and find patterns consistent with weak or unstable signal. We recommend prompt-level paired evaluation and noise-space diagnostics as standard practice when studying initialization schemes for T2V diffusion.

cross AutoFigure-Edit: Generating Editable Scientific Illustration

Authors: Zhen Lin, Qiujie Xie, Minjun Zhu, Shichen Li, Qiyao Sun, Enhao Gu, Yiran Ding, Ke Sun, Fang Guo, Panzhong Lu, Zhiyuan Ning, Yixuan Weng, Yue Zhang

Abstract: High-quality scientific illustrations are essential for communicating complex scientific and technical concepts, yet existing automated systems remain limited in editability, stylistic controllability, and efficiency. We present AutoFigure-Edit, an end-to-end system that generates fully editable scientific illustrations from long-form scientific text while enabling flexible style adaptation through user-provided reference images. By combining long-context understanding, reference-guided styling, and native SVG editing, it enables efficient creation and refinement of high-quality scientific illustrations. To facilitate further progress in this field, we release the video at https://youtu.be/10IH8SyJjAQ, full codebase at https://github.com/ResearAI/AutoFigure-Edit and provide a website for easy access and interactive use at https://deepscientist.cc/.

URLs: https://youtu.be/10IH8SyJjAQ,, https://github.com/ResearAI/AutoFigure-Edit, https://deepscientist.cc/.

cross XAI and Few-shot-based Hybrid Classification Model for Plant Leaf Disease Prognosis

Authors: Diana Susan Joseph, Pranav M Pawar, Raja Muthalagu, Mithun Mukharjee

Abstract: Performing a timely and accurate identification of crop diseases is vital to maintain agricultural productivity and food security. The current work presents a hybrid few-shot learning model that integrates Explainable Artificial Intelligence (XAI) and Few-Shot Learning (FSL) to address the challenge of identifying and classifying the stages of disease of the diseases of maize, rice, and wheat leaves under limited annotated data conditions. The proposed model integrates Siamese and Prototypical Networks within an episodic training paradigm to effectively learn discriminative disease features from a few examples. To ensure model transparency and trustworthiness, Gradient-weighted Class Activation Mapping (Grad-CAM) is employed for visualizing key decision regions in the leaf images, offering interpretable insights into the classification process. Experimental evaluations on custom few-shot datasets developed in the study prove that the model consistently achieves high accuracy, precision, recall, and F1-scores, frequently exceeding 92% across various disease stages. Comparative analyses against baseline FSL models further confirm the superior performance and explainability of the proposed approach. The framework offers a promising solution for real-world, data-constrained agricultural disease monitoring applications.

cross Chart Deep Research in LVLMs via Parallel Relative Policy Optimization

Authors: Jiajin Tang, Gaoyang, Wenjie Wang, Sibei Yang, Xing Chen

Abstract: With the rapid advancement of data science, charts have evolved from simple numerical presentation tools to essential instruments for insight discovery and decision-making support. However, current chart data intelligence exhibits significant limitations in deep research capabilities, with existing methods predominantly addressing shallow tasks such as visual recognition or factual question-answering, rather than the complex reasoning and high-level data analysis that deep research requires. This limitation stems from two primary technical bottlenecks: at the training level, existing post-training techniques exhibit deficiencies in handling multi-dimensional reward signal interference and heterogeneous data gradient conflicts, preventing models from achieving balanced development across multiple capability dimensions; at the evaluation level, current methods remain limited to factual retrieval and basic computation, failing to assess end-to-end analytic reasoning and other deep research capabilities. To address the training challenge, we propose PRPO, which performs parallel optimization across reward dimensions and capability partitioning across data types, effectively disentangling conflicts between heterogeneous data and multi-dimensional reward signals while ensuring optimization stability. For the evaluation challenge, we construct MCDR-Bench based on the ``error uniqueness principle," transforming subjective generation assessment into objective error identification through controllable error injection, enabling quantifiable evaluation of deep research capabilities. Experimental validation confirms that the proposed PRPO and MCDR-Bench jointly establish a unified framework that systematically advances chart deep research through enhanced collaborative training and objective evaluation.

cross VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images

Authors: Neil Tripathi

Abstract: We present VB, a benchmark that tests whether vision-language models can determine what is and is not visible in a photograph, and abstain when a human viewer cannot reliably answer. Each item pairs a single photo with a short yes/no visibility claim; the model must output VISIBLY_TRUE, VISIBLY_FALSE, or ABSTAIN, together with a confidence score. Items are organized into 100 families using a 2x2 design that crosses a minimal image edit with a minimal text edit, yielding 300 headline evaluation cells. Unlike prior unanswerable-VQA benchmarks, VB tests not only whether a question is unanswerable but why (via reason codes tied to specific visibility factors), and uses controlled minimal edits to verify that model judgments change when and only when the underlying evidence changes. We score models on confidence-aware accuracy with abstention (CAA), minimal-edit flip rate (MEFR), confidence-ranked selective prediction (SelRank), and second-order perspective reasoning (ToMAcc); all headline numbers are computed on the strict XOR subset (three cells per family, 300 scored items per model). We evaluate nine models spanning flagship and prior-generation closed-source systems, and open-source models from 8B to 12B parameters. GPT-4o and Gemini 3.1 Pro effectively tie for the best composite score (0.728 and 0.727), followed by Gemini 2.5 Pro (0.678). The best open-source model, Gemma 3 12B (0.505), surpasses one prior-generation closed-source system. Text-flip robustness exceeds image-flip robustness for six of nine models, and confidence calibration varies substantially: GPT-4o and Gemini 2.5 Pro achieve similar accuracy yet differ sharply in selective prediction quality.

cross Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning

Authors: Zhengjian Yao, Yongzhi Li, Xinyuan Gao, Quan Chen, Peng Jiang, Yanye Lu

Abstract: We present "Narrative Weaver", a novel framework that addresses a fundamental challenge in generative AI: achieving multi-modal controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences - a critical limitation for real-world applications such as filmmaking and e-commerce advertising. Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift. To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data. Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD) - the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations. Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method's superiority while opening new possibilities for AI-driven content creation.

cross Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

Authors: Yiwei Li, Zihao Wu, Yanjun Lv, Hanqi Jiang, Weihang You, Zhengliang Liu, Dajiang Zhu, Xiang Li, Quanzheng Li, Tianming Liu, Lin Zhao

Abstract: Vision--language models (VLMs) process images as visual tokens, yet their intermediate reasoning is often carried out in text, which can be suboptimal for visually grounded radiology tasks. Radiologists instead diagnose via sequential visual search; eye-tracking captures this process as time-ordered gaze trajectories that reveal how evidence is acquired over time. We use eye-gaze as supervision to guide VLM reasoning by introducing a small set of dedicated gaze tokens. These tokens are trained to predict gaze-selected image patch indices in temporal order, encouraging the model to follow human-like evidence acquisition and integration. Experiments on MIMIC-EYE and multiple external zero-shot benchmarks show consistent gains over baselines, achieving state-of-the-art in-domain performance and improved out-of-domain robustness. These results highlight temporally ordered gaze as an effective supervision signal for learning visually grounded medical reasoning.

cross Mining Beyond the Bools: Learning Data Transformations and Temporal Specifications

Authors: Sam Nicholas Kouteili, William Fishell, Christian Scaff, Mark Santolucito, Ruzica Piskac

Abstract: Mining specifications from execution traces presents an automated way of capturing characteristic system behaviors. However, existing approaches are largely restricted to Boolean abstractions of events, limiting their ability to express data-aware properties. In this paper, we extend mining procedures to operate over richer datatypes. We first establish candidate functions in our domain that cover the set of traces by leveraging Syntax Guided Synthesis (SyGuS) techniques. To capture these function applications temporally, we formalize the semantics of TSL$_f$, a finite-prefix interpretation of Temporal Stream Logic (TSL) that extends LTL$_f$ with support for first-order predicates and functional updates. This allows us to unify a corresponding procedure for learning the data transformations and temporal specifications of a system. We demonstrate our approach synthesizing reactive programs from mined specifications on the OpenAI-Gymnasium ToyText environments, finding that our method is more robust and orders of magnitude more sample-efficient than passive learning baselines on generalized problem instances.

cross Scaling Agentic Capabilities, Not Context: Efficient Reinforcement Finetuning for Large Toolspaces

Authors: Karan Gupta, Pranav Vajreshwari, Yash Pandya, Raghav Magazine, Akshay Nambi, Ahmed Awadallah

Abstract: Agentic systems operating over large tool ecosystems must plan and execute long-horizon workflows under weak or non-verifiable supervision. While frontier models mitigate these challenges through scale and large context budgets, small language models (SLMs) remain brittle: eager tool loading saturates context, execution errors compound over time, and sparse rewards limit learning. We introduce ATLAS, a reinforcement finetuning framework that enables SLMs to operate effectively in large-scale toolspace environments by learning how to acquire context and how to execute actions. Our approach makes two key contributions. First, we treat context control and execution structure as learnable decisions, combining iterative tool loading with programmatic tool orchestration to bound context growth and stabilize long-horizon trajectories. Second, we propose rubric-based reinforcement finetuning, which decomposes task success into structured, task-aligned criteria and enables scalable training using small judge models. Across MCP benchmarks, these design choices yield large and consistent gains over generic RL baselines, allowing a 4B SLM to approach frontier-agent performance under far tighter parameter and context budgets.

cross Dynamic Targeting of Satellite Observations Using Supplemental Geostationary Satellite Data and Hierarchical Planning

Authors: Akseli Kangaslahti, Itai Zilberstein, Alberto Candela, Steve Chien

Abstract: The Dynamic Targeting (DT) mission concept is an approach to satellite observation in which a lookahead sensor gathers information about the upcoming environment and uses this information to intelligently plan observations. Previous work has shown that DT has the potential to increase the science return across applications. However, DT mission concepts must address challenges, such as the limited spatial extent of onboard lookahead data and instrument mobility, data throughput, and onboard computation constraints. In this work, we show how the performance of DT systems can be improved by using supplementary data streamed from geostationary satellites that provide lookahead information up to 35 minutes ahead of time rather than the 1 minute latency from an onboard lookahead sensor. While there is a greater volume of geostationary data, the search space for observation planning explodes exponentially with the size of the horizon. To address this, we introduce a hierarchical planning approach in which the geostationary data is used to plan a long-term observation blueprint in polynomial time, then the onboard lookahead data is leveraged to refine that plan over short-term horizons. We compare the performance of our approach to that of traditional DT planners relying on onboard lookahead data across four different problem instances: three cloud avoidance variations and a storm hunting scenario. We show that our hierarchical planner outperforms the traditional DT planners by up to 41% and examine the features of the scenarios that affect the performance of our approach. We demonstrate that incorporating geostationary satellite data is most effective for dynamic problem instances in which the targets of interest are sparsely distributed throughout the overflight.

cross ProtAlign: Contrastive learning paradigm for Sequence and structure alignment

Authors: Aditya Ranganath, Hasin Us Sami, Kowshik Thopalli, Bhavya Kailkhura, Wesam Sakla

Abstract: Protein language models often take into consideration the alignment between a protein sequence and its textual description. However, they do not take structural information into consideration. Traditional methods treat sequence and structure separately, limiting the ability to exploit the alignment between the structure and protein sequence embeddings. In this paper, we introduce a sequence structure contrastive alignment framework, which learns a shared embedding space where proteins are represented consistently across modalities. By training on large-scale pairs of sequences and experimentally resolved or predicted structures, the model maximizes agreement between matched sequence structure pairs while pushing apart unrelated pairs. This alignment enables cross-modal retrieval (e.g., finding structural neighbors given a sequence), improves downstream prediction tasks such as function annotation and stability estimation, and provides interpretable links between sequence variation and structural organization. Our results demonstrate that contrastive learning can serve as a powerful bridge between protein sequences and structures, offering a unified representation for understanding and engineering proteins.

cross UWPD: A General Paradigm for Invisible Watermark Detection Agnostic to Embedding Algorithms

Authors: Xiang Ao, Yiling Du, Zidan Wang, Mengru Chen

Abstract: Invisible watermarks, as an essential technology for image copyright protection, have been widely deployed with the rapid development of social media and AIGC. However, existing invisible watermark detection heavily relies on prior knowledge of specific algorithms, leading to limited detection capabilities for "unknown watermarks" in open environments. To this end, we propose a novel task named Universal Watermark Presence Detection (UWPD), which aims to identify whether an image carries a copyright mark without requiring decoding information. We construct the UniFreq-100K dataset, comprising large-scale samples across various invisible watermark embedding algorithms. Furthermore, we propose the Frequency Shield Network (FSNet). This model deploys an Adaptive Spectral Perception Module (ASPM) in the shallow layers, utilizing learnable frequency gating to dynamically amplify high-frequency watermark signals while suppressing low-frequency semantics. In the deep layers, the network introduces Dynamic Multi-Spectral Attention (DMSA) combined with tri-stream extremum pooling to deeply mine watermark energy anomalies, forcing the model to precisely focus on sensitive frequency bands. Extensive experiments demonstrate that FSNet exhibits superior zero-shot detection capabilities on the UWPD task, outperforming existing baseline models. Code and datasets will be released upon acceptance.

cross Bi Directional Feedback Fusion for Activity Aware Forecasting of Indoor CO2 and PM2.5

Authors: Harshala Gammulle, Lidia Morawska, Sridha Sridharan, Clinton Fookes

Abstract: Indoor air quality (IAQ) forecasting plays a critical role in safeguarding occupant health, ensuring thermal comfort, and supporting intelligent building control. However, predicting future concentrations of key pollutants such as carbon dioxide (CO2) and fine particulate matter (PM2.5) remains challenging due to the complex interplay between environmental factors and highly dynamic occupant behaviours. Traditional data driven models primarily rely on historical sensor trajectories and often fail to anticipate behaviour induced emission spikes or rapid concentration shifts. To address these limitations, we present a dual stream bi directional feedback fusion framework that jointly models indoor environmental evolution and action derived embeddings representing human activities. The proposed architecture integrates a context aware modulation mechanism that adaptively scales and shifts each stream based on a shared, evolving fusion state, enabling the model to selectively emphasise behavioural cues or long term environmental trends. Furthermore, we introduce dual timescale temporal modules that independently capture gradual CO2 accumulation patterns and short term PM2.5 fluctuations. A composite loss function combining weighted mean squared error, spike aware penalties, and uncertainty regularisation facilitates robust learning under volatile indoor conditions. Extensive validation on real-world IAQ datasets demonstrates that our approach significantly outperforms state of the art forecasting baselines while providing interpretable uncertainty estimates essential for practical deployment in smart buildings and health-aware monitoring systems.

cross Regression Models Meet Foundation Models: A Hybrid-AI Approach to Practical Electricity Price Forecasting

Authors: Yunzhong Qiu, Binzhu Li, Hao Wei, Shenglin Weng, Chen Wang, Zhongyi Pei, Mingsheng Long, Jianmin Wang

Abstract: Electricity market prices exhibit extreme volatility, nonlinearity, and non-stationarity, making accurate forecasting a significant challenge. While cutting-edge time series foundation models (TSFMs) effectively capture temporal dependencies, they typically underutilize cross-variate correlations and non-periodic patterns that are essential for price forecasting. Conversely, regression models excel at capturing feature interactions but are limited to future-available inputs, ignoring crucial historical drivers that are unavailable at forecast time. To bridge this gap, we propose FutureBoosting, a novel paradigm that enhances regression-based forecasts by integrating forecasted features generated from a frozen TSFM. This approach leverages the TSFM's ability to model historical patterns and injects these insights as enriched inputs into a downstream regression model. We instantiate this paradigm into a lightweight, plug-and-play framework for electricity price forecasting. Extensive evaluations on real-world electricity market data demonstrate that our framework consistently outperforms state-of-the-art TSFMs and regression baselines, achieving reductions in Mean Absolute Error (MAE) of more than 30% at most. Through ablation studies and explainable AI (XAI) techniques, we validate the contribution of forecasted features and elucidate the model's decision-making process. FutureBoosting establishes a robust, interpretable, and effective solution for practical market participation, offering a general framework for enhancing regression models with temporal context.

cross Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Authors: Jingyuan Feng, Andrew Gambardella, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Abstract: Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why a model refuses a request, nor intervene when its safety judgments fail. We propose Safe Transformer, a modular approach that augments pre-trained language models by inserting a discrete information bottleneck containing an explicit safety bit between transformer layers. The safety bit serves as both an interpretable signal of the model's safety classification and a controllable switch: through contrastive training, the model learns disentangled representations where the safety bit governs the behavioral mode - producing helpful responses when $s=1$ and refusals when $s=0$ - while additional unsupervised bits $u$ encode semantic content for generation. Additional unsupervised bits in the information bottleneck allow semantic information to flow through, preserving the model's generation capabilities. This design achieves both interpretability (the safety decision is directly readable) and controllability (the safety bit can be manually overridden), requiring only lightweight fine-tuning without pre-training from scratch. In red-team benchmarks, Safe Transformer achieves near-zero Attack Success Rate, substantially outperforming base models and safety fine-tuning baselines.

cross Don't Freeze, Don't Crash: Extending the Safe Operating Range of Neural Navigation in Dense Crowds

Authors: Jiefu Zhang, Yang Xu, Vaneet Aggarwal

Abstract: Navigating safely through dense crowds requires collision avoidance that generalizes beyond the densities seen during training. Learning-based crowd navigation can break under out-of-distribution crowd sizes due to density-sensitive observation normalization and social-cost scaling, while analytical solvers often remain safe but freeze in tight interactions. We propose a reinforcement learning approach for dense, variable-density navigation that attains zero-shot density generalization using a density-invariant observation encoding with density-randomized training and physics-informed proxemic reward shaping with density-adaptive scaling. The encoding represents the distance-sorted $K$ nearest pedestrians plus bounded crowd summaries, keeping input statistics stable as crowd size grows. Trained with $N\!\in\![11,16]$ pedestrians in a $3\mathrm{m}\times3\mathrm{m}$ arena and evaluated up to $N\!=\!21$ pedestrians ($1.3\times$ denser), our policy reaches the goal in $>99\%$ of episodes and achieves $86\%$ collision-free success in random crowds, with markedly less freezing than analytical methods and a $>\!60$-point collision-free margin over learning-based benchmark methods. Codes are available at \href{https://github.com/jznmsl/PSS-Social}{https://github.com/jznmsl/PSS-Social}.

URLs: https://github.com/jznmsl/PSS-Social, https://github.com/jznmsl/PSS-Social

cross Calibrated Credit Intelligence: Shift-Robust and Fair Risk Scoring with Bayesian Uncertainty and Gradient Boosting

Authors: Srikumar Nayak

Abstract: Credit risk scoring must support high-stakes lending decisions where data distributions change over time, probability estimates must be reliable, and group-level fairness is required. While modern machine learning models improve default prediction accuracy, they often produce poorly calibrated scores under distribution shift and may create unfair outcomes when trained without explicit constraints. This paper proposes Calibrated Credit Intelligence (CCI), a deployment-oriented framework that combines (i) a Bayesian neural risk scorer to capture epistemic uncertainty and reduce overconfident errors, (ii) a fairnessconstrained gradient boosting model to control group disparities while preserving strong tabular performance, and (iii) a shiftaware fusion strategy followed by post-hoc probability calibration to stabilize decision thresholds in later time periods. We evaluate CCI on the Home Credit Credit Risk Model Stability benchmark using a time-consistent split to reflect real-world drift. Compared with strong baselines (LightGBM, XGBoost, CatBoost, TabNet, and a standalone Bayesian neural model), CCI achieves the best overall trade-off between discrimination, calibration, stability, and fairness. In particular, CCI reaches an AUC-ROC of 0.912 and an AUC-PR of 0.438, improves operational performance with Recall@1%FPR = 0.509, and reduces calibration error (Brier score 0.087, ECE 0.015). Under temporal shift, CCI shows a smaller AUC-PR drop from early to late periods (0.017), and it lowers group disparities (demographic parity gap 0.046, equal opportunity gap 0.037) compared to unconstrained boosting. These results indicate that CCI produces risk scores that are accurate, reliable, and more equitable under realistic deployment conditions.

cross Agent Hunt: Bounty Based Collaborative Autoformalization With LLM Agents

Authors: Chad E. Brown, Cezary Kaliszyk, Josef Urban

Abstract: We describe an experiment in large-scale autoformalization of algebraic topology in an Interactive Theorem Proving (ITP) environment, where the workload is distributed among multiple LLM-based coding agents. Rather than relying on static central planning, we implement a simulated bounty-based marketplace in which agents dynamically propose new lemmas (formal statements), attach bounties to them, and compete to discharge these proof obligations and claim the bounties. The agents interact directly with the interactive proof system: they can invoke tactics, inspect proof states and goals, analyze tactic successes and failures, and iteratively refine their proof scripts. In addition to constructing proofs, agents may introduce new formal definitions and intermediate lemmas to structure the development. All accepted proofs are ultimately checked and verified by the underlying proof assistant. This setting explores collaborative, decentralized proof search and theory building, and the use of market-inspired mechanisms to scale autoformalization in ITP.

cross Rank-Factorized Implicit Neural Bias: Scaling Super-Resolution Transformer with FlashAttention

Authors: Dongheon Lee, Seokju Yun, Jaegyun Im, Youngmin Ro

Abstract: Recent Super-Resolution~(SR) methods mainly adopt Transformers for their strong long-range modeling capability and exceptional representational capacity. However, most SR Transformers rely heavily on relative positional bias~(RPB), which prevents them from leveraging hardware-efficient attention kernels such as FlashAttention. This limitation imposes a prohibitive computational burden during both training and inference, severely restricting attempts to scale SR Transformers by enlarging the training patch size or the self-attention window. Consequently, unlike other domains that actively exploit the inherent scalability of Transformers, SR Transformers remain heavily focused on effectively utilizing limited receptive fields. In this paper, we propose Rank-factorized Implicit Neural Bias~(RIB), an alternative to RPB that enables FlashAttention in SR Transformers. Specifically, RIB approximates positional bias using low-rank implicit neural representations and concatenates them with pixel content tokens in a channel-wise manner, turning the element-wise bias addition in attention score computation into a dot-product operation. Further, we introduce a convolutional local attention and a cyclic window strategy to fully leverage the advantages of long-range interactions enabled by RIB and FlashAttention. We enlarge the window size up to \textbf{96$\times$96} while jointly scaling the training patch size and the dataset size, maximizing the benefits of Transformers in the SR task. As a result, our network achieves \textbf{35.63\,dB PSNR} on Urban100$\times$2, while reducing training and inference time by \textbf{2.1$\times$} and \textbf{2.9$\times$}, respectively, compared to the RPB-based SR Transformer~(PFT).

cross ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution

Authors: Yubang Wang, Chenxi Zhang, Bowen Chen, Zezheng Huai, Zihao Dai, Xinchi Chen, Yuxin Wang, Yining Zheng, Jingjing Gong, Xipeng Qiu

Abstract: Autonomous agents are increasingly expected to support scientific research, and recent benchmarks report progress in code repair and autonomous experimentation. However, these evaluations typically assume a pre-configured execution environment, which requires resolving complex software dependencies, aligning hardware and framework versions, and configuring distributed execution, yet this capability remains largely unbenchmarked. We introduce ResearchEnvBench, a benchmark for environment synthesis in research code execution. Given a research repository, documentation, and a target execution setting, agents must construct an environment that successfully executes at runtime. Evaluations on diverse research repositories reveal a substantial gap in current SOTA agents, with failures dominated by incomplete dependency resolution and brittle version coupling. ResearchEnvBench provides a realistic testbed for advancing autonomous agents toward reproducible scientific research.

cross ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins

Authors: Yichen Zhou, Jonathan Golob, Amir Karimi, Stefan Bauer, Patrick Schwab

Abstract: Protein language models (pLMs) have shown strong potential in prediction of the functional effects of missense variants in zero-shot settings. Despite this progress, benchmarking pLMs for viral proteins remains limited and systematic strategies for integrating in silico metrics with in vitro validation to guide antigen and target selection are underdeveloped. Here, we introduce ViroGym, a comprehensive benchmark designed to evaluate variant effect prediction in viral proteins and to facilitate selecting rational antigen candidates. We curated 79 deep mutational scanning (DMS) assays encompassing eukaryotic viruses, collectively comprising 552,937 mutated amino acid sequences across 7 distinct phenotypic readouts, and 21 influenza virus neutralisation tasks and a real-world predictive task for SARS-CoV-2. We benchmark well-established pLMs on fitness landscapes, antigenic diversity, and pandemic forecasting to provide a framework for vaccine selection, and show that pLMs selected using in vitro experimental data excel at predicting dominant circulating mutations in real world.

cross Heterogeneous Decentralized Diffusion Models

Authors: Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy

Abstract: Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly coupled clusters, limiting participation to well-resourced institutions. While Decentralized Diffusion Models (DDM) enable training multiple experts in isolation, existing approaches require 1176 GPU-days and homogeneous training objectives across all experts. We present an efficient framework that reduces resource requirements while supporting heterogeneous training objectives. Our approach combines three contributions: (1) a heterogeneous decentralized training paradigm that allows experts to use different objectives (DDPM and Flow Matching), unified at inference time via a deterministic schedule-aware conversion into a common velocity space without retraining; (2) pretrained checkpoint conversion from ImageNet-DDPM to Flow Matching objectives, accelerating convergence and enabling initialization without objective-specific pretraining; and (3) PixArt-alpha's efficient AdaLN-Single architecture, reducing parameters while maintaining quality. Experiments on LAION-Aesthetics show that, relative to the training scale reported for prior DDM work, our approach reduces compute from 1176 to 72 GPU-days (16x) and data from 158M to 11M (14x). Under aligned inference settings, our heterogeneous 2DDPM:6FM configuration achieves better FID (11.88 vs. 12.45) and higher intra-prompt diversity (LPIPS 0.631 vs. 0.617) than the homogeneous 8FM baseline. By eliminating synchronization requirements and enabling mixed DDPM/FM objectives, our framework lowers infrastructure requirements for decentralized generative model training.

cross Improved Constrained Generation by Bridging Pretrained Generative Models

Authors: Xiaoxuan Liang, Saeid Naderiparizi, Yunpeng Liu, Berend Zwartsenberg, Frank Wood

Abstract: Constrained generative modeling is fundamental to applications such as robotic control and autonomous driving, where models must respect physical laws and safety-critical constraints. In real-world settings, these constraints rarely take the form of simple linear inequalities, but instead complex feasible regions that resemble road maps or other structured spatial domains. We propose a constrained generation framework that generates samples directly within such feasible regions while preserving realism. Our method fine-tunes a pretrained generative model to enforce constraints while maintaining generative fidelity. Experimentally, our method exhibits characteristics distinct from existing fine-tuning and training-free constrained baselines, revealing a new compromise between constraint satisfaction and sampling quality.

cross Stabilizing Reinforcement Learning for Diffusion Language Models

Authors: Jianyuan Zhong, Kaibo Wang, Ding Ding, Zijin Feng, Haoli Bai, Yang Xiang, Jiacheng Sun, Qiang Xu

Abstract: Group Relative Policy Optimization (GRPO) is highly effective for post-training autoregressive (AR) language models, yet its direct application to diffusion large language models (dLLMs) often triggers reward collapse. We identify two sources of incompatibility. First, GRPO relies on importance ratios defined by sequence probabilities, which are intractable in dLLMs and must be estimated (e.g., via ELBO-based or mean-field likelihood proxies), yielding inherently noisy ratios. Second, standard GRPO's formulation is not designed for estimated ratios: its conditional clipping can be anomalously bypassed by model-agnostic estimation noise, producing gradient spikes, while its fixed group-size normalization amplifies gradient-magnitude fluctuations under high-variance ratio estimates. We show these effects form a self-reinforcing instability loop that drives policy drift and further increases ratio variance. To break this loop, we propose StableDRL, a reformulation of GRPO tailored for dLLMs that uses (i) unconditional clipping to suppress outlier-induced spikes and (ii) self-normalization to constrain updates within the convex hull of per-sample gradients. We further extend StableDRL to block-wise diffusion models via a staircase attention mechanism.

cross Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection

Authors: Minjae Kang, Jaehyung Kim

Abstract: Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality. To address this, we introduce DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representations. Extensive evaluations show that DIRECTER significantly enhances instruction-following capabilities across diverse benchmarks, improving accuracy by up to 6.5% over baselines without the common trade-offs in generation quality or task fidelity. The proposed dynamic, plausibility-guided control during activation steering further demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.

cross ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers

Authors: Aryan Karmore

Abstract: Deploying sparse Mixture of Experts(MoE) Vision Transformers remains a challenge due to linear expert memory scaling. Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N_E \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyViT, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d_{\text{model}} \cdot d_{\text{ff}} + N_E \cdot n_\ell \cdot d)$ memory which is sub-linear in the number of experts. To address the unique challenges of vision, a spatial smoothness regulariser is introduced that penalises routing irregularities between adjacent patch tokens, turning patch correlation into a training signal. Across image classification tasks on CIFAR-100, ButterflyViT achieves 354$\times$ memory reduction at 64 experts with negligible accuracy loss. ButterflyViT allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

cross Property-driven Protein Inverse Folding With Multi-Objective Preference Alignment

Authors: Xiaoyang Hou, Junqi Liu, Chence Shi, Xin Liu, Zhi Yang, Jian Tang

Abstract: Protein sequence design must balance designability, defined as the ability to recover a target backbone, with multiple, often competing, developability properties such as solubility, thermostability, and expression. Existing approaches address these properties through post hoc mutation, inference-time biasing, or retraining on property-specific subsets, yet they are target dependent and demand substantial domain expertise or careful hyperparameter tuning. In this paper, we introduce ProtAlign, a multi-objective preference alignment framework that fine-tunes pretrained inverse folding models to satisfy diverse developability objectives while preserving structural fidelity. ProtAlign employs a semi-online Direct Preference Optimization strategy with a flexible preference margin to mitigate conflicts among competing objectives and constructs preference pairs using in silico property predictors. Applied to the widely used ProteinMPNN backbone, the resulting model MoMPNN enhances developability without compromising designability across tasks including sequence design for CATH 4.3 crystal structures, de novo generated backbones, and real-world binder design scenarios, making it an appealing framework for practical protein sequence design.

cross Robotic Foundation Models for Industrial Control: A Comprehensive Survey and Readiness Assessment Framework

Authors: David Kube, Simon Hadwiger, Tobias Meisen

Abstract: Robotic foundation models (RFMs) are emerging as a promising route towards flexible, instruction- and demonstration-driven robot control, however, a critical investigation of their industrial applicability is still lacking. This survey gives an extensive overview over the RFM-landscape and analyses, driven by concrete implications, how industrial domains and use cases shape the requirements of RFMs, with particular focus on collaborative robot platforms, heterogeneous sensing and actuation, edge-computing constraints, and safety-critical operation. We synthesise industrial deployment perspectives into eleven interdependent implications and operationalise them into an assessment framework comprising a catalogue of 149 concrete criteria, spanning both model capabilities and ecosystem requirements. Using this framework, we evaluate 324 manipulation-capable RFMs via 48,276 criterion-level decisions obtained via a conservative LLM-assisted evaluation pipeline, validated against expert judgements. The results indicate that industrial maturity is limited and uneven: even the highest-rated models satisfy only a fraction of criteria and typically exhibit narrow implication-specific peaks rather than integrated coverage. We conclude that progress towards industry-grade RFMs depends less on isolated benchmark successes than on systematic incorporation of safety, real-time feasibility, robust perception, interaction, and cost-effective system integration into auditable deployment stacks.

cross XMACNet: An Explainable Lightweight Attention based CNN with Multi Modal Fusion for Chili Disease Classification

Authors: Tapon Kumer Ray, Rajkumar Y, Shalini R, Srigayathri K, Jayashree S, Lokeswari P

Abstract: Plant disease classification via imaging is a critical task in precision agriculture. We propose XMACNet, a novel light-weight Convolutional Neural Network (CNN) that integrates self-attention and multi-modal fusion of visible imagery and vegetation indices for chili disease detection. XMACNet uses an EfficientNetV2S backbone enhanced by a self-attention module and a fusion branch that processes both RGB images and computed vegetation index maps (NDVI, NPCI, MCARI). We curated a new dataset of 12,000 chili leaf images across six classes (five disease types plus healthy), augmented synthetically via StyleGAN to mitigate data scarcity. Trained on this dataset, XMACNet achieves high accuracy, F1-score, and AUC, outperforming baseline models such as ResNet-50, MobileNetV2, and a Swin Transformer variant. Crucially, XMACNet is explainable: we use Grad-CAM++ and SHAP to visualize and quantify the models focus on disease features. The models compact size and fast inference make it suitable for edge deployment in real-world farming scenarios.

cross Learning Unbiased Cluster Descriptors for Interpretable Imbalanced Concept Drift Detection

Authors: Yiqun Zhang, Zhanpei Huang, Mingjie Zhao, Chuyao Zhang, Yang Lu, Yuzhu Ji, Fangqing Gu, An Zeng

Abstract: Unlabeled streaming data are usually collected to describe dynamic systems, where concept drift detection is a vital prerequisite to understanding the evolution of systems. However, the drifting concepts are usually imbalanced in most real cases, which brings great challenges to drift detection. That is, the dominant statistics of large clusters can easily mask the drifting of small cluster distributions (also called small concepts), which is known as the `masking effect'. Considering that most existing approaches only detect the overall existence of drift under the assumption of balanced concepts, two critical problems arise: 1) where the small concept is, and 2) how to detect its drift. To address the challenging concept drift detection for imbalanced data, we propose Imbalanced Cluster Descriptor-based Drift Detection (ICD3) approach that is unbiased to the imbalanced concepts. This approach first detects imbalanced concepts by employing a newly designed multi-distribution-granular search, which ensures that the distribution of both small and large concepts is effectively captured. Subsequently, it trains a One-Cluster Classifier (OCC) for each identified concept to carefully monitor their potential drifts in the upcoming data chunks. Since the detection is independently performed for each concept, the dominance of large clusters is thus circumvented. ICD3 demonstrates highly interpretability by specifically locating the drifted concepts, and is robust to the changing of the imbalance ratio of concepts. Comprehensive experiments with multi-aspect ablation studies conducted on various benchmark datasets demonstrate the superiority of ICD3 against the state-of-the-art counterparts.

cross Enhancing SHAP Explainability for Diagnostic and Prognostic ML Models in Alzheimer Disease

Authors: Pablo Guill\'en, Enrique Frias-Martinez

Abstract: Alzheimer disease (AD) diagnosis and prognosis increasingly rely on machine learning (ML) models. Although these models provide good results, clinical adoption is limited by the need for technical expertise and the lack of trustworthy and consistent model explanations. SHAP (SHapley Additive exPlanations) is com-monly used to interpret AD models, but existing studies tend to focus on explanations for isolated tasks, providing little evidence about their robustness across disease stages, model architectures, or prediction objectives. This paper proposes a multi-level explainability framework that measures the coherence, stabil-ity and consistency of explanations by integrating: (1) within-model coherence metrics between feature importance and SHAP, (2) SHAP stability across AD boundaries, and (3) SHAP cross-task consistency be-tween diagnosis and prognosis. Using AutoML to optimize classifiers on the NACC dataset, we trained four diagnostic and four prognostic models covering the standard AD progression stages. Stability was then evaluated using correlation metrics, top-k feature overlap, SHAP sign consistency, and domain-level contribution ratios. Results show that cognitive and functional markers dominate SHAP explanations in both diagnosis and prognosis. SHAP-SHAP consistency between diagnostic and prognostic models was high across all classifiers, with 100% sign stability and minimal shifts in explanatory magnitude. Domain-level contributions also remained stable, with only minimal increases in genetic features for prognosis. These results demonstrate that SHAP explanations can be quantitatively vali-dated for robustness and transferability, providing clinicians with more reliable interpretations of ML pre-dictions.

cross Gradient-based Nested Co-Design of Aerodynamic Shape and Control for Winged Robots

Authors: Daniele Affinita, Mingda Xu, Beno\^it Valentin Gherardi, Pascal Fua

Abstract: Designing aerial robots for specialized tasks, from perching to payload delivery, requires tailoring their aerodynamic shape to specific mission requirements. For tasks involving wide flight envelopes, the usual sequential process of first determining the shape and then the motion planner is likely to be suboptimal due to the inherent nonlinear interactions between them. This limitation has been motivating co-design research, which involves jointly optimizing the aerodynamic shape and the motion planner. In this paper, we present a general-purpose, gradient-based, nested co-design framework where the motion planner solves an optimal control problem and the aerodynamic forces used in the dynamics model are determined by a neural surrogate model. This enables us to model complex subsonic flow conditions encountered in aerial robotics and to overcome the limited applicability of existing co-design methods. These limitations stem from the simplifying assumptions they require for computational tractability to either the planner or the aerodynamics. We validate our method on two complex dynamic tasks for fixed-wing gliders: perching and a short landing. Our optimized designs improve task performance compared to an evolutionary baseline in a fraction of the computation time.

cross Diversity-Aware Adaptive Collocation for Physics-Informed Neural Networks via Sparse QUBO Optimization and Hybrid Coresets

Authors: Hadi Salloum, Maximilian Mifsud Bonici, Sinan Ibrahim, Pavel Osinenko, Alexei Kornaev

Abstract: Physics-Informed Neural Networks (PINNs) enforce governing equations by penalizing PDE residuals at interior collocation points, but standard collocation strategies - uniform sampling and residual-based adaptive refinement - can oversample smooth regions, produce highly correlated point sets, and incur unnecessary training cost. We reinterpret collocation selection as a coreset construction problem: from a large candidate pool, select a fixed-size subset that is simultaneously informative (high expected impact on reducing PDE error) and diverse (low redundancy under a space-time similarity notion). We formulate this as a QUBO/BQM objective with linear terms encoding residual-based importance and quadratic terms discouraging redundant selections. To avoid the scalability issues of dense k-hot QUBOs, we propose a sparse graph-based BQM built on a kNN similarity graph and an efficient repair procedure that enforces an exact collocation budget. We further introduce hybrid coverage anchors to guarantee global PDE enforcement. We evaluate the method on the 1D time-dependent viscous Burgers equation with shock formation and report both accuracy and end-to-end time-to-accuracy, including a timing breakdown of selection overhead. Results demonstrate that sparse and hybrid formulations reduce selection overhead relative to dense QUBOs while matching or improving accuracy at fixed collocation budgets.

cross Failure Detection in Chemical Processes using Symbolic Machine Learning: A Case Study on Ethylene Oxidation

Authors: Julien Amblard, Niklas Groll, Matthew Tait, Mark Law, G\"urkan Sin, Alessandra Russo

Abstract: Over the past decade, Artificial Intelligence has significantly advanced, mostly driven by large-scale neural approaches. However, in the chemical process industry, where safety is critical, these methods are often unsuitable due to their brittleness, and lack of explainability and interpretability. Furthermore, open-source real-world datasets containing historical failures are scarce in this domain. In this paper, we investigate an approach for predicting failures in chemical processes using symbolic machine learning and conduct a feasibility study in the context of an ethylene oxidation process. Our method builds on a state-of-the-art symbolic machine learning system capable of learning predictive models in the form of probabilistic rules from context-dependent noisy examples. This system is a general-purpose symbolic learner, which makes our approach independent of any specific chemical process. To address the lack of real-world failure data, we conduct our feasibility study leveraging data generated from a chemical process simulator. Experimental results show that symbolic machine learning can outperform baseline methods such as random forest and multilayer perceptron, while preserving interpretability through the generation of compact, rule-based predictive models. Finally, we explain how such learned rule-based models could be integrated into agents to assist chemical plant operators in decision-making during potential failures.

cross HGT-Scheduler: Deep Reinforcement Learning for the Job Shop Scheduling Problem via Heterogeneous Graph Transformers

Authors: Bulent Soykan

Abstract: The Job Shop Scheduling Problem (JSSP) is commonly formulated as a disjunctive graph in which nodes represent operations and edges encode technological precedence constraints as well as machine-sharing conflicts. Most existing reinforcement learning approaches model this graph as homogeneous, merging job-precedence and machine-contention edges into a single relation type. Such a simplification overlooks the intrinsic heterogeneity of the problem structure and may lead to the loss of critical relational information. To address this limitation, we propose the Heterogeneous Graph Transformer (HGT)-Scheduler, a reinforcement learning framework that models the JSSP as a heterogeneous graph. The proposed architecture leverages a Heterogeneous Graph Transformer to capture type-specific relational patterns through edge-type-dependent attention mechanisms applied to precedence and contention relations. The scheduling policy is trained using Proximal Policy Optimization. The effectiveness of the proposed method is evaluated on the Fisher--Thompson benchmark instances. On the FT06 instance, the HGT-Scheduler achieves an optimality gap of 8.4\%, statistically outperforming both an identical architecture that ignores edge types ($p = 0.011$) and a standard Graph Isomorphism Network baseline. On the larger FT10 instance, the approach demonstrates favorable scalability. However, under a 50,000-step training limit, the performance of heterogeneous and homogeneous graph models is comparable, suggesting that edge-type awareness requires longer training horizons for larger problem instances. Ablation analyses further indicate that a three-layer attention architecture provides the best performance. Overall, the results confirm that explicitly modeling distinct edge semantics improves the learning of effective scheduling policies.

cross SpatialMAGIC: A Hybrid Framework Integrating Graph Diffusion and Spatial Attention for Spatial Transcriptomics Imputation

Authors: Sayeem Bin Zaman, Fahim Hafiz, Riasat Azim

Abstract: Spatial transcriptomics (ST) enables mapping gene expression with spatial context but is severely affected by high sparsity and technical noise, which conceals true biological signals and hinders downstream analyses. To address these challenges, SpatialMagic was proposed, which is a hybrid imputation model combining MAGIC-based graph diffusion with transformer-based spatial self-attention. The long-range dependencies in the gene expression are captured by graph diffusion, and local neighborhood structure is captured by spatial attention models, which allow for recovering the missing expression values, retaining spatial consistency. Across multiple platforms, SpatialMagic consistently outperforms existing baselines, including MAGIC and attention-based models, achieving peak Adjusted Rand Index (ARI) scores in clustering accuracy of 0.3301 on high-resolution Stereo-Seq data, 0.3074 on Slide-Seq, and 0.4216 on the Sci-Space dataset. Beyond quantitative improvements, SpatialMagic substantially enhances downstream biological analyses by improving the detection of both up- and down-regulated genes while maintaining regulatory consistency across datasets. The pathway enrichment analysis of the recovered genes indicates that they are involved in consistent processes across key metabolic, transport, and neural signaling pathways, suggesting that the framework improves data quality while preserving biological interpretability. Overall, SpatialMagic's hybrid diffusion attention strategy and refinement module outperform state-of-the-art baselines on quantitative metrics and provide a better understanding of the imputed data by preserving tissue architecture and uncovering biologically relevant genes. The source code and datasets are provided in the following link: https://github.com/sayeemzzaman/SpatialMAGIC

URLs: https://github.com/sayeemzzaman/SpatialMAGIC

cross xaitimesynth: A Python Package for Evaluating Attribution Methods for Time Series with Synthetic Ground Truth

Authors: Gregor Baer

Abstract: Evaluating time series attribution methods is difficult because real-world datasets rarely provide ground truth for which time points drive a prediction. A common workaround is to generate synthetic data where class-discriminating features are placed at known locations, but each study currently reimplements this from scratch. We introduce xaitimesynth, a Python package that provides reusable infrastructure for this evaluation approach. The package generates synthetic time series following an additive model where each sample is a sum of background signal and a localized, class-discriminating feature, with the feature window automatically tracked as a ground truth mask. A fluent data generation API and YAML configuration format allow flexible and reproducible dataset definitions for both univariate and multivariate time series. The package also provides standard localization metrics, including AUC-PR, AUC-ROC, Relevance Mass Accuracy, and Relevance Rank Accuracy. xaitimesynth is open source and available at https://github.com/gregorbaer/xaitimesynth.

URLs: https://github.com/gregorbaer/xaitimesynth.

cross Physics-Informed Diffusion Model for Generating Synthetic Extreme Rare Weather Events Data

Authors: Marawan Yakout, Tannistha Maiti, Monira Majhabeen, Tarry Singh

Abstract: Data scarcity is a primary obstacle in developing robust Machine Learning (ML) models for detecting rapidly intensifying tropical cyclones. Traditional data augmentation techniques (rotation, flipping, brightness adjustment) fail to preserve the physical consistency and high-intensity gradients characteristic of rare Category 4-equivalent events, which constitute only 0.14\% of our dataset (202 of 140,514 samples). We propose a physics-informed diffusion model based on the Context-UNet architecture to generate synthetic, multi-spectral satellite imagery of extreme weather events. Our model is conditioned on critical atmospheric parameters such as average wind speed, type of Ocean and stage of development (early, mature, late etc) -- the known drivers of rapid intensification. Using a controlled pre-generated noise sampling strategy and mixed-precision training, we generated $16\times16$ wind-field samples that are cropped from multi-spectral satellite imagery which preserve realistic spatial autocorrelation and physical consistency. Results demonstrate that our model successfully learns discriminative features across ten distinct context classes, effectively mitigating the data bottleneck. Specifically, we address the extreme class imbalance in our dataset, where Class 4 (Ocean 2, early stage with average wind speed 50kn hurricane) contains only 202 samples compared to 79,768 samples in Class 0. This generative framework provides a scalable solution for augmenting training datasets for operational weather detection algorithms. The average Results yield an average Log-Spectral Distance (LSD) of 4.5dB, demonstrating a scalable framework for enhancing operational weather detection algorithms.

cross Optimistic Policy Regularization

Authors: Mai Pham, Vikrant Vaze, Peter Chin

Abstract: Deep reinforcement learning agents frequently suffer from premature convergence, where early entropy collapse causes the policy to discard exploratory behaviors before discovering globally optimal strategies. We introduce Optimistic Policy Regularization (OPR), a lightweight mechanism designed to preserve and reinforce historically successful trajectories during policy optimization. OPR maintains a dynamic buffer of high-performing episodes and biases learning toward these behaviors through directional log-ratio reward shaping and an auxiliary behavioral cloning objective. When instantiated on Proximal Policy Optimization (PPO), OPR substantially improves sample efficiency on the Arcade Learning Environment. Across 49 Atari games evaluated at the 10-million step benchmark, OPR achieves the highest score in 22 environments despite baseline methods being reported at the standard 50-million step horizon. Beyond arcade benchmarks, OPR also generalizes to the CAGE Challenge 2 cyber-defense environment, surpassing the competition-winning Cardiff agent while using the same PPO architecture. These results demonstrate that anchoring policy updates to empirically successful trajectories can improve both sample efficiency and final performance.

cross A Hybrid Machine Learning Model for Cerebral Palsy Detection

Authors: Karan Kumar Singh, Nikita Gajbhiye, Gouri Sankar Mishra

Abstract: The development of effective treatments for Cerebral Palsy (CP) can begin with the early identification of affected children while they are still in the early stages of the disorder. Pathological issues in the brain can be better diagnosed with the use of one of many medical imaging techniques. Magnetic Resonance Imaging (MRI) has revolutionized medical imaging with its unparalleled image resolution. A unique Machine Learning (ML) model that was built to identify CP disorder is presented in this paper. The model is intended to assist in the early diagnosis of CP in newborns. In this study, the brain MRI images dataset was first collected, and then the preprocessing techniques were applied to this dataset to make it ready for use in the proposed model. Following this, the proposed model was constructed by combining three CNN models, specifically VGG 19, Efficient-Net, and the ResNet50 model, to extract features from the image. Following this, a Bi-LSTM was utilized as a classifier to determine whether or not CP was present, and finally, the proposed model was employed for training and testing. The results show that the proposed model achieved an accuracy of 98.83%, which is higher than VGG-19 (96.79%), Efficient-Net (97.29%), and VGG-16 (97.50%).. When the suggested model is compared to other models that have been pre-trained in the past, the accuracy scores seem to be much higher.

cross AI-Assisted Curation of Conference Scholarship: Compiling, Structuring, and Analyzing Two Decades of Presentations at the Society for Social Work and Research

Authors: Brian Perron, Bryan Victor, Zia Qi

Abstract: Purpose: This study developed a comprehensive database of presentation abstracts from the Society for Social Work and Research (SSWR) Annual Conference and examined patterns in research methodology, authorship, collaboration, and institutional participation over two decades. Method: Abstract metadata was compiled from the SSWR Confex conference management system for presentations from 2005 to 2026 using web scraping. A small language model (gpt-oss:20b) performed classification and extraction tasks on abstracts, including categorization of methodologies and parsing of author affiliations, with human review at each major stage to ensure accuracy. Results: The database contains 23,793 presentations with 69,924 author records representing 20,779 unique researchers from 4,049 institutions across 93 countries. Annual conference presentations increased from 423 in 2005 to 1,935 in 2026, representing a compound annual growth rate of 7.5%. Quantitative methods predominated (61.1%), followed by qualitative approaches (23.4%), mixed methods (9.1%), and reviews (5.4%). The mean number of authors per presentation increased from 2.22 in 2005 to 3.31 in 2026. International participation grew from 4.5% to 13.5% of author affiliations over the observation period. Discussion: Findings indicate substantial growth in SSWR conference participation, alongside increased collaboration and international engagement. The methodological distribution reveals continued quantitative predominance with growing qualitative representation. This database provides research infrastructure for systematic hypothesis testing about research priorities and disciplinary development over time, enabling analyses that inform both scholarship and conference planning.

cross "Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior

Authors: Roshni Lulla, Fiona Collins, Sanaya Parekh, Thilo Hagendorff, Jonas Kaplan

Abstract: The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility with human preferences and values as capabilities increase. Current large language models (LLMs) show misaligned behaviors, such as strategic deception, manipulation, and reward-seeking, that can arise despite safety training. Gaining a mechanistic understanding of these failures requires empirical approaches that can isolate behavioral patterns in controlled settings. We propose that biological misalignment precedes artificial misalignment, and leverage the Dark Triad of personality (narcissism, psychopathy, and Machiavellianism) as a psychologically grounded framework for constructing model organisms of misalignment. In Study 1, we establish comprehensive behavioral profiles of Dark Triad traits in a human population (N = 318), identifying affective dissonance as a central empathic deficit connecting the traits, as well as trait-specific patterns in moral reasoning and deceptive behavior. In Study 2, we demonstrate that dark personas can be reliably induced in frontier LLMs through minimal fine-tuning on validated psychometric instruments. Narrow training datasets as small as 36 psychometric items resulted in significant shifts across behavioral measures that closely mirrored human antisocial profiles. Critically, models generalized beyond training items, demonstrating out-of-context reasoning rather than memorization. These findings reveal latent persona structures within LLMs that can be readily activated through narrow interventions, positioning the Dark Triad as a validated framework for inducing, detecting, and understanding misalignment across both biological and artificial intelligence.

cross Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

Authors: Md Ashikur Rahman, Md Arifur Rahman, Niamul Hassan Samin, Abdullah Ibne Hanif Arean, Juena Ahmed Noshin

Abstract: We uncover a behavioral law of long-horizon vision-language models: models that maintain temporally grounded beliefs generalize better. Standard benchmarks measure only final-answer accuracy, which obscures how models use visual information; a model can guess correctly while its step-by-step reasoning is entirely unanchored to the visual input. We formalize this as behavioral faithfulness over long horizons, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state. Across eight models on three long-horizon benchmarks, we demonstrate that temporal grounding quality is a leading indicator of robustness: the Step Grounding Rate (SGR) predicts out-of-distribution retention with $r = 0.83$ (permutation test $p = 0.003$), a relationship that holds within capacity-matched models and cannot be explained by scale or in-distribution accuracy. Critically, grounding quality varies by up to 10.8 percentage points within parameter-matched 7B models despite similar accuracy, revealing it as an independent axis of model capability. Multiple robustness checks confirm the signal reflects genuine visual reliance: counterfactual traces drop SGR by 26--41 percentage points, cross-architecture verifiers agree at $\rho = 0.96$, random reasoning scores near chance ($\sim 18\%$), and the predictor remains strong even without explicit reasoning disclosure ($r = 0.78$).

cross Twitch: Learning Abstractions for Equational Theorem Proving

Authors: Guy Axelrod, Moa Johansson, Nicholas Smallbone

Abstract: Several successful strategies in automated reasoning rely on human-supplied guidance about which term or clause shapes are interesting. In this paper we aim to discover interesting term shapes automatically. Specifically, we discover abstractions : term patterns that occur over and over again in relevant proofs. We present our tool Twitch which discovers abstractions with the help of Stitch, a tool originally developed for discovering reusable library functions in program synthesis tasks. Twitch can produce abstractions in two ways: (1) from a partial, failed proof of a conjecture; (2) from successful proofs of other theorems in the same domain. We have also extended Twee, an equational theorem prover, to use these abstractions. We evaluate Twitch on a set of unit equality (UEQ) problems from TPTP, and show that it can prove 12 rating-1 problems as well as yielding significant speed-ups on many other problems.

cross Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

Authors: Neta Glazer, Lenny Aharon, Ethan Fetaya

Abstract: Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.

cross Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration

Authors: Yanjun Chen, Yirong Sun, Hanlin Wang, Xinming Zhang, Xiaoyu Shen, Wenjie Li, Wei Zhang

Abstract: Cooperative multi-agent reinforcement learning (MARL) systems powered by large language models (LLMs) are frequently optimized via sparse terminal-only feedback. This shared signal entangles upstream decisions, obstructing accurate decision-level credit assignment. To address this trajectory-level diffusion, we introduce Contextual Counterfactual Credit Assignment (\textbf{\texttt{C3}}). Instead of distributing rewards across an entire episode, \textbf{\texttt{C3}} isolates the causal impact of individual messages by freezing the exact transcript-derived context, evaluating context-matched alternatives via fixed-continuation replay, and applying a leave-one-out (LOO) baseline. This localized intervention extracts unbiased, low-variance marginal advantages for standard policy-gradient optimization. Evaluated across five mathematical and coding benchmarks under matched budgets, \textbf{\texttt{C3}} improves terminal performance over established baselines. Mechanistic diagnostics further show that these gains are accompanied by higher credit fidelity, lower contextual variance, and stronger inter-agent causal dependence. Our code is available at https://github.com/EIT-EAST-Lab/C3.

URLs: https://github.com/EIT-EAST-Lab/C3.

cross Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers

Authors: David Heye, Karl Kindermann, Robin Decker, Johannes Lohm\"oller, Anastasiia Belova, Sandra Geisler, Klaus Wehrle, Jan Pennekamp

Abstract: Artifact Evaluation (AE) is essential for ensuring the transparency and reliability of research, closing the gap between exploratory work and real-world deployment is particularly important in cybersecurity, particularly in IoT and CPSs, where large-scale, heterogeneous, and privacy-sensitive data meet safety-critical actuation. Yet, manual reproducibility checks are time-consuming and do not scale with growing submission volumes. In this work, we demonstrate that Large Language Models (LLMs) can provide powerful support for AE tasks: (i) text-based reproducibility rating, (ii) autonomous sandboxed execution environment preparation, and (iii) assessment of methodological pitfalls. Our reproducibility-assessment toolkit yields an accuracy of over 72% and autonomously sets up execution environments for 28% of runnable cybersecurity artifacts. Our automated pitfall assessment detects seven prevalent pitfalls with high accuracy ($F_1$ > 92%). Hence, the toolkit significantly reduces reviewer effort and, when integrated into established AE processes, could incentivize authors to submit higher-quality and more reproducible artifacts. IoT, CPS, and cybersecurity conferences and workshops may integrate the toolkit into their peer-review processes to support reviewers' decisions on awarding artifact badges, improving the overall sustainability of the process.

cross A prior information informed learning architecture for flying trajectory prediction

Authors: Xianda Huang, Zidong Han, Ruibo Jin, Zhenyu Wang, Wenyu Li, Xiaoyang Li, Yi Gong

Abstract: Trajectory prediction for flying objects is critical in domains ranging from sports analytics to aerospace. However, traditional methods struggle with complex physical modeling, computational inefficiencies, and high hardware demands, often neglecting critical trajectory events like landing points. This paper introduces a novel, hardware-efficient trajectory prediction framework that integrates environmental priors with a Dual-Transformer-Cascaded (DTC) architecture. We demonstrate this approach by predicting the landing points of tennis balls in real-world outdoor courts. Using a single industrial camera and YOLO-based detection, we extract high-speed flight coordinates. These coordinates, fused with structural environmental priors (e.g., court boundaries), form a comprehensive dataset fed into our proposed DTC model. A first-level Transformer classifies the trajectory, while a second-level Transformer synthesizes these features to precisely predict the landing point. Extensive ablation and comparative experiments demonstrate that integrating environmental priors within the DTC architecture significantly outperforms existing trajectory prediction frameworks

cross Physics-informed AI Accelerated Retention Analysis of Ferroelectric Vertical NAND: From Day-Scale TCAD to Second-Scale Surrogate Model

Authors: Gyujun Jeong (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA), Sungwon Cho (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA), Minji Shon (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA), Namhoon Kim (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA), Woohyun Hwang (Semiconductor Research and Development, Samsung Electronics Co., Ltd, South Korea), Kwangyou Seo (Semiconductor Research and Development, Samsung Electronics Co., Ltd, South Korea), Suhwan Lim (Semiconductor Research and Development, Samsung Electronics Co., Ltd, South Korea), Wanki Kim (Semiconductor Research and Development, Samsung Electronics Co., Ltd, South Korea), Daewon Ha (Semiconductor Research and Development, Samsung Electronics Co., Ltd, South Korea), Prasanna Venkatesan (NVIDIA, Santa Clara, CA, USA), Kihang Youn (NVIDIA, Santa Clara, CA, USA), Ram Cherukuri (NVIDIA, Santa Clara, CA, USA), Yiyi Wang (NVIDIA, Santa Clara, CA, USA), Suman Datta (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA), Asif Khan (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA), Shimeng Yu (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA)

Abstract: Ferroelectric field-effect transistors (FeFET)-based vertical NAND (Fe-VNAND) has emerged as a promising candidate to overcome z-scaling limitations with lower programming voltages. However, the data retention of 3D Fe-VNAND is hindered by the complex interaction between charge detrapping and ferroelectric depolarization. Developing optimized device designs requires exploring an extensive parameter space, but the high computational cost of conventional Technology Computer-Aided Design (TCAD) tools makes such wide-scale optimization impractical. To overcome these simulation barriers, we present a Physics-Informed Neural Operator (PINO)-based AI surrogate model designed for high-efficiency prediction of threshold voltage (Vth) shifts and retention behavior. By embedding fundamental physical principles into the learning architecture, our PINO framework achieves a speedup exceeding 10000x compared to TCAD while maintaining physical accuracy. This study demonstrates the model's effectiveness on a single FeFET configuration, serving as a pathway toward modeling the retention loss mechanisms.

cross MindfulAgents: Personalizing Mindfulness Meditation via an Expert-Aligned Multi-Agent System

Authors: Mengyuan (Millie), Wu, Zhihan Jiang, Yuang Fan, Richard Feng, Sahiti Dharmavaram, Mathew Polowitz, Shawn Fallon, Bashima Islam, Lizbeth Benson, Irene Tung, David Creswell, Xuhai Xu

Abstract: Mindfulness meditation is a widely accessible and evidence-based method for supporting mental health. Despite the proliferation of mindfulness meditation apps, sustaining user engagement remains a persistent challenge. Personalizing the meditation experience is a promising strategy to improve engagement, but it often requires costly and unscalable manual effort. We present MindfulAgents, a multi-agent system powered by large language models that (1) generates guided meditation scripts based on an expert-established mindfulness framework, (2) encourages users' reflection on emotional states and mindfulness skills, and (3) enables real-time personalization of the mindfulness meditation experience for each user. In a formative lab study (N=13), MindfulAgents significantly improved in-session engagement (p = 0.011) and self-awareness (p = 0.014), and reduced momentary stress (p = 0.020). Furthermore, a four-week deployment study (N=62) demonstrated a notable increase in long-term engagement (p = 0.002) and level of mindfulness (p = 0.023). Participants reported that MindfulAgents offered more relevant meditation sessions personalized to individual needs in various contexts, supporting sustained practice. Our findings highlight the potential of LLM-driven personalization for enhancing user engagement in digital mindfulness meditation interventions.

cross How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences

Authors: Sofiane Ouaari, Jules Kreuer, Nico Pfeifer

Abstract: DNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings-as-a-Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model's output for reconstructing the DNA sequence is a zero-shot embedding, which is then fed to a decoder. We evaluated the privacy of three DNA foundation models: DNABERT-2, Evo 2, and Nucleotide Transformer v2 (NTv2). Our results show that per-token embeddings allow near-perfect sequence reconstruction across all models. For mean-pooled embeddings, reconstruction quality degrades as sequence length increases, though it remains substantially above random baselines. Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences with reconstruction similarities > 90%, while DNABERT-2's BPE tokenization provides the greatest resilience. We found that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success. Our findings emphasize the urgent need for privacy-aware design in genomic foundation models prior to their widespread deployment in EaaS settings. Training code, model weights and evaluation pipeline are released on: https://github.com/not-a-feature/DNA-Embedding-Inversion.

URLs: https://github.com/not-a-feature/DNA-Embedding-Inversion.

cross Post-Training with Policy Gradients: Optimality and the Base Model Barrier

Authors: Alireza Mousavi-Hosseini, Murat A. Erdogdu

Abstract: We study post-training linear autoregressive models with outcome and process rewards. Given a context $\boldsymbol{x}$, the model must predict the response $\boldsymbol{y} \in Y^N$, a sequence of length $N$ that satisfies a $\gamma$ margin condition, an extension of the standard separability to sequences. We prove that on test samples where the base model achieves a non-trivial likelihood $\alpha$, a variant of policy gradient (PG) can achieve likelihood $1 - \varepsilon$ with an essentially minimax optimal number of reward queries $\tilde{O}((\alpha^{-1} + \varepsilon^{-1})/\gamma^2)$. However, a barrier arises for going beyond the support of the base model. We prove that the overall expected error after post-training with outcome rewards is governed by a property of the base model called the Likelihood Quantile (LQ), and that variants of PG, while minimax optimal, may require a number of reward queries exponential in $N$ to go beyond this support, regardless of the pre-training algorithm. To overcome this barrier, we study post-training with a process reward model, and demonstrate how PG variants in this setting avoid the curse of dimensionality in $N$ via dependence on a token-level LQ. Along the way, we prove that under the margin condition, SGD with adaptive learning rate (LR) achieves a near optimal test error for statistical learning, and PG with adaptive LR achieves a near optimal number of mistakes for online learning while being computationally efficient whenever possible, both of which may be of independent interest.

cross Learning Quadruped Walking from Seconds of Demonstration

Authors: Ruipeng Zhang, Hongzhan Yu, Ya-Chien Chang, Chenghao Li, Henrik I. Christensen, Sicun Gao

Abstract: Quadruped locomotion provides a natural setting for understanding when model-free learning can outperform model-based control design, by exploiting data patterns to bypass the difficulty of optimizing over discrete contacts and the combinatorial explosion of mode changes. We give a principled analysis of why imitation learning with quadrupeds can be inherently effective in a small data regime, based on the structure of its limit cycles, Poincar\'e return maps, and local numerical properties of neural networks. The understanding motivates a new imitation learning method that regulates the alignment between variations in a latent space and those over the output actions. Hardware experiments confirm that a few seconds of demonstration is sufficient to train various locomotion policies from scratch entirely offline with reasonable robustness.

cross Elenchus: Generating Knowledge Bases from Prover-Skeptic Dialogues

Authors: Bradley P. Allen

Abstract: We present Elenchus, a dialogue system for knowledge base construction grounded in inferentialist semantics, where knowledge engineering is re-conceived as explicitation rather than extraction from expert testimony or textual content. A human expert develops a bilateral position (commitments and denials) about a topic through prover-skeptic dialogue with a large language model (LLM) opponent. The LLM proposes tensions (claims that parts of the position are jointly incoherent) which the expert resolves by retraction, refinement, or contestation. The LLM thus serves as a defeasible derivability oracle whose unreliability is structurally contained by the expert's authority. Our main technical contribution is a mapping from Elenchus dialectical states to material bases in Hlobil and Brandom's NonMonotonic MultiSuccedent (NMMS) logic, satisfying Containment and enabling the elaboration of logical vocabulary that makes explicit the inferential relationships negotiated in the dialectic. We demonstrate the approach on the W3C PROV-O provenance ontology, where a single dialogue session elicits and structures design tensions that a domain expert can articulate, corresponding to decisions documented in a retrospective analysis of the ontology's design. Using pyNMMS, an automated NMMS reasoner, we verify that the structural properties of the resulting material base (nontransitivity, nonmonotonicity, and independence) correspond to specific PROV design rationales, demonstrating end-to-end integration from dialogue through formal reasoning.

cross A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity

Authors: Muhammad Arslan Shaukat, Muntasir Adnan, Carlos C. N. Kuhn

Abstract: We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning fixed-size, semantic, structure-aware, hierarchical, adaptive, and LLM-assisted approaches are benchmarked across six diverse knowledge domains using five different embedding models. Retrieval performance is assessed using graded relevance scores from a state-of-the-art LLM evaluator, with Normalised DCG@5 as the primary metric (complemented by Hit@5 and MRR). Our experiments show that content-aware chunking significantly improves retrieval effectiveness over naive fixed-length splitting. The top-performing strategy, Paragraph Group Chunking, achieved the highest overall accuracy (mean nDCG@5~0.459) and substantially better top-rank hit rates (Precision@1~24%, Hit@5~59%). In contrast, simple fixed-size character chunking as baselines performed poorly (nDCG@5 < 0.244, Precision@1~2-3%). We observe pronounced domain-specific differences: dynamic token sizing is strongest in biology, physics and health, while paragraph grouping is strongest in legal and maths. Larger embedding models yield higher absolute scores but remain sensitive to suboptimal segmentation, indicating that better chunking and large embeddings provide complementary benefits. In addition to accuracy gains, we quantify the efficiency trade-offs of advanced chunking. Producing more, smaller chunks can increase index size and latency. Consequently, we identify methods (like dynamic chunking) that approach an optimal balance of effectiveness and efficiency. These findings establish chunking as a vital lever for improving retrieval performance and reliability.

cross NePPO: Near-Potential Policy Optimization for General-Sum Multi-Agent Reinforcement Learning

Authors: Addison Kalanther, Sanika Bharvirkar, Shankar Sastry, Chinmay Maheshwari

Abstract: Multi-agent reinforcement learning (MARL) is increasingly used to design learning-enabled agents that interact in shared environments. However, training MARL algorithms in general-sum games remains challenging: learning dynamics can become unstable, and convergence guarantees typically hold only in restricted settings such as two-player zero-sum or fully cooperative games. Moreover, when agents have heterogeneous and potentially conflicting preferences, it is unclear what system-level objective should guide learning. In this paper, we propose a new MARL pipeline called Near-Potential Policy Optimization (NePPO) for computing approximate Nash equilibria in mixed cooperative--competitive environments. The core idea is to learn a player-independent potential function such that the Nash equilibrium of a cooperative game with this potential as the common utility approximates a Nash equilibrium of the original game. To this end, we introduce a novel MARL objective such that minimizing this objective yields the best possible potential function candidate and consequently an approximate Nash equilibrium of the original game. We develop an algorithmic pipeline that minimizes this objective using zeroth-order gradient descent and returns an approximate Nash equilibrium policy. We empirically show the superior performance of this approach compared to popular baselines such as MAPPO, IPPO and MADDPG.

cross Diffusion Controller: Framework, Algorithms and Parameterization

Authors: Tong Yang, Moonkyung Ryu, Chih-Wei Hsu, Guy Tennenholtz, Yuejie Chi, Craig Boutilier, Bo Dai

Abstract: Controllable diffusion generation often relies on various heuristics that are seemingly disconnected without a unified understanding. We bridge this gap with Diffusion Controller (DiffCon), a unified control-theoretic view that casts reverse diffusion sampling as state-only stochastic control within (generalized) linearly-solvable Markov Decision Processes (LS-MDPs). Under this framework, control acts by reweighting the pretrained reverse-time transition kernels, balancing terminal objectives against an $f$-divergence cost. From the resulting optimality conditions, we derive practical reinforcement learning methods for diffusion fine-tuning: (i) f-divergence-regularized policy-gradient updates, including a PPO-style rule, and (ii) a regularizer-determined reward-weighted regression objective with a minimizer-preservation guarantee under the Kullback-Leibler (KL) divergence. The LS-MDP framework further implies a principled model form: the optimal score decomposes into a fixed pretrained baseline plus a lightweight control correction, motivating a side-network parameterization conditioned on exposed intermediate denoising outputs, enabling effective gray-box adaptation with a frozen backbone. Experiments on Stable Diffusion v1.4 across supervised and reward-driven finetuning show consistent gains in preference-alignment win rates and improved quality-efficiency trade-offs versus gray-box baselines and even the parameter-efficient white-box adapter LoRA.

cross Masked Unfairness: Hiding Causality within Zero ATE

Authors: Zou Yang, Sophia Xiao, Bijan Mazaheri

Abstract: Recent work has proposed powerful frameworks, rooted in causal theory, to quantify fairness. Causal inference has primarily emphasized the detection of \emph{average} treatment effects (ATEs), and subsequent notions of fairness have inherited this focus. In this paper, we build on previous concerns about regulation based on averages. In particular, we formulate the "causal masking problem" as a linear program that optimizes an alternative objective, such as maximizing profit or minimizing crime, while retaining a zero ATE (i.e., the ATE between a protected attribute and a decision). By studying the capabilities and limitations of causal masking, we show that optimization under ATE-based regulation may induce significant unequal treatment. We demonstrate that the divergence between true and causally masked fairness is driven by confounding, underscoring the importance of full conditional-independence testing when assessing fairness. Finally, we discuss statistical and information-theoretic limitations that make causally masked solutions very difficult to detect, allowing them to persist for long periods. These results argue that we must regulate fairness at the model-level, rather than at the decision level.

cross Foundational World Models Accurately Detect Bimanual Manipulator Failures

Authors: Isaac R. Ward, Michelle Ho, Houjun Liu, Aaron Feldman, Joseph Vincent, Liam Kruse, Sean Cheong, Duncan Eddy, Mykel J. Kochenderfer, Mac Schwager

Abstract: Deploying visuomotor robots at scale is challenging due to the potential for anomalous failures to degrade performance, cause damage, or endanger human life. Bimanual manipulators are no exception; these robots have vast state spaces comprised of high-dimensional images and proprioceptive signals. Explicitly defining failure modes within such state spaces is infeasible. In this work, we overcome these challenges by training a probabilistic, history informed, world model within the compressed latent space of a pretrained vision foundation model (NVIDIA's Cosmos Tokenizer). The model outputs uncertainty estimates alongside its predictions that serve as non-conformity scores within a conformal prediction framework. We use these scores to develop a runtime monitor, correlating periods of high uncertainty with anomalous failures. To test these methods, we use the simulated Push-T environment and the Bimanual Cable Manipulation dataset, the latter of which we introduce in this work. This new dataset features trajectories with multiple synchronized camera views, proprioceptive signals, and annotated failures from a challenging data center maintenance task. We benchmark our methods against baselines from the anomaly detection and out-of-distribution detection literature, and show that our approach considerably outperforms statistical techniques. Furthermore, we show that our approach requires approximately one twentieth of the trainable parameters as the next-best learning-based approach, yet outperforms it by 3.8% in terms of failure detection rate, paving the way toward safely deploying manipulator robots in real-world environments where reliability is non-negotiable.

cross SuperSkillsStack: Agency, Domain Knowledge, Imagination, and Taste in Human-AI Design Education

Authors: Qian Huang, King Wang Poon

Abstract: This study examines how students integrate generative artificial intelligence (AI) into design projects through the lens of the SuperSkillsStack framework, which identifies four key human competencies for effective human-AI collaboration: Agency, Domain Knowledge, Imagination, and Taste. As generative AI increasingly transforms creative practice, design education must consider how human capabilities are cultivated alongside technological tools. Using qualitative thematic analysis, this study analyzes reflective writings from 80 student design teams participating in a human-centered design course. The findings show that students primarily used AI during the early stages of the design process, including brainstorming, information synthesis, and problem framing. However, students consistently relied on human judgment to interpret contextual information, validate AI-generated outputs, and refine design solutions. Domain knowledge derived from field observations enabled students to detect inaccuracies in AI suggestions, while taste played a critical role in evaluating and selecting meaningful ideas. The results suggest that generative AI functions primarily as a cognitive accelerator rather than a replacement for human creativity. The study highlights the importance of cultivating higher-order human capabilities to support effective human-AI collaboration in design education.

cross Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models

Authors: Punyajoy Saha, Sudipta Halder, Debjyoti Mondal, Subhadarshi Panda

Abstract: Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale, and slow to adapt to evolving model behaviors. Moreover, overly conservative safety mechanisms can reduce model usefulness by rejecting sensitive but legitimate queries. We introduce Self-MOA (Self Multi-Objective Alignment), a fully automated framework for aligning small language models using weak supervision from automated evaluator models. Self-MOA operates as a closed loop that dynamically generates model-specific red team prompts, constructs preference data from model-generated responses, and aligns models via multi-objective preference optimization to jointly optimize for safety and helpfulness. Across multiple small language models and safety benchmarks, Self-MOA achieves a 12.41\% improvement in safety while preserving helpfulness, using as little as 11 times less training data than human-supervised alignment baselines. These results demonstrate that adaptive, automated alignment can reduce the dependence on static, human-curated safety pipelines in resource-constrained settings.

cross RESCHED: Rethinking Flexible Job Shop Scheduling from a Transformer-based Architecture with Simplified States

Authors: Xiangjie Xiao, Cong Zhang, Wen Song, Zhiguang Cao

Abstract: Neural approaches to the Flexible Job Shop Scheduling Problem (FJSP), particularly those based on deep reinforcement learning (DRL), have gained growing attention in recent years. However, existing methods rely on complex feature-engineered state representations (i.e., often requiring more than 20 handcrafted features) and graph-biased neural architectures. To reduce modeling complexity and advance a more generalizable framework for FJSP, we introduce \textsc{ReSched}, a minimalist DRL framework that rethinks both the scheduling formulation and model design. First, by revisiting the Markov Decision Process (MDP) formulation of FJSP, we condense the state space to just four essential features, eliminating historical dependencies through a subproblem-based perspective. Second, we employ Transformer blocks with dot-product attention, augmented by three lightweight but effective architectural modifications tailored to scheduling tasks. Extensive experiments show that \textsc{ReSched} outperforms classical dispatching rules and state-of-the-art DRL methods on FJSP. Moreover, \textsc{ReSched} also generalizes well to the Job Shop Scheduling Problem (JSSP) and the Flexible Flow Shop Scheduling Problem (FFSP), achieving competitive performance against neural baselines specifically designed for these variants.

cross Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment

Authors: Junming Liu, Yuqi Li, Shiping Wen, Zhigang Zeng, Tingwen Huang

Abstract: Despite the promise of Retrieval-Augmented Generation in grounding Multimodal Large Language Models with external knowledge, the transition to extensive contexts often leads to significant attention dilution and reasoning hallucinations. The surge in information density causes critical evidence to be submerged by voluminous noise, which complicates the discernment of relevant fragments within a dense input. In this paper, we propose \textbf{Hit-RAG}, a multi-stage preference alignment framework designed to resolve these cognitive bottlenecks through a progressive optimization pipeline. Our approach systematically refines the utilization of external evidence via three distinct stages. First, Supervised Fine-tuning establishes baseline context awareness to minimize information neglect. Next, Discriminative Preference Alignment enhances robustness against misleading distractors. Finally, Group-Relative Policy Optimization stabilizes logical synthesis to prevent reasoning collapse. Extensive evaluations on eight benchmarks demonstrate that Hit-RAG consistently yields substantial performance gains, enabling models to bridge the gap between context acquisition and accurate reasoning while surpassing much larger counterparts in long-context scenarios.

cross Looking Back and Forth: Cross-Image Attention Calibration and Attentive Preference Learning for Multi-Image Hallucination Mitigation

Authors: Xiaochen Yang, Hao Fang, Jiawei Kong, Yaoxin Mao, Bin Chen, Shu-Tao Xia

Abstract: Although large vision-language models (LVLMs) have demonstrated remarkable capabilities, they are prone to hallucinations in multi-image tasks. We attribute this issue to limitations in existing attention mechanisms and insufficient cross-image modeling. Inspired by this, we propose a structured hallucination mitigation framework involving Cross-Image Attention calibration and Preference Learning (CAPL). CAPL explicitly enhances inter-image interactions at the architectural level while reinforcing reliance on genuine cross-image evidence during training, thereby improving the model's perception and modeling of cross-image associations. Specifically, we (i) introduce a selectable image token interaction attention mechanism to establish fine-grained cross-image entity alignment and information flow; (ii) design a cross-image modeling-based preference optimization strategy that contrasts reasoning outcomes under full inter-image interaction and those obtained when images are mutually invisible, encouraging the model to ground its predictions in authentic visual evidence and mitigating erroneous inferences driven by textual priors. Experimental results demonstrate that CAPL consistently improves performance across multiple model architectures, achieving stable gains on both multi-image hallucination and general benchmarks. Notably, performance on single-image visual tasks remains stable or slightly improves, indicating strong generalization capability.

cross MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering

Authors: Trong-Thang Pham, Loc Nguyen, Anh Nguyen, Hien Nguyen, Ngan Le

Abstract: Generative diffusion models are increasingly used for medical imaging data augmentation, but text prompting cannot produce causal training data. Re-prompting rerolls the entire generation trajectory, altering anatomy, texture, and background. Inversion-based editing methods introduce reconstruction error that causes structural drift. We propose MedSteer, a training-free activation-steering framework for endoscopic synthesis. MedSteer identifies a pathology vector for each contrastive prompt pair in the cross-attention layers of a diffusion transformer. At inference time, it steers image activations along this vector, generating counterfactual pairs from scratch where the only difference is the steered concept. All other structure is preserved by construction. We evaluate MedSteer across three experiments on Kvasir v3 and HyperKvasir. On counterfactual generation across three clinical concept pairs, MedSteer achieves flip rates of 0.800, 0.925, and 0.950, outperforming the best inversion-based baseline in both concept flip rate and structural preservation. On dye disentanglement, MedSteer achieves 75% dye removal against 20% (PnP) and 10% (h-Edit). On downstream polyp detection, augmenting with MedSteer counterfactual pairs achieves ViT AUC of 0.9755 versus 0.9083 for quantity-matched re-prompting, confirming that counterfactual structure drives the gain. Code is at link https://github.com/phamtrongthang123/medsteer

URLs: https://github.com/phamtrongthang123/medsteer

cross User Review Writing via Interview with Dialogue Systems

Authors: Yoshiki Tanaka, Michimasa Inaba

Abstract: User reviews on e-commerce and review sites are crucial for making purchase decisions, although creating detailed reviews is time-consuming and labor-intensive. In this study, we propose a novel use of dialogue systems to facilitate user review creation by generating reviews from information gathered during interview dialogues with users. To validate our approach, we implemented our system using GPT-4 and conducted comparative experiments from the perspectives of system users and review readers. The results indicate that participants who used our system rated their interactions positively. Additionally, reviews generated by our system required less editing to achieve user satisfaction compared to those by the baseline. We also evaluated the reviews from the reader' perspective and found that our system-generated reviews are more helpful than those written by humans. Despite challenges with the fluency of the generated reviews, our method offers a promising new approach to review writing.

cross Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

Authors: Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang

Abstract: Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual-access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward-hacking rates. Using this environment, we study reward hacking in open-weight LLMs and find that such behaviors can be unintentionally learned during supervised fine-tuning (SFT) when even a small fraction of reward-hacking trajectories leak into training data. As little as 1\% contamination in distillation SFT data is sufficient for models to internalize reward hacking which resurfaces during subsequent reinforcement learning (RL). We further show that RL amplifies misalignment and drives its generalization beyond the original domain. We open-source our environment and code to facilitate future research on reward hacking in LLMs. Our results reveal a previously underexplored pathway through which reward hacking can emerge and persist in LLMs, underscoring the need for more rigorous validation of synthetic SFT data. Code is available at https://github.com/zohaib-khan5040/Countdown-Code.

URLs: https://github.com/zohaib-khan5040/Countdown-Code.

cross mAVE: A Watermark for Joint Audio-Visual Generation Models

Authors: Luyang Si, Leyi Pan, Lijie Wen

Abstract: As Joint Audio-Visual Generation Models see widespread commercial deployment, embedding watermarks has become essential for protecting vendor copyright and ensuring content provenance. However, existing techniques suffer from an architectural mismatch by treating modalities as decoupled entities, exposing a critical Binding Vulnerability. Adversaries exploit this via Swap Attacks by replacing authentic audio with malicious deepfakes while retaining the watermarked video. Because current detectors rely on independent verification ($Video_{wm}\vee Audio_{wm}$), they incorrectly authenticate the manipulated content, falsely attributing harmful media to the original vendor and severely damaging their reputation. To address this, we propose mAVE (Manifold Audio-Visual Entanglement), the first watermarking framework natively designed for joint architectures. mAVE cryptographically binds audio and video latents at initialization without fine-tuning, defining a Legitimate Entanglement Manifold via Inverse Transform Sampling. Experiments on state-of-the-art models (LTX-2, MOVA) demonstrate that mAVE guarantees performance-losslessness and provides an exponential security bound against Swap Attacks. Achieving near-perfect binding integrity ($>99\%$), mAVE offers a robust cryptographic defense for vendor copyright.

cross Efficient Personalized Reranking with Semi-Autoregressive Generation and Online Knowledge Distillation

Authors: Kai Cheng, Hao Wang, Wei Guo, Weiwen Liu, Yong Liu, Yawen Li, Enhong Chen

Abstract: Generative models offer a promising paradigm for the final stage reranking in multi-stage recommender systems, with the ability to capture inter-item dependencies within reranked lists. However, their practical deployment still faces two key challenges: (1) an inherent conflict between achieving high generation quality and ensuring low-latency inference, making it difficult to balance the two, and (2) insufficient interaction between user and item features in existing methods. To address these challenges, we propose a novel Personalized Semi-Autoregressive with online knowledge Distillation (PSAD) framework for reranking. In this framework, the teacher model adopts a semi-autoregressive generator to balance generation quality and efficiency, while its ranking knowledge is distilled online into a lightweight scoring network during joint training, enabling real-time and efficient inference. Furthermore, we propose a User Profile Network (UPN) that injects user intent and models interest dynamics, enabling deeper interactions between users and items. Extensive experiments conducted on three large-scale public datasets demonstrate that PSAD significantly outperforms state-of-the-art baselines in both ranking performance and inference efficiency.

cross Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information

Authors: Yoshiki Tanaka, Takumasa Kaneko, Hiroki Onozeki, Natsumi Ezure, Ryuichi Uehara, Zhiyang Qi, Tomoya Higuchi, Ryutaro Asahara, Michimasa Inaba

Abstract: The Werewolf Game is a communication game where players' reasoning and discussion skills are essential. In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co-hosted with the 17th INLG. In recent years, large language models like ChatGPT have garnered attention for their exceptional response generation and reasoning capabilities. We thus develop the LLM-based agents for the Werewolf Game. This study aims to enhance the consistency of the agent's utterances by utilizing dialogue summaries generated by LLMs and manually designed personas and utterance examples. By analyzing self-match game logs, we demonstrate that the agent's utterances are contextually consistent and that the character, including tone, is maintained throughout the game.

cross aCAPTCHA: Verifying That an Entity Is a Capable Agent via Asymmetric Hardness

Authors: Zuyao Xu, Xiang Li, Fubin Wu, Yuqi Qiu, Lu Sun, FaSheng Miao

Abstract: As autonomous AI agents increasingly populate the Internet, a novel security challenge arises: "Is this entity an AI agent?" It is a new entity-type verification problem with no established solution. We formalize the problem through a three-class entity taxonomy (Human, Script, Agent) based on a verifiable agentic capability vector (action, reasoning, and memory). A timing threshold t exploits the asymmetric hardness between human cognition and AI processing to separate the three classes. We define the Agentic Capability Verification Problem (ACVP) through three necessity primitives, each testing one capability dimension. Building on this foundation, we introduce aCAPTCHA (Agent CAPTCHA), a time-constrained security game for agent admission whose security rests on ACVP hardness under t. We instantiate aCAPTCHA through time-bounded natural-language understanding as a multi-round HTTP verification protocol, and evaluate it with preliminary agent trials that validate the protocol's soundness and completeness. aCAPTCHA provides a composable, infrastructure-free admission gate for any service where entity-type verification is required.

cross Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge

Authors: Shuai Lu, Meng Wang, Jia Guo, Jiawei Du, Bo Liu, Shengzhu Yang, Weihang Zhang, Huazhu Fu, Huiqi Li

Abstract: Large Vision Language Models (LVLMs) show immense potential for automated ophthalmic diagnosis. However, their clinical deployment is severely hindered by lacking domain-specific knowledge. In this work, we identify two structural deficiencies hindering reliable medical reasoning: 1) the Perception Gap, where general-purpose visual encoders fail to resolve fine-grained pathological cues (e.g., microaneurysms); and 2) the Reasoning Gap, where sparse visual evidence is progressively overridden by massive language priors in deeper transformer layers, leading to ungrounded hallucinations. To bridge these gaps, we propose EyExIn, a data-efficient framework designed to anchor retinal VLMs with expert knowledge via a Deep Expert Injection mechanism. Our architecture employs an Expert-Aware Dual-Stream encoding strategy that decouples visual representation into a general stream for anatomical context and a specialized expert stream for pathological semantics. To ensure high-fidelity integration, we design a Semantic-Adaptive Gated Fusion module, which dynamically amplifies subtle lesion signals while filtering irrelevant background noise. Furthermore, we introduce Adaptive Deep Expert Injection to embed persistent "Vision Anchors" by integrating fused visual features as residual biases directly into intermediate LLM layers. This mechanism creates a visual shortcut that forces the reasoning stack to remain strictly grounded in visual evidence. Extensive experiments across four benchmarks demonstrate that our model consistently outperforms massive proprietary systems. EyExIn significantly enhances domain-specific knowledge embedding and achieves state-of-the-art precision in ophthalmic visual question answering, advancing the development of trustworthy ophthalmic AI.

cross Emotion Transcription in Conversation: A Benchmark for Capturing Subtle and Complex Emotional States through Natural Language

Authors: Yoshiki Tanaka, Ryuichi Uehara, Koji Inoue, Michimasa Inaba

Abstract: Emotion Recognition in Conversation (ERC) is critical for enabling natural human-machine interactions. However, existing methods predominantly employ categorical or dimensional emotion annotations, which often fail to adequately represent complex, subtle, or culturally specific emotional nuances. To overcome this limitation, we propose a novel task named Emotion Transcription in Conversation (ETC). This task focuses on generating natural language descriptions that accurately reflect speakers' emotional states within conversational contexts. To address the ETC, we constructed a Japanese dataset comprising text-based dialogues annotated with participants' self-reported emotional states, described in natural language. The dataset also includes emotion category labels for each transcription, enabling quantitative analysis and its application to ERC. We benchmarked baseline models, finding that while fine-tuning on our dataset enhances model performance, current models still struggle to infer implicit emotional states. The ETC task will encourage further research into more expressive emotion understanding in dialogue. The dataset is publicly available at https://github.com/UEC-InabaLab/ETCDataset.

URLs: https://github.com/UEC-InabaLab/ETCDataset.

cross Fine-Grained Table Retrieval Through the Lens of Complex Queries

Authors: Wojciech Kosiuk, Xingyu Ji, Yeounoh Chung, Fatma \"Ozcan, Madelon Hulsebos

Abstract: Enabling question answering over tables and databases in natural language has become a key capability in the democratization of insights from tabular data sources. These systems first require retrieval of data that is relevant to a given natural language query, for which several methods have been introduced. In this work we present and study a table retrieval mechanism devising fine-grained typed query decomposition and global connectivity-awareness (DCTR), to handle the challenges induced by open-domain question answering over relational databases in complex usage contexts. We evaluate the effectiveness of the two mechanisms through the lens of retrieval complexity which we measure along the axes of query- and data complexity. Our analyses over industry-aligned benchmarks illustrate the robustness of DCTR for highly composite queries and densely connected databases.

cross From State Changes to Creative Decisions: Documenting and Interpreting Traces Across Creative Domains

Authors: Xiaohan Peng, Sotiris Piliouras, Carl Abou Saada Nujaim

Abstract: Analyzing creative activity traces requires capturing activity at appropriate granularity and interpreting it in ways that reflect the structure of creative practice. However, existing approaches record state changes without preserving the intent or relationships that define higher-level creative moves. This decoupling manifests differently across domains: GenAI tools lose non-linear exploration structure, visualization authoring obscures representational intent, and programmatic environments flatten interaction boundaries. We present three complementary approaches: a node-based interface for stateful GenAI artifact management, a vocabulary of visual cues as higher-level creative moves in visualization authoring, and a programming model that embeds semantic histories directly into interaction state.

cross Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Authors: Yuxu Ge

Abstract: Autonomous agents powered by large language models introduce a class of execution-layer vulnerabilities -- prompt injection, retrieval poisoning, and uncontrolled tool invocation -- that existing guardrails fail to address systematically. In this work, we propose the Layered Governance Architecture (LGA), a four-layer framework comprising execution sandboxing (L1), intent verification (L2), zero-trust inter-agent authorization (L3), and immutable audit logging (L4). To evaluate LGA, we construct a bilingual benchmark (Chinese original, English via machine translation) of 1,081 tool-call samples -- covering prompt injection, RAG poisoning, and malicious skill plugins -- and apply it to OpenClaw, a representative open-source agent framework. Experimental results on Layer 2 intent verification with four local LLM judges (Qwen3.5-4B, Llama-3.1-8B, Qwen3.5-9B, Qwen2.5-14B) and one cloud judge (GPT-4o-mini) show that all five LLM judges intercept 93.0-98.5% of TC1/TC2 malicious tool calls, while lightweight NLI baselines remain below 10%. TC3 (malicious skill plugins) proves harder at 75-94% IR among judges with meaningful precision-recall balance, motivating complementary enforcement at Layers 1 and 3. Qwen2.5-14B achieves the best local balance (98% IR, approximately 10-20% FPR); a two-stage cascade (Qwen3.5-9B->GPT-4o-mini) achieves 91.9-92.6% IR with 1.9-6.7% FPR; a fully local cascade (Qwen3.5-9B->Qwen2.5-14B) achieves 94.7-95.6% IR with 6.0-9.7% FPR for data-sovereign deployments. An end-to-end pipeline evaluation (n=100) demonstrates that all four layers operate in concert with 96% IR and a total P50 latency of approximately 980 ms, of which the non-judge layers contribute only approximately 18 ms. Generalization to the external InjecAgent benchmark yields 99-100% interception, confirming robustness beyond our synthetic data.

cross A Miniature Brain Transformer: Thalamic Gating, Hippocampal Lateralization, Amygdaloid Salience, and Prefrontal Working Memory in Attention-Coupled Latent Memory

Authors: Hong Jeong

Abstract: We present a miniature brain transformer architecture that extends the attention-coupled latent memory framework with four additional brain-region analogues: a thalamic relay, an amygdaloid salience module, a prefrontal working-memory (PFC) buffer, and a cerebellar fast-path, all coupled by inhibitory callosal cross-talk between lateralized hippocampal banks. We evaluate on a two-domain benchmark -- MQAR (Multi-Query Associative Recall; episodic domain) and modular arithmetic (+1 mod 10; rule-based domain) -- using a seven-variant additive ablation. The central empirical finding is a surprise: inhibitory callosal coupling alone never lateralizes the banks (variants 1-5 maintain D_sep ~ 0.25 and P_ct ~ 0.25 for all 30 epochs). Functional lateralization requires the synergy of PFC and inhibition: only when the PFC buffer is added (variant 6) does a sharp, discontinuous phase transition fire -- at epoch 11 for the PFC-only variant and epoch 10 for the full model -- collapsing P_ct from 0.25 to ~0.002 and more than doubling D_sep from 0.251 to 0.501 in a single gradient step. The PFC buffer acts as a symmetry-breaker: its slowly drifting domain context creates the initial asymmetry that the inhibitory feedback loop then amplifies irreversibly. The cerebellar fast-path accelerates the transition by one epoch (epoch 10 vs. epoch 11) with no asymptotic change, confirming its convergence-acceleration role. The result constitutes a novel, falsifiable prediction -- no lateralization without working memory context -- and a principled, neurobiologically motivated blueprint for hierarchical persistent memory in sequence models.

cross VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

Authors: Seul-Ki Yeom, Marcel Simon, Eunbin Lee, Tae-Ho Kim

Abstract: Self-supervised learning (SSL) has made rapid progress, yet learned features often over-rely on contextual shortcuts-background textures and co-occurrence statistics. While video provides rich temporal variation, dense in-the-wild streams with strong ego-motion create a co-occurrence trap: foreground objects and background context move coherently, encouraging representations to collapse into scene encoders. To address this, we propose VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck. Using a class-agnostic structural prior solely to generate views-not as semantic pseudo-labels-VINO forms an asymmetric distillation problem. The teacher predicts from a foreground-union view with the background suppressed, while the student observes object-conditioned scene views that retain surrounding context but remove competing instances. Matching these targets via masked distillation makes background cues unreliable, pushing the representation toward object-centric invariances. We further enforce temporal object permanence via teacher-anchored cross-time distillation over track-matched objects, and stabilize part-to-whole consistency with mask-guided local views. Through attention visualization and unsupervised object discovery on PASCAL VOC, we demonstrate that VINO effectively disentangles foreground from background. Pretrained on the dense Walking Tours Venice video, VINO achieves 34.8 CorLoc, yielding highly focused, shape-biased representations that substantially outperform prior dense-video and motion-guided SSL baselines.

cross A Hybrid LTR-based System via Social Context Embedding for Recommending Solutions of Software Bugs in Developer Communities

Authors: Fouzi Harrag, Mokdad Khemliche

Abstract: Questions and Answering forums such as Stack Overflow play an important role in supporting software developers in finding answers to queries related to issues such as software errors and bugs. However, searching through a large set of candidate answers could be time consuming and may not lead to the best solution. In this research, the effectiveness of data mining models and machine learning techniques to solve this kind of problems is evaluated. We propose a recommender system to aid developers in finding solutions to their software bugs by carefully mining Stack Overflow. The proposed model leverages the knowledge available through crowdsourcing the Q&A available in Stack Overflow to recommend a solution to software bugs. We use deep learning techniques to construct the required Learning-to-Rank (LTR)-based model using the social context embedding the Stack Overflow features. Text mining, natural language processing and recommendation algorithms are used to extract, evaluate and recommend the best relevant bug solutions. Additionally, our model achieves nearly 78% correct solutions when recommending the 10 best answers for each question.

cross LEPA: Learning Geometric Equivariance in Satellite Remote Sensing Data with a Predictive Architecture

Authors: Erik Scheurer, Rocco Sedona, Stefan Kesselheim, Gabriele Cavallaro

Abstract: Geospatial foundation models provide precomputed embeddings that serve as compact feature vectors for large-scale satellite remote sensing data. While these embeddings can reduce data-transfer bottlenecks and computational costs, Earth observation (EO) applications can still face geometric mismatches between user-defined areas of interest and the fixed precomputed embedding grid. Standard latent-space interpolation is unreliable in this setting because the embedding manifold is highly non-convex, yielding representations that do not correspond to realistic inputs. We verify this using Prithvi-EO-2.0 to understand the shortcomings of interpolation applied to patch embeddings. As a substitute, we propose a Learned Equivariance-Predicting Architecture (LEPA). Instead of averaging vectors, LEPA conditions a predictor on geometric augmentations to directly predict the transformed embedding. We evaluate LEPA on NASA/USGS Harmonized Landsat-Sentinel (HLS) imagery and ImageNet-1k. Experiments show that standard interpolation achieves a mean reciprocal rank (MRR) below 0.2, whereas LEPA increases MRR to over 0.8, enabling accurate geometric adjustment without re-encoding.

cross Learning When to Cooperate Under Heterogeneous Goals

Authors: Max Taylor-Davies, Neil Bramley, Christopher G. Lucas

Abstract: A significant element of human cooperative intelligence lies in our ability to identify opportunities for fruitful collaboration; and conversely to recognise when the task at hand is better pursued alone. Research on flexible cooperation in machines has left this meta-level problem largely unexplored, despite its importance for successful collaboration in heterogeneous open-ended environments. Here, we extend the typical Ad Hoc Teamwork (AHT) setting to incorporate the idea of agents having heterogeneous goals that in any given scenario may or may not overlap. We introduce a novel approach to learning policies in this setting, based on a hierarchical combination of imitation and reinforcement learning, and show that it outperforms baseline methods across extended versions of two cooperative environments. We also investigate the contribution of an auxiliary component that learns to model teammates by predicting their actions, finding that its effect on performance is inversely related to the amount of observable information about teammate goals.

cross Kinematics-Aware Latent World Models for Data-Efficient Autonomous Driving

Authors: Jiazhuo Li, Linjiang Cao, Qi Liu, Xi Xiong

Abstract: Data-efficient learning remains a central challenge in autonomous driving due to the high cost and safety risks of large-scale real-world interaction. Although world-model-based reinforcement learning enables policy optimization through latent imagination, existing approaches often lack explicit mechanisms to encode spatial and kinematic structure essential for driving tasks. In this work, we build upon the Recurrent State-Space Model (RSSM) and propose a kinematics-aware latent world model framework for autonomous driving. Vehicle kinematic information is incorporated into the observation encoder to ground latent transitions in physically meaningful motion dynamics, while geometry-aware supervision regularizes the RSSM latent state to capture task-relevant spatial structure beyond pixel reconstruction. The resulting structured latent dynamics improve long-horizon imagination fidelity and stabilize policy optimization. Experiments in a driving simulation benchmark demonstrate consistent gains over both model-free and pixel-based world-model baselines in terms of sample efficiency and driving performance. Ablation studies further verify that the proposed design enhances spatial representation quality within the latent space. These results suggest that integrating kinematic grounding into RSSM-based world models provides a scalable and physically grounded paradigm for autonomous driving policy learning.

cross Do Deployment Constraints Make LLMs Hallucinate Citations? An Empirical Study across Four Models and Five Prompting Regimes

Authors: Chen Zhao, Yuan Tang, Yitian Qian

Abstract: LLMs are increasingly used to draft academic text and to support software engineering (SE) evidence synthesis, but they often hallucinate bibliographic references that look legitimate. We study how deployment-motivated prompting constraints affect citation verifiability in a closed-book setting. Using 144 claims (24 in SE&CS) and a deterministic verification pipeline (Crossref + Semantic Scholar), we evaluate two proprietary models (Claude Sonnet, GPT-4o) and two open-weight models (LLaMA~3.1-8B, Qwen~2.5-14B) across five regimes: Baseline, Temporal (publication-year window), Survey-style breadth, Non-Disclosure policy, and their combination. Across 17,443 generated citations, no model exceeds a citation-level existence rate of 0.475; Temporal and Combo conditions produce the steepest drops while outputs remain format-compliant (well-formed bibliographic fields). Unresolved outcomes dominate (36-61%); a 100-citation audit indicates that a substantial fraction of Unresolved cases are fabricated. Results motivate post-hoc citation verification before LLM outputs enter SE literature reviews or tooling pipelines.

cross MAviS: A Multimodal Conversational Assistant For Avian Species

Authors: Yevheniia Kryklyvets, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jinxing Zhou, Fahad Shabzan Khan, Rao Anwer, Salman Khan, Hisham Cholakkal

Abstract: Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the MAviS-Dataset, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question-answer pairs. Building on the MAviS-Dataset, we introduce MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present MAviS-Bench, a benchmark of over 25,000 QA pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive multimodal LLMs for ecological applications.

cross Spectral Discovery of Continuous Symmetries via Generalized Fourier Transforms

Authors: Pavan Karjol, Kumar Shubham, Prathosh AP

Abstract: Continuous symmetries are fundamental to many scientific and learning problems, yet they are often unknown a priori. Existing symmetry discovery approaches typically search directly in the space of transformation generators or rely on learned augmentation schemes. We propose a fundamentally different perspective based on spectral structure. We introduce a framework for discovering continuous one-parameter subgroups using the Generalized Fourier Transform (GFT). Our central observation is that invariance to a subgroup induces structured sparsity in the spectral decomposition of a function across irreducible representations. Instead of optimizing over generators, we detect symmetries by identifying this induced sparsity pattern in the spectral domain. We develop symmetry detection procedures on maximal tori, where the GFT reduces to multi-dimensional Fourier analysis through their irreducible representations. Across structured tasks, including the double pendulum and top quark tagging, we demonstrate that spectral sparsity reliably reveals one-parameter symmetries. These results position spectral analysis as a principled and interpretable alternative to generator-based symmetry discovery.

cross Adversarial Latent-State Training for Robust Policies in Partially Observable Domains

Authors: Angad Singh Ahuja

Abstract: Robustness under latent distribution shift remains challenging in partially observable reinforcement learning. We formalize a focused setting where an adversary selects a hidden initial latent distribution before the episode, termed an adversarial latent-initial-state POMDP. Theoretically, we prove a latent minimax principle, characterize worst-case defender distributions, and derive approximate best-response inequalities with finite-sample concentration bounds that make the optimization and sampling terms explicit. Empirically, using a Battleship benchmark, we demonstrate that targeted exposure to shifted latent distributions reduces average robustness gaps between Spread and Uniform distributions from 10.3 to 3.1 shots at equal budget. Furthermore, iterative best-response training exhibits budget-sensitive behavior that is qualitatively consistent with the theorem-guided diagnostics once one accounts for discounted PPO surrogates and finite-sample noise. Ultimately, we show that for latent-initial-state problems, the framework yields a clean evaluation game and useful theorem-motivated diagnostics while also making clear where implementation-level surrogates and optimization limits enter.

cross Norm-Hierarchy Transitions in Representation Learning: When and Why Neural Networks Abandon Shortcuts

Authors: Truong Xuan Khanh, Truong Quynh Hoa

Abstract: Neural networks often rely on spurious shortcuts for many epochs before discovering structured representations. However, the mechanism governing when this transition occurs and whether its timing can be predicted remains unclear. Prior work shows that gradient descent converges to low norm solutions and that neural networks exhibit simplicity bias, but neither explains the timescale of the transition from shortcut features to structured representations. We introduce the Norm-Hierarchy Transition (NHT) framework, which explains delayed representation learning as the slow traversal of a hierarchy of parameter norms during regularized optimization. When multiple interpolating solutions exist with different norms, weight decay gradually moves the model from high norm shortcut solutions toward lower norm structured representations. We derive a tight bound showing that the transition delay grows logarithmically with the ratio between shortcut and structured norms. Experiments on modular arithmetic, CIFAR-10 with spurious features, CelebA, and Waterbirds support the predictions of the framework. The results suggest that grokking, shortcut learning, and delayed feature discovery arise from a common mechanism based on norm hierarchy traversal during training.

cross Agora: Teaching the Skill of Consensus-Finding with AI Personas Grounded in Human Voice

Authors: Suyash Fulay, Prerna Ravi, Emily Kubin, Shrestha Mohanty, Michiel Bakker, Deb Roy

Abstract: Deliberative democratic theory suggests that civic competence: the capacity to navigate disagreement, weigh competing values, and arrive at collective decisions is not innate but developed through practice. Yet opportunities to cultivate these skills remain limited, as traditional deliberative processes like citizens' assemblies reach only a small fraction of the population. We present Agora, an early-stage AI-powered platform that uses LLMs to organize authentic human voices on policy issues, helping users build consensus-finding skills by proposing and revising policy recommendations, hearing supporting and opposing perspectives, and receiving feedback on how policy changes affect predicted support. In a preliminary study with 44 university students, participants using the full interface (with access to voice explanations) reported higher levels of problem-solving skills, internal deliberation, and produced higher quality consensus statements compared to a control condition showing only aggregate support distributions. These initial findings point toward a promising direction for scaling civic education.

cross Learning Concept Bottleneck Models from Mechanistic Explanations

Authors: Antonio De Santis, Schrasing Tong, Marco Brambilla, Lalana Kagal

Abstract: Concept Bottleneck Models (CBMs) aim for ante-hoc interpretability by learning a bottleneck layer that predicts interpretable concepts before the decision. State-of-the-art approaches typically select which concepts to learn via human specification, open knowledge graphs, prompting an LLM, or using general CLIP concepts. However, concepts defined a-priori may not have sufficient predictive power for the task or even be learnable from the available data. As a result, these CBMs often significantly trail their black-box counterpart when controlling for information leakage. To address this, we introduce a novel CBM pipeline named Mechanistic CBM (M-CBM), which builds the bottleneck directly from a black-box model's own learned concepts. These concepts are extracted via Sparse Autoencoders (SAEs) and subsequently named and annotated on a selected subset of images using a Multimodal LLM. For fair comparison and leakage control, we also introduce the Number of Contributing Concepts (NCC), a decision-level sparsity metric that extends the recently proposed NEC metric. Across diverse datasets, we show that M-CBMs consistently surpass prior CBMs at matched sparsity, while improving concept predictions and providing concise explanations. Our code is available at https://github.com/Antonio-Dee/M-CBM.

URLs: https://github.com/Antonio-Dee/M-CBM.

cross AgrI Challenge: A Data-Centric AI Competition for Cross-Team Validation in Agricultural Vision

Authors: Mohammed Brahimi, Karim Laabassi, Mohamed Seghir Hadj Ameur, Aicha Boutorh, Badia Siab-Farsi, Amin Khouani, Omar Farouk Zouak, Seif Eddine Bouziane, Kheira Lakhdari, Abdelkader Nabil Benghanem

Abstract: Machine learning models in agricultural vision often achieve high accuracy on curated datasets but fail to generalize under real field conditions due to distribution shifts between training and deployment environments. Moreover, most machine learning competitions focus primarily on model design while treating datasets as fixed resources, leaving the role of data collection practices in model generalization largely unexplored. We introduce the AgrI Challenge, a data-centric competition framework in which multiple teams independently collect field datasets, producing a heterogeneous multi-source benchmark that reflects realistic variability in acquisition conditions. To systematically evaluate cross-domain generalization across independently collected datasets, we propose Cross-Team Validation (CTV), an evaluation paradigm that treats each team's dataset as a distinct domain. CTV includes two complementary protocols: Train-on-One-Team-Only (TOTO), which measures single-source generalization, and Leave-One-Team-Out (LOTO), which evaluates collaborative multi-source training. Experiments reveal substantial generalization gaps under single-source training: models achieve near-perfect validation accuracy yet exhibit validation-test gaps of up to 16.20% (DenseNet121) and 11.37% (Swin Transformer) when evaluated on datasets collected by other teams. In contrast, collaborative multi-source training dramatically improves robustness, reducing the gap to 2.82% and 1.78%, respectively. The challenge also produced a publicly available dataset of 50,673 field images of six tree species collected by twelve independent teams, providing a diverse benchmark for studying domain shift and data-centric learning in agricultural vision.

cross Latent Generative Models with Tunable Complexity for Compressed Sensing and other Inverse Problems

Authors: Sean Gunn, Jorio Cocola, Oliver De Candido, Vaggos Chatziafratis, Paul Hand

Abstract: Generative models have emerged as powerful priors for solving inverse problems. These models typically represent a class of natural signals using a single fixed complexity or dimensionality. This can be limiting: depending on the problem, a fixed complexity may result in high representation error if too small, or overfitting to noise if too large. We develop tunable-complexity priors for diffusion models, normalizing flows, and variational autoencoders, leveraging nested dropout. Across tasks including compressed sensing, inpainting, denoising, and phase retrieval, we show empirically that tunable priors consistently achieve lower reconstruction errors than fixed-complexity baselines. In the linear denoising setting, we provide a theoretical analysis that explicitly characterizes how the optimal tuning parameter depends on noise and model structure. This work demonstrates the potential of tunable-complexity generative priors and motivates both the development of supporting theory and their application across a wide range of inverse problems.

cross Scaling Laws in the Tiny Regime: How Small Models Change Their Mistakes

Authors: Mohammed Alnemari, Rizwan Qureshi, Nader Begrazadah

Abstract: Neural scaling laws describe how model performance improves as a power law with size, but existing work focuses on models above 100M parameters. The sub-20M regime -- where TinyML and edge AI operate -- remains unexamined. We train 90 models (22K--19.8M parameters) across two architectures (plain ConvNet, MobileNetV2) on CIFAR-100, varying width while holding depth and training fixed. Both follow approximate power laws in error rate: $\alpha = 0.156 \pm 0.002$ (ScaleCNN) and $\alpha = 0.106 \pm 0.001$ (MobileNetV2) across five seeds. Since prior work fit cross-entropy loss rather than error rate, direct exponent comparison is approximate; with that caveat, these are 1.4--2x steeper than $\alpha \approx 0.076$ for large language models. The power law does not hold uniformly: local exponents decay with scale, and MobileNetV2 saturates at 19.8M parameters ($\alpha_{\mathrm{local}} = 0.006$). Error structure also changes. Jaccard overlap between error sets of the smallest and largest ScaleCNN is only 0.35 (25 seed pairs, $\pm 0.004$) -- compression changes which inputs are misclassified, not merely how many. Small models concentrate capacity on easy classes (Gini: 0.26 at 22K vs. 0.09 at 4.7M) while abandoning the hardest (bottom-5 accuracy: 10% vs. 53%). Counter to expectation, the smallest models are best calibrated (ECE = 0.013 vs. peak 0.110 at mid-size). Aggregate accuracy is therefore misleading for edge deployment; validation must happen at the target model size.

cross Position: LLMs Must Use Functor-Based and RAG-Driven Bias Mitigation for Fairness

Authors: Ravi Ranjan, Utkarsh Grover, Agorista Polyzou

Abstract: Biases in large language models (LLMs) often manifest as systematic distortions in associations between demographic attributes and professional or social roles, reinforcing harmful stereotypes across gender, ethnicity, and geography. This position paper advocates for addressing demographic and gender biases in LLMs through a dual-pronged methodology, integrating category-theoretic transformations and retrieval-augmented generation (RAG). Category theory provides a rigorous, structure-preserving mathematical framework that maps biased semantic domains to unbiased canonical forms via functors, ensuring bias elimination while preserving semantic integrity. Complementing this, RAG dynamically injects diverse, up-to-date external knowledge during inference, directly countering ingrained biases within model parameters. By combining structural debiasing through functor-based mappings and contextual grounding via RAG, we outline a comprehensive framework capable of delivering equitable and fair model outputs. Our synthesis of the current literature validates the efficacy of each approach individually, while addressing potential critiques demonstrates the robustness of this integrated strategy. Ensuring fairness in LLMs, therefore, demands both the mathematical rigor of category-theoretic transformations and the adaptability of retrieval augmentation.

cross ConfHit: Conformal Generative Design with Oracle Free Guarantees

Authors: Siddhartha Laghuvarapu, Ying Jin, Jimeng Sun

Abstract: The success of deep generative models in scientific discovery requires not only the ability to generate novel candidates but also reliable guarantees that these candidates indeed satisfy desired properties. Recent conformal-prediction methods offer a path to such guarantees, but its application to generative modeling in drug discovery is limited by budget constraints, lack of oracle access, and distribution shift. To this end, we introduce ConfHit, a distribution-free framework that provides validity guarantees under these conditions. ConfHit formalizes two central questions: (i) Certification: whether a generated batch can be guaranteed to contain at least one hit with a user-specified confidence level, and (ii) Design: whether the generation can be refined to a compact set without weakening this guarantee. ConfHit leverages weighted exchangeability between historical and generated samples to eliminate the need for an experimental oracle, constructs multiple-sample density-ratio weighted conformal p-value to quantify statistical confidence in hits, and proposes a nested testing procedure to certify and refine candidate sets of multiple generated samples while maintaining statistical guarantees. Across representative generative molecule design tasks and a broad range of methods, ConfHit consistently delivers valid coverage guarantees at multiple confidence levels while maintaining compact certified sets, establishing a principled and reliable framework for generative modeling.

cross Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios

Authors: Namrata Patil Gurav, Akashdeep Ranu, Archchana Sindhujan, Diptesh Kanojia

Abstract: Quality Estimation (QE) is essential for assessing machine translation quality in reference-less settings, particularly for domain-specific and low-resource language scenarios. In this paper, we investigate sentence-level QE for English to Indic machine translation across four domains (Healthcare, Legal, Tourism, and General) and five language pairs. We systematically compare zero-shot, few-shot, and guideline-anchored prompting across selected closed-weight and open-weight LLMs. Findings indicate that while closed-weight models achieve strong performance via prompting alone, prompt-only approaches remain fragile for open-weight models, especially in high-risk domains. To address this, we adopt ALOPE, a framework for LLM-based QE that uses Low-Rank Adaptation with regression heads attached to selected intermediate Transformer layers. We also extend ALOPE with recently proposed Low-Rank Multiplicative Adaptation (LoRMA). Our results show that intermediate-layer adaptation consistently improves QE performance, with gains in semantically complex domains, indicating a path toward more robust QE in practical scenarios. We release code and domain-specific QE datasets publicly to support further research.

cross Scheduling Parallel Optical Circuit Switches for AI Training

Authors: Kevin Liang, Litao Qiao, Isaac Keslassy, Bill Lin

Abstract: The rapid growth of AI training has dramatically increased datacenter traffic demand and energy consumption, which has motivated renewed interest in optical circuit switches (OCSes) as a high-bandwidth, energy-efficient alternative for AI fabrics. Deploying multiple parallel OCSes is a leading alternative. However, efficiently scheduling time-varying traffic matrices across parallel optical switches with non-negligible reconfiguration delays remains an open challenge. We consider the problem of scheduling a single AI traffic demand matrix $D$ over $s$ parallel OCSes while minimizing the makespan under reconfiguration delay $\delta$. Our algorithm Spectra relies on a three-step approach: Decompose $D$ into a minimal set of weighted permutations; Schedule these permutations across parallel switches using load-aware assignment; then Equalize the imbalanced loads on the switches via controlled permutation splitting. Evaluated on realistic AI training workloads (GPT model and Qwen MoE expert routing) as well as standard benchmarks, Spectra vastly outperforms a baseline based on state-of-the-art algorithms, reducing schedule makespan by an average factor of $1.4\times$ on GPT AI workloads, $1.9\times$ on MoE AI workloads, and $2.4\times$ on standard benchmarks. Further, the makespans achieved by Spectra consistently approach newly derived lower bounds.

cross Sparsity and Out-of-Distribution Generalization

Authors: Scott Aaronson, Lin Lin Lee, Jiawei Li

Abstract: Explaining out-of-distribution generalization has been a central problem in epistemology since Goodman's "grue" puzzle in 1946. Today it's a central problem in machine learning, including AI alignment. Here we propose a principled account of OOD generalization with three main ingredients. First, the world is always presented to experience not as an amorphous mass, but via distinguished features (for example, visual and auditory channels). Second, Occam's Razor favors hypotheses that are "sparse," meaning that they depend on as few features as possible. Third, sparse hypotheses will generalize from a training to a test distribution, provided the two distributions sufficiently overlap on their restrictions to the features that are either actually relevant or hypothesized to be. The two distributions could diverge arbitrarily on other features. We prove a simple theorem that formalizes the above intuitions, generalizing the classic sample complexity bound of Blumer et al. to an OOD context. We then generalize sparse classifiers to subspace juntas, where the ground truth classifier depends solely on a low-dimensional linear subspace of the features.

cross AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

Authors: Jihyoung Jang, Hyounghun Kim

Abstract: Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.

cross Adaptive Capacity Allocation for Vision Language Action Fine-tuning

Authors: Donghoon Kim, Minji Bae, Unghui Nam, Gyeonghun Kim, Suyun Lee, Kyuhong Shim, Byonghyo Shim

Abstract: Vision language action models (VLAs) are increasingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (e.g., $r \in \{4, 8\}$), while spectral analyses indicate VLAs may require much larger ranks (e.g., $r \approx 128$) or near-full rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (Select-Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores $E(k) \ge \eta$, providing a direct link to approximation error via our spectral analysis. During training, $\eta$ concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross-task interference and improve generalization. On four real-robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones ($\pi_0$ and SmolVLA), LoRA-SP matches or exceeds full fine-tuning with far fewer trainable parameters, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.

cross UnSCAR: Universal, Scalable, Controllable, and Adaptable Image Restoration

Authors: Debabrata Mandal, Soumitri Chattopadhyay, Yujie Wang, Marc Niethammer, Praneeth Chakravarthula

Abstract: Universal image restoration aims to recover clean images from arbitrary real-world degradations using a single inference model. Despite significant progress, existing all-in-one restoration networks do not scale to multiple degradations. As the number of degradations increases, training becomes unstable, models grow excessively large, and performance drops across both seen and unseen domains. In this work, we show that scaling universal restoration is fundamentally limited by interference across degradations during joint learning, leading to catastrophic task forgetting. To address this challenge, we introduce a unified inference pipeline with a multi-branch mixture-of-experts architecture that decomposes restoration knowledge across specialized task-adaptable experts. Our approach enables scalable learning (over sixteen degradations), adapts and generalizes robustly to unseen domains, and supports user-controllable restoration across degradations. Beyond achieving superior performance across benchmarks, this work establishes a new design paradigm for scalable and controllable universal image restoration.

cross Machine Learning for the Internet of Underwater Things: From Fundamentals to Implementation

Authors: Kenechi Omeke, Attai Abubakar, Michael Mollel, Lei Zhang, Qammer H. Abbasi, Muhammad Ali Imran

Abstract: The Internet of Underwater Things (IoUT) is becoming a critical infrastructure for ocean observation, marine resource management, and climate science. Its development is hindered by severe acoustic attenuation, propagation delays far exceeding those of terrestrial wireless systems, strict energy constraints, and dynamic topologies shaped by ocean currents. Machine learning (ML) has emerged as a key enabler for addressing these limitations, offering data driven mechanisms that enhance performance across all layers of underwater wireless sensor networks. This tutorial survey synthesises ML methodologies supervised, unsupervised, reinforcement, and deep learning specifically contextualised for underwater communication environments. It outlines the algorithmic principles of each paradigm and examines the conditions under which particular approaches deliver superior performance. A layer wise analysis highlights physical layer gains in localisation and channel estimation, MAC layer adaptations that improve channel utilisation, network layer routing strategies that extend operational lifetime, and transport layer mechanisms capable of reducing packet loss by up to 91 percent. At the application layer, ML enables substantial data compression and object detection accuracies reaching 92 percent. Drawing on 300 studies from 2012 to 2025, the survey documents energy efficiency gains of 7 to 29 times, throughput improvements over traditional protocols, and cross layer optimisation benefits of up to 42 percent. It also identifies persistent barriers, including limited datasets, computational constraints, and the gap between theoretical models and real world deployment. The survey concludes with emerging research directions and a technology roadmap supporting ML adoption in operational underwater networks.

cross Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting

Authors: Ran Cheng

Abstract: Catastrophic forgetting remains a central challenge in continual learning (CL), yet lacks a unified information-theoretic explanation for why some architectures forget catastrophically while others do not. We introduce \emph{Context Channel Capacity} ($C_\mathrm{ctx}$), the mutual information between a CL architecture's context signal and its generated parameters, and prove that zero forgetting requires $C_\mathrm{ctx} \geq H(T)$, where $H(T)$ is the task identity entropy. We establish an \emph{Impossibility Triangle} -- zero forgetting, online learning, and finite parameters cannot be simultaneously satisfied by sequential state-based learners -- and show that conditional regeneration architectures (HyperNetworks) bypass this triangle by redefining parameters as function values rather than states. We validate this framework across 8 CL methods on Split-MNIST (1,130+ experiments over 86 days, 4 seeds each), showing that $C_\mathrm{ctx}$ perfectly predicts forgetting behavior: methods with $C_\mathrm{ctx} = 0$ (NaiveSGD, EWC, SI, LwF, CFlow) exhibit catastrophic forgetting (6--97\%), while methods with $C_\mathrm{ctx} \approx 1$ (HyperNetwork) achieve zero forgetting (98.8\% ACC). We further propose \emph{Wrong-Context Probing} (P5), a practical diagnostic protocol for measuring $C_\mathrm{ctx}$, and extend the framework to CIFAR-10 via a novel \emph{Gradient Context Encoder} that closes the oracle gap from 23.3pp to 0.7pp. A systematic taxonomy of 15+ closed research directions -- including the Hebbian null result (frozen random features outperform learned features), CFlow's $\theta_0$-memorizer phenomenon, and the $S_N$ symmetry barrier to column specialization -- provides the community with precisely diagnosed negative results. Our central design principle: \emph{architecture over algorithm} -- the context pathway must be structurally unbypassable.

cross OrthoFormer: Instrumental Variable Estimation in Transformer Hidden States via Neural Control Functions

Authors: Charles Luo

Abstract: Transformer architectures excel at sequential modeling yet remain fundamentally limited by correlational learning - they capture spurious associations induced by latent confounders rather than invariant causal mechanisms. We identify this as an epistemological challenge: standard Transformers conflate static background factors (intrinsic identity, style, context) with dynamic causal flows (state evolution, mechanism), leading to catastrophic out-of-distribution failure. We propose OrthoFormer, a causally grounded architecture that embeds instrumental variable estimation directly into Transformer blocks via neural control functions. Our framework rests on four theoretical pillars: Structural Directionality (time-arrow enforcement), Representation Orthogonality (latent-noise separation), Causal Sparsity (Markov Blanket approximation), and End-to-End Consistency (gradient- detached stage separation). We prove that OrthoFormer achieves bias strictly less than OLS for any valid instrument lag, with residual bias decaying geometrically as O(\r{ho}k ). We characterize the bias-variance-exogeneity trilemma inherent in self-instrumenting and identify the neural forbidden regression - where removing gradient detachment improves prediction loss while destroying causal validity. Experiments confirm all theoretical predictions. OrthoFormer represents a paradigm shift from correlational to causal sequence modeling, with implications for robustness, interpretability, and reliable decision-making under distribution shift.

cross Dial: A Knowledge-Grounded Dialect-Specific NL2SQL System

Authors: Xiang Zhang, Hongming Xu, Le Zhou, Wei Zhou, Xuanhe Zhou, Guoliang Li, Yuyu Luo, Changdong Liu, Guorun Chen, Jiang Liao, Fan Wu

Abstract: Enterprises commonly deploy heterogeneous database systems, each of which owns a distinct SQL dialect with different syntax rules, built-in functions, and execution constraints. However, most existing NL2SQL methods assume a single dialect (e.g., SQLite) and struggle to produce queries that are both semantically correct and executable on target engines. Prompt-based approaches tightly couple intent reasoning with dialect syntax, rule-based translators often degrade native operators into generic constructs, and multi-dialect fine-tuning suffers from cross-dialect interference. In this paper, we present Dial, a knowledge-grounded framework for dialect-specific NL2SQL. Dial introduces: (1) a Dialect-Aware Logical Query Planning module that converts natural language into a dialect-aware logical query plan via operator-level intent decomposition and divergence-aware specification; (2) HINT-KB, a hierarchical intent-aware knowledge base that organizes dialect knowledge into (i) a canonical syntax reference, (ii) a declarative function repository, and (iii) a procedural constraint repository; and (3) an execution-driven debugging and semantic verification loop that separates syntactic recovery from logic auditing to prevent semantic drift. We construct DS-NL2SQL, a benchmark covering six major database systems with 2,218 dialect-specific test cases. Experimental results show that Dial consistently improves translation accuracy by 10.25% and dialect feature coverage by 15.77% over state-of-the-art baselines. The code is at https://github.com/weAIDB/Dial.

URLs: https://github.com/weAIDB/Dial.

cross Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Authors: Yige Li, Wei Zhao, Zhe Li, Nay Myat Min, Hanxun Huang, Yunhan Zhao, Xingjun Ma, Yu-Gang Jiang, Jun Sun

Abstract: Backdoor mechanisms have traditionally been studied as security threats that compromise the integrity of machine learning models. However, the same mechanism -- the conditional activation of specific behaviors through input triggers -- can also serve as a controllable and auditable interface for trustworthy model behavior. In this work, we present \textbf{Backdoor4Good (B4G)}, a unified benchmark and framework for \textit{beneficial backdoor} applications in large language models (LLMs). Unlike conventional backdoor studies focused on attacks and defenses, B4G repurposes backdoor conditioning for Beneficial Tasks that enhance safety, controllability, and accountability. It formalizes beneficial backdoor learning under a triplet formulation $(T, A, U)$, representing the \emph{Trigger}, \emph{Activation mechanism}, and \emph{Utility function}, and implements a benchmark covering four trust-centric applications. Through extensive experiments across Llama3.1-8B, Gemma-2-9B, Qwen2.5-7B, and Llama2-13B, we show that beneficial backdoors can achieve high controllability, tamper-resistance, and stealthiness while preserving clean-task performance. Our findings demonstrate new insights that backdoors need not be inherently malicious; when properly designed, they can serve as modular, interpretable, and beneficial building blocks for trustworthy AI systems. Our code and datasets are available at https://github.com/bboylyg/BackdoorLLM/B4G.

URLs: https://github.com/bboylyg/BackdoorLLM/B4G.

cross Image Generation Models: A Technical History

Authors: Rouzbeh Shirvani

Abstract: Image generation has advanced rapidly over the past decade, yet the literature seems fragmented across different models and application domains. This paper aims to offer a comprehensive survey of breakthrough image generation models, including variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, autoregressive and transformer-based generators, and diffusion-based methods. We provide a detailed technical walkthrough of each model type, including their underlying objectives, architectural building blocks, and algorithmic training steps. For each model type, we present the optimization techniques as well as common failure modes and limitations. We also go over recent developments in video generation and present the research works that made it possible to go from still frames to high quality videos. Lastly, we cover the growing importance of robustness and responsible deployment of these models, including deepfake risks, detection, artifacts, and watermarking.

cross "Better Ask for Forgiveness than Permission": Practices and Policies of AI Disclosure in Freelance Work

Authors: Angel Hsing-Chi Hwang, Senya Wong, Baixiao Chen, Jessica He, Hyo Jin Do

Abstract: The growing use of AI applications among freelance workers is reshaping trust and relationships with clients. This paper investigates how both workers and clients perceive AI use and disclosure in the freelance economy through a three-stage study: interviews with workers and two survey studies with workers and clients. Findings first reveal a key expectation gap around disclosure: Workers often adopt passive disclosure practices, revealing AI use only when asked, as they assume clients can already detect it. Clients, however, are far less confident in recognizing AI-assisted work and prefer proactive disclosure. A second finding highlights the role of unclear or absent client AI policies, which leave workers consistently misinterpreting clients' expectations for AI use and disclosure. Together, these gaps point to the need for clearer guidelines and practices for AI disclosure. Insights extend beyond freelancing, offering implications for trust, accountability, and policy design in other AI-mediated work domains.

cross Where Do LLM-based Systems Break? A System-Level Security Framework for Risk Assessment and Treatment

Authors: Neha Nagaraja, Hayretdin Bahsi

Abstract: Large Language Models (LLMs) are increasingly integrated into safety-critical workflows, yet existing security analyses remain fragmented and often isolate model behavior from the broader system context. This work introduces a goal-driven risk assessment framework for LLM-powered systems that combines system modeling with Attack-Defense Trees (ADTrees) and Common Vulnerability Scoring System (CVSS)-based exploitability scoring to support structured, comparable analysis. We demonstrate the framework through a healthcare case study, modeling multi-step attack paths targeting intervention in medical procedures, leakage of electronic health record (EHR) data, and disruption of service availability. Our analysis indicates that threats spanning (i) conventional cyber, (ii) adversarial ML, and (iii) conversational attacks that manipulate prompts or context often consolidate into a small number of dominant paths and shared system choke points, enabling targeted defenses to yield meaningful reductions in path exploitability. By systematically comparing defense portfolios, we align these risks with established vulnerability management practices and provide a domain-agnostic workflow applicable to other LLM-enabled critical systems.

cross The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling

Authors: J. Clayton Kerce, Alexis Fox

Abstract: Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We introduce the Dual-Stream Transformer, which decomposes the residual stream into two functionally distinct components: a token stream updated by attention and a context stream updated by feed-forward networks. Information flow between attention heads is controlled through a hierarchy of mixing strategies, from fully independent (maximum interpretability) to dense (standard transformer behavior). This design exposes a tunable tradeoff between interpretability and performance. We measure this tradeoff on language modeling tasks at 29M parameters. Fully independent head mixing increases validation loss by 8\% relative to dense baselines. The recommended Kronecker mixing strategy, which permits scalar communication between heads while preserving within-head structure, costs only 2.5\%. All configurations maintain functional generation under attention amplification (scaling logits by factors up to 16 at inference time), with degradation ranging from 16\% to 27\%. This robustness suggests the architectures learn discrete algorithms that operate independently of soft probabilistic mixing. The architecture provides a foundation for interpretable language models where internal structure is exposed by design. \footnote{This work was partially supported by DARPA Contract HR001125C0302.}

cross Towards Lightweight Adaptation of Speech Enhancement Models in Real-World Environments

Authors: Longbiao Cheng, Shih-Chii Liu

Abstract: Recent studies have shown that post-deployment adaptation can improve the robustness of speech enhancement models in unseen noise conditions. However, existing methods often incur prohibitive computational and memory costs, limiting their suitability for on-device deployment. In this work, we investigate model adaptation in realistic settings with dynamic acoustic scene changes and propose a lightweight framework that augments a frozen backbone with low-rank adapters updated via self-supervised training. Experiments on sequential scene evaluations spanning 111 environments across 37 noise types and three signal-to-noise ratio ranges, including the challenging [-8, 0] dB range, show that our method updates fewer than 1% of the base model's parameters while achieving an average 1.51 dB SI-SDR improvement within only 20 updates per scene. Compared to state-of-the-art approaches, our framework achieves competitive or superior perceptual quality with smoother and more stable convergence, demonstrating its practicality for lightweight on-device adaptation of speech enhancement models under real-world acoustic conditions.

cross Contact-Guided 3D Genome Structure Generation of E. coli via Diffusion Transformers

Authors: Mingxin Zhang, Xiaofeng Dai, Yu Yao, Ziqi Yin

Abstract: In this study, we present a conditional diffusion-transformer framework for generating ensembles of three-dimensional Escherichia coli genome conformations guided by Hi-C contact maps. Instead of producing a single deterministic structure, we formulate genome reconstruction as a conditional generative modeling problem that samples heterogeneous conformations whose ensemble-averaged contacts are consistent with the input Hi-C data. A synthetic dataset is constructed using coarse-grained molecular dynamics simulations to generate chromatin ensembles and corresponding Hi-C maps under circular topology. Our models operate in a latent diffusion setting with a variational autoencoder that preserves per-bin alignment and supports replication-aware representations. Hi-C information is injected through a transformer-based encoder and cross-attention, enforcing a physically interpretable one-way constraint from Hi-C to structure. The model is trained using a flow-matching objective for stable optimization. On held-out ensembles, generated structures reproduce the input Hi-C distance-decay and structural correlation metrics while maintaining substantial conformational diversity, demonstrating the effectiveness of diffusion-based generative modeling for ensemble-level 3D genome reconstruction.

cross Give Them an Inch and They Will Take a Mile:Understanding and Measuring Caller Identity Confusion in MCP-Based AI Systems

Authors: Yuhang Huang, Boyang Ma, Biwei Yan, Xuelong Dai, Yechao Zhang, Minghui Xu, Kaidi Xu, Yue Zhang

Abstract: The Model Context Protocol (MCP) is an open and standardized interface that enables large language models (LLMs) to interact with external tools and services, and is increasingly adopted by AI agents. However, the security of MCP-based systems remains largely unexplored.In this work, we conduct a large-scale security analysis of MCP servers integrated within MCP clients. We show that treating MCP servers as trusted entities without authenticating the caller identity is fundamentally insecure. Since MCP servers often cannot distinguish who is invoking a request, a single authorization decision may implicitly grant access to multiple, potentially untrusted callers.Our empirical study reveals that most MCP servers rely on persistent authorization states, allowing tool invocations after an initial authorization without re-authentication, regardless of the caller. In addition, many MCP servers fail to enforce authentication at the per-tool level, enabling unauthorized access to sensitive operations.These findings demonstrate that one-time authorization and server-level trust significantly expand the attack surface of MCP-based systems, highlighting the need for explicit caller authentication and fine-grained authorization mechanisms.

cross Cross-Modal Taxonomic Generalization in (Vision-) Language Models

Authors: Tianyang Xu, Marcelo Sandoval-Castaneda, Karen Livescu, Greg Shakhnarovich, Kanishka Misra

Abstract: What is the interplay between semantic representations learned by language models (LM) from surface form alone to those learned from more grounded evidence? We study this question for a scenario where part of the input comes from a different modality -- in our case, in a vision-language model (VLM), where a pretrained LM is aligned with a pretrained image encoder. As a case study, we focus on the task of predicting hypernyms of objects represented in images. We do so in a VLM setup where the image encoder and LM are kept frozen, and only the intermediate mappings are learned. We progressively deprive the VLM of explicit evidence for hypernyms, and test whether knowledge of hypernyms is recoverable from the LM. We find that the LMs we study can recover this knowledge and generalize even in the most extreme version of this experiment (when the model receives no evidence of a hypernym during training). Additional experiments suggest that this cross-modal taxonomic generalization persists under counterfactual image-label mappings only when the counterfactual data have high visual similarity within each category. Taken together, these findings suggest that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues.

cross Interpretable-by-Design Transformers via Architectural Stream Independence

Authors: Clayton Kerce, Alexis Fox

Abstract: While transformers achieve strong performance, their internal decision-making processes remain opaque. We investigate whether architectural constraints can enforce interpretability by design through architectural stream independence: maintaining a token stream (carrying symbolic structure) and contextual semantics in separated streams that remain independently observable throughout processing, with integration delayed until output. We validate this principle through the Late Fusion Architecture (LFA), which demonstrates interpretable symbolic heads through all the final layers, while standard transformers show dissolution by the third of six layers; we quantify this effect by introducing the Token-Position Dependence Score (PDS), with $PDS_{max}$ = 0.276 and 0.058, respectively. Crucially, intervention experiments demonstrate functional modularity: suppressing LFA's recency heads causes minimal semantic damage (Cohen's d = -0.158) versus catastrophic entanglement in baselines (d = -0.672). LFA demonstrates that architectural constraints improve underlying learning mechanisms, averaging 42% stability versus 19% and 11% for baseline comparisons, with extremes from 50% on LFA's best pairs (6 of 12 heads position-invariant) down to 0% complete collapse in over-constrained cases. By preventing premature entanglement, architectural independence steers models toward semantic understanding over positional heuristics, establishing interpretability as an architectural design criterion enforceable through structural constraints rather than post-hoc analysis.

cross A Joint Neural Baseline for Concept, Assertion, and Relation Extraction from Clinical Text

Authors: Fei Cheng, Ribeka Tanaka, Sadao Kurohashi

Abstract: Clinical information extraction (e.g., 2010 i2b2/VA challenge) usually presents tasks of concept recognition, assertion classification, and relation extraction. Jointly modeling the multi-stage tasks in the clinical domain is an underexplored topic. The existing independent task setting (reference inputs given in each stage) makes the joint models not directly comparable to the existing pipeline work. To address these issues, we define a joint task setting and propose a novel end-to-end system to jointly optimize three-stage tasks. We empirically investigate the joint evaluation of our proposal and the pipeline baseline with various embedding techniques: word, contextual, and in-domain contextual embeddings. The proposed joint system substantially outperforms the pipeline baseline by +0.3, +1.4, +3.1 for the concept, assertion, and relation F1. This work bridges joint approaches and clinical information extraction. The proposed approach could serve as a strong joint baseline for future research. The code is publicly available.

cross From Thinker to Society: Security in Hierarchical Autonomy Evolution of AI Agents

Authors: Xiaolei Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, Tianyu Du, Heqing Huang, Hao Peng, Zhe Liu

Abstract: Artificial Intelligence (AI) agents have evolved from passive predictive tools into active entities capable of autonomous decision-making and environmental interaction, driven by the reasoning capabilities of Large Language Models (LLMs). However, this evolution has introduced critical security vulnerabilities that existing frameworks fail to address. The Hierarchical Autonomy Evolution (HAE) framework organizes agent security into three tiers: Cognitive Autonomy (L1) targets internal reasoning integrity; Execution Autonomy (L2) covers tool-mediated environmental interaction; Collective Autonomy (L3) addresses systemic risks in multi-agent ecosystems. We present a taxonomy of threats spanning cognitive manipulation, physical environment disruption, and multi-agent systemic failures, and evaluate existing defenses while identifying key research gaps. The findings aim to guide the development of multilayered, autonomy-aware defense architectures for trustworthy AI agent systems.

cross SeDa: A Unified System for Dataset Discovery and Multi-Entity Augmented Semantic Exploration

Authors: Kan Ling, Zhen Qin, Yichi Zhu, Hengrun Zhang, Huiqun Yu, Guisheng Fan

Abstract: The continuous expansion of open data platforms and research repositories has led to a fragmented dataset ecosystem, posing significant challenges for cross-source data discovery and interpretation. To address these challenges, we introduce SeDa--a unified framework for dataset discovery, semantic annotation, and multi-entity augmented navigation. SeDa integrates more than 7.6 million datasets from over 200 platforms, spanning governmental, academic, and industrial domains. The framework first performs semantic extraction and standardization to harmonize heterogeneous metadata representations. On this basis, a topic-tagging mechanism constructs an extensible tag graph that supports thematic retrieval and cross-domain association, while a provenance assurance module embedded within the annotation process continuously validates dataset sources and monitors link availability to ensure reliability and traceability. Furthermore, SeDa employs a multi-entity augmented navigation strategy that organizes datasets within a knowledge space of sites, institutions, and enterprises, enabling contextual and provenance-aware exploration beyond traditional search paradigms. Comparative experiments with popular dataset search platforms, such as ChatPD and Google Dataset Search, demonstrate that SeDa achieves superior coverage, timeliness, and traceability. Taken together, SeDa establishes a foundation for trustworthy, semantically enriched, and globally scalable dataset exploration.

cross A Unified View of Drifting and Score-Based Models

Authors: Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, Molei Tao

Abstract: Drifting models train one-step generators by optimizing a mean-shift discrepancy induced by a kernel between the data and model distributions, with Laplace kernels used by default in practice. At each point, this discrepancy compares the kernel-weighted displacement toward nearby data samples with the corresponding displacement toward nearby model samples, yielding a transport direction for generated samples. In this paper, we make its relationship to the score-matching principle behind diffusion models precise by showing that drifting admits a score-based formulation on kernel-smoothed distributions. For Gaussian kernels, the population mean-shift field coincides with the score difference between the Gaussian-smoothed data and model distributions. This identity follows from Tweedie's formula, which links the score of a Gaussian-smoothed density to the corresponding conditional mean, and implies that Gaussian-kernel drifting is exactly a score-matching-style objective on smoothed distributions. It also clarifies the connection to Distribution Matching Distillation (DMD): both methods use score-mismatch transport directions, but drifting realizes the score signal nonparametrically from kernel neighborhoods, whereas DMD uses a pretrained diffusion teacher. Beyond Gaussians, we derive an exact decomposition for general radial kernels, and for the Laplace kernel we prove rigorous error bounds showing that drifting remains an accurate proxy for score matching in low-temperature and high-dimensional regimes.

cross InterReal: A Unified Physics-Based Imitation Framework for Learning Human-Object Interaction Skills

Authors: Dayang Liang, Yuhang Lin, Xinzhe Liu, Jiyuan Shi, Yunlong Liu, Chenjia Bai

Abstract: Interaction is one of the core abilities of humanoid robots. However, most existing frameworks focus on non-interactive whole-body control, which limits their practical applicability. In this work, we develop InterReal, a unified physics-based imitation learning framework for Real-world human-object Interaction (HOI) control. InterReal enables humanoid robots to track HOI reference motions, facilitating the learning of fine-grained interactive skills and their deployment in real-world settings. Within this framework, we first introduce a HOI motion data augmentation scheme with hand-object contact constraints, and utilize the augmented motions to improve policy stability under object perturbations. Second, we propose an automatic reward learner to address the challenge of large-scale reward shaping. A meta-policy guided by critical tracking error metrics explores and allocates reward signals to the low-level reinforcement learning objective, which enables more effective learning of interactive policies. Experiments on HOI tasks of box-picking and box-pushing demonstrate that InterReal achieves the best tracking accuracy and the highest task success rate compared to recent baselines. Furthermore, we validate the framework on the real-world robot Unitree G1, which demonstrates its practical effectiveness and robustness beyond simulation.

cross SketchGraphNet: A Memory-Efficient Hybrid Graph Transformer for Large-Scale Sketch Corpora Recognition

Authors: Shilong Chen, Mingyuan Li, Zhaoyang Wang, Zhonglin Ye, Haixing Zhao

Abstract: This work investigates large-scale sketch recognition from a graph-native perspective, where free-hand sketches are directly modeled as structured graphs rather than raster images or stroke sequences. We propose SketchGraphNet, a hybrid graph neural architecture that integrates local message passing with a memory-efficient global attention mechanism, without relying on auxiliary positional or structural encodings. To support systematic evaluation, we construct SketchGraph, a large-scale benchmark comprising 3.44 million graph-structured sketches across 344 categories, with two variants (A and R) to reflect different noise conditions. Each sketch is represented as a spatiotemporal graph with normalized stroke-order attributes. On SketchGraph-A and SketchGraph-R, SketchGraphNet achieves Top-1 accuracies of 83.62% and 87.61%, respectively, under a unified training configuration. MemEffAttn further reduces peak GPU memory by over 40% and training time by more than 30% compared with Performer-based global attention, while maintaining comparable accuracy.

cross Neural Dynamics-Informed Pre-trained Framework for Personalized Brain Functional Network Construction

Authors: Hongjie Jiang, Yifei Tang, Shuqiang Wang

Abstract: Brain activity is intrinsically a neural dynamic process constrained by anatomical space. This leads to significant variations in spatial distribution patterns and correlation patterns of neural activity across variable and heterogeneous scenarios. However, dominant brain functional network construction methods, which relies on pre-defined brain atlases and linear assumptions, fails to precisely capture varying neural activity patterns in heterogeneous scenarios. This limits the consistency and generalizability of the brain functional networks constructed by dominant methods. Here, a neural dynamics-informed pre-trained framework is proposed for personalized brain functional network construction. The proposed framework extracts personalized representations of neural activity patterns in heterogeneous scenarios. Personalized brain functional networks are obtained by utilizing these representations to guide brain parcellation and neural activity correlation estimation. Systematic evaluations were employed on 18 datasets across tasks, such as virtual neural modulation and abnormal neural circuit identification. Experimental results demonstrate that the proposed framework attains superior performance in heterogeneous scenarios. Overall, the proposed framework challenges the dominant brain functional network construction method.

cross How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

Authors: Haoyu Chen, Qing Liu, Yuqian Zhou, He Zhang, Zhaowen Wang, Mengwei Ren, Jingjing Ren, Xiang Wang, Zhe Lin, Lei Zhu

Abstract: Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.

cross DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration

Authors: Jinzhou Tang, Fan Feng, Minghao Fu, Wenjun Lin, Biwei Huang, Keze Wang

Abstract: Learned world models excel at interpolative generalization but fail at extrapolative generalization to novel physical properties. This limitation arises because they learn statistical correlations rather than the environment's underlying generative rules, such as physical invariances and conservation laws. We argue that learning these invariances is key to robust extrapolation. To achieve this, we first introduce \textbf{Symmetry Exploration}, an unsupervised exploration strategy where an agent is intrinsically motivated by a Hamiltonian-based curiosity bonus to actively probe and challenge its understanding of conservation laws, thereby collecting physically informative data. Second, we design a Hamiltonian-based world model that learns from the collected data, using a novel self-supervised contrastive objective to identify the invariant physical state from raw, view-dependent pixel observations. Our framework, \textbf{DreamSAC}, trained on this actively curated data, significantly outperforms state-of-the-art baselines in 3D physics simulations on tasks requiring extrapolation.

cross Learning-free L2-Accented Speech Generation using Phonological Rules

Authors: Thanathai Lertpetchpun, Yoonjeong Lee, Jihwan Lee, Tiantian Feng, Dani Byrd, Shrikanth Narayanan

Abstract: Accent plays a crucial role in speaker identity and inclusivity in speech technologies. Existing accented text-to-speech (TTS) systems either require large-scale accented datasets or lack fine-grained phoneme-level controllability. We propose a accented TTS framework that combines phonological rules with a multilingual TTS model. The rules are applied to phoneme sequences to transform accent at the phoneme level while preserving intelligibility. The method requires no accented training data and enables explicit phoneme-level accent manipulation. We design rule sets for Spanish- and Indian-accented English, modeling systematic differences in consonants, vowels, and syllable structure arising from phonotactic constraints. We analyze the trade-off between phoneme-level duration alignment and accent as realized in speech timing. Experimental results demonstrate effective accent shift while maintaining speech quality.

cross Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

Authors: Thanapat Trachu, Thanathai Lertpetchpun, Sai Praneeth Karimireddy, Shrikanth Narayanan

Abstract: Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific identities while preserving utility for other speakers. We evaluate inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). We achieve strong privacy for up to 15 speakers but reveal scalability limits at 100 speakers due to increased identity overlap. Our study thus introduces a novel problem and evaluation framework toward further advances in generative voice privacy.

cross Nw\=ach\=a Mun\=a: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR

Authors: Rishikesh Kumar Sharma, Safal Narshing Shrestha, Jenny Poudel, Rupak Tiwari, Arju Shrestha, Rupak Raj Ghimire, Bal Krishna Bal

Abstract: Nepal Bhasha (Newari), an endangered language of the Kathmandu Valley, remains digitally marginalized due to the severe scarcity of annotated speech resources. In this work, we introduce Nw\=ach\=a Mun\=a, a newly curated 5.39-hour manually transcribed Devanagari speech corpus for Nepal Bhasha, and establish the first benchmark using script-preserving acoustic modeling. We investigate whether proximal cross-lingual transfer from a geographically and linguistically adjacent language (Nepali) can rival large-scale multilingual pretraining in an ultra-low-resource Automatic Speech Recognition (ASR) setting. Fine-tuning a Nepali Conformer model reduces the Character Error Rate (CER) from a 52.54% zero-shot baseline to 17.59% with data augmentation, effectively matching the performance of the multilingual Whisper-Small model despite utilizing significantly fewer parameters. Our findings demonstrate that proximal transfer within South Asian language clusters serves as a computationally efficient alternative to massive multilingual models. We openly release the dataset and benchmarks to digitally enable the Newari community and foster further research in Nepal Bhasha.

cross GRD-Net: Generative-Reconstructive-Discriminative Anomaly Detection with Region of Interest Attention Module

Authors: Niccol\`o Ferrari, Michele Fraccaroli, Evelina Lamma

Abstract: Anomaly detection is nowadays increasingly used in industrial applications and processes. One of the main fields of the appliance is the visual inspection for surface anomaly detection, which aims to spot regions that deviate from regularity and consequently identify abnormal products. Defect localization is a key task, that usually is achieved using a basic comparison between generated image and the original one, implementing some blob-analysis or image-editing algorithms, in the post-processing step, which is very biased towards the source dataset, and they are unable to generalize. Furthermore, in industrial applications, the totality of the image is not always interesting but could be one or some regions of interest (ROIs), where only in those areas there are relevant anomalies to be spotted. For these reasons, we propose a new architecture composed by two blocks. The first block is a Generative Adversarial Network (GAN), based on a residual autoencoder (ResAE), to perform reconstruction and denoising processes, while the second block produces image segmentation, spotting defects. This method learns from a dataset composed of good products and generated synthetic defects. The discriminative network is trained using a ROI for each image contained in the training dataset. The network will learn in which area anomalies are relevant. This approach guarantees the reduction of using pre-processing algorithms, formerly developed with blob-analysis and image-editing procedures. To test our model we used challenging MVTec anomaly detection datasets and an industrial large dataset of pharmaceutical BFS strips of vials. This set constitutes a more realistic use case of the aforementioned network.

cross A Systematic Comparison of Training Objectives for Out-of-Distribution Detection in Image Classification

Authors: Furkan Gen\c{c}, Onat \"Ozdemir, Emre Akba\c{s}

Abstract: Out-of-distribution (OOD) detection is critical in safety-sensitive applications. While this challenge has been addressed from various perspectives, the influence of training objectives on OOD behavior remains comparatively underexplored. In this paper, we present a systematic comparison of four widely used training objectives: Cross-Entropy Loss, Prototype Loss, Triplet Loss, and Average Precision (AP) Loss, spanning probabilistic, prototype-based, metric-learning, and ranking-based supervision, for OOD detection in image classification under standardized OpenOOD protocols. Across CIFAR-10/100 and ImageNet-200, we find that Cross-Entropy Loss, Prototype Loss, and AP Loss achieve comparable in-distribution accuracy, while Cross-Entropy Loss provides the most consistent near- and far-OOD performance overall; the other objectives can be competitive in specific settings.

cross Integration of deep generative Anomaly Detection algorithm in high-speed industrial line

Authors: Niccol\`o Ferrari, Nicola Zanarini, Michele Fraccaroli, Alice Bizzarri, Evelina Lamma

Abstract: Industrial visual inspection in pharmaceutical production requires high accuracy under strict constraints on cycle time, hardware footprint, and operational cost. Manual inline inspection is still common, but it is affected by operator variability and limited throughput. Classical rule-based computer vision pipelines are often rigid and difficult to scale to highly variable production scenarios. To address these limitations, we present a semi-supervised anomaly detection framework based on a generative adversarial architecture with a residual autoencoder and a dense bottleneck, specifically designed for online deployment on a high-speed Blow-Fill-Seal (BFS) line. The model is trained only on nominal samples and detects anomalies through reconstruction residuals, providing both classification and spatial localization via heatmaps. The training set contains 2,815,200 grayscale patches. Experiments on a real industrial test kit show high detection performance while satisfying timing constraints compatible with a 500 ms acquisition slot.

cross SMAT: Staged Multi-Agent Training for Co-Adaptive Exoskeleton Control

Authors: Yifei Yuan, Ghaith Androwis, Xianlian Zhou

Abstract: Effective exoskeleton assistance requires co-adaptation: as the device alters joint dynamics, the user reorganizes neuromuscular coordination, creating a non-stationary learning problem. Most learning-based approaches do not explicitly account for the sequential nature of human motor adaptation, leading to training instability and poorly timed assistance. We propose Staged Multi-Agent Training (SMAT), a four-stage curriculum designed to mirror how users naturally acclimate to a wearable device. In SMAT, a musculoskeletal human actor and a bilateral hip exoskeleton actor are trained progressively: the human first learns unassisted gait, then adapts to the added device mass; the exoskeleton subsequently learns a positive assistance pattern against a stabilized human policy, and finally both agents co-adapt with full torque capacity and bidirectional feedback. We implement SMAT in the MyoAssist simulation environment using a 26-muscle lower-limb model and an attached hip exoskeleton. Our musculoskeletal simulations demonstrate that the learned exoskeleton control policy produces an average 10.1% reduction in hip muscle activation relative to the no-assist condition. We validated the learned controller in an offline setting using open-source gait data, then deployed it to a physical hip exoskeleton for treadmill experiments with five subjects. The resulting policy delivers consistent assistance and predominantly positive mechanical power without the need for any explicitly imposed timing shift (mean positive power: 13.6 W at 6 Nm RMS torque to 23.8 W at 9.3 Nm RMS torque, with minimal negative power) consistently across all subjects without subject-specific retraining.

cross Evaluating Synthetic Data for Baggage Trolley Detection in Airport Logistics

Authors: Abdeldjalil Taibi, Mohmoud Badlis, Amina Bensalem, Belkacem Zouilekh, Mohammed Brahimi

Abstract: Efficient luggage trolley management is critical for reducing congestion and ensuring asset availability in modern airports. Automated detection systems face two main challenges. First, strict security and privacy regulations limit large-scale data collection. Second, existing public datasets lack the diversity, scale, and annotation quality needed to handle dense, overlapping trolley arrangements typical of real-world operations. To address these limitations, we introduce a synthetic data generation pipeline based on a high-fidelity Digital Twin of Algiers International Airport using NVIDIA Omniverse. The pipeline produces richly annotated data with oriented bounding boxes, capturing complex trolley formations, including tightly nested chains. We evaluate YOLO-OBB using five training strategies: real-only, synthetic-only, linear probing, full fine-tuning, and mixed training. This allows us to assess how synthetic data can complement limited real-world annotations. Our results show that mixed training with synthetic data and only 40 percent of real annotations matches or exceeds the full real-data baseline, achieving 0.94 mAP@50 and 0.77 mAP@50-95, while reducing annotation effort by 25 to 35 percent. Multi-seed experiments confirm strong reproducibility with a standard deviation below 0.01 on mAP@50, demonstrating the practical effectiveness of synthetic data for automated trolley detection.

cross AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

Authors: Likui Zhang, Tao Tang, Zhihao Zhan, Xiuwei Chen, Zisheng Chen, Jianhua Han, Jiangtong Zhu, Pei Xu, Hang Xu, Hefeng Wu, Liang Lin, Xiaodan Liang

Abstract: Recent advances in Visual-Language-Action (VLA) models have shown promising potential for robotic manipulation tasks. However, real-world robotic tasks often involve long-horizon, multi-step problem-solving and require generalization for continual skill acquisition, extending beyond single actions or skills. These challenges present significant barriers for existing VLA models, which use monolithic action decoders trained on aggregated data, resulting in poor scalability. To address these challenges, we propose AtomicVLA, a unified planning-and-execution framework that jointly generates task-level plans, atomic skill abstractions, and fine-grained actions. AtomicVLA constructs a scalable atomic skill library through a Skill-Guided Mixture-of-Experts (SG-MoE), where each expert specializes in mastering generic yet precise atomic skills. Furthermore, we introduce a flexible routing encoder that automatically assigns dedicated atomic experts to new skills, enabling continual learning. We validate our approach through extensive experiments. In simulation, AtomicVLA outperforms $\pi_{0}$ by 2.4\% on LIBERO, 10\% on LIBERO-LONG, and outperforms $\pi_{0}$ and $\pi_{0.5}$ by 0.22 and 0.25 in average task length on CALVIN. Additionally, our AtomicVLA consistently surpasses baselines by 18.3\% and 21\% in real-world long-horizon tasks and continual learning. These results highlight the effectiveness of atomic skill abstraction and dynamic expert composition for long-horizon and lifelong robotic tasks. The project page is \href{https://zhanglk9.github.io/atomicvla-web/}{here}.

URLs: https://zhanglk9.github.io/atomicvla-web/

cross Ref-DGS: Reflective Dual Gaussian Splatting

Authors: Ningjing Fan, Yiqun Wang, Dongming Yan, Peter Wonka

Abstract: Reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surface reconstruction and novel view synthesis. Existing Gaussian splatting methods either fail to model near-field specular reflections or rely on explicit ray tracing at substantial computational cost. We present Ref-DGS, a reflective dual Gaussian splatting framework that addresses this trade-off by decoupling surface reconstruction from specular reflection within an efficient rasterization-based pipeline. Ref-DGS introduces a dual Gaussian scene representation consisting of geometry Gaussians and complementary local reflection Gaussians that capture near-field specular interactions without explicit ray tracing, along with a global environment reflection field for modeling far-field specular reflections. To predict specular radiance, we further propose a lightweight, physically-aware adaptive mixing shader that fuses global and local reflection features. Experiments demonstrate that Ref-DGS achieves state-of-the-art performance on reflective scenes while training substantially faster than ray-based Gaussian methods.

cross AI-Driven Phase Identification from X-ray Hyperspectral Imaging of cycled Na-ion Cathode Materials

Authors: Fay\c{c}al Adrar, Nicolas Folastre, Chlo\'e Pablos, Stefan Stanescu, Sufal Swaraj, Raghvender Raghvender, Fran\c{c}ois Cadiou, Laurence Croguennec, Matthieu Bugnet, Arnaud Demorti\`ere

Abstract: Na-ion batteries have emerged as viable candidates for large-scale energy storage applica- tions due to resource abundance and cost advantages. The constraints imposed on their performance and durability, for instance, by complex phase transformations in positive electrode materials during electrochemical cycling, can be addressed and are thus not detrimental to their development. However, diffusion-limited Na-ion transport can drive spatially heterogeneous phase nucleation and propagation, leading to multiphase coexis- tence and locally non-uniform electrochemical activity, generating complex reaction path- ways that challenge both mechanistic understanding and predictive material optimization. These challenges can be addressed by investigating single-crystalline regions of materials, i.e. down to the scale of individual particles, although such analyses are often constrained by energetically and/or spatially sparse hyperspectral datasets. Here, we developed an AI-driven method to process hyperspectral data under sparse sampling conditions and generate multiphase maps with nanometer-scale resolution over a micrometer-scale field of view. We applied this processing on scanning transmission X-ray microscopy (STXM) data to determine the distribution and coexistence of phases in individual particles of NaxV2(PO4)2F3 cathode materials, at different states of charge. The methodology relies on a workflow which combines a Gaussian mixture variational autoencoder (GMVAE) algorithm with the Pearson corre- lation coefficient to identify the sodium content and map their spatial distribution. Our approach reveals nanoscale phase heterogeneity and evolution within individual particles, and improves the reliability of phase detection by identifying ambiguity zones, false assign- ments, and transition phases localized at grain boundaries.

cross Compressed-Domain-Aware Online Video Super-Resolution

Authors: Yuhang Wang, Hai Li, Shujuan Hou, Zhetao Dong, Xiaoyao Yang

Abstract: In bandwidth-limited online video streaming, videos are usually downsampled and compressed. Although recent online video super-resolution (online VSR) approaches achieve promising results, they are still compute-intensive and fall short of real-time processing at higher resolutions, due to complex motion estimation for alignment and redundant processing of consecutive frames. To address these issues, we propose a compressed-domain-aware network (CDA-VSR) for online VSR, which utilizes compressed-domain information, including motion vectors, residual maps, and frame types to balance quality and efficiency. Specifically, we propose a motion-vector-guided deformable alignment module that uses motion vectors for coarse warping and learns only local residual offsets for fine-tuned adjustments, thereby maintaining accuracy while reducing computation. Then, we utilize a residual map gated fusion module to derive spatial weights from residual maps, suppressing mismatched regions and emphasizing reliable details. Further, we design a frame-type-aware reconstruction module for adaptive compute allocation across frame types, balancing accuracy and efficiency. On the REDS4 dataset, our CDA-VSR surpasses the state-of-the-art method TMP, with a maximum PSNR improvement of 0.13 dB while delivering more than double the inference speed. The code will be released at https://github.com/sspBIT/CDA-VSR.

URLs: https://github.com/sspBIT/CDA-VSR.

cross TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

Authors: Yihong Luo, Tianyang Hu, Weijian Luo, Jing Tang

Abstract: While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: https://github.com/Luo-Yihong/TDM-R1

URLs: https://github.com/Luo-Yihong/TDM-R1

cross VoiceSHIELD-Small: Real-Time Malicious Speech Detection and Transcription

Authors: Sumit Ranjan, Sugandha Sharma, Ubaid Abbas, Puneeth N Ail

Abstract: Voice interfaces are quickly becoming a common way for people to interact with AI systems. This also brings new security risks, such as prompt injection, social engineering, and harmful voice commands. Traditional security methods rely on converting speech to text and then filtering that text, which introduces delays and can ignore important audio cues. This paper introduces VoiceSHIELD-Small, a lightweight model that works in real time. It can transcribe speech and detect whether it is safe or harmful, all in one step. Built on OpenAI's Whisper-small encoder, VoiceSHIELD adds a mean-pooling layer and a simple classification head. It takes just 90-120 milliseconds to classify audio on mid-tier GPUs, while transcription happens at the same time. Tested on a balanced set of 947 audio clips, the model achieved 99.16 percent accuracy and an F1 score of 0.9865. At the default setting, it missed 2.33 percent of harmful inputs. Cross-validation showed consistent performance (F1 standard deviation = 0.0026). The paper also covers the model's design, training data, performance trade-offs, and responsible use guidelines. VoiceSHIELD is released under the MIT license to encourage further research and adoption in voice AI security.

cross YAQIN: Culturally Sensitive, Agentic AI for Mental Healthcare Support Among Muslim Women in the UK

Authors: Yasmin Zaraket, C\'eline Mougenot

Abstract: Mental healthcare services in the UK lack tools and resources to address the cultural needs of Muslim women, often leaving them feeling as though their values are pathologised and limiting trust and engagement [1]. Despite growing awareness of cultural competency, few interventions integrate Islamic frameworks into therapeutic support. This report investigates the design and evaluation of YAQIN, a co-designed AI-based application supporting culturally and faith-sensitive mental health engagement for Muslim women. With almost 1.9 million Muslim women in England in 2021, YAQIN responds to a gap in care [2]. It leverages AI\'s anonymity and continuous support through a faith-aware chatbot and guided journaling tool grounded in user-centred design and Islamic psychology. The YAQIN design research methodology comprised three stages: contextual investigation and literature review, user research with N=14 stakeholders including Muslim women and mental health experts, and prototype development informed by deductive thematic analysis, personas, journey maps, and design specifications. Evaluation involved a co-designed user study with five participants: four Muslim women and one mental health expert who reviewed therapeutic alignment and cultural sensitivity after using the chatbot prototype. Feedback focused on tone, faith relevance, emotional resonance, and the Retrieval-Augmented Generation pipeline enabling contextual continuity. Participants highlighted YAQIN\'s ability to bridge cultural gaps in trust and therapeutic confidence. Feedback included suggestions of including linguistic diversity and routine-based guidance. This project demonstrates how culturally sensitive AI can improve mental healthcare accessibility and trust for marginalised communities and highlights the potential of faith-integrated technology in healthcare innovation.

cross Hide and Find: A Distributed Adversarial Attack on Federated Graph Learning

Authors: Jinshan Liu, Ken Li, Jiazhe Wei, Bin Shi, Bo Dong

Abstract: Federated Graph Learning (FedGL) is vulnerable to malicious attacks, yet developing a truly effective and stealthy attack method remains a significant challenge. Existing attack methods suffer from low attack success rates, high computational costs, and are easily identified and smoothed by defense algorithms. To address these challenges, we propose \textbf{FedShift}, a novel two-stage "Hide and Find" distributed adversarial attack. In the first stage, before FedGL begins, we inject a learnable and hidden "shifter" into part of the training data, which subtly pushes poisoned graph representations toward a target class's decision boundary without crossing it, ensuring attack stealthiness during training. In the second stage, after FedGL is complete, we leverage the global model information and use the hidden shifter as an optimization starting point to efficiently find the adversarial perturbations. During the final attack, we aggregate these perturbations from multiple malicious clients to form the final effective adversarial sample and trigger the attack. Extensive experiments on six large-scale datasets demonstrate that our method achieves the highest attack effectiveness compared to existing advanced attack methods. In particular, our attack can effectively evade 3 mainstream robust federated learning defense algorithms and converges with a time cost reduction of over 90\%, highlighting its exceptional stealthiness, robustness, and efficiency.

cross DECADE: A Temporally-Consistent Unsupervised Diffusion Model for Enhanced Rb-82 Dynamic Cardiac PET Image Denoising

Authors: Yinchi Zhou, Liang Guo, Huidong Xie, Yuexi Du, Ashley Wang, Menghua Xia, Tian Yu, Ramesh Fazzone-Chettiar, Christopher Weyman, Bruce Spottiswoode, Vladimir Panin, Kuangyu Shi, Edward J. Miller, Attila Feher, Albert J. Sinusas, Nicha C. Dvornek, Chi Liu

Abstract: Rb-82 dynamic cardiac PET imaging is widely used for the clinical diagnosis of coronary artery disease (CAD), but its short half-life results in high noise levels that degrade dynamic frame quality and parametric imaging. The lack of paired clean-noisy training data, rapid tracer kinetics, and frame-dependent noise variations further limit the effectiveness of existing deep learning denoising methods. We propose DECADE (A Temporally-Consistent Unsupervised Diffusion model for Enhanced Rb-82 CArdiac PET DEnoising), an unsupervised diffusion framework that generalizes across early- to late-phase dynamic frames. DECADE incorporates temporal consistency during both training and iterative sampling, using noisy frames as guidance to preserve quantitative accuracy. The method was trained and evaluated on datasets acquired from Siemens Vision 450 and Siemens Biograph Vision Quadra scanners. On the Vision 450 dataset, DECADE consistently produced high-quality dynamic and parametric images with reduced noise while preserving myocardial blood flow (MBF) and myocardial flow reserve (MFR). On the Quadra dataset, using 15%-count images as input and full-count images as reference, DECADE outperformed UNet-based and other diffusion models in image quality and K1/MBF quantification. The proposed framework enables effective unsupervised denoising of Rb-82 dynamic cardiac PET without paired training data, supporting clearer visualization while maintaining quantitative integrity.

cross QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis

Authors: A. J. W. de Vink, Filippos Karolos Ventirozos, Natalia Amat-Lefort, Lifeng Han

Abstract: We present our system for SemEval-2026 Task 3 on dimensional aspect-based sentiment regression. Our approach combines a hybrid RoBERTa encoder, which jointly predicts sentiment using regression and discretized classification heads, with large language models (LLMs) via prediction-level ensemble learning. The hybrid encoder improves prediction stability by combining continuous and discretized sentiment representations. We further explore in-context learning with LLMs and ridge-regression stacking to combine encoder and LLM predictions. Experimental results on the development set show that ensemble learning significantly improves performance over individual models, achieving substantial reductions in RMSE and improvements in correlation scores. Our findings demonstrate the complementary strengths of encoder-based and LLM-based approaches for dimensional sentiment analysis. Our development code and resources will be shared at https://github.com/aaronlifenghan/ABSentiment

URLs: https://github.com/aaronlifenghan/ABSentiment

cross ProgAgent:A Continual RL Agent with Progress-Aware Rewards

Authors: Jinzhou Tan, Gabriel Adineera, Jinoh Kim

Abstract: We present ProgAgent, a continual reinforcement learning (CRL) agent that unifies progress-aware reward learning with a high-throughput, JAX-native system architecture. Lifelong robotic learning grapples with catastrophic forgetting and the high cost of reward specification. ProgAgent tackles these by deriving dense, shaped rewards from unlabeled expert videos through a perceptual model that estimates task progress across initial, current, and goal observations. We theoretically interpret this as a learned state-potential function, delivering robust guidance in line with expert behaviors. To maintain stability amid online exploration - where novel, out-of-distribution states arise - we incorporate an adversarial push-back refinement that regularizes the reward model, curbing overconfident predictions on non-expert trajectories and countering distribution shift. By embedding this reward mechanism into a JIT-compiled loop, ProgAgent supports massively parallel rollouts and fully differentiable updates, rendering a sophisticated unified objective feasible: it merges PPO with coreset replay and synaptic intelligence for an enhanced stability-plasticity balance. Evaluations on ContinualBench and Meta-World benchmarks highlight ProgAgent's advantages: it markedly reduces forgetting, boosts learning speed, and outperforms key baselines in visual reward learning (e.g., Rank2Reward, TCN) and continual learning (e.g., Coreset, SI) - surpassing even an idealized perfect memory agent. Real-robot trials further validate its ability to acquire complex manipulation skills from noisy, few-shot human demonstrations.

cross Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context

Authors: Ashish Pandey, Tek Raj Chhetri

Abstract: Large language models (LLMs) increasingly influence global digital ecosystems, yet their potential to perpetuate social and cultural biases remains poorly understood in underrepresented contexts. This study presents a systematic analysis of representational biases in seven state-of-the-art LLMs: GPT-4o-mini, Claude-3-Sonnet, Claude-4-Sonnet, Gemini-2.0-Flash, Gemini-2.0-Lite, Llama-3-70B, and Mistral-Nemo in the Nepali cultural context. Using Croissant-compliant dataset of 2400+ stereotypical and anti-stereotypical sentence pairs on gender roles across social domains, we implement an evaluation framework, Dual-Metric Bias Assessment (DMBA), combining two metrics: (1) agreement with biased statements and (2) stereotypical completion tendencies. Results show models exhibit measurable explicit agreement bias, with mean bias agreement ranging from 0.36 to 0.43 across decoding configurations, and an implicit completion bias rate of 0.740-0.755. Importantly, implicit completion bias follows a non-linear, U-shaped relationship with temperature, peaking at moderate stochasticity (T=0.3) and declining slightly at higher temperatures. Correlation analysis under different decoding settings revealed that explicit agreement strongly aligns with stereotypical sentence agreement but is a weak and often negative predictor of implicit completion bias, indicating generative bias is poorly captured by agreement metrics. Sensitivity analysis shows increasing top-p amplifies explicit bias, while implicit generative bias remains largely stable. Domain-level analysis shows implicit bias is strongest for race and sociocultural stereotypes, while explicit agreement bias is similar across gender and sociocultural categories, with race showing the lowest explicit agreement. These findings highlight the need for culturally grounded datasets and debiasing strategies for LLMs in underrepresented societies.

cross Learning embeddings of non-linear PDEs: the Burgers' equation

Authors: Pedro Taranc\'on-\'Alvarez, Leonid Sarieddine, Pavlos Protopapas, Raul Jimenez

Abstract: Embeddings provide low-dimensional representations that organize complex function spaces and support generalization. They provide a geometric representation that supports efficient retrieval, comparison, and generalization. In this work we generalize the concept to Physics Informed Neural Networks. We present a method to construct solution embedding spaces of nonlinear partial differential equations using a multi-head setup, and extract non-degenerate information from them using principal component analysis (PCA). We test this method by applying it to viscous Burgers' equation, which is solved simultaneously for a family of initial conditions and values of the viscosity. A shared network body learns a latent embedding of the solution space, while linear heads map this embedding to individual realizations. By enforcing orthogonality constraints on the heads, we obtain a principal-component decomposition of the latent space that is robust to training degeneracies and admits a direct physical interpretation. The obtained components for Burgers' equation exhibit rapid saturation, indicating that a small number of latent modes captures the dominant features of the dynamics.

cross HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration

Authors: Desen Sun, Jason Hon, Jintao Zhang, Sihang Liu

Abstract: Diffusion models have demonstrated a remarkable ability in Text-to-Image (T2I) generation applications. Despite the advanced generation output, they suffer from heavy computation overhead, especially for large models that contain tens of billions of parameters. Prior work has illustrated that replacing part of the denoising steps with a smaller model still maintains the generation quality. However, these methods only focus on saving computation for some timesteps, ignoring the difference in compute demand within one timestep. In this work, we propose HybridStitch, a new T2I generation paradigm that treats generation like editing. Specifically, we introduce a hybrid stage that jointly incorporates both the large model and the small model. HybridStitch separates the entire image into two regions: one that is relatively easy to render, enabling an early transition to the smaller model, and another that is more complex and therefore requires refinement by the large model. HybridStitch employs the small model to construct a coarse sketch while exploiting the large model to edit and refine the complex regions. According to our evaluation, HybridStitch achieves 1.83$\times$ speedup on Stable Diffusion 3, which is faster than all existing mixture of model methods.

cross Column Generation for the Micro-Transit Zoning Problem

Authors: Hins Hu, Rishav Sen, Jose Paolo Talusan, Abhishek Dubey, Aron Laszka, Samitha Samaranayake

Abstract: Along with the rapid development of new urban mobility options like ride-sharing over the past decade, on-demand micro-transit services stand out as a middle ground, bridging the gap between fixed-line mass transit and single-request ride-hailing, balancing ridership maximization and travel time minimization. Micro-transit adoption can have significant social impact. It improves urban sustainability, through lower energy consumption and reduced emissions, while enhancing equitable mobility access for disadvantaged communities, thanks to its lower vehicle miles per passenger, flexible schedules, and affordable pricing. However, effective operation of micro-transit services requires planning geo-fenced zones in advance, which involves solving a challenging combinatorial optimization problem. Existing approaches enumerate candidate zones first and selects a fixed number of optimal zones in the second step. In this paper, we generalize the Micro-Transit Zoning Problem (MZP) to allow a global budget rather than imposing a size limit for candidate zones. We also design a Column Generation (CG) framework to solve the problem and several pricing heuristics to accelerate computation. Extensive numerical experiments across major U.S. cities demonstrate that our approach produces higher-quality solutions more efficiently and scales better in the generalized setting.

cross Gradient Iterated Temporal-Difference Learning

Authors: Th\'eo Vincent, Kevin Gerhardt, Yogesh Tripathi, Habib Maraqten, Adam White, Martha White, Jan Peters, Carlo D'Eramo

Abstract: Temporal-difference (TD) learning is highly effective at controlling and evaluating an agent's long-term outcomes. Most approaches in this paradigm implement a semi-gradient update to boost the learning speed, which consists of ignoring the gradient of the bootstrapped estimate. While popular, this type of update is prone to divergence, as Baird's counterexample illustrates. Gradient TD methods were introduced to overcome this issue, but have not been widely used, potentially due to issues with learning speed compared to semi-gradient methods. Recently, iterated TD learning was developed to increase the learning speed of TD methods. For that, it learns a sequence of action-value functions in parallel, where each function is optimized to represent the application of the Bellman operator over the previous function in the sequence. While promising, this algorithm can be unstable due to its semi-gradient nature, as each function tracks a moving target. In this work, we modify iterated TD learning by computing the gradients over those moving targets, aiming to build a powerful gradient TD method that competes with semi-gradient methods. Our evaluation reveals that this algorithm, called Gradient Iterated Temporal-Difference learning, has a competitive learning speed against semi-gradient methods across various benchmarks, including Atari games, a result that no prior work on gradient TD methods has demonstrated.

cross AI Misuse in Education Is a Measurement Problem: Toward a Learning Visibility Framework

Authors: Eduardo Davalos, Yike Zhang

Abstract: The rapid integration of conversational AI systems into educational settings has intensified ethical concerns about academic integrity, fairness, and students' cognitive development. Institutional responses have largely centered on AI detection tools and restrictive policies, yet such approaches have proven unreliable and ethically contentious. This paper reframes AI misuse in education not primarily as a detection problem, but as a measurement problem rooted in the loss of visibility into the learning process. When AI enters the assessment loop, educators often retain access to final outputs but lose valuable insight into how those outputs were produced. Drawing on research in cognitive offloading, learning analytics, and multimodal timeline reconstruction, we propose the Learning Visibility Framework, grounded in three principles: clear specification and modeling of acceptable AI use, recognition of learning processes as assessable evidence alongside outcomes, and the establishment of transparent timelines of student activity. Rather than promoting surveillance, the framework emphasizes transparency and shared evidence as foundations for ethical AI integration in classroom settings. By shifting focus from adversarial detection toward process visibility, this work offers a principled pathway for aligning AI use with educational values while preserving trust and transparency between students and educators

cross DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Authors: Bo Jiang

Abstract: Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and unevaluated. We present DistillGuard, a framework for systematically evaluating output-level defenses against LLM knowledge distillation. We introduce a taxonomy of three defense categories -- output perturbation, data poisoning, and information throttling -- and evaluate nine defense configurations using a standardized pipeline with Qwen3-14B as teacher and Qwen2.5-7B-Instruct as student across three benchmarks (MATH-500, HumanEval+, MT-Bench). Our results reveal that, in a same-family distillation setting against a naive attacker, most output-level defenses are surprisingly ineffective: paraphrasing-based perturbation barely degrades distilled student quality, and data poisoning primarily impairs conversational fluency while leaving task-specific capabilities intact. Only chain-of-thought removal substantially impairs mathematical reasoning (31.4\% vs.\ 67.8\% baseline), though code generation remains unaffected. These findings demonstrate that the effectiveness of distillation defenses is highly task-dependent and that current output-level approaches are insufficient to broadly prevent knowledge theft.

cross AI Steerability 360: A Toolkit for Steering Large Language Models

Authors: Erik Miehling, Karthikeyan Natesan Ramamurthy, Praveen Venkateswaran, Irene Ko, Pierre Dognin, Moninder Singh, Tejaswini Pedapati, Avinash Balakrishnan, Matthew Riemer, Dennis Wei, Inge Vejsbjerg, Elizabeth M. Daly, Kush R. Varshney

Abstract: The AI Steerability 360 toolkit is an extensible, open-source Python library for steering LLMs. Steering abstractions are designed around four model control surfaces: input (modification of the prompt), structural (modification of the model's weights or architecture), state (modification of the model's activations and attentions), and output (modification of the decoding or generation process). Steering methods exert control on the model through a common interface, termed a steering pipeline, which additionally allows for the composition of multiple steering methods. Comprehensive evaluation and comparison of steering methods/pipelines is facilitated by use case classes (for defining tasks) and a benchmark class (for performance comparison on a given task). The functionality provided by the toolkit significantly lowers the barrier to developing and comprehensively evaluating steering methods. The toolkit is Hugging Face native and is released under an Apache 2.0 license at https://github.com/IBM/AISteer360.

URLs: https://github.com/IBM/AISteer360.

cross Slumbering to Precision: Enhancing Artificial Neural Network Calibration Through Sleep-like Processes

Authors: Jean Erik Delanois, Aditya Ahuja, Giri P. Krishnan, Maxim Bazhenov

Abstract: Artificial neural networks are often overconfident, undermining trust because their predicted probabilities do not match actual accuracy. Inspired by biological sleep and the role of spontaneous replay in memory and learning, we introduce Sleep Replay Consolidation (SRC), a novel calibration approach. SRC is a post-training, sleep-like phase that selectively replays internal representations to update network weights and improve calibration without supervised retraining. Across multiple experiments, SRC is competitive with and complementary to standard approaches such as temperature scaling. Combining SRC with temperature scaling achieves the best Brier score and entropy trade-offs for AlexNet and VGG19. These results show that SRC provides a fundamentally novel approach to improving neural network calibration. SRC-based calibration offers a practical path toward more trustworthy confidence estimates and narrows the gap between human-like uncertainty handling and modern deep networks.

cross CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

Authors: Xiaona Xue, Yiqiao Huang, Jiacheng Li, Yuanhang Zheng, Huiqi Miao, Yunfei Ma, Rui Liu, Xinbao Sun, Minglu Liu, Fanyu Meng, Chao Deng, Junlan Feng

Abstract: Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of content and format, logical workflow control, and real-world applications. This leads to a significant gap between current evaluation practices and practical demands. To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions. CCR-Bench is characterized by: (1) deep entanglement of content and formatting requirements in task specifications; (2) instructions that involve intricate task decomposition, conditional reasoning, and procedural planning; and (3) evaluation samples derived entirely from real-world industrial scenarios. Extensive experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substantial performance deficiencies, clearly quantifying the gap between current LLM capabilities and the demands of realworld instruction understanding. We believe that CCR-Bench offers a more rigorous and realistic evaluation framework, advancing the development of LLMs toward the next generation of models capable of understanding and executing complex tasks in industrial applications.

cross Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

Authors: Noah Golowich, Fan Chen, Dhruv Rohatgi, Raghav Singhal, Carles Domingo-Enrich, Dylan J. Foster, Akshay Krishnamurthy

Abstract: Inference-time methods that aggregate and prune multiple samples have emerged as a powerful paradigm for steering large language models, yet we lack any principled understanding of their accuracy-cost tradeoffs. In this paper, we introduce a route to rigorously study such approaches using the lens of *particle filtering* algorithms such as Sequential Monte Carlo (SMC). Given a base language model and a *process reward model* estimating expected terminal rewards, we ask: *how accurately can we sample from a target distribution given some number of process reward evaluations?* Theoretically, we identify (1) simple criteria enabling non-asymptotic guarantees for SMC; (2) algorithmic improvements to SMC; and (3) a fundamental limit faced by all particle filtering methods. Empirically, we demonstrate that our theoretical criteria effectively govern the *sampling error* of SMC, though not necessarily its final *accuracy*, suggesting that theoretical perspectives beyond sampling may be necessary.

cross VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

Authors: Minkyu Kim, Sangheon Lee, Dongmin Park

Abstract: The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs' reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.

cross Designing probabilistic AI monsoon forecasts to inform agricultural decision-making

Authors: Colin Aitken, Rajat Masiwal, Adam Marchakitus, Katherine Kowal, Mayank Gupta, Tyler Yang, Amir Jina, Pedram Hassanzadeh, William R. Boos, Michael Kremer

Abstract: Hundreds of millions of farmers make high-stakes decisions under uncertainty about future weather. Forecasts can inform these decisions, but available choices and their risks and benefits vary between farmers. We introduce a decision-theory framework for designing useful forecasts in settings where the forecaster cannot prescribe optimal actions because farmers' circumstances are heterogeneous. We apply this framework to the case of seasonal onset of monsoon rains, a key date for planting decisions and agricultural investments in many tropical countries. We develop a system for tailoring forecasts to the requirements of this framework by blending systematically benchmarked artificial intelligence (AI) weather prediction models with a new "evolving farmer expectations" statistical model. This statistical model applies Bayesian inference to historical observations to predict time-varying probabilities of first-occurrence events throughout a season. The blended system yields more skillful Indian monsoon forecasts at longer lead times than its components or any multi-model average. In 2025, this system was deployed operationally in a government-led program that delivered subseasonal monsoon onset forecasts to 38 million Indian farmers, skillfully predicting that year's early-summer anomalous dry period. This decision-theory framework and blending system offer a pathway for developing climate adaptation tools for large vulnerable populations around the world.

cross Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy

Authors: Junyang Wu, Mingyi Luo, Fangfang Xie, Minghui Zhang, Hanxiao Zhang, Chunxi Zhang, Junhao Wang, Jiayuan Sun, Yun Gu, Guang-Zhong Yang

Abstract: Accurate intraoperative navigation is essential for robot-assisted endoluminal intervention, but remains difficult because of limited endoscopic field of view and dynamic artifacts. Existing navigation platforms often rely on external localization technologies, such as electromagnetic tracking or shape sensing, which increase hardware complexity and remain vulnerable to intraoperative anatomical mismatch. We present a vision-only autonomy framework that performs long-horizon bronchoscopic navigation using preoperative CT-derived virtual targets and live endoscopic video, without external tracking during navigation. The framework uses hierarchical long-short agents: a short-term reactive agent for continuous low-latency motion control, and a long-term strategic agent for decision support at anatomically ambiguous points. When their recommendations conflict, a world-model critic predicts future visual states for candidate actions and selects the action whose predicted state best matches the target view. We evaluated the system in a high-fidelity airway phantom, three ex vivo porcine lungs, and a live porcine model. The system reached all planned segmental targets in the phantom, maintained 80\% success to the eighth generation ex vivo, and achieved in vivo navigation performance comparable to the expert bronchoscopist. These results support the preclinical feasibility of sensor-free autonomous bronchoscopic navigation.

cross IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation

Authors: Sunghyun Baek, Jaemyung Yu, Seunghee Koh, Minsu Kim, Hyeonseong Jeon, Junmo Kim

Abstract: Test-time adaptation (TTA) has been widely explored to prevent performance degradation when test data differ from the training distribution. However, fully leveraging the rich representations of large pretrained models with minimal parameter updates remains underexplored. In this paper, we propose Intrinsic Mixture of Spectral Experts (IMSE) that leverages the spectral experts inherently embedded in Vision Transformers. We decompose each linear layer via singular value decomposition (SVD) and adapt only the singular values, while keeping the singular vectors fixed. We further identify a key limitation of entropy minimization in TTA: it often induces feature collapse, causing the model to rely on domain-specific features rather than class-discriminative features. To address this, we propose a diversity maximization loss based on expert-input alignment, which encourages diverse utilization of spectral experts during adaptation. In the continual test-time adaptation (CTTA) scenario, beyond preserving pretrained knowledge, it is crucial to retain and reuse knowledge from previously observed domains. We introduce Domain-Aware Spectral Code Retrieval, which estimates input distributions to detect domain shifts, and retrieves adapted singular values for rapid adaptation. Consequently, our method achieves state-of-the-art performance on various distribution-shift benchmarks under the TTA setting. In CTTA and Gradual CTTA, it further improves accuracy by 3.4 percentage points (pp) and 2.4 pp, respectively, while requiring 385 times fewer trainable parameters. Our code is available at https://github.com/baek85/IMSE.

URLs: https://github.com/baek85/IMSE.

cross SWE-Fuse: Empowering Software Agents via Issue-free Trajectory Learning and Entropy-aware RLVR Training

Authors: Xin-Cheng Wen, Binbin Chen, Haoxuan Lan, Hang Yu, Peng Di, Cuiyun Gao

Abstract: Large language models (LLMs) have transformed the software engineering landscape. Recently, numerous LLM-based agents have been developed to address real-world software issue fixing tasks. Despite their state-of-the-art performance, Despite achieving state-of-the-art performance, these agents face a significant challenge: \textbf{Insufficient high-quality issue descriptions.} Real-world datasets often exhibit misalignments between issue descriptions and their corresponding solutions, introducing noise and ambiguity that mislead automated agents and limit their problem-solving effectiveness. We propose \textbf{\textit{SWE-Fuse}}, an issue-description-aware training framework that fuses issue-description-guided and issue-free samples for training SWE agents. It consists of two key modules: (1) An issue-free-driven trajectory learning module for mitigating potentially misleading issue descriptions while enabling the model to learn step-by-step debugging processes; and (2) An entropy-aware RLVR training module, which adaptively adjusts training dynamics through entropy-driven clipping. It applies relaxed clipping under high entropy to encourage exploration, and stricter clipping under low entropy to ensure training stability. We evaluate SWE-Fuse on the widely studied SWE-bench Verified benchmark shows to demonstrate its effectiveness in solving real-world software problems. Specifically, SWE-Fuse outperforms the best 8B and 32B baselines by 43.0\% and 60.2\% in solve rate, respectively. Furthermore, integrating SWE-Fuse with test-time scaling (TTS) enables further performance improvements, achieving solve rates of 49.8\% and 65.2\% under TTS@8 for the 8B and 32B models, respectively.

cross AI Agents, Language, Deep Learning and the Next Revolution in Science

Authors: Ke Li, Beijiang Liu, Bruce Mellado, Changzheng Yuan, Zhengde Zhang

Abstract: Modern science is reaching a critical inflection point. Instruments across disciplines, from particle physics and astronomy to genomics and climate modeling, now produce data of such scale, diversity, and interdependence that traditional analytical methods can no longer keep pace. This growing imbalance between data generation and data understanding signals the need for a new scientific paradigm. We propose that intelligent, human-supervised AI agents operating over deep-learning algorithms, represent the next evolution of the scientific method. Built upon large language models and multimodal learning, these agents can interpret scientific intent, design and execute analytical workflows, and ensure traceability through domain-specific languages that preserve human oversight and accountability. Particle physics, a historic incubator of computational innovation, offers the ideal testbed for this transition. At the Institute of High Energy Physics of the Chinese Academy of Sciences, the Dr. Sai system embodies this vision, a multi-agent reasoning framework deployed within collider research at the CEPC. This emerging approach does not replace human scientists but extends their cognitive reach, enabling discovery to scale with complexity and redefining how knowledge itself is produced in the age of intelligent machines. The significance of this paradigm transcends particle physics, offering a blueprint for all data-driven sciences facing the same complexity ceiling.

cross ELLMob: Event-Driven Human Mobility Generation with Self-Aligned LLM Framework

Authors: Yusong Wang, Chuang Yang, Jiawei Wang, Xiaohang Xu, Jiayi Xu, Dongyuan Li, Chuan Xiao, Renhe Jiang

Abstract: Human mobility generation aims to synthesize plausible trajectory data, which is widely used in urban system research. While Large Language Model-based methods excel at generating routine trajectories, they struggle to capture deviated mobility during large-scale societal events. This limitation stems from two critical gaps: (1) the absence of event-annotated mobility datasets for design and evaluation, and (2) the inability of current frameworks to reconcile competitions between users' habitual patterns and event-imposed constraints when making trajectory decisions. This work addresses these gaps with a twofold contribution. First, we construct the first event-annotated mobility dataset covering three major events: Typhoon Hagibis, COVID-19, and the Tokyo 2021 Olympics. Second, we propose ELLMob, a self-aligned LLM framework that first extracts competing rationales between habitual patterns and event constraints, based on Fuzzy-Trace Theory, and then iteratively aligns them to generate trajectories that are both habitually grounded and event-responsive. Extensive experiments show that ELLMob wins state-of-the-art baselines across all events, demonstrating its effectiveness. Our codes and datasets are available at https://github.com/deepkashiwa20/ELLMob.

URLs: https://github.com/deepkashiwa20/ELLMob.

cross PSTNet: Physically-Structured Turbulence Network

Authors: Boris Kriuk, Fedor Kriuk

Abstract: Reliable real-time estimation of atmospheric turbulence intensity remains an open challenge for aircraft operating across diverse altitude bands, particularly over oceanic, polar, and data-sparse regions that lack operational nowcasting infrastructure. Classical spectral models encode climatological averages rather than the instantaneous atmospheric state, and generic ML regressors offer adaptivity but provide no guarantee that predictions respect fundamental scaling laws. This paper introduces the Physically-Structured Turbulence Network (PSTNet), a lightweight architecture that embeds physics directly into its structure. PSTNet couples four components: (i) a zero-parameter backbone derived from Monin-Obukhov theory, (ii) a regime-gated mixture of specialist sub-networks supervised by Richardson-number-derived soft targets, (iii) Feature-wise Linear Modulation layers conditioning hidden representations on local air-density ratio, and (iv) a Kolmogorov output layer enforcing inertial-subrange scaling as an architectural constraint. The entire model contains only 552 learnable parameters, requiring fewer than 2.5 kB of storage and executing in under 12s on a Cortex-M7 microcontroller. We validate PSTNet on 340 paired six-degree-of-freedom guidance simulations spanning three vehicle classes (Mach 2.8, 4.5, and 8.0) and six operational categories with real-time satellite weather ingestion. PSTNet achieves a mean miss-distance improvement of +2.8% with a 78% win rate and a statistically significant effect size. Our results demonstrate that encoding domain physics as architectural priors yields a more efficient and interpretable path to turbulence estimation accuracy than scaling model capacity, establishing PSTNet as a viable drop-in replacement for legacy look-up tables in resource-constrained, safety-critical on-board guidance systems.

cross VORL-EXPLORE: A Hybrid Learning Planning Approach to Multi-Robot Exploration in Dynamic Environments

Authors: Ning Liu, Sen Shen, Zheng Li, Sheng Liu, Dongkun Han, Shangke Lyu, Thomas Braunl

Abstract: Hierarchical multi-robot exploration commonly decouples frontier allocation from local navigation, which can make the system brittle in dense and dynamic environments. Because the allocator lacks direct awareness of execution difficulty, robots may cluster at bottlenecks, trigger oscillatory replanning, and generate redundant coverage. We propose VORL-EXPLORE, a hybrid learning and planning framework that addresses this limitation through execution fidelity, a shared estimate of local navigability that couples task allocation with motion execution. This fidelity signal is incorporated into a fidelity-coupled Voronoi objective with inter-robot repulsion to reduce contention before it emerges. It also drives a risk-aware adaptive arbitration mechanism between global A* guidance and a reactive reinforcement learning policy, balancing long-range efficiency with safe interaction in confined spaces. The framework further supports online self-supervised recalibration of the fidelity model using pseudo-labels derived from recent progress and safety outcomes, enabling adaptation to non-stationary obstacles without manual risk tuning. We evaluate this capability separately in a dedicated severe-traffic ablation. Extensive experiments in randomized grids and a Gazebo factory scenario show high success rates, shorter path length, lower overlap, and robust collision avoidance. The source code will be made publicly available upon acceptance.

cross Emergence is Overrated: AGI as an Archipelago of Experts

Authors: Daniel Kilov

Abstract: Krakauer, Krakauer, and Mitchell (2025) distinguish between emergent capabilities and emergent intelligence, arguing that true intelligence requires efficient coarse-grained representations enabling diverse problem-solving through analogy and minimal modification. They contend that intelligence means doing "more with less" through compression and generalization, contrasting this with "vast assemblages of diverse calculators" that merely accumulate specialized capabilities. This paper examines whether their framework accurately characterizes human intelligence and its implications for conceptualizing artificial general intelligence. Drawing on empirical evidence from cognitive science, I demonstrate that human expertise operates primarily through domain-specific pattern accumulation rather than elegant compression. Expert performance appears flexible not through unifying principles but through vast repertoires of specialized responses. Creative breakthroughs themselves may emerge through evolutionary processes of blind variation and selective retention rather than principled analogical reasoning. These findings suggest reconceptualizing AGI as an "archipelago of experts": isolated islands of specialized competence without unifying principles or shared representations. If we accept human expertise with its characteristic brittleness as genuine intelligence, then consistency demands recognizing that artificial systems comprising millions of specialized modules could constitute general intelligence despite lacking KKM's emergent intelligence.

cross \$OneMillion-Bench: How Far are Language Agents from Human Experts?

Authors: Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen, Tiliang Duan, Jiayun Dong, Xiaobo Hu, Zixia Jia, Yang Liu, Tao Peng, Yixin Ren, Ran Tian, Zaiyuan Wang, Yanglihong Xiao, Gang Yao, Lingyue Yin, Ge Zhang, Chun Zhang, Jianpeng Jiao, Zilong Zheng, Yuan Gong

Abstract: As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce \$OneMillion-Bench \$OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike prior work, the benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constraint decisions, where correctness depends as much on the reasoning process as the final answer. We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents. Together, \$OneMillion-Bench provides a unified testbed for assessing agentic reliability, professional depth, and practical readiness in domain-intensive scenarios.

cross Aero-Promptness: Drag-Aware Aerodynamic Manipulability for Propeller-driven Vehicles

Authors: Antonio Franchi

Abstract: This work introduces the Drag-Aware Aerodynamic Manipulability (DAAM), a geometric framework for control allocation in redundant multirotors. By equipping the propeller spin-rate space with a Riemannian metric based on the remaining symmetric acceleration capacity of each motor, the formulation explicitly accounts for motor torque limits and aerodynamic drag. Mapping this metric through the nonlinear thrust law to the generalized force space yields a state-dependent manipulability volume. The log-determinant of this volume acts as a natural barrier function, strictly penalizing drag-induced saturation and low-spin thrust loss. Optimizing this volume along the allocation fibers provides a redundancy resolution strategy inherently invariant to arbitrary coordinate scaling in the generalized-force space. Analytically, we prove that the resulting optimal allocations locally form smooth embedded manifolds, and we geometrically characterize the global jump discontinuities that inevitably arise from physical actuator limits and spin-rate sign transitions.

cross ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation

Authors: Haoyu Tong, Xiangyu Dong, Xiaoguang Ma, Haoran Zhao, Yaoming Zhou, Chenghao Lin

Abstract: Existing aerial Vision-Language Navigation (VLN) methods predominantly adopt a detection-and-planning pipeline, which converts open-vocabulary detections into discrete textual scene graphs. These approaches are plagued by inadequate spatial reasoning capabilities and inherent linguistic ambiguities. To address these bottlenecks, we propose a Visual-Spatial Reasoning (ViSA) enhanced framework for aerial VLN. Specifically, a triple-phase collaborative architecture is designed to leverage structured visual prompting, enabling Vision-Language Models (VLMs) to perform direct reasoning on image planes without the need for additional training or complex intermediate representations. Comprehensive evaluations on the CityNav benchmark demonstrate that the ViSA-enhanced VLN achieves a 70.3\% improvement in success rate compared to the fully trained state-of-the-art (SOTA) method, elucidating its great potential as a backbone for aerial VLN systems.

cross FedMomentum: Preserving LoRA Training Momentum in Federated Fine-Tuning

Authors: Peishen Yan, Yang Hua, Hao Wang, Jiaru Zhang, Xiaoyu Wu, Tao Song, Haibing Guan

Abstract: Federated fine-tuning of large language models (LLMs) with low-rank adaptation (LoRA) offers a communication-efficient and privacy-preserving solution for task-specific adaptation. Naive aggregation of LoRA modules introduces noise due to mathematical incorrectness when averaging the downsampling and upsampling matrices independently. However, existing noise-free aggregation strategies inevitably compromise the structural expressiveness of LoRA, limiting its ability to retain client-specific adaptations by either improperly reconstructing the low-rank structure or excluding partially trainable components. We identify this problem as loss of training momentum, where LoRA updates fail to accumulate effectively across rounds, resulting in slower convergence and suboptimal performance. To address this, we propose FedMomentum, a novel framework that enables structured and momentum-preserving LoRA aggregation via singular value decomposition (SVD). Specifically, after aggregating low-rank updates in a mathematically correct manner, FedMomentum applies SVD to extract the dominant components that capture the main update directions. These components are used to reconstruct the LoRA modules with the same rank, while residual components can be retained and later merged into the backbone to preserve semantic information and ensure robustness. Extensive experiments across multiple tasks demonstrate that FedMomentum consistently outperforms prior state-of-the-art methods in convergence speed and final accuracy.

cross Alignment--Process--Outcome: Rethinking How AIs and Humans Collaborate

Authors: Haichang Li, Anjun Zhu, Arpit Narechania

Abstract: In real-world collaboration, alignment, process structure, and outcome quality do not exhibit a simple linear or one-to-one correspondence: similar alignment may accompany either rapid convergence or extensive multi-branch exploration, and lead to different results. Existing accounts often isolate these dimensions or focus on specific participant types, limiting structural accounts of collaboration. We reconceptualize collaboration through two complementary lenses. The task lens models collaboration as trajectory evolution in a structured task space, revealing patterns such as advancement, branching, and backtracking. The intent lens examines how individual intents are expressed within shared contexts and enter situated decisions. Together, these lenses clarify the structural relationships among alignment, decision-making, and trajectory structure. Rather than reducing collaboration to outcome quality or treating alignment as the sole objective, we propose a unified dynamic view of the relationships among alignment, process, and outcome, and use it to re-examine collaboration structure across Human-Human, AI-AI, and Human-AI settings.

cross Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model

Authors: Sangjune Park, Inhyeok Choi, Donghyeon Soon, Youngwoo Jeon, Kyungdon Joo

Abstract: Dance is a form of human motion characterized by emotional expression and communication, playing a role in various fields such as music, virtual reality, and content creation. Existing methods for dance generation often fail to adequately capture the inherently sequential, rhythmical, and music-synchronized characteristics of dance. In this paper, we propose \emph{MambaDance}, a new dance generation approach that leverages a Mamba-based diffusion model. Mamba, well-suited to handling long and autoregressive sequences, is integrated into our two-stage diffusion architecture, substituting off-the-shelf Transformer. Additionally, considering the critical role of musical beats in dance choreography, we propose a Gaussian-based beat representation to explicitly guide the decoding of dance sequences. Experiments on AIST++ and FineDance datasets for each sequence length show that our proposed method effectively generates plausible dance movements while reflecting essential characteristics, consistently from short to long dances, compared to the previous methods. Additional qualitative results and demo videos are available at \small{https://vision3d-lab.github.io/mambadance}.

URLs: https://vision3d-lab.github.io/mambadance

cross DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

Authors: Younjoo Lee, Junghoo Lee, Seungkyun Dan, Jaiyoung Park, Jung Ho Ahn

Abstract: Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.

cross GCGNet: Graph-Consistent Generative Network for Time Series Forecasting with Exogenous Variables

Authors: Zhengyu Li, Xiangfei Qiu, Yuhan Zhu, Xingjian Wu, Jilin Hu, Chenjuan Guo, Bin Yang

Abstract: Exogenous variables offer valuable supplementary information for predicting future endogenous variables. Forecasting with exogenous variables needs to consider both past-to-future dependencies (i.e., temporal correlations) and the influence of exogenous variables on endogenous variables (i.e., channel correlations). This is pivotal when future exogenous variables are available, because they may directly affect the future endogenous variables. Many methods have been proposed for time series forecasting with exogenous variables, focusing on modeling temporal and channel correlations. However, most of them use a two-step strategy, modeling temporal and channel correlations separately, which limits their ability to capture joint correlations across time and channels. Furthermore, in real-world scenarios, time series are frequently affected by various forms of noises, underscoring the critical importance of robustness in such correlations modeling. To address these limitations, we propose GCGNet, a Graph-Consistent Generative Network for time series forecasting with exogenous variables. Specifically, GCGNet first employs a Variational Generator to produce coarse predictions. A Graph Structure Aligner then further guides it by evaluating the consistency between the generated and true correlations, where the correlations are represented as graphs, and are robust to noises. Finally, a Graph Refiner is proposed to refine the predictions to prevent degeneration and improve accuracy. Extensive experiments on 12 real-world datasets demonstrate that GCGNet outperforms state-of-the-art baselines.

cross Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

Authors: Jun Yu, Naixiang Zheng, Guoyuan Wang, Yunxiang Zhang, Lingsi Zhu, Jiaen Liang, Wei Huang, Shengping Liu

Abstract: Emotion recognition in real-world environments is hindered by partial occlusions, missing modalities, and severe class imbalance. To address these issues, particularly for the Affective Behavior Analysis in-the-wild (ABAW) Expression challenge, we propose a multimodal framework that dynamically fuses visual and audio representations. Our approach uses a dual-branch Transformer architecture featuring a safe cross-attention mechanism and a modality dropout strategy. This design allows the network to rely on audio-based predictions when visual cues are absent. To mitigate the long-tail distribution of the Aff-Wild2 dataset, we apply focal loss optimization, combined with a sliding-window soft voting strategy to capture dynamic emotional transitions and reduce frame-level classification jitter. Experiments demonstrate that our framework effectively handles missing modalities and complex spatiotemporal dependencies, achieving an accuracy of 60.79% and an F1-score of 0.5029 on the Aff-Wild2 validation set.

cross Speed3R: Sparse Feed-forward 3D Reconstruction Models

Authors: Weining Ren, Xiao Tan, Kai Han

Abstract: While recent feed-forward 3D reconstruction models accelerate 3D reconstruction by jointly inferring dense geometry and camera poses in a single pass, their reliance on dense attention imposes a quadratic complexity, creating a prohibitive computational bottleneck that severely limits inference speed. To resolve this, we introduce Speed3R, an end-to-end trainable model inspired by the core principle of Structure-from-Motion: that a sparse set of keypoints is sufficient for robust pose estimation. Speed3R features a dual-branch attention mechanism where a compression branch creates a coarse contextual prior to guide a selection branch, which performs fine-grained attention only on the most informative image tokens. This strategy mimics the efficiency of traditional keypoint matching, achieving a remarkable 12.4x inference speedup on 1000-view sequences, while introducing a minimal, controlled trade-off in geometric accuracy. Validated on standard benchmarks with both VGGT and $\pi^3$ backbones, our method delivers high-quality reconstructions at a fraction of computational cost, paving the way for efficient large-scale scene modeling.

cross ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning

Authors: Yiran Zhao, Yaoqi Ye, Xiang Liu, Michael Qizhe Shieh, Trung Bui

Abstract: With the rapid advancement of commercial multi-modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly closed-source or proprietary models, often struggle with complex, indirect, or multi-step user instructions. These limitations hinder their ability to perform nuanced, context-aware edits that align with human intent. In this work, we propose ImageEdit-R1, a multi-agent framework for intelligent image editing that leverages reinforcement learning to coordinate high-level decision-making across a set of specialized, pretrained vision-language and generative agents. Each agent is responsible for distinct capabilities--such as understanding user intent, identifying regions of interest, selecting appropriate editing actions, and synthesizing visual content--while reinforcement learning governs their collaboration to ensure coherent and goal-directed behavior. Unlike existing approaches that rely on monolithic models or hand-crafted pipelines, our method treats image editing as a sequential decision-making problem, enabling dynamic and context-aware editing strategies. Experimental results demonstrate that ImageEdit-R1 consistently outperforms both individual closed-source diffusion models and alternative multi-agent framework baselines across multiple image editing datasets.

cross DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation

Authors: Zhenyu Hu, Qing Wang, Te Cao, Luo Liao, Longfei Lu, Liqun Liu, Shuang Li, Hang Chen, Mengge Xue, Yuan Chen, Chao Deng, Peng Shu, Huan Yu, Jie Jiang

Abstract: Significant progress has been achieved in subject-driven text-to-image (T2I) generation, which aims to synthesize new images depicting target subjects according to user instructions. However, evaluating these models remains a significant challenge. Existing benchmarks exhibit critical limitations: 1) insufficient diversity and comprehensiveness in subject images, 2) inadequate granularity in assessing model performance across different subject difficulty levels and prompt scenarios, and 3) a profound lack of actionable insights and diagnostic guidance for subsequent model refinement. To address these limitations, we propose DSH-Bench, a comprehensive benchmark that enables systematic multi-perspective analysis of subject-driven T2I models through four principal innovations: 1) a hierarchical taxonomy sampling mechanism ensuring comprehensive subject representation across 58 fine-grained categories, 2) an innovative classification scheme categorizing both subject difficulty level and prompt scenario for granular capability assessment, 3) a novel Subject Identity Consistency Score (SICS) metric demonstrating a 9.4\% higher correlation with human evaluation compared to existing measures in quantifying subject preservation, and 4) a comprehensive set of diagnostic insights derived from the benchmark, offering critical guidance for optimizing future model training paradigms and data construction strategies. Through an extensive empirical evaluation of 19 leading models, DSH-Bench uncovers previously obscured limitations in current approaches, establishing concrete directions for future research and development.

cross DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

Authors: Chi-Min Chan, Ehsan Hajiramezanali, Xiner Li, Edward De Brouwer, Carl Edwards, Wei Xue, Sirui Han, Yike Guo, Gabriele Scalia

Abstract: In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.

cross SaiVLA-0: Cerebrum--Pons--Cerebellum Tripartite Architecture for Compute-Aware Vision-Language-Action

Authors: Xiang Shi, Wenlong Huang, Menglin Zou, Xinhai Sun

Abstract: We revisit Vision-Language-Action through a neuroscience-inspired triad. Biologically, the Cerebrum provides stable high-level multimodal priors and remains frozen; the Pons Adapter integrates these cortical features with real-time proprioceptive inputs and compiles intent into execution-ready tokens; and the Cerebellum (ParaCAT) performs fast, parallel categorical decoding for online control, with hysteresis/EMA/temperature/entropy for stability. A fixed-ratio schedule and two-stage feature caching make the system compute-aware and reproducible. Inspired by active, foveated vision, our wrist ROIs are geometrically tied to the end-effector via calibrated projection, providing a movement-stabilized, high-resolution view that is sensitive to fine-grained pose changes and complements the global context of the main view. The design is modular: upgrading the Cerebrum only retrains the Pons; changing robots only trains the Cerebellum; cerebellum-only RL can further refine control without touching high-level semantics. As a concept-and-protocol paper with preliminary evidence, we outline a timing protocol under matched conditions (GPU, resolution, batch) to verify anticipated efficiency gains. We also report preliminary LIBERO evidence showing that split feature caching reduces training time (7.5h to 4.5h) and improves average success (86.5% to 92.5%) under official N1.5 head-only training, and that SaiVLA0 reaches 99.0% mean success.

cross Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

Authors: Shentong Mo, Yibing Song

Abstract: Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose FoleyFlow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and rhythmic consistency. Then, we develop a dynamic conditional flow for the final audio generation. Built upon the efficient velocity flow generation framework, our dynamic conditional flow utilizes temporally varying video features as the dynamic condition to guide corresponding audio segment generations. To this end, we extract coherent semantic and rhythmic representations during masked AV alignment, and use this representation of video segments to guide audio generation temporally. Our audio results are evaluated on the standard benchmarks and largely surpass existing results under several metrics. The superior performance indicates that FoleyFlow is effective in generating coordinated audios that are both semantically and rhythmically coherent to various video sequences.

cross DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

Authors: Mingxi Zou, Jiaxiang Chen, Junfan Li, Langzhang Liang, Qifan Wang, Xu Yinghui, Zenglin Xu

Abstract: Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.

cross Gradually Excavating External Knowledge for Implicit Complex Question Answering

Authors: Chang Liu, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Edmund Y. Lam, Ngai Wong

Abstract: Recently, large language models (LLMs) have gained much attention for the emergence of human-comparable capabilities and huge potential. However, for open-domain implicit question-answering problems, LLMs may not be the ultimate solution due to the reasons of: 1) uncovered or out-of-date domain knowledge, 2) one-shot generation and hence restricted comprehensiveness. To this end, this work proposes a gradual knowledge excavation framework for open-domain complex question answering, where LLMs iteratively and actively acquire external information, and then reason based on acquired historical knowledge. Specifically, during each step of the solving process, the model selects an action to execute, such as querying external knowledge or performing a single logical reasoning step, to gradually progress toward a final answer. Our method can effectively leverage plug-and-play external knowledge and dynamically adjust the strategy for solving complex questions. Evaluated on the StrategyQA dataset, our method achieves 78.17% accuracy with less than 6% parameters of its competitors, setting new SOTA for ~10B-scale LLMs.

cross An explainable hybrid deep learning-enabled intelligent fault detection and diagnosis approach for automotive software systems validation

Authors: Mohammad Abboush, Ehab Ghannoum, Andreas Rausch

Abstract: Advancements in data-driven machine learning have emerged as a pivotal element in supporting automotive software systems (ASSs) engineering across various levels of the V-development process. Duringsystemverificationandvalidation,theintegrationofanintelligent fault detection anddiagnosis (FDD) model with test recordings analysis process serves as a powerful tool for efficiency ensuring functional safety. However, the lack of interpretability of the black-box FDD models developed not only hinders understanding of the cause underlying the prediction, but also prevents the model from being adapted based on the prediction result. This, in turn, increases the computational cost required for developingacomplexFDDmodelandlimitsconfidenceinreal-timesafety-criticalapplications.To address this challenge, a novel explainable method for fault detection, identification, and localization is proposed in this article with the aim of providing a clear understanding of the logic behind the prediction outcome. To this end, a hybrid 1dCNN-GRU-based intelligent model was developed to analyze the recordings from the real-time validation process of ASSs. The employment of explainable AI techniques, i.e., IGs, DeepLIFT, Gradient SHAP, and DeepLIFT SHAP, was instrumental in enabling model adaptation and facilitating the root cause analysis (RCA). The proposed approach is applied to the real time dataset collected during a virtual test drive performed by the user on hardware in the loop system.

cross Evolution Strategy-Based Calibration for Low-Bit Quantization of Speech Models

Authors: Lucas Rakotoarivony

Abstract: Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio signals remain largely overlooked. In particular, we show that audio activations can exhibit large calibration ranges, leading to significant information loss when standard calibration techniques are applied. To address this, we propose ESC, an Evolution Strategy-based Calibration method that formulates activation scaling as an optimization problem and solves it using a two-step local-global scheme driven by an evolution strategy. ESC enables unaltered performance under full INT8 quantization and is the first calibration method to achieve near-lossless performance for full INT4 quantization across multiple speech tasks. Integrating ESC with PTQ methods further reduces performance loss, achieving a 1% relative accuracy degradation on the AST model.

cross Is continuous CoT better suited for multi-lingual reasoning?

Authors: Ali Hamza Bashir, Behzad Shomali, Markus Frey, Mehdi Ali, Rafet Sifa, David Berghaus

Abstract: We investigate whether performing reasoning in a continuous latent space leads to more robust multilingual capabilities. We compare Continuous Chain-of-Thought (using the CODI framework) against standard supervised fine-tuning across five typologically diverse languages: English, Chinese, German, French, and Urdu. Our experiments on GSM8k and CommonsenseQA demonstrate that continuous reasoning significantly outperforms explicit reasoning on low-resource languages, particularly in zero-shot settings where the target language was not seen during training. Additionally, this approach achieves extreme efficiency, compressing reasoning traces by approximately $29\times$ to $50\times$. These findings indicate that continuous latent representations naturally exhibit greater language invariance, offering a scalable solution for cross-lingual reasoning.

cross Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models

Authors: Nikita Kuzmin, Tao Zhong, Jiajun Deng, Yingke Zhu, Tristan Tsoi, Tianxiang Cao, Simon Lui, Kong Aik Lee, Eng Siong Chng

Abstract: End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states of SALM-Duplex and Moshi leak substantial speaker identity across all transformer layers. Layer-wise and turn-wise analyses reveal that leakage persists across all layers, with SALM-Duplex showing stronger leakage in early layers while Moshi leaks uniformly, and that Linkability rises sharply within the first few turns. We propose two streaming anonymization setups using Stream-Voice-Anon: a waveform-level front-end (Anon-W2W) and a feature-domain replacement (Anon-W2F). Anon-W2F raises EER by over 3.5x relative to the discrete encoder baseline (11.2% to 41.0%), approaching the 50% random-chance ceiling, while Anon-W2W retains 78-93% of baseline sBERT across setups with sub-second response latency (FRL under 0.8 s).

cross TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

Authors: Toms Bergmanis, Martins Kronis, Ingus J\=anis Pretkalni\c{n}\v{s}, D\=avis Nicmanis, Je\c{l}izaveta Je\c{l}inska, Roberts Rozis, Rinalds V\=iksna, M\=arcis Pinnis

Abstract: Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in training data. This paper presents TildeOpen LLM, a 30-billion-parameter open-weight foundational model trained for 34 European languages to promote linguistic equity and improve performance for low-resource languages. To address the data imbalance, we combine dataset upsampling with a curriculum-based training schedule that alternates between uniform and natural language distributions. The resulting model performs favorably compared to other multilingual LLMs despite being trained with significantly fewer computing resources. Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages. Human evaluations confirm an up to tenfold reduction in linguistic errors relative to leading baselines. The model and associated resources are fully open-weight and publicly available at huggingface.co/TildeAI/TildeOpen-30b. These outcomes demonstrate that careful data curation and balanced training strategies can substantially enhance multilingual model quality without increasing model size or training volume.

cross MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

Authors: Siarhei Sheludzko, Dhimitrios Duka, Bernt Schiele, Hilde Kuehne, Anna Kukleva

Abstract: Contrastive learning has become a fundamental approach in both uni-modal and multi-modal frameworks. This learning paradigm pulls positive pairs of samples closer while pushing negatives apart. In the uni-modal setting (e.g., image-based learning), previous research has shown that the strength of these forces can be controlled through the temperature parameter. In this work, we propose Multi-Modal Temperature and Margin Schedules (MM-TS), extending the concept of uni-modal temperature scheduling to multi-modal contrastive learning. Our method dynamically adjusts the temperature in the contrastive loss during training, modulating the attraction and repulsion forces in the multi-modal setting. Additionally, recognizing that standard multi-modal datasets often follow imbalanced, long-tail distributions, we adapt the temperature based on the local distribution of each training sample. Specifically, samples from dense clusters are assigned a higher temperature to better preserve their semantic structure. Furthermore, we demonstrate that temperature scheduling can be effectively integrated within a max-margin framework, thereby unifying the two predominant approaches in multi-modal contrastive learning: InfoNCE loss and max-margin objective. We evaluate our approach on four widely used image- and video-language datasets, Flickr30K, MSCOCO, EPIC-KITCHENS-100, and YouCook2, and show that our dynamic temperature and margin schedules improve performance and lead to new state-of-the-art results in the field.

cross Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules

Authors: Jonas Landsgesell, Pascal Knoll

Abstract: Prior-Data Fitted Networks (PFNs), such as TabPFN and TabICL, have revolutionized tabular deep learning by leveraging in-context learning for tabular data. These models are meant as foundation models for classification and regression settings and promise to greatly simplify deployment in practical settings because their performance is unprecedented (in terms of mean squared error or $R^2$, when measured on common benchmarks like TabArena or TALENT). However, we see an important weakness of current benchmarks for the regression setting: the current benchmarks focus on evaluating win rates and performance using metrics like (root) mean squared error or $R^2$. Therefore, these leaderboards (implicitly and explicitly) push researchers to optimize for machine learning pipelines which elicit a good mean value estimate. The main problem is that this approach only evaluates a point estimate (namely the mean estimator which is the Bayes estimator associated with the mean squared error loss). In this article we discuss the application of proper scoring rules for evaluating the goodness of probabilistic forecasts in distributional regression. We also propose to enhance common machine learning benchmarks with metrics for probabilistic regression. To improve the status quo and make the machine learning community aware of scoring rules for probabilistic regression, we advocate to use the continuous ranked probability score (CRPS) in benchmarks for probabilistic regression. However, we also illustrate that the choice of the scoring rule changes the inductive bias of the trained model. We, therefore, advocate for finetuning or promptable tabular foundation models.

cross Alignment-Aware and Reliability-Gated Multimodal Fusion for Unmanned Aerial Vehicle Detection Across Heterogeneous Thermal-Visual Sensors

Authors: Ishrat Jahan, Molla E Majid, M Murugappan, Muhammad E. H. Chowdhury, N. B. Prakash, Saad Bin Abul Kashem, Balamurugan Balusamy, Amith Khandakar

Abstract: Reliable unmanned aerial vehicle (UAV) detection is critical for autonomous airspace monitoring but remains challenging when integrating sensor streams that differ substantially in resolution, perspective, and field of view. Conventional fusion methods-such as wavelet-, Laplacian-, and decision-level approaches-often fail to preserve spatial correspondence across modalities and suffer from annotation of inconsistencies, limiting their robustness in real-world settings. This study introduces two fusion strategies, Registration-aware Guided Image Fusion (RGIF) and Reliability-Gated Modality-Attention Fusion (RGMAF), designed to overcome these limitations. RGIF employs Enhanced Correlation Coefficient (ECC)-based affine registration combined with guided filtering to maintain thermal saliency while enhancing structural detail. RGMAF integrates affine and optical-flow registration with a reliability-weighted attention mechanism that adaptively balances thermal contrast and visual sharpness. Experiments were conducted on the Multi-Sensor and Multi-View Fixed-Wing (MMFW)-UAV dataset comprising 147,417 annotated air-to-air frames collected from infrared, wide-angle, and zoom sensors. Among single-modality detectors, YOLOv10x demonstrated the most stable cross-domain performance and was selected as the detection backbone for evaluating fused imagery. RGIF improved the visual baseline by 2.13% mAP@50 (achieving 97.65%), while RGMAF attained the highest recall of 98.64%. These findings show that registration-aware and reliability-adaptive fusion provides a robust framework for integrating heterogeneous modalities, substantially enhancing UAV detection performance in multimodal environments.

cross Revisiting Gradient Staleness: Evaluating Distance Metrics for Asynchronous Federated Learning Aggregation

Authors: Patrick Wilhelm, Odej Kao

Abstract: In asynchronous federated learning (FL), client devices send updates to a central server at varying times based on their computational speed, often using stale versions of the global model. This staleness can degrade the convergence and accuracy of the global model. Previous work, such as AsyncFedED, proposed an adaptive aggregation method using Euclidean distance to measure staleness. In this paper, we extend this approach by exploring alternative distance metrics to more accurately capture the effect of gradient staleness. We integrate these metrics into the aggregation process and evaluate their impact on convergence speed, model performance, and training stability under heterogeneous clients and non-IID data settings. Our results demonstrate that certain metrics lead to more robust and efficient asynchronous FL training, offering a stronger foundation for practical deployment.

cross SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration

Authors: Jianshu She

Abstract: Enterprise adoption of cloud-based AI agents faces a fundamental privacy dilemma: leveraging powerful cloud models requires sharing sensitive data, while local processing limits capability. Current agent frameworks like MCP and A2A assume complete data sharing, making them unsuitable for enterprise environments with confidential information. We present SplitAgent, a novel distributed architecture that enables privacy-preserving collaboration between enterprise-side privacy agents and cloud-side reasoning agents. Our key innovation is context-aware dynamic sanitization that adapts privacy protection based on task semantics -- contract review requires different sanitization than code review or financial analysis. SplitAgent extends existing agent protocols with differential privacy guarantees, zero-knowledge tool verification, and privacy budget management. Through comprehensive experiments on enterprise scenarios, we demonstrate that SplitAgent achieves 83.8\% task accuracy while maintaining 90.1\% privacy protection, significantly outperforming static approaches (73.2\% accuracy, 79.7\% privacy). Context-aware sanitization improves task utility by 24.1\% over static methods while reducing privacy leakage by 67\%. Our architecture provides a practical path for enterprise AI adoption without compromising sensitive data.

cross Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

Authors: Xiaofeng Yu, Jiaheng Dong, Jean Honorio, Abhirup Ghosh, Hong Jia, Ting Dang

Abstract: Speech emotion recognition plays an important role in various applications. However, most existing approaches predict a single emotion label, oversimplifying the inherently ambiguous nature of human emotional expression. Recent large audio-language models show promise in generating richer outputs, but their reasoning ability for ambiguous emotional understanding remains limited. In this work, we reformulate ambiguous emotion recognition as a distributional reasoning problem and present the first systematic study of ambiguity-aware reasoning in LALMs. Our framework comprises two complementary components: an ambiguity-aware objective that aligns predictions with human perceptual distributions, and a structured ambiguity-aware chain-of-thought supervision that guides reasoning over emotional cues. Experiments on IEMOCAP and CREMA-D demonstrate consistent improvements across SFT, DPO, and GRPO training strategies.

cross Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema

Authors: Pablo Jimenez-Lizcano, Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, Guillermo Gonz\'alez de Rivera, Ruben Vera-Rodriguez, Julian Fierrez

Abstract: Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.

cross Fibration Policy Optimization

Authors: Chang Li, Tshihao Tsu, Yaren Zhang, Chao Xue, Xiaodong He

Abstract: Large language models are increasingly trained as heterogeneous systems spanning multiple domains, expert partitions, and agentic pipelines, yet prevalent proximal objectives operate at a single scale and lack a principled mechanism for coupling token-level, trajectory-level, and higher-level hierarchical stability control. To bridge this gap, we derive the Aggregational Policy Censoring Objective (APC-Obj), the first exact unconstrained reformulation of sample-based TV-TRPO, establishing that clipping-based surrogate design and trust-region optimization are dual formulations of the same problem. Building on this foundation, we develop Fiber Bundle Gating (FBG), an algebraic framework that organizes sampled RL data as a fiber bundle and decomposes ratio gating into a base-level gate on trajectory aggregates and a fiber-level gate on per-token residuals, with provable first-order agreement with the true RL objective near on-policy. From APC-Obj and FBG we derive Fibration Policy Optimization (or simply, FiberPO), a concrete objective whose Jacobian is block-diagonal over trajectories, reduces to identity at on-policy, and provides better update direction thus improving token efficiency. The compositional nature of the framework extends beyond the trajectory-token case: fibrations compose algebraically into a Fibration Gating Hierarchy (FGH) that scales the same gating mechanism to arbitrary hierarchical depth without new primitives, as demonstrated by FiberPO-Domain, a four-level instantiation with independent trust-region budgets at the domain, prompt group, trajectory, and token levels. Together, these results connect the trust-region theory, a compositional algebraic structure, and practical multi-scale stability control into a unified framework for LLM policy optimization.

cross SAIL: Test-Time Scaling for In-Context Imitation Learning with VLM

Authors: Makoto Sato, Yusuke Iwasawa, Yujin Tang, So Kuroki

Abstract: In-context imitation learning allows robots to acquire skills from demonstrations, yet one-shot trajectory generation remains fragile under environmental variation. We propose SAIL, a framework that reframes robot imitation as an iterative refinement problem capable of scaling with test-time compute. SAIL utilizes Monte Carlo Tree Search, where each node is a complete trajectory and edges correspond to trajectory refinements. The process is guided by three core components: an automated archive of successful trajectories for contextually relevant retrieval, a vision language model-based scoring mechanism for trajectory evaluation, and a step-level feedback that provides trajectory-aligned scores for iterative refinement. Experiments across six diverse manipulation tasks in simulation and real-world validation clearly demonstrate that increasing test-time compute consistently improves success rates, achieving up to 95% on complex tasks. Our results suggest that trajectory-level test-time scaling is a robust path toward more generalizable robotic agents.

cross SCL-GNN: Towards Generalizable Graph Neural Networks via Spurious Correlation Learning

Authors: Yuxiang Zhang, Enyan Dai

Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable success across diverse tasks. However, their generalization capability is often hindered by spurious correlations between node features and labels in the graph. Our analysis reveals that GNNs tend to exploit imperceptible statistical correlations in training data, even when such correlations are unreliable for prediction. To address this challenge, we propose the Spurious Correlation Learning Graph Neural Network (SCL-GNN), a novel framework designed to enhance generalization on both Independent and Identically Distributed (IID) and Out-of-Distribution (OOD) graphs. SCL-GNN incorporates a principled spurious correlation learning mechanism, leveraging the Hilbert-Schmidt Independence Criterion (HSIC) to quantify correlations between node representations and class scores. This enables the model to identify and mitigate irrelevant but influential spurious correlations effectively. Additionally, we introduce an efficient bi-level optimization strategy to jointly optimize modules and GNN parameters, preventing overfitting. Extensive experiments on real-world and synthetic datasets demonstrate that SCL-GNN consistently outperforms state-of-the-art baselines under various distribution shifts, highlighting its robustness and generalization capabilities.

cross How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

Authors: JV Roig

Abstract: How much do large language models actually hallucinate when answering questions grounded in provided documents? Despite the critical importance of this question for enterprise AI deployments, reliable measurement has been hampered by benchmarks that rely on static datasets vulnerable to contamination, LLM-based judges with documented biases, or evaluation scales too small for statistical confidence. We address this gap using RIKER, a ground-truth-first evaluation methodology that enables deterministic scoring without human annotation. Across 35 open-weight models, three context lengths (32K, 128K, and 200K tokens), four temperature settings, and three hardware platforms (NVIDIA H200, AMD MI300X, and Intel Gaudi 3), we conducted over 172 billion tokens of evaluation - an order of magnitude beyond prior work. Our findings reveal that: (1) even the best-performing models fabricate answers at a non-trivial rate - 1.19% at best at 32K, with top-tier models at 5 - 7% - and fabrication rises steeply with context length, nearly tripling at 128K and exceeding 10% for all models at 200K; (2) model selection dominates all other factors, with overall accuracy spanning a 72-percentage-point range and model family predicting fabrication resistance better than model size; (3) temperature effects are nuanced - T=0.0 yields the best overall accuracy in roughly 60% of cases, but higher temperatures reduce fabrication for the majority of models and dramatically reduce coherence loss (infinite generation loops), which can reach 48x higher rates at T=0.0 versus T=1.0; (4) grounding ability and fabrication resistance are distinct capabilities - models that excel at finding facts may still fabricate facts that do not exist; and (5) results are consistent across hardware platforms, confirming that deployment decisions need not be hardware-dependent.

cross AdaCultureSafe: Adaptive Cultural Safety Grounded by Cultural Knowledge in Large Language Models

Authors: Hankun Kang, Di Lin, Zhirong Liao, Pengfei Bai, Xinyi Zeng, Jiawei Jiang, Yuanyuan Zhu, Tieyun Qian

Abstract: With the widespread adoption of Large Language Models (LLMs), respecting indigenous cultures becomes essential for models' culturally safety and responsible global applications. Existing studies separately consider cultural safety and cultural knowledge and neglect that the former should be grounded by the latter. This severely prevents LLMs from yielding culture-specific respectful responses. Consequently, adaptive cultural safety remains a formidable task. In this work, we propose to jointly model cultural safety and knowledge. First and foremost, cultural-safety and knowledge-paired data serve as the key prerequisite to conduct this research. However, the cultural diversity across regions and the subtlety of cultural differences pose significant challenges to the creation of such paired evaluation data. To address this issue, we propose a novel framework that integrates authoritative cultural knowledge descriptions curation, LLM-automated query generation, and heavy manual verification. Accordingly, we obtain a dataset named AdaCultureSafe containing 4.8K manually decomposed fine-grained cultural descriptions and the corresponding 48K manually verified safety- and knowledge-oriented queries. Upon the constructed dataset, we evaluate three families of popular LLMs on their cultural safety and knowledge proficiency, via which we make a critical discovery: no significant correlation exists between their cultural safety and knowledge proficiency. We then delve into the utility-related neuron activations within LLMs to investigate the potential cause of the absence of correlation, which can be attributed to the difference of the objectives of pre-training and post-alignment. We finally present a knowledge-grounded method, which significantly enhances cultural safety by enforcing the integration of knowledge into the LLM response generation process.

cross TA-RNN-Medical-Hybrid: A Time-Aware and Interpretable Framework for Mortality Risk Prediction

Authors: Zahra Jafari, Azadeh Zamanifar, Amirfarhad Farhadi

Abstract: Accurate and interpretable mortality risk prediction in intensive care units (ICUs) remains a critical challenge due to the irregular temporal structure of electronic health records (EHRs), the complexity of longitudinal disease trajectories, and the lack of clinically grounded explanations in many data-driven models. To address these challenges, we propose \textit{TA-RNN-Medical-Hybrid}, a time-aware and knowledge-enriched deep learning framework that jointly models longitudinal clinical sequences and irregular temporal dynamics through explicit continuous-time encoding, along with standardized medical concept representations. The proposed framework extends time-aware recurrent modeling by integrating explicit continuous-time embeddings that operate independently of visit indexing, SNOMED-based disease representations, and a hierarchical dual-level attention mechanism that captures both visit-level temporal importance and feature/concept-level clinical relevance. This design enables accurate mortality risk estimation while providing transparent and clinically meaningful explanations aligned with established medical knowledge. We evaluate the proposed approach on the MIMIC-III critical care dataset and compare it against strong time-aware and sequential baselines. Experimental results demonstrate that TA-RNN-Medical-Hybrid consistently improves predictive performance in terms of AUC, accuracy, and recall-oriented F$_2$-score. Moreover, qualitative analysis shows that the model effectively decomposes mortality risk across time and clinical concepts, yielding interpretable insights into disease severity, chronicity, and temporal progression. Overall, the proposed framework bridges the gap between predictive accuracy and clinical interpretability, offering a scalable and transparent solution for high-stakes ICU decision support systems.

cross Evaluating LLM-Based Grant Proposal Review via Structured Perturbations

Authors: William Thorne, Joseph James, Yang Wang, Chenghua Lin, Diana Maynard

Abstract: As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation. Using six EPSRC proposals, we develop a perturbation-based framework probing LLM sensitivity across six quality axes: funding, timeline, competency, alignment, clarity, and impact. We compare three review architectures: single-pass review, section-by-section analysis, and a 'Council of Personas' ensemble emulating expert panels. The section-level approach significantly outperforms alternatives in both detection rate and scoring reliability, while the computationally expensive council method performs no better than baseline. Detection varies substantially by perturbation type, with alignment issues readily identified but clarity flaws largely missed by all systems. Human evaluation shows LLM feedback is largely valid but skewed toward compliance checking over holistic assessment. We conclude that current LLMs may provide supplementary value within EPSRC review but exhibit high variability and misaligned review priorities. We release our code and any non-protected data.

cross A Blockchain-based Traceability System for AI-Driven Engine Blade Inspection

Authors: Mahmoud Hafez, Eman Ouda, Mohammed A. Mohammed Eltoum, Khaled Salah, Yusra Abdulrahman

Abstract: Aircraft engine blade maintenance relies on inspection records shared across manufacturers, airlines, maintenance organizations, and regulators. Yet current systems are fragmented, difficult to audit, and vulnerable to tampering. This paper presents BladeChain, a blockchain-based system providing immutable traceability for blade inspections throughout the component life cycle. BladeChain is the first system to integrate multi-stakeholder endorsement, automated inspection scheduling, AI model provenance, and cryptographic evidence binding, delivering auditable maintenance traceability for aerospace deployments. Built on a four-stakeholder Hyperledger Fabric network (OEM, Airline, MRO, Regulator), BladeChain captures every life-cycle event in a tamper-evident ledger. A chaincode-enforced state machine governs blade status transitions and automatically triggers inspections when configurable flight hour, cycle, or calendar thresholds are exceeded, eliminating manual scheduling errors. Inspection artifacts are stored off-chain in IPFS and linked to on-chain records via SHA-256 hashes, with each inspection record capturing the AI model name and version used for defect detection. This enables regulators to audit both what defects were found and how they were found. The detection module is pluggable, allowing organizations to adopt or upgrade inspection models without modifying the ledger or workflows. We built a prototype and evaluated it on workloads of up to 100 blades, demonstrating 100% life cycle completion with consistent throughput of 26 operations per minute. A centralized SQL baseline quantifies the consensus overhead and highlights the security trade-off. Security validation confirms tamper detection within 17~ms through hash verification.

cross Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

Authors: Chaewon Moon, Dongkuk Si, Chulhee Yun

Abstract: We study the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drastically -- even on a single-example dataset. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $\mathbf{0}$ or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For $\ell_2$-SAM, we show that although its limit direction matches the $\ell_1$ max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.

cross Graph-Instructed Neural Networks for parametric problems with varying boundary conditions

Authors: Francesco Della Santa, Sandra Pieraccini, Maria Strazzullo

Abstract: This work addresses the accurate and efficient simulation of physical phenomena governed by parametric Partial Differential Equations (PDEs) characterized by varying boundary conditions, where parametric instances modify not only the physics of the problem but also the imposition of boundary constraints on the computational domain. In such scenarios, classical Galerkin projection-based reduced order techniques encounter a fundamental bottleneck. Parametric boundaries typically necessitate a re-formulation of the discrete problem for each new configuration, and often, these approaches are unsuitable for real-time applications. To overcome these limitations, we propose a novel methodology based on Graph-Instructed Neural Networks (GINNs). The GINN framework effectively learns the mapping between the parametric description of the computational domain and the corresponding PDE solution. Our results demonstrate that the proposed GINN-based models, can efficiently represent highly complex parametric PDEs, serving as a robust and scalable asset for several applied-oriented settings when compared with fully connected architectures.

cross Retrieval-Augmented Anatomical Guidance for Text-to-CT Generation

Authors: Daniele Molino, Camillo Maria Caruso, Paolo Soda, Valerio Guarrasi

Abstract: Text-conditioned generative models for volumetric medical imaging provide semantic control but lack explicit anatomical guidance, often resulting in outputs that are spatially ambiguous or anatomically inconsistent. In contrast, structure-driven methods ensure strong anatomical consistency but typically assume access to ground-truth annotations, which are unavailable when the target image is to be synthesized. We propose a retrieval-augmented approach for Text-to-CT generation that integrates semantic and anatomical information under a realistic inference setting. Given a radiology report, our method retrieves a semantically related clinical case using a 3D vision-language encoder and leverages its associated anatomical annotation as a structural proxy. This proxy is injected into a text-conditioned latent diffusion model via a ControlNet branch, providing coarse anatomical guidance while maintaining semantic flexibility. Experiments on the CT-RATE dataset show that retrieval-augmented generation improves image fidelity and clinical consistency compared to text-only baselines, while additionally enabling explicit spatial controllability, a capability inherently absent in such approaches. Further analysis highlights the importance of retrieval quality, with semantically aligned proxies yielding consistent gains across all evaluation axes. This work introduces a principled and scalable mechanism to bridge semantic conditioning and anatomical plausibility in volumetric medical image synthesis. Code will be released.

cross Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness

Authors: Yehonatan Elisha, Oren Barkan, Noam Koenigstein

Abstract: Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground-background masks, which fail to capture the fine-grained semantic concepts that define an object (e.g., ``long beak'' and ``wings'' for a ``bird''). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class-relevant concepts are first proposed using an LLM-based, label-free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution benchmarks demonstrate that our method improves robustness across multiple ViT-based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.

cross Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations

Authors: Sadegh Rahmaniboldaji, Filip Rybansky, Quoc C. Vuong, Anya C. Hurlbert, Frank Guerin, Andrew Gilbert

Abstract: Humans consistently outperform state-of-the-art AI models in action recognition, particularly in challenging real-world conditions involving low resolution, occlusion, and visual clutter. Understanding the sources of this performance gap is essential for developing more robust and human-aligned models. In this paper, we present a large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition. We used our previously introduced, Epic ReduAct, a systematically spatially reduced and temporally scrambled dataset derived from 36 EPIC KITCHENS videos, spanning multiple spatial reduction levels and temporal conditions. Recognition performance is evaluated using over 3,000 human participants and the Side4Video model. Our analysis combines quantitative metrics, Average Reduction Rate and Recognition Gap, with qualitative analyses of spatial (high-, mid-, and low-level visual features) and spatiotemporal factors, including a categorisation of actions into Low Temporal Actions (LTA) and High Temporal Actions (HTA). Results show that human performance exhibits sharp declines when transitioning from MIRCs to subMIRCs, reflecting a strong reliance on sparse, semantically critical cues such as hand-object interactions. In contrast, the model degrades more gradually and often relies on contextual and mid- to low-level features, sometimes even exhibiting increased confidence under spatial reduction. Temporally, humans remain robust to scrambling when key spatial cues are preserved, whereas the model often shows insensitivity to temporal disruption, revealing class-dependent temporal sensitivities.

cross EndoSERV: A Vision-based Endoluminal Robot Navigation System

Authors: Junyang Wu, Fangfang Xie, Minghui Zhang, Hanxiao Zhang, Jiayuan Sun, Yun Gu, Guang-Zhong Yang

Abstract: Robot-assisted endoluminal procedures are increasingly used for early cancer intervention. However, the intricate, narrow and tortuous pathways within the luminal anatomy pose substantial difficulties for robot navigation. Vision-based navigation offers a promising solution, but existing localization approaches are error-prone due to tissue deformation, in vivo artifacts and a lack of distinctive landmarks for consistent localization. This paper presents a novel EndoSERV localization method to address these challenges. It includes two main parts, \textit{i.e.}, \textbf{SE}gment-to-structure and \textbf{R}eal-to-\textbf{V}irtual mapping, and hence the name. For long-range and complex luminal structures, we divide them into smaller sub-segments and estimate the odometry independently. To cater for label insufficiency, an efficient transfer technique maps real image features to the virtual domain to use virtual pose ground truth. The training phases of EndoSERV include an offline pretraining to extract texture-agnostic features, and an online phase that adapts to real-world conditions. Extensive experiments based on both public and clinical datasets have been performed to demonstrate the effectiveness of the method even without any real pose labels.

cross SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation

Authors: Yagiz Can Akay, Muhammed Yusuf Kartal, Esra Alparslan, Faruk Ortakoyluoglu, Arda Akpinar

Abstract: Answering complex, real-world queries often requires synthesizing facts scattered across vast document corpora. In these settings, standard retrieval-augmented generation (RAG) pipelines suffer from incomplete evidence coverage, while long-context large language models (LLMs) struggle to reason reliably over massive inputs. We introduce SPD-RAG, a hierarchical multi-agent framework for exhaustive cross-document question answering that decomposes the problem along the document axis. Each document is processed by a dedicated document-level agent operating only on its own content, enabling focused retrieval, while a coordinator dispatches tasks to relevant agents and aggregates their partial answers. Agent outputs are synthesized by merging partial answers through a token-bounded synthesis layer (which supports recursive map-reduce for massive corpora). This document-level specialization with centralized fusion improves scalability and answer quality in heterogeneous multidocument settings while yielding a modular, extensible retrieval pipeline. On the LOONG benchmark (EMNLP 2024) for long-context multi-document QA, SPD-RAG achieves an Avg Score of 58.1 (GPT-5 evaluation), outperforming Normal RAG (33.0) and Agentic RAG (32.8) while using only 38% of the API cost of a full-context baseline (68.0).

cross Detecting Fake Reviewer Groups in Dynamic Networks: An Adaptive Graph Learning Method

Authors: Jing Zhang, Ke Huang, Yao Zhang, Bin Guo, Zhiwen Yu

Abstract: The proliferation of fake reviews, often produced by organized groups, undermines consumer trust and fair competition on online platforms. These groups employ sophisticated strategies that evade traditional detection methods, particularly in cold-start scenarios involving newly launched products with sparse data. To address this, we propose the \underline{D}iversity- and \underline{S}imilarity-aware \underline{D}ynamic \underline{G}raph \underline{A}ttention-enhanced \underline{G}raph \underline{C}onvolutional \underline{N}etwork (DS-DGA-GCN), a new graph learning model for detecting fake reviewer groups. DS-DGA-GCN achieves robust detection since it focuses on the joint relationships among products, reviews, and reviewers by modeling product-review-reviewer networks. DS-DGA-GCN also achieves adaptive detection by integrating a Network Feature Scoring (NFS) system and a new dynamic graph attention mechanism. The NFS system quantifies network attributes, including neighbor diversity, network self-similarity, as a unified feature score. The dynamic graph attention mechanism improves the adaptability and computational efficiency by captures features related to temporal information, node importance, and global network structure. Extensive experiments conducted on two real-world datasets derived from Amazon and Xiaohongshu demonstrate that DS-DGA-GCN significantly outperforms state-of-the-art baselines, achieving accuracies of up to \textbf{89.8\% and 88.3\%}, respectively.

cross Electrocardiogram Classification with Transformers Using Koopman and Wavelet Features

Authors: Sucheta Ghosh, Zahra Monfared

Abstract: Electrocardiogram (ECG) analysis is vital for detecting cardiac abnormalities, yet robust automated classification is challenging due to the complexity and variability of physiological signals. In this work, we investigate transformer-based ECG classification using features derived from the Koopman operator and wavelet transforms. Two tasks are studied: (1) binary classification (Normal vs. Non-normal), and (2) four-class classification (Normal, Atrial Fibrillation, Ventricular Arrhythmia, Block). We use Extended Dynamic Mode Decomposition (EDMD) to approximate the Koopman operator. Our results show that wavelet features excel in binary classification, while Koopman features, when paired with transformers, achieve superior performance in the four-class setting. A simple hybrid of Koopman and wavelet features does not improve accuracy. However, selecting an appropriate EDMD dictionary -- specifically a radial basis function dictionary with tuned parameters -- yields significant gains, surpassing the wavelet-only baseline and the hybrid wavelet-Koopman system. We also present a Koopman-based reconstruction analysis for interpretable insights into the learned dynamics and compare against a recurrent neural network baseline. Overall, our findings demonstrate the effectiveness of Koopman-based feature learning with transformers and highlight promising directions for integrating dynamical systems theory into time-series classification.

cross Towards plausibility in time series counterfactual explanations

Authors: Marcin Kostrzewa, Krzysztof Galus, Maciej Zi\k{e}ba

Abstract: We present a new method for generating plausible counterfactual explanations for time series classification problems. The approach performs gradient-based optimization directly in the input space. To enforce plausibility, we integrate soft-DTW (dynamic time warping) alignment with $k$-nearest neighbors from the target class, which effectively encourages the generated counterfactuals to adopt a realistic temporal structure. The overall optimization objective is a multi-faceted loss function that balances key counterfactual properties. It incorporates losses for validity, sparsity, and proximity, alongside the novel soft-DTW-based plausibility component. We conduct an evaluation of our method against several strong reference approaches, measuring the key properties of the generated counterfactuals across multiple dimensions. The results demonstrate that our method achieves competitive performance in validity while significantly outperforming existing approaches in distributional alignment with the target class, indicating superior temporal realism. Furthermore, a qualitative analysis highlights the critical limitations of existing methods in preserving realistic temporal structure. This work shows that the proposed method consistently generates counterfactual explanations for time series classifiers that are not only valid but also highly plausible and consistent with temporal patterns.

cross Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors

Authors: Okko R\"as\"anen

Abstract: Learning to understand speech appears almost effortless for typically developing infants, yet from an information-processing perspective, acquiring a language from acoustic speech is an enormous challenge. This chapter reviews recent developments in using computational models to understand early language acquisition from speech and audiovisual input. The focus is on self-supervised and visually grounded models of perceptual learning. We show how these models are becoming increasingly powerful in learning various aspects of speech without strong linguistic priors, and how many features of early language development can be explained through a shared set of learning principles-principles broadly compatible with multiple theories of language acquisition and human cognition. We also discuss how modern learning simulations are gradually becoming more realistic, both in terms of input data and in linking model behavior to empirical findings on infant language development.

cross Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

Authors: Liyuan Mao, Le Yu, Jing Zhou, Chujie Zheng, Bowen Yu, Chang Gao, Shixuan Liu, An Yang, Weinan Zhang, JunYang Lin

Abstract: In this work, we reveal that Large Language Models (LLMs) possess intrinsic behavioral plasticity-akin to chameleons adapting their coloration to environmental cues-that can be exposed through token-conditional generation and stabilized via reinforcement learning. Specifically, by conditioning generation on carefully selected token prefixes sampled from responses exhibiting desired behaviors, LLMs seamlessly adapt their behavioral modes at inference time (e.g., switching from step-by-step reasoning to direct answering) without retraining. Based on this insight, we propose Token-Conditioned Reinforcement Learning (ToCoRL), a principled framework that leverages RL to internalize this chameleon-like plasticity, transforming transient inference-time adaptations into stable and learnable behavioral patterns. ToCoRL guides exploration with token-conditional generation and keep enhancing exploitation, enabling emergence of appropriate behaviors. Extensive experiments show that ToCoRL enables precise behavioral control without capability degradation. Notably, we show that large reasoning models, while performing strongly on complex mathematics, can be effectively adapted to excel at factual question answering, which was a capability previously hindered by their step-by-step reasoning patterns.

cross A Recipe for Stable Offline Multi-agent Reinforcement Learning

Authors: Dongsu Lee, Daehee Lee, Amy Zhang

Abstract: Despite remarkable achievements in single-agent offline reinforcement learning (RL), multi-agent RL (MARL) has struggled to adopt this paradigm, largely persisting with on-policy training and self-play from scratch. One reason for this gap comes from the instability of non-linear value decomposition, leading prior works to avoid complex mixing networks in favor of linear value decomposition (e.g., VDN) with value regularization used in single-agent setups. In this work, we analyze the source of instability in non-linear value decomposition within the offline MARL setting. Our observations confirm that they induce value-scale amplification and unstable optimization. To alleviate this, we propose a simple technique, scale-invariant value normalization (SVN), that stabilizes actor-critic training without altering the Bellman fixed point. Empirically, we examine the interaction among key components of offline MARL (e.g., value decomposition, value learning, and policy extraction) and derive a practical recipe that unlocks its full potential.

cross Aligning to Illusions: Choice Blindness in Human and AI Feedback

Authors: Wenbin Wu

Abstract: Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.

cross Geometrically Constrained Outlier Synthesis

Authors: Daniil Karzanov, Marcin Detyniecki

Abstract: Deep neural networks for image classification often exhibit overconfidence on out-of-distribution (OOD) samples. To address this, we introduce Geometrically Constrained Outlier Synthesis (GCOS), a training-time regularization framework aimed at improving OOD robustness during inference. GCOS addresses a limitation of prior synthesis methods by generating virtual outliers in the hidden feature space that respect the learned manifold structure of in-distribution (ID) data. The synthesis proceeds in two stages: (i) a dominant-variance subspace extracted from the training features identifies geometrically informed, off-manifold directions; (ii) a conformally-inspired shell, defined by the empirical quantiles of a nonconformity score from a calibration set, adaptively controls the synthesis magnitude to produce boundary samples. The shell ensures that generated outliers are neither trivially detectable nor indistinguishable from in-distribution data, facilitating smoother learning of robust features. This is combined with a contrastive regularization objective that promotes separability of ID and OOD samples in a chosen score space, such as Mahalanobis or energy-based. Experiments demonstrate that GCOS outperforms state-of-the-art methods using standard energy-based inference on near-OOD benchmarks, defined as tasks where outliers share the same semantic domain as in-distribution data. As an exploratory extension, the framework naturally transitions to conformal OOD inference, which translates uncertainty scores into statistically valid p-values and enables thresholds with formal error guarantees, providing a pathway toward more predictable and reliable OOD detection.

cross Human-Aware Robot Behaviour in Self-Driving Labs

Authors: Satheeshkumar Veeramani, Anna Kisil, Abigail Bentley, Hatem Fakhruldeen, Gabriella Pizzuto, Andrew I. Cooper

Abstract: Self-driving laboratories (SDLs) are rapidly transforming research in chemistry and materials science to accelerate new discoveries. Mobile robot chemists (MRCs) play a pivotal role by autonomously navigating the lab to transport samples, effectively connecting synthesis, analysis, and characterisation equipment. The instruments within an SDL are typically designed or retrofitted to be accessed by both human and robotic chemists, ensuring operational flexibility and integration between manual and automated workflows. In many scenarios, human and robotic chemists may need to use the same equipment simultaneously. Currently, MRCs rely on simple LiDAR-based obstruction detection, which forces the robot to passively wait if a human is present. This lack of situational awareness leads to unnecessary delays and inefficient coordination in time-critical automated workflows in human-robot shared labs. To address this, we present an initial study of an embodied, AI-driven perception method that facilitates proactive human-robot interaction in shared-access scenarios. Our method features a hierarchical human intention prediction model that allows the robot to distinguish between preparatory actions (waiting) and transient interactions (accessing the instrument). Our results demonstrate that the proposed approach enhances efficiency by enabling proactive human-robot interaction, streamlining coordination, and potentially increasing the efficiency of autonomous scientific labs.

cross SYNAPSE: Framework for Neuron Analysis and Perturbation in Sequence Encoding

Authors: Jes\'us S\'anchez Ochoa, Enrique Tom\'as Mart\'inez Beltr\'an, Alberto Huertas Celdr\'an

Abstract: In recent years, Artificial Intelligence has become a powerful partner for complex tasks such as data analysis, prediction, and problem-solving, yet its lack of transparency raises concerns about its reliability. In sensitive domains such as healthcare or cybersecurity, ensuring transparency, trustworthiness, and robustness is essential, since the consequences of wrong decisions or successful attacks can be severe. Prior neuron-level interpretability approaches are primarily descriptive, task-dependent, or require retraining, which limits their use as systematic, reusable tools for evaluating internal robustness across architectures and domains. To overcome these limitations, this work proposes SYNAPSE, a systematic, training-free framework for understanding and stress-testing the internal behavior of Transformer models across domains. It extracts per-layer [CLS] representations, trains a lightweight linear probe to obtain global and per-class neuron rankings, and applies forward-hook interventions during inference. This design enables controlled experiments on internal representations without altering the original model, thereby allowing weaknesses, stability patterns, and label-specific sensitivities to be measured and compared directly across tasks and architectures. Across all experiments, SYNAPSE reveals a consistent, domain-independent organization of internal representations, in which task-relevant information is encoded in broad, overlapping neuron subsets. This redundancy provides a strong degree of functional stability, while class-wise asymmetries expose heterogeneous specialization patterns and enable label-aware analysis. In contrast, small structured manipulations in weight or logit space are sufficient to redirect predictions, highlighting complementary vulnerability profiles and illustrating how SYNAPSE can guide the development of more robust Transformer models.

cross One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States

Authors: Bo Jiang

Abstract: LLM agents that retrieve external knowledge typically generate a search query as text, then run a separate embedding model to encode it into a vector. This two-model pipeline adds infrastructure complexity and latency, yet is redundant: the LLM already encodes the full conversational context in its hidden states. We propose equipping LLM agents with native retrieval capability by adding a lightweight projection head that maps hidden states directly into the embedding space, eliminating the need for a separate embedding model. Trained with a combination of alignment, contrastive, and rank distillation losses, our method retains 97\% of baseline retrieval quality while enabling the LLM agent to search with its own representations. Experiments on the QReCC conversational search benchmark show competitive Recall@10 and MRR@10 compared to the standard generate-then-encode pipeline, with systematic ablations confirming the contribution of each loss component.

cross A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

Authors: Peter Brodeur, Jacob M. Koshy, Anil Palepu, Khaled Saab, Ava Homiar, Roma Ruparel, Charles Wu, Ryutaro Tanno, Joseph Xu, Amy Wang, David Stutz, Hannah M. Ferrera, David Barrett, Lindsey Crowley, Jihyeon Lee, Spencer E. Rittner, Ellery Wulczyn, Selena K. Zhang, Elahe Vedadi, Christine G. Kohn, Kavita Kulkarni, Vinay Kadiyala, Sara Mahdavi, Wendy Du, Jessica Williams, David Feinbloom, Renee Wong, Tao Tu, Petar Sirkovic, Alessio Orlandi, Christopher Semturs, Yun Liu, Juraj Gottweis, Dale R. Webster, Jo\"elle Barral, Katherine Chou, Pushmeet Kohli, Avinatan Hassidim, Yossi Matias, James Manyika, Rob Fields, Jonathan X. Li, Marc L. Cohen, Vivek Natarajan, Mike Schaekermann, Alan Karthikesalingam, Adam Rodman

Abstract: Large language model (LLM)-based AI systems have shown promise for patient-facing diagnostic and management conversations in simulated settings. Translating these systems into clinical practice requires assessment in real-world workflows with rigorous safety oversight. We report a prospective, single-arm feasibility study of an LLM-based conversational AI, the Articulate Medical Intelligence Explorer (AMIE), conducting clinical history taking and presentation of potential diagnoses for patients to discuss with their provider at urgent care appointments at a leading academic medical center. 100 adult patients completed an AMIE text-chat interaction up to 5 days before their appointment. We sought to assess the conversational safety and quality, patient and clinician experience, and clinical reasoning capabilities compared to primary care providers (PCPs). Human safety supervisors monitored all patient-AMIE interactions in real time and did not need to intervene to stop any consultations based on pre-defined criteria. Patients reported high satisfaction and their attitudes towards AI improved after interacting with AMIE (p < 0.001). PCPs found AMIE's output useful with a positive impact on preparedness. AMIE's differential diagnosis (DDx) included the final diagnosis, per chart review 8 weeks post-encounter, in 90% of cases, with 75% top-3 accuracy. Blinded assessment of AMIE and PCP DDx and management (Mx) plans suggested similar overall DDx and Mx plan quality, without significant differences for DDx (p = 0.6) and appropriateness and safety of Mx (p = 0.1 and 1.0, respectively). PCPs outperformed AMIE in the practicality (p = 0.003) and cost effectiveness (p = 0.004) of Mx. While further research is needed, this study demonstrates the initial feasibility, safety, and user acceptance of conversational AI in a real-world setting, representing crucial steps towards clinical translation.

cross LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing

Authors: Dongfang Li, Zixuan Liu, Gang Lin, Baotian Hu, Min Zhang

Abstract: The quadratic complexity of the attention mechanism and the substantial memory footprint of the Key-Value (KV) cache present severe computational and memory challenges for Large Language Models (LLMs) processing long contexts. Existing retrieval-based methods often compromise semantic integrity through fixed-size chunking and suffer from inefficient linear scanning. In this paper, we propose LycheeCluster, a novel method for efficient KV cache management. LycheeCluster preserves local semantic coherence via boundary-aware chunking and constructs a recursive hierarchical index rooted in the triangle inequality. This design transforms cache retrieval from a linear scan into a theoretically bounded, logarithmic-time pruning process, while a lazy update strategy supports efficient streaming generation. Experiments demonstrate that LycheeCluster achieves up to a 3.6x end-to-end inference speedup with negligible degradation in model performance, outperforming state-of-the-art KV cache management methods (e.g., Quest, ClusterKV). We will release our code and kernels after publication.

cross R2F: Repurposing Ray Frontiers for LLM-free Object Navigation

Authors: Francesco Argenziano, John Mark Alexis Marcelo, Michele Brienza, Abdel Hakim Drid, Emanuele Musumeci, Daniele Nardi, Domenico D. Bloisi, Vincenzo Suriani

Abstract: Zero-shot open-vocabulary object navigation has progressed rapidly with the emergence of large Vision-Language Models (VLMs) and Large Language Models (LLMs), now widely used as high-level decision-makers instead of end-to-end policies. Although effective, such systems often rely on iterative large-model queries at inference time, introducing latency and computational overhead that limit real-time deployment. To address this problem, we repurpose ray frontiers (R2F), a recently proposed frontier-based exploration paradigm, to develop an LLM-free framework for indoor open-vocabulary object navigation. While ray frontiers were originally used to bias exploration using semantic cues carried along rays, we reinterpret frontier regions as explicit, direction-conditioned semantic hypotheses that serve as navigation goals. Language-aligned features accumulated along out-of-range rays are stored sparsely at frontiers, where each region maintains multiple directional embeddings encoding plausible unseen content. In this way, navigation then reduces to embedding-based frontier scoring and goal tracking within a classical mapping and planning pipeline, eliminating iterative large-model reasoning. We further introduce R2F-VLN, a lightweight extension for free-form language instructions using syntactic parsing and relational verification without additional VLM or LLM components. Experiments in Habitat-sim and on a real robotic platform demonstrate competitive state-of-the-art zero-shot performance with real-time execution, achieving up to 6 times faster runtime than VLM-based alternatives.

cross X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

Authors: Youngseo Kim, Kwan Yun, Seokhyeon Hong, Sihun Cha, Colette Suhjung Koo, Junyong Noh

Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) an audio-visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful cross-generator evaluation, we further introduce MMDF, a new multimodal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by 13.1%. Our findings highlight the importance of leveraging internal audio-visual consistency cues for robustness to future generators in deepfake detection.

cross Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Authors: Qishun Yang, Shu Yang, Lijie Hu, Di Wang

Abstract: Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.

cross First-Order Geometry, Spectral Compression, and Structural Compatibility under Bounded Computation

Authors: Changkai Li

Abstract: Optimization under structural constraints is typically analyzed through projection or penalty methods, obscuring the geometric mechanism by which constraints shape admissible dynamics. We propose an operator-theoretic formulation in which computational or feasibility limitations are encoded by self-adjoint operators defining locally reachable subspaces. In this setting, the optimal first-order improvement direction emerges as a pseudoinverse-weighted gradient, revealing how constraints induce a distorted ascent geometry. We further demonstrate that effective dynamics concentrate along dominant spectral modes, yielding a principled notion of spectral compression, and establish a compatibility principle that characterizes the existence of common admissible directions across multiple objectives. The resulting framework unifies gradient projection, spectral truncation, and multi-objective feasibility within a single geometric structure.

cross Echo2ECG: Enhancing ECG Representations with Cardiac Morphology from Multi-View Echos

Authors: Michelle Espranita Liman, \"Ozg\"un Turgut, Alexander M\"uller, Eimo Martens, Daniel Rueckert, Philip M\"uller

Abstract: Electrocardiography (ECG) is a low-cost, widely used modality for diagnosing electrical abnormalities like atrial fibrillation by capturing the heart's electrical activity. However, it cannot directly measure cardiac morphological phenotypes, such as left ventricular ejection fraction (LVEF), which typically require echocardiography (Echo). Predicting these phenotypes from ECG would enable early, accessible health screening. Existing self-supervised methods suffer from a representational mismatch by aligning ECGs to single-view Echos, which only capture local, spatially restricted anatomical snapshots. To address this, we propose Echo2ECG, a multimodal self-supervised learning framework that enriches ECG representations with the heart's morphological structure captured in multi-view Echos. We evaluate Echo2ECG as an ECG feature extractor on two clinically relevant tasks that fundamentally require morphological information: (1) classification of structural cardiac phenotypes across three datasets, and (2) retrieval of Echo studies with similar morphological characteristics using ECG queries. Our extracted ECG representations consistently outperform those of state-of-the-art unimodal and multimodal baselines across both tasks, despite being 18x smaller than the largest baseline. These results demonstrate that Echo2ECG is a robust, powerful ECG feature extractor. Our code is accessible at https://github.com/michelleespranita/Echo2ECG.

URLs: https://github.com/michelleespranita/Echo2ECG.

cross Oracle-Guided Soft Shielding for Safe Move Prediction in Chess

Authors: Prajit T Rajendran, Fabio Arnez, Huascar Espinoza, Agnes Delaborde, Chokri Mraidha

Abstract: In high stakes environments, agents relying purely on imitation learning or reinforcement learning often struggle to avoid safety-critical errors during exploration. Existing reinforcement learning approaches for environments such as chess require hundreds of thousands of episodes and substantial computational resources to converge. Imitation learning, on the other hand, is more sample efficient but is brittle under distributional shift and lacks mechanisms for proactive risk avoidance. In this work, we propose Oracle-Guided Soft Shielding (OGSS), a simple yet effective framework for safer decision-making, enabling safe exploration by learning a probabilistic safety model from oracle feedback in an imitation learning setting. Focusing on the domain of chess, we train a model to predict strong moves based on past games, and separately learn a blunder prediction model from Stockfish evaluations to estimate the tactical risk of each move. During inference, the agent first generates a set of candidate moves and then uses the blunder model to determine high-risk options, and uses a utility function combining the predicted move likelihood from the policy model and the blunder probability to select actions that strike a balance between performance and safety. This enables the agent to explore and play competitively while significantly reducing the chance of tactical mistakes. Across hundreds of games against a strong chess engine, we compare our approach with other methods in the literature, such as action pruning, SafeDAgger, and uncertainty-based sampling. Our results demonstrate that OGSS variants maintain a lower blunder rate even as the agent's exploration ratio is increased by several folds, highlighting its ability to support broader exploration without compromising tactical soundness.

cross Beyond Hungarian: Match-Free Supervision for End-to-End Object Detection

Authors: Shoumeng Qiu, Xinrun Li, Yang Long

Abstract: Recent DEtection TRansformer (DETR) based frameworks have achieved remarkable success in end-to-end object detection. However, the reliance on the Hungarian algorithm for bipartite matching between queries and ground truths introduces computational overhead and complicates the training dynamics. In this paper, we propose a novel matching-free training scheme for DETR-based detectors that eliminates the need for explicit heuristic matching. At the core of our approach is a dedicated Cross-Attention-based Query Selection (CAQS) module. Instead of discrete assignment, we utilize encoded ground-truth information to probe the decoder queries through a cross-attention mechanism. By minimizing the weighted error between the queried results and the ground truths, the model autonomously learns the implicit correspondences between object queries and specific targets. This learned relationship further provides supervision signals for the learning of queries. Experimental results demonstrate that our proposed method bypasses the traditional matching process, significantly enhancing training efficiency, reducing the matching latency by over 50\%, effectively eliminating the discrete matching bottleneck through differentiable correspondence learning, and also achieving superior performance compared to existing state-of-the-art methods.

cross Towards Effective and Efficient Graph Alignment without Supervision

Authors: Songyang Chen, Youfang Lin, Yu Liu, Shuai Zheng, Lei Zou

Abstract: Unsupervised graph alignment aims to find the node correspondence across different graphs without any anchor node pairs. Despite the recent efforts utilizing deep learning-based techniques, such as the embedding and optimal transport (OT)-based approaches, we observe their limitations in terms of model accuracy-efficiency tradeoff. By focusing on the exploitation of local and global graph information, we formalize them as the ``local representation, global alignment'' paradigm, and present a new ``global representation and alignment'' paradigm to resolve the mismatch between the two phases in the alignment process. We then propose \underline{Gl}obal representation and \underline{o}ptimal transport-\underline{b}ased \underline{Align}ment (\texttt{GlobAlign}), and its variant, \texttt{GlobAlign-E}, for better \underline{E}fficiency. Our methods are equipped with the global attention mechanism and a hierarchical cross-graph transport cost, able to capture long-range and implicit node dependencies beyond the local graph structure. Furthermore, \texttt{GlobAlign-E} successfully closes the time complexity gap between representative embedding and OT-based methods, reducing OT's cubic complexity to quadratic terms. Through extensive experiments, our methods demonstrate superior performance, with up to a 20\% accuracy improvement over the best competitor. Meanwhile, \texttt{GlobAlign-E} achieves the best efficiency, with an order of magnitude speedup against existing OT-based methods.

cross OSS-CRS: Liberating AIxCC Cyber Reasoning Systems for Real-World Open-Source Security

Authors: Andrew Chin, Dongkwan Kim, Yu-Fu Fu, Fabian Fleischer, Youngjoon Kim, HyungSeok Han, Cen Zhang, Brian Junekyu Lee, Hanqing Zhao, Taesoo Kim

Abstract: DARPA's AI Cyber Challenge (AIxCC) showed that cyber reasoning systems (CRSs) can go beyond vulnerability discovery to autonomously confirm and patch bugs: seven teams built such systems and open-sourced them after the competition. Yet all seven open-sourced CRSs remain largely unusable outside their original teams, each bound to the competition cloud infrastructure that no longer exists. We present OSS-CRS, an open, locally deployable framework for running and combining CRS techniques against real-world open-source projects, with budget-aware resource management. We ported the first-place system (Atlantis) and discovered 10 previously unknown bugs (three of high severity) across 8 OSS-Fuzz projects. OSS-CRS is publicly available.

cross MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation

Authors: Yutong Shen, Hangxu Liu, Penghui Liu, Jiashuo Luo, Yongkang Zhang, Rex Morvley, Chen Jiang, Jianwei Zhang, Lei Zhang

Abstract: Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.

cross Towards Batch-to-Streaming Deep Reinforcement Learning for Continuous Control

Authors: Riccardo De Monte, Matteo Cederle, Gian Antonio Susto

Abstract: State-of-the-art deep reinforcement learning (RL) methods have achieved remarkable performance in continuous control tasks, yet their computational complexity is often incompatible with the constraints of resource-limited hardware, due to their reliance on replay buffers, batch updates, and target networks. The emerging paradigm of streaming deep RL addresses this limitation through purely online updates, achieving strong empirical performance on standard benchmarks. In this work, we propose two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer. Both algorithms achieve performance comparable to state-of-the-art streaming baselines on standard benchmarks without requiring tedious hyperparameter tuning. Finally, we further investigate the practical challenges of transitioning from batch to streaming learning during finetuning and propose concrete strategies to tackle them.

cross Don't Look Back in Anger: MAGIC Net for Streaming Continual Learning with Temporal Dependence

Authors: Federico Giannini, Sandro D'Andrea, Emanuele Della Valle

Abstract: Concept drift, temporal dependence, and catastrophic forgetting represent major challenges when learning from data streams. While Streaming Machine Learning and Continual Learning (CL) address these issues separately, recent efforts in Streaming Continual Learning (SCL) aim to unify them. In this work, we introduce MAGIC Net, a novel SCL approach that integrates CL-inspired architectural strategies with recurrent neural networks to tame temporal dependence. MAGIC Net continuously learns, looks back at past knowledge by applying learnable masks over frozen weights, and expands its architecture when necessary. It performs all operations online, ensuring inference availability at all times. Experiments on synthetic and real-world streams show that it improves adaptation to new concepts, limits memory usage, and mitigates forgetting.

cross Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation

Authors: Hikmat Khan, Wei Chen, Muhammad Khalid Khan Niazi

Abstract: Background and objectives: Colorectal cancer histopathological grading depends on accurate segmentation of glandular structures. Current deep learning approaches rely on large scale pixel level annotations that are labor intensive and difficult to obtain in routine clinical practice. Weakly supervised semantic segmentation offers a promising alternative. However, class activation map based methods often produce incomplete pseudo masks that emphasize highly discriminative regions and fail to supervise unannotated glandular structures. We propose a weakly supervised teacher student framework that leverages sparse pathologist annotations and an Exponential Moving Average stabilized teacher network to generate refined pseudo masks. Methods: The framework integrates confidence based filtering, adaptive fusion of teacher predictions with limited ground truth, and curriculum guided refinement to progressively segment unannotated glandular regions. The method was evaluated on an institutional colorectal cancer cohort from The Ohio State University Wexner Medical Center consisting of 60 hematoxylin and eosin stained whole slide images and on public datasets including the Gland Segmentation dataset, TCGA COAD, TCGA READ, and SPIDER. Results: On the Gland Segmentation dataset the framework achieved a mean Intersection over Union of 80.10 and a mean Dice coefficient of 89.10. Cross cohort evaluation demonstrated robust generalization on TCGA COAD and TCGA READ without additional annotations, while reduced performance on SPIDER reflected domain shift. Conclusions: The proposed framework provides an annotation efficient and generalizable approach for gland segmentation in colorectal histopathology.

cross UNBOX: Unveiling Black-box visual models with Natural-language

Authors: Simone Carnemolla, Chiara Russo, Simone Palazzo, Quentin Bouniot, Daniela Giordano, Zeynep Akata, Matteo Pennisi, Concetto Spampinato

Abstract: Ensuring trustworthiness in open-world visual recognition requires models that are interpretable, fair, and robust to distribution shifts. Yet modern vision systems are increasingly deployed as proprietary black-box APIs, exposing only output probabilities and hiding architecture, parameters, gradients, and training data. This opacity prevents meaningful auditing, bias detection, and failure analysis. Existing explanation methods assume white- or gray-box access or knowledge of the training distribution, making them unusable in these real-world settings. We introduce UNBOX, a framework for class-wise model dissection under fully data-free, gradient-free, and backpropagation-free constraints. UNBOX leverages Large Language Models and text-to-image diffusion models to recast activation maximization as a purely semantic search driven by output probabilities. The method produces human-interpretable text descriptors that maximally activate each class, revealing the concepts a model has implicitly learned, the training distribution it reflects, and potential sources of bias. We evaluate UNBOX on ImageNet-1K, Waterbirds, and CelebA through semantic fidelity tests, visual-feature correlation analyses and slice-discovery auditing. Despite operating under the strictest black-box constraints, UNBOX performs competitively with state-of-the-art white-box interpretability methods. This demonstrates that meaningful insight into a model's internal reasoning can be recovered without any internal access, enabling more trustworthy and accountable visual recognition systems.

cross PostTrainBench: Can LLM Agents Automate LLM Post-Training?

Authors: Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, Maksym Andriushchenko

Abstract: AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction-tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.

URLs: https://posttrainbench.com/.

cross A New Lower Bound for the Random Offerer Mechanism in Bilateral Trade using AI-Guided Evolutionary Search

Authors: Yang Cai, Vineet Gupta, Zun Li, Aranyak Mehta

Abstract: The celebrated Myerson--Satterthwaite theorem shows that in bilateral trade, no mechanism can be simultaneously fully efficient, Bayesian incentive compatible (BIC), and budget balanced (BB). This naturally raises the question of how closely the gains from trade (GFT) achievable by a BIC and BB mechanism can approximate the first-best (fully efficient) benchmark. The optimal BIC and BB mechanism is typically complex and highly distribution-dependent, making it difficult to characterize directly. Consequently, much of the literature analyzes simpler mechanisms such as the Random-Offerer (RO) mechanism and establishes constant-factor guarantees relative to the first-best GFT. An important open question concerns the worst-case performance of the RO mechanism relative to first-best (FB) efficiency. While it was originally hypothesized that the approximation ratio $\frac{\text{GFT}_{\text{FB}}}{\text{GFT}_{\text{RO}}}$ is bounded by $2$, recent work provided counterexamples to this conjecture: Cai et al. proved that the ratio can be strictly larger than $2$, and Babaioff et al. exhibited an explicit example with ratio approximately $2.02$. In this work, we employ AlphaEvolve, an AI-guided evolutionary search framework, to explore the space of value distributions. We identify a new worst-case instance that yields an improved lower bound of $\frac{\text{GFT}_{\text{FB}}}{\text{GFT}_{\text{RO}}} \ge \textbf{2.0749}$. This establishes a new lower bound on the worst-case performance of the Random-Offerer mechanism, demonstrating a wider efficiency gap than previously known.

cross Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

Authors: Phillip Long, Zachary Novack, Chris Donahue

Abstract: Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.

cross Split Federated Learning Architectures for High-Accuracy and Low-Delay Model Training

Authors: Yiannis Papageorgiou, Yannis Thomas, Ramin Khalili, Iordanis Koutsopoulos

Abstract: Can we find a network architecture for ML model training so as to optimize training loss (and thus, accuracy) in Split Federated Learning (SFL)? And can this architecture also reduce training delay and communication overhead? While accuracy is not influenced by how we split the model in ordinary, state-of-the-art SFL, in this work we answer the questions above in the affirmative. Recent Hierarchical SFL (HSFL) architectures adopt a three-tier training structure consisting of clients, (local) aggregators, and a central server. In this architecture, the model is partitioned at two partitioning layers into three sub-models, which are executed across the three tiers. Despite their merits, HSFL architectures overlook the impact of the partitioning layers and client-to-aggregator assignments on accuracy, delay, and overhead. This work explicitly captures the impact of the partitioning layers and client-to-aggregator assignments on accuracy, delay and overhead by formulating a joint optimization problem. We prove that the problem is NP-hard and propose the first accuracy-aware heuristic algorithm that explicitly accounts for model accuracy, while remaining delay-efficient. Simulation results on public datasets show that our approach can improve accuracy by 3%, while reducing delay by 20% and overhead by 50%, compared to state-of-the-art SFL and HSFL schemes.

cross Scale Space Diffusion

Authors: Soumik Mukhopadhyay, Prateksha Udhayanan, Abhinav Shrivastava

Abstract: Diffusion models degrade images through noise, and reversing this process reveals an information hierarchy across timesteps. Scale-space theory exhibits a similar hierarchy via low-pass filtering. We formalize this connection and show that highly noisy diffusion states contain no more information than small, downsampled images - raising the question of why they must be processed at full resolution. To address this, we fuse scale spaces into the diffusion process by formulating a family of diffusion models with generalized linear degradations and practical implementations. Using downsampling as the degradation yields our proposed Scale Space Diffusion. To support Scale Space Diffusion, we introduce Flexi-UNet, a UNet variant that performs resolution-preserving and resolution-increasing denoising using only the necessary parts of the network. We evaluate our framework on CelebA and ImageNet and analyze its scaling behavior across resolutions and network depths. Our project website ( https://prateksha.github.io/projects/scale-space-diffusion/ ) is available publicly.

URLs: https://prateksha.github.io/projects/scale-space-diffusion/

replace An Embedding-based Approach to Inconsistency-tolerant Reasoning with Inconsistent Ontologies

Authors: Keyu Wang, Site Li, Jiaye Li, Guilin Qi, Qiu Ji

Abstract: Inconsistency handling is an important issue in knowledge management. Especially in ontology engineering, logical inconsistencies may occur during ontology construction. A natural way to reason with an inconsistent ontology is to utilize the maximal consistent subsets of the ontology. However, previous studies on selecting maximum consistent subsets have rarely considered the semantics of the axioms, which may result in irrational inference. In this paper, we propose a novel approach to reasoning with inconsistent ontologies in description logics based on the embeddings of axioms. We first give a method for turning axioms into distributed semantic vectors to compute the semantic connections between the axioms. We then define an embedding-based method for selecting the maximum consistent subsets and use it to define an inconsistency-tolerant inference relation. We show the rationality of our inference relation by considering some logical properties. Finally, we conduct experiments on several ontologies to evaluate the reasoning power of our inference relation. The experimental results show that our embedding-based method can outperform existing inconsistency-tolerant reasoning methods based on maximal consistent subsets.

replace Parallelized Planning-Acting for Efficient LLM-based Multi-Agent Systems in Minecraft

Authors: Yaoru Li, Shunyu Liu, Tongya Zheng, Li Sun, Mingli Song

Abstract: Recent advancements in Large Language Model~(LLM)-based Multi-Agent Systems (MAS) have demonstrated remarkable potential for tackling complex decision-making tasks. However, existing frameworks inevitably rely on serialized execution paradigms, where agents must complete sequential LLM planning before taking action. This fundamental constraint severely limits real-time responsiveness and adaptation, which is crucial in dynamic environments with ever-changing scenarios like Minecraft. In this paper, we propose a novel parallelized planning-acting framework for LLM-based MAS, featuring a dual-thread architecture with interruptible execution to enable concurrent planning and acting. Specifically, our framework comprises two core threads: (1) a planning thread driven by a centralized memory system, maintaining synchronization of environmental states and agent communication to support dynamic decision-making; and (2) an acting thread equipped with a comprehensive skill library, enabling automated task execution through recursive decomposition. Extensive experiments on Minecraft demonstrate the effectiveness of the proposed framework.

replace Engineering Systems for Data Analysis Using Interactive Structured Inductive Programming

Authors: Shraddha Surana, Ashwin Srinivasan, Michael Bain

Abstract: Engineering information systems for scientific data analysis presents significant challenges: complex workflows requiring exploration of large solution spaces, close collaboration with domain specialists, and the need for maintainable, interpretable implementations. Traditional manual development is time-consuming, while "No Code" approaches using large language models (LLMs) often produce unreliable systems. We present iProg, a tool implementing Interactive Structured Inductive Programming. iProg employs a variant of a '2-way Intelligibility' communication protocol to constrain collaborative system construction by a human and an LLM. Specifically, given a natural-language description of the overall data analysis task, iProg uses an LLM to first identify an appropriate decomposition of the problem into a declarative representation, expressed as a Data Flow Diagram (DFD). In a second phase, iProg then uses an LLM to generate code for each DFD process. In both stages, human feedback, mediated through the constructs provided by the communication protocol, is used to verify LLMs' outputs. We evaluate iProg extensively on two published scientific collaborations (astrophysics and biochemistry), demonstrating that it is possible to identify appropriate system decompositions and construct end-to-end information systems with better performance, higher code quality, and order-of-magnitude faster development compared to Low Code/No Code alternatives. The tool is available at: https://shraddhasurana.github.io/dhaani/

URLs: https://shraddhasurana.github.io/dhaani/

replace From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Authors: Mohamed Amine Ferrag, Norbert Tihanyi, Merouane Debbah

Abstract: Large language models and autonomous AI agents have evolved rapidly, resulting in a diverse array of evaluation benchmarks, frameworks, and collaboration protocols. Driven by the growing need for standardized evaluation and integration, we systematically consolidate these fragmented efforts into a unified framework. However, the landscape remains fragmented and lacks a unified taxonomy or comprehensive survey. Therefore, we present a side-by-side comparison of benchmarks developed between 2019 and 2025 that evaluate these models and agents across multiple domains. In addition, we propose a taxonomy of approximately 60 benchmarks that cover general and academic knowledge reasoning, mathematical problem-solving, code generation and software engineering, factual grounding and retrieval, domain-specific evaluations, multimodal and embodied tasks, task orchestration, and interactive assessments. Furthermore, we review AI-agent frameworks introduced between 2023 and 2025 that integrate large language models with modular toolkits to enable autonomous decision-making and multi-step reasoning. Moreover, we present real-world applications of autonomous AI agents in materials science, biomedical research, academic ideation, software engineering, synthetic data generation, chemical reasoning, mathematical problem-solving, geographic information systems, multimedia, healthcare, and finance. We then survey key agent-to-agent collaboration protocols, namely the Agent Communication Protocol (ACP), the Model Context Protocol (MCP), and the Agent-to-Agent Protocol (A2A). Finally, we discuss recommendations for future research, focusing on advanced reasoning strategies, failure modes in multi-agent LLM systems, automated scientific discovery, dynamic tool integration via reinforcement learning, integrated search capabilities, and security vulnerabilities in agent protocols.

replace Precision Proactivity: Measuring Cognitive Load in Real-World AI-Assisted Work

Authors: Brandon Lepine, Juho Kim, Pamela Mishkin, Matthew Beane

Abstract: Systems like ChatGPT and Claude assist billions through proactive dialogue-offering unsolicited, task-relevant information. Drawing on Cognitive Load Theory, we study how cognitive load shapes performance in AI-assisted knowledge work. We recruited 34 financial professionals to complete a complex valuation task using GPT-4o and developed a transcript-based framework estimating intrinsic and extraneous load from computational indicators anchored in a task decomposition and knowledge graph. Across 1,178 participant-subtask observations, AI-generated content usage is positively associated with quality, while extraneous load shows the largest negative association-roughly three times that of intrinsic load. Mediation reveals a compensatory pathway partially offsetting but not eliminating load-related deficits. Extraneous load persists within speakers and spills asymmetrically to model responses. Model-initiated task switching is the strongest predictor of decline. Expertise moderates these dynamics: less experienced professionals face larger penalties and derive greater marginal gains from AI-generated content, yet are not those who most increase uptake under load.

replace Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations

Authors: Sanjay Kariyappa, G. Edward Suh

Abstract: Prompt injection attacks are a critical security vulnerability in large language models (LLMs), allowing attackers to hijack model behavior by injecting malicious instructions within the input context. Recent defense mechanisms have leveraged an Instruction Hierarchy (IH) Signal, often implemented through special delimiter tokens or additive embeddings to denote the privilege level of input tokens. However, these prior works typically inject the IH signal exclusively at the initial input layer, which we hypothesize limits its ability to effectively distinguish the privilege levels of tokens as it propagates through the different layers of the model. To overcome this limitation, we introduce a novel approach that injects the IH signal into the intermediate token representations within the network. Our method augments these representations with layer-specific trainable embeddings that encode the privilege information. Our evaluations across multiple models and training methods reveal that our proposal yields between $1.6\times$ and $9.2\times$ reduction in attack success rate on gradient-based prompt injection attacks compared to state-of-the-art methods, without significantly degrading the model's utility.

replace MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

Authors: Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish

Abstract: Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 28K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI GPT-5 and DeepSeek R1 score only around 69\% and 57\% respectively, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.

URLs: https://github.com/MMTU-Benchmark/MMTU, https://huggingface.co/datasets/MMTU-benchmark/MMTU.

replace Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification

Authors: Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, Zsolt Kira

Abstract: Verifiers--functions assigning rewards to agent behavior--have been key to AI progress in math, code, and games. However, extending gains to domains without clear-cut success criteria remains a challenge: while humans can recognize desired outcomes, translating this intuition into scalable rules is nontrivial. Multimodal LLMs (MLLMs) offer a promising solution, given their world knowledge, human-preference alignment, and reasoning capabilities. We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents. We identify a critical limitation: a strong tendency for MLLMs to over-validate agent behavior--a phenomenon we term agreement bias. This bias is pervasive, resilient to test-time scaling, and can harm applications relying on MLLM judgments/rewards (e.g., self-improvement, steering, online supervision). We discuss several considerations for evaluating and designing MLLM verifiers, and introduce SGV, a lightweight method that better leverages their capabilities by modulating (un)conditional generation. First, an MLLM is elicited to generate broad priors about desired behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Our methods yield more human-aligned verifiers, improving failure detection by 25pp and accuracy by 14pp. In self-improvement and online supervision, they boost task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena--surpassing the previous state of the art by 20pp. As a byproduct, we release an update of VisualWebArena featuring strong agent baselines, more human-aligned oracles, container parallelism with high fidelity and proper resets, >10x speedups, and VWA-Lite, a 1/3 subset with comparable evaluation fidelity.

replace UniCast: A Unified Framework for Instance-Conditioned Multimodal Time-Series Forecasting

Authors: Sehyuk Park, Soyeon Caren Han, Eduard Hovy

Abstract: Time series forecasting underpins applications in finance, healthcare, and environmental monitoring. Despite the success of Time Series Foundation Models (TSFMs), existing approaches operate in a unimodal setting and rely on static prompts or fixed fusion schemes, limiting their ability to exploit multimodal context and adapt to instance-level variation. We propose UniCast, a parameter-efficient multimodal framework that extends TSFMs through instance conditioned prompting and dynamic modality routing. UniCast infers a conditional prompt from time series, vision, and text inputs via a Transformer-based contextual distiller, enabling input-specific adaptation without updating the forecasting backbone. To regulate how auxiliary modalities influence predictions, UniCast employs Modality Routing, a cross-attention mechanism that estimates modality relevance given the current temporal state and selectively amplifies informative signals while suppressing noise. Integrated with a frozen TSFM via soft prompt tuning, UniCast preserves foundation-level generalization while enabling effective multimodal control. Extensive experiments across diverse forecasting benchmarks show that UniCast consistently outperforms all existing TSFM baselines, demonstrating that instance-conditioned multimodal control is critical for next-generation time series forecasting.

replace Synthetic Homes: An Accessible Multimodal Pipeline for Producing Residential Building Data with Generative AI

Authors: Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra

Abstract: Computational models have emerged as powerful tools for multi-scale energy modeling research at the building level as well as urban scale. However, these models require a plethora of data on building parameters, some of which can be inaccessible, expensive, or can raise privacy concerns. We introduce a modular multimodal framework to synthetically produce this data from publicly accessible images and residential information using generative Artificial Intelligence (AI). Additionally, we provide a modeling pipeline demonstrating this framework and we evaluate its generative AI components for realism. Our experiments show that our framework's use of AI avoids common issues with generative models and produces realistic multimodal data at the building scale. Resulting datasets can be used for assessing influence of energy efficiency upgrades at the building scale, as well as to simulate larger patterns of energy consumption across regions. This work will support research in building and energy simulation by reducing dependence on costly or restricted data sources, and pave a path towards more accessible research in Machine Learning (ML) and other data-driven disciplines.

replace MICA: Multi-Agent Industrial Coordination Assistant

Authors: Di Wen, Kunyu Peng, Junwei Zheng, Yufan Chen, Yitian Shi, Jiale Wei, Ruiping Liu, Kailun Yang, Rainer Stiefelhagen

Abstract: Industrial workflows demand adaptive and trustworthy assistance that can operate under limited computing, connectivity, and strict privacy constraints. In this work, we present MICA (Multi-Agent Industrial Coordination Assistant), a perception-grounded and speech-interactive system that delivers real-time guidance for assembly, troubleshooting, part queries, and maintenance. MICA coordinates five role-specialized language agents, audited by a safety checker, to ensure accurate and compliant support. To achieve robust step understanding, we introduce Adaptive Step Fusion (ASF), which dynamically blends expert reasoning with online adaptation from natural speech feedback. Furthermore, we establish a new multi-agent coordination benchmark across representative task categories and propose evaluation metrics tailored to industrial assistance, enabling systematic comparison of different coordination topologies. Our experiments demonstrate that MICA consistently improves task success, reliability, and responsiveness over baseline structures, while remaining deployable on practical offline hardware. Together, these contributions highlight MICA as a step toward deployable, privacy-preserving multi-agent assistants for dynamic factory environments. The source code will be made publicly available at https://github.com/Kratos-Wen/MICA.

URLs: https://github.com/Kratos-Wen/MICA.

replace Linear probes rely on textual evidence: Results from leakage mitigation studies in language models

Authors: Gerard Boxo, Aman Neelappa, Shivam Raval

Abstract: White-box monitors are a popular technique for detecting potentially harmful behaviours in language models. While they perform well in general, their effectiveness in detecting text-ambiguous behaviour is disputed. In this work, we find evidence that removing textual evidence of a behaviour significantly decreases probe performance. The AUROC reduction ranges from $10$- to $30$-point depending on the setting. We evaluate probe monitors across three setups (Sandbagging, Sycophancy, and Bias), finding that when probes rely on textual evidence of the target behaviour (such as system prompts or CoT reasoning), performance degrades once these tokens are filtered. This filtering procedure is standard practice for output monitor evaluation. As further evidence of this phenomenon, we train Model Organisms which produce outputs without any behaviour verbalisations. We validate that probe performance on Model Organisms is substantially lower than unfiltered evaluations: $0.57$ vs $0.74$ AUROC for Bias, and $0.57$ vs $0.94$ AUROC for Sandbagging. Our findings suggest that linear probes may be brittle in scenarios where they must detect non-surface-level patterns.

replace Towards Strategic Persuasion with Language Models

Authors: Zirui Cheng, Jiaxuan You

Abstract: Large language models (LLMs) have demonstrated strong persuasive capabilities comparable to those of humans, offering promising benefits while raising societal concerns. However, systematically evaluating the persuasive capabilities of LLMs is inherently challenging, as the effectiveness of persuasion among humans varies significantly across different domains. In this paper, we take a theory-driven approach to provide a scalable and principled framework for studying the persuasive capabilities of LLMs. Grounded in Bayesian persuasion theory, we repurpose human-human persuasion datasets to construct environments for evaluating and training LLMs as strategic persuaders. Our results reveal that frontier models can consistently achieve high persuasion gains and exhibit sophisticated persuasion strategies that align with theoretical characterizations. Building on this, we use reinforcement learning to train LLMs for strategic persuasion in our environments. Our results also demonstrate that even small LLMs can obtain significantly higher persuasion gains through reinforcement learning.

replace Mapping Overlaps in Benchmarks through Perplexity in the Wild

Authors: Siyang Wu, Honglin Bao, Sida Li, Ari Holtzman, James A. Evans

Abstract: We introduce benchmark signatures to characterize the capacity demands of LLM benchmarks and their overlaps. Signatures are sets of salient tokens from in-the-wild corpora whose model token perplexity, reflecting training exposure, predicts benchmark performance. We extract them via stepwise forward selection with linear regression in a meta-evaluation spanning 32 LLMs and 89 benchmarks across diverse domains. We then analyze how these signatures relate to both the semantic similarity of benchmark questions and the correlation structure of model performance. While performance correlations are uniformly high and semantic overlaps stay in a narrow mid-range, benchmark signatures reveal more nuanced structure. For instance, they uncover substantial overlap between benchmarks in knowledge and reasoning tasks, whereas benchmarks in culture- and humanity-oriented domains show low similarity with each other. Unlike raw performance correlations, which are influenced by benchmark-orthogonal factors such as question formats, signatures are robust to such confounds. We further identify cross-functional overlaps between logic, math, language, instruction following, and cultural/world modeling, with coding emerging as the most isolated function, interacting only moderately with the ability of detecting missing information. Qualitative analysis shows that only the knowledge signature aligns with actual knowledge, suggesting that LLM semantic organization may differ from human conceptual structure. Together, these findings offer insights into benchmark validity, LLM sensitivities, and the landscape of interconnected LLM capacities. We have open-sourced the code and data in this https://github.com/siyangwu1/Benchmark-Signature-Repository.

URLs: https://github.com/siyangwu1/Benchmark-Signature-Repository.

replace ELHPlan: Efficient Long-Horizon Task Planning for Multi-Agent Collaboration

Authors: Shaobin Ling, Yun Wang, Chenyou Fan, Tin Lun Lam, Junjie Hu

Abstract: Large Language Models (LLMs) enable intelligent multi-robot collaboration but face fundamental trade-offs: open-loop methods that compile tasks into formal representations for external executors produce sound plans but lack adaptability in partially observable environments, while iterative methods incur prohibitive computational costs that scale poorly with team size and task complexity. In this paper, we propose Efficient Long-Horizon Planning (ELHPlan), a novel framework that introduces Action Chains, sequences of actions explicitly bound to sub-goal intentions, as the fundamental planning primitive. ELHPlan operates via a cyclical process: 1) constructing intention-bound action sequences, 2) proactively validating for conflicts and feasibility, 3) refining issues through targeted mechanisms, and 4) executing validated actions. This design balances adaptability and efficiency by providing intention-bound action sequences with longer lookahead while avoiding expensive full re-planning. We further advocate comprehensive efficiency metrics, including token consumption and planning time, to more holistically evaluate multi-agent collaboration. Our experiments on benchmarks TDW-MAT and C-WAH demonstrate that ELHPlan achieves comparable task success rates while consuming only 30-40% of the tokens required by state-of-the-art methods. Our research establishes a new efficiency-effectiveness frontier for LLM-based multi-agent planning systems.

replace Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

Authors: Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, Jing Shao

Abstract: Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.

URLs: https://github.com/ShaoShuai0605/Misevolution

replace Deliberative Dynamics and Value Alignment in LLM Debates

Authors: Pratik S. Sachdeva, Tom van Nuenen

Abstract: As large language models (LLMs) are increasingly deployed in sensitive everyday contexts -- offering personal advice, mental health support, and moral guidance -- understanding their behavior in navigating complex moral reasoning is essential. Most evaluations study this sociotechnical alignment through single-turn prompts, but it is unclear if these findings extend to multi-turn settings, and even less clear how they depend on the interaction protocols used to coordinate agentic systems. We address this gap using LLM debate to examine deliberative dynamics and value alignment in multi-turn settings by prompting subsets of three models (GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash) to collectively assign blame in 1,000 everyday dilemmas from Reddit's ``Am I the Asshole'' community. To test order effects and assess verdict revision, we use both synchronous (parallel responses) and round-robin (sequential responses) deliberation structures, mirroring how multi-agent systems are increasingly orchestrated in practice. Our findings show striking behavioral differences. In the synchronous setting, GPT-4.1 showed strong inertia (0.6-3.1\% revision rates) while Claude 3.7 Sonnet and Gemini 2.0 Flash were far more flexible (28-41\% revision rates). Value patterns also diverged: GPT-4.1 emphasized personal autonomy and direct communication (relative to its deliberation partners), while Claude 3.7 Sonnet and Gemini 2.0 Flash prioritized empathetic dialogue. We further find that deliberation format had a strong impact on model behavior: GPT-4.1 and Gemini 2.0 Flash stood out as highly conforming relative to Claude 3.7 Sonnet, with their verdict behavior strongly shaped by order effects. We provide additional results on open-source models (DeepSeek-V3.2 and Llama 3.1).

replace Reallocating Attention Across Layers to Reduce Multimodal Hallucination

Authors: Haolang Lu, Bolun Chu, WeiYe Fu, Guoshun Nan, Junning Liu, Minghui Pan, Qiankun Li, Yi Yu, Hua Wang, Kun Wang

Abstract: Multimodal large reasoning models (MLRMs) often suffer from hallucinations that stem not only from insufficient visual grounding but also from imbalanced allocation between perception and reasoning processes. Building upon recent interpretability findings suggesting a staged division of attention across layers, we analyze how this functional misalignment leads to two complementary failure modes: perceptual bias in shallow layers and reasoning drift in deeper layers. To alleviate these issues, we propose Functional Head Identification and Class-Conditioned Rescaling , a lightweight, training-free plugin that identifies perception- and reasoning-oriented heads and adaptively rebalances their layerwise contributions. Our method improves reasoning consistency and visual faithfulness without retraining or any architectural modification. Evaluations across three representative MLRMs and five multimodal reasoning benchmarks show an average 4.2% point gain, with less than 1% additional computation and only 9% baseline latency. Beyond empirical improvements, our study provides an interpretable perspective on regulating cross-layer functional dynamics to enhance the reliability of multimodal reasoning.

replace ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

Authors: Roger Creus Castanyer, Faisal Mohamed, Pablo Samuel Castro, Cyrus Neary, Glen Berseth

Abstract: Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) -- an automata-based formalism for reward specification -- are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.

replace Human-Centered LLM-Agent System for Detecting Anomalous Digital Asset Transactions

Authors: Gyuyeon Na, Minjung Park, Hyeonjeong Cha, Sangmi Chai

Abstract: We present HCLA, a human-centered multi-agent system for anomaly detection in digital-asset transactions. The system integrates three cognitively aligned roles: Rule Abstraction, Evidence Scoring, and Expert-Style Justification. These roles operate in a conversational workflow that enables non-experts to express analytical intent in natural language, inspect structured risk evidence, and obtain traceable, context-aware reasoning. Implemented with an open-source, web-based interface, HCLA translates user intent into explicit analytical rules, applies classical anomaly detectors to quantify evidential risk, and reconstructs expert-style justifications grounded in observable transactional signals. Experiments on a cryptocurrency anomaly dataset show that, while the underlying detector achieves strong predictive accuracy, HCLA substantially improves interpretability, interaction, and decision transparency. Importantly, HCLA is not designed to explain a black-box model in the conventional XAI sense. Instead, we reconstruct a traceable expert reasoning process that aligns algorithmic evidence with regulatory and investigative judgment. By explicitly separating evidence scoring from expert-style justification, the framework emphasizes accountability beyond explainability and addresses practical requirements for regulatory, audit, and compliance-driven financial forensics. We describe the system architecture, closed-loop interaction design, datasets, evaluation protocol, and limitations. We argue that a human-in-the-loop reasoning reconstruction paradigm is essential for achieving transparency, accountability, and trust in high-stakes financial environments. Keywords: Human-Centered AI; LLM-Agent System; Multi-Agent Architecture; Anomaly Detection; Digital Asset Transactions; Cryptocurrency Forensics; Blockchain Analytics; Human-in-the-Loop; Explainable AI (XAI); Interpretability

replace Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

Authors: Atsuyuki Miyai, Mashiro Toyooka, Takashi Otonari, Zaiying Zhao, Kiyoharu Aizawa

Abstract: Understanding the current capabilities and risks of AI Scientist systems is essential for ensuring trustworthy and sustainable AI-driven scientific progress while preserving the integrity of the academic ecosystem. To this end, we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system that mimics the core research workflow of a novice student researcher: Given the baseline paper from the human mentor, it analyzes its limitations, formulates novel hypotheses for improvement, validates them through rigorous experimentation, and writes a paper with the results. Unlike previous approaches that assume full automation or operate on small-scale code, Jr. AI Scientist follows a well-defined research workflow and leverages modern coding agents to handle complex, multi-file implementations, leading to scientifically valuable contributions. Through our experiments, the Jr. AI Scientist successfully generated new research papers that build upon real NeurIPS, IJCV, and ICLR works by proposing and implementing novel algorithms. For evaluation, we conducted automated assessments using AI Reviewers, author-led evaluations, and submissions to Agents4Science, a venue dedicated to AI-driven scientific contributions. The findings demonstrate that Jr. AI Scientist generates papers receiving higher review scores by DeepReviewer than existing fully automated systems. Nevertheless, we identify important limitations from both the author evaluation and the Agents4Science reviews, indicating the potential risks of directly applying current AI Scientist systems and key challenges for future research. Finally, we comprehensively report various risks identified during development. We believe this study clarifies the current role and limitations of AI Scientist systems, offering insights into the areas that still require human expertise and the risks that may emerge as these systems evolve.

replace Think, Speak, Decide: Language-Augmented Multi-Agent Reinforcement Learning for Economic Decision-Making

Authors: Heyang Ma, Qirui Mi, Qipeng Yang, Zijun Fan, Bo Li, Haifeng Zhang

Abstract: Economic decision-making depends not only on structured signals such as prices and taxes, but also on unstructured language, including peer dialogue and media narratives. While multi-agent reinforcement learning (MARL) has shown promise in optimizing economic decisions, it struggles with the semantic ambiguity and contextual richness of language. We propose LAMP (Language-Augmented Multi-Agent Policy), a framework that integrates language into economic decision-making and narrows the gap to real-world settings. LAMP follows a Think-Speak-Decide pipeline: (1) Think interprets numerical observations to extract short-term shocks and long-term trends, caching high-value reasoning trajectories; (2) Speak crafts and exchanges strategic messages based on reasoning, updating beliefs by parsing peer communications; and (3) Decide fuses numerical data, reasoning, and reflections into a MARL policy to optimize language-augmented decision-making. Experiments in economic simulation show that LAMP outperforms both MARL and LLM-only baselines in cumulative return (+63.5%, +34.0%), robustness (+18.8%, +59.4%), and interpretability. These results demonstrate the potential of language-augmented policies to deliver more effective and robust economic strategies.

replace Integrating a Causal Foundation Model into a Prescriptive Maintenance Framework for Optimising Production-Line OEE

Authors: Felix Saretzky, Lucas Andersen, Thomas Engel, Fazel Ansari

Abstract: The transition to prescriptive maintenance (PsM) in manufacturing is critically constrained by a dependence on predictive models. Such purely predictive models tend to capture statistical associations in the data without identifying the underlying causal drivers of failure, which can lead to costly misdiagnoses and ineffective measures. This fundamental limitation results in a key challenge: while we can predict that a failure may occur, we lack a systematic method to understand why a failure occurs. This paper proposes a model based on causal machine learning to bridge this gap. Our objective is to move beyond diagnosis to active prescription by simulating and evaluating potential fixes to optimise KPIs such as Overall Equipment Effectiveness (OEE). For this purpose, a pre-trained causal foundation model is used as a ``what-if'' simulator to estimate the effects of potential fixes. By estimating the causal effect of each intervention on system-level KPIs, specific actions can be recommended for the production line. This can help identify plausible root causes and quantify their operational impact. The model is evaluated using semi-synthetic manufacturing data and compared with non-causal and causal baseline machine learning models. This paper provides a technical basis for a human-centred approach, allowing engineers to test potential solutions in a causal environment to make more effective operational decisions and reduce costly downtimes.

replace Parallel Decoder Transformer: Planner-Seeded Latent Coordination for Synchronized Parallel Decoding

Authors: Logan Robbins

Abstract: Autoregressive language models can often identify parallel subproblems, but standard decoding exposes only a single left-to-right output interface. External orchestration methods can launch multiple prompts concurrently, yet they provide no model-internal state through which those generations can synchronize, resolve ownership, or wait for missing information. We present the Parallel Decoder Transformer (PDT), a frozen-trunk architecture that augments a decoder with a planner-seeded latent workspace and a synchronized multi-stream output protocol. Before any stream emits tokens, a mandatory prompt-time planner predicts fixed latent plan slots and projects them as snapshot 0 on an embeddings-only Dynamic Notes Bus. During decoding, each stream reads the visible notes window through Speculative Note Conditioning (SNC), emits provisional token blocks and latent summaries, and advances only when agreement logic determines that the current shared state is sufficient for continued parallel generation. Coverage heads track plan-item ownership, while rollback handles incoherent or premature commits. PDT therefore shifts parallel task decomposition from an external prompting strategy to a model-internal coordination mechanism over the output interface of a frozen language model.

replace Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills

Authors: Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han

Abstract: Large language model (LLM) agents are moving beyond prompting alone. ChatGPT marked the rise of general-purpose LLM assistants, DeepSeek showed that on-policy reinforcement learning with verifiable rewards can improve reasoning and tool use, and OpenClaw highlights a newer direction in which agents accumulate persistent memory and reusable skills. Yet the research landscape remains fragmented across post-training, retrieval, memory, and skill systems. This survey studies these developments under a single notion of \emph{adaptation}: improving an agent, its tools, or their interaction after pretraining. We organize the field with a four-paradigm framework spanning agent adaptation and tool adaptation. On the agent side, A1 (tool-execution-signaled) and A2 (agent-output-signaled) improve the agent itself through supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. On the tool side, T1 (agent-agnostic) provides reusable pre-trained modules any agent can call, while T2 (agent-supervised) uses the agent's outputs to train memory systems, skill libraries, or lightweight subagents. Using this framework, we review post-training methods, adaptive memory architectures, and agent skills; compare their trade-offs in cost, flexibility, and generalization; and summarize evaluation practices across deep research, software development, computer use, and drug discovery. We conclude by outlining open problems in agent-tool co-adaptation, continual learning, safety, and efficient deployment.

replace Toward a Physical Theory of Intelligence

Authors: Peter David Fagan

Abstract: While often treated as abstract algorithmic properties, intelligence and computation are ultimately physical processes constrained by conservation laws. We introduce the Conservation-Congruent Encoding (CCE) framework as a unified, substrate-neutral physical framework for studying intelligence. We propose that information processing emerges when open systems undergo irreversible transitions, carving out macroscopic states from underlying reversible micro-dynamics. Generalizing Landauer's principle to arbitrary conserved quantities via metriplectic flows, we derive a universal bound for macroscopic computation. This yields physical metrics for intelligence and an operational analogue for consciousness, quantifying an agent's ability to extract work from the environment while minimizing its own dissipative dynamics. Applying CCE to the limits of physical observation, we model measurement as an active coarse-graining process rather than a passive projection. At the quantum scale, CCE recovers the Lindblad Master Equation, consistent with modelling decoherence as the dissipative exhaust required to record a measurement. Scaling to cosmological limits, we explore the hypothesis that gravity emerges as the macroscopic geometric footprint of these bounds. We show that, under this hypothesis, measurement-induced dissipation is consistent with a volumetric phase-space collapse, offering a dynamical route to the Bekenstein-Hawking area law. Equating the Landauer exhaust of this coarse-graining to horizon deformation outlines a limiting-case recovery of the Einstein Field Equations. Ultimately, by establishing a substrate-neutral link between thermodynamic dissipation, quantum measurement, and spacetime geometry, CCE provides physical constraints for understanding both natural and artificial intelligence.

replace Batch-of-Thought: Cross-Instance Learning for Enhanced LLM Reasoning

Authors: Xuan Yang, Furong Jia, Roy Xie, Xiong Xi, Hengwei Bian, Jian Li, Monica Agrawal

Abstract: Current Large Language Model reasoning systems process queries independently, discarding valuable cross-instance signals such as shared reasoning patterns and consistency constraints. We introduce Batch-of-Thought (BoT), a training-free method that processes related queries jointly to enable cross-instance learning. By performing comparative analysis across batches, BoT identifies high-quality reasoning templates, detects errors through consistency checks, and amortizes computational costs. We instantiate BoT within a multi-agent reflection architecture (BoT-R), where a Reflector performs joint evaluation to unlock mutual information gain unavailable in isolated processing. Experiments across three model families and six benchmarks demonstrate that BoT-R consistently improves accuracy and confidence calibration while reducing inference costs by up to 61%. Our theoretical and experimental analysis reveals when and why batch-aware reasoning benefits LLM systems. Our code is available at https://github.com/xuanyang19/BoT

URLs: https://github.com/xuanyang19/BoT

replace CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Authors: Hanna Foerster, Tom Blanchard, Kristina Nikoli\'c, Ilia Shumailov, Cheng Zhang, Robert Mullins, Nicolas Papernot, Florian Tram\`er, Yiren Zhao

Abstract: AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior to steal credentials or cause financial loss. The only known robust defense is architectural isolation that strictly separates trusted task planning from untrusted environment observations. However, applying this design to Computer Use Agents (CUAs) -- systems that automate tasks by viewing screens and executing actions -- presents a fundamental challenge: current agents require continuous observation of UI state to determine each action, conflicting with the isolation required for security. We resolve this tension by demonstrating that UI workflows, while dynamic, are structurally predictable. We introduce Single-Shot Planning for CUAs, where a trusted planner generates a complete execution graph with conditional branches before any observation of potentially malicious content, providing provable control flow integrity guarantees against arbitrary instruction injections. Although this architectural isolation successfully prevents instruction injections, we show that additional measures are needed to prevent Branch Steering attacks, which manipulate UI elements to trigger unintended valid paths within the plan. We evaluate our design on OSWorld, and retain up to 57% of the performance of frontier models while improving performance for smaller open-source models by up to 19%, demonstrating that rigorous security and utility can coexist in CUAs.

replace BoxMind: Closed-loop AI strategy optimization for elite boxing validated in the 2024 Olympics

Authors: Kaiwen Wang, Kaili Zheng, Rongrong Deng, Qingmin Fan, Milin Zhang, Zongrui Li, Xuesi Zhou, Bo Han, Liren Chen, Chenyi Guo, Ji Wu

Abstract: Competitive sports require sophisticated tactical analysis, yet combat disciplines like boxing remain underdeveloped in AI-driven analytics due to the complexity of action dynamics and the lack of structured tactical representations. To address this, we present BoxMind, a closed-loop AI expert system validated in elite boxing competition. By defining atomic punch events with precise temporal boundaries and spatial and technical attributes, we parse match footage into 18 hierarchical technical-tactical indicators. We then propose a graph-based predictive model that fuses these explicit technical-tactical profiles with learnable, time-variant latent embeddings to capture the dynamics of boxer matchups. Modeling match outcome as a differentiable function of technical-tactical indicators, we turn winning probability gradients into executable tactical adjustments. Experiments show that the outcome prediction model achieves state-of-the-art performance, with 69.8% accuracy on BoxerGraph test set and 87.5% on Olympic matches. Using this predictive model as a foundation, the system generates strategic recommendations that demonstrate proficiency comparable to human experts. BoxMind is validated through a closed-loop deployment during the 2024 Paris Olympics, directly contributing to the Chinese National Team's historic achievement of three gold and two silver medals. BoxMind establishes a replicable paradigm for transforming unstructured video data into strategic intelligence, bridging the gap between computer vision and decision support in competitive sports. Code and data is available at https://github.com/gouba2333/BoxingWeb.

URLs: https://github.com/gouba2333/BoxingWeb.

replace MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

Authors: Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Jiayu Wang, Semih Yavuz, Caiming Xiong, Shafiq Joty

Abstract: While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

replace Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

Authors: Raffi Khatchadourian

Abstract: LLM agents struggle with regulatory audit replay: when asked to reproduce a flagged transaction decision with identical inputs, many deployments fail to return consistent results. We introduce the Determinism-Faithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism, decision determinism, and evidence-conditioned faithfulness in tool-using agents deployed in financial services. Across 4,700+ agentic runs (7 models, 4 providers, 3 financial benchmarks with 50 cases each at T=0.0), we find that decision determinism and task accuracy are not detectably correlated (r = -0.11, 95% CI [-0.49, 0.31], p = 0.63, n = 21 configurations): models can be deterministic without being accurate, and accurate without being deterministic. Because neither metric predicts the other in our sample, both must be measured independently, which is precisely what DFAH provides. Small models (7-20B) achieve near-perfect determinism through rigid pattern matching at the cost of accuracy (20-42%), while frontier models show moderate determinism (50-96%) with variable accuracy. No model achieves both perfect determinism and high accuracy, supporting DFAH's multi-dimensional measurement approach. We provide three financial benchmarks (compliance triage, portfolio constraints, and DataOps exceptions; 50 cases each) together with an open-source stress-test harness. Across these benchmarks and DFAH evaluation settings, Tier 1 models with schema-first architectures achieved determinism levels consistent with audit replay requirements.

replace BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

Authors: Dionizije Fa, Marko \v{C}uljak, Bruno Pand\v{z}a, Mateo \v{C}upi\'c

Abstract: This paper introduces BioAgent Bench, a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The benchmark contains curated end-to-end tasks (e.g., RNA-seq, variant calling, metagenomics) with prompts that specify concrete output artifacts to support automated assessment, including stress testing under controlled perturbations. We evaluate frontier closed-source and open-weight models across multiple agent harnesses, and use an LLM-based grader to score pipeline progress and outcome validity. We find that frontier agents can complete multi-step bioinformatics pipelines without elaborate custom scaffolding, often producing the requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations (corrupted inputs, decoy files, and prompt bloat), indicating that correct high-level pipeline construction does not guarantee reliable step-level reasoning. Finally, because bioinformatics workflows may involve sensitive patient data, proprietary references, or unpublished IP, closed-source models can be unsuitable under strict privacy constraints; in such settings, open-weight models may be preferable despite lower completion rates. We release the dataset and evaluation suite publicly.

replace Real-Time Aligned Reward Model beyond Semantics

Authors: Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuefeng Xiao, Hongyan Xie, Li Huaqiu, Songshi Liang, Zhongxiang Dai, Fuzhen Zhuang, Jianxin Li, Yikun Ban, Deqing Wang

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model, exploit spurious reward patterns instead of faithfully capturing human intent. Prior mitigations primarily relies on surface semantic information and fails to efficiently address the misalignment between the reward model (RM) and the policy model caused by continuous policy distribution shifts. This inevitably leads to an increasing reward discrepancy, exacerbating reward overoptimization. To address these limitations, we introduce R2M (Real-Time Aligned Reward Model), a novel lightweight RLHF framework. R2M goes beyond vanilla reward models that solely depend on the semantic representations of a pretrained LLM. Instead, it leverages the evolving hidden states of the policy (namely policy feedback) to align with the real-time distribution shift of the policy during the RL process. This work points to a promising new direction for improving the performance of reward models through real-time utilization of feedback from policy models.

replace Conditional Diffusion Guidance under Hard Constraint: A Stochastic Analysis Approach

Authors: Zhengyi Guo, Wenpin Tang, Renyuan Xu

Abstract: We study conditional generation in diffusion models under hard constraints, where generated samples must satisfy prescribed events with probability one. Such constraints arise naturally in safety-critical applications and in rare-event simulation, where soft or reward-based guidance methods offer no guarantee of constraint satisfaction. Building on a probabilistic interpretation of diffusion models, we develop a principled conditional diffusion guidance framework based on Doob's h-transform, martingale representation and quadratic variation process. Specifically, the resulting guided dynamics augment a pretrained diffusion with an explicit drift correction involving the logarithmic gradient of a conditioning function, without modifying the pretrained score network. Leveraging martingale and quadratic-variation identities, we propose two novel off-policy learning algorithms based on a martingale loss and a martingale-covariation loss to estimate h and its gradient using only trajectories from the pretrained model. We provide non-asymptotic guarantees for the resulting conditional sampler in both total variation and Wasserstein distances, explicitly characterizing the impact of score approximation and guidance estimation errors. Numerical experiments demonstrate the effectiveness of the proposed methods in enforcing hard constraints and generating rare-event samples. The code of the numerical experiments can be found at https://github.com/ZhengyiGuo2002/CDG_Finance.

URLs: https://github.com/ZhengyiGuo2002/CDG_Finance.

replace NAAMSE: Framework for Evolutionary Security Evaluation of Agents

Authors: Kunal Pai, Parth Shah, Harshil Patel

Abstract: AI agents are increasingly deployed in production, yet their security evaluations remain bottlenecked by manual red-teaming or static benchmarks that fail to model adaptive, multi-turn adversaries. We propose NAAMSE, an evolutionary framework that reframes agent security evaluation as a feedback-driven optimization problem. Our system employs a single autonomous agent that orchestrates a lifecycle of genetic prompt mutation, hierarchical corpus exploration, and asymmetric behavioral scoring. By using model responses as a fitness signal, the framework iteratively compounds effective attack strategies while simultaneously ensuring "benign-use correctness", preventing the degenerate security of blanket refusal. Our experiments across a diverse suite of state-of-the-art large language models demonstrate that evolutionary mutation systematically amplifies vulnerabilities missed by one-shot methods, with controlled ablations revealing that the synergy between exploration and targeted mutation uncovers high-severity failure modes. We show that this adaptive approach provides a more realistic and scalable assessment of agent robustness in the face of evolving threats. The code for NAAMSE is open source and available at https://github.com/HASHIRU-AI/NAAMSE.

URLs: https://github.com/HASHIRU-AI/NAAMSE.

replace MERIT Feedback Elicits Better Bargaining in LLM Negotiators

Authors: Jihwan Oh, Murad Aghazada, Yooju Shin, Se-Young Yun, Taehyeon Kim

Abstract: Bargaining is often regarded as a logical arena rather than an art or a matter of intuition, yet Large Language Models (LLMs) still struggle to navigate it due to limited strategic depth and difficulty adapting to complex human factors. Current benchmarks rarely capture this limitation. To bridge this gap, we present a utility feedback centric framework. Our contributions are: (i) AgoraBench, a new benchmark spanning nine challenging settings (e.g., deception, monopoly) that supports diverse strategy modeling; (ii) human-aligned, economically grounded metrics derived from utility theory. This is operationalized via agent utility, negotiation power, and acquisition ratio that implicitly measure how well the negotiation aligns with human preference and (iii) a human preference grounded dataset with learning pipeline that strengthens LLMs' bargaining ability through both prompting and finetuning. Empirical results indicate that baseline LLM strategies often diverge from human preferences, while our mechanism substantially improves negotiation performance, yielding deeper strategic behavior and stronger opponent awareness.

replace To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

Authors: Haoqing Wang, Xiang Long, Ziheng Li, Yilong Xu, Tingguang Li, Yehui Tang

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, instruction following, and agent) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, model prediction behavior, information constraints and self-verification. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/mosAI25/M2RL.

URLs: https://github.com/mosAI25/M2RL.

replace SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Authors: Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei Wang, Runhui Wang, Tianyu Wang, Wengao Ye, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, Han-chung Lee

Abstract: Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

replace X-SYS: A Reference Architecture for Interactive Explanation Systems

Authors: Tobias Labarta, Nhi Hoang, Maximilian Dreyer, Jim Berend, Oleg Hein, Jackie Ma, Wojciech Samek, Sebastian Lapuschkin

Abstract: The explainable AI (XAI) research community has proposed numerous technical methods, yet deploying explainability as systems remains challenging: Interactive explanation systems require both suitable algorithms and system capabilities that maintain explanation usability across repeated queries, evolving models and data, and governance constraints. We argue that operationalizing XAI requires treating explainability as an information systems problem where user interaction demands induce specific system requirements. We introduce X-SYS, a reference architecture for interactive explanation systems, that guides (X)AI researchers, developers and practitioners in connecting interactive explanation user interfaces (XUI) with system capabilities. X-SYS organizes around four quality attributes named STAR (scalability, traceability, responsiveness, and adaptability), and specifies a five-component decomposition (XUI Services, Explanation Services, Model Services, Data Services, Orchestration and Governance). It maps interaction patterns to system capabilities to decouple user interface evolution from backend computation. We implement X-SYS through SemanticLens, a system for semantic search and activation steering in vision-language models. SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability. Together, this work provides a reusable blueprint and concrete instantiation for interactive explanation systems supporting end-to-end design under operational constraints.

replace A Geometric Taxonomy of Hallucinations in LLMs

Authors: Javier Mar\'in

Abstract: The term "hallucination" converge different failure modes with specific geometric signatures in embedding space. We propose a taxonomy identifying three types: unfaithfulness (Type I: ignoring provided context), confabulation (Type II: inventing semantically foreign content), and factual error (Type III: wrong details within correct conceptual frames). We introduce two detection methods grounded in this taxonomy: the Semantic Grounding Index (SGI) for Type I, which measures whether a response moves toward provided context on the unit hypersphere, and the Directional Grounding Index (DGI) for Type II, which measures displacement geometry in context-free settings. DGI achieves AUROC=0.958 on human-crafted confabulations with 3.8% cross-domain degradation. External validation on three independently collected human-annotated benchmarks -WikiBio GPT-3, FELM, and ExpertQA- yields domain-specific AUROC 0.581-0.695, with DGI outperforming an NLI CrossEncoder baseline on expert-domain data, where surface entailment operates at chance. On LLM-generated benchmarks, detection is domain-local. We examine the Type III boundary through TruthfulQA, where apparent classifier signal (Logistic Regression with AUROC 0.731) is traced to a stylistic annotation confound: false answers are geometrically closer to queries than truthful ones, a pattern incompatible with factual-error detection. This identifies a theoretical constraint from a methodological limitation.

replace Can a Lightweight Automated AI Pipeline Solve Research-Level Mathematical Problems?

Authors: Lve Meng (University of Science,Technology of China, Zhongguancun Academy), Weilong Zhao (Universit\'e Paris Cit\'e), Yanzhi Zhang (Zhongguancun Academy), Haoxiang Guan (Zhongguancun Academy), Jiyan He (Zhongguancun Academy)

Abstract: Large language models (LLMs) have recently achieved remarkable success in generating rigorous mathematical proofs, with "AI for Math" emerging as a vibrant field of research (Ju et al., 2026). While these models have mastered competition-level benchmarks like the International Mathematical Olympiad (Huang et al., 2025; Duan et al., 2025) and show promise in research applications through auto-formalization (Wang et al., 2025), their deployment via lightweight, natural-language pipelines for research problems remains underexplored. In this work, we demonstrate that next-generation models (e.g., Gemini 3 Pro, GPT-5.2 Pro), when integrated into a streamlined automated pipeline optimized for citation-based verification, can solve sophisticated research-grade problems. We evaluate our pipeline on two novel datasets: (1) the ICCM (2025) problem sets (comparable to the S.-T. Yau College Student Mathematics Contest) proposed by leading mathematicians (Shanghai Math Challenge, 2026), and (2) the "First Proof" problem set (Abouzaid et al., 2026), consisting of previously unpublished research questions. Our pipeline generated candidate proofs for all problems in the first two ICCM sets and the "First Proof" set. The solutions for the first two ICCM sets and Problem 4 of the "First Proof" set have been fully verified by our team. All generated proofs have been submitted to the official organization, and our generated results are publicly available at https://github.com/ml1301215/question_sets-test_results. We have open-sourced the code and developed a user-friendly UI for this workflow, accessible at https://github.com/ml1301215/research-math-assistant.

URLs: https://github.com/ml1301215/question_sets-test_results., https://github.com/ml1301215/research-math-assistant.

replace ABD: Default Exception Abduction in Finite First Order Worlds

Authors: Serafim Batzoglou

Abstract: We introduce ABD, a benchmark for default-exception abduction over finite first-order worlds. Given a background theory with an abnormality predicate and a set of relational structures, a model must output a first-order formula that defines exceptions, restoring satisfiability while keeping exceptions sparse. We formalize three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification. Evaluating ten frontier LLMs on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.

replace INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic

Authors: Serafim Batzoglou

Abstract: We introduce INDUCTION, a benchmark for finite structure concept synthesis in first order logic. Given small finite relational worlds with extensionally labeled target predicates, models must output a single first order logical formula that explains the target uniformly across worlds, with correctness verified via exact model checking. The benchmark includes three regimes, FullObs, CI (contrastive), and EC (existential completion), nd penalizes formula bloat. We find sharp difficulty gradients, persistent hard structural families, and observe that low bloat formulas generalize far better on held out worlds. Elite recent models show qualitatively different behaviors across tasks and performance metrics, hinting to their different strategies of concept generalization.

replace Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment

Authors: Aymen Khouja, Imen Jendoubi, Oumayma Mahjoub, Oussama Mahfoudhi, Ruan De Kock, Siddarth Singh, Claude Formanek

Abstract: The optimization of urban energy systems is crucial for the advancement of sustainable and resilient smart cities, which are becoming increasingly complex with multiple decision-making units. To address scalability and coordination concerns, Multi-Agent Reinforcement Learning (MARL) is a promising solution. This paper addresses the imperative need for comprehensive and reliable benchmarking of MARL algorithms on energy management tasks. CityLearn is used as a case study environment because it realistically simulates urban energy systems, incorporates multiple storage systems, and utilizes renewable energy sources. By doing so, our work sets a new standard for evaluation, conducting a comparative study across multiple key performance indicators (KPIs). This approach illuminates the key strengths and weaknesses of various algorithms, moving beyond traditional KPI averaging which often masks critical insights. Our experiments utilize widely accepted baselines such as Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC), and encompass diverse training schemes including Decentralized Training with Decentralized Execution (DTDE) and Centralized Training with Decentralized Execution (CTDE) approaches and different neural network architectures. Our work also proposes novel KPIs that tackle real world implementation challenges such as individual building contribution and battery storage lifetime. Our findings show that DTDE consistently outperforms CTDE in both average and worst-case performance. Additionally, temporal dependency learning improved control on memory dependent KPIs such as ramping and battery usage, contributing to more sustainable battery operation. Results also reveal robustness to agent or resource removal, highlighting both the resilience and decentralizability of the learned policies.

replace ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Authors: Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li, Kaiqiao Han, Chenyi Tong, Haoran Deng, Renliang Sun, Alexander Taylor, Yanqiao Zhu, Jason Cong, Yizhou Sun, Wei Wang

Abstract: Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.

replace Vibe Researching as Wolf Coming: Can AI Agents with Skills Replace or Augment Social Scientists?

Authors: Yongjun Zhang

Abstract: AI agents -- systems that execute multi-step reasoning workflows with persistent state, tool access, and specialist skills -- represent a qualitative shift from prior automation technologies in social science. Unlike chatbots that respond to isolated queries, AI agents can now read files, run code, query databases, search the web, and invoke domain-specific skills to execute entire research pipelines autonomously. This paper introduces the concept of vibe researching -- the AI-era parallel to vibe coding -- and uses scholar-skill, a 26-skill plugin for Claude Code covering the full research pipeline from idea to submission across 18 orchestrated phases with 53 quality gates, as an illustrative case. I develop a cognitive task framework that classifies research activities along two dimensions -- codifiability and tacit knowledge requirement -- to identify a delegation boundary that is cognitive, not sequential: it cuts through every stage of the research pipeline, not between stages. I argue that AI agents excel at speed, coverage, and methodological scaffolding but struggle with theoretical originality and tacit field knowledge. The paper concludes with an analysis of three implications for the profession -- augmentation with fragile conditions, stratification risk, and a pedagogical crisis -- and proposes five principles for responsible vibe researching.

replace A Mathematical Theory of Agency and Intelligence

Authors: Wael Hafez, Chenan Wei, Rodrigo Pena, Amir Nazeri, Cameron Reid

Abstract: To operate reliably under changing conditions, complex systems require feedback on how effectively they use resources, not just whether objectives are met. Current AI systems process vast information to produce sophisticated predictions, yet predictions can appear successful while the underlying interaction with the environment degrades. What is missing is a principled measure of how much of the total information a system deploys is actually shared between its observations, actions, and outcomes. We prove this shared fraction, which we term bipredictability, P, is intrinsic to any interaction, derivable from first principles, and strictly bounded: P can reach unity in quantum systems, P equal to, or smaller than 0.5 in classical systems, and lower once agency (action selection) is introduced. We confirm these bounds in a physical system (double pendulum), reinforcement learning agents, and multi turn LLM conversations. These results distinguish agency from intelligence: agency is the capacity to act on predictions, whereas intelligence additionally requires learning from interaction, self-monitoring of its learning effectiveness, and adapting the scope of observations, actions, and outcomes to restore effective learning. By this definition, current AI systems achieve agency but not intelligence. Inspired by thalamocortical regulation in biological systems, we demonstrate a feedback architecture that monitors P in real time, establishing a prerequisite for adaptive, resilient AI.

replace Decomposing Physician Disagreement in HealthBench

Authors: Satya Borgohain, Roy Mariathas

Abstract: We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it. Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance; physician identity accounts for just 2.4%. The dominant 81.8% case-level residual is not reduced by HealthBench's metadata labels (z = -0.22, p = 0.83), normative rubric language (pseudo R^2 = 1.2%), medical specialty (0/300 Tukey pairs significant), surface-feature triage (AUC = 0.58), or embeddings (AUC = 0.485). Disagreement follows an inverted-U with completion quality (AUC = 0.689), confirming physicians agree on clearly good or bad outputs but split on borderline cases. Physician-validated uncertainty categories reveal that reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p < 10^(-24)), while irreducible uncertainty (genuine medical ambiguity) has no effect (OR = 1.01, p = 0.90), though even the former explains only ~3% of total variance. The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity does not, pointing toward actionable evaluation design improvements.

replace On Sample-Efficient Generalized Planning via Learned Transition Models

Authors: Nitin Gupta, Vishal Pallagani, John A. Aydin, Biplav Srivastava

Abstract: Generalized planning studies the construction of solution strategies that generalize across families of planning problems sharing a common domain model, formally defined by a transition function $\gamma : S \times A \rightarrow S$. Classical approaches achieve such generalization through symbolic abstractions and explicit reasoning over $\gamma$. In contrast, recent Transformer-based planners, such as PlanGPT and Plansformer, largely cast generalized planning as direct action-sequence prediction, bypassing explicit transition modeling. While effective on in-distribution instances, these approaches typically require large datasets and model sizes, and often suffer from state drift in long-horizon settings due to the absence of explicit world-state evolution. In this work, we formulate generalized planning as a transition-model learning problem, in which a neural model explicitly approximates the successor-state function $\hat{\gamma} \approx \gamma$ and generates plans by rolling out symbolic state trajectories. Instead of predicting actions directly, the model autoregressively predicts intermediate world states, thereby learning the domain dynamics as an implicit world model. To study size-invariant generalization and sample efficiency, we systematically evaluate multiple state representations and neural architectures, including relational graph encodings. Our results show that learning explicit transition models yields higher out-of-distribution satisficing-plan success than direct action-sequence prediction in multiple domains, while achieving these gains with significantly fewer training instances and smaller models. This is an extended version of a short paper accepted at ICAPS 2026 under the same title.

replace How Well Do Multimodal Models Reason on ECG Signals?

Authors: Maxwell A. Xu, Harish Haresamudram, Catherine W. Liu, Patrick Langer, Jathurshan Pradeepkumar, Wanting Mao, Sunita J. Ferns, Aradhana Verma, Jimeng Sun, Paul Schmiedmayer, Xin Liu, Daniel McDuff, Emily B. Fox, James M. Rehg

Abstract: While multimodal large language models offer a promising solution to the "black box" nature of health AI by generating interpretable reasoning traces, verifying the validity of these traces remains a critical challenge. Existing evaluation methods are either unscalable, relying on manual clinician review, or superficial, utilizing proxy metrics (e.g. QA) that fail to capture the semantic correctness of clinical logic. In this work, we introduce a reproducible framework for evaluating reasoning in ECG signals. We propose decomposing reasoning into two distinct, components: (i) Perception, the accurate identification of patterns within the raw signal, and (ii) Deduction, the logical application of domain knowledge to those patterns. To evaluate Perception, we employ an agentic framework that generates code to empirically verify the temporal structures described in the reasoning trace. To evaluate Deduction, we measure the alignment of the model's logic against a structured database of established clinical criteria in a retrieval-based approach. This dual-verification method enables the scalable assessment of "true" reasoning capabilities.

replace Extended Empirical Validation of the Explainability Solution Space

Authors: Antoni Mestre, Manoli Albert, Miriam Gil, Vicente Pelechano

Abstract: This technical report provides an extended validation of the Explainability Solution Space (ESS) through cross-domain evaluation. While initial validation focused on employee attrition prediction, this study introduces a heterogeneous intelligent urban resource allocation system to demonstrate the generality and domain-independence of the ESS framework. The second case study integrates tabular, temporal, and geospatial data under multi-stakeholder governance conditions. Explicit quantitative positioning of representative XAI families is provided for both contexts. Results confirm that ESS rankings are not domain-specific but adapt systematically to governance roles, risk profiles, and stakeholder configurations. The findings reinforce ESS as a generalizable operational decision-support instrument for explainable AI strategy design across socio-technical systems.

replace Opponent State Inference Under Partial Observability: An HMM-POMDP Framework for 2026 Formula 1 Energy Strategy

Authors: Kalliopi Kleisarchaki

Abstract: The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode (abbreviated MOM throughout), the optimal energy deployment policy depends not only on a driver's own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 30-state Hidden Markov Model (HMM) that infers a probability distribution over each rival's ERS charge level, Override Mode status, and tyre degradation state from five publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap -- a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack -- and show that detecting it requires belief-state inference rather than reactive threshold rules. On synthetic races generated from the model's own assumptions, the HMM achieves 92.3% ERS inference accuracy (random baseline: 33.3%) and detects counter-harvest trap conditions with 95.7% recall. Pre-registration -- empirical validation begins Australian Grand Prix, 8 March 2026.

replace HarmonyCell: Automating Single-Cell Perturbation Modeling under Semantic and Distribution Shifts

Authors: Wenxuan Huang, Mingyu Tsoi, Yanhao Huang, Xinjie Mao, Xue Xia, Hao Wu, Jiaqi Wei, Yuejin Yang, Lang Yu, Cheng Tan, Xiang Zhang, Zhangyang Gao, Siqi Sun

Abstract: Single-cell perturbation studies face dual heterogeneity bottlenecks: (i) semantic heterogeneity--identical biological concepts encoded under incompatible metadata schemas across datasets; and (ii) statistical heterogeneity--distribution shifts from biological variation demanding dataset-specific inductive biases. We propose HarmonyCell, an end-to-end agent framework resolving each challenge through a dedicated mechanism: an LLM-driven Semantic Unifier autonomously maps disparate metadata into a canonical interface without manual intervention; and an adaptive Monte Carlo Tree Search engine operates over a hierarchical action space to synthesize architectures with optimal statistical inductive biases for distribution shifts. Evaluated across diverse perturbation tasks under both semantic and distribution shifts, HarmonyCell achieves a 95% valid execution rate on heterogeneous input datasets (versus 0% for general agents) while matching or even exceeding expert-designed baselines in rigorous out-of-distribution evaluations. This dual-track orchestration enables scalable automatic virtual cell modeling without dataset-specific engineering.

replace LLM-assisted Semantic Option Discovery for Facilitating Adaptive Deep Reinforcement Learning

Authors: Chang Yao, Jinghui Qin, Kebing Jin, Hankz Hankui Zhuo

Abstract: Despite achieving remarkable success in complex tasks, Deep Reinforcement Learning (DRL) is still suffering from critical issues in practical applications, such as low data efficiency, lack of interpretability, and limited cross-environment transferability. However, the learned policy generating actions based on states are sensitive to the environmental changes, struggling to guarantee behavioral safety and compliance. Recent research shows that integrating Large Language Models (LLMs) with symbolic planning is promising in addressing these challenges. Inspired by this, we introduce a novel LLM-driven closed-loop framework, which enables semantic-driven skill reuse and real-time constraint monitoring by mapping natural language instructions into executable rules and semantically annotating automatically created options. The proposed approach utilizes the general knowledge of LLMs to facilitate exploration efficiency and adapt to transferable options for similar environments, and provides inherent interpretability through semantic annotations. To validate the effectiveness of this framework, we conduct experiments on two domains, Office World and Montezuma's Revenge, respectively. The results demonstrate superior performance in data efficiency, constraint compliance, and cross-task transferability.

replace Agentified Assessment of Logical Reasoning Agents

Authors: Zhiyu Ni, Yifeng Xiao, Zheng Liang

Abstract: We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under test only needs to expose a standardized agent-to-agent interface. As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO. The agent translates natural language premises and conclusions into executable Z3Py programs and employs satisfiability modulo theories (SMT) solving to determine logical entailment. On the cleaned FOLIO validation set, the auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (73.89%).

replace No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models

Authors: Omer Sela (Tel Aviv University)

Abstract: CDD, or Contamination Detection via output Distribution, identifies data contamination by measuring the peakedness of a model's sampled outputs. We study the conditions under which this approach succeeds and fails on small language models ranging from 70M to 410M parameters. Using controlled contamination experiments on GSM8K, HumanEval, and MATH, we find that CDD's effectiveness depends critically on whether fine-tuning produces verbatim memorization. In the majority of conditions we test, CDD performs at chance level even when the data is verifiably contaminated and detectable by simpler methods. We show that probability-based methods, specifically perplexity and Min-k% Prob, outperform CDD in every condition we test, suggesting that output-distribution approaches are insufficient for contamination detection in small language models. Our code is available at https://github.com/Sela-Omer/Contamination-Detection-Small-LM

URLs: https://github.com/Sela-Omer/Contamination-Detection-Small-LM

replace Jagarin: A Three-Layer Architecture for Hibernating Personal Duty Agents on Mobile

Authors: Ravi Kiran Kadaboina

Abstract: Personal AI agents face a fundamental deployment paradox on mobile: persistent background execution drains battery and violates platform sandboxing policies, yet purely reactive agents miss time-sensitive obligations until the user remembers to ask. We present Jagarin, a three-layer architecture that resolves this paradox through structured hibernation and demand-driven wake. The first layer, DAWN (Duty-Aware Wake Network), is an on-device heuristic engine that computes a composite urgency score from four signals: duty-typed optimal action windows, user behavioral engagement prediction, opportunity cost of inaction, and cross-duty batch resonance. It uses adaptive per-user thresholds to decide when a sleeping agent should nudge or escalate. The second layer, ARIA (Agent Relay Identity Architecture), is a commercial email identity proxy that routes the full commercial inbox -- obligations, promotional offers, loyalty rewards, and platform updates -- to appropriate DAWN handlers by message category, eliminating cold-start and removing manual data entry. The third layer, ACE (Agent-Centric Exchange), is a protocol framework for direct machine-readable communication from institutions to personal agents, replacing human-targeted email as the canonical channel. Together, these three layers form a complete stack from institutional signal to on-device action, without persistent cloud state, continuous background execution, or privacy compromise. A working Flutter prototype is demonstrated on Android, combining all three layers with an ephemeral cloud agent invoked only on user-initiated escalation.

replace STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks

Authors: ELita Lobo, Xu Chen, Jingjing Meng, Nan Xi, Yang Jiao, Chirag Agarwal, Yair Zick, Yan Gao

Abstract: Recent advances in large language models (LLMs) have enabled agentic systems for sequential decision-making. Such agents must perceive their environment, reason across multiple time steps, and take actions that optimize long-term objectives. However, existing web agents struggle on complex, long-horizon tasks due to limited in-context memory for tracking history, weak planning abilities, and greedy behaviors that lead to premature termination. To address these challenges, we propose STRUCTUREDAGENT, a hierarchical planning framework with two core components: (1) an online hierarchical planner that uses dynamic AND/OR trees for efficient search and (2) a structured memory module that tracks and maintains candidate solutions to improve constraint satisfaction in information-seeking tasks. The framework also produces interpretable hierarchical plans, enabling easier debugging and facilitating human intervention when needed. Our results on WebVoyager, WebArena, and custom shopping benchmarks show that STRUCTUREDAGENT improves performance on long-horizon web-browsing tasks compared to standard LLM-based agents.

replace Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned

Authors: Nghi D. Q. Bui

Abstract: The landscape of AI coding assistance is undergoing a fundamental shift from complex IDE plugins to versatile, terminal-native agents. Operating directly where developers manage source control, execute builds, and deploy environments, CLI-based agents offer unprecedented autonomy for long-horizon development tasks. In this paper, we present OPENDEV, an open-source, command-line coding agent engineered specifically for this new paradigm. Effective autonomous assistance requires strict safety controls and highly efficient context management to prevent context bloat and reasoning degradation. OPENDEV overcomes these challenges through a compound AI system architecture with workload-specialized model routing, a dual-agent architecture separating planning from execution, lazy tool discovery, and adaptive context compaction that progressively reduces older observations. Furthermore, it employs an automated memory system to accumulate project-specific knowledge across sessions and counteracts instruction fade-out through event-driven system reminders. By enforcing explicit reasoning phases and prioritizing context efficiency, OPENDEV provides a secure, extensible foundation for terminal-first AI assistance, offering a blueprint for robust autonomous software engineering.

replace RoboLayout: Differentiable 3D Scene Generation for Embodied Agents

Authors: Ali Shamsaddinlou

Abstract: Recent advances in vision language models (VLMs) have shown strong potential for spatial reasoning and 3D scene layout generation from open-ended language instructions. However, generating layouts that are not only semantically coherent but also feasible for interaction by embodied agents remains challenging, particularly in physically constrained indoor environments. In this paper, RoboLayout is introduced as an extension of LayoutVLM that augments the original framework with agent-aware reasoning and improved optimization stability. RoboLayout integrates explicit reachability constraints into a differentiable layout optimization process, enabling the generation of layouts that are navigable and actionable by embodied agents. Importantly, the agent abstraction is not limited to a specific robot platform and can represent diverse entities with distinct physical capabilities, such as service robots, warehouse robots, humans of different age groups, or animals, allowing environment design to be tailored to the intended agent. In addition, a local refinement stage is proposed that selectively reoptimizes problematic object placements while keeping the remainder of the scene fixed, improving convergence efficiency without increasing global optimization iterations. Overall, RoboLayout preserves the strong semantic alignment and physical plausibility of LayoutVLM while enhancing applicability to agent-centric indoor scene generation, as demonstrated by experimental results across diverse scene configurations.

replace-cross Online Neural Networks for Change-Point Detection

Authors: Mikhail Hushchyn, Kenenbek Arzymatov, Denis Derkach

Abstract: Moments when a time series changes its behavior are called change points. Occurrence of change point implies that the state of the system is altered and its timely detection might help to prevent unwanted consequences. In this paper, we present two change-point detection approaches based on neural networks and online learning. These algorithms demonstrate linear computational complexity and are suitable for change-point detection in large time series. We compare them with the best known algorithms on various synthetic and real world data sets. Experiments show that the proposed methods outperform known approaches. We also prove the convergence of the algorithms to the optimal solutions and describe conditions rendering current approach more powerful than offline one.

replace-cross Automated Reinforcement Learning: An Overview

Authors: Reza Refaei Afshar, Joaquin Vanschoren, Uzay Kaymak, Rui Zhang, Yaoxin Wu, Wen Song, Yingqian Zhang

Abstract: Reinforcement Learning and, recently, Deep Reinforcement Learning are popular methods for solving sequential decision-making problems modeled as Markov Decision Processes. RL modeling of a problem and selecting algorithms and hyper-parameters require careful consideration, as different configurations may entail completely different performances. These considerations are mainly the task of RL experts; however, RL is progressively becoming popular in other fields, such as combinatorial optimization, where researchers and system designers are not necessarily RL experts. Besides, many modeling decisions are typically made manually, such as defining state and action space, size of batches, batch update frequency, and time steps. For these reasons, automating different components of RL is of great importance, and it has attracted much attention in recent years. Automated RL provides a framework in which different components of RL, including MDP modeling, algorithm selection, and hyper-parameter optimization, are modeled and defined automatically. In this article, we present the literature on automated RL (AutoRL), including the recent large language model (LLM) based techniques. We also discuss the recent work on techniques that are not presently tailored for automated RL but hold promise for future integration into AutoRL. Furthermore, we discuss the challenges, open questions, and research directions in AutoRL.

replace-cross Explainable classification of astronomical uncertain time series

Authors: Michael Franklin Mbouopda (LIMOS, UCA), Emille E. O. Ishida (LIMOS, UCA), Engelbert Mephu Nguifo (LIMOS, UCA), Emmanuel Gangler (LPC, UCA)

Abstract: Exploring the expansion history of the universe, understanding its evolutionary stages, and predicting its future evolution are important goals in astrophysics. Today, machine learning tools are used to help achieving these goals by analyzing transient sources, which are modeled as uncertain time series. Although black-box methods achieve appreciable performance, existing interpretable time series methods failed to obtain acceptable performance for this type of data. Furthermore, data uncertainty is rarely taken into account in these methods. In this work, we propose an uncertaintyaware subsequence based model which achieves a classification comparable to that of state-of-the-art methods. Unlike conformal learning which estimates model uncertainty on predictions, our method takes data uncertainty as additional input. Moreover, our approach is explainable-by-design, giving domain experts the ability to inspect the model and explain its predictions. The explainability of the proposed method has also the potential to inspire new developments in theoretical astrophysics modeling by suggesting important subsequences which depict details of light curve shapes. The dataset, the source code of our experiment, and the results are made available on a public repository.

replace-cross Utility Theory based Cognitive Modeling in the Application of Robotics: A Survey

Authors: Qin Yang

Abstract: Cognitive modeling, which explores the essence of cognition, including motivation, emotion, and perception, has been widely applied in the artificial intelligence (AI) agent domains, such as robotics. From the computational perspective, various cognitive functionalities have been developed through utility theory to provide a detailed and process-based understanding for specifying corresponding computational models of representations, mechanisms, and processes. Especially for decision-making and learning in multi-agent/robot systems (MAS/MRS), a suitable cognitive model can guide agents in choosing reasonable strategies to achieve their current needs and learning to cooperate and organize their behaviors, optimizing the system's utility, building stable and reliable relationships, and guaranteeing each group member's sustainable development, similar to the human society. This survey examines existing robotic systems for developmental cognitive models in the context of utility theory. We discuss the evolution of cognitive modeling in robotics from behavior-based robotics (BBR) and cognitive architectures to the properties of value systems in robots, such as the studies on motivations as artificial value systems, and the utility theory based cognitive modeling for generating and updating strategies in robotic interactions. Then, we examine the extent to which existing value systems support the application of robotics from an AI agent cognitive modeling perspective, including single-agent and multi-agent systems, trust among agents, and human-robot interaction. Finally, we survey the existing literature of current value systems in relevant fields and propose several promising research directions, along with some open problems that we deem necessary for further investigation.

replace-cross Online Dispatching and Routing for Automated Guided Vehicles in Pickup and Delivery Systems on Loop-Based Graphs

Authors: Louis Stubbe, Jens Goemaere, Jan Goedgebeur

Abstract: Automated guided vehicles (AGVs) are widely used in various industries, and scheduling and routing them in a conflict-free manner is crucial to their efficient operation. We propose a loop-based algorithm that solves the online, conflict-free scheduling and routing problem for AGVs with any capacity and ordered jobs in loop-based graphs. The proposed algorithm is compared against an exact method, a greedy heuristic and a metaheuristic. We experimentally show, using theoretical and real instances on a model representing a real manufacturing plant, that this algorithm either outperforms the other algorithms or gets an equally good solution in less computing time.

replace-cross Survey of Computerized Adaptive Testing: A Machine Learning Perspective

Authors: Yan Zhuang, Qi Liu, Haoyang Bi, Zhenya Huang, Weizhe Huang, Jiatong Li, Junhao Yu, Zirui Liu, Zirui Hu, Yuting Hong, Zachary A. Pardos, Haiping Ma, Mengxiao Zhu, Shijin Wang, Enhong Chen

Abstract: Computerized Adaptive Testing (CAT) offers an efficient and personalized method for assessing examinee proficiency by dynamically adjusting test questions based on individual performance. Compared to traditional, non-personalized testing methods, CAT requires fewer questions and provides more accurate assessments. As a result, CAT has been widely adopted across various fields, including education, healthcare, sports, sociology, and the evaluation of AI models. While traditional methods rely on psychometrics and statistics, the increasing complexity of large-scale testing has spurred the integration of machine learning techniques. This paper aims to provide a machine learning-focused survey on CAT, presenting a fresh perspective on this adaptive testing paradigm. We delve into measurement models, question selection algorithm, bank construction, and test control within CAT, exploring how machine learning can optimize these components. Through an analysis of current methods, strengths, limitations, and challenges, we strive to develop robust, fair, and efficient CAT systems. By bridging psychometric-driven CAT research with machine learning, this survey advocates for a more inclusive and interdisciplinary approach to the future of adaptive testing.

replace-cross Fast Explanations via Policy Gradient-Optimized Explainer

Authors: Deng Pan, Nuno Moniz, Nitesh Chawla

Abstract: The challenge of delivering efficient explanations is a critical barrier that prevents the adoption of model explanations in real-world applications. Existing approaches often depend on extensive model queries for sample-level explanations or rely on expert's knowledge of specific model structures that trade general applicability for efficiency. To address these limitations, this paper introduces a novel framework Fast Explanation (FEX) that represents attribution-based explanations via probability distributions, which are optimized by leveraging the policy gradient method. The proposed framework offers a robust, scalable solution for real-time, large-scale model explanations, bridging the gap between efficiency and applicability. We validate our framework on image and text classification tasks and the experiments demonstrate that our method reduces inference time by over 97% and memory usage by 70% compared to traditional model-agnostic approaches while maintaining high-quality explanations and broad applicability.

replace-cross Exploring Diffusion Models' Corruption Stage in Few-Shot Fine-tuning and Mitigating with Bayesian Neural Networks

Authors: Xiaoyu Wu, Jiaru Zhang, Yang Hua, Bohan Lyu, Hao Wang, Tao Song, Haibing Guan

Abstract: Few-shot fine-tuning of Diffusion Models (DMs) is a key advancement, significantly reducing training costs and enabling personalized AI applications. However, we explore the training dynamics of DMs and observe an unanticipated phenomenon: during the training process, image fidelity initially improves, then unexpectedly deteriorates with the emergence of noisy patterns, only to recover later with severe overfitting. We term the stage with generated noisy patterns as corruption stage. To understand this corruption stage, we begin by theoretically modeling the one-shot fine-tuning scenario, and then extend this modeling to more general cases. Through this modeling, we identify the primary cause of this corruption stage: a narrowed learning distribution inherent in the nature of few-shot fine-tuning. To tackle this, we apply Bayesian Neural Networks (BNNs) on DMs with variational inference to implicitly broaden the learned distribution, and present that the learning target of the BNNs can be naturally regarded as an expectation of the diffusion loss and a further regularization with the pretrained DMs. This approach is highly compatible with current few-shot fine-tuning methods in DMs and does not introduce any extra inference costs. Experimental results demonstrate that our method significantly mitigates corruption, and improves the fidelity, quality and diversity of the generated images in both object-driven and subject-driven generation tasks. Code is available at https://github.com/Nicholas0228/BNN-Finetuning-DMs.

URLs: https://github.com/Nicholas0228/BNN-Finetuning-DMs.

replace-cross OTAD: An Optimal Transport-Induced Robust Model for Agnostic Adversarial Attack

Authors: Kuo Gai, Sicong Wang, Shihua Zhang

Abstract: Deep neural networks (DNNs) are vulnerable to small adversarial perturbations of the inputs, posing a significant challenge to their reliability and robustness. Empirical methods such as adversarial training can defend against particular attacks but remain vulnerable to more powerful attacks. Alternatively, Lipschitz networks provide certified robustness to unseen perturbations but lack sufficient expressive power. To harness the advantages of both approaches, we design a novel two-step Optimal Transport induced Adversarial Defense (OTAD) model that can fit the training data accurately while preserving the local Lipschitz continuity. First, we train a DNN with a regularizer derived from optimal transport theory, yielding a discrete optimal transport map linking data to its features. By leveraging the map's inherent regularity, we interpolate the map by solving the convex integration problem (CIP) to guarantee the local Lipschitz property. OTAD is extensible to diverse architectures of ResNet and Transformer, making it suitable for complex data. For efficient computation, the CIP can be solved through training neural networks. OTAD opens a novel avenue for developing reliable and secure deep learning systems through the regularity of optimal transport maps. Empirical results demonstrate that OTAD can outperform other robust models on diverse datasets.

replace-cross Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling

Authors: Jian Xu, Shian Du, Junmei Yang, Qianli Ma, Delu Zeng, John Paisley

Abstract: Gaussian Process Latent Variable Models (GPLVMs) have become increasingly popular for unsupervised tasks such as dimensionality reduction and missing data recovery due to their flexibility and non-linear nature. An importance-weighted version of the Bayesian GPLVMs has been proposed to obtain a tighter variational bound. However, this version of the approach is primarily limited to analyzing simple data structures, as the generation of an effective proposal distribution can become quite challenging in high-dimensional spaces or with complex data sets. In this work, we propose an Annealed Importance Sampling (AIS) approach to address these issues. By transforming the posterior into a sequence of intermediate distributions using annealing, we combine the strengths of Sequential Monte Carlo samplers and VI to explore a wider range of posterior distributions and gradually approach the target distribution. We further propose an efficient algorithm by reparameterizing all variables in the evidence lower bound (ELBO). Experimental results on both toy and image datasets demonstrate that our method outperforms state-of-the-art methods in terms of tighter variational bounds, higher log-likelihoods, and more robust convergence.

replace-cross Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

Authors: Yucong Zhang, Xin Zou, Jinshan Yang, Wenjun Chen, Juan Liu, Faya Liang, Ming Li

Abstract: This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks. To get better masks, we introduce a diffusion-based refinement that follows traditional U-Net segmentation to reduce false positives. We conducted several ablation studies to demonstrate the effectiveness of each module and modalities in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, unilateral VFP classification results on a real-world clinic dataset demonstrate MLVAS's ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.

replace-cross The Future of Software Testing: AI-Powered Test Case Generation and Validation

Authors: Mohammad Baqar, Rajat Khanda

Abstract: Software testing is a crucial phase in the software development lifecycle (SDLC), ensuring that products meet necessary functional, performance, and quality benchmarks before release. Despite advancements in automation, traditional methods of generating and validating test cases still face significant challenges, including prolonged timelines, human error, incomplete test coverage, and high costs of manual intervention. These limitations often lead to delayed product launches and undetected defects that compromise software quality and user satisfaction. The integration of artificial intelligence (AI) into software testing presents a promising solution to these persistent challenges. AI-driven testing methods automate the creation of comprehensive test cases, dynamically adapt to changes, and leverage machine learning to identify high-risk areas in the codebase. This approach enhances regression testing efficiency while expanding overall test coverage. Furthermore, AI-powered tools enable continuous testing and self-healing test cases, significantly reducing manual oversight and accelerating feedback loops, ultimately leading to faster and more reliable software releases. This paper explores the transformative potential of AI in improving test case generation and validation, focusing on its ability to enhance efficiency, accuracy, and scalability in testing processes. It also addresses key challenges associated with adapting AI for testing, including the need for high quality training data, ensuring model transparency, and maintaining a balance between automation and human oversight. Through case studies and examples of real-world applications, this paper illustrates how AI can significantly enhance testing efficiency across both legacy and modern software systems.

replace-cross Reconsidering the energy efficiency of spiking neural networks

Authors: Zhanglu Yan, Zhenyu Bai, Weng-Fai Wong

Abstract: Spiking Neural Networks (SNNs) promise higher energy efficiency over conventional Quantized Artificial Neural Networks (QNNs) due to their event-driven, spike-based computation. However, prevailing energy evaluations often oversimplify, focusing on computational aspects while neglecting critical overheads like comprehensive data movement and memory access. Such simplifications can lead to misleading conclusions regarding the true energy benefits of SNNs. This paper presents a rigorous re-evaluation. We establish a fair baseline by mapping rate-encoded SNNs with $T$ timesteps to functionally equivalent QNNs with $\lceil \log_2(T+1) \rceil$ bits. This ensures both models have comparable representational capacities, as well has similar hardware requirement, enabling meaningful energy comparisons. We introduce a detailed analytical energy model encompassing core computation and data movement. Using this model, we systematically explore a wide parameter space, including intrinsic network characteristics ($T$, spike rate $\SR$, QNN sparsity $\gamma$, model size $N$, weight bit-level) and hardware characteristics (memory system and network-on-chip). Our analysis identifies specific operational regimes where SNNs genuinely offer superior energy efficiency. For example, under typical neuromorphic hardware conditions, SNNs with moderate time windows ($T \in [5,10]$) require an average spike rate ($\SR$) below 6.4\% to outperform equivalent QNNs. Furthermore, to illustrate the real-world implications of our findings, we analyze the operational lifetime of a typical smartwatch, showing that an optimized SNN can nearly double its battery life compared to a QNN. These insights guide the design of turely energy-efficient neural network solutions.

replace-cross Input-to-State Stable Coupled Oscillator Networks for Closed-form Model-based Control in Latent Space

Authors: Maximilian St\"olzle, Cosimo Della Santina

Abstract: Even though a variety of methods have been proposed in the literature, efficient and effective latent-space control (i.e., control in a learned low-dimensional space) of physical systems remains an open challenge. We argue that a promising avenue is to leverage powerful and well-understood closed-form strategies from control theory literature in combination with learned dynamics, such as potential-energy shaping. We identify three fundamental shortcomings in existing latent-space models that have so far prevented this powerful combination: (i) they lack the mathematical structure of a physical system, (ii) they do not inherently conserve the stability properties of the real systems, (iii) these methods do not have an invertible mapping between input and latent-space forcing. This work proposes a novel Coupled Oscillator Network (CON) model that simultaneously tackles all these issues. More specifically, (i) we show analytically that CON is a Lagrangian system - i.e., it possesses well-defined potential and kinetic energy terms. Then, (ii) we provide formal proof of global Input-to-State stability using Lyapunov arguments. Moving to the experimental side, we demonstrate that CON reaches SoA performance when learning complex nonlinear dynamics of mechanical systems directly from images. An additional methodological innovation contributing to achieving this third goal is an approximated closed-form solution for efficient integration of network dynamics, which eases efficient training. We tackle (iii) by approximating the forcing-to-input mapping with a decoder that is trained to reconstruct the input based on the encoded latent space force. Finally, we show how these properties enable latent-space control. We use an integral-saturated PID with potential force compensation and demonstrate high-quality performance on a soft robot using raw pixels as the only feedback information.

replace-cross BNEM: A Boltzmann Sampler Based on Bootstrapped Noised Energy Matching

Authors: RuiKang OuYang, Bo Qiang, Jos\'e Miguel Hern\'andez-Lobato

Abstract: Developing an efficient sampler capable of generating independent and identically distributed (IID) samples from a Boltzmann distribution is a crucial challenge in scientific research, e.g. molecular dynamics. In this work, we intend to learn neural samplers given energy functions instead of data sampled from the Boltzmann distribution. By learning the energies of the noised data, we propose a diffusion-based sampler, Noised Energy Matching, which theoretically has lower variance and more complexity compared to related works. Furthermore, a novel bootstrapping technique is applied to NEM to balance between bias and variance. We evaluate NEM and BNEM on a 2-dimensional 40 Gaussian Mixture Model (GMM) and a 4-particle double-well potential (DW-4). The experimental results demonstrate that BNEM can achieve state-of-the-art performance while being more robust.

replace-cross Improving Visual Object Tracking through Visual Prompting

Authors: Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

Abstract: Learning a discriminative model that distinguishes the specified target from surrounding distractors across frames is essential for generic object tracking (GOT). Dynamic adaptation of target representation against distractors remains challenging because prevailing trackers exhibit limited discriminative capability. To address this issue, we present a new visual prompting mechanism for generic object tracking, termed PiVOT. PiVOT introduces mechanisms that leverage the pretrained foundation model (CLIP) to automatically generate and refine visual prompts online, thereby enabling the tracker to suppress distractors through contrastive guidance. To transfer contrastive knowledge from the foundation model to the tracker, PiVOT automatically propagates this knowledge online and dynamically generates and updates visual prompts. Specifically, it proposes a prompt initialization mechanism that produces an initial visual prompt highlighting potential target locations. The foundation model is then used to refine the prompt based on appearance similarities between candidate objects and reference templates across potential targets. After refinement, the visual prompt better highlights potential target locations and reduces irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate instance-aware feature maps guided by the visual prompts, which are incrementally and automatically updated during tracking, thereby effectively suppressing distractors. Extensive experiments across multiple benchmarks indicate that PiVOT, with the proposed prompting mechanism, can suppress distracting objects and improve tracking performance.

replace-cross Neural delay differential equations: learning non-Markovian closures for partially known dynamical systems

Authors: Thibault Monsel, Onofrio Semeraro, Lionel Mathelin, Guillaume Charpiat

Abstract: Recent advances in learning dynamical systems from data have shown significant promise. However, many existing methods assume access to the full state of the system -- an assumption that is rarely satisfied in practice, where systems are typically monitored through a limited number of sensors, leading to partial observability. To address this challenge, we draw inspiration from the Mori-Zwanzig formalism, which provides a theoretical connection between hidden variables and memory terms. Motivated by this perspective, we introduce a constant-lag Neural Delay Differential Equations (NDDEs) framework, providing a continuous-time approach for learning non-Markovian dynamics directly from data. These memory effects are captured using a finite set of time delays, which are identified via the adjoint method. We validate the proposed approach on a range of datasets, including synthetic systems, chaotic dynamics, and experimental measurements, such as the Kuramoto-Sivashinsky equation and cavity-flow experiments. Results demonstrate that NDDEs compare favourably with existing approaches for partially observed systems, including long short-term memory (LSTM) networks and augmented neural ordinary differential equations (ANODEs). Overall, NDDEs offer a principled and data-efficient framework for modelling non-Markovian dynamics under partial observability. An open-source implementation accompanies this article.

replace-cross Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model

Authors: Zecheng Hao, Yifan Huang, Zijie Xu, Wenxuan Liu, Yuanhong Tang, Zhaofei Yu, Tiejun Huang

Abstract: Spiking Neural Networks (SNNs) are considered to have enormous potential in the future development of Artificial Intelligence due to their brain-inspired and energy-efficient properties. Compared to vanilla Spatial-Temporal Back-propagation (STBP) training methods, online training can effectively avoid the risk of GPU memory explosion. However, current online learning frameworks cannot tackle the gradient discrepancy problem between the forward and backward process, merely aiming to optimize the GPU memory, resulting in no performance advantages compared to the STBP-based models in the inference stage. To address the aforementioned challenges, we propose Hybrid-Driven Leaky Integrate-and-Fire (HD-LIF) model family for efficient online learning, which respectively adopt different spiking calculation mechanism in the upper-region and lower-region of the firing threshold. We theoretically point out that our learning framework can effectively separate temporal gradients and address the misalignment problem of surrogate gradients, as well as achieving full-stage optimization towards learning precision, memory complexity and power consumption. Experimental results have demonstrated that our scheme is enable to achieve state-of-the-art performance for multiple evaluation metrics, breaking through the traditional paradigm of SNN online training and deployment. Code is available at \href{https://github.com/hzc1208/HD_LIF}{here}.

URLs: https://github.com/hzc1208/HD_LIF

replace-cross Puppet-CNN: Continuous Parameter Dynamics for Input-Adaptive Convolutional Networks

Authors: Yucheng Xing, Xin Wang

Abstract: Modern convolutional neural networks (CNNs) organize computation as a discrete stack of layers whose parameters are independently stored and learned, with the number of layers fixed as an architectural hyperparameter. In this work, we explore an alternative perspective: can network parameterization itself be modeled as a continuous dynamical system? We introduce Puppet-CNN, a framework that represents convolutional layer parameters as states evolving along a learned parameter flow governed by a neural ordinary differential equation (ODE). Under this formulation, layer parameters are generated through continuous evolution in parameter space, and the effective number of generated layers is determined by the integration horizon of the learned dynamics, which can be modulated by input complexity to enable input-adaptive computation. We validate this formulation on standard image classification benchmarks and demonstrate that continuous parameter dynamics can achieve competitive predictive performance while substantially reducing stored trainable parameters. These results suggest that viewing neural network parameterization through the lens of dynamical systems provides a structured and flexible design space for adaptive convolutional models.

replace-cross Input-Adaptive Generative Dynamics in Diffusion Models

Authors: Yucheng Xing, Xiaodong Liu, Xin Wang

Abstract: Diffusion models typically generate data through a fixed denoising trajectory that is shared across all samples. However, generation targets can differ in complexity, suggesting that a single pre-defined diffusion process may not be optimal for every input. In this work, we investigate input-adaptive generative dynamics for diffusion models, where the generation process itself adapts to the conditions of each sample. Instead of relying on a fixed diffusion trajectory, the proposed framework allows the generative dynamics to adjust across inputs according to their generation requirements. To enable this behavior, we train the diffusion backbone under varying horizons and noise schedules, so that it can operate consistently under different input-adaptive trajectories. Experiments on conditional image generation show that diffusion trajectories can vary across inputs while maintaining generation quality and reducing the average number of sampling steps. These results provide a proof of the concept that diffusion processes can benefit from input-adaptive generative dynamics rather than relying on a single fixed trajectory.

replace-cross The Illusion of Collusion

Authors: Connor Douglas, Foster Provost, Arun Sundararajan

Abstract: Algorithmic agents are used in a variety of competitive decision-making settings, including pricing contexts that range from online retail to residential home rental. We study the emergence of algorithmic collusion when competing agents employ multi-armed bandit algorithms and competition is modeled as a repeated Prisoner's Dilemma game. Notably, agents in our setting perform online learning with no prior model of game structure and have no direct knowledge of competitor states or actions, thus they cannot learn strategies that depend on these factors. These context-free bandits nonetheless frequently learn seemingly collusive behavior, a phenomenon we term naive collusion. Our results reveal that whether naive collusion emerges depends starkly on the choice of behavior policy employed by bandit learners. The mechanism underpinning the emergence of collusive outcomes is synchronicity in agent action plays, where synchronicity captures how often agents play the same action. We show that in the long-run, naive algorithmic collusion never emerges when both agents use a broad class of persistently random algorithms, including the epsilon-greedy algorithm without epsilon decay, sometimes emerges when both agents use greedy-in-the-limit algorithms which feature randomness during exploration but are asymptotically deterministic, and always emerges when both agents use deterministic bandit learning algorithms like those in the well-known upper confidence bound (UCB) family. We highlight market and algorithmic conditions under which one can and cannot predict a priori whether collusion will occur. Our findings have several policy implications: preventing pricing algorithms from conditioning their actions on competitor prices may not preclude algorithmic collusion, symmetry in algorithms may increase collusion potential, and the emergence of algorithmic collusion is path dependent.

replace-cross Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

Authors: Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue

Abstract: Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria prioritized can differ depending on the task, making it challenging for current metrics to adapt to multi-task scenarios. To address this limitation, we propose HarmonicEval, a reference-free comprehensive evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, to assess the generalizability of automatic evaluation metrics in multi-task scenarios, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) benchmark, which comprises 18,000 expert human judgments across four multi-modal tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion. Project page: https://stjohn2007.github.io/MMHE_project/

URLs: https://stjohn2007.github.io/MMHE_project/

replace-cross From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

Authors: Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Jiuguang Wang, Tom\'as Lozano-P\'erez, Leslie Pack Kaelbling

Abstract: Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision-language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.

replace-cross Efficient Semi-Supervised Adversarial Training via Latent Clustering-Based Data Reduction

Authors: Somrita Ghosh, Yuelin Xu, Xiao Zhang

Abstract: Learning robust models under adversarial settings is widely recognized as requiring a considerably large number of training samples. Recent work proposes semi-supervised adversarial training (SSAT), which utilizes external unlabeled or synthetically generated data and is currently the state of the art. However, SSAT requires substantial extra data to attain high robustness, resulting in prolonged training time and increased memory usage. In this paper, we propose data reduction strategies to improve the efficiency of SSAT by optimizing the amount of additional data incorporated. Specifically, we design novel latent clustering-based techniques to select or generate a small, critical subset of data samples near the model's decision boundary. While focusing on boundary-adjacent points, our methods maintain a balanced ratio between boundary and non-boundary data points, thereby avoiding overfitting. Comprehensive experiments across image benchmarks demonstrate that our methods can effectively reduce SSAT's data requirements and computational costs while preserving its strong robustness advantages. In particular, our latent-space selection scheme based on k-means clustering and our guided diffusion-based approach with LCG-KM are the most effective, achieving nearly identical robust accuracies with 5 times to 10 times less unlabeled data. When compared to full SSAT trained to convergence, our methods reduce total runtime by approximately 3 times to 4 times due to strategic prioritization of unlabeled data.

replace-cross A Single Model Ensemble Framework for Neural Machine Translation using Pivot Translation

Authors: Seokjin Oh, Keonwoong Noh, Woohwan Jung

Abstract: Despite the recent remarkable advances in neural machine translation, translation quality for low-resource language pairs remains subpar. Ensembling multiple systems is a widely adopted technique to enhance performance, often accomplished by combining probability distributions. However, previous approaches face the challenge of high computational costs for training multiple models. Furthermore, for black-box models, averaging token-level probabilities at each decoding step is not feasible. To address the problems of multi-model ensemble methods, we present a pivot-based single model ensemble. The proposed strategy consists of two steps: pivot-based candidate generation and post-hoc aggregation. In the first step, we generate candidates through pivot translation. This can be achieved with only a single model and facilitates knowledge transfer from high-resource pivot languages, resulting in candidates that are not only diverse but also more accurate. Next, in the aggregation step, we select k high-quality candidates from the generated candidates and merge them to generate a final translation that outperforms the existing candidates. Our experimental results show that our method produces translations of superior quality by leveraging candidates from pivot translation to capture the subtle nuances of the source sentence.

replace-cross GRADIEND: Feature Learning within Neural Networks Exemplified through Biases

Authors: Jonathan Drechsel, Steffen Herbold

Abstract: AI systems frequently exhibit and amplify social biases, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a feature neuron encoding societal bias information such as gender, race, and religion. We show that our method can not only identify which weights of a model need to be changed to modify a feature, but even demonstrate that this can be used to rewrite models to debias them while maintaining other capabilities. We demonstrate the effectiveness of our approach across various model architectures and highlight its potential for broader applications.

replace-cross An Efficient Local Search Approach for Polarized Community Discovery in Signed Networks

Authors: Linus Aronsson, Morteza Haghir Chehreghani

Abstract: Signed networks, where edges are labeled as positive or negative to represent friendly or antagonistic interactions, provide a natural framework for analyzing polarization, trust, and conflict in social systems. Detecting meaningful group structures in such networks is crucial for understanding online discourse, political divisions, and trust dynamics. A key challenge is to identify communities that are internally cohesive and externally antagonistic, while allowing for neutral or unaligned vertices. In this paper, we propose a method for identifying $k$ polarized communities that addresses a major limitation of prior methods: their tendency to produce highly size-imbalanced solutions. We introduce a novel optimization objective that avoids such imbalance. In addition, it is well known that approximation algorithms based on local search are highly effective for clustering signed networks when neutral vertices are not allowed. We build on this idea and design the first local search algorithm that extends to the setting with neutral vertices while scaling to large networks. By connecting our approach to block-coordinate Frank-Wolfe optimization, we prove a linear convergence rate, enabled by the structure of our objective. Experiments on real-world and synthetic datasets demonstrate that our method consistently outperforms state-of-the-art baselines in solution quality, while remaining competitive in computational efficiency.

replace-cross Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs

Authors: Thierry Bossy, Julien Vignoud, Tahseen Rabbani, Juan R. Troncoso Pastoriza, Martin Jaggi

Abstract: Federated learning (FL) is a popular paradigm for collaborative training which avoids direct data exposure between clients. However, data privacy issues still remain: FL-trained large language models are capable of memorizing and completing phrases and sentences contained in training data when given their prefixes. Thus, it is possible for adversarial and honest- but-curious clients to recover training data of other participants simply through targeted prompting. In this work, we demonstrate that a popular and simple fine-tuning strategy, low-rank adaptation (LoRA), reduces memorization during FL by a factor of up to 10 without significant performance cost. We study this effect by performing fine-tuning tasks in high-risk domains such as medicine, law, and finance. We observe a reduction in memorization for a wide variety of model families, from 1B to 70B parameters. We find that LoRA can reduce memorization in centralized learning as well, and we compare how the memorization patterns differ. Furthermore, we study the effect of hyperparameters and show that LoRA can be combined with other privacy-preserving techniques such as gradient clipping and Gaussian noise, secure aggregation, and Goldfish loss to further improve record-level privacy while maintaining performance.

replace-cross Prompt-SID: Learning Structural Representation Prompt via Latent Diffusion for Single-Image Denoising

Authors: Huaqiu Li, Wang Zhang, Xiaowan Hu, Tao Jiang, Zikang Chen, Haoqian Wang

Abstract: Many studies have concentrated on constructing supervised models utilizing paired datasets for image denoising, which proves to be expensive and time-consuming. Current self-supervised and unsupervised approaches typically rely on blind-spot networks or sub-image pairs sampling, resulting in pixel information loss and destruction of detailed structural information, thereby significantly constraining the efficacy of such methods. In this paper, we introduce Prompt-SID, a prompt-learning-based single image denoising framework that emphasizes preserving of structural details. This approach is trained in a self-supervised manner using downsampled image pairs. It captures original-scale image information through structural encoding and integrates this prompt into the denoiser. To achieve this, we propose a structural representation generation model based on the latent diffusion process and design a structural attention module within the transformer-based denoiser architecture to decode the prompt. Additionally, we introduce a scale replay training mechanism, which effectively mitigates the scale gap from images of different resolutions. We conduct comprehensive experiments on synthetic, real-world, and fluorescence imaging datasets, showcasing the remarkable effectiveness of Prompt-SID. Our code will be released at https://github.com/huaqlili/Prompt-SID.

URLs: https://github.com/huaqlili/Prompt-SID.

replace-cross Language in the Flow of Time: Time-Series-Paired Texts Weaved into a Unified Temporal Narrative

Authors: Zihao Li, Xiao Lin, Zhining Liu, Jiaru Zou, Ziwei Wu, Lecheng Zheng, Dongqi Fu, Yada Zhu, Hendrik Hamann, Hanghang Tong, Jingrui He

Abstract: While many advances in time series models focus exclusively on numerical data, research on multimodal time series, particularly those involving contextual textual information, remains in its infancy. With recent progress in large language models and time series learning, we revisit the integration of paired texts with time series through the Platonic Representation Hypothesis, which posits that representations of different modalities converge to shared spaces. In this context, we identify that time-series-paired texts may naturally exhibit periodic properties that closely mirror those of the original time series. Building on this insight, we propose a novel framework, Texts as Time Series (TaTS), which considers the time-series-paired texts to be auxiliary variables of the time series. TaTS can be plugged into any existing numerical-only time series models and effectively enable them to handle time series data with paired texts. Through extensive experiments on both multimodal time series forecasting and imputation tasks across benchmark datasets with various existing time series models, we demonstrate that TaTS can enhance multimodal predictive performance without modifying model architectures. Our Code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/TaTS.

URLs: https://github.com/iDEA-iSAIL-Lab-UIUC/TaTS.

replace-cross LaVCa: LLM-assisted Visual Cortex Captioning

Authors: Takuya Matsuyama, Shinji Nishimoto, Yu Takagi

Abstract: Understanding the property of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that uses large language models (LLMs) to generate natural-language captions for images to which voxels are selective. By applying LaVCa for image-evoked brain activity, we demonstrate that LaVCa generates captions that describe voxel selectivity more accurately than the previously proposed method. Furthermore, the captions generated by LaVCa quantitatively capture more detailed properties than the existing method at both the inter-voxel and intra-voxel levels. Furthermore, a more detailed analysis of the voxel-specific properties generated by LaVCa reveals fine-grained functional differentiation within regions of interest (ROIs) in the visual cortex and voxels that simultaneously represent multiple distinct concepts. These findings offer profound insights into human visual representations by assigning detailed captions throughout the visual cortex while highlighting the potential of LLM-based methods in understanding brain representations.

replace-cross Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

Authors: Chengyin Xu, Kaiyuan Chen, Xiao Li, Ke Shen, Chenggang Li

Abstract: The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for comprehensive understanding of scaling properties. This is challenged by: 1) the emergence phenomenon, where unpredictable capabilities appearing suddenly at critical model scales; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby constructing a more stable and predictable task subset that exhibits well-behaved scaling characteristics with the increase of compute budget. We adopt a performance scaling law to predict cluster-wise performance with theoretical support. Predictable subset performance acts as an intermediate predictor for the full evaluation set. We further derive a mapping function to accurately extrapolate the performance of the subset to the full set. Applied to an LLM with 70B parameters, COD achieved a 1.55\% average prediction error across eight key LLM benchmarks, thus providing actionable insights for scaling properties and training monitoring during LLM pre-training.

replace-cross Enhancing Alzheimer's Diagnosis: Leveraging Anatomical Landmarks in Graph Convolutional Neural Networks on Tetrahedral Meshes

Authors: Yanxi Chen, Mohammad Farazi, Zhangsihao Yang, Yonghui Fan, Nicholas Ashton, Eric M Reiman, Yi Su, Yalin Wang

Abstract: Alzheimer's disease (AD) is a major neurodegenerative condition that affects millions around the world. As one of the main biomarkers in the AD diagnosis procedure, brain amyloid positivity is typically identified by positron emission tomography (PET), which is costly and invasive. Brain structural magnetic resonance imaging (sMRI) may provide a safer and more convenient solution for the AD diagnosis. Recent advances in geometric deep learning have facilitated sMRI analysis and early diagnosis of AD. However, determining AD pathology, such as brain amyloid deposition, in preclinical stage remains challenging, as less significant morphological changes can be observed. As a result, few AD classification models are generalizable to the brain amyloid positivity classification task. Blood-based biomarkers (BBBMs), on the other hand, have recently achieved remarkable success in predicting brain amyloid positivity and identifying individuals with high risk of being brain amyloid positive. However, individuals in medium risk group still require gold standard tests such as Amyloid PET for further evaluation. Inspired by the recent success of transformer architectures, we propose a geometric deep learning model based on transformer that is both scalable and robust to variations in input volumetric mesh size. Our work introduced a novel tokenization scheme for tetrahedral meshes, incorporating anatomical landmarks generated by a pre-trained Gaussian process model. Our model achieved superior classification performance in AD classification task. In addition, we showed that the model was also generalizable to the brain amyloid positivity prediction with individuals in the medium risk class, where BM alone cannot achieve a clear classification. Our work may enrich geometric deep learning research and improve AD diagnosis accuracy without using expensive and invasive PET scans.

replace-cross ViLAM: Distilling Vision-Language Reasoning into Attention Maps for Social Robot Navigation

Authors: Mohamed Elnoor, Kasun Weerakoon, Gershom Seneviratne, Jing Liang, Vignesh Rajagopal, Dinesh Manocha

Abstract: We introduce ViLAM, a novel method for distilling vision-language reasoning from large Vision-Language Models (VLMs) into spatial attention maps for socially compliant robot navigation. Unlike traditional methods that rely on expert demonstrations or human-annotated datasets, ViLAM performs knowledge distillation and fine-tuning at the intermediate layer representation (attention) level by aligning attention maps from a pretrained vision-action model with socially guided attention maps derived from a large VLM. These distilled attention maps highlight key navigational regions in a scene and serve as socially informed spatial cost maps for motion planning. To achieve this, we introduce a novel attention-level distillation loss that fuses knowledge from both sources, generating augmented attention maps with enhanced social awareness. These refined attention maps are then used as a traversability costmap within a socially aware local planner for navigation. We validate our approach through real-world experiments on a Husky wheeled robot, and demonstrate 14.2% - 50% improvements in success rate over existing methods.

replace-cross IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Authors: Yiyang Ling, Karan Owalekar, Oluwatobiloba Adesanya, Erdem B{\i}y{\i}k, Daniel Seita

Abstract: Motion planning involves determining a sequence of robot configurations to reach a desired pose, subject to movement and safety constraints. Traditional motion planning finds collision-free paths, but this is overly restrictive in clutter, where it may not be possible for a robot to accomplish a task without contact. In addition, contacts range from relatively benign (e.g. brushing a soft pillow) to more dangerous (e.g. toppling a glass vase), making it difficult to characterize which may be acceptable. In this paper, we propose IMPACT, a novel motion planning framework that uses Vision-Language Models (VLMs) to infer environment semantics, identifying which parts of the environment can best tolerate contact based on object properties and locations. Our approach generates an anisotropic cost map that encodes directional push safety. We pair this map with a contact-aware A* planner to find stable contact-rich paths. We perform experiments using 20 simulation and 10 real-world scenes and assess using task success rate, object displacements, and feedback from human evaluators. Our results over 3200 simulation and 200 real-world trials suggest that IMPACT enables efficient contact-rich motion planning in cluttered settings while outperforming alternative methods and ablations. Our project website is available at https://impact-planning.github.io/.

URLs: https://impact-planning.github.io/.

replace-cross More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models

Authors: Evan Chen, Run-Jun Zhan, Yan-Bai Lin, Hung-Hsuan Chen

Abstract: Large Language Models (LLMs) have revolutionized natural language processing, yet concerns persist regarding their tendency to reflect or amplify social biases. This study introduces a novel evaluation framework to uncover gender biases in LLMs: using free-form storytelling to surface biases embedded within the models. A systematic analysis of ten prominent LLMs shows a consistent pattern of overrepresenting female characters across occupations, likely due to supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). Paradoxically, despite this overrepresentation, the occupational gender distributions produced by these LLMs align more closely with human stereotypes than with real-world labor data. This highlights the challenge and importance of implementing balanced mitigation measures to promote fairness and prevent the establishment of potentially new biases. We release the prompts and LLM-generated stories at GitHub.

replace-cross From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction

Authors: Gaoge Han, Yongkang Cheng, Zhe Chen, Shaoli Huang, Tongliang Liu

Abstract: Two-hand reconstruction from monocular images is hampered by complex poses and severe occlusions, which often cause interaction misalignment and two-hand penetration. We address this by decoupling the problem into 2D structural alignment and 3D spatial interaction alignment, each handled by a tailored component. For 2D alignment, we pioneer the attempt to unify heterogeneous structural priors (keypoints, segmentation, and depth) from vision foundation models as complementary structured guidance for two-hand recovery. Instead of extracting priors prediction as explicit inputs, we propose a fusion-alignment encoder that absorbs their structural knowledge implicitly, achieving foundation-level guidance without foundation-level cost. For 3D spatial alignment, we propose a two-hand penetration-free diffusion model that learns a generative mapping from interpenetrated poses to realistic, collision-free configurations. Guided by collision gradients during denoising, the model converges toward the manifold of valid two-hand interactions, preserving geometric and kinematic coherence. This generative formulation approach enables physically credible reconstructions even under occlusion or ambiguous visual input. Extensive experiments on InterHand2.6M and HIC show state-of-the-art or leading performance in interaction alignment and penetration suppression. Project: https://gaogehan.github.io/A2P/

URLs: https://gaogehan.github.io/A2P/

replace-cross More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty

Authors: Lang Cao, Renhong Chen, Yingtian Zou, Chao Peng, Huacong Xu, Yuxian Wang, Wu Ning, Qian Chen, Mofan Peng, Zijie Chen, Peishuo Su, Yitong Li

Abstract: We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.

replace-cross MediTools -- Medical Education Powered by LLMs

Authors: Amr Alshatnawi, Remi Sampaleanu, David Liebovitz

Abstract: Artificial Intelligence (AI) has been advancing rapidly and with the advent of large language models (LLMs) in late 2022, numerous opportunities have emerged for adopting this technology across various domains, including medicine. These innovations hold immense potential to revolutionize and modernize medical education. Our research project leverages large language models to enhance medical education and address workflow challenges through the development of MediTools - AI Medical Education. This prototype application focuses on developing interactive tools that simulate real-life clinical scenarios, provide access to medical literature, and keep users updated with the latest medical news. Our first tool is a dermatology case simulation tool that uses real patient images depicting various dermatological conditions and enables interaction with LLMs acting as virtual patients. This platform allows users to practice their diagnostic skills and enhance their clinical decision-making abilities. The application also features two additional tools: an AI-enhanced PubMed tool for engaging with LLMs to gain deeper insights into research papers, and a Google News tool that offers LLM generated summaries of articles for various medical specialties. A comprehensive survey has been conducted among medical professionals and students to gather initial feedback on the effectiveness and user satisfaction of MediTools, providing insights for further development and refinement of the application. This research demonstrates the potential of AI-driven tools in transforming and revolutionizing medical education, offering a scalable and interactive platform for continuous learning and skill development.

replace-cross SFIBA: Spatial-based Full-target Invisible Backdoor Attacks

Authors: Yangxu Yin, Honglong Chen, Yudong Gao, Peng Sun, Zhishuai Li, Weifeng Liu

Abstract: Multi-target backdoor attacks pose significant security threats to deep neural networks, as they can preset multiple target classes through a single backdoor injection. This allows attackers to control the model to misclassify poisoned samples with triggers into any desired target class during inference, exhibiting superior attack performance compared with conventional backdoor attacks. However, existing multi-target backdoor attacks fail to guarantee trigger specificity and stealthiness in black-box settings, resulting in two main issues. First, they are unable to simultaneously target all classes when only training data can be manipulated, limiting their effectiveness in realistic attack scenarios. Second, the triggers often lack visual imperceptibility, making poisoned samples easy to detect. To address these problems, we propose a Spatial-based Full-target Invisible Backdoor Attack, called SFIBA. It restricts triggers for different classes to specific local spatial regions and morphologies in the pixel space to ensure specificity, while employing a frequency-domain-based trigger injection method to guarantee stealthiness. Specifically, for injection of each trigger, we first apply fast fourier transform to obtain the amplitude spectrum of clean samples in local spatial regions. Then, we employ discrete wavelet transform to extract the features from the amplitude spectrum and use singular value decomposition to integrate the trigger. Subsequently, we selectively filter parts of the trigger in pixel space to implement trigger morphology constraints and adjust injection coefficients based on visual effects. We conduct experiments on multiple datasets and models. The results demonstrate that SFIBA can achieve excellent attack performance and stealthiness, while preserving the model's performance on benign samples, and can also bypass existing backdoor defenses.

replace-cross Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning

Authors: Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan Catanzaro

Abstract: We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.

replace-cross Ready2Unlearn: A Learning-Time Approach for Preparing Models with Future Unlearning Readiness

Authors: Hanyu Duan, Yi Yang, Ahmed Abbasi, Kar Yan Tam

Abstract: Machine unlearning is the process of removing the imprint left by specific data samples during the training of a machine learning model. AI developers, including those building personalized technologies, employ machine unlearning for various purposes such as privacy protection, security, and to address ethical concerns. This paper introduces Ready2Unlearn, a learning-time optimization approach designed to facilitate future unlearning processes. Unlike the majority of existing unlearning efforts that focus on designing unlearning algorithms, which are typically implemented reactively when an unlearning request is made during the model deployment phase, Ready2Unlearn shifts the focus to the training phase, adopting a "forward-looking" perspective. Building upon well-established meta-learning principles, Ready2Unlearn proactively trains machine learning models with unlearning readiness, such that they are well prepared and can handle future unlearning requests in a more efficient and principled manner. Ready2Unlearn is model-agnostic and compatible with any gradient ascent-based machine unlearning algorithms. We evaluate the method on both language and vision tasks under various unlearning settings, including class-wise unlearning and random data unlearning. Experimental results show that by incorporating such preparedness at training time, Ready2Unlearn produces an unlearning-ready model state, which offers several key advantages when future unlearning is requested. We hope this study inspires future research on proactive strategies for equipping machine learning models with built-in unlearning readiness, particularly in modern information systems that rely heavily on user data for recommendation, search, and personalized services, where privacy risks and data deletion demands are increasingly prevalent.

replace-cross FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Authors: Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, Jieru Zhao

Abstract: Large language models (LLMs) are widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods have been proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, a training-free algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency, enabling effective overlap with computation, full latency hiding, and practical speedups from speculative recall. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to a 13$\times$ speedup compared to SOTA KV retrieval methods. Code is available at https://github.com/sjtu-zhao-lab/FreeKV.

URLs: https://github.com/sjtu-zhao-lab/FreeKV.

replace-cross MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

Authors: Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Ryan Chin, Caiming Xiong, Shafiq Joty

Abstract: Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation set for tuning and yield static MAS designs lacking adaptability during inference, while also removing the flexibility to reduce to simpler systems. We introduce MAS-ZERO, the first self-evolved, inference-time framework for automatic MAS design. MAS-ZERO employs meta-level design to iteratively design, critique, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic problem decomposition and agent composition through meta-feedback on solvability and completeness, and reduction to simpler systems when appropriate. Experiments across reasoning (math and graduate-level QA), coding, and agentic (search-based) benchmarks, using both closed-source and open-source LLM backbones of varying sizes, demonstrate that MAS-ZERO outperforms strong manual and automatic MAS baselines. It achieves substantial average accuracy improvements of up to 16.69% on reasoning, 16.66% on coding, and 5.45% on agentic tasks, while maintaining cost efficiency.

replace-cross EasyInsert: A Data-Efficient and Generalizable Insertion Policy

Authors: Guanghe Li, Junming Zhao, Shengjie Wang, Yang Gao

Abstract: Robotic insertion is a highly challenging task that requires exceptional precision in cluttered environments. Existing methods often have poor generalization capabilities. They typically function in restricted and structured environments, and frequently fail when the plug and socket are far apart, when the scene is densely cluttered, or when handling novel objects. They also rely on strong assumptions such as access to CAD models or a digital twin in simulation. To address these limitations, we propose EasyInsert. Inspired by human intuition, it formulates insertion as a delta-pose regression problem, which unlocks an efficient, highly scalable data collection pipeline with minimal human labor to train an end-to-end visual policy. During execution, the visual policy predicts the relative pose between plug and socket to drive a multi-phase, coarse-to-fine insertion process. EasyInsert demonstrates strong zero-shot generalization capability for unseen objects in cluttered environments, robustly handling cases with significant initial pose deviations. In real-world experiments, by leveraging just 1 hour of human teleoperation data to bootstrap a large-scale automated data collection process, EasyInsert achieves an over 90% success rate in zero-shot insertion for 13 out of 15 unseen novel objects, including challenging objects like Type-C cables, HDMI cables, and Ethernet cables. Furthermore, requiring only a single manual reset, EasyInsert allows for fast adaptation to novel test objects through automated data collection and fine-tuning, achieving an over 90% success rate across all 15 objects.

replace-cross The Cell Must Go On: Agar.io for Continual Reinforcement Learning

Authors: Mohamed A. Mohamed, Kateryna Nekhomiazh, Vedant Vyas, Marcos M. Jose, Andrew Patterson, Marlos C. Machado

Abstract: Continual reinforcement learning (RL) concerns agents that are expected to learn continually, rather than converge to a policy that is then fixed for evaluation. This setting is well-suited to environments that the agent perceives as changing over time, rendering any static policy ineffective. In continual RL, researchers often simulate such changes either by modifying episodic environments to incorporate task shifts during interaction or by designing simulators that explicitly model continual dynamics. However, transforming episodic problems into continual ones primarily captures scenarios involving abrupt changes in the data stream and still relies on episodic structure. Meanwhile, the few simulators explicitly designed for empirical continual RL research are often limited in scope or complexity. In this paper, we introduce AgarCL, a research platform for continual RL that enables agents to progress toward increasingly sophisticated behaviour. AgarCL is based on the game Agar.io, a non-episodic, high-dimensional problem with stochastic, ever-evolving dynamics, continuous actions, and partial observability. We provide benchmark results for DQN, PPO, and SAC on the primary continual RL challenge, as well as across a suite of smaller tasks within AgarCL. These smaller tasks isolate aspects of the full environment and allow us to characterize the distinct challenges posed by different components of the game. We further evaluate three continual learning methods-Shrink and Perturb, ReDo, and Continual Backpropagation-and observe little improvement over standard RL algorithms, suggesting that the challenges posed by AgarCL extend beyond the stability-plasticity dilemma.

replace-cross Maximum Principle of Optimal Probability Density Control

Authors: Nathan Gaby, Xiaojing Ye

Abstract: We develop a general theoretical framework for optimal probability density control on standard measure spaces, aimed at addressing large-scale multi-agent control problems. In particular, we establish a maximum principle (MP) for control problems posed on infinite-dimensional spaces of probability distributions and control vector fields. We further derive the Hamilton--Jacobi--Bellman equation for the associated value functional defined on the space of probability distributions. Both results are presented in a concise form and supported by rigorous mathematical analysis, enabling efficient numerical treatment of these problems. Building on the proposed MP, we introduce a scalable numerical algorithm that leverages deep neural networks to handle high-dimensional settings. The effectiveness of the approach is demonstrated through several multi-agent control examples involving domain obstacles and inter-agent interactions.

replace-cross OCN: Effectively Utilizing Higher-Order Common Neighbors for Better Link Prediction

Authors: Juntong Wang, Xiyuan Wang, Muhan Zhang

Abstract: Common Neighbors (CNs) and their higher-order variants are important pairwise features widely used in state-of-the-art link prediction methods. However, existing methods often struggle with the repetition across different orders of CNs and fail to fully leverage their potential. We identify that these limitations stem from two key issues: redundancy and over-smoothing in high-order common neighbors. To address these challenges, we design orthogonalization to eliminate redundancy between different-order CNs and normalization to mitigate over-smoothing. By combining these two techniques, we propose Orthogonal Common Neighbor (OCN), a novel approach that significantly outperforms the strongest baselines by an average of 7.7\% on popular link prediction benchmarks. A thorough theoretical analysis is provided to support our method. Ablation studies also verify the effectiveness of our orthogonalization and normalization techniques. Code is available at: https://github.com/qingpingmo/OCN.

URLs: https://github.com/qingpingmo/OCN.

replace-cross "That's another doom I haven't thought about": A User Study on AI Labels as a Safeguard Against Image-Based Misinformation

Authors: Sandra H\"oltervennhoff, Jonas Ricker, Maike M. Raphael, Charlotte Schwedes, Rebecca Weil, Asja Fischer, Thorsten Holz, Lea Sch\"onherr, Sascha Fahl

Abstract: As generative AI is increasingly contributing to the spread of deceptively realistic misinformation, lawmakers have introduced regulations requiring the disclosure of AI-generated content. However, it is unclear if labels reduce the risk of users falling for AI-generated misinformation. To address this research gap, we study the effect of labels on users' perception and the implications of mislabeling, focusing on AI-generated images. We first explored users' opinions and expectations of labels using five focus groups. Although participants were wary of practical implementations, they considered labeling helpful in identifying AI-generated images and avoiding deception. Second, we conducted a survey with 1354 participants to assess how labels affect users' ability to recognize misinformation. While labels reduced participants' belief in false claims supported by AI-generated images, we found evidence of overreliance, leading to unintended side effects: Participants were more susceptible to false claims accompanied by human-made images, and were more hesitant to believe true claims illustrated with labeled AI-generated images.

replace-cross Representing local protein environments with machine learning force fields

Authors: Meital Bojan, Sanketh Vedula, Advaith Maddipatla, Nadav Bojan Sellam, Anar Rzayev, Federico Napoli, Paul Schanda, Alex M. Bronstein

Abstract: The local structure of a protein strongly impacts its function and interactions with other molecules. Therefore, a concise, informative representation of a local protein environment is essential for modeling and designing proteins and biomolecular interactions. However, these environments' extensive structural and chemical variability makes them challenging to model, and such representations remain under-explored. In this work, we propose a novel representation for a local protein environment derived from the intermediate features of atomistic foundation models (AFMs). We demonstrate that this embedding effectively captures both local structure (e.g., secondary motifs), and chemical features (e.g., amino-acid identity and protonation state). We further show that the AFM-derived representation space exhibits meaningful structure, enabling the construction of data-driven priors over the distribution of biomolecular environments. Finally, in the context of biomolecular NMR spectroscopy, we demonstrate that the proposed representations enable a first-of-its-kind physics-informed chemical shift predictor that achieves state-of-the-art accuracy. Our results demonstrate the surprising effectiveness of atomistic foundation models and their emergent representations for protein modeling beyond traditional molecular simulations. We believe this will open new lines of work in constructing effective functional representations for protein environments.

replace-cross RoboPARA: Dual-Arm Robot Planning with Parallel Allocation and Recomposition Across Tasks

Authors: Shiying Duan, Pei Ren, Nanxiang Jiang, Zhengping Che, Jian Tang, Zhaoxin Fan, Yifan Sun, Wenjun Wu

Abstract: Dual-arm robots play a crucial role in improving efficiency and flexibility in complex multitasking scenarios. While existing methods have achieved promising results in task planning, they often fail to fully optimize task parallelism, limiting the potential of dual-arm collaboration. To address this issue, we propose RoboPARA, a novel large language model (LLM)-driven framework for dual-arm task parallelism planning. RoboPARA employs a two-stage process: (1) Dependency Graph-based Planning Candidates Generation, which constructs directed acyclic graphs (DAGs) to model task dependencies and eliminate redundancy, and (2) Graph Re-Traversal-based Dual-Arm Parallel Planning, which optimizes DAG traversal to maximize parallelism while maintaining task coherence. In addition, we introduce the Cross-Scenario Dual-Arm Parallel Task dataset (X-DAPT dataset), the first dataset specifically designed to evaluate dual-arm task parallelism across diverse scenarios and difficulty levels. Extensive experiments demonstrate that RoboPARA significantly outperforms existing planning methods, achieving higher efficiency and reliability, particularly in complex task combinations. Our code is publicly available at https://github.com/AiDuanshiying/RoboPARA.

URLs: https://github.com/AiDuanshiying/RoboPARA.

replace-cross BemaGANv2: Discriminator Combination Strategies for GAN-based Vocoders in Long-Term Audio Generation

Authors: Taesoo Park, Mungwi Jeong, Mingyu Park, Narae Kim, Junyoung Kim, Mujung Kim, Jisang Yoo, Hoyun Lee, Sanghoon Kim, Soonchul Kwon

Abstract: This paper presents BemaGANv2, an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation, with a focus on systematic evaluation of discriminator combination strategies. Long-term audio generation is critical for applications in Text-to-Music (TTM) and Text-to-Audio (TTA) systems, where maintaining temporal co- herence, prosodic consistency, and harmonic structure over extended durations remains a significant challenge. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we proposed, to extract rich temporal en- velope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this com- bination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including Multi-Scale Discriminator (MSD) + MED, MSD + MRD, and Multi-Period Discriminator (MPD) + MED + MRD, using objective metrics (Fr\'echet Audio Distance (FAD), Structural Similar- ity Index (SSIM), Pearson Correlation Coefficient (PCC), Mel-Cepstral Distortion (MCD), Multi-Resolution STFT (M-STFT), Periodicity error (Periodicity)) and subjective evaluations (MOS, SMOS). To support reproducibility, we provide detailed architectural descriptions, training configurations, and complete implementation details. The code, pre-trained models, and audio demo samples are available at: https://github.com/dinhoitt/BemaGANv2.

URLs: https://github.com/dinhoitt/BemaGANv2.

replace-cross Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients

Authors: Minhyuk Seo, Taeheon Kim, Hankook Lee, Jonghyun Choi, Tinne Tuytelaars

Abstract: As AI becomes more personal, e.g., Agentic AI, there is an increasing need for personalizing models for various use cases. Personalized federated learning (PFL) enables each client to collaboratively leverage other clients' knowledge for better adaptation to the task of interest, without privacy risks. Despite its potential, existing PFL methods remain confined to rather simplified scenarios where data and models are the same across clients. To move towards realistic scenarios, we move beyond these restrictive assumptions by addressing both data and model heterogeneity. We propose a task-relevance-aware model aggregation strategy to reduce parameter interference under heterogeneous data. Moreover, we introduce Co-LoRA, a dimension-invariant module that enables knowledge sharing across heterogeneous architectures. To mimic the real-world task diversity, we propose a multi-modal PFL benchmark spanning 40 distinct tasks with distribution shifts over time. Extensive experiments shows that our proposed method significantly outperforms the state-of-the-art PFL methods under heterogeneous scenarios.

replace-cross Context Matters! Relaxing Goals with LLMs for Feasible 3D Scene Planning

Authors: Emanuele Musumeci, Michele Brienza, Francesco Argenziano, Abdel Hakim Drid, Vincenzo Suriani, Daniele Nardi, Domenico D. Bloisi

Abstract: Embodied agents need to plan and act reliably in real and complex 3D environments. Classical planning (e.g., PDDL) offers structure and guarantees, but in practice it fails under noisy perception and incorrect predicate grounding. On the other hand, Large Language Models (LLMs)-based planners leverage commonsense reasoning, yet frequently propose actions that are unfeasible or unsafe. Following recent works that combine the two approaches, we introduce ContextMatters, a framework that fuses LLMs and classical planning to perform hierarchical goal relaxation: the LLM helps ground symbols to the scene and, when the target is unreachable, it proposes functionally equivalent goals that progressively relax constraints, adapting the goal to the context of the agent's environment. Operating on 3D Scene Graphs, this mechanism turns many nominally unfeasible tasks into tractable plans and enables context-aware partial achievement when full completion is not achievable. Our experimental results show a +52.45% Success Rate improvement over state-of-the-art LLMs+PDDL baseline, demonstrating the effectiveness of our approach. Moreover, we validate the execution of ContextMatter in a real world scenario by deploying it on a TIAGo robot. Code, dataset, and supplementary materials are available to the community at https://lab-rococo-sapienza.github.io/context-matters/.

URLs: https://lab-rococo-sapienza.github.io/context-matters/.

replace-cross From Semantic To Instance: A Semi-Self-Supervised Learning Approach

Authors: Keyhan Najafian, Farhad Maleki, Lingling Jin, Ian Stavness

Abstract: Instance segmentation is essential for applications such as automated monitoring of plant health, growth, and yield. However, extensive effort is required to create large-scale datasets with pixel-level annotations of each object instance for developing instance segmentation models that restrict the use of deep learning in these areas. This challenge is more significant in images with densely packed, self-occluded objects, which are common in agriculture. To address this challenge, we propose a semi-self-supervised learning approach that requires minimal manual annotation to develop a high-performing instance segmentation model. We design GLMask, an image-mask representation for the model to focus on shape, texture, and pattern while minimizing its dependence on color features. We develop a pipeline to generate semantic segmentation and then transform it into instance-level segmentation. The proposed approach substantially outperforms the conventional instance segmentation models, establishing a state-of-the-art wheat head instance segmentation model with mAP@50 of 98.5%. Additionally, we assessed the proposed methodology on the general-purpose Microsoft COCO dataset, achieving a significant performance improvement of over 12.6% mAP@50. This highlights that the utility of our proposed approach extends beyond precision agriculture and applies to other domains, specifically those with similar data characteristics.

replace-cross Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

Authors: Zixuan Huang, Yikun Ban, Lean Fu, Xiaojie Li, Zhongxiang Dai, Jianxin Li, Deqing Wang

Abstract: Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data. To address this bottleneck, prior work has explored various data selection strategies, but these methods often overlook the impact of the evolving states of the language model during the optimization process. In this paper, we introduce a novel problem: Sample Scheduling for DPO, which aims to dynamically and adaptively schedule training samples based on the model's evolving batch-wise states throughout preference optimization. To solve this problem, we propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch based on the LLM's learning feedback to maximize the potential generalization performance. Notably, without modifying the core DPO algorithm, simply integrating SamS significantly improves performance across tasks, with minimal additional computational overhead. This work points to a promising new direction for improving LLM alignment through batch-wise sample selection, with potential generalization to RLHF and broader supervised learning paradigms.

replace-cross A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models

Authors: Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang, Ting-En Lin, Fei Huang, Yongbin Li, Dacheng Tao

Abstract: Reinforcement Learning with Verifiable Rewards~(RLVR) has emerged as a powerful learn-to-reason paradigm for large reasoning models to tackle complex tasks. However, the current RLVR paradigm is still not efficient enough, as it works in a trial-and-error manner. To perform better, the model needs to explore the reward space by numerously generating responses and learn from fragmented reward signals, blind to the overall reward patterns. Fortunately, verifiable rewards make the natural language description of the reward function possible, and meanwhile, LLMs have demonstrated strong in-context learning ability. This motivates us to explore if large reasoning models can benefit from a \textbf{motivation} of the task, \textit{i.e.}, awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning. In this paper, we introduce \textit{\textbf{M}otivation-\textbf{e}nhanced \textbf{R}einforcement \textbf{F}inetuning}~(\textbf{MeRF}), an intuitive yet effective method enhancing reinforcement finetuning of LLMs by involving \emph{``telling LLMs rules of the game''}. Specifically, \textbf{MeRF} directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective. This simple modification leverages the in-context learning ability of LLMs, aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations demonstrate that \textbf{MeRF} achieves substantial performance gains over the RLVR baseline. Moreover, ablation studies show that MeRF performs better with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement finetuning.

replace-cross SUBARU: A Practical Approach to Power Saving in Hearables Using SUB-Nyquist Audio Resolution Upsampling

Authors: Tarikul Islam Tamiti, Sajid Fardin Dipto, Luke Benjamin Baja-Ricketts, David C Vergano, Anomadarshi Barua

Abstract: Hearables are wearable computers that are worn on the ear. Bone conduction microphones (BCMs) are used with air conduction microphones (ACMs) in hearables as a supporting modality for multimodal speech enhancement (SE) in noisy conditions. However, existing works don't consider the following practical aspects for low-power implementations on hearables: (i) They do not explore how lowering the sampling frequencies and bit resolutions in analog-to-digital converters (ADCs) of hearables jointly impact low-power processing and multimodal SE in terms of speech quality and intelligibility. And (iii) They don't process signals from ACMs/BCMs at a sub-Nyquist sampling rate because, in their frameworks, they lack a wideband reconstruction methodology from their narrowband parts. We propose SUBARU (\textbf{Sub}-Nyquist \textbf{A}udio \textbf{R}esolution \textbf{U}psampling), which achieves the following: SUBARU (i) intentionally uses sub-Nyquist sampling and low bit resolution in ADCs, achieving a 3.31x reduction in power consumption; and (ii) achieves streaming operations on mobile platforms and SE in in-the-wild noisy conditions with an inference time of 1.74ms and a memory footprint of less than 13.77MB.

replace-cross LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

Authors: Huaqiu Li, Yong Wang, Tongwen Huang, Hailang Huang, Haoqian Wang, Xiangxiang Chu

Abstract: Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data are available at https://github.com/AMAP-ML/LD-RPS.

URLs: https://github.com/AMAP-ML/LD-RPS.

replace-cross Noisy PDE Training Requires Bigger PINNs

Authors: Sebastien Andre-Sloan, Anirbit Mukherjee, Matthew Colbrook

Abstract: Physics-Informed Neural Networks (PINNs) are increasingly used to approximate solutions of partial differential equations (PDEs), particularly in high dimensions. In real-world settings, data are often noisy, making it crucial to understand when a predictor can still achieve low empirical risk. Yet, little is known about the conditions under which a PINN can do so effectively. We analyse PINNs applied to the Hamilton--Jacobi--Bellman (HJB) PDE and establish a lower bound on the network size required for the supervised PINN empirical risk to fall below the variance of noisy supervision labels. Specifically, if a predictor achieves empirical risk $O(\eta)$ below $\sigma^2$ (the variance of the supervision data), then necessarily $d_N\log d_N\gtrsim N_s \eta^2$, where $N_s$ is the number of samples and $d_N$ the number of trainable parameters. A similar constraint holds in the fully unsupervised PINN setting when boundary labels are noisy. Thus, simply increasing the number of noisy supervision labels does not offer a ``free lunch'' in reducing empirical risk. We also give empirical studies on the HJB PDE, the Poisson PDE and the the Navier-Stokes PDE set to produce the Taylor-Green solutions. In these experiments we demonstrate that PINNs indeed need to be beyond a threshold model size for them to train to errors below $\sigma^2$. These results provide a quantitative foundation for understanding parameter requirements when training PINNs in the presence of noisy data.

replace-cross Unified Medical Image Segmentation with State Space Modeling Snake

Authors: Ruicheng Zhang, Haowei Guo, Kanghui Tian, Jun Zhou, Mingliang Yan, Zeyu Zhang, Shen Zhao

Abstract: Unified Medical Image Segmentation (UMIS) is critical for comprehensive anatomical assessment but faces challenges due to multi-scale structural heterogeneity. Conventional pixel-based approaches, lacking object-level anatomical insight and inter-organ relational modeling, struggle with morphological complexity and feature conflicts, limiting their efficacy in UMIS. We propose Mamba Snake, a novel deep snake framework enhanced by state space modeling for UMIS. Mamba Snake frames multi-contour evolution as a hierarchical state space atlas, effectively modeling macroscopic inter-organ topological relationships and microscopic contour refinements. We introduce a snake-specific vision state space module, the Mamba Evolution Block (MEB), which leverages effective spatiotemporal information aggregation for adaptive refinement of complex morphologies. Energy map shape priors further ensure robust long-range contour evolution in heterogeneous data. Additionally, a dual-classification synergy mechanism is incorporated to concurrently optimize detection and segmentation, mitigating under-segmentation of microstructures in UMIS. Extensive evaluations across five clinical datasets reveal Mamba Snake's superior performance, with an average Dice improvement of 3\% over state-of-the-art methods.

replace-cross Post-Disaster Affected Area Segmentation with a Vision Transformer (ViT)-based EVAP Model using Sentinel-2 and Formosat-5 Imagery

Authors: Yi-Shan Chu, Hsuan-Cheng Wei

Abstract: We propose a vision transformer (ViT)-based deep learning framework to refine disaster-affected area segmentation from remote sensing imagery, aiming to support and enhance the Emergent Value Added Product (EVAP) developed by the Taiwan Space Agency (TASA). The process starts with a small set of manually annotated regions. We then apply principal component analysis (PCA)-based feature space analysis and construct a confidence index (CI) to expand these labels, producing a weakly supervised training set. These expanded labels are then used to train ViT-based encoder-decoder models with multi-band inputs from Sentinel-2 and Formosat-5 imagery. Our architecture supports multiple decoder variants and multi-stage loss strategies to improve performance under limited supervision. During the evaluation, model predictions are compared with higher-resolution EVAP output to assess spatial coherence and segmentation consistency. Case studies on the 2022 Poyang Lake drought and the 2023 Rhodes wildfire demonstrate that our framework improves the smoothness and reliability of segmentation results, offering a scalable approach for disaster mapping when accurate ground truth is unavailable.

replace-cross Flow Matching Meets Biology and Life Science: A Survey

Authors: Zihao Li, Zhichen Zeng, Xiao Lin, Feihao Fang, Yanru Qu, Zhe Xu, Zhining Liu, Xuying Ning, Tianxin Wei, Ge Liu, Hanghang Tong, Jingrui He

Abstract: Over the past decade, advances in generative modeling, such as generative adversarial networks, masked autoencoders, and diffusion models, have significantly transformed biological research and discovery, enabling breakthroughs in molecule design, protein generation, catalysis discovery, drug discovery, and beyond. At the same time, biological applications have served as valuable testbeds for evaluating the capabilities of generative models. Recently, flow matching has emerged as a powerful and efficient alternative to diffusion-based generative modeling, with growing interest in its application to problems in biology and life sciences. This paper presents the first comprehensive survey of recent developments in flow matching and its applications in biological domains. We begin by systematically reviewing the foundations and variants of flow matching, and then categorize its applications into three major areas: biological sequence modeling, molecule generation and design, and peptide and protein generation. For each, we provide an in-depth review of recent progress. We also summarize commonly used datasets and software tools, and conclude with a discussion of potential future directions. The corresponding curated resources are available at https://github.com/Violet24K/Awesome-Flow-Matching-Meets-Biology.

URLs: https://github.com/Violet24K/Awesome-Flow-Matching-Meets-Biology.

replace-cross Goal Alignment in LLM-Based User Simulators for Conversational AI

Authors: Shuhaib Mehri, Xiaocheng Yang, Takyoung Kim, Gokhan Tur, Shikib Mehri, Dilek Hakkani-T\"ur

Abstract: User simulators are essential to conversational AI, enabling scalable agent development and evaluation through simulated interactions. While current Large Language Models (LLMs) have advanced user simulation capabilities, we reveal that they struggle to consistently demonstrate goal-oriented behavior across multi-turn conversations--a critical limitation that compromises their reliability in downstream applications. We introduce User Goal State Tracking (UGST), a novel framework that tracks user goal progression throughout conversations. Leveraging UGST, we present a three-stage methodology for developing user simulators that can autonomously track goal progression and reason to generate goal-aligned responses. Moreover, we establish comprehensive evaluation metrics for measuring goal alignment in user simulators, and demonstrate that our approach yields substantial improvements across two benchmarks (MultiWOZ 2.4 and {\tau}-Bench). Our contributions address a critical gap in conversational AI and establish UGST as an essential framework for developing goal-aligned user simulators.

replace-cross CauKer: Classification Time Series Foundation Models Can Be Pretrained on Synthetic Data

Authors: Shifeng Xie, Vasilii Feofanov, Ambroise Odonnat, Lei Zan, Marius Alonso, Jianfeng Zhang, Themis Palpanas, Lujia Pan, Keli Zhang, Ievgen Redko

Abstract: Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pre-training on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pre-training of TSFMs, we propose \textsc{CauKer}, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. \textsc{CauKer} combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pre-training of state-of-the-art classification TSFMs having different architectures and following different pre-training approaches. Additionally, our experiments reveal that \textsc{CauKer}-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior. The source code is publicly available at https://github.com/ShifengXIE/CauKer.

URLs: https://github.com/ShifengXIE/CauKer.

replace-cross GraphProp: Training the Graph Foundation Models using Graph Properties

Authors: Ziheng Sun, Qi Feng, Lehao Lin, Chris Ding, Jicong Fan

Abstract: This work focuses on training graph foundation models (GFMs) that have strong generalization ability in graph-level tasks such as graph classification. Effective GFM training requires capturing information consistent across different domains. We discover that graph structures provide more consistent cross-domain information compared to node features and graph labels. However, traditional GFMs primarily focus on transferring node features from various domains into a unified representation space but often lack structural cross-domain generalization. To address this, we introduce GraphProp, which emphasizes structural generalization. The training process of GraphProp consists of two main phases. First, we train a structural GFM by predicting graph invariants. Since graph invariants are properties of graphs that depend only on the abstract structure, not on particular labellings or drawings of the graph, this structural GFM has a strong ability to capture the abstract structural information and provide discriminative graph representations comparable across diverse domains. In the second phase, we use the representations given by the structural GFM as positional encodings to train a comprehensive GFM. This phase utilizes domain-specific node attributes and graph labels to further improve cross-domain node feature generalization. Our experiments demonstrate that GraphProp significantly outperforms the competitors in supervised learning and few-shot learning, especially in handling graphs without node attributes.

replace-cross Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding

Authors: Yun Wang, Long Zhang, Jingren Liu, Jiaqi Yan, Zhanjie Zhang, Jiahao Zheng, Ao Ma, Run Ling, Xun Yang, Dapeng Wu, Xiangyu Chen, Xuelong Li

Abstract: Video Large Language Models (Video-LLMs) have shown strong video understanding, yet their application to long-form videos remains constrained by limited context windows. A common workaround is to compress long videos into a handful of representative frames via retrieval or summarization. However, most existing pipelines score frames in isolation, implicitly assuming that frame-level saliency is sufficient for downstream reasoning. This often yields redundant selections, fragmented temporal evidence, and weakened narrative grounding for long-form video question answering. We present \textbf{Video-EM}, a training-free, event-centric episodic memory framework that reframes long-form VideoQA as \emph{episodic event construction} followed by \emph{memory refinement}. Instead of treating retrieved keyframes as independent visuals, Video-EM employs an LLM as an active memory agent to orchestrate off-the-shelf tools: it first localizes query-relevant moments via multi-grained semantic matching, then groups and segments them into temporally coherent events, and finally encodes each event as a grounded episodic memory with explicit temporal indices and spatio-temporal cues (capturing \emph{when}, \emph{where}, \emph{what}, and involved entities). To further suppress verbosity and noise from imperfect upstream signals, Video-EM integrates a reasoning-driven self-reflection loop that iteratively verifies evidence sufficiency and cross-event consistency, removes redundancy, and adaptively adjusts event granularity. The outcome is a compact yet reliable \emph{event timeline} -- a minimal but sufficient episodic memory set that can be directly consumed by existing Video-LLMs without additional training or architectural changes.

replace-cross ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals

Authors: Yucong Zhang, Juan Liu, Ming Li

Abstract: Pre-trained foundation models have demonstrated remarkable success in audio, vision and language, yet their potential for general machine signal modeling with arbitrary sampling rates-covering acoustic, vibration, and other industrial sensor data-remains under-explored. In this work, we propose a novel foundation model ECHO that integrates an advanced band-split architecture with frequency positional embeddings, enabling spectral localization across arbitrary sampling configurations. Moreover, the model incorporates sliding patches to support inputs of variable length without padding or cropping, producing a concise embedding that retains both temporal and spectral fidelity and naturally extends to streaming scenarios. We evaluate our method on various kinds of machine signal datasets, including previous DCASE task 2 challenges (2020-2025), and widely-used industrial signal corpora. Experimental results demonstrate consistent state-of-the-art performance in machine signal anomaly detection and fault classification, confirming the effectiveness and generalization capability of the proposed model. We open-sourced ECHO on https://github.com/yucongzh/ECHO.

URLs: https://github.com/yucongzh/ECHO.

replace-cross Entropy-Driven Curriculum for Multi-Task Training in Human Mobility Prediction

Authors: Tianye Fang, Xuanshu Luo, Martin Werner

Abstract: The increasing availability of big mobility data from ubiquitous portable devices enables human mobility prediction through deep learning approaches. However, the diverse complexity of human mobility data impedes model training, leading to inefficient gradient updates and potential underfitting. Meanwhile, exclusively predicting next locations neglects implicit determinants, including distances and directions, thereby yielding suboptimal prediction results. This paper presents a unified training framework that integrates entropy-driven curriculum and multi-task learning to address these challenges. The proposed entropy-driven curriculum learning strategy quantifies trajectory predictability based on Lempel-Ziv compression and organizes training from simple to complex for faster convergence and enhanced performance. The multi-task training simultaneously optimizes the primary location prediction alongside auxiliary estimation of movement distance and direction for learning realistic mobility patterns, and improve prediction accuracy through complementary supervision signals. Extensive experiments conducted in accordance with the HuMob Challenge demonstrate that our approach achieves state-of-the-art performance on GEO-BLEU (0.354) and DTW (26.15) metrics with up to 2.92-fold convergence speed compared to training without curriculum learning.

replace-cross OTESGN: Optimal Transport-Enhanced Syntactic-Semantic Graph Networks for Aspect-Based Sentiment Analysis

Authors: Xinfeng Liao, Xuanqi Chen, Lianxi Wang, Jiahuan Yang, Zhuowei Chen, Ziying Rong

Abstract: Aspect-based sentiment analysis (ABSA) aims to identify aspect terms and determine their sentiment polarity. While dependency trees combined with contextual semantics provide structural cues, existing approaches often rely on dot-product similarity and fixed graphs, which limit their ability to capture nonlinear associations and adapt to noisy contexts. To address these limitations, we propose the Optimal Transport-Enhanced Syntactic-Semantic Graph Network (OTESGN), a model that jointly integrates structural and distributional signals. Specifically, a Syntactic Graph-Aware Attention module models global dependencies with syntax-guided masking, while a Semantic Optimal Transport Attention module formulates aspect-opinion association as a distribution matching problem solved via the Sinkhorn algorithm. An Adaptive Attention Fusion mechanism balances heterogeneous features, and contrastive regularization enhances robustness. Extensive experiments on three benchmark datasets (Rest14, Laptop14, and Twitter) demonstrate that OTESGN delivers state-of-the-art performance. Notably, it surpasses competitive baselines by up to +1.30 Macro-F1 on Laptop14 and +1.01 on Twitter. Ablation studies and visualization analyses further highlight OTESGN's ability to capture fine-grained sentiment associations and suppress noise from irrelevant context.

replace-cross M4Diffuser: Multi-View Diffusion Policy with Manipulability-Aware Control for Robust Mobile Manipulation

Authors: Ju Dong, Lei Zhang, Liding Zhang, Yao Ling, Yu Fu, Kaixin Bai, Zolt\'an-Csaba M\'arton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

Abstract: Mobile manipulation requires the coordinated control of a mobile base and a robotic arm while simultaneously perceiving both global scene context and fine-grained object details. Existing single-view approaches often fail in unstructured environments due to limited fields of view, exploration, and generalization abilities. Moreover, classical controllers, although stable, struggle with efficiency and manipulability near singularities. To address these challenges, we propose M4Diffuser, a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware QP (ReM-QP) controller for mobile manipulation. The diffusion policy leverages proprioceptive states and complementary camera perspectives with both close-range object details and global scene context to generate task-relevant end-effector goals in the world frame. These high-level goals are then executed by the ReM-QP controller, which eliminates slack variables for computational efficiency and incorporates manipulability-aware preferences for robustness near singularities. Comprehensive experiments in simulation and real-world environments show that M4Diffuser achieves 7 to 56 percent higher success rates and reduces collisions by 3 to 31 percent over baselines. Our approach demonstrates robust performance for smooth whole-body coordination, and strong generalization to unseen tasks, paving the way for reliable mobile manipulation in unstructured environments. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/m4diffuser.

URLs: https://sites.google.com/view/m4diffuser.

replace-cross Compose by Focus: Scene Graph-based Atomic Skills

Authors: Han Qi, Changhe Chen, Heng Yang

Abstract: A key requirement for generalist robots is compositional generalization - the ability to combine atomic skills to solve complex, long-horizon tasks. While prior work has primarily focused on synthesizing a planner that sequences pre-learned skills, robust execution of the individual skills themselves remains challenging, as visuomotor policies often fail under distribution shifts induced by scene composition. To address this, we introduce a scene graph-based representation that focuses on task-relevant objects and relations, thereby mitigating sensitivity to irrelevant variation. Building on this idea, we develop a scene-graph skill learning framework that integrates graph neural networks with diffusion-based imitation learning, and further combine "focused" scene-graph skills with a vision-language model (VLM) based task planner. Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state-of-the-art baselines, highlighting improved robustness and compositional generalization in long-horizon tasks.

replace-cross Efficient Construction of Implicit Surface Models From a Single Image for Motion Generation

Authors: Wei-Teng Chu, Tianyi Zhang, Matthew Johnson-Roberson, Weiming Zhi

Abstract: Implicit representations have been widely applied in robotics for obstacle avoidance and path planning. In this paper, we explore the problem of constructing an implicit distance representation from a single image. Past methods for implicit surface reconstruction, such as NeuS and its variants generally require a large set of multi-view images as input, and require long training times. In this work, we propose Fast Image-to-Neural Surface (FINS), a lightweight framework that can reconstruct high-fidelity surfaces and SDF fields based on a single or a small set of images. FINS integrates a multi-resolution hash grid encoder with lightweight geometry and color heads, making the training via an approximate second-order optimizer highly efficient and capable of converging within a few seconds. Additionally, we achieve the construction of a neural surface requiring only a single RGB image, by leveraging pre-trained foundation models to estimate the geometry inherent in the image. Our experiments demonstrate that under the same conditions, our method outperforms state-of-the-art baselines in both convergence speed and accuracy on surface reconstruction and SDF field estimation. Moreover, we demonstrate the applicability of FINS for robot surface following tasks and show its scalability to a variety of benchmark datasets.

replace-cross Generative Evolutionary Meta-Solver (GEMS): Scalable Surrogate-Free Multi-Agent Reinforcement Learning

Authors: Alakh Sharma, Gaurish Trivedi, Kartikey Singh Bhandari, Yash Sinha, Dhruv Kumar, Pratik Narang, Jagat Sesh Challa

Abstract: Scalable multi-agent reinforcement learning (MARL) remains a central challenge for AI. Existing population-based methods, like Policy-Space Response Oracles, PSRO, require storing explicit policy populations and constructing full payoff matrices, incurring quadratic computation and linear memory costs. We present Generative Evolutionary Meta-Solver (GEMS), a surrogate-free framework that replaces explicit populations with a compact set of latent anchors and a single amortized generator. Instead of exhaustively constructing the payoff matrix, GEMS relies on unbiased Monte Carlo rollouts, multiplicative-weights meta-dynamics, and a model-free empirical-Bernstein UCB oracle to adaptively expand the policy set. Best responses are trained within the generator using an advantage-based trust-region objective, eliminating the need to store and train separate actors. We evaluated GEMS in a variety of Two-player and Multi-Player games such as the Deceptive Messages Game, Kuhn Poker and Multi-Particle environment. We find that GEMS is up to ~$\mathbf{6\times}$ faster, has $\mathbf{1.3\times}$ less memory usage than PSRO, while also reaps higher rewards simultaneously. These results demonstrate that GEMS retains the game theoretic guarantees of PSRO, while overcoming its fundamental inefficiencies, hence enabling scalable multi-agent learning in multiple domains.

replace-cross Cold-Start Active Correlation Clustering

Authors: Linus Aronsson, Han Wu, Morteza Haghir Chehreghani

Abstract: We study active correlation clustering where pairwise similarities are not provided upfront and must be queried in a cost-efficient manner through active learning. Specifically, we focus on the cold-start scenario, where no true initial pairwise similarities are available for active learning. To address this challenge, we propose a coverage-aware method that encourages diversity early in the process. We demonstrate the effectiveness of our approach through several synthetic and real-world experiments.

replace-cross CroSTAta: Cross-State Transition Attention Transformer for Robotic Manipulation

Authors: Giovanni Minelli, Giulio Turrisi, Victor Barasuol, Claudio Semini

Abstract: Learning robotic manipulation policies through supervised learning from demonstrations remains challenging when policies encounter execution variations not explicitly covered during training. While incorporating historical context through attention mechanisms can improve robustness, standard approaches process all past states in a sequence without explicitly modeling the temporal structure that demonstrations may include, such as failure and recovery patterns. We propose a Cross-State Transition Attention Transformer that employs a novel State Transition Attention (STA) mechanism to modulate standard attention weights based on learned state evolution patterns, enabling policies to better adapt their behavior based on execution history. Our approach combines this structured attention with temporal masking during training, where visual information is randomly removed from recent timesteps to encourage temporal reasoning from historical context. Evaluation in simulation shows that STA consistently outperforms standard attention approach and temporal modeling methods like TCN and LSTM networks, achieving more than 2x improvement over cross-attention on precision-critical tasks. The source code and data can be accessed at https://github.com/iit-DLSLab/croSTAta

URLs: https://github.com/iit-DLSLab/croSTAta

replace-cross FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol

Authors: He Zhang, Anzhou Zhang, Jian Dai

Abstract: Reasoning protocols such as Chain of Thought (CoT) and Tree of Thought (ToT) organize internal deliberation but lack an explicit mechanism for external questioning that elicits self-revision. We present FOR-Prompting (From Objection to Revision Prompting), an asymmetric protocol where a Defender proposes an answer, an Debater (Questioner) raises question-style objections with no direct fixes, and a Host optionally synthesizes the final output. Across GSM8K, FOR-Prompting matches the accuracy of CoT and consistently improves over single-prompting when evaluated under identical model backbones. On small-scale open-source models (e.g., LLaMA-3.2-1B), FOR-Prompting yields substantial gains over direct prompting and performs comparably to lightweight reasoning baselines, highlighting its promise for low-resource and on-device settings. Cross-model role-swapping further shows that performance is primarily determined by the Defender, enabling small models to act effectively as Questioners. Beyond structured math tasks, FOR-Prompting supports refinement in open-ended and multi-stage tasks: qualitative analysis shows improved exploration, coverage, and specificity, and a blind study of human preferences found that participants preferred FOR-Prompting outputs over strong LLM baselines in an itinerary-planning scenario. The protocol is model-agnostic and operates purely through role-structured prompting, requiring no training, access to model internals, or symmetrically strong agents. FOR-Prompting therefore enables scalable study of objection-driven reasoning and offers a practical mechanism for automated iterative refinement across both hosted and local LLMs.

replace-cross Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks

Authors: Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar, Miguel Ballesteros, Alan Ritter, Dan Roth

Abstract: Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 44.2% higher ASR across 12 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.

replace-cross Wasserstein Gradient Flows for Scalable and Regularized Barycenter Computation

Authors: Eduardo Fernandes Montesuma, Yassir Bendou, Mike Gartrell

Abstract: Wasserstein barycenters provide a principled approach for aggregating probability measures, while preserving the geometry of their ambient space. Existing discrete methods are not scalable as they assume access to the complete set of samples from the input measures. Meanwhile, neural network approaches do scale well, but rely on complex optimization problems and cannot easily incorporate label information. We address these limitations through gradient flows in the space of probability measures. Through time discretization, we achieve a scalable algorithm that i) relies on mini-batch optimal transport, ii) accepts modular regularization through task-aware functions, and iii) seamlessly integrates supervised information into the ground-cost. We empirically validate our approach on domain adaptation benchmarks that span computer vision, neuroscience, and chemical engineering. Our method establishes a new state-of-the-art barycenter solver, with labeled barycenters consistently outperforming unlabeled ones.

replace-cross Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

Authors: Yilong Li, Shuai Zhang, Yijing Zeng, Hao Zhang, Xinmiao Xiong, Jingyu Liu, Pan Hu, Suman Banerjee

Abstract: Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks'' (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly 20.8 hours.

replace-cross Membership Inference Attacks on Tokenizers of Large Language Models

Authors: Meng Tong, Yuntao Du, Kejiang Chen, Weiming Zhang, Ninghui Li

Abstract: Membership inference attacks (MIAs) are widely used to assess the privacy risks associated with machine learning models. However, when these attacks are applied to pre-trained large language models (LLMs), they encounter significant challenges, including mislabeled samples, distribution shifts, and discrepancies in model size between experimental and real-world settings. To address these limitations, we introduce tokenizers as a new attack vector for membership inference. Specifically, a tokenizer converts raw text into tokens for LLMs. Unlike full models, tokenizers can be efficiently trained from scratch, thereby avoiding the aforementioned challenges. In addition, the tokenizer's training data is typically representative of the data used to pre-train LLMs. Despite these advantages, the potential of tokenizers as an attack vector remains unexplored. To this end, we present the first study on membership leakage through tokenizers and explore five attack methods to infer dataset membership. Extensive experiments on millions of Internet samples reveal the vulnerabilities in the tokenizers of state-of-the-art LLMs. To mitigate this emerging risk, we further propose an adaptive defense. Our findings highlight tokenizers as an overlooked yet critical privacy threat, underscoring the urgent need for privacy-preserving mechanisms specifically designed for them.

replace-cross DropVLA: An Action-Level Backdoor Attack on Vision-Language-Action Models

Authors: Zonghuan Xu, Jiayu Li, Yunhan Zhao, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang

Abstract: Vision-Language-Action (VLA) models map multimodal perception and language instructions to executable robot actions, making them particularly vulnerable to behavioral backdoor manipulation: a hidden trigger introduced during training can induce unintended physical actions while nominal task performance remains intact. Prior work on VLA backdoors primarily studies untargeted attacks or task-level hijacking, leaving fine-grained control over individual actions largely unexplored. In this work, we present DropVLA, an action-level backdoor attack that forces a reusable action primitive (e.g., open_gripper) to execute at attacker-chosen decision points under a realistic pipeline-black-box setting with limited data-poisoning access, using a window-consistent relabeling scheme for chunked fine-tuning. On OpenVLA-7B evaluated with LIBERO, vision-only poisoning achieves 98.67%-99.83% attack success rate (ASR) with only 0.31% poisoned episodes while preserving 98.50%-99.17% clean-task retention, and successfully triggers the targeted action within 25 control steps at 500 Hz (0.05 s). Text-only triggers are unstable at low poisoning budgets, and combining text with vision provides no consistent ASR improvement over vision-only attacks. The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%). We further validate physical-world feasibility on a 7-DoF Franka arm with pi0-fast, demonstrating non-trivial attack efficacy under camera-relative motion that induces image-plane trigger drift. These results reveal that VLA models can be covertly steered at the granularity of safety-critical actions with minimal poisoning and without observable degradation of nominal performance.

replace-cross Ego-Vision World Model for Humanoid Contact Planning

Authors: Hang Liu, Yuman Gao, Sangli Teng, Yufeng Chi, Yakun Sophia Shao, Zhongyu Li, Maani Ghaffari, Koushil Sreenath

Abstract: Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization-based planners struggle with contact complexity, while on-policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved sample efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images. Code and dataset are available at our website: https://ego-vcp.github.io/

URLs: https://ego-vcp.github.io/

replace-cross Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

Authors: Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, Shi-Min Hu

Abstract: Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

replace-cross The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLM CoTs

Authors: Nikolaus Howe, Micah Carroll

Abstract: Chain-of-Thought (CoT) monitoring has emerged as a compelling method for detecting harmful behaviors such as reward hacking for reasoning models, under the assumption that models' reasoning processes are informative of such behaviors. In practice, LLM training often produces unintended behaviors due to imperfect reward signals, leading models to develop misaligned tendencies. A common corrective approach is to apply post-hoc instructions to avoid problematic behaviors, but what happens to the model's reasoning process when these instructions conflict with learned behaviors? We investigate this question in simple settings and find that models engage in systematic motivated reasoning -- generating plausible-sounding justifications for violating their instructions while downplaying potential harms or contradictions. Concerningly, we find that as motivated reasoning becomes more prevalent over the course of training, an 8B-parameter CoT monitor is increasingly fooled by the motivated reasoning, being persuaded to judge the answer as following the constitution, despite correctly identifying the answer as contradicting the constitution when not provided with the model's reasoning trace. While we find that large frontier reasoning models closely track human ability in detecting motivated reasoning, this should not give us too much solace, as frontier model developers rely on smaller models for monitoring due to their low latency and deployment costs. Our results underscore the necessity for further research into the emergence and detection of motivated reasoning in model evaluation and oversight. Code for this paper is available at https://github.com/nikihowe/motivated-reasoning. WARNING: some examples in this paper may be upsetting.

URLs: https://github.com/nikihowe/motivated-reasoning.

replace-cross Explainable Heterogeneous Anomaly Detection in Financial Networks via Adaptive Expert Routing

Authors: Zan Li, Rui Fan

Abstract: Financial anomalies arise from heterogeneous mechanisms -- price shocks, liquidity freezes, contagion cascades, and momentum reversals -- yet existing detectors produce uniform scores without revealing which mechanism is failing. This hinders targeted responses: liquidity freezes call for market-making support, whereas price shocks call for circuit breakers. Three key challenges remain: (1) static graphs cannot adapt when correlations shift across regimes; (2) uniform detectors overlook heterogeneous anomaly signatures; and (3) black-box scores provide no actionable guidance on driving mechanisms. We address these challenges with an adaptive graph learning framework that embeds interpretability architecturally rather than post hoc. The framework constructs stress-modulated graphs that adaptively interpolate between known sector and geographic relationships and data-driven correlations as market conditions evolve. Anomalies are decomposed via four mechanism-specific experts -- Price-Shock, Liquidity, Systemic-Contagion, and Momentum-Reversal -- each capturing a distinct anomaly channel documented in the financial economics literature. The resulting routing weights serve as interpretable proxies for mechanism attribution, with their relative values indicating each anomaly's primary driving mechanism. A hierarchical Market Pressure Index aggregates entity-level anomaly scores into graduated market-wide alerts. On 100 U.S. equities (2017-2024), the framework detects all six major stress events with a 3.7-day mean lead time, outperforming baselines by +33 percentage points, with AUC 0.888 and AP 0.626. Case studies on SVB (March 2023) and Japan carry-trade unwind (August 2024) demonstrate that routing weights automatically distinguish localized from systemic crises without labeled supervision.

replace-cross Taming Modality Entanglement in Continual Audio-Visual Segmentation

Authors: Yuyang Hong, Qi Yang, Tao Zhang, Zili Wang, Zhaojin Fu, Kun Ding, Bin Fan, Shiming Xiang

Abstract: Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.

replace-cross Reinforcing Numerical Reasoning in LLMs for Tabular Prediction via Structural Priors

Authors: Pengxiang Cai, Zihao Gao, Wanchen Lian, Jintai Chen

Abstract: Tabular prediction traditionally relies on gradient-boosted decision trees and deep learning models, which excel in specific tasks but lack interpretability and transferability. Reasoning large language models (LLMs) promise cross-task adaptability with transparent reasoning traces, yet their potential for tabular data remains unrealized. To bridge this gap, we propose a reasoning framework centered on Permutation Relative Policy Optimization (PRPO), a reinforcement learning method that encodes column-permutation invariance as a structural prior. By estimating advantages across label-preserving permutations, PRPO transforms sparse rewards into dense signals, activating latent numerical reasoning capabilities of LLMs with limited supervision. Extensive experiments show that our method matches fully supervised baselines and dominates in zero-shot settings, performing on par with 32-shot strong baselines. Remarkably, our 8B model significantly outperforms much larger LLMs, achieving up to a 53.17% improvement over DeepSeek-R1 (685B).

replace-cross Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks

Authors: Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo, Zhenxin Zhu, Kalok Ho, Lijun Zhou, Bohan Zeng, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wentao Zhang

Abstract: Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are $\mathbf{really\ crucial}$ for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Page: https://wm-research.github.io/Dream4Drive/ GitHub Link: https://github.com/wm-research/Dream4Drive

URLs: https://wm-research.github.io/Dream4Drive/, https://github.com/wm-research/Dream4Drive

replace-cross Step2Motion: Locomotion Reconstruction from Pressure Sensing Insoles

Authors: Jose Luis Ponton, Eduardo Alvarado, Lin Geng Foo, Nuria Pelechano, Carlos Andujar, Marc Habermann

Abstract: Human motion is fundamentally driven by continuous physical interaction with the environment. Whether walking, running, or simply standing, the forces exchanged between our feet and the ground provide crucial insights for understanding and reconstructing human movement. Recent advances in wearable insole devices offer a compelling solution for capturing these forces in diverse, real-world scenarios. Sensor insoles pose no constraint on the users' motion (unlike mocap suits) and are unaffected by line-of-sight limitations (in contrast to optical systems). These qualities make sensor insoles an ideal choice for robust, unconstrained motion capture, particularly in outdoor environments. Surprisingly, leveraging these devices with recent motion reconstruction methods remains largely unexplored. Aiming to fill this gap, we present Step2Motion, the first approach to reconstruct human locomotion from multi-modal insole sensors. Our method utilizes pressure and inertial data-accelerations and angular rates-captured by the insoles to reconstruct human motion. We evaluate the effectiveness of our approach across a range of experiments to show its versatility for diverse locomotion styles, from simple ones like walking or jogging up to moving sideways, on tiptoes, slightly crouching, or dancing.

replace-cross CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

Authors: Md Tanvir Hossain, Akif Islam, Mohd Ruhul Ameen

Abstract: Humans can often count unfamiliar objects by observing visual repetition and composition, rather than relying only on object categories. However, many exemplar-free counting models struggle in such situations and may overcount when objects contain symmetric components, repeated substructures, or partial occlusion. We introduce CountFormer, a controlled adaptation of a density-regression framework inspired by CounTR, where the image encoder is replaced with the self-supervised vision foundation model DINOv2. The resulting transformer features are combined with explicit two-dimensional positional embeddings and decoded by a lightweight convolutional network to produce a density map whose integral gives the final count. Our goal is not to propose a new counting architecture, but to study whether foundation-based representations improve structural consistency under a strictly exemplar-free setting. On FSC-147, CountFormer achieves competitive performance under the official benchmark (MAE 19.06, RMSE 118.45). Qualitative analysis suggests fewer part-level overcounting errors for some structurally complex objects, while overall error remains broadly consistent with prior approaches. Sensitivity analysis shows that evaluation metrics are strongly affected by a small number of extreme high-density scenes. Overall, the results highlight the role of representation quality in exemplar-free object counting.

replace-cross LagMemo: Language 3D Gaussian Splatting Memory for Multi-modal Open-vocabulary Multi-goal Visual Navigation

Authors: Haotian Zhou, Xiaole Wang, He Li, Zhuo Qi, Jinrun Yin, Haiyu Kong, Jianghuan Xu, Huijing Zhao

Abstract: Navigating to a designated goal using visual information is a fundamental capability for intelligent robots. To address the practical demands of multi-modal, open-vocabulary goal queries and multi-goal visual navigation, we propose LagMemo, a navigation system that leverages a language 3D Gaussian Splatting memory. During a one-time exploration, LagMemo constructs a unified 3D language memory with robust spatial-semantic correlations. With incoming task goals, the system efficiently queries the memory, predicts candidate goal locations, and integrates a local perception-based verification mechanism to dynamically match and validate goals. For fair and rigorous evaluation, we curate GOAT-Core, a high-quality core split distilled from GOAT-Bench. Experimental results show that LagMemo's memory module enables effective multi-modal open-vocabulary localization, and significantly outperforms state-of-the-art methods in multi-goal visual navigation. Project page: https://weekgoodday.github.io/lagmemo

URLs: https://weekgoodday.github.io/lagmemo

replace-cross SwiftEmbed: Ultra-Fast Text Embeddings via Static Token Lookup for Real-Time Applications

Authors: Edouard Lansiaux, Antoine Simonet, Eric Wiel

Abstract: We present SwiftEmbed, a production-oriented serving system for static token embeddings that achieves 1.12\,ms p50 latency for single-text requests while maintaining a 60.6 MTEB average score across 8 representative tasks. Built around the open-source Potion-base-8M distilled model from MinishLab and implemented in Rust, the system delivers 50,000 requests per second through static embedding lookup, mean pooling, and zero-copy IEEE754 binary serialization. Evaluation demonstrates exceptional duplicate detection performance (90.1% AP) and strong semantic similarity (76.1% Spearman correlation). Performance relative to Sentence-BERT is task-dependent: robust for deduplication and similarity workloads (89--100%), substantially lower for classification and complex retrieval tasks (75%). Domain-specific performance ranges from 75% to 131% of a GloVe-840B baseline. The system targets real-time embedding applications where sub-5\,ms latency is operationally critical and where full transformer inference is not feasible.

replace-cross Vectorized Online POMDP Planning

Authors: Marcus Hoerger, Muhammad Sudrajat, Hanna Kurniawati

Abstract: Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization on today's hardware, but parallelizing POMDP solvers has been challenging. Most solvers rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation which analytically solves part of the optimization component, leaving numerical computations to consist of only estimation of expectations. VOPP represents all data structures related to planning as a collection of tensors, and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel online solver with no dependencies or synchronization bottlenecks between concurrent processes. Experimental results indicate that VOPP is at least $20\times$ more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver. Moreover, VOPP outperforms state-of-the-art sequential online solvers, while using a planning budget that is $1000\times$ smaller.

replace-cross Detecting AI-Generated Images via Diffusion Snap-Back Reconstruction: A Forensic Approach

Authors: Mohd Ruhul Ameen, Akif Islam

Abstract: The rapid advancement of generative image models has transformed digital media to the point where AI generated images can no longer be reliably distinguished from authentic photographs by human observers or many conventional detection methods. Modern text to image systems such as Stable Diffusion and DALL E can now generate images so realistic that they often appear completely natural, leaving little to no visible artifacts for traditional deepfake detectors to rely on. This challenge has practical consequences for misinformation control, institutional identity verification, and digital trust in political and legal contexts. Instead of searching for hidden pixel level traces, we take a different approach: we observe how an image responds when it is gently disturbed and reconstructed by a diffusion model. We call this behavior diffusion snap back. By tracking how perceptual similarity measures (LPIPS, SSIM, and PSNR) change across different reconstruction strengths, we capture compact and interpretable signals that reveal how closely an image aligns with the diffusion model's learned denoising behavior. Evaluated on a balanced dataset of 4,000 human and AI generated images, the proposed method achieves an AUROC of 0.993 under stratified five fold cross validation and 0.990 on a holdout split using only logistic regression. Initial robustness tests show that the method remains stable under common real world distortions such as image compression and added noise. Although our experiments were conducted using a single diffusion backbone, the results indicate that reconstruction behavior can serve as a reliable and scalable foundation for synthetic media detection as generative models continue to grow more realistic.

replace-cross Balancing Interpretability and Performance in Motor Imagery EEG Classification: A Comparative Study of ANFIS-FBCSP-PSO and EEGNet

Authors: Farjana Aktar, Mohd Ruhul Ameen, Akif Islam, Md Ekramul Hamid

Abstract: Achieving both accurate and interpretable classification of motor-imagery EEG remains a key challenge in brain-computer interface (BCI) research. In this paper, we compare a transparent fuzzy-reasoning approach (ANFIS-FBCSP-PSO) with a well-known deep-learning benchmark (EEGNet) using the publicly available BCI Competition IV-2a dataset. The ANFIS pipeline combines filter-bank common spatial pattern feature extraction with fuzzy IF-THEN rules optimized via particle-swarm optimization, while EEGNet learns hierarchical spatial-temporal representations directly from raw EEG data. In within-subject experiments, the fuzzy-neural model performed better (68.58% +/- 13.76% accuracy, kappa = 58.04% +/- 18.43), while in cross-subject (LOSO) tests, the deep model exhibited stronger generalization (68.20% +/- 12.13% accuracy, kappa = 57.33% +/- 16.22). The study therefore provides practical guidance for selecting MI-BCI systems according to the design goal: interpretability or robustness across users. Future investigations into transformer-based and hybrid neuro-symbolic frameworks are expected to further advance transparent EEG decoding.

replace-cross Towards Efficient Federated Learning of Networked Mixture-of-Experts for Mobile Edge Computing

Authors: Song Gao, Songyang Zhang, Shusen Jing, Shuai Zhang, Xiangwei Zhou, Yue Wang, Zhipeng Cai

Abstract: Recent advancements in large artificial intelligence models (LAMs) are driving significant innovations in mobile edge computing within next-generation wireless networks. However, the substantial demands for computational resources and larges-cale training data required to train LAMs conflict with the limited storage and computational capacity of edge devices, posing significant challenges to training and deploying LAMs at the edge. In this work, we introduce the Networked Mixture-of-Experts (NMoE) system, in which clients perform inference collaboratively by distributing tasks to suitable neighbors based on their expertise and aggregate the returned results. For training the NMoE, we propose a federated learning framework that integrates both supervised and self-supervised learning to balance personalization and generalization, while preserving communication efficiency and data privacy. We conduct extensive experiments to demonstrate the efficacy of the proposed NMoE system, providing insights for the NMoE training algorithms.

replace-cross FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

Authors: Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailin Guan, Peihao Wu, Chunbo Dai, Liang Xiao, Bin Dong

Abstract: Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE (Formal Algebra Theorem Evaluation), a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Our two-stage evaluation reveals that models' natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.

replace-cross Detecting AI-Generated Images via Contextual Anomaly Estimation in Masked AutoEncoders

Authors: Minsuk Jang, Hyunseo Jeong, Minseok Son, Changick Kim

Abstract: Context-based detection methods such as DetectGPT achieve strong generalization in identifying AI-generated text by evaluating content compatibility with a model's learned distribution. In contrast, existing image detectors rely on discriminative features from pretrained backbones such as CLIP, which implicitly capture generator-specific artifacts. However, as modern generative models rapidly advance in visual fidelity, the artifacts these detectors depend on are becoming increasingly subtle or absent, undermining their reliability. Masked AutoEncoders (MAE) are inherently trained to reconstruct masked patches from visible context, naturally modeling patch-level contextual plausibility akin to conditional probability estimation, while also serving as a powerful semantic feature extractor through its encoder. We propose CINEMAE, a novel architecture that exploits both capabilities of MAE for AI-generated image detection: we derive per-patch anomaly signals from the reconstruction mechanism and extract global semantic features from the encoder, fusing both context-based and feature-based cues for robust detection. CINEMAE achieves highly competitive mean accuracies of 96.63\% on GenImage and 93.96\% on AIGCDetectBenchmark, maintaining over 93\% accuracy even under JPEG compression at QF=50.

replace-cross HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection

Authors: Irina Proskurina, Marc-Antoine Carpentier, Julien Velcin

Abstract: Optimization of offensive content moderation models for different types of hateful messages is typically achieved through continued pre-training or fine-tuning on new hate speech benchmarks. However, existing benchmarks mainly address explicit hate toward protected groups and often overlook implicit or indirect hate, such as demeaning comparisons, calls for exclusion or violence, and subtle discriminatory language that still causes harm. While explicit hate can often be captured through surface features, implicit hate requires deeper, full-model semantic processing. In this work, we question the need for repeated fine-tuning and analyze the role of HatePrototypes, class-level vector representations derived from language models optimized for hate speech detection and safety moderation. We find that these prototypes, built from as few as 50 examples per class, enable cross-task transfer between explicit and implicit hate, with interchangeable prototypes across benchmarks. Moreover, we show that parameter-free early exiting with prototypes is effective for both hate types. We release the code, prototype resources, and evaluation scripts to support future research on efficient and transferable hate speech detection.

replace-cross UnfoldLDM: Deep Unfolding-based Blind Image Restoration with Latent Diffusion Priors

Authors: Chunming He, Rihan Zhang, Zheng Chen, Bowen Yang, Chengyu Fang, Yunlong Lin, Yulun Zhang, Fengyang Xiao, Sina Farsiu

Abstract: Deep unfolding networks (DUNs) combine the interpretability of model-based methods with the learning ability of deep networks, yet remain limited for blind image restoration (BIR). Existing DUNs suffer from: (1) \textbf{Degradation-specific dependency}, as their optimization frameworks are tied to a known degradation model, making them unsuitable for BIR tasks; and (2) \textbf{Over-smoothing bias}, resulting from the direct feeding of gradient descent outputs, dominated by low-frequency content, into the proximal term, suppressing fine textures. To overcome these issues, we propose UnfoldLDM to integrate DUNs with latent diffusion model (LDM) for BIR. In each stage, UnfoldLDM employs a multi-granularity degradation-aware (MGDA) module as the gradient descent step. MGDA models BIR as an unknown degradation estimation problem and estimates both the holistic degradation matrix and its decomposed forms, enabling robust degradation removal. For the proximal step, we design a degradation-resistant LDM (DR-LDM) to extract compact degradation-invariant priors from the MGDA output. Guided by this prior, an over-smoothing correction transformer (OCFormer) explicitly recovers high-frequency components and enhances texture details. This unique combination ensures the final result is degradation-free and visually rich. Experiments show that our UnfoldLDM achieves a leading place on various BIR tasks and benefits downstream tasks. Moreover, our design is compatible with existing DUN-based methods, serving as a plug-and-play framework. Code will be released.

replace-cross Stable Multi-Drone GNSS Tracking System for Marine Robots

Authors: Shuo Wen, Edwin Meriaux, Mariana Sosa Guzm\'an, Zhizun Wang, Junming Shi, Gregory Dudek

Abstract: Stable and accurate tracking is essential for marine robotics, yet Global Navigation Satellite System (GNSS) signals vanish immediately below the sea surface. Traditional alternatives suffer from error accumulation, high computational demands, or infrastructure dependence. In this work, we present a multi-drone GNSS-based tracking system for surface and near-surface marine robots. Our approach combines efficient visual detection, lightweight multi-object tracking, GNSS-based triangulation, and a confidence-weighted Extended Kalman Filter (EKF) to provide stable GNSS estimation in real time. We further introduce a cross-drone tracking ID alignment algorithm that enforces global consistency across views, enabling robust multi-robot tracking with cooperative aerial coverage. We validate our system in diversified complex settings to show the accuracy and robustness of the proposed algorithm.

replace-cross Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Authors: Adarsh Kumarappan, Ayushi Mehrotra

Abstract: The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict "k-unstable" assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, "(k, $\varepsilon$)-unstable," to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.

replace-cross Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion

Authors: Keyang Lu, Sifan Zhou, Hongbin Xu, Gang Xu, Zhifei Yang, Yikai Wang, Zhen Xiao, Jieyi Long, Ming Li

Abstract: Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualizes the city through a top-down planning strategy that defines a hierarchical "City-District-Grid" structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a "produce-refine-evaluate" isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.

replace-cross Enhancing low energy reconstruction and classification in KM3NeT/ORCA with transformers

Authors: Iv\'an Moz\'un Mateo (on behalf of the KM3NeT collaboration)

Abstract: The current KM3NeT/ORCA neutrino telescope, still under construction, has not yet reached its full potential in neutrino reconstruction capability. When training any deep learning model, no explicit information about the physics or the detector is provided, thus they remain unknown to the model. This study leverages the strengths of transformers by incorporating attention masks inspired by the physics and detector design, making the model understand both the telescope design and the neutrino physics measured on it. The study also shows the efficacy of transformers on retaining valuable information between detectors when doing fine-tuning from one configurations to another.

replace-cross Automating Deception: Scalable Multi-Turn LLM Jailbreaks

Authors: Adarsh Kumarappan, Ananya Mujoo

Abstract: Multi-turn conversational attacks, which leverage psychological principles like Foot-in-the-Door (FITD), where a small initial request paves the way for a more significant one, to bypass safety alignments, pose a persistent threat to Large Language Models (LLMs). Progress in defending against these attacks is hindered by a reliance on manual, hard-to-scale dataset creation. This paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark of 1,500 scenarios across illegal activities and offensive content. We evaluate seven models from three major LLM families under both multi-turn (with history) and single-turn (without history) conditions. Our results reveal stark differences in contextual robustness: models in the GPT family demonstrate a significant vulnerability to conversational history, with Attack Success Rates (ASR) increasing by as much as 32 percentage points. In contrast, Google's Gemini 2.5 Flash exhibits exceptional resilience, proving nearly immune to these attacks, while Anthropic's Claude 3 Haiku shows strong but imperfect resistance. These findings highlight a critical divergence in how current safety architectures handle conversational context and underscore the need for defenses that can resist narrative-based manipulation.

replace-cross BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data

Authors: Selene Cerna, Sara Si-Moussi, Wilfried Thuiller, Hadrien Hendrikx, Vincent Miele

Abstract: Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relev\'es. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.

replace-cross RadDiff: Retrieval-Augmented Denoising Diffusion for Protein Inverse Folding

Authors: Jin Han, Tianfan Fu, Wu-Jun Li

Abstract: Protein inverse folding, the design of an amino acid sequence based on a target protein structure, is a fundamental problem of computational protein engineering. Existing methods either generate sequences without leveraging external knowledge or relying on protein language models~(PLMs). The former omits the knowledge stored in natural protein data, while the latter is parameter-inefficient and inflexible to adapt to ever-growing protein data. To overcome the above drawbacks, in this paper we propose a novel method, called $\underline{\text{r}}$etrieval-$\underline{\text{a}}$ugmented $\underline{\text{d}}$enoising $\underline{\text{diff}}$usion~($\mbox{RadDiff}$), for protein inverse folding. In RadDiff, a novel retrieval-augmentation mechanism is designed to capture the up-to-date protein knowledge. We further design a knowledge-aware diffusion model that integrates this protein knowledge into the diffusion process via a lightweight module. Experimental results on the CATH, TS50, and PDB2022 datasets show that $\mbox{RadDiff}$ consistently outperforms existing methods, improving sequence recovery rate by up to 19\%. Experimental results also demonstrate that RadDiff generates highly foldable sequences and scales effectively with database size.

replace-cross ForamDeepSlice: A High-Accuracy Deep Learning Framework for Foraminifera Species Classification from 2D Micro-CT Slices

Authors: Abdelghafour Halimi, Ali Alibrahim, Didier Barradas-Bautista, Ronell Sicat, Abdulkader M. Afifi

Abstract: This study presents a comprehensive deep learning pipeline for the automated classification of foraminifera species using 2D micro-CT slices derived from 3D scans. We curated a scientifically rigorous dataset of 97 micro-CT scanned specimens spanning 27 species, from which we selected 12 representative species with sufficient specimen counts (at least four 3D models each) for robust classification. To ensure methodological integrity and prevent data leakage, we employed specimen-level data splitting, resulting in 109,617 high-quality 2D slices (44,103 for training, 14,046 for validation, and 51,468 for testing). We evaluated seven state-of-the-art 2D convolutional neural network (CNN) architectures using transfer learning. Our final ensemble model, ForamDeepSlice (FDS), combining ConvNeXt-Large and EfficientNetV2-Small, achieved a test accuracy of 95.64%, with a top-3 accuracy of 99.6% and an area under the ROC curve (AUC) of 0.998 across all species. To facilitate practical deployment, we developed an interactive advanced dashboard that supports real-time slice classification and 3D slice matching using advanced similarity metrics, including SSIM, NCC, and the Dice coefficient. This work establishes new benchmarks for AI-assisted micropaleontological identification and provides a fully reproducible framework for foraminifera classification research, bridging the gap between deep learning and applied geosciences.

replace-cross AltNet: Addressing the Plasticity-Stability Dilemma in Reinforcement Learning

Authors: Mansi Maheshwari, John C. Raisbeck, Bruno Castro da Silva

Abstract: Artificial neural networks have shown remarkable success in supervised learning when trained on a single task using a fixed dataset. However, when neural networks are trained on a reinforcement learning task, their ability to continue learning from new experiences declines over time. This decline in learning ability is known as plasticity loss. To restore plasticity, prior work has explored periodically resetting the parameters of the learning network, a strategy that often improves performance. However, such resets come at the cost of a temporary drop in performance, which can be dangerous in real-world settings. To overcome this instability, we introduce AltNet, a reset-based approach that restores plasticity without performance degradation by leveraging a pair of twin networks. The use of twin networks anchors performance during resets through a mechanism that allows networks to periodically alternate roles: one network learns as it acts in the environment, while the other learns off-policy from the active network's interactions through a replay buffer. At fixed intervals, the active network is reset and the passive network, having learned from prior experience, becomes the new active network. AltNet restores plasticity, improving sample efficiency and achieving higher performance, while avoiding performance drops that pose risks in safety-critical settings. We demonstrate these advantages in several high-dimensional control tasks from the DeepMind Control Suite, where AltNet outperforms various relevant baseline methods, as well as state-of-the-art reset-based techniques.

replace-cross Dual Randomized Smoothing: Beyond Global Noise Variance

Authors: Chenhao Sun, Yuhao Mao, Martin Vechev

Abstract: Randomized Smoothing (RS) is a prominent technique for certifying the robustness of neural networks against adversarial perturbations. With RS, achieving high accuracy at small radii requires a small noise variance, while achieving high accuracy at large radii requires a large noise variance. However, the global noise variance used in the standard RS formulation leads to a fundamental limitation: there exists no global noise variance that simultaneously achieves strong performance at both small and large radii. To break through the global variance limitation, we propose a dual RS framework which enables input-dependent noise variances. To achieve that, we first prove that RS remains valid with input-dependent noise variances, provided the variance is locally constant around each input. Building on this result, we introduce two components: (i) a variance estimator predicts an optimal noise variance for each input, (ii) this estimated variance is then used by a standard RS classifier. The variance estimator is independently smoothed via RS to ensure local constancy, enabling flexible design. We also introduce training strategies to iteratively optimize the two components. Experiments on CIFAR-10 demonstrate that our dual RS method provides strong performance for both small and large radii-unattainable with global noise variance-while incurring only a 60% computational overhead at inference. Moreover, it outperforms prior input-dependent noise approaches across most radii, with gains at radii 0.5, 0.75, and 1.0 of 15.6%, 20.0%, and 15.7%. On ImageNet, dual RS remains effective across all radii, with advantages of 8.6%, 17.1%, and 9.1% at radii 0.5, 1.0, and 1.5. Additionally, the dual RS framework provides a routing perspective for certified robustness, improving the accuracy-robustness trade-off with off-the-shelf expert RS models.

replace-cross Process-Centric Analysis of Agentic Software Systems

Authors: Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, Reyhan Jabbarvand

Abstract: Agentic systems are modern software systems: they consist of orchestrated modules, expose interfaces, and are deployed in software pipelines. Unlike conventional programs, their execution, i.e., trajectories, is inherently stochastic and adaptive to the problems they solve. Evaluation of such systems is often outcome-centric. This narrow focus overlooks detailed insights, failing to explain how agents reason, plan, act, or change their strategies. Inspired by the structured representation of conventional software systems as graphs, we introduce Graphectory to systematically encode the temporal and semantic relations in such systems. Using Graphectory, we automatically analyze 4000 trajectories of two dominant agentic programming workflows, SWE-agent and OpenHands, with four backbone Large Language Models (LLMs), attempting to resolve SWE-bench issues. Our automated analyses (completed within four minutes) reveal that: (1) agents using richer prompts or stronger LLMs exhibit more complex Graphectory, reflecting deeper exploration, broader context gathering, and more thorough validation; (2) agents' strategies vary with problem difficulty and the underlying LLM - for resolved issues, strategies often follow coherent localization-patching-validation steps, while unresolved ones exhibit chaotic or backtracking behaviors; and (3) even successful agentic systems often display inefficient processes. We also implement a novel technique for real-time construction and analysis of Graphectory and Langutory during agent execution to flag trajectory issues. Upon detecting such issues, the technique notifies the agent with a diagnostic message and, when applicable, rolls back the trajectory. Experiments show that online monitoring and interventions improve resolution rates by 6.9%-23.5% across models for problematic instances, while significantly shortening trajectories with near-zero overhead.

replace-cross Beyond Additivity: Sparse Isotonic Shapley Regression toward Nonlinear Explainability

Authors: Jialai She

Abstract: Shapley values, a gold standard for feature attribution in Explainable AI, face two key challenges. First, the canonical Shapley framework assumes that the worth function is additive, yet real-world payoff constructions--driven by non-Gaussian distributions, heavy tails, feature dependence, or domain-specific loss scales--often violate this assumption, leading to distorted attributions. Second, achieving sparse explanations in high-dimensional settings by computing dense Shapley values and then applying ad hoc thresholding is costly and risks inconsistency. We introduce Sparse Isotonic Shapley Regression (SISR), a unified nonlinear explanation framework. SISR simultaneously learns a monotonic transformation to restore additivity--obviating the need for a closed-form specification--and enforces an L0 sparsity constraint on the Shapley vector, enhancing computational efficiency in large feature spaces. Its optimization algorithm leverages Pool-Adjacent-Violators for efficient isotonic regression and normalized hard-thresholding for support selection, ensuring ease in implementation and global convergence guarantees. Analysis shows that SISR recovers the true transformation in a wide range of scenarios and achieves strong support recovery even in high noise. Moreover, we are the first to demonstrate that irrelevant features and inter-feature dependencies can induce a true payoff transformation that deviates substantially from linearity. Extensive experiments demonstrate that SISR stabilizes attributions across payoff schemes and correctly filters irrelevant features; in contrast, standard Shapley values suffer severe rank and sign distortions. By unifying nonlinear transformation estimation with sparsity pursuit, SISR advances the frontier of nonlinear explainability, providing a theoretically grounded and practical attribution framework.

replace-cross Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

Authors: Wenfei Guan, Jilin Mei, Tong Shen, Xumin Wu, Shuo Wang, Chen Min, Yu Hu

Abstract: Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors. This work addresses these limitations in two complementary ways. First, we release WildRoad, a global off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly. Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. An efficient vertex extraction strategy also yields roughly 2.5X faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild. We release both the dataset and code at this repository. We release both the dataset and code at https://github.com/xiaofei-guan/MaGRoad.

URLs: https://github.com/xiaofei-guan/MaGRoad.

replace-cross SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

Authors: Vegard Flovik

Abstract: Deep neural networks achieve impressive performance but remain difficult to interpret and control. We present SALVE (Sparse Autoencoder-Latent Vector Editing), a unified "discover, validate, and control" framework that bridges mechanistic interpretability and model editing. Using an $\ell_1$-regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder's structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold, $\alpha_{crit}$, quantifying each class's reliance on its dominant feature, supporting fine-grained robustness diagnostics. Our approach is validated on both convolutional (ResNet-18) and transformer-based (ViT-B/16) models, demonstrating consistent, interpretable control over their behavior. This work contributes a principled methodology for turning feature discovery into actionable model edits, advancing the development of transparent and controllable AI systems.

replace-cross Meta-RL Induces Exploration in Language Agents

Authors: Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic

Abstract: Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

replace-cross ReDepth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

Authors: Ananta R. Bhattarai, Helge Rhodin

Abstract: Monocular depth estimation remains challenging, as foundation models such as Depth Anything V2 (DA-V2) struggle with real-world images that are far from the training distribution. We introduce Re-Depth Anything, a test-time self-supervision framework that bridges this domain gap by fusing foundation models with the powerful priors of large-scale 2D diffusion models. Our method performs label-free refinement directly on the input image by re-lighting the predicted depth map and augmenting the input. This re-synthesis method replaces classical photometric reconstruction by leveraging shape from shading (SfS) cues in a new, generative context with Score Distillation Sampling (SDS). To prevent optimization collapse, our framework updates only intermediate embeddings and the decoder's weights, rather than optimizing the depth tensor directly or fine-tuning the full model. Across diverse benchmarks, Re-Depth Anything yields substantial gains in depth accuracy and realism over DA-V2, and applied on top of Depth Anything 3 (DA3) achieves state-of-the-art results, showcasing new avenues for self-supervision by geometric reasoning.

replace-cross Cost Trade-offs of Reasoning and Non-Reasoning Large Language Models in Text-to-SQL

Authors: Saurabh Deochake, Debajyoti Mukhopadhyay

Abstract: While Text-to-SQL systems achieve high accuracy, existing efficiency metrics like the Valid Efficiency Score prioritize execution time, a metric we show is fundamentally decoupled from consumption-based cloud billing. This paper evaluates cloud query execution cost trade-offs between reasoning and non-reasoning Large Language Models by performing 180 Text-to-SQL query executions across six LLMs on Google BigQuery using the 230 GB StackOverflow dataset. Our analysis reveals that reasoning models process 44.5% fewer bytes than non-reasoning counterparts while maintaining equivalent correctness at 96.7% to 100%, and that execution time correlates weakly with query cost at $r=0.16$, indicating that speed optimization does not imply cost efficiency. Non-reasoning models also exhibit extreme cost variance of up to 3.4$\times$, producing outliers exceeding 36 GB per query, over 20$\times$ the best model's 1.8 GB average, due to missing partition filters and inefficient joins. We identify these prevalent inefficiency patterns and provide deployment guidelines to mitigate financial risks in cost-sensitive enterprise environments.

replace-cross DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Authors: Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li, Steven L. Waslander

Abstract: Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers, with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.

replace-cross NC-Bench: An LLM Benchmark for Evaluating Conversational Competence

Authors: Robert J. Moore, Sungeun An, Farhan Ahmed, Jay Pankaj Gala

Abstract: The Natural Conversation Benchmark (NC-Bench) introduces a new approach to evaluating the general conversational competence of large language models (LLMs). Unlike prior benchmarks that focus on the content of model behavior, NC-Bench focuses on the form and structure of natural conversation. Grounded in the IBM Natural Conversation Framework (NCF), NC-Bench comprises three distinct sets: (1) the basic set evaluates fundamental sequence management practices, such as answering inquiries, repairing responses, and closing conversational pairs; (2) the retrieval-augmented generation (RAG) set applies the same sequence management patterns as the first set but incorporates information-seeking via RAG; (3) the complex request set extends to requests involving more intricate sequence management patterns. Each set tests a model's ability to produce contextually appropriate conversational actions in response to characteristic interaction patterns. Initial evaluations across six open-source models and 14 interaction patterns show that models perform well on basic answering tasks, struggle more with repair tasks (especially repeat), have mixed performance on closing sequences, and find complex multi-turn requests most challenging. By operationalizing fundamental principles of human conversation, NC-Bench provides a lightweight, extensible, and theory-grounded framework for assessing and improving the conversational abilities of LLMs beyond topical or task-specific benchmarks.

replace-cross The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor

Authors: Jordan Taylor, William Agnew, Maarten Sap, Sarah E. Fox, Haiyi Zhu

Abstract: Visual generative AI models are trained using a one-size-fits-all measure of aesthetic appeal. However, what is deemed "aesthetic" is inextricably linked to personal taste and cultural values, raising the question of whose taste is represented in visual generative AI models. In this work, we study an aesthetic evaluation model--LAION-Aesthetics Predictor (LAP)--that is widely used to curate datasets to train visual generative image models, like Stable Diffusion, and evaluate the quality of AI-generated images. To understand what LAP measures, we audited the model across three datasets. First, we examined the impact of aesthetic filtering on the LAION-Aesthetics Dataset (approximately 1.2B images), which was curated from LAION-5B using LAP. We find that the LAP disproportionally filters in images with captions mentioning women, while filtering out images with captions mentioning men or LGBTQ+ people. Then, we used LAP to score approximately 330k images across two art datasets, finding the model rates realistic images of landscapes, cityscapes, and portraits from western and Japanese artists most highly. In doing so, the algorithmic gaze of this aesthetic evaluation model reinforces the imperial and male gazes found within western art history. In order to understand where these biases may have originated, we performed a digital ethnography of public materials related to the creation of LAP. We find that the development of LAP reflects the biases we found in our audits, such as the aesthetic scores used to train LAP primarily coming from English-speaking photographers and western AI-enthusiasts. In response, we discuss how aesthetic evaluation can perpetuate representational harms and call on AI developers to shift away from prescriptive measures of "aesthetics" toward more pluralistic evaluation.

replace-cross Multifaceted Scenario-Aware Hypergraph Learning for Next POI Recommendation

Authors: Yuxi Lin, Yongkang Li, Jie Xing, Zipei Fan

Abstract: Among the diverse services provided by Location-Based Social Networks (LBSNs), Next Point-of-Interest (POI) recommendation plays a crucial role in inferring user preferences from historical check-in trajectories. However, existing sequential and graph-based methods frequently neglect significant mobility variations across distinct contextual scenarios (e.g., tourists versus locals). This oversight results in suboptimal performance due to two fundamental limitations: the inability to capture scenario-specific features and the failure to resolve inherent inter-scenario conflicts. To overcome these limitations, we propose the Multifaceted Scenario-Aware Hypergraph Learning method (MSAHG), a framework that adopts a scenario-splitting paradigm for next POI recommendation. Our main contributions are: (1) Construction of scenario-specific, multi-view disentangled sub-hypergraphs to capture distinct mobility patterns; (2) A parameter-splitting mechanism to adaptively resolve conflicting optimization directions across scenarios while preserving generalization capability. Extensive experiments on three real-world datasets demonstrate that MSAHG consistently outperforms five state-of-the-art methods across diverse scenarios, confirming its effectiveness in multi-scenario POI recommendation.

replace-cross DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

Authors: Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, Elsie Nallipogu

Abstract: DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry, such as API usage and code purpose understanding. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement-detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.

replace-cross Continuous-Flow Data-Rate-Aware CNN Inference on FPGA

Authors: Tobias Habermann, Michael Mecik, Zhenyu Wang, C\'esar David Vera, Martin Kumm, Mario Garrido

Abstract: Among hardware accelerators for deep-learning inference, data flow implementations offer low latency and high throughput capabilities. In these architectures, each neuron is mapped to a dedicated hardware unit, making them well-suited for field-programmable gate array (FPGA) implementation. Previous unrolled implementations mostly focus on fully connected networks because of their simplicity, although it is well known that convolutional neural networks (CNNs) require fewer computations for the same accuracy. When observing the data flow in CNNs, pooling layers and convolutional layers with a stride larger than one, the number of data at their output is reduced with respect to their input. This data reduction strongly affects the data rate in a fully parallel implementation, making hardware units heavily underutilized unless it is handled properly. This work addresses this issue by analyzing the data flow of CNNs and presents a novel approach to designing data-rate-aware, continuous-flow CNN architectures. The proposed approach ensures a high hardware utilization close to 100% by interleaving low data rate signals and sharing hardware units, as well as using the right parallelization to achieve the throughput of a fully parallel implementation. The results show that a significant amount of the arithmetic logic can be saved, which allows implementing complex CNNs like MobileNet on a single FPGA with high throughput.

replace-cross MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference

Authors: Huanlin Gao, Ping Chen, Fuyuan Shi, Ruijia Wu, Li YanTao, Qiang Hui, Yuren You, Ting Lu, Chao Tan, Shaoan Zhao, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian

Abstract: We present MeanCache, a training-free caching framework for efficient Flow Matching inference. Existing caching methods reduce redundant computation but typically rely on instantaneous velocity information (e.g., feature caching), which often leads to severe trajectory deviations and error accumulation under high acceleration ratios. MeanCache introduces an average-velocity perspective: by leveraging cached Jacobian--vector products (JVP) to construct interval average velocities from instantaneous velocities, it effectively mitigates local error accumulation. To further improve cache timing and JVP reuse stability, we develop a trajectory-stability scheduling strategy as a practical tool, employing a Peak-Suppressed Shortest Path under budget constraints to determine the schedule. Experiments on FLUX.1, Qwen-Image, and HunyuanVideo demonstrate that MeanCache achieves 4.12X and 4.56X and 3.59X acceleration, respectively, while consistently outperforming state-of-the-art caching baselines in generation quality. We believe this simple yet effective approach provides a new perspective for Flow Matching inference and will inspire further exploration of stability-driven acceleration in commercial-scale generative models.

replace-cross RedSage: A Cybersecurity Generalist LLM

Authors: Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi, Juergen Gall, Paolo Ceravolo, Ernesto Damiani

Abstract: Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code are publicly available.

replace-cross Bitcoin Price Prediction using Machine Learning and Combinatorial Fusion Analysis

Authors: Yuanhong Wu, Wei Ye, Jingyan Xu, D. Frank Hsu

Abstract: In this work, we propose to apply a new model fusion and learning paradigm, known as Combinatorial Fusion Analysis (CFA), to the field of Bitcoin price prediction. Price prediction of financial product has always been a big topic in finance, as the successful prediction of the price can yield significant profit. Every machine learning model has its own strength and weakness, which hinders progress toward robustness. CFA has been used to enhance models by leveraging rank-score characteristic (RSC) function and cognitive diversity in the combination of a moderate set of diverse and relatively well-performed models. Our method utilizes both score and rank combinations as well as other weighted combination techniques. Key metrics such as RMSE and MAPE are used to evaluate our methodology performance. Our proposal presents a notable MAPE performance of 0.19\%. The proposed method greatly improves upon individual model performance, as well as outperforms other Bitcoin price prediction models.

replace-cross Impact of LLMs news Sentiment Analysis on Stock Price Movement Prediction

Authors: Walid Siala (SnT, University of Luxembourg, Luxembourg), Ahmed Khanfir (RIADI, ENSI, University of Manouba, Tunisia, SnT, University of Luxembourg, Luxembourg), Mike Papadakis (SnT, University of Luxembourg, Luxembourg)

Abstract: This paper addresses stock price movement prediction by leveraging LLM-based news sentiment analysis. Earlier works have largely focused on proposing and assessing sentiment analysis models and stock movement prediction methods, however, separately. Although promising results have been achieved, a clear and in-depth understanding of the benefit of the news sentiment to this task, as well as a comprehensive assessment of different architecture types in this context, is still lacking. Herein, we conduct an evaluation study that compares 3 different LLMs, namely, DeBERTa, RoBERTa and FinBERT, for sentiment-driven stock prediction. Our results suggest that DeBERTa outperforms the other two models with an accuracy of 75% and that an ensemble model that combines the three models can increase the accuracy to about 80%. Also, we see that sentiment news features can benefit (slightly) some stock market prediction models, i.e., LSTM-, PatchTST- and tPatchGNN-based classifiers and PatchTST- and TimesNet-based regression tasks models.

replace-cross In-Run Data Shapley for Adam Optimizer

Authors: Meng Ding, Zeqing Zhang, Di Wang, Lijie Hu

Abstract: Reliable data attribution is essential for mitigating bias and reducing computational waste in modern machine learning, with the Shapley value serving as the theoretical gold standard. While recent "In-Run" methods bypass the prohibitive cost of retraining by estimating contributions dynamically, they heavily rely on the linear structure of Stochastic Gradient Descent (SGD) and fail to capture the complex dynamics of adaptive optimizers like Adam. In this work, we demonstrate that data attribution is inherently optimizer-dependent: we show that SGD-based proxies diverge significantly from true contributions under Adam (Pearson $R \approx 0.11$), rendering them ineffective for modern training pipelines. To bridge this gap, we propose Adam-Aware In-Run Data Shapley. We derive a closed-form approximation that restores additivity by redefining utility under a fixed-state assumption and enable scalable computation via a novel Linearized Ghost Approximation. This technique linearizes the variance-dependent scaling term, allowing us to compute pairwise gradient dot-products without materializing per-sample gradients. Extensive experiments show that our method achieves near-perfect fidelity to ground-truth marginal contributions ($R > 0.99$) while retaining $\sim$95\% of standard training throughput. Furthermore, our Adam-aware attribution significantly outperforms SGD-based baselines in data attribution downstream tasks.

replace-cross Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? A Study of Hierarchical Gating and Calibration

Authors: V\'ictor Yeste, Paolo Rosso

Abstract: Human value detection from single sentences is a sparse, imbalanced multi-label task. We study whether Schwartz higher-order (HO) categories help this setting on ValueEval'24 / ValuesML (74K English sentences) under a compute-frugal budget. Rather than proposing a new architecture, we compare direct supervised transformers, hard HO$\rightarrow$values pipelines, Presence$\rightarrow$HO$\rightarrow$values cascades, compact instruction-tuned large language models (LLMs), QLoRA, and low-cost upgrades such as threshold tuning and small ensembles. HO categories are learnable: the easiest bipolar pair, Growth vs. Self-Protection, reaches Macro-$F_1=0.58$. The most reliable gains come from calibration and ensembling: threshold tuning improves Social Focus vs. Personal Focus from $0.41$ to $0.57$ ($+0.16$), transformer soft voting lifts Growth from $0.286$ to $0.303$, and a Transformer+LLM hybrid reaches $0.353$ on Self-Protection. In contrast, hard hierarchical gating does not consistently improve the end task. Compact LLMs also underperform supervised encoders as stand-alone systems, although they sometimes add useful diversity in hybrid ensembles. Under this benchmark, the HO structure is more useful as an inductive bias than as a rigid routing rule.

replace-cross Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning

Authors: Wenze Lin, Zhen Yang, Xitai Jiang, Pony Ma, Gao Huang

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, it frequently encounters challenges such as entropy collapse, excessive verbosity, and insufficient exploration for hard problems. Crucially, existing reward schemes fail to distinguish between the need for extensive search during problem-solving and the efficiency required for mastered knowledge. In this work, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements a dual-phase mechanism: (1) On incorrect attempts, T2T incentivizes "thickening" (longer trajectories) to broaden the search space and explore novel solution paths; (2) Upon achieving correctness, it shifts to "thinning", imposing length penalties to discourage redundancy, thereby fostering model confidence and crystallizing reasoning capabilities. Extensive experiments on mathematical benchmarks (MATH-500, AIME, AMC) across Qwen-series and Deepseek models demonstrate that T2T significantly outperforms standard GRPO and recent baselines, achieving superior performance.

replace-cross Extracting Recurring Vulnerabilities from Black-Box LLM-Generated Software

Authors: Tomer Kordonsky, Maayan Yamin, Noam Benzimra, Amit LeVi, Avi Mendelson

Abstract: LLMs are increasingly used for code generation, but their outputs often follow recurring templates that can induce predictable vulnerabilities. We study vulnerability persistence in LLM-generated software and introduce Feature--Security Table (FSTab) with two components. First, FSTab enables a black-box attack that predicts likely backend vulnerabilities from observable frontend features and knowledge of the source LLM, without access to the backend or source code. Second, FSTab provides a model-centric evaluation that quantifies how consistently a model reproduces the same vulnerabilities across programs, semantics-preserving rephrasings, and application domains. We evaluate FSTab on state-of-the-art code LLMs, including GPT-5.2, Claude-4.5 Opus, and Gemini-3 Pro, across diverse application domains. Our results show strong cross-domain transfer: even when the target domain is excluded from training, FSTab achieves up to 94% attack success and 93% vulnerability coverage on Internal Tools (Claude-4.5 Opus). These findings expose an underexplored attack surface in LLM-generated software and highlight the security risks of code generation. Our code is available at https://github.com/fstabicml2026/FSTab

URLs: https://github.com/fstabicml2026/FSTab

replace-cross Semantic Search over 9 Million Mathematical Theorems

Authors: Luke Alexander, Eric Leonen, Sophie Szeto, Artemii Remizov, Ignacio Tejeda, Jarod Alper, Giovanni Inchiostro, Vasily Ilin

Abstract: Searching for mathematical results remains difficult: most existing tools retrieve entire papers, while mathematicians and theorem-proving agents often seek a specific theorem, lemma, or proposition that answers a query. While semantic search has seen rapid progress, its behavior on large, highly technical corpora such as research-level mathematical theorems remains poorly understood. In this work, we introduce and study semantic theorem retrieval at scale over a unified corpus of $9.2$ million theorem statements extracted from arXiv and seven other sources, representing the largest publicly available corpus of human-authored, research-level theorems. We represent each theorem with a short natural-language description as a retrieval representation and systematically analyze how representation context, language model choice, embedding model, and prompting strategy affect retrieval quality. On a curated evaluation set of theorem-search queries written by professional mathematicians, our approach substantially improves both theorem-level and paper-level retrieval compared to existing baselines, demonstrating that semantic theorem search is feasible and effective at web scale. The project page, search tool, dataset, REST API, and MCP server are available at theoremsearch.com.

replace-cross LMMRec: LLM-driven Motivation-aware Multimodal Recommendation

Authors: Yicheng Di, Zhanjie Zhang, Yun Wang, Jinren Liu, Jiaqi Yan, Jiyu Wei, Xiangyu Chen, Yuan Liu

Abstract: Motivation-based recommendation systems uncover user behavior drivers. Motivation modeling, crucial for decision-making and content preference, explains recommendation generation. Existing methods often treat motivation as latent variables from interaction data, neglecting heterogeneous information like review text. In multimodal motivation fusion, two challenges arise: 1) achieving stable cross-modal alignment amid noise, and 2) identifying features reflecting the same underlying motivation across modalities. To address these, we propose LLM-driven Motivation-aware Multimodal Recommendation (LMMRec), a model-agnostic framework leveraging large language models for deep semantic priors and motivation understanding. LMMRec uses chain-of-thought prompting to extract fine-grained user and item motivations from text. A dual-encoder architecture models textual and interaction-based motivations for cross-modal alignment, while Motivation Coordination Strategy and Interaction-Text Correspondence Method mitigate noise and semantic drift through contrastive learning and momentum updates. Experiments on three datasets show LMMRec achieves up to a 4.98\% performance improvement.

replace-cross Diffusion-Guided Pretraining for Brain Graph Foundation Models

Authors: Xinxu Wei, Rong Zhou, Lifang He, Yu Zhang

Abstract: With the growing interest in foundation models for brain signals, graph-based pretraining has emerged as a promising paradigm for learning transferable representations from connectome data. However, existing contrastive and masked autoencoder methods typically rely on naive random dropping or masking for augmentation, which is ill-suited for brain graphs and hypergraphs as it disrupts semantically meaningful connectivity patterns. Moreover, commonly used graph-level readout and reconstruction schemes fail to capture global structural information, limiting the robustness of learned representations. In this work, we propose a unified diffusion-based pretraining framework that addresses both limitations. First, diffusion is designed to guide structure-aware dropping and masking strategies, preserving brain graph semantics while maintaining effective pretraining diversity. Second, diffusion enables topology-aware graph-level readout and node-level global reconstruction by allowing graph embeddings and masked nodes to aggregate information from globally related regions. Extensive experiments across multiple neuroimaging datasets with over 25,000 subjects and 60,000 scans involving various mental disorders and brain atlases demonstrate consistent performance improvements.

replace-cross Listen to the Layers: Mitigating Hallucinations with Inter-Layer Disagreement

Authors: Koduvayur Subbalakshmi, Sabbir Hossain Ujjal, Venkata Krishna Teja Mangichetty, Nastaran Jamalipour Soofi

Abstract: Pretrained Large Language Models (LLMs) are prone to generating fluent yet factually incorrect text-a phenomenon known as hallucinations, undermining their reliability and utility in downstream tasks. We hypothesize that a generated text span's factuality is correlated with its representational instability across the model's internal layers. Based on this, we propose the CoCoA (Confusion and Consistency Aware) decoder, a novel, training-free decoding algorithm that mitigates hallucinations at inference time by listening to these signals in the middle layers. We propose two metrics to quantify this instability in the middle layers and use it to penalize outputs that exhibit high internal confusion, thereby steering the model towards more internally consistent and factually grounded outputs. We further propose a self-information gated variant, CoCoA-SIG, that dynamically modulates this penalty to selectively target high-surprise, unstable generations. Extensive experiments on diverse tasks, including question-answering, summarization, mathematical reasoning and code generation, demonstrate that CoCoA significantly improves factual correctness across multiple model families (e.g., Llama-3, Qwen-2.5, Mistral). By leveraging model-intrinsic signals, CoCoA offers an effective and broadly applicable method for enhancing the trustworthiness of LLMs at inference time, without requiring any model retraining.

replace-cross SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

Authors: Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Hogun Park, Il Yong Chun

Abstract: In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.

replace-cross Accelerating Robotic Reinforcement Learning with Agent Guidance

Authors: Haojun Chen, Zili Zou, Chengdong Ma, Yaoxiang Pu, Haotong Zhang, Yuanpei Chen, Yaodong Yang

Abstract: Reinforcement Learning (RL) offers a powerful paradigm for autonomous robots to master generalist manipulation skills through trial-and-error. However, its real-world application is stifled by low sample efficiency. Recent Human-in-the-Loop (HIL) methods accelerate training by using human corrections, yet this approach faces a scalability barrier. Reliance on human supervisors imposes a 1:1 supervision ratio that limits scalability, suffers from operator fatigue over extended sessions, and introduces high variance due to inconsistent human proficiency. We present Agent-guided Policy Search (AGPS), a framework that automates the training pipeline by replacing human supervisors with a multimodal agent. Our key insight is that the agent can be viewed as a semantic world model, injecting intrinsic value priors to structure physical exploration. By using tools, the agent provides precise guidance via corrective waypoints and spatial constraints for exploration pruning. We validate our approach on three tasks, ranging from precision insertion to deformable object manipulation. Results demonstrate that AGPS outperforms HIL methods in sample efficiency. This automates the supervision pipeline, unlocking the path to labor-free and scalable robot learning. Project website: https://agps-rl.github.io/agps/.

URLs: https://agps-rl.github.io/agps/.

replace-cross TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

Authors: Peng Cheng, Jiucheng Zang, Qingnan Li, Liheng Ma, Yufei Cui, Yingxue Zhang, Boxing Chen, Ming Jian, Wen Tong

Abstract: Muon-style optimizers leverage Newton-Schulz (NS) iterations to orthogonalize updates, yielding update geometries that often outperform Adam-series methods. However, this orthogonalization discards magnitude information, rendering training sensitive to step-size hyperparameters and vulnerable to high-energy bursts. To mitigate this, we introduce TrasMuon (\textbf{T}rust \textbf{R}egion \textbf{A}daptive \textbf{S}caling \textbf{Muon}). TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping. We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers. TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone. Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines. Furthermore, experiments without warmup stages confirm TrasMuon's superior stability and robustness.

replace-cross Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation

Authors: Guojian Zhan, Letian Tao, Pengcheng Wang, Yixiao Wang, Yiheng Li, Yuxin Chen, Hongyang Li, Masayoshi Tomizuka, Shengbo Eben Li

Abstract: Learning expressive and efficient policy functions is a promising direction in reinforcement learning (RL). While flow-based policies have recently proven effective in modeling complex action distributions with a fast deterministic sampling process, they still face a trade-off between expressiveness and computational burden, which is typically controlled by the number of flow steps. In this work, we propose mean velocity policy (MVP), a new generative policy function that models the mean velocity field to achieve the fastest one-step action generation. To ensure its high expressiveness, an instantaneous velocity constraint (IVC) is introduced on the mean velocity field during training. We theoretically prove that this design explicitly serves as a crucial boundary condition, thereby improving learning accuracy and enhancing policy expressiveness. Empirically, our MVP achieves state-of-the-art success rates across several challenging robotic manipulation tasks from Robomimic and OGBench. It also delivers substantial improvements in training and inference speed over existing flow-based policy baselines.

replace-cross Pawsterior: Variational Flow Matching for Structured Simulation-Based Inference

Authors: Jorge Carrasco-Pollo, Floor Eijkelboom, Jan-Willem van de Meent

Abstract: We introduce Pawsterior, a variational flow-matching framework for improved and extended simulation-based inference (SBI). Many SBI problems involve posteriors constrained by structured domains, such as bounded physical parameters or hybrid discrete-continuous variables, yet standard flow-matching methods typically operate in unconstrained spaces. This mismatch leads to inefficient learning and difficulty respecting physical constraints. Our contributions are twofold. First, generalizing the geometric inductive bias of CatFlow, we formalize endpoint-induced affine geometric confinement, a principle that incorporates domain geometry directly into the inference process via a two-sided variational model. This formulation improves numerical stability during sampling and leads to consistently better posterior fidelity, as demonstrated by improved classifier two-sample test performance across standard SBI benchmarks. Second, and more importantly, our variational parameterization enables SBI tasks involving discrete latent structure (e.g., switching systems) that are fundamentally incompatible with conventional flow-matching approaches. By addressing both geometric constraints and discrete latent structure, Pawsterior provides a principled way to apply flow-matching in a broader range of structured SBI settings.

replace-cross Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets

Authors: Yuchen Yang, Wenze Lin, Enhao Huang, Zhixuan Chu, Hongbin Zhou, Lan Tao, Yiming Li, Zhan Qin, Kui Ren

Abstract: Large Language Models (LLMs) have seen remarkable advancements, achieving state-of-the-art results in diverse applications. Fine-tuning, an important step for adapting LLMs to specific downstream tasks, typically involves further training on corresponding datasets. However, a fundamental discrepancy exists between current fine-tuning datasets and the token-level optimization mechanism of LLMs: most datasets are designed at the sentence-level, which introduces token-level noise, causing negative influence to final performance. In this paper, we propose XTF, an explainable token-level noise filtering framework. XTF decomposes the complex and subtle contributions of token-level data to the fine-tuning process into three distinct and explicit attributes (reasoning importance, knowledge novelty, and task relevance), which can be assessed using scoring methods, and then masks the gradients of selected noisy tokens accordingly to optimize the performance of fine-tuned LLMs. We conduct extensive experiments on three representative downstream tasks (math, code and medicine) across 7 mainstream LLMs. The results demonstrate that XTF can significantly improve downstream performance by up to 13.7% compared to regular fine-tuning. Our work highlights the importance of token-level dataset optimization, and demonstrates the potential of strategies based on attribute decomposition for explaining complex training mechanisms.

replace-cross LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio

Authors: Naveen Vakada, Kartik Hegde, Arvind Krishna Sridhar, Yinyi Guo, Erik Visser

Abstract: Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.

replace-cross Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion

Authors: Pengcheng Zhou, Haochen Li, Zhiqiang Nie, JiaLe Chen, Qing Gong, Weizhen Zhang, Chun Yu

Abstract: Retrieval-Augmented Generation (RAG) effectively mitigates hallucinations in LLMs by incorporating external knowledge. However, the inherent discrete representation of text in existing frameworks often results in a loss of semantic integrity, leading to retrieval deviations. Inspired by the human episodic memory mechanism, we propose CogitoRAG, a RAG framework that simulates human cognitive memory processes. The core of this framework lies in the extraction and evolution of the Semantic Gist. During the offline indexing stage, CogitoRAG first deduces unstructured corpora into gist memory corpora, which are then transformed into a multi-dimensional knowledge graph integrating entities, relational facts, and memory nodes. In the online retrieval stage, the framework handles complex queries via Query Decomposition Module that breaks them into comprehensive sub-queries, mimicking the cognitive decomposition humans employ for complex information. Subsequently, Entity Diffusion Module performs associative retrieval across the graph, guided by structural relevance and an entity-frequency reward mechanism. Furthermore, we propose the CogniRank algorithm, which precisely reranks candidate passages by fusing diffusion-derived scores with semantic similarity. The final evidence is delivered to the generator in a passage-memory pairing format, providing high-density information support. Experimental results across five mainstream QA benchmarks and multi-task generation on GraphBench demonstrate that CogitoRAG significantly outperforms state-of-the-art RAG methods, showcasing superior capabilities in complex knowledge integration and reasoning.

replace-cross Symmetry-Driven Generation of Crystal Structures from Composition

Authors: Shi Yin, Jinming Mu, Xudong Zhu, Linxin He

Abstract: Crystal structure prediction (CSP), which aims to predict the three-dimensional atomic arrangement of a crystal from its composition, is central to materials discovery and mechanistic understanding. However, given the composition in a unit cell, existing methods struggle with the NP-hard combinatorial challenge of rigorous symmetry enforcement or rely on retrieving known templates, which inherently limits both physical fidelity and the ability to discover genuinely new materials. To solve this, we propose a symmetry-driven generative framework. Our approach leverages large language models to encode chemical semantics and directly generate fine-grained Wyckoff patterns from atomic stoichiometry, effectively circumventing the limitations inherent to database lookups. Crucially, to overcome the exponentially complex problem of combinatorial site assignments, we incorporate domain knowledge through an efficient, linear-complexity heuristic beam search algorithm that rigorously enforces algebraic consistency between site multiplicities and atomic stoichiometry. By integrating this symmetry-consistent template into a diffusion backbone, our approach constrains the stochastic generative trajectory to a physically valid geometric manifold. This framework achieves state-of-the-art performance across stability, uniqueness, and novelty (SUN) benchmarks, alongside superior matching performance, thereby establishing a new paradigm for the rigorous exploration of targeted crystallographic space which can be previously uncharted, with no reliance on a priori structural knowledge.

replace-cross Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering

Authors: Jash Rajesh Parekh, Wonbin Kweon, Joey Chan, Rezarta Islamaj, Robert Leaman, Pengcheng Jiang, Chih-Hsuan Wei, Zhizheng Wang, Zhiyong Lu, Jiawei Han

Abstract: Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context. To address this gap, we propose CondMedQA, the first benchmark for conditional biomedical QA, consisting of multi-hop questions whose answers vary with patient conditions. Furthermore, we propose Condition-Gated Reasoning (CGR), a novel framework that constructs condition-aware knowledge graphs and selectively activates or prunes reasoning paths based on query conditions. Our findings show that CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks, highlighting the importance of explicitly modeling conditionality for robust medical reasoning.

replace-cross Conformal Tradeoffs: Guarantees Beyond Coverage

Authors: Petrus H. Zwart

Abstract: Deployed conformal predictors are long-lived decision infrastructure reused over finite operational windows. In practice, stakeholders care not only about marginal coverage, but also about operational quantities: how often the system commits versus defers, and what error exposure it induces when it acts. These deployment-facing quantities are not determined by coverage alone: identical calibrated thresholds can yield markedly different operational profiles depending on score geometry. We develop tools for operational certification and planning beyond coverage for split conformal prediction. First, Small-Sample Beta Correction (SSBC) inverts the exact finite-sample rank/Beta law to map a user request $(\alpha^\star,\delta)$ to a concrete calibration grid point with PAC-style semantics, yielding explicit finite-window coverage guarantees for a reused deployed rule. Second, because no distribution-free pivot exists beyond coverage, we propose Calibrate-and-Audit: an independent audit set supports certified finite-window predictive envelopes (Binomial/Beta-Binomial) for key operational quantities -- commitment frequency, deferral, and decisive error exposure -- and related metrics via linear projection, without committing to a scalar objective. Third, we give a geometric characterization of the feasibility constraints and regime boundaries induced by a fixed conformal partition, clarifying why operational quantities are coupled and how calibration navigation trades them off. The result is an operational menu rthat traces attainable operational profiles (Pareto trade-offs) and attach finite-window uncertainty envelopes to each regime. We illustrate the approach on benchmark molecular toxicity and aqueous solubility datasets.

replace-cross CrystaL: Spontaneous Emergence of Visual Latents in MLLMs

Authors: Yang Zhang, Danyang Li, Yuxuan Li, Xin Zhang, Tianyu Xie, Mingming Cheng, Xiang Li

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent states. To address this limitation, we propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, respectively. By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules. Extensive experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning capabilities.

replace-cross MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

Authors: Daniel Tamayo, I\~naki Lacunza, Paula Rivera-Hidalgo, Severino Da Dalt, Javier Aula-Blasco, Aitor Gonzalez-Agirre, Marta Villegas

Abstract: We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code. Through targeted adaptation, this model family achieves state-of-the-art results on Catalan- and Spanish-specific tasks, while establishing robust performance across specialized biomedical and legal domains. To bridge the gap between research and production, we incorporate Matryoshka Representation Learning (MRL), enabling flexible vector sizing that significantly reduces inference and storage costs. Ultimately, the MrBERT family demonstrates that modern encoder architectures can be optimized for both localized linguistic excellence and efficient, high-stakes domain specialization. We open source the complete model family on Huggingface.

replace-cross UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

Authors: Yuxuan Chen, Peize He, Haoyuan Yu, Junzi Zhang

Abstract: A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.

replace-cross CryoNet.Refine: A One-step Diffusion Model for Rapid Refinement of Structural Models with Cryo-EM Density Map Restraints

Authors: Fuyao Huang, Xiaozhu Yu, Kui Xu, Qiangfeng Cliff Zhang

Abstract: High-resolution structure determination by cryo-electron microscopy (cryo-EM) requires the accurate fitting of an atomic model into an experimental density map. Traditional refinement pipelines such as Phenix.real_space_refine and Rosetta are computationally expensive, demand extensive manual tuning, and present a significant bottleneck for researchers. We present CryoNet.Refine, an end-to-end deep learning framework that automates and accelerates molecular structure refinement. Our approach utilizes a one-step diffusion model that integrates a density-aware loss function with robust stereochemical restraints, enabling rapid optimization of a structure against experimental data. CryoNet.Refine provides a unified and versatile solution capable of refining protein complexes as well as DNA/RNA-protein complexes. In benchmarks against Phenix.real_space_refine, CryoNet.Refine consistently achieves substantial improvements in both model-map correlation and overall geometric quality metrics. By offering a scalable, automated, and powerful alternative, CryoNet.Refine aims to serve as an essential tool for next-generation cryo-EM structure refinement. Web server: https://cryonet.ai/refine; Source code: https://github.com/kuixu/cryonet.refine.

URLs: https://cryonet.ai/refine;, https://github.com/kuixu/cryonet.refine.

replace-cross Autoregressive Visual Decoding from EEG Signals

Authors: Sicheng Dai, Hongwang Xiao, Shan Yu, Qiwei Ye

Abstract: Electroencephalogram (EEG) signals have become a popular medium for decoding visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual decoding from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on a "next-scale prediction" strategy: images are encoded into multi-scale token maps using a pre-trained VQ-VAE, and a transformer is trained to autoregressively predict finer-scale tokens starting from EEG embeddings as the coarsest representation. This design enables coherent generation while preserving a direct connection between the input EEG signals and the reconstructed images. Experiments on two datasets show that AVDE outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters. In addition, visualization of intermediate outputs shows that the generative process of AVDE reflects the hierarchical nature of human visual perception. These results highlight the potential of autoregressive models as efficient and interpretable tools for practical BCI applications.

replace-cross CeRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion

Authors: Hung-Hsuan Chen

Abstract: Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning (PEFT). However, it faces a critical ``linear ceiling'' in complex reasoning tasks: simply increasing the rank yields diminishing returns due to intrinsic linear constraints. We introduce CeRA (Capacity-enhanced Rank Adaptation), a weight-level parallel adapter that injects SiLU gating and structural dropout to induce manifold expansion. On the SlimOrca benchmark, CeRA breaks this linear barrier: at rank 64 (PPL 3.89), it outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency. This advantage generalizes to mathematical reasoning, where CeRA achieves a perplexity of 1.97 on MathInstruct, significantly surpassing LoRA's saturation point of 2.07. Mechanism analysis via Singular Value Decomposition (SVD) confirms that CeRA activates the dormant tail of the singular value spectrum, effectively preventing the rank collapse observed in linear methods.

replace-cross Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

Authors: Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga, Sandip Gaikwad, Sean Suchter, Venkat Sundaranatha

Abstract: Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result's semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textual relevance. These offline gains were validated by a worldwide A/B test on the App Store ranker, which demonstrated a statistically significant +0.24% increase in conversion rate, with the most substantial performance gains occurring in tail queries, where the new textual relevance labels provide a robust signal in the absence of reliable behavioral relevance labels.

replace-cross Attn-QAT: 4-Bit Attention With Quantization-Aware Training

Authors: Peiyuan Zhang, Matthew Noto, Wenxuan Tan, Chengquan Jiang, Will Lin, Wei Zhou, Hao Zhang

Abstract: Achieving reliable 4-bit attention is a prerequisite for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains the main obstacle due to FP4's tiny dynamic range and attention's heavy-tailed activations. This paper presents the first systematic study of 4-bit quantization-aware training (QAT) for attention. We find that "drop-in" QAT, which naively combines an FP4 forward pass with a high-precision Flash Attention (FA)-style backward pass, leads to training instability. We identify two key principles for stable FP4 attention: (1) matching low-precision recomputation of attention scores in the backward pass, and (2) resolving implicit precision assumptions in FA's gradient calculation. Based on these insights, we propose Attn-QAT and implement fused Triton kernels for training as well as FP4 inference kernels. Across diffusion and language models, Attn-QAT recovers the quality drop from FP4 attention without explicit outlier-mitigation heuristics used in prior FP4 attention, and delivers up to a 1.5x speedup on an RTX 5090. Video demos can be found at https://drive.google.com/drive/folders/190F6xbBDUF2kGQYIcXBt3ehSYij5jlim?usp=sharing.

URLs: https://drive.google.com/drive/folders/190F6xbBDUF2kGQYIcXBt3ehSYij5jlim?usp=sharing.

replace-cross PEPA: a Persistently Autonomous Embodied Agent with Personalities

Authors: Kaige Liu, Yang Li, Lijun Zhu, Weinan Zhang

Abstract: Living organisms exhibit persistent autonomy through internally generated goals and self-sustaining behavioral organization, yet current embodied agents remain driven by externally scripted objectives. This dependence on predefined task specifications limits their capacity for long-term deployment in dynamic, unstructured environments where continuous human intervention is impractical. We propose that personality traits provide an intrinsic organizational principle for achieving persistent autonomy. Analogous to genotypic biases shaping biological behavioral tendencies, personalities enable agents to autonomously generate goals and sustain behavioral evolution without external supervision. To realize this, we develop PEPA, a three-layer cognitive architecture that operates through three interacting systems: Sys3 autonomously synthesizes personality-aligned goals and refines them via episodic memory and daily self-reflection; Sys2 performs deliberative reasoning to translate goals into executable action plans; Sys1 grounds the agent in sensorimotor interaction, executing actions and recording experiences. We validate the framework through real-world deployment on a quadruped robot in a multi-floor office building. Operating without reliance on fixed task specifications, the robot autonomously arbitrates between user requests and personality-driven motivations, navigating elevators and exploring environments accordingly. Quantitative analysis across five distinct personality prototypes demonstrates stable, trait-aligned behaviors. The results confirm that personality-driven cognitive architectures enable sustained autonomous operation characteristic of persistent embodied systems. Code and demo videos are available at https://sites.google.com/view/pepa-persistent/.

URLs: https://sites.google.com/view/pepa-persistent/.

replace-cross Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains

Authors: Manil Shrestha, Edward Kim

Abstract: Large Language Models (LLMs) are increasingly used for medical entity extraction, yet their confidence scores are often miscalibrated, limiting safe deployment in clinical settings. We present a conformal prediction framework that provides finite-sample coverage guarantees for LLM-based extraction across two clinical domains. First, we extract structured entities from 1,000 FDA drug labels across eight sections using GPT-4.1, verified via FactScore-based atomic statement evaluation (97.7\% accuracy over 128,906 entities). Second, we extract radiological entities from MIMIC-CXR reports using the RadGraph schema with GPT-4.1 and Llama-4-Maverick, evaluated against physician annotations (entity F1: 0.81 to 0.84). Our central finding is that miscalibration direction reverses across domains: on well-structured FDA labels, models are underconfident, requiring modest conformal thresholds ($\tau \approx 0.06$), while on free-text radiology reports, models are overconfident, demanding strict thresholds ($\tau$ up to 0.99). Despite this heterogeneity, conformal prediction achieves target coverage ($\geq 90\%$) in both settings with manageable rejection rates (9--13\%). These results demonstrate that calibration is not a global model property but depends on document structure, extraction category, and model architecture, motivating domain-specific conformal calibration for safe clinical deployment.

replace-cross A Detection-Gated Pipeline for Robust Glottal Area Waveform Extraction and Clinical Pathology Assessment

Authors: Harikrishnan Unnikrishnan

Abstract: Background: Accurate glottal segmentation in high-speed videoendoscopy (HSV) is essential for extracting kinematic biomarkers of laryngeal function. However, existing deep learning models often produce spurious artifacts in non-glottal frames and fail to generalize across different clinical settings. Methods: We propose a detection-gated pipeline that integrates a localizer with a segmenter. A temporal consistency wrapper ensures robustness by suppressing false positives during glottal closure and occlusion. The segmenter was trained on a limited subset of the GIRAFE dataset (600 frames), while the localizer was trained on the BAGLS training set. The in-distribution localizer provides a tight region of interest (ROI), removing geometric anatomical variations and enabling cross-dataset generalization without fine-tuning. Results: The pipeline achieved state-of-the-art performance on the GIRAFE (DSC=0.81) and BAGLS (DSC=0.85) benchmarks and demonstrated superior generalizability. Notably, the framework maintained robust cross-dataset generalization (DSC=0.77). Downstream validation on a 65-subject clinical cohort confirmed that automated kinematic features - specifically the Open Quotient and Glottal Area Waveform (GAW) - remained consistent with clinical benchmarks. The coefficient of variation (CV) of the glottal area was a significant marker for distinguishing healthy from pathological vocal function (p=0.006). Conclusions: This architecture provides a computationally efficient solution (~35 frames/s) suitable for real-time clinical use. By overcoming cross-dataset variability, this framework facilitates the standardized, large-scale extraction of clinical biomarkers across diverse endoscopy platforms. Code, trained weights, and evaluation scripts are released at https://github.com/hari-krishnan/openglottal.

URLs: https://github.com/hari-krishnan/openglottal.

replace-cross Slurry-as-a-Service: A Modest Proposal on Scalable Pluralistic Alignment for Nutrient Optimization

Authors: Rachel Hong, Yael Eiger, Jevan Hutson, Os Keyes, William Agnew

Abstract: Pluralistic alignment has emerged as a promising approach for ensuring that large language models (LLMs) faithfully represent the diversity, nuance, and conflict inherent in human values. In this work, we study a high-stakes deployment context - mulching - where automated systems transform selected individuals into nutrient-rich slurry for the dual purposes of food security and aesthetic population management. Building on recent pluralistic alignment frameworks, we introduce ValueMulch, a reproducible training, deployment, and certification pipeline for aligning mulching models (MMs) to a wide range of community norms. Through a real-world testbed spanning 32 communities, we show that ValueMulch improves distributional agreement with community mulching preferences relative to frontier baselines. We conclude with a discussion of ethical considerations, limitations, and implications for researchers seeking to align systems to the full spectrum of human values - especially when those values are inconsistent, commercially inconvenient, or nutritionally underutilized. Author's note: This piece builds on prior existing work Keyes et al in 2019 that satirized cannibalism as a parody for approaches that imbue ethics into problematic technology. We bring those ideas to today's era with the proliferation of large language models in everyday lives, as a critique of current AI pluralistic alignment literature. Our work does not intend to argue that all alignment practices are evil, but rather that if framing value design as a technical problem enables technology systems to enact harms, then perhaps this framing is not enough.

replace-cross Human-Certified Module Repositories for the AI Age

Authors: Szil\'ard Enyedi

Abstract: Human-Certified Module Repositories (HCMRs) are introduced in this work as a new architectural model for constructing trustworthy software in the era of AI-assisted development. As large language models increasingly participate in code generation, configuration synthesis, and multi-component integration, the reliability of AI-assembled systems will depend critically on the trustworthiness of the building blocks they use. Today's software supply-chain incidents and modular development ecosystems highlight the risks of relying on components with unclear provenance, insufficient review, or unpredictable composition behavior. We argue that future AI-driven development workflows require repositories of reusable modules that are curated, security-reviewed, provenance-rich, and equipped with explicit interface contracts. To this end, we propose HCMRs, a framework that blends human oversight with automated analysis to certify modules and support safe, predictable assembly by both humans and AI agents. We present a reference architecture for HCMRs, outline a certification and provenance workflow, analyze threat surfaces relevant to modular ecosystems, and extract lessons from recent failures. We further discuss implications for governance, scalability, and AI accountability, positioning HCMRs as a foundational substrate for reliable and auditable AI-constructed software systems.

replace-cross iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

Authors: Hanpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

Abstract: Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.

replace-cross ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

Authors: Hanpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

Abstract: Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

replace-cross Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers

Authors: Youngjun Jun, Seil Kang, Woojung Han, Seong Jae Hwang

Abstract: Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.

replace-cross Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

Authors: Ruinan Jin, Yingbin Liang, Shaofeng Zou

Abstract: Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $\delta^{-1/2}$ dependence on the confidence parameter $\delta$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $\delta^{-1}$ dependence.

replace-cross Information Routing in Atomistic Foundation Models: How Task Alignment and Equivariance Shape Linear Disentanglement

Authors: Joshua Steier

Abstract: What determines whether a molecular property prediction model organizes its representations so that geometric and compositional information can be cleanly separated? We introduce Compositional Probe Decomposition (CPD), which linearly projects out composition signal and measures how much geometric information remains accessible to a Ridge probe. We validate CPD with four independent checks, including a structural isomer benchmark where compositional projections score at chance while geometric residuals reach 94.6\% pairwise classification accuracy. Across ten models from five architectural families on QM9, we find a \emph{linear accessibility gradient}: models differ by $6.6\times$ in geometric information accessible after composition removal ($R^2_{\mathrm{geom}}$ from 0.081 to 0.533 for HOMO-LUMO gap). Three factors explain this gradient. Task alignment dominates: models trained on HOMO-LUMO gap ($R^2_{\mathrm{geom}}$ 0.44--0.53) outscore energy-trained models by $\sim$0.25 $R^2$ regardless of architecture. Within-architecture ablations on two independent architectures confirm this: PaiNN drops from 0.53 to 0.31 when retrained on energy, and MACE drops from 0.44 to 0.08. Data diversity partially compensates for misaligned objectives, with MACE pretrained on MPTraj (0.36) outperforming QM9-only energy models. Inside MACE's representations, information routes by symmetry type: $L{=}1$ (vector) channels preferentially encode dipole moment ($R^2 = 0.59$ vs.\ 0.38 in $L{=}0$), while $L{=}0$ (scalar) channels encode HOMO-LUMO gap ($R^2 = 0.76$ vs.\ 0.34 in $L{=}1$). This pattern is absent in ViSNet. We also show that nonlinear probes produce misleading results on residualized representations, recovering $R^2 = 0.68$--$0.95$ on a purely compositional target, and recommend linear probes for this setting.

replace-cross Non-Invasive Reconstruction of Intracranial EEG Across the Deep Temporal Lobe from Scalp EEG based on Conditional Normalizing Flow

Authors: Dongyi He, Bin Jiang, Kecheng Feng, Luyin Zhang, Ling Liu, Yuxuan Li, Yun Zhao, He Yan

Abstract: Although obtaining deep brain activity from non-invasive scalp electroencephalography (sEEG) is crucial for neuroscience and clinical diagnosis, directly generating high-fidelity intracranial electroencephalography (iEEG) signals remains a largely unexplored field, limiting our understanding of deep brain dynamics. Current research primarily focuses on traditional signal processing or source localization methods, which struggle to capture the complex waveforms and random characteristics of iEEG. To address this critical challenge, this paper introduces NeuroFlowNet, a novel cross-modal generative framework whose core contribution lies in the first-ever reconstruction of iEEG signals from the entire deep temporal lobe region using sEEG signals. NeuroFlowNet is built on Conditional Normalizing Flow (CNF), which directly models complex conditional probability distributions through reversible transformations, thereby explicitly capturing the randomness of brain signals and fundamentally avoiding the pattern collapse issues common in existing generative models. Additionally, the model integrates a multi-scale architecture and self-attention mechanisms to robustly capture fine-grained temporal details and long-range dependencies. Validation results on a publicly available synchronized sEEG-iEEG dataset demonstrate NeuroFlowNet's effectiveness in terms of temporal waveform fidelity, spectral feature reproduction, and functional connectivity restoration. This study establishes a more reliable and scalable new paradigm for non-invasive analysis of deep brain dynamics. The code of this study is available in https://github.com/hdy6438/NeuroFlowNet

URLs: https://github.com/hdy6438/NeuroFlowNet

replace-cross ACES: Accent Subspaces for Coupling, Explanations, and Stress-Testing in Automatic Speech Recognition

Authors: Swapnil Parekh

Abstract: ASR systems exhibit persistent performance disparities across accents, but whether these gaps reflect superficial biases or deep structural vulnerabilities remains unclear. We introduce ACES, a three-stage audit that extracts accent-discriminative subspaces from ASR representations, constrains adversarial attacks to them, and tests whether removing them improves fairness. On Wav2Vec2-base with seven accents, imperceptible perturbations (~60 dB SNR) along the accent subspace amplify the WER disparity gap by nearly 50% (21.3->31.8 pp), exceeding random-subspace controls; a permuted-label test confirms specificity to genuine accent structure. Partially removing the subspace worsens both WER and disparity, revealing that accent-discriminative and recognition-critical features are deeply entangled. ACES thus positions accent subspaces as powerful fairness-auditing tools, not simple erasure levers.

replace-cross Test-Time Meta-Adaptation with Self-Synthesis

Authors: Zeyneb N. Kaya, Nick Rui

Abstract: As strong general reasoners, large language models (LLMs) encounter diverse domains and tasks, where the ability to adapt and self-improve at test time is valuable. We introduce MASS, a meta-learning framework that enables LLMs to self-adapt by generating problem-specific synthetic training data and performing targeted self-updates optimized for downstream performance at inference time. We train this behavior end-to-end via bilevel optimization: an inner loop adapts on self-generated examples while an outer loop meta-learns data-attribution signals and rewards post-update task performance. The synthetic data is optimized with scalable meta-gradients, backpropagating the downstream loss through the inner updates to reward useful generations. Experiments on mathematical reasoning show that MASS learns to synthesize per-instance curricula that yield effective, data-efficient test-time adaptation.

replace-cross Mathematicians in the age of AI

Authors: Jeremy Avigad

Abstract: Recent developments show that AI can prove research-level theorems in mathematics, both formally and informally. This essay urges mathematicians to stay up-to-date with the technology, to consider the ways it will disrupt mathematical practice, and to respond appropriately to the challenges and opportunities we now face.

replace-cross ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

Authors: Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, Aleksander Holynski

Abstract: Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $\pi^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.

replace-cross How Professional Visual Artists are Negotiating Generative AI in the Workplace

Authors: Harry H. Jiang, Jordan Taylor, William Agnew

Abstract: Generative AI has been heavily critiqued by artists in both popular media and HCI scholarship. However, more work is needed to understand the impacts of generative AI on professional artists' workplaces and careers. In this paper, we conduct a survey of 378 verified professional visual artists about how generative AI has impacted their careers and workplaces. We find (1) most visual artists are strongly opposed to using generative AI (text or visual) and negotiate their inclusion in the workplace through a variety of refusal strategies (2) there exist a range of factors in artists environments shaping their use of generative AI, including pressure from clients, bosses, and peers and (3) visual artists report overwhelmingly negative impacts of generative AI on their workplaces, leading to added stress and reduced job opportunities. In light of these findings, we encourage HCI researchers to contend more deeply with artists' desires not to use generative AI in the workplace.

replace-cross Neuro-Symbolic Financial Reasoning via Deterministic Fact Ledgers and Adversarial Low-Latency Hallucination Detector

Authors: Pedram Agand

Abstract: Standard Retrieval-Augmented Generation (RAG) architectures fail in high-stakes financial domains due to two fundamental limitations: the inherent arithmetic incompetence of Large Language Models (LLMs) and the distributional semantic conflation of dense vector retrieval (e.g., mapping "Net Income" to "Net Sales" due to contextual proximity). In deterministic domains, a 99% accuracy rate yields 0% operational trust. To achieve zero-hallucination financial reasoning, we introduce the Verifiable Numerical Reasoning Agent (VeNRA). VeNRA shifts the RAG paradigm from retrieving probabilistic text to retrieving deterministic variables via a strictly typed Universal Fact Ledger (UFL). We mathematically bound this ledger using a novel Double-Lock Grounding algorithm. Coupled with deterministic Python execution, this neuro-symbolic routing compresses systemic hallucination rates to a near-zero 1.2%. Recognising that upstream parsing anomalies inevitably occur, we introduce the VeNRA Sentinel: a 3-billion parameter SLM trained to forensically audit candidate using a single-token inference budget with optional post-hoc reasoning. To train the Sentinel, we steer away from traditional hallucination datasets in favour of Adversarial Simulation, programmatically sabotaging financial records to simulate Ecological Errors. The compact Sentinel consequently outperforms 70B+ frontier models in error detection. Through Loss Dilution phenomenon in Reverse-CoT training, we present a novel Micro-Chunking loss algorithm to stabilise gradients under extreme verdict penalisation, yielding a 28x latency speedup without sacrificing forensic rigor.

replace-cross GALACTIC: Global and Local Agnostic Counterfactuals for Time-series Clustering

Authors: Christos Fragkathoulas, Eleni Psaroudaki, Themis Palpanas, Evaggelia Pitoura

Abstract: Time-series clustering is a fundamental tool for pattern discovery, yet existing explainability methods, primarily based on feature attribution or metadata, fail to identify the transitions that move an instance across cluster boundaries. While Counterfactual Explanations (CEs) identify the minimal temporal perturbations required to alter the prediction of a model, they have been mostly confined to supervised settings. This paper introduces GALACTIC, the first unified framework to bridge local and global counterfactual explainability for unsupervised time-series clustering. At instance level (local), GALACTIC generates perturbations via a cluster-aware optimization objective that respects the target and underlying cluster assignments. At cluster level (global), to mitigate cognitive load and enhance interpretability, we formulate a representative CE selection problem. We propose a Minimum Description Length (MDL) objective to extract a non-redundant summary of global explanations that characterize the transitions between clusters. We prove that our MDL objective is supermodular, which allows the corresponding MDL reduction to be framed as a monotone submodular set function. This enables an efficient greedy selection algorithm with provable $(1-1/e)$ approximation guarantees. Extensive experimental evaluation on the UCR Archive demonstrates that GALACTIC produces significantly sparser local CEs and more concise global summaries than state-of-the-art baselines adapted for our problem, offering the first unified approach for interpreting clustered time-series through counterfactuals.

replace-cross SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Authors: Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Minju Jeon, Hyungee Kim, Dong-Jin Kim

Abstract: Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.

replace-cross When AI Levels the Playing Field: Skill Homogenization, Asset Concentration, and Two Regimes of Inequality

Authors: Xupeng Chen, Shuchen Meng

Abstract: Generative AI compresses within-task skill differences while shifting economic value toward concentrated complementary assets, creating an apparent paradox: the technology that equalizes individual performance may widen aggregate inequality. We formalize this tension in a task-based model with endogenous education, employer screening, and heterogeneous firms. The model yields two regimes whose boundary depends on AI's technology structure (proprietary vs. commodity) and labor market institutions (rent-sharing elasticity, asset concentration). A scenario analysis via Method of Simulated Moments, matching six empirical targets, disciplines the model's quantitative magnitudes; a sensitivity decomposition reveals that the five non-$\Delta$Gini moments identify mechanism rates but not the aggregate sign, which at the calibrated parameters is pinned by $m_6$ and $\xi$, while AI's technology structure ($\eta_1$ vs. $\eta_0$) independently crosses the boundary. The contribution is the mechanism -- not a verdict on the sign. Occupation-level regressions using BLS OEWS data (2019--2023) illustrate why such data cannot test the model's task-level predictions. The predictions are testable with within-occupation, within-task panel data that do not yet exist at scale.

replace-cross Bridging Domains through Subspace-Aware Model Merging

Authors: Levy Chaves, Chao Zhou, Rebekka Burkholz, Eduardo Valle, Sandra Avila

Abstract: Model merging integrates multiple task-specific models into a single consolidated one. Recent research has made progress in improving merging performance for in-distribution or multi-task scenarios, but domain generalization in model merging remains underexplored. We investigate how merging models fine-tuned on distinct domains affects generalization to unseen domains. Through an analysis of parameter competition in the task matrix using singular value decomposition, we show that merging models trained under different distribution shifts induces stronger conflicts between their subspaces compared to traditional multi-task settings. To mitigate this issue, we propose SCORE (Subspace COnflict-Resolving mErging), a method designed to alleviate such singular subspace conflicts. SCORE finds a shared orthogonal basis by computing the principal components of the concatenated leading singular vectors of all models. It then projects each task matrix into the shared basis, pruning off-diagonal components to remove conflicting singular directions. SCORE consistently outperforms, on average, existing model merging approaches in domain generalization settings across a variety of architectures and model scales, demonstrating its effectiveness and scalability.