new WorkflowGen:an adaptive workflow generation mechanism driven by trajectory experience

Authors: Ruocan Wei, Shufeng Wang, Ziwei Shi

Abstract: Large language model (LLM) agents often suffer from high reasoning overhead, excessive token consumption, unstable execution, and inability to reuse past experiences in complex tasks like business queries, tool use, and workflow orchestration. Traditional methods generate workflows from scratch for every query, leading to high cost, slow response, and poor robustness. We propose WorkflowGen, an adaptive, trajectory experience-driven framework for automatic workflow generation that reduces token usage and improves efficiency and success rate. Early in execution, WorkflowGen captures full trajectories and extracts reusable knowledge at both node and workflow levels, including error fingerprints, optimal tool mappings, parameter schemas, execution paths, and exception-avoidance strategies. It then employs a closed-loop mechanism that performs lightweight generation only on variable nodes via trajectory rewriting, experience updating, and template induction. A three-tier adaptive routing strategy dynamically selects among direct reuse, rewriting-based generation, and full initialization based on semantic similarity to historical queries. Without large annotated datasets, we qualitatively compare WorkflowGen against real-time planning, static single trajectory, and basic in-context learning baselines. Our method reduces token consumption by over 40 percent compared to real-time planning, improves success rate by 20 percent on medium-similarity queries through proactive error avoidance and adaptive fallback, and enhances deployability via modular, traceable experiences and cross-scenario adaptability. WorkflowGen achieves a practical balance of efficiency, robustness, and interpretability, addressing key limitations of existing approaches.

new Transparent Screening for LLM Inference and Training Impacts

Authors: Arnault Pachot, Thierry Petit

Abstract: This paper presents a transparent screening framework for estimating inference and training impacts of current large language models under limited observability. The framework converts natural-language application descriptions into bounded environmental estimates and supports a comparative online observatory of current market models. Rather than claiming direct measurement for opaque proprietary services, it provides an auditable, source-linked proxy methodology designed to improve comparability, transparency, and reproducibility.

new Accelerating PayPal's Commerce Agent with Speculative Decoding: An Empirical Study on EAGLE3 with Fine-Tuned Nemotron Models

Authors: Ally Qin, Jian Wan, Sarat Mudunuri, Srinivasan Manoharan

Abstract: We evaluate speculative decoding with EAGLE3 as an inference-time optimization for PayPal's Commerce Agent, powered by a fine-tuned llama3.1-nemotron-nano-8B-v1 model. Building on prior work (NEMO-4-PAYPAL) that reduced latency and cost through domain-specific fine-tuning, we benchmark EAGLE3 via vLLM against NVIDIA NIM on identical 2xH100 hardware across 40 configurations spanning speculative token counts (gamma=3, gamma=5), concurrency levels (1-32), and sampling temperatures (0, 0.5). Key findings: (1) gamma=3 achieves 22-49% throughput improvement and 18-33% latency reduction at zero additional hardware cost; (2) acceptance rates remain stable at approximately 35.5% for gamma=3 across all conditions; (3) gamma=5 yields diminishing returns (approximately 25% acceptance rate); (4) LLM-as-Judge evaluation confirms fully preserved output quality; and (5) speculative decoding on a single H100 matches or exceeds NIM on two H100s, enabling 50% GPU cost reduction.

new On-Meter Graph Machine Learning: A Case Study of PV Power Forecasting for Grid Edge Intelligence

Authors: Jian Huang, Zixiang Ming, Yongli Zhu, Linna Xu

Abstract: This paper presents a detailed study of how graph neural networks can be used on edge intelligent meters in a microgrid to forecast photovoltaic power generation. The problem background and the adopted technologies are introduced, including ONNX and ONNX Runtime. The hardware and software specifications of the smart meter are also briefly described. Then, the paper focuses on the training and deployment of two graph machine learning models, GCN and GraphSAGE, with particular emphasis on developing and deploying a customized ONNX operator for GCN. Finally, a case study is conducted using real datasets from a village microgrid. The performance of the two models is compared on both the PC and the smart meter, exhibiting successful deployments and executions on the smart meter.

new Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Authors: Chaitanya Dwivedi, Binxuan Huang, Himanshu Gupta, Pratik Jayarao, Neeraj Varshney, Bing Yin

Abstract: Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However, training large MoEs is expensive, as memory requirements and inter-device communication both scale with total parameter count. We propose expert upcycling, a method for progressively expanding MoE capacity by increasing the number of experts during continued pre-training (CPT). Given a trained E-expert model, the upcycling operator constructs an mE-expert model through expert duplication and router extension while holding top-K routing fixed, preserving per-token inference cost. Duplication provides a warm initialization: the expanded model inherits the source checkpoint's learned representations, starting from a substantially lower loss than random initialization. Subsequent CPT then breaks the symmetry among duplicated experts to drive specialization. We formalize the upcycling operator and develop a theoretical framework decomposing the quality gap into a capacity term and an initialization term. We further introduce utility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication, more than tripling gap closure when CPT is limited. In our 7B-13B total parameter experiments, the upcycled model matches the fixed-size baseline on validation loss while saving 32% of GPU hours. Comprehensive ablations across model scales, activation ratios, MoE architectures, and training budgets yield a practical recipe for deploying expert upcycling, establishing it as a principled, compute-efficient alternative to training large MoE models from scratch.

new Graph-Theoretic Models for the Prediction of Molecular Measurements

Authors: Anna Niane, Prudence Djagba

Abstract: Graph-theoretic approaches offer simplicity, interpretability, and low computational cost for molecular property prediction. Among these, the model proposed by Mukwembi and Nyabadza, based on the external activity $D(G)$ and internal activity $\zeta(G)$ indices, achieved strong results on a small flavonoid dataset. However, its ability to generalize to larger and chemically diverse datasets has not been tested. This study evaluates the baseline $D(G)$-$\zeta(G)$ polynomial model on five benchmark datasets from MoleculeNet, covering biological activity (BACE, 1,513 molecules), lipophilicity (LogP synthetic, 14,610 molecules; LogP experimental, 753 molecules), aqueous solubility (ESOL, 1,128 molecules), and hydration free energy (SAMPL, 642 molecules). The baseline model achieves an average $R^2 = 0.24$, confirming limited transferability. To address this, a systematic enhancement framework is proposed, progressively incorporating Ridge regularization, additional graph descriptors, physicochemical properties, ensemble learning with Gradient Boosting, Lasso feature selection, and a hybrid approach combining topological indices with Morgan fingerprints. The enhanced models raise the average best $R^2$ to 0.79, with individual improvements ranging from 165\% to 274\%. All improvements are statistically significant ($p < 0.001$). A direct comparison with a Graph Convolutional Network under identical experimental conditions shows that the enhanced classical models match or outperform deep learning on all five datasets. Comparison with the recent GNN+PGM hybrid of Djagba et al.\ further confirms competitiveness, with the enhanced models achieving the best results on two datasets and tying on one. The entire framework requires no GPU, trains in under five minutes, and uses only open-source tools, making it accessible for researchers in resource-limited settings.

new Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

Authors: Carter Adams, Rafael Oliveira, Gabriel Almeida, Sofia Torres

Abstract: Reinforcement fine-tuning with verifiable rewards (RLVR) has emerged as a powerful paradigm for equipping large vision-language models (LVLMs) with agentic capabilities such as tool use and multi-step reasoning. Despite striking empirical successes, most notably Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), the theoretical underpinnings of this paradigm remain poorly understood. In particular, two critical questions lack rigorous answers: (i)~how does the composite structure of verifiable rewards (format compliance, answer accuracy, tool executability) affect the convergence of Group Relative Policy Optimization (GRPO), and (ii)~why does training on a small set of tool-augmented tasks transfer to out-of-distribution domains? We address these gaps by introducing the \emph{Tool-Augmented Markov Decision Process} (TA-MDP), a formal framework that models multimodal agentic decision-making with bounded-depth tool calls. Within this framework, we establish three main results. First, we prove that GRPO under composite verifiable rewards converges to a first-order stationary point at rate $O(1/\sqrt{T})$ with explicit dependence on the number of reward components and group size (\textbf{Theorem~1}). Second, we derive a \emph{Reward Decomposition Theorem} that bounds the sub-optimality gap between decomposed per-component optimization and joint optimization, providing a precise characterization of when reward decomposition is beneficial (\textbf{Theorem~2}). Third, we establish a PAC-Bayes generalization bound for tool-augmented policies that explains the strong out-of-distribution transfer observed in Visual-ARFT (\textbf{Theorem~3}).

new DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

Authors: Venus Team, Sunhao Dai, Yong Deng, Jinzhen Lin, Yusheng Song, Guoqing Wang, Xiaofeng Wu, Yuqi Zhou, Shuo Yang, Zhenzhe Ying, Zhanwei Zhang, Changhua Meng, Weiqiang Wang

Abstract: Edge-scale deep research agents based on small language models are attractive for real-world deployment due to their advantages in cost, latency, and privacy. In this work, we study how to train a strong small deep research agent under limited open-data by improving both data quality and data utilization. We present DR-Venus, a frontier 4B deep research agent for edge-scale deployment, built entirely on open data. Our training recipe consists of two stages. In the first stage, we use agentic supervised fine-tuning (SFT) to establish basic agentic capability, combining strict data cleaning with resampling of long-horizon trajectories to improve data quality and utilization. In the second stage, we apply agentic reinforcement learning (RL) to further improve execution reliability on long-horizon deep research tasks. To make RL effective for small agents in this setting, we build on IGPO and design turn-level rewards based on information gain and format-aware regularization, thereby enhancing supervision density and turn-level credit assignment. Built entirely on roughly 10K open-data, DR-Venus-4B significantly outperforms prior agentic models under 9B parameters on multiple deep research benchmarks, while also narrowing the gap to much larger 30B-class systems. Our further analysis shows that 4B agents already possess surprisingly strong performance potential, highlighting both the deployment promise of small models and the value of test-time scaling in this setting. We release our models, code, and key recipes to support reproducible research on edge-scale deep research agents.

new Super Apriel: One Checkpoint, Many Speeds

Authors: SLAM Labs, :, Oleksiy Ostapenko, Raymond Li, Torsten Scholak, Alireza Mousavi-Hosseini, Aman Tiwari, Denis Kocetkov, Joel Lamy Poirier, Kelechi Ogueji, Nanda H Krishna, Rafael Pardinas, Sathwik Tejaswi Madhusudhan, Shruthan Radhakrishna, Srinivas Sunkara, Valerie Becaert

Abstract: We release Super Apriel, a 15B-parameter supernet in which every decoder layer provides four trained mixer choices -- Full Attention (FA), Sliding Window Attention (SWA), Kimi Delta Attention (KDA), and Gated DeltaNet (GDN). A placement selects one mixer per layer; placements can be switched between requests at serving time without reloading weights, enabling multiple speed presets from a single checkpoint. The shared checkpoint also enables speculative decoding without a separate draft model. The all-FA preset matches the Apriel 1.6 teacher on all reported benchmarks; recommended hybrid presets span $2.9\times$ to $10.7\times$ decode throughput at 96% to 77% quality retention, with throughput advantages that compound at longer context lengths. With four mixer types across 48 layers, the configuration space is vast. A surrogate that predicts placement quality from the per-layer mixer assignment makes the speed-quality landscape tractable and identifies the best tradeoffs at each speed level. We investigate whether the best configurations at each speed level can be identified early in training or only after convergence. Rankings stabilize quickly at 0.5B scale, but the most efficient configurations exhibit higher instability at 15B, cautioning against extrapolation from smaller models. Super Apriel is trained by stochastic distillation from a frozen Apriel 1.6 teacher, followed by supervised fine-tuning. We release the supernet weights, Fast-LLM training code, vLLM serving code, and a placement optimization toolkit.

new A Multi-Plant Machine Learning Framework for Emission Prediction, Forecasting, and Control in Cement Manufacturing

Authors: Sheikh Junaid Fayaz, Nestor D. Montiel-Bohorquez, Wilson Ricardo Leal da Silva, Shashank Bishnoi, Matteo Romano, Manuele Gatti, N. M. Anoop Krishnan

Abstract: Cement production is among the largest contributors to industrial air pollution, emitting ~3 Mt NOx/year. The industry-standard mitigation approach, selective non-catalytic reduction (SNCR), exhibits low NH3 utilization efficiency, resulting in operational inefficiencies and increased reagent costs. Here, we develop a data-driven framework for emission control using large-scale operational data from four cement plants worldwide. Benchmarking nine machine learning architectures, we observe that prediction error varies ~3-5x across plants due to variation in data richness. Incorporating short-term process history nearly triples NOx prediction accuracy, revealing that NOx formation carries substantial process memory, a timescale dependence that is absent in CO and CO2. Further, we develop models that forecast NOx overshoots as early as nine minutes, providing a buffer for operational adjustments. The developed framework controls NOx formation at the source, reducing NH3 consumption in downstream SNCR. Surrogate model projections estimate a ~34-64% reduction in NOx while preserving clinker quality, corresponding to a reduction of ~290 t NOx/year and ~58,000 USD/year in NH3 savings. This work establishes a generalizable framework for data-driven emission control, offering a pathway toward low-emission operation without structural modifications or additional hardware, with potential applicability to other hard-to-abate industries such as steel, glass, and lime.

new Physics-Guided Dimension Reduction for Simulation-Free Operator Learning of Stiff Differential--Algebraic Systems

Authors: Huy Hoang Le, Haoguang Wang, Christian Moya, Marcos Netto, Guang Lin

Abstract: Neural surrogates for stiff differential-algebraic equations (DAEs) face two key challenges: soft-constraint methods leave algebraic residuals that stiffness amplifies into large errors, while hard-constraint methods require trajectory data from computationally expensive stiff integrators. We introduce an extended Newton implicit layer that enforces algebraic consistency and quasi-steady-state reduction within a single differentiable solve. Given slow-state predictions from a physics-informed DeepONet, the proposed layer recovers fast and algebraic states, eliminates the stiffness-amplification pathway within each time window, and reduces the output dimension to the slow states alone. Gradients derived via the implicit function theorem capture a stiffness-scaled coupling term that is absent in penalty-based approaches. Cascaded implicit layers further extend the framework to multi-component systems with provable convergence. On a grid-forming inverter DAE (21 states), the proposed method (7 outputs, 1.42 percent error) significantly outperforms penalty methods (39.3 percent), standard Newton approaches (57.0 percent), and augmented Lagrangian or feedback linearization baselines, which fail to converge. Two independently trained models compose into a 44-state system without retraining, achieving 0.72 to 1.16 percent error with zero algebraic residual. Conformal prediction further provides 90 percent coverage in-distribution and enables automatic out-of-distribution detection.

new Generalization and Membership Inference Attack a Practical Perspective

Authors: Fateme Rahmani, Mahdi Jafari Siavoshani, Mohammad Hossein Rohban

Abstract: With the emergence of new evaluation metrics and attack methodologies for Membership Inference Attacks (MIA), it becomes essential to reevaluate previously accepted assumptions. In this paper, we revisit the longstanding debate regarding the correlation between MIA success rates and model generalization using an empirical approach. We focused on employing augmentation techniques and early stopping to enhance model generalization and examined their impact on MIA success rates. We found that utilizing advanced generalization techniques can significantly decrease attack performance, potentially by up to 100 times. Moreover, combining these methods not only improves model generalization but also reduces attack effectiveness by introducing randomness during training. Additionally, our study confirmed the direct impact of generalization on MIA performance through an analysis of over 1K models in a controlled environment.

new Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

Authors: Het Patel, Tiejin Chen, Hua Wei, Evangelos E. Papalexakis, Jia Chen

Abstract: Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identify three feature populations that play fundamentally different functional roles. Pure uncertainty features are functionally essential: suppressing them severely degrades accuracy. Pure incorrectness features are functionally inert: despite showing statistically significant activation differences between correct and incorrect predictions, the majority produce near-zero change in accuracy when suppressed. Confounded features that encode both signals are detrimental to output quality, and targeted suppression of them yields a 1.1% accuracy improvement and a 75% entropy reduction, with effects transferring across the ARC-Challenge and RACE benchmarks. The feature categories are also informationally distinct: the activations of just 3 confounded features from a single mid-network layer predict model correctness (AUROC ~0.79), enabling selective abstention that raises accuracy from 62% to 81% at 53% coverage. The results demonstrate that uncertainty and correctness are distinct internal phenomena, with implications for interpretability and targeted inference-time intervention.

new Fast Amortized Fitting of Scientific Signals Across Time and Ensembles via Transferable Neural Fields

Authors: Sophia Zorek, Kushal Vyas, Yuhao Liu, David Lenz, Tom Peterka, Guha Balakrishnan

Abstract: Neural fields, also known as implicit neural representations (INRs), offer a powerful framework for modeling continuous geometry, but their effectiveness in high-dimensional scientific settings is limited by slow convergence and scaling challenges. In this study, we extend INR models to handle spatiotemporal and multivariate signals and show how INR features can be transferred across scientific signals to enable efficient and scalable representation across time and ensemble runs in an amortized fashion. Across controlled transformation regimes (e.g., geometric transformations and localized perturbations of synthetic fields) and high-fidelity scientific domains-including turbulent flows, fluid-material impact dynamics, and astrophysical systems-we show that transferable features improve not only signal fidelity but also the accuracy of derived geometric and physical quantities, including density gradients and vorticity. In particular, transferable features reduce iterations to reach target reconstruction quality by up to an order of magnitude, increase early-stage reconstruction quality by multiple dB (with gains exceeding 10 dB in some cases), and consistently improve gradient-based physical accuracy.

new Multi-Objective Reinforcement Learning for Generating Covalent Inhibitor Candidates

Authors: Renee Gil

Abstract: Rational design of covalent inhibitors requires simultaneously optimizing multiple properties, such as binding affinity, target selectivity, or electrophilic reactivity. This presents a multi-objective problem not easily addressed by screening alone. Here we present a machine learning pipeline for generating covalent inhibitor candidates using multi-objective reinforcement learning (RL), applied to two targets: epidermal growth factor receptor (EGFR) and acetylcholinesterase (ACHE). A SMILES-based pretrained LSTM serves as the generative model, optimized via policy gradient RL with Pareto crowding distance to balance competing scoring functions including synthetic accessibility, predicted covalent activity, residue affinity, and an approximated docking score. The pipeline rediscovers known covalent inhibitors at rates of up to 0.50% (EGFR) and 0.74% (ACHE) in 10,000-structure runs, with candidate structures achieving warhead-to-residue distances as short as 5.5 angstrom (EGFR) and 3.2 angstrom (ACHE) after further docking-based screening. More notably, the pipeline spontaneously generates structures bearing warhead motifs absent from the training data - including allenes, 3-oxo-$\beta$-sultams, and $\alpha$-methylene-$\beta$-lactones - all of which have independent literature support as covalent warheads. These results suggest that RL-guided generation can explore covalent chemical space beyond its training distribution, and may be useful as a tool for medicinal chemists working on covalent drug discovery.

new Continuous Semantic Caching for Low-Cost LLM Serving

Authors: Baran Atalar, Xutong Liu, Jinhang Zuo, Siwei Wang, Wei Chen, Carlee Joe-Wong

Abstract: As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference costs and latency. Existing caching frameworks have proposed to decide which query responses to cache by assuming a finite, known universe of discrete queries and learning their serving costs and arrival probabilities. As LLMs' pool of users and queries expands, however, such an assumption becomes increasingly untenable: real-world LLM queries reside in an infinite, continuous embedding space. In this paper, we establish the first rigorous theoretical framework for semantic LLM response caching in continuous query space under uncertainty. To bridge the gap between discrete optimization and continuous representation spaces, we introduce dynamic $\epsilon$-net discretization coupled with Kernel Ridge Regression. This design enables the system to formally quantify estimation uncertainty and generalize partial feedback on LLM query costs across continuous semantic query neighborhoods. We develop both offline learning and online adaptive algorithms optimized to reduce switching costs incurred by changing the cached responses. We prove that our online algorithm achieves a sublinear regret bound against an optimal continuous oracle, which reduces to existing bounds for discrete query models. Extensive empirical evaluations demonstrate that our framework approximates the continuous optimal cache well while also reducing computational and switching overhead compared to existing methods.

new Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

Authors: Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

Abstract: Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.

new Replicable Bandits with UCB based Exploration

Authors: Rohan Deb, Udaya Ghai, Karan Singh, Arindam Banerjee

Abstract: We study replicable algorithms for stochastic multi-armed bandits (MAB) and linear bandits with UCB (Upper Confidence Bound) based exploration. A bandit algorithm is $\rho$-replicable if two executions using shared internal randomness but independent reward realizations, produce the same action sequence with probability at least $1-\rho$. Prior work is primarily elimination-based and, in linear bandits with infinitely many actions, relies on discretization, leading to suboptimal dependence on the dimension $d$ and $\rho$. We develop optimistic alternatives for both settings. For stochastic multi-armed bandits, we propose RepUCB, a replicable batched UCB algorithm and show that it attains a regret $O\!\left(\frac{K^2\log^2 T}{\rho^2}\sum_{a:\Delta_a>0}\left(\Delta_a+\frac{\log(KT\log T)}{\Delta_a}\right)\right)$. For stochastic linear bandits, we first introduce RepRidge, a replicable ridge regression estimator that satisfies both a confidence guarantee and a $\rho$-replicability guarantee. Beyond its role in our bandit algorithm, this estimator and its guarantees may also be of independent interest in other statistical estimation settings. We then use RepRidge to design RepLinUCB, a replicable optimistic algorithm for stochastic linear bandits, and show that its regret is bounded by $\widetilde{O}\!\big(\big(d+\frac{d^3}{\rho}\big)\sqrt{T}\big)$. This improves the best prior regret guarantee by a factor of $O(d/\rho)$, showing that our optimistic algorithm can substantially reduce the price of replicability.

new Federated Learning over Blockchain-Enabled Cloud Infrastructure

Authors: Saloni Garg, Amit Sagtani, Kamal Kant Hiran

Abstract: The rise of IoT devices and the uptake of cloud computing have informed a new era of data-driven intelligence. Traditional centralized machine learning models that require a large volume of data to be stored in a single location have therefore become more susceptible to data breaches, privacy violations, and regulatory non-compliance. This report presents a thorough examination of the merging of Federated Learning (FL) and blockchain technology in a cloud-edge setting, demonstrating it as an effective solution to the stated concerns. We are proposing a detailed four-dimensional architectural categorization that meticulously assesses coordination frameworks, consensus algorithms, data storage practices, and trust models that are significant to these integrated systems. The manuscript presents a comprehensive comparative examination of two cutting-edge frameworks: the Multi-Objectives Reinforcement Federated Learning Blockchain (MORFLB), which is designed for intelligent transportation systems, and the Federated Blockchain-IoT Framework for Sustainable Healthcare Systems (FBCI-SHS), elucidating their distinctive contributions and inherent limitations. Lastly, we engage in a thorough evaluation of the literature that integrates a comparative perspective on current frameworks to discern the singular nature of this research within existing knowledge systems. The manuscript culminates in delineating the principal challenges and offering a strategic framework for prospective research trajectories, emphasizing the advancement of adaptive, resilient, and standardized BCFL systems across diverse application domains.

new Maximum Entropy Semi-Supervised Inverse Reinforcement Learning

Authors: Julien Audiffren, Michal Valko, Alessandro Lazaric, Mohammad Ghavamzadeh

Abstract: A popular approach to apprenticeship learning (AL) is to formulate it as an inverse reinforcement learning (IRL) problem. The MaxEnt-IRL algorithm successfully integrates the maximum entropy principle into IRL and unlike its predecessors, it resolves the ambiguity arising from the fact that a possibly large number of policies could match the expert's behavior. In this paper, we study an AL setting in which in addition to the expert's trajectories, a number of unsupervised trajectories is available. We introduce MESSI, a novel algorithm that combines MaxEnt-IRL with principles coming from semi-supervised learning. In particular, MESSI integrates the unsupervised data into the MaxEnt-IRL framework using a pairwise penalty on trajectories. Empirical results in a highway driving and grid-world problems indicate that MESSI is able to take advantage of the unsupervised trajectories and improve the performance of MaxEnt-IRL.

new Analysis of Nystrom method with sequential ridge leverage scores

Authors: Daniele Calandriello, Alessandro Lazaric, Michal Valko

Abstract: Large-scale kernel ridge regression (KRR) is limited by the need to store a large kernel matrix K_t. To avoid storing the entire matrix K_t, Nystrom methods subsample a subset of columns of the kernel matrix, and efficiently find an approximate KRR solution on the reconstructed matrix. The chosen subsampling distribution in turn affects the statistical and computational tradeoffs. For KRR problems, recent works show that a sampling distribution proportional to the ridge leverage scores (RLSs) provides strong reconstruction guarantees for the approximation. While exact RLSs are as difficult to compute as a KRR solution, we may be able to approximate them well enough. In this paper, we study KRR problems in a sequential setting and introduce the INK-ESTIMATE algorithm, that incrementally computes the RLSs estimates. INK-ESTIMATE maintains a small sketch of K_t, that at each step is used to compute an intermediate estimate of the RLSs. First, our sketch update does not require access to previously seen columns, and therefore a single pass over the kernel matrix is sufficient. Second, the algorithm requires a fixed, small space budget to run dependent only on the effective dimension of the kernel matrix. Finally, our sketch provides strong approximation guarantees on the distance between the true kernel matrix and its approximation, and on the statistical risk of the approximate KRR solution at any time, because all our guarantees hold at any intermediate step.

new Improved large-scale graph learning through ridge spectral sparsification

Authors: Daniele Calandriello, Ioannis Koutis, Alessandro Lazaric, Michal Valko

Abstract: Graph-based techniques and spectral graph theory have enriched the field of machine learning with a variety of critical advances. A central object in the analysis is the graph Laplacian L, which encodes the structure of the graph. We consider the problem of learning over this Laplacian in a distributed streaming setting, where new edges of the graph are observed in real time by a network of workers. In this setting, it is hard to learn quickly or approximately while keeping a distributed representation of L. To address this challenge, we present a novel algorithm, GSQUEAK, which efficiently sparsifies the Laplacian by maintaining a small subset of effective resistances. We show that our algorithm produces sparsifiers with strong spectral approximation guarantees, all while processing edges in a single pass and in a distributed fashion.

new On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks

Authors: Aarav Gupta, Gururaj Deshpande, Chandreyi Chakraborty

Abstract: Auto-regressive Large Language Models (LLMs) achieve strong performance on coding tasks, but incur high memory and inference costs. Diffusion-based language models (d-LLMs) offer bounded inference cost via iterative denoising, but their behavior under post-training quantization (PTQ) has been sparsely explored. We investigate the application and robustness of PTQ techniques, specifically GPTQ and a modified Hessian-Aware Quantization (HAWQ) algorithm, on a diffusion-based coding LLM (CoDA) and observe that these methods applied to CoDA exhibit greater robustness at low bitwidths compared to Qwen3-1.7B, its auto-regressive counterpart, under a standardized evaluation pipeline. We find that in our setup, CoDA exhibits greater robustness at low bitwidths (2-4 bits), with smaller accuracy degradation across HumanEval and MBPP benchmarks. Additionally, mixed-precision configurations derived from HAWQ provide smooth trade-offs across accuracy, latency, and memory. The results suggest that diffusion LLMs may offer advantages for efficient deployment due to more quantization-resilience.

new Concept Graph Convolutions: Message Passing in the Concept Space

Authors: Lucie Charlotte Magister, Pietro Lio

Abstract: The trust in the predictions of Graph Neural Networks is limited by their opaque reasoning process. Prior methods have tried to explain graph networks via concept-based explanations extracted from the latent representations obtained after message passing. However, these explanations fall short of explaining the message passing process itself. To this aim, we propose the Concept Graph Convolution, the first graph convolution designed to operate on node-level concepts for improved interpretability. The proposed convolutional layer performs message passing on a combination of raw and concept representations using structural and attention-based edge weights. We also propose a pure variant of the convolution, only operating in the concept space. Our results show that the Concept Graph Convolution allows to obtain competitive task accuracy, while enabling an increased insight into the evolution of concepts across convolutional steps.

new Energy-Based Open-Set Active Learning for Object Classification

Authors: Zongyao Lyu, William J. Beksi

Abstract: Active learning (AL) has emerged as a crucial methodology for minimizing labeling costs in deep learning by selecting the most valuable samples from a pool of unlabeled data for annotation. Traditional AL operates under a closed-set assumption, where all classes in the dataset are known and consistent. However, real-world scenarios often present open-set conditions in which unlabeled data contains both known and unknown classes. In such environments, standard AL techniques struggle. They can mistakenly query samples from unknown categories, leading to inefficient use of annotation budgets. In this paper, we propose a novel dual-stage energy-based framework for open-set AL. Our method employs two specialized energy-based models (EBMs). The first, an energy-based known/unknown separator, filters out samples likely to belong to unknown classes. The second, an energy-based sample scorer, assesses the informativeness of the filtered known samples. Using the energy landscape, our models distinguish between data points from known and unknown classes in the unlabeled pool by assigning lower energy to known samples and higher energy to unknown samples, ensuring that only samples from classes of interest are selected for labeling. By integrating these components, our approach ensures efficient and targeted sample selection, maximizing learning impact in each iteration. Experiments on 2D (CIFAR-10, CIFAR-100, TinyImageNet) and 3D (ModelNet40) object classification benchmarks demonstrates that our framework outperforms existing approaches, achieving superior annotation efficiency and classification performance in open-set environments.

new Differentiable Conformal Training for LLM Reasoning Factuality

Authors: Nathan Hittesdorf, Marco Salzetta, Lu Cheng

Abstract: Large Language Models (LLMs) frequently hallucinate, limiting their reliability in critical applications. Conformal Prediction (CP) addresses this by calibrating error rates on held-out data to provide statistically valid confidence guarantees. Recent work extends CP to LLM factuality to filter out risky claims, ensuring that hallucination rates remain below a user-specified level (e.g., 10%). While prior methods treat claims independently, Coherent Factuality extends to multi-step reasoning by representing outputs as dependency graphs and jointly validating claims with their logical ancestors. A key limitation is that Coherent Factuality is not differentiable, requiring hand-crafted scorers that at high reliability levels remove nearly 60% of true claims. We introduce Differentiable Coherent Factuality (DCF), a fully differentiable relaxation that enables learning improved scorers while provably recovering the original algorithm's guarantees. Experiments on two benchmark reasoning datasets demonstrate DCF achieves up to 141% improvement in claim retention while maintaining reliability guarantees, representing a significant step towards reliable conformal LLM systems.

new Learning to Solve the Quadratic Assignment Problem with Warm-Started MCMC Finetuning

Authors: Yicheng Pan, Ruisong Zhou, Haijun Zou, Tianyou Li, Zaiwen Wen

Abstract: The quadratic assignment problem (QAP) is a fundamental NP-hard task that poses significant challenges for both traditional heuristics and modern learning-based solvers. Existing QAP solvers still struggle to achieve consistently competitive performance across structurally diverse real-world instances. To bridge this performance gap, we propose PLMA, an innovative permutation learning framework. PLMA features an efficient warm-started MCMC finetuning procedure to enhance deployment-time performance, leveraging short Markov chains to anchor the adaptation to the promising regions previously explored. For rapid exploration via MCMC over the permutation space, we design an additive energy-based model (EBM) that enables an $O(1)$-time 2-swap Metropolis-Hastings sampling step. Moreover, the neural network used to parameterize the EBM incorporates a scalable and flexible cross-graph attention mechanism to model interactions between facilities and locations in the QAP. Extensive experiments demonstrate that PLMA consistently outperforms state-of-the-art baselines across various benchmarks. In particular, PLMA achieves a near-zero average optimality gap on QAPLIB, exhibits remarkably superior robustness on the notoriously difficult Taixxeyy instances, and also serves as an effective QAP solver in bandwidth minimization.

new Meta Additive Model: Interpretable Sparse Learning With Auto Weighting

Authors: Xuelin Zhang, Xinyue Liu, Lingjuan Wu, Hong Chen

Abstract: Sparse additive models have attracted much attention in high-dimensional data analysis due to their flexible representation and strong interpretability. However, most existing models are limited to single-level learning under the mean-squared error criterion, whose empirical performance can degrade significantly in the presence of complex noise, such as non-Gaussian perturbations, outliers, noisy labels, and imbalanced categories. The sample reweighting strategy is widely used to reduce the model's sensitivity to atypical data; however, it typically requires prespecifying the weighting functions and manually selecting additional hyperparameters. To address this issue, we propose a new meta additive model (MAM) based on the bilevel optimization framework, which learns data-driven weighting of individual losses by parameterizing the weighting function via an MLP trained on meta data. MAM is capable of a variety of learning tasks, including variable selection, robust regression estimation, and imbalanced classification. Theoretically, MAM provides guarantees on convergence in computation, algorithmic generalization, and variable selection consistency under mild conditions. Empirically, MAM outperforms several state-of-the-art additive models on both synthetic and real-world data under various data corruptions.

new On the Stability and Generalization of First-order Bilevel Minimax Optimization

Authors: Xuelin Zhang, Peipei Yuan

Abstract: Bilevel optimization and bilevel minimax optimization have recently emerged as unifying frameworks for a range of machine-learning tasks, including hyperparameter optimization and reinforcement learning. The existing literature focuses on empirical efficiency and convergence guarantees, leaving a critical theoretical gap in understanding how well these algorithms generalize. To bridge this gap, we provide the first systematic generalization analysis for first-order gradient-based bilevel minimax solvers with lower-level minimax problems. Specifically, by leveraging algorithmic stability arguments, we derive fine-grained generalization bounds for three representative algorithms, including single-timescale stochastic gradient descent-ascent, and two variants of two-timescale stochastic gradient descent-ascent. Our results reveal a precise trade-off among algorithmic stability, generalization gaps, and practical settings. Furthermore, extensive empirical evaluations corroborate our theoretical insights on realistic optimization tasks with bilevel minimax structures.

new Adaptive Conformal Anomaly Detection with Time Series Foundation Models for Signal Monitoring

Authors: Natalia Martinez Gil, Fearghal O'Donncha, Wesley M. Gifford, Nianjun Zhou, Dhaval C. Patel, Roman Vaculin

Abstract: We propose a post-hoc adaptive conformal anomaly detection method for monitoring time series that leverages predictions from pre-trained foundation models without requiring additional fine-tuning. Our method yields an interpretable anomaly score directly interpretable as a false alarm rate (p-value), facilitating transparent and actionable decision-making. It employs weighted quantile conformal prediction bounds and adaptively learns optimal weighting parameters from past predictions, enabling calibration under distribution shifts and stable false alarm control, while preserving out-of-sample guarantees. As a model-agnostic solution, it integrates seamlessly with foundation models and supports rapid deployment in resource-constrained environments. This approach addresses key industrial challenges such as limited data availability, lack of training expertise, and the need for immediate inference, while taking advantage of the growing accessibility of time series foundation models. Experiments on both synthetic and real-world datasets show that the proposed approach delivers strong performance, combining simplicity, interpretability, robustness, and adaptivity.

new Trajectory-Aware Reliability Modeling of Democratic Systems

Authors: Dmitry Zaytsev, Valentina Kuskova, Michael Coppedge

Abstract: Failures in complex systems often emerge through gradual degradation and the propagation of stress across interacting components rather than through isolated shocks. Democratic systems exhibit similar dynamics, where weakening institutions can trigger cascading deterioration in related institutional structures. Traditional reliability and survival models typically estimate failure risk based on the current system state but do not explicitly capture how degradation propagates through institutional networks over time. This paper introduces a trajectory-aware reliability modeling framework based on Dynamic Causal Neural Autoregression (DCNAR). The framework first estimates a causal interaction structure among institutional indicators and then models their joint temporal evolution to generate forward trajectories of system states. Failure risk is defined as the probability that predicted trajectories cross predefined degradation thresholds within a fixed horizon. Using longitudinal institutional indicators, we compare DCNAR-based trajectory risk models with discrete-time hazard and Cox proportional hazards models. Results show that trajectory-aware modeling consistently outperforms Cox models and improves risk prediction for several propagation-driven institutional failures. These findings highlight the importance of modeling dynamic system interactions for reliability analysis and early detection of systemic degradation.

new A Delta-Aware Orchestration Framework for Scalable Multi-Agent Edge Computing

Authors: Samaresh Kumar Singh, Joyjit Roy

Abstract: The Synergistic Collapse occurs when scaling beyond 100 agents causes superlinear performance degradation that individual optimizations cannot prevent. We observe this collapse with 150 cameras in Smart City deployment using MADDPG, where Deadline Satisfaction drops from 78% to 34%, producing approximately $180,000 in annual cost overruns. Prior work has addressed each contributing factor in isolation: exponential action-space growth, computational redundancy among spatially adjacent agents, and task-agnostic hardware scheduling. None has examined how these three factors interact and amplify each other. We present DAOEF (Delta-Aware Orchestration for Edge Federations), a framework that addresses all three simultaneously through: (1) Differential Neural Caching, which stores intermediate layer activations and computes only the input deltas, achieving 2.1x higher hit ratios (72% vs. 35%) than output-level caching while staying within 2% accuracy loss through empirically calibrated similarity thresholds; (2) Criticality-Based Action Space Pruning, which organizes agents into priority tiers and reduces coordination complexity from O(n2) to O(n log n) with less than 6% optimality loss; and (3) Learned Hardware Affinity Matching, which assigns tasks to their optimal accelerator (GPU, CPU, NPU, or FPGA) to prevent compounding mismatch penalties. Controlled factor-isolation experiments confirm that each mechanism is necessary but insufficient on its own: removing any single mechanism increases latency by more than 40%, validating that the gains are interdependent rather than additive. Across four datasets (100-250 agents) and a 20-device physical testbed, DAOEF achieves a 1.45x multiplicative gain over applying the three mechanisms independently. A 200-agent cloud deployment yields 62% latency reduction (280 ms vs. 735 ms), sub-linear latency growth up to 250 agents.

new Pairing Regularization for Mitigating Many-to-One Collapse in GANs

Authors: Kuan-Yu Lin, Yu-Chih Huang, Tie Liu

Abstract: Mode collapse remains a fundamental challenge in training generative adversarial networks (GANs). While existing works have primarily focused on inter-mode collapse, such as mode dropping, intra-mode collapse-where many latent variables map to the same or highly similar outputs-has received significantly less attention. In this work, we propose a pairing regularizer jointly optimized with the generator to mitigate the many-to-one collapse by enforcing local consistency between latent variables and generated samples. We show that the effect of pairing regularization depends on the dominant failure mode of training. In collapse-prone regimes with limited exploration, pairing encourages structured local exploration, leading to improved coverage and higher recall. In contrast, under stabilized training with sufficient exploration, pairing refines the generator's induced data density by discouraging redundant mappings, thereby improving precision without sacrificing recall. Extensive experiments on both toy distributions and real-image benchmarks demonstrate that the proposed regularizer effectively complements existing stabilization techniques by directly addressing intra-mode collapse.

new Fourier Weak SINDy: Spectral Test Function Selection for Robust Model Identification

Authors: Zhiheng Chen, Urban Fasel, Anastasia Bizyaeva

Abstract: We introduce Fourier Weak SINDy, a minimal noise-robust and interpretable derivative-free equation learning method that combines weak-form sparse equation learning with spectral density estimation for data-driven test function selection. By using orthogonal sinusoidal test functions inspired by their prevalence in Modulating Function-based system identification, the weak-form sparse regression problem reduces to a regression over Fourier coefficients. Dominant frequencies are then selected via multitaper estimation of the frequency spectrum of the data. This formulation unifies weak-form learning and spectral estimation within a compact and flexible framework. We illustrate the effectiveness of this approach in numerical experiments across multiple chaotic and hyperchaotic ODE benchmarks.

new Temporally Extended Mixture-of-Experts Models

Authors: Zeyu Shen, Peter Henderson

Abstract: Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a controller to each layer that learns when to switch expert sets and which to load. By applying this to gpt-oss-20b with low-rank adapters and a self-distillation reward, our method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability. We hope this opens a principled path, grounded in the options framework, for memory-efficient serving and continual learning in ever-growing MoE models.

new SMART: A Spectral Transfer Approach to Multi-Task Learning

Authors: Boxin Zhao, Mladen Kolar, Jinchi Lv

Abstract: Multi-task learning is effective for related applications, but its performance can deteriorate when the target sample size is small. Transfer learning can borrow strength from related studies; yet, many existing methods rely on restrictive bounded-difference assumptions between the source and target models. We propose SMART, a spectral transfer method for multi-task linear regression that instead assumes spectral similarity: the target left and right singular subspaces lie within the corresponding source subspaces and are sparsely aligned with the source singular bases. Such an assumption is natural when studies share latent structures and enables transfer beyond the bounded-difference settings. SMART estimates the target coefficient matrix through structured regularization that incorporates spectral information from a source study. Importantly, it requires only a fitted source model rather than the raw source data, making it useful when data sharing is limited. Although the optimization problem is nonconvex, we develop a practical ADMM-based algorithm. We establish general, non-asymptotic error bounds and a minimax lower bound in the noiseless-source regime. Under additional regularity conditions, these results yield near-minimax Frobenius error rates up to logarithmic factors. Simulations confirm improved estimation accuracy and robustness to negative transfer, and analysis of multi-modal single-cell data demonstrates better predictive performance. The Python implementation of SMART, along with the code to reproduce all experiments in this paper, is publicly available at https://github.com/boxinz17/smart.

URLs: https://github.com/boxinz17/smart.

new Cover meets Robbins while Betting on Bounded Data: $\ln n$ Regret and Almost Sure $\ln\ln n$ Regret

Authors: Shubhada Agrawal, Aaditya Ramdas

Abstract: Consider betting against a sequence of data in $[0,1]$, where one is allowed to make any bet that is fair if the data have a conditional mean $m_0 \in (0,1)$. Cover's universal portfolio algorithm delivers a worst-case regret of $O(\ln n)$ compared to the best constant bet in hindsight, and this bound is unimprovable against adversarially generated data. In this work, we present a novel mixture betting strategy that combines insights from Robbins and Cover, and exhibits a different behavior: it eventually produces a regret of $O(\ln \ln n)$ on \emph{almost} all paths (a measure-one set of paths if each conditional mean equals $m_0$ and intrinsic variance increases to $\infty$), but has an $O(\log n)$ regret on the complement (a measure zero set of paths). Our paper appears to be the first to point out the value in hedging two very different strategies to achieve a best-of-both-worlds adaptivity to stochastic data and protection against adversarial data. We contrast our results to those in~\cite{agrawal2025regret} for a sub-Gaussian mixture on unbounded data: their worst-case regret has to be unbounded, but a similar hedging delivers both an optimal betting growth-rate and an almost sure $\ln\ln n$ regret on stochastic data. Finally, our strategy witnesses a sharp game-theoretic upper law of the iterated logarithm, analogous to~\cite{shafer2005probability}.

new Lever: Inference-Time Policy Reuse under Support Constraints

Authors: Ihor Vitenki, Noha Ibrahim, Sihem Amer-Yahia

Abstract: Reinforcement learning (RL) policies are typically trained for fixed objectives, making reuse difficult when task requirements change. We study inference-time policy reuse: given a library of pre-trained policies and a new composite objective, can a high-quality policy be constructed entirely offline, without additional environment interaction? We introduce lever (Leveraging Efficient Vector Embeddings for Reusable policies), an end-to-end framework that retrieves relevant policies, evaluates them using behavioral embeddings, and composes new policies via offline Q-value composition. We focus on the support-limited regime, where no value propagation is possible, and show that the effectiveness of reuse depends critically on the coverage of available transitions. To balance performance and computational cost, lever proposes composition strategies that control the exploration of candidate policies. Experiments in deterministic GridWorld environments show that inference-time composition can match, and in some cases exceed, training-from-scratch performance while providing substantial speedups. At the same time, performance degrades when long-horizon dependencies require value propagation, highlighting a fundamental limitation of offline reuse.

new Physics-Enhanced Deep Learning for Proactive Thermal Runaway Forecasting in Li-Ion Batteries

Authors: Salman Khan, Muhammad Zunair Zamir, Syed Sajid Ullah, Jie Li

Abstract: Accurate prediction of thermal runaway in lithium-ion batteries is essential for ensuring the safety, efficiency, and reliability of modern energy storage systems. Conventional data-driven approaches, such as Long Short-Term Memory (LSTM) networks, can capture complex temporal dependencies but often violate thermodynamic principles, resulting in physically inconsistent predictions. Conversely, physics-based thermal models provide interpretability but are computationally expensive and difficult to parameterize for real-time applications. To bridge this gap, this study proposes a Physics-Informed Long Short-Term Memory (PI-LSTM) framework that integrates governing heat transfer equations directly into the deep learning architecture through a physics-based regularization term in the loss function. The model leverages multi-feature input sequences, including state of charge, voltage, current, mechanical stress, and surface temperature, to forecast battery temperature evolution while enforcing thermal diffusion constraints. Extensive experiments conducted on thirteen lithium-ion battery datasets demonstrate that the proposed PI-LSTM achieves an 81.9% reduction in root mean square error (RMSE) and an 81.3% reduction in mean absolute error (MAE) compared to the standard LSTM baseline, while also outperforming CNN-LSTM and multilayer perceptron (MLP) models by wide margins. The inclusion of physical constraints enhances the model's generalization across diverse operating conditions and eliminates non-physical temperature oscillations. These results confirm that physics-informed deep learning offers a viable pathway toward interpretable, accurate, and real-time thermal management in next-generation battery systems.

new Structure-Aware Variational Learning of a Class of Generalized Diffusions

Authors: Yubin Lu, Xiaofan Li, Chun Liu, Qi Tang, Yiwei Wang

Abstract: Learning the underlying potential energy of stochastic gradient systems from partial and noisy observations is a fundamental problem arising in physics, chemistry, and data-driven modeling. Classical approaches often rely on direct regression of governing equations or velocity fields, which can be sensitive to noise and external perturbations and may fail when observations are incomplete. In this work, we propose a structure-aware, energy-based learning framework for inferring unknown potential functions in generalized diffusion processes, grounded in the energetic variational approach. Starting from the energy-dissipation law associated with the Fokker-Planck equation, we construct loss functions based on the De Giorgi dissipation functional, which consistently couple the free energy and the dissipation mechanism of the system. This formulation avoids explicit enforcement of the governing partial differential equation and preserves the underlying variational structure of the dynamics. Through numerical experiments in one, two, and three dimensions, we demonstrate that the proposed energy-based loss exhibits enhanced robustness with respect to observation time, noise level, and the diversity and amount of available training data. These results highlight the effectiveness of energy-dissipation principles as a reliable foundation for learning stochastic diffusion dynamics from data.

new ACT: Anti-Crosstalk Learning for Cross-Sectional Stock Ranking via Temporal Disentanglement and Structural Purification

Authors: Juntao Li, Liang Zhang

Abstract: Cross-sectional stock ranking is a fundamental task in quantitative investment, relying on both temporal modeling of individual stocks and the capture of inter-stock dependencies. While existing deep learning models leverage graph-based approaches to enhance ranking accuracy by propagating information over relational graphs, they suffer from a key challenge: crosstalk, namely unintended information interference across predictive factors. We identify two forms of crosstalk: temporal-scale crosstalk, where trends, fluctuations, and shocks are entangled in a shared representation and non-transferable local patterns contaminate cross-stock learning; and structural crosstalk, where heterogeneous relations are indiscriminately fused and relation-specific predictive signals are obscured. To address both issues, we propose the Anti-CrossTalk (ACT) framework for cross-sectional stock ranking via temporal disentanglement and structural purification. Specifically, ACT first decomposes each stock sequence into trend, fluctuation, and shock components, then extracts component-specific information through dedicated branches, which effectively decouples non-transferable local patterns. ACT further introduces a Progressive Structural Purification Encoder to sequentially purify structural crosstalk on the trend component after mitigating temporal-scale crosstalk. An adaptive fusion module finally integrates all branch representations for ranking. Experiments on CSI300 and CSI500 demonstrate that ACT achieves state-of-the-art ranking accuracy and superior portfolio performance, with improvements of up to 74.25% on the CSI300 dataset.

new Scaling Self-Play with Self-Guidance

Authors: Luke Bailey, Kaiyue Wen, Kefan Dong, Tatsunori Hashimoto, Tengyu Ma

Abstract: LLM self-play algorithms are notable in that, in principle, nothing bounds their learning: a Conjecturer model creates problems for a Solver, and both improve together. However, in practice, existing LLM self-play methods do not scale well with large amounts of compute, instead hitting learning plateaus. We argue this is because over long training runs, the Conjecturer learns to hack its reward, collapsing to artificially complex problems that do not help the Solver improve. To overcome this, we introduce Self-Guided Self-Play (SGS), a self-play algorithm in which the language model itself guides the Conjecturer away from degeneracy. In SGS, the model takes on three roles: Solver, Conjecturer, and a Guide that scores synthetic problems by their relevance to unsolved target problems and how clean and natural they are, providing supervision against Conjecturer collapse. Our core hypothesis is that language models can assess whether a subproblem is useful for achieving a goal. We evaluate the scaling properties of SGS by running training for significantly longer than prior works and by fitting scaling laws to cumulative solve rate curves. Applying SGS to formal theorem proving in Lean4, we find that it surpasses the asymptotic solve rate of our strongest RL baseline in fewer than 80 rounds of self-play and enables a 7B parameter model, after 200 rounds of self-play, to solve more problems than a 671B parameter model pass@4.

new Geometric Layer-wise Approximation Rates for Deep Networks

Authors: Shijun Zhang, Zuowei Shen, Yuesheng Xu

Abstract: Depth is widely viewed as a central contributor to the success of deep neural networks, whereas standard neural network approximation theory typically provides guarantees only for the final output and leaves the role of intermediate layers largely unclear. We address this gap by developing a quantitative framework in which depth admits a precise scale-dependent interpretation. Specifically, we design a single shared mixed-activation architecture of fixed width $2dN+d+2$ and any prescribed finite depth such that each intermediate readout $\Phi_\ell$ is itself an approximant to the target function $f$. For $f\in L^p([0,1]^d)$ with $p\in [1,\infty)$, the approximation error of $\Phi_\ell$ is controlled by $(2d+1)$ times the $L^p$ modulus of continuity at the geometric scale $N^{-\ell}$ for all $\ell$. The estimate reduces to the geometric rate $(2d+1)N^{-\ell}$ if $f$ is $1$-Lipschitz. Our network design is inspired by multigrade deep learning, where depth serves as a progressive refinement mechanism: each new correction targets residual information at a finer scale while the earlier correction terms remain part of the later readouts, yielding a nested architecture that supports adaptive refinement without redesigning the preceding network.

new Machine Learning for Two-Stage Graph Sparsification for the Travelling Salesman Problem

Authors: Bo-Cheng Lin, Yi Mei, Mengjie Zhang

Abstract: High-performance TSP solvers like LKH search within a sparsified candidate graph rather than over all possible edges. Graph sparsification is non-trivial: keep too many edges and the solver wastes time; cut too many and it loses edges that belong to the optimal tour. The two leading heuristic methods, $\alpha$-Nearest and POPMUSIC, produce high-quality candidate graphs, but no single heuristic is both sparse and reliable across all instance sizes and distributions. Machine learning methods can potentially learn better sparsification models. However, existing approaches operate on the complete graph, which is expensive and mostly restricted to Euclidean distances. To address this issue, we propose a two-stage graph sparsification approach: Stage~1 takes the union of $\alpha$-Nearest and POPMUSIC to maximise recall; Stage~2 trains a single model to reduce density. We conducted experiments across four TSPLIB distance types, five spatial distributions, and problem sizes from 50 to 500. The two-stage approach substantially reduces candidate-graph density while retaining high coverage, generalises across distance types and distributions, outperforms recent neural sparsification methods that are restricted to Euclidean distances, and becomes increasingly valuable at larger scales where single-stage heuristics degrade.

new uLEAD-TabPFN: Uncertainty-aware Dependency-based Anomaly Detection with TabPFN

Authors: Sha Lu, Jixue Liu, Stefan Peters, Thuc Duy Le, Craig Xie, Lin Liu, Jiuyong Li

Abstract: Anomaly detection in tabular data is challenging due to high dimensionality, complex feature dependencies, and heterogeneous noise. Many existing methods rely on proximity-based cues and may miss anomalies caused by violations of complex feature dependencies. Dependency-based anomaly detection provides a principled alternative by identifying anomalies as violations of dependencies among features. However, existing methods often struggle to model such dependencies robustly and to scale to high-dimensional data with complex dependency structures. To address these challenges, we propose uLEAD-TabPFN, a dependency-based anomaly detection framework built on Prior-Data Fitted Networks (PFNs). uLEAD-TabPFN identifies anomalies as violations of conditional dependencies in a learned latent space, leveraging frozen PFNs for dependency estimation. Combined with uncertainty-aware scoring, the proposed framework enables robust and scalable anomaly detection. Experiments on 57 tabular datasets from ADBench show that uLEAD-TabPFN achieves particularly strong performance in medium- and high-dimensional settings, where it attains the top average rank. On high-dimensional datasets, uLEAD-TabPFN improves the average ROC-AUC by nearly 20\% over the average baseline and by approximately 2.8\% over the best-performing baseline, while maintaining overall superior performance compared to state-of-the-art methods. Further analysis shows that uLEAD-TabPFN provides complementary anomaly detection capability, achieving strong performance on datasets where many existing methods struggle.

new Causal-Transformer with Adaptive Mutation-Locking for Early Prediction of Acute Kidney Injury

Authors: Weizhi Nie, Haolin Chen

Abstract: Accurate early prediction of Acute Kidney Injury (AKI) is critical for timely clinical intervention. However, existing deep learning models struggle with irregularly sampled data and suffer from the opaque "black-box" nature of sequential architectures, strictly limiting clinical trust. To address these challenges, we propose CT-Former, integrating continuous-time modeling with a Causal-Transformer. To handle data irregularity without biased artificial imputation, our framework utilizes a continuous-time state evolution mechanism to naturally track patient temporal trajectories. To resolve the black-box problem, our Causal-Attention module abandons uninterpretable hidden state aggregation. Instead, it generates a directed structural causal matrix to identify and trace the exact historical onset of severe physiological shocks. By establishing clear causal pathways between historical anomalies and current risk predictions, CT-Former provides native clinical interpretability. Training follows a decoupled two-stage protocol to optimize the causal-fusion process independently. Extensive experiments on the MIMIC-IV cohort (N=18,419) demonstrate that CT-Former significantly outperforms state-of-the-art baselines. The results confirm that our explicitly transparent architecture offers an accurate and trustworthy tool for clinical decision-making.

new Rethinking Intrinsic Dimension Estimation in Neural Representations

Authors: Rickmer Schulte, David R\"ugamer

Abstract: The analysis of neural representation has become an integral part of research aiming to better understand the inner workings of neural networks. While there are many different approaches to investigate neural representations, an important line of research has focused on doing so through the lens of intrinsic dimensions (IDs). Although this perspective has provided valuable insights and stimulated substantial follow-up research, important limitations of this approach have remained largely unaddressed. In this paper, we highlight a crucial discrepancy between theory and practice of IDs in neural representations, theoretically and empirically showing that common ID estimators are, in fact, not tracking the true underlying ID of the representation. We contrast this negative result with an investigation of the underlying factors that may drive commonly reported ID-related results on neural representation in the literature. Building on these insights, we offer a new perspective on ID estimation in neural representations.

new Generative Augmentation of Imbalanced Flight Records for Flight Diversion Prediction: A Multi-objective Optimisation Framework

Authors: Karim Aly, Alexei Sharpanskykh, Jacco Hoekstra

Abstract: Flight diversions are rare but high-impact events in aviation, making their reliable prediction vital for both safety and operational efficiency. However, their scarcity in historical records impedes the training of machine learning models utilised to predict them. This study addresses this scarcity gap by investigating how generative models can augment historical flight data with synthetic diversion records to enhance model training and improve predictive accuracy. We propose a multi-objective optimisation framework coupled with automated hyperparameter search to identify optimal configurations for three deep generative models: Tabular Variational Autoencoder (TVAE), Conditional Tabular Generative Adversarial Network (CTGAN), and CopulaGAN, with the Gaussian Copula (GC) model serving as a statistical baseline. The quality of the synthetic data was examined through a six-stage evaluation framework encompassing realism, diversity, operational validity, statistical similarity, fidelity, and predictive utility. Results show that the optimised models significantly outperform their non-optimised counterparts, and that synthetic augmentation substantially improves diversion prediction compared to models trained solely on real data. These findings demonstrate the effectiveness of hyperparameter-optimised generative models for advancing predictive modelling of rare events in air transportation.

new Synthetic Flight Data Generation Using Generative Models

Authors: Karim Aly, Alexei Sharpanskykh

Abstract: The increasing adoption of synthetic data in aviation research offers a promising solution to data scarcity and confidentiality challenges. This study investigates the potential of generative models to produce realistic synthetic flight data and evaluates their quality through a comprehensive four-stage assessment framework. The need for synthetic flight data arises from their potential to serve as an alternative to confidential real-world records and to augment rare events in historical datasets. These enhanced datasets can then be used to train machine learning models that predict critical events, such as flight delays, cancellations, diversions, and turnaround times. Two generative models, Tabular Variational Autoencoder (TVAE) and Gaussian Copula (GC), are adapted to generate synthetic flight information and compared based on their ability to preserve statistical similarity, fidelity, diversity, and predictive utility. Results indicate that while GC achieves higher statistical similarity and fidelity, its computational cost hinders its applicability to large datasets. In contrast, TVAE efficiently handles large datasets and enables scalable synthetic data generation. The findings demonstrate that synthetic data can support flight delay prediction models with accuracy comparable to those trained on real data. These results pave the way for leveraging synthetic flight data to enhance predictive modeling in air transportation.

new Sheaf Neural Networks on SPD Manifolds: Second-Order Geometric Representation Learning

Authors: Yuhan Peng, Junwen Dong, Yuzhi Zeng, Hao Li, Ce Ju, Huitao Feng, Diaaeldin Taha, Anna Wienhard, Kelin Xia

Abstract: Graph neural networks face two fundamental challenges rooted in the linear structure of Euclidean vector spaces: (1) Current architectures represent geometry through vectors (directions, gradients), yet many tasks require matrix-valued representations that capture relationships between directions-such as how atomic orientations covary in a molecule. These second-order representations are naturally captured by points on the symmetric positive definite matrices (SPD) manifold; (2) Standard message passing applies shared transformations across edges. Sheaf neural networks address this via edge-specific transformations, but existing formulations remain confined to vector spaces and therefore cannot propagate matrix-valued features. We address both challenges by developing the first sheaf neural network operates natively on the SPD manifold. Our key insight is that the SPD manifold admits a Lie group structure, enabling well-posed analogs of sheaf operators without projecting to Euclidean space. Theoretically, we prove that SPD-valued sheaves are strictly more expressive than Euclidean sheaves: they admit consistent configurations (global sections) that vector-valued sheaves cannot represent, directly translating to richer learned representations. Empirically, our sheaf convolution transforms effectively rank-1 directional inputs into full-rank matrices encoding local geometric structure. Our dual-stream architecture achieves SOTA on 6/7 MoleculeNet benchmarks, with the sheaf framework providing consistent depth robustness.

new Formalising the Logit Shift Induced by LoRA: A Technical Note

Authors: Xiang Shi, Shuaizhi Cheng, Mingwei Li

Abstract: This technical note provides a first-order formalisation of the logit shift and fact-margin change induced by Low-Rank Adaptation (LoRA). Using a first-order Fr\'echet approximation around the base model trajectory, we show that the multi-layer LoRA effect can be decomposed into a linear summation of layerwise contributions and a higher-order remainder term representing inter-layer coupling.

new R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling

Authors: Aijia Cheng, Kailong Wang, Ling Shi, Yongxin Zhao

Abstract: Function calling empowers large language models (LLMs) to interface with external tools, yet existing RL-based approaches suffer from misalignment between reasoning processes and tool-call decisions. We propose R2IF, a reasoning-aware RL framework for interpretable function calling, adopting a composite reward integrating format/correctness constraints, Chain-of-Thought Effectiveness Reward (CER), and Specification-Modification-Value (SMV) reward, optimized via GRPO. Experiments on BFCL/ACEBench show R2IF outperforms baselines by up to 34.62% (Llama3.2-3B on BFCL) with positive Average CoT Effectiveness (0.05 for Llama3.2-3B), enhancing both function-calling accuracy and interpretability for reliable tool-augmented LLM deployment.

new Cold-Start Forecasting of New Product Life-Cycles via Conditional Diffusion Models

Authors: Ruihan Zhou, Zishi Zhang, Jinhui Han, Yijie Peng, Xiaowei Zhang

Abstract: Forecasting the life-cycle trajectory of a newly launched product is important for launch planning, resource allocation, and early risk assessment. This task is especially difficult in the pre-launch and early post-launch phases, when product-specific outcome history is limited or unavailable, creating a cold-start problem. In these phases, firms must make decisions before demand patterns become reliably observable, while early signals are often sparse, noisy, and unstable We propose the Conditional Diffusion Life-cycle Forecaster (CDLF), a conditional generative framework for forecasting new-product life-cycle trajectories under cold start. CDLF combines three sources of information: static descriptors, reference trajectories from similar products, and newly arriving observations when available. Here, static descriptors refer to structured pre-launch characteristics of the product, such as category, price tier, brand or organization identity, scale, and access conditions. This structure allows the model to condition forecasts on relevant product context and to update them adaptively over time without retraining, yielding flexible multi-modal predictive distributions under extreme data scarcity. The method satisfies consistency with a horizon-uniform distributional error bound for recursive generation. Across studies on Intel microprocessor stock keeping unit (SKU) life cycles and the platform-mediated adoption of open large language model repositories, CDLF delivers more accurate point forecasts and higher-quality probabilistic forecasts than classical diffusion models, Bayesian updating approaches, and other state-of-the-art machine-learning baselines.

new Towards Event-Aware Forecasting in DeFi: Insights from On-chain Automated Market Maker Protocols

Authors: Huaiyu Jia, Jiehshun You, Yizhi Luo, Jingyu Liu, Shuo Sun

Abstract: Automated Market Makers (AMMs), as a core infrastructure of decentralized finance (DeFi), uniquely drive on-chain asset pricing through a deterministic reserve ratio mechanism. Unlike traditional markets, AMM price dynamics is triggered largely by on-chain events (e.g., swap) that change the reserve ratio, rather than by continuous responses to off-chain information. This makes event-level analysis crucial for understanding price formation mechanisms in AMMs. However, existing research generally neglects the micro-structural dynamics at the AMMs level, lacking both a comprehensive dataset covering multiple protocols with fine-grained event classification and an effective framework for event-aware modeling. To fill this gap, we construct a dataset containing 8.9 million on-chain event records from four representative AMMs protocols: Pendle, Uniswap v3, Aave and Morpho, with precise annotations of transaction type and block height timestamps. Furthermore, we propose an Uncertainty Weighted Mean Squared Error (UWM) loss function, which incorporates the block interval regression term into the traditional Time-Point Process (TPP) objective function by weighting the uncertainty with homoscedasticity. Extensive experiments on eight advanced TPP architectures demonstrate that this loss function reduces the time prediction error by an average of 56.41\% while maintaining the accuracy of event type prediction, establishing a robust benchmark for event-aware prediction in the AMMs ecosystem. This work provides the necessary data foundation and methodological framework for modeling the discreteness and event-driven characteristics of on-chain price discovery. All datasets and source code are publicly available. https://github.com/yosen-king/Deep-AMM-Events

URLs: https://github.com/yosen-king/Deep-AMM-Events

new Distributional Value Estimation Without Target Networks for Robust Quality-Diversity

Authors: Behrad Koohy, Jamie Bayne

Abstract: Quality-Diversity (QD) algorithms excel at discovering diverse repertoires of skills, but are hindered by poor sample efficiency and often require tens of millions of environment steps to solve complex locomotion tasks. Recent advances in Reinforcement Learning (RL) have shown that high Update-to-Data (UTD) ratios accelerate Actor-Critic learning. While effective, standard high-UTD algorithms typically utilise target networks to stabilise training. This requirement introduces a significant computational bottleneck, rendering them impractical for resource-intensive Quality-Diversity (QD) tasks where sample efficiency and rapid population adaptation are critical. In this paper, we introduce QDHUAC, a sample-efficient, target-free and distributional QD-RL algorithm that provides dense and low-variance gradient signals, which enables high-UTD training for Dominated Novelty Search whilst requiring an order of magnitude fewer environment steps. We demonstrate that our method enables stable training at high UTD ratios, achieving competitive coverage and fitness on high-dimensional Brax environments with an order of magnitude fewer samples than baselines. Our results suggest that combining target-free distributional critics with dominance-based selection is a key enabler for the next generation of sample-efficient evolutionary RL algorithms.

new Robustness of Spatio-temporal Graph Neural Networks for Fault Location in Partially Observable Distribution Grids

Authors: Burak Karabulut, Carlo Manna, Chris Develder

Abstract: Fault location in distribution grids is critical for reliability and minimizing outage durations. Yet, it remains challenging due to partial observability, given sparse measurement infrastructure. Recent works show promising results by combining Recurrent Neural Networks (RNNs) and Graph Neural Networks (GNNs) for spatio-temporal learning. Still, many modern GNN architectures remain untested for this grid application, while existing GNN solutions have not explored GNN topology definitions beyond simply adopting the full grid topology to construct the GNN graph. We address these gaps by (i) systematically comparing a newly proposed graph-forming strategy (measured-only) to the traditional full-topology approach, and (ii) introducing STGNN (Spatio-temporal GNN) models based on GraphSAGE and an improved Graph Attention (GATv2), for distribution grid fault location; (iii) benchmarking them against state-of-the-art STGNN and RNN baselines on the IEEE 123-bus feeder. In our experiments, all evaluated STGNN variants achieve high performance and consistently outperform a pure RNN baseline, with improvements up to 11 percentage points F1. Among STGNN models, the newly explored RGATv2 and RGSAGE achieve only marginally higher F1 scores. Still, STGNNs demonstrate superior stability, with tight confidence intervals (within +/- 1.4%) compared to the RNN baseline (up to +/- 7.5%) across different experiment runs. Finally, our proposed reduced GNN topology (measured-only) shows clear benefits in both (i) model training time (6-fold reduction) and (ii) model performance (up to 11 points F1). This suggests that measured-only graphs offer a more practical, efficient, and robust framework for partially observable distribution grids.

new Calibrating conditional risk

Authors: Andrey Vasilyev, Yikai Wang, Xiaocheng Li, Guanting Chen

Abstract: We introduce and study the problem of calibrating conditional risk, which involves estimating the expected loss of a prediction model conditional on input features. We analyze this problem in both classification and regression settings and show that it is fundamentally equivalent to a standard regression task. For classification settings, we further establish a connection between conditional risk calibration and individual/conditional probability calibration, and develop theoretical insights for the performance metric. This reveals that while conditional risk calibration is related to existing uncertainty quantification problems, it remains a distinct and standalone machine learning problem. Empirically, we validate our theoretical findings and demonstrate the practical implications of conditional risk calibration in the learning to defer (L2D) framework. Our systematic experiments provide both qualitative and quantitative assessments, offering guidance for future research in uncertainty-aware decision-making.

new Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving

Authors: Hung Cuong Pham, Fatih Gedikli

Abstract: AI research often emphasizes model design and algorithmic performance, while deployment and inference remain comparatively underexplored despite being critical for real-world use. This study addresses that gap by investigating the performance and optimization of a BentoML-based AI inference system for scalable model serving developed in collaboration with graphworks.ai. The evaluation first establishes baseline performance under three realistic workload scenarios. To ensure a fair and reproducible assessment, a pre-trained RoBERTa sentiment analysis model is used throughout the experiments. The system is subjected to traffic patterns following gamma and exponential distributions in order to emulate real-world usage conditions, including steady, bursty, and high-intensity workloads. Key performance metrics, such as latency percentiles and throughput, are collected and analyzed to identify bottlenecks in the inference pipeline. Based on the baseline results, optimization strategies are introduced at multiple levels of the serving stack to improve efficiency and scalability. The optimized system is then reevaluated under the same workload conditions, and the results are compared with the baseline using statistical analysis to quantify the impact of the applied improvements. The findings demonstrate practical strategies for achieving efficient and scalable AI inference with BentoML. The study examines how latency and throughput scale under varying workloads, how optimizations at the runtime, service, and deployment levels affect response time, and how deployment in a single-node K3s cluster influences resilience during disruptions.

new Unlocking the Forecasting Economy: A Suite of Datasets for the Full Lifecycle of Prediction Market: [Experiments \& Analysis]

Authors: Huaiyu Jia, Luofeng Zhou, Wentao Zhang, Lin William Cong, Siguang Li, Shuo Sun

Abstract: Prediction markets are markets for trading claims on future events, such as presidential elections, and their prices provide continuously updated signals of collective beliefs. In decentralized platforms such as Polymarket, the market lifecycle spans market creation, token registration, trading, oracle interaction, dispute, and final settlement, yet the corresponding data are fragmented across heterogeneous off-chain and on-chain sources. We present the first continuously maintained dataset suite for the full lifecycle of decentralized prediction markets, built on Polymarket. To address the challenges of large-scale cross-source integration, incomplete linkage, and continuous synchronization, we build a unified relational data system that integrates three canonical layers: market metadata, fill-level trading records, and oracle-resolution events, through identifier resolution, on-chain recovery, and incremental updates. The resulting dataset spans October 2020 to March 2026 and comprises more than 770 thousand market records, over 943 million fill records, and nearly 2 million oracle events. We describe the data model, collection pipeline, and consistency mechanisms that make the dataset reproducible and extensible, and we demonstrate its utility through descriptive analyses of market activity and two downstream case studies: NBA outcome calibration and CPI expectation reconstruction.

new The Origin of Edge of Stability

Authors: Elon Litman

Abstract: Full-batch gradient descent on neural networks drives the largest Hessian eigenvalue to the threshold $2/\eta$, where $\eta$ is the learning rate. This phenomenon, the Edge of Stability, has resisted a unified explanation: existing accounts establish self-regulation near the edge but do not explain why the trajectory is forced toward $2/\eta$ from arbitrary initialization. We introduce the edge coupling, a functional on consecutive iterate pairs whose coefficient is uniquely fixed by the gradient-descent update. Differencing its criticality condition yields a step recurrence with stability boundary $2/\eta$, and a second-order expansion yields a loss-change formula whose telescoping sum forces curvature toward $2/\eta$. The two formulas involve different Hessian averages, but the mean value theorem localizes each to the true Hessian at an interior point of the step segment, yielding exact forcing of the Hessian eigenvalue with no gap. Setting both gradients of the edge coupling to zero classifies fixed points and period-two orbits; near a fixed point, the problem reduces to a function of the half-amplitude alone, which determines which directions support period-two orbits and on which side of the critical learning rate they appear.

new Surrogate Functionals for Machine-Learned Orbital-Free Density Functional Theory

Authors: Roman Remme, Fred A. Hamprecht

Abstract: We introduce surrogate functionals: machine-learned energy functionals for orbital-free density functional theory (OF-DFT) which are defined not by universal fidelity to a physical reference, but merely by the requirement that density optimization with a fixed procedure yields the true ground-state density. Helpfully, training surrogate functionals requires only ground-state densities, no energies or gradients away from the ground state. We here propose a gradient-descent-improvement loss that guarantees exponential convergence of the density to the ground state, and combine it with an adaptive sampling scheme that concentrates learning around the optimization trajectories actually visited during inference. On the QM9 and QMugs benchmarks, surrogate functionals achieve density errors competitive with or improving upon the state of the art for fully supervised machine-learned OF-DFT, while eliminating the need for the $O(N^3)$ orthononormalization step required by prior work, yielding improved runtime scaling for larger systems.

new Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees

Authors: Xueyan Li, Johannes Zenn, Ekaterina Fadeeva, Guinan Su, Mrinmaya Sachan, Jonas Geiping

Abstract: Self-consistency boosts inference-time performance by sampling multiple reasoning traces in parallel and voting. However, in constrained domains like math and code, this strategy is compute-inefficient because it samples with replacement, repeatedly revisiting the same high-probability prefixes and duplicate completions. We propose Distinct Leaf Enumeration (DLE), a deterministic decoding method that treats truncated sampling as traversal of a pruned decoding tree and systematically enumerates distinct leaves instead of sampling with replacement. This strategy improves inference efficiency in two ways. Algorithmically, it increases coverage of the truncated search space under a fixed budget by exploring previously unvisited high-probability branches. Systemically, it reuses shared prefixes and reduces redundant token generation. Empirically, DLE explores higher-quality reasoning traces than stochastic self-consistency, yielding better performance on math, coding, and general reasoning tasks.

new Explicit Dropout: Deterministic Regularization for Transformer Architectures

Authors: Vidhi Agrawal, Illia Oleksiienko, Alexandros Iosifidis

Abstract: Dropout is a widely used regularization technique in deep learning, but its effects are typically realized through stochastic masking rather than explicit optimization objectives. We propose a deterministic formulation that expresses dropout as an additive regularizer directly incorporated into the training loss. The framework derives explicit regularization terms for Transformer architectures, covering attention query, key, value, and feed-forward components with independently controllable strengths. This formulation removes reliance on stochastic perturbations while providing clearer and fine-grained control over regularization strength. Experiments across image classification, temporal action detection, and audio classification show that explicit dropout matches or outperforms conventional implicit methods, with consistent gains when applied to attention and feed-forward network layers. Ablation studies demonstrate stable performance and controllable regularization through regularization coefficients and dropout rates. Overall, explicit dropout offers a practical and interpretable alternative to stochastic regularization while maintaining architectural flexibility across diverse tasks.

new CHASM: Unveiling Covert Advertisements on Chinese Social Media

Authors: Jingyi Zheng, Tianyi Hu, Yule Liu, Zhen Sun, Zongmin Zhang, Zifan Peng, Wenhan Dong, Xinlei He

Abstract: Current benchmarks for evaluating large language models (LLMs) in social media moderation completely overlook a serious threat: covert advertisements, which disguise themselves as regular posts to deceive and mislead consumers into making purchases, leading to significant ethical and legal concerns. In this paper, we present the CHASM, a first-of-its-kind dataset designed to evaluate the capability of Multimodal Large Language Models (MLLMs) in detecting covert advertisements on social media. CHASM is a high-quality, anonymized, manually curated dataset consisting of 4,992 instances, based on real-world scenarios from the Chinese social media platform Rednote. The dataset was collected and annotated under strict privacy protection and quality control protocols. It includes many product experience sharing posts that closely resemble covert advertisements, making the dataset particularly challenging.The results show that under both zero-shot and in-context learning settings, none of the current MLLMs are sufficiently reliable for detecting covert advertisements.Our further experiments revealed that fine-tuning open-source MLLMs on our dataset yielded noticeable performance gains. However, significant challenges persist, such as detecting subtle cues in comments and differences in visual and textual structures.We provide in-depth error analysis and outline future research directions. We hope our study can serve as a call for the research community and platform moderators to develop more precise defenses against this emerging threat.

new Amortized Vine Copulas for High-Dimensional Density and Information Estimation

Authors: Houman Safaai

Abstract: Modeling high-dimensional dependencies while keeping likelihoods tractable remains challenging. Classical vine-copula pipelines are interpretable but can be expensive, while many neural estimators are flexible but less structured. In this work, we propose Vine Denoising Copula (VDC), an amortized vine-copula pipeline that trains a single bivariate denoising model and reuses it across all vine edges. For each edge, given pseudo-observations, the model predicts a density grid. We then apply an IPFP/Sinkhorn projection that enforces non-negativity, unit mass, and uniform marginals. This keeps the exact vine likelihood and preserves the usual copula interpretation while replacing repeated per-edge optimization with GPU inference. Across synthetic and real-data benchmarks, VDC delivers strong bivariate density accuracy, competitive MI/TC estimation, and substantial speedups for high-dimensional vine fitting. In practice, these gains make explicit information estimation and dependence decomposition feasible at scales where repeated vine fitting would otherwise be costly, although conditional downstream inference remains mixed.

new A Hierarchical MARL-Based Approach for Coordinated Retail P2P Trading and Wholesale Market Participation of DERs

Authors: Patrick Wilk, Ethan Cantor, Yikui Liu, Jie Li

Abstract: The ongoing shift towards decentralization of the electric energy sector, driven by the growing electrification across end-use sectors, and widespread adoption of distributed energy resources (DERs), necessitates their active participation in the electricity markets to support grid operations. Furthermore, with bi-directional energy and communication flows becoming standard, intelligent, easy-to-deploy, resource-conservative demand-side participation is expected to play a critical role in securing power grid operational flexibility and market efficiency. This work proposes a market engagement framework that leverages a hierarchical multi-agent deep reinforcement learning (MARL) approach to enable individual prosumers to participate in peer-to-peer retail auctions and further aggregate these intelligent prosumers to facilitate effective DER participation in wholesale markets. Ultimately, a Stackelberg game is proposed to coordinate this hierarchical MARL-based DER market participation framework toward enhanced market performance.

new Differentially Private Clustered Federated Learning with Privacy-Preserving Initialization and Normality-Driven Aggregation

Authors: Jie Xu, Haaris Mehmood, Rogier Van Dalen, Karthikeyan Saravanan, Mete Ozay

Abstract: Federated learning (FL) enables training of a global model while keeping raw data on end-devices. Despite this, FL has shown to leak private user information and thus in practice, it is often coupled with methods such as differential privacy (DP) and secure vector sum to provide formal privacy guarantees to its participants. In realistic cross-device deployments, the data are highly heterogeneous, so vanilla federated learning converges slowly and generalizes poorly. Clustered federated learning (CFL) mitigates this by segregating users into clusters, leading to lower intra-cluster data heterogeneity. Nevertheless, coupling CFL with DP remains challenging: the injected DP noise makes individual client updates excessively noisy, and the server is unable to initialize cluster centroids with the less noisy aggregated updates. To address this challenge, we propose PINA, a two-stage framework that first lets each client fine-tune a lightweight low-rank adaptation (LoRA) adapter and privately share a compressed sketch of the update. The server leverages these sketches to construct robust cluster centroids. In the second stage, PINA introduces a normality-driven aggregation mechanism that improves convergence and robustness. Our method retains the benefits of clustered FL while providing formal privacy guarantees against an untrusted server. Extensive evaluations show that our proposed method outperforms state-of-the-art DP-FL algorithms by an average of 2.9% in accuracy for privacy budgets (epsilon in {2, 8}).

new Too Sharp, Too Sure: When Calibration Follows Curvature

Authors: Alessandro Morosini, Matea Gjika, Tomaso Poggio, Pierfrancesco Beneventano

Abstract: Modern neural networks can achieve high accuracy while remaining poorly calibrated, producing confidence estimates that do not match empirical correctness. Yet calibration is often treated as a post-hoc attribute. We take a different perspective: we study calibration as a training-time phenomenon on small vision tasks, and ask whether calibrated solutions can be obtained reliably by intervening on the training procedure. We identify a tight coupling between calibration, curvature, and margins during training of deep networks under multiple gradient-based methods. Empirically, Expected Calibration Error (ECE) closely tracks curvature-based sharpness throughout optimization. Mathematically, we show that both ECE and Gauss--Newton curvature are controlled, up to problem-specific constants, by the same margin-dependent exponential tail functional along the trajectory. Guided by this mechanism, we introduce a margin-aware training objective that explicitly targets robust-margin tails and local smoothness, yielding improved out-of-sample calibration across optimizers without sacrificing accuracy.

new Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning

Authors: Aravind Venugopal, Jiayu Chen, Xudong Wu, Chongyi Zheng, Benjamin Eysenbach, Jeff Schneider

Abstract: The temporal lag between actions and their long-term consequences makes credit assignment a challenge when learning goal-directed behaviors from data. Generative world models capture the distribution of future states an agent may visit, indicating that they have captured temporal information. How can that temporal information be extracted to perform credit assignment? In this paper, we formalize how the temporal information stored in world models encodes the underlying geometry of the world. Leveraging optimal transport, we extract this geometry from a learned model of the occupancy measure into a reward function that captures goal-reaching information. Our resulting method, Occupancy Reward Shaping, largely mitigates the problem of credit assignment in sparse reward settings. ORS provably does not alter the optimal policy, yet empirically improves performance by 2.2x across 13 diverse long-horizon locomotion and manipulation tasks. Moreover, we demonstrate the effectiveness of ORS in the real world for controlling nuclear fusion on 3 Tokamak control tasks. Code: https://github.com/aravindvenu7/occupancy_reward_shaping; Website: https://aravindvenu7.github.io/website/ors/

URLs: https://github.com/aravindvenu7/occupancy_reward_shaping;, https://aravindvenu7.github.io/website/ors/

new GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

Authors: Jingyi Wang, Lei Zhu, Tengjin Weng, Song-Li Wu, Haochen Tan, Jierun Chen, Chaofan Tao, Haoli Bai, Lu Hou, Lifeng Shang, Xiao-Ping Zhang

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Language Models (LLMs) by leveraging direct outcome verification instead of learned reward models. Building on this paradigm, Group Relative Policy Optimization (GRPO) eliminates the need for critic models but suffers from indiscriminate credit assignment for intermediate steps, which limits its ability to identify effective reasoning strategies and incurs overthinking. In this work, we introduce a model-free and verifiable process supervision via probing the model's belief in the correct answer throughout its reasoning trajectory. By segmenting the generation into discrete steps and tracking the conditional probability of the correct answer appended at each segment boundary, we efficiently compute interpretable segment-wise progress measurements to refine GRPO's trajectory-level feedback. This approach enables more targeted and sample-efficient policy updates, while avoiding the need for intermediate supervision derived from costly Monte Carlo rollouts or auxiliary models. Experiments on mathematical and general-domain benchmarks show consistent gains over GRPO across diverse models: up to 2.6-point accuracy improvements and 13.7% reasoning-length reductions on math tasks, and up to 2.4 points and 4% on general-domain tasks, demonstrating strong generalization.

new Improving clinical interpretability of linear neuroimaging models through feature whitening

Authors: Sara Petiton, Antoine Grigis, Rapha\"el Vock, Edouard Duchesnay

Abstract: Linear models are widely used in computational neuroimaging to identify biomarkers associated with brain pathologies. However, interpreting the learned weights remains challenging, as they do not always yield clinically meaningful insights. This difficulty arises in part from the inherent correlation between brain regions, which causes linear weights to reflect shared rather than region-specific contributions. In particular, some groups of regions, including homologous structures in the left and right hemispheres, are known to exhibit strong anatomical correlations. In this work, we leverage this prior neuroanatomical knowledge to introduce a whitening approach applied to groups of regions with known shared variance, designed to disentangle overlapping information across correlated brain measures. We additionally propose a regularized variant that allows controlled tuning of the degree of decorrelation. We evaluate this method using region-of-interest features in two psychiatric classification tasks, distinguishing individuals with bipolar disorder or schizophrenia from healthy controls. Importantly, unlike PCA or ICA which use whitening as a dimensionality reduction step, our approach decorrelates anatomically informed pairs of neuroanatomical regions while retaining the full input signal, making it specifically suited for feature interpretation rather than feature selection. Our findings demonstrate that whitening improves the interpretability of model weights while preserving predictive performance, providing a robust framework for linking linear model outputs to neurobiological mechanisms.

new Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

Authors: Samuel Salfati

Abstract: We present a systematic empirical study of transformer compression through over 40 experiments on GPT-2 (124M parameters) and Mistral 7B (7.24B parameters). Our analysis covers spectral compression, block-level function replacement, rotation-based quantization, activation geometry, and adaptive early exit. We identify five structural properties relevant to compression. (1) Variance is not importance: high-variance activation directions are approximately 96 percent uncorrelated with predictive directions (measured via CCA), and projecting onto these subspaces preserves over 90 percent of variance while degrading perplexity. (2) Block linearity is conditional: transformer blocks are approximately linear (R^2 ~ 0.95 on GPT-2, 0.93 on Mistral block 31) only under the correct upstream distribution; modifying earlier blocks induces distribution shift that degrades downstream approximations. (3) The reconstruction wall: approaches that factor weights into quantized components amplify errors through cross-terms, making direct quantization strictly superior. (4) Linearity increases with depth: Mistral 7B exhibits a progression from R^2 = 0.17 (block 0) to R^2 = 0.93 (block 31), indicating a division between nonlinear feature construction and linear refinement. (5) Approximately 30 percent of tokens are computationally easy, confirmed via exit heads and KL divergence sensitivity. We demonstrate that single-block linear replacement achieves 34x compression with a 1.71 perplexity increase on the final block of Mistral 7B, while multi-block replacement fails due to residual error accumulation and distribution shift. These findings suggest fundamental limits to static post-training compression and motivate adaptive, per-token computation as a more effective direction.

new MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment

Authors: Andor V\'ari-Kakas, Ji Won Park, Natasa Tagasovska

Abstract: Aligning large language models (LLMs) to desirable human values requires balancing multiple, potentially conflicting objectives such as helpfulness, truthfulness, and harmlessness, which presents a multi-objective optimisation challenge. Most alignment pipelines rely on a fixed scalarisation of these objectives, which can introduce procedural unfairness by systematically under-weighting harder-to-optimise or minority objectives. To promote more equitable trade-offs, we introduce MGDA-Decoupled, a geometry-based multi-objective optimisation algorithm that finds a shared descent direction while explicitly accounting for each objective's convergence dynamics. In contrast to prior methods that depend on reinforcement learning (e.g., GAPO) or explicit reward models (e.g., MODPO), our approach operates entirely within the lightweight Direct Preference Optimisation (DPO) paradigm. Experiments on the UltraFeedback dataset show that geometry-aware methods -- and MGDA-Decoupled in particular -- achieve the highest win rates against golden responses, both overall and per objective.

new StormNet: Improving storm surge predictions with a GNN-based spatio-temporal offset forecasting model

Authors: Noujoud Nader, Stefanos Giaremis, Clint Dawson, Carola Kaiser, Karame Mohammadiporshokooh, Hartmut Kaiser

Abstract: Storm surge forecasting remains a critical challenge in mitigating the impacts of tropical cyclones on coastal regions, particularly given recent trends of rapid intensification and increasing nearshore storm activity. Traditional high fidelity numerical models such as ADCIRC, while robust, are often hindered by inevitable uncertainties arising from various sources. To address these challenges, this study introduces StormNet, a spatio-temporal graph neural network (GNN) designed for bias correction of storm surge forecasts. StormNet integrates graph convolutional (GCN) and graph attention (GAT) mechanisms with long short-term memory (LSTM) components to capture complex spatial and temporal dependencies among water-level gauge stations. The model was trained using historical hurricane data from the U.S. Gulf Coast and evaluated on Hurricane Idalia (2023). Results demonstrate that StormNet can effectively reduce the root mean square error (RMSE) in water-level predictions by more than 70\% for 48-hour forecasts and above 50\% for 72-hour forecasts, as well as outperform a sequential LSTM baseline, particularly for longer prediction horizons. The model also exhibits low training time, enhancing its applicability in real-time operational forecasting systems. Overall, StormNet provides a computationally efficient and physically meaningful framework for improving storm surge prediction accuracy and reliability during extreme weather events.

new Generative Flow Networks for Model Adaptation in Digital Twins of Natural Systems

Authors: Pascal Archambault, Houari Sahraoui, Eugene Syriani

Abstract: Digital twins of natural systems must remain aligned with physical systems that evolve over time, are only partially observed, and are typically modeled by mechanistic simulators whose parameters cannot be measured directly. In such settings, model adaptation is naturally posed as a simulation-based inference problem. However, sparse and indirect observations often fail to identify a unique and optimal calibration, leaving several simulator parameterizations compatible with the available evidence. This article presents a GFlowNet-based approach to model adaptation for digital twins of natural systems. We formulate adaptation as a generative modeling problem over complete simulator configurations, so that plausible parameterizations can be sampled with probability proportional to a reward derived from agreement between simulated and observed behavior. Using a controlled environment agriculture case study based on a mechanistic tomato model, we show that the learned policy recovers dominant regions of the adaptation landscape, retrieves strong calibration hypotheses, and preserves multiple plausible configurations under uncertainty.

new COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

Authors: Noah Flynn

Abstract: Large language models (LLMs) often exhibit performance disparities across languages, with naive multilingual fine-tuning frequently degrading performance due to negative cross-lingual interference. To address this, we introduce COMPASS (COntinual Multilingual PEFT with Adaptive Semantic Sampling), a novel data-centric framework for adapting LLMs to target languages. COMPASS leverages parameter-efficient fine-tuning (PEFT) by training lightweight, language-specific adapters on a judiciously selected subset of auxiliary multilingual data. The core of our method is a distribution-aware sampling strategy that uses multilingual embeddings and clustering to identify semantic gaps between existing training data and a target usage distribution. By prioritizing auxiliary data from under-represented semantic clusters, COMPASS maximizes positive cross-lingual transfer while minimizing interference. We extend this into a continual learning framework, COMPASS-ECDA, which monitors for data distribution shifts in production and dynamically updates adapters to prevent model staleness, balancing adaptation to new data with the preservation of existing knowledge. Across three different model architectures (Phi-4-Mini, Llama-3.1-8B, and Qwen2.5-7B) and multiple challenging multilingual benchmarks (Global-MMLU, MMLU-ProX), including unseen long-context tasks (OneRuler), we demonstrate that COMPASS consistently outperforms baseline methods guided by linguistic similarity, providing an effective, efficient, and sustainable solution for developing and maintaining high-performing multilingual models in dynamic environments.

new Tokenised Flow Matching for Hierarchical Simulation Based Inference

Authors: Giovanni Charles, Cosmo Santoni, Seth Flaxman, Elizaveta Semenova

Abstract: The cost of simulator evaluations is a key practical bottleneck for Simulation Based Inference (SBI). In hierarchical settings with shared global parameters and exchangeable site-level parameters and observations, this structure can be exploited to improve simulation efficiency. Existing hierarchical SBI approaches factorise the posterior yet still simulate across multiple sites per training sample; We instead explore likelihood factorisation (LF) to train from single-site simulations. In LF sampling we learn a per-site neural surrogate of the simulator and then assemble synthetic multi-site observations to amortise inference for the full hierarchical posterior. Building on this, we propose Tokenised Flow Matching for Posterior Estimation (TFMPE), a tokenised flow matching approach that supports function-valued observations through likelihood factorisation. To enable systematic evaluation, we introduce a benchmark for hierarchical SBI. We validate TFMPE on this benchmark and on realistic infectious disease and computational fluid dynamics models, finding well-calibrated posteriors while reducing computational cost.

new Supplement Generation Training for Enhancing Agentic Task Performance

Authors: Young Min Cho, Daniele Bonadiman, Divya Bhargavi, Tamer Alkhouli, Salvatore Romeo, Dongwei Jiang, Khushbu Pahwa, Yubin Ge, Etsuko Ishii, Monica Sunkara, Yi Zhang

Abstract: Training large foundation models for agentic tasks is increasingly impractical due to the high computational costs, long iteration cycles, and rapid obsolescence as new models are continuously released. Instead of post-training massive models for every new task or domain, we propose Supplement Generation Training (SGT), a more efficient and sustainable strategy. SGT trains a smaller LLM to generate useful supplemental text that, when appended to the original input, helps the larger LLM solve the task more effectively. These lightweight models can dynamically adapt supplements to task requirements, improving performance without modifying the underlying large models. This approach decouples task-specific optimization from large foundation models and enables more flexible, cost-effective deployment of LLM-powered agents in real-world applications.

new Near-Future Policy Optimization

Authors: Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher $Q$ , more new knowledge to learn) and close enough (lower $V$ , more readily absorbed) conditions required to maximize the effective learning signal $\mathcal{S} = Q/V$. We propose \textbf{N}ear-Future \textbf{P}olicy \textbf{O}ptimization (\textbf{NPO}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose \textbf{AutoNPO},an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes $S$. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.

new Fast Bayesian equipment condition monitoring via simulation based inference: applications to heat exchanger health

Authors: Peter Collett, Alexander Johannes Stasik, Simone Casolo, Signe Riemer-S{\o}rensen

Abstract: Accurate condition monitoring of industrial equipment requires inferring latent degradation parameters from indirect sensor measurements under uncertainty. While traditional Bayesian methods like Markov Chain Monte Carlo (MCMC) provide rigorous uncertainty quantification, their heavy computational bottlenecks render them impractical for real-time process control. To overcome this limitation, we propose an AI-driven framework utilizing Simulation-Based Inference (SBI) powered by amortized neural posterior estimation to diagnose complex failure modes in heat exchangers. By training neural density estimators on a simulated dataset, our approach learns a direct, likelihood-free mapping from thermal-fluid observations to the full posterior distribution of degradation parameters. We benchmark this framework against an MCMC baseline across various synthetic fouling and leakage scenarios, including challenging low-probability, sparse-event failures. The results show that SBI achieves comparable diagnostic accuracy and reliable uncertainty quantification, while accelerating inference time by a factor of82$\times$ compared to traditional sampling. The amortized nature of the neural network enables near-instantaneous inference, establishing SBI as a highly scalable, real-time alternative for probabilistic fault diagnosis and digital twin realization in complex engineering systems.

new F\textsuperscript{2}LP-AP: Fast \& Flexible Label Propagation with Adaptive Propagation Kernel

Authors: Yutong Shen, Ruizhe Xia, Jingyi Liu, Yinqi Liu

Abstract: Semi-supervised node classification is a foundational task in graph machine learning, yet state-of-the-art Graph Neural Networks (GNNs) are hindered by significant computational overhead and reliance on strong homophily assumptions. Traditional GNNs require expensive iterative training and multi-layer message passing, while existing training-free methods, such as Label Propagation, lack adaptability to heterophilo\-us graph structures. This paper presents \textbf{F$^2$LP-AP} (Fast and Flexible Label Propagation with Adaptive Propagation Kernel), a training-free, computationally efficient framework that adapts to local graph topology. Our method constructs robust class prototypes via the geometric median and dynamically adjusts propagation parameters based on the Local Clustering Coefficient (LCC), enabling effective modeling of both homophilous and heterophilous graphs without gradient-based training. Extensive experiments across diverse benchmark datasets demonstrate that \textbf{F$^2$LP-AP} achieves competitive or superior accuracy compared to trained GNNs, while significantly outperforming existing baselines in computational efficiency.

new Lifecycle-Aware Federated Continual Learning in Mobile Autonomous Systems

Authors: Beining Wu, Jun Huang

Abstract: Federated continual learning (FCL) allows distributed autonomous fleets to adapt collaboratively to evolving terrain types across extended mission lifecycles. However, current approaches face several key challenges: 1) they use uniform protection strategies that do not account for the varying sensitivities to forgetting on different network layers; 2) they focus primarily on preventing forgetting during training, without addressing the long-term effects of cumulative drift; and 3) they often depend on idealized simulations that fail to capture the real-world heterogeneity present in distributed fleets. In this paper, we propose a lifecycle-aware dual-timescale FCL framework that incorporates training-time (pre-forgetting) prevention and (post-forgetting) recovery. Under this framework, we design a layer-selective rehearsal strategy that mitigates immediate forgetting during local training, and a rapid knowledge recovery strategy that restores degraded models after long-term cumulative drift. We present a theoretical analysis that characterizes heterogeneous forgetting dynamics and establishes the inevitability of long-term degradation. Our experimental results show that this framework achieves up to 8.3\% mIoU improvement over the strongest federated baseline and up to 31.7\% over conventional fine-tuning. We also deploy the FCL framework on a real-world rover testbed to assess system-level robustness under realistic constraints; the testing results further confirm the effectiveness of our FCL design.

new Relative Entropy Estimation in Function Space: Theory and Applications to Trajectory Inference

Authors: Chao Wang, Luca Nepote, Giulio Franzese, Pietro Michiardi

Abstract: Trajectory Inference (TI) seeks to recover latent dynamical processes from snapshot data, where only independent samples from time-indexed marginals are observed. In applications such as single-cell genomics, destructive measurements make path-space laws non-identifiable from finitely many marginals, leaving held-out marginal prediction as the dominant but limited evaluation protocol. We introduce a general framework for estimating the Kullback-Leibler divergence (KL) divergence between probability measures on function space, yielding a tractable, data-driven estimator that is scalable to realistic snapshot datasets. We validate the accuracy of our estimator on a benchmark suite, where the estimated functional KL closely matches the analytic KL. Applying this framework to synthetic and real scRNA-seq datasets, we show that current evaluation metrics often give inconsistent assessments, whereas path-space KL enables a coherent comparison of trajectory inference methods and exposes discrepancies in inferred dynamics, especially in regions with sparse or missing data. These results support functional KL as a principled criterion for evaluating trajectory inference under partial observability.

new Efficient Multi-Cohort Inference for Long-Term Effects and Lifetime Value in A/B Testing with User Learning

Authors: Dario Simionato, Andrea Tonon, Mingxue Wang, Weiguo Wang, Tong Gui, Xiaoyue Li

Abstract: In streaming platforms churn is extremely costly, yet A/B tests are typically evaluated using outcomes observed within a limited experimental horizon. Even when both short- and predicted long-term engagement metrics are considered, they may fail to capture how a treatment affects users' retention. Consequently, an intervention may appear beneficial in the short term and neutral in the long term while still generating lower total value than the control due to users churn. To address this limitation, we introduce a method that estimates long-term treatment effects (LTE) and residual lifetime value change ($\Delta ERLV$) in short multi-cohort A/B tests under user learning. To estimate time-varying treatment effects efficiently, we introduce an inverse-variance weighted estimator that combines multiple cohorts estimates, reducing variance relative to standard approaches in the literature. The estimated treatment trajectory is then modeled as a parametric decay to recover both the asymptotic treatment effect and the cumulative value generated over time. Our framework enables simultaneous evaluation of steady-state impact and residual user value within a single experiment. Empirical results show improved precision in estimating LTE and $\Delta ERLV$ and identify scenarios in which relying on either short-term or long-term metrics alone would lead to incorrect product decisions.

new Physics-Conditioned Synthesis of Internal Ice-Layer Thickness for Incomplete Layer Traces

Authors: Zesheng Liu, Maryam Rahnemoonfar

Abstract: Internal ice layers imaged by radar provide key evidence of snow accumulation and ice dynamics, but radar-derived layer boundary observations are often incomplete, with discontinuous traces and sometimes entirely missing layers, due to limited resolution, sensor noise, and signal loss. Existing graph-based models for ice stratigraphy generally assume sufficiently complete layer profiles and focus on predicting deeper-layer thickness from reliably traced shallow layers. In this work, we address the layer-completion problem itself by synthesizing complete ice-layer thickness annotations from incomplete radar-derived layer traces by conditioning on colocated physical features synchronized from physical climate models. The proposed network combines geometric learning to aggregate within-layer spatial context with a transformer-based temporal module that propagates information across layers to encourage coherent stratigraphy and consistent thickness evolution. To learn from incomplete supervision, we optimize a mask-aware robust regression objective that evaluates errors only at observed thickness values and normalizes by the number of valid entries, enabling stable training under varying sparsity without imputation and steering completions toward physically plausible values. The model preserves observed thickness where available and infers only missing regions, recovering fragmented segments and even fully absent layers while remaining consistent with measured traces. As an additional benefit, the synthesized thickness stacks provide effective pretraining supervision for a downstream deep-layer predictor, improving fine-tuned accuracy over training from scratch on the same fully traced data.

new ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

Authors: Shelly Golan, Michael Finkelson, Ariel Bereslavsky, Yotam Nitzan, Or Patashnik

Abstract: Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.

new Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

Authors: Yiming Bian, Joshua M. Akey

Abstract: The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by introducing CQS Divide, an operation derived from cyclic quorum sets (CQS) theory that decomposes attention into a set of independent subsequence computations whose recomposition yields exactly the same result as full-sequence attention. Exploiting this decomposition, we introduce Stream-CQSA, a memory-adaptive scheduling framework that partitions attention into subproblems that fit within arbitrary memory budgets. This recasts attention from a logically monolithic operation into a collection of schedulable tasks, enabling flexible execution across devices without inter-device communication. Experiments demonstrate predictable memory scaling and show that exact attention over billion-token sequences can be executed on a single GPU via streaming, without changing the underlying mathematical definition of attention or introducing approximation error.

new Closing the Domain Gap in Biomedical Imaging by In-Context Control Samples

Authors: Ana Sanchez-Fernandez, Thomas Pinetz, Werner Zellinger, G\"unter Klambauer

Abstract: The central problem in biomedical imaging are batch effects: systematic technical variations unrelated to the biological signal of interest. These batch effects critically undermine experimental reproducibility and are the primary cause of failure of deep learning systems on new experimental batches, preventing their practical use in the real world. Despite years of research, no method has succeeded in closing this performance gap for deep learning models. We propose Control-Stabilized Adaptive Risk Minimization via Batch Normalization (CS-ARM-BN), a meta-learning adaptation method that exploits negative control samples. Such unperturbed reference images are present in every experimental batch by design and serve as stable context for adaptation. We validate our novel method on Mechanism-of-Action (MoA) classification, a crucial task for drug discovery, on the large-scale JUMP-CP dataset. The accuracy of standard ResNets drops from 0.939 $\pm$ 0.005, on the training domain, to 0.862 $\pm$ 0.060 on data from new experimental batches. Foundation models, even after Typical Variation Normalization, fail to close this gap. We are the first to show that meta-learning approaches close the domain gap by achieving 0.935 $\pm$ 0.018. If the new experimental batches exhibit strong domain shifts, such as being generated in a different lab, meta-learning approaches can be stabilized with control samples, which are always available in biomedical experiments. Our work shows that batch effects in bioimaging data can be effectively neutralized through principled in-context adaptation, which also makes them practically usable and efficient.

new FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels

Authors: Sina Gholami, Abdulmoneam Ali, Tania Haghighi, Ahmed Arafa, Minhaj Nur Alam

Abstract: Federated learning (FL) enables collaborative model training without sharing raw data; however, the presence of noisy labels across distributed clients can severely degrade the learning performance. In this paper, we propose FedSIR, a multi-stage framework for robust FL under noisy labels. Different from existing approaches that mainly rely on designing noise-tolerant loss functions or exploiting loss dynamics during training, our method leverages the spectral structure of client feature representations to identify and mitigate label noise. Our framework consists of three key components. First, we identify clean and noisy clients by analyzing the spectral consistency of class-wise feature subspaces with minimal communication overhead. Second, clean clients provide spectral references that enable noisy clients to relabel potentially corrupted samples using both dominant class directions and residual subspaces. Third, we employ a noise-aware training strategy that integrates logit-adjusted loss, knowledge distillation, and distance-aware aggregation to further stabilize federated optimization. Extensive experiments on standard FL benchmarks demonstrate that FedSIR consistently outperforms state-of-the-art methods for FL with noisy labels. The code is available at https://github.com/sinagh72/FedSIR.

URLs: https://github.com/sinagh72/FedSIR.

cross Algorithm Selection with Zero Domain Knowledge via Text Embeddings

Authors: Stefan Szeider

Abstract: We propose a feature-free approach to algorithm selection that replaces hand-crafted instance features with pretrained text embeddings. Our method, ZeroFolio, proceeds in three steps: it reads the raw instance file as plain text, embeds it with a pretrained embedding model, and selects an algorithm via weighted k-nearest neighbors. The key to our approach is the observation that pretrained embeddings produce representations that distinguish problem instances without any domain knowledge or task-specific training. This allows us to apply the same three-step pipeline (serialize, embed, select) across diverse problem domains with text-based instance formats. We evaluate our approach on 11 ASlib scenarios spanning 7 domains (SAT, MaxSAT, QBF, ASP, CSP, MIP, and graph problems). Our experiments show that this approach outperforms a random forest trained on hand-crafted features in 10 of 11 scenarios with a single fixed configuration, and in all 11 with two-seed voting; the margin is often substantial. Our ablation study shows that inverse-distance weighting, line shuffling, and Manhattan distance are the key design choices. On scenarios where both selectors are competitive, combining embeddings with hand-crafted features via soft voting yields further improvements.

cross Exploring Data Augmentation and Resampling Strategies for Transformer-Based Models to Address Class Imbalance in AI Scoring of Scientific Explanations in NGSS Classroom

Authors: Prudence Djagba, Kevin Haudek, Clare G. C. Franovic, Leonora Kaldaras

Abstract: Automated scoring of students' scientific explanations offers the potential for immediate, accurate feedback, yet class imbalance in rubric categories particularly those capturing advanced reasoning remains a challenge. This study investigates augmentation strategies to improve transformer-based text classification of student responses to a physical science assessment based on an NGSS-aligned learning progression. The dataset consists of 1,466 high school responses scored on 11 binary-coded analytic categories. This rubric identifies six important components including scientific ideas needed for a complete explanation along with five common incomplete or inaccurate ideas. Using SciBERT as a baseline, we applied fine-tuning and test these augmentation strategies: (1) GPT-4--generated synthetic responses, (2) EASE, a word-level extraction and filtering approach, and (3) ALP (Augmentation using Lexicalized Probabilistic context-free grammar) phrase-level extraction. While fine-tuning SciBERT improved recall over baseline, augmentation substantially enhanced performance, with GPT data boosting both precision and recall, and ALP achieving perfect precision, recall, and F1 scores across most severe imbalanced categories (5,6,7 and 9). Across all rubric categories EASE augmentation substantially increased alignment with human scoring for both scientific ideas (Categories 1--6) and inaccurate ideas (Categories 7--11). We compared different augmentation strategies to a traditional oversampling method (SMOTE) in an effort to avoid overfitting and retain novice-level data critical for learning progression alignment. Findings demonstrate that targeted augmentation can address severe imbalance while preserving conceptual coverage, offering a scalable solution for automated learning progression-aligned scoring in science education.

cross Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks

Authors: Dorothy Torres, Wei Cheng, Ke Hu

Abstract: Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under strict audit and governance constraints. While large language models (LLMs) can summarize heterogeneous evidence and draft rationales, unconstrained generation is risky in regulated workflows due to hallucinations, weak provenance, and explanations that are not faithful to the underlying decision. We propose an explainable AML triage framework that treats triage as an evidence-constrained decision process. Our method combines (i) retrieval-augmented evidence bundling from policy/typology guidance, customer context, alert triggers, and transaction subgraphs, (ii) a structured LLM output contract that requires explicit citations and separates supporting from contradicting or missing evidence, and (iii) counterfactual checks that validate whether minimal, plausible perturbations lead to coherent changes in both the triage recommendation and its rationale. We evaluate on public synthetic AML benchmarks and simulators and compare against rules, tabular and graph machine-learning baselines, and LLM-only/RAG-only variants. Results show that evidence grounding substantially improves auditability and reduces numerical and policy hallucination errors, while counterfactual validation further increases decision-linked explainability and robustness, yielding the best overall triage performance (PR-AUC 0.75; Escalate F1 0.62) and strong provenance and faithfulness metrics (citation validity 0.98; evidence support 0.88; counterfactual faithfulness 0.76). These findings indicate that governed, verifiable LLM systems can provide practical decision support for AML triage without sacrificing compliance requirements for traceability and defensibility.

cross ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

Authors: Kemal D\"uzkar

Abstract: We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a, and variable-cp air. Six frontier LLMs are evaluated across three independent runs each. The composite leaderboard is led by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%). Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax), confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 pp performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. Dataset and code are open-source at https://huggingface.co/datasets/olivenet/thermoqa

URLs: https://huggingface.co/datasets/olivenet/thermoqa

cross EvoForest: A Novel Machine-Learning Paradigm via Open-Ended Evolution of Computational Graphs

Authors: Kamer Ali Yuksel, Hassan Sawaf

Abstract: Modern machine learning is still largely organized around a single recipe: choose a parameterized model family and optimize its weights. Although highly successful, this paradigm is too narrow for many structured prediction problems, where the main bottleneck is not parameter fitting but discovering what should be computed from the data. Success often depends on identifying the right transformations, statistics, invariances, interaction structures, temporal summaries, gates, or nonlinear compositions, especially when objectives are non-differentiable, evaluation is cross-validation-based, interpretability matters, or continual adaptation is required. We present EvoForest, a hybrid neuro-symbolic system for end-to-end open-ended evolution of computation. Rather than merely generating features, EvoForest jointly evolves reusable computational structure, callable function families, and trainable low-dimensional continuous components inside a shared directed acyclic graph. Intermediate nodes store alternative implementations, callable nodes encode reusable transformation families such as projections, gates, and activations, output nodes define candidate predictive computations, and persistent global parameters can be refined by gradient descent. For each graph configuration, EvoForest evaluates the discovered computation and uses a lightweight Ridge-based readout to score the resulting representation against a non-differentiable cross-validation target. The evaluator also produces structured feedback that guides future LLM-driven mutations. In the 2025 ADIA Lab Structural Break Challenge, EvoForest reached 94.13% ROC-AUC after 600 evolution steps, exceeding the publicly reported winning score of 90.14% under the same evaluation protocol.

cross TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

Authors: Gradwell Dzikanyanga, Weihao Yang, Hao Huang, Donglei Wu, Shihao Wang, Wen Xia, Sanjeeb K C

Abstract: Key-value (KV) caching is critical for efficient inference in large language models (LLMs), yet its memory footprint scales linearly with context length, resulting in a severe scalability bottleneck. Existing approaches largely treat KV states as equally important across time, implicitly assuming uniform precision and accessibility. However, this assumption contrasts with human memory systems, where memories vary in clarity, recall frequency, and relevance with temporal proximity.Motivated by this insight, we propose TTKV, a KV cache management framework that maps the human memory system onto the KV cache. TTKV partitions the KV cache into temporal tiers with heterogeneous capacity and precision. The design addresses three aspects: (1) Tier Layout, decoupling fast and slow memory using HBM and DRAM; (2) Tier Content, assigning more recent KV states to faster, higher-precision tiers based on temporal proximity; and (3) Tier Interaction, employing block-wise streaming attention to overlap communication and computation when accessing slow tiers. Experiments show that TTKV reduces cross-tier traffic by 5.94x on 128K-context tasks, achieving up to 76% latency reduction and 2x throughput improvement over strong baselines.

cross Development and Preliminary Evaluation of a Domain-Specific Large Language Model for Tuberculosis Care in South Africa

Authors: Thokozile Khosa, Olawande Daramola

Abstract: Tuberculosis (TB) is one of the world's deadliest infectious diseases, and in South Africa, it contributes a significant burden to the country's health care system. This paper presents an experimental study on the development of a domain-specific Large Language Model (DS-LLM) for TB care that can help to alleviate the burden on patients and healthcare providers. To achieve this, a literature review was conducted to understand current LLM development strategies, specifically in the medical domain. Thereafter, data were collected from South African TB guidelines, selected TB literature, and existing benchmark medical datasets. We performed LLM fine-tuning by using the Quantised Low-Rank Adaptation (QLoRA) algorithm on a medical LLM (BioMistral-7B), and also implemented Retrieval-Augmented Generation using GraphRAG. The developed DS-LLM was evaluated against the base BioMistral-7B model and a general-purpose LLM using a mix of automated metrics and quantitative ratings. The results show that the DS-LLM had better performance compared to the base model in terms of its contextual alignment (lexical, semantic, and knowledge) for TB care in South Africa.

cross Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements

Authors: Yifei Wang, Tianlin Li, Xiaohan Zhang, Xiaoyu Zhang, Wei Ma, Mingfei Cheng, Li Pan

Abstract: Large language models (LLMs) are increasingly deployed under diverse numerical precision configurations, including standard floating-point formats (e.g., bfloat16 and float16) and quantized integer formats (e.g., int16 and int8), to meet efficiency and resource constraints. However, minor inconsistencies between LLMs of different precisions are difficult to detect and are often overlooked by existing evaluation methods. In this paper, we present PrecisionDiff, an automated differential testing framework for systematically detecting precision-induced behavioral disagreements in LLMs. PrecisionDiff generates precision-sensitive test inputs and performs cross-precision comparative analysis to uncover subtle divergences that remain hidden under conventional testing strategies. To demonstrate its practical significance, we instantiate PrecisionDiff on the alignment verification task, where precision-induced disagreements manifest as jailbreak divergence-inputs that are rejected under one precision may produce harmful responses under another. Experimental results show that such behavioral disagreements are widespread across multiple open-source aligned LLMs and precision settings, and that PrecisionDiff significantly outperforms vanilla testing methods in detecting these issues. Our work enables automated precision-sensitive test generation, facilitating effective pre-deployment evaluation and improving precision robustness during training.

cross SkillGraph: Graph Foundation Priors for LLM Agent Tool Sequence Recommendation

Authors: Hao Liu, Dongyu Li

Abstract: LLM agents must select tools from large API libraries and order them correctly. Existing methods use semantic similarity for both retrieval and ordering, but ordering depends on inter-tool data dependencies that are absent from tool descriptions. As a result, semantic-only methods can produce negative Kendall-$\tau$ in structured workflow domains. We introduce SkillGraph, a directed weighted execution-transition graph mined from 49,831 successful LLM agent trajectories, which encodes workflow-precedence regularities as a reusable graph foundation prior. Building on this graph foundation prior, we propose a two-stage decoupled framework: GS-Hybrid retrieval for candidate selection and a learned pairwise reranker for ordering. On ToolBench (9,965 test instances; ~16,000 tools), the method reaches Set-F1 = 0.271 and Kendall-$\tau$ = 0.096; on API-Bank, Kendall-$\tau$ improves from -0.433 to +0.613. Under identical Stage-1 inputs, the learned reranker also outperforms LLaMA-3.1-8B Stage-2 rerankers.

cross Improving Molecular Force Fields with Minimal Temporal Information

Authors: Ali Mollahosseini, Mohammed Haroon Dupty, Wee Sun Lee

Abstract: Accurate prediction of energy and forces for 3D molecular systems is one of fundamental challenges at the core of AI for Science applications. Many powerful and data-efficient neural networks predict molecular energies and forces from single atomic configurations. However, one crucial aspect of the data generation process is rarely considered while learning these models i.e. Molecular Dynamics (MD) simulation. MD simulations generate time-ordered trajectories of atomic positions that fluctuate in energy and explore regions of the potential energy surface (e.g., under standard NVE/NVT ensembles), rather than being constructed to steadily lower the potential energy toward a minimum as in geometry relaxations. This work explores a novel way to leverage MD data, when available, to improve the performance of such predictors. We introduce a novel training strategy called FRAMES, that use an auxiliary loss function for exploiting the temporal relationships within MD trajectories. Counter-intuitively, on two atomistic benchmarks and a synthetic system we observe that minimal temporal information, captured by pairs of just two consecutive frames, is often sufficient to obtain the best performance, while adding longer trajectory sequences can introduce redundancy and degrade performance. On the widely used MD17 and ISO17 benchmarks, FRAMES significantly outperforms its Equiformer baseline, achieving highly competitive results in both energy and force accuracy. Our work not only presents a novel training strategy which improves the accuracy of the model, but also provides evidence that for distilling physical priors of atomic systems, more temporal data is not always better.

cross MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models

Authors: Jason Z Wang

Abstract: We introduce MIRROR, a benchmark comprising eight experiments across four metacognitive levels that evaluates whether large language models can use self-knowledge to make better decisions. We evaluate 16 models from 8 labs across approximately 250,000 evaluation instances using five independent behavioral measurement channels. Core experiments are run across the full model roster; experiments with specialized infrastructure requirements report explicitly marked model subsets. We find two phenomena with direct implications for agentic deployment: (1) compositional self-prediction fails universally -- the Compositional Calibration Error ranges from 0.500 to 0.943 on the original 15-model Exp3-v1 set (and 0.434 to 0.758 on the balanced 16-model Exp3-v2 expansion), indicating that models cannot predict their own performance on multi-domain tasks, and (2) models exhibit above-chance but imperfect domain-specific self-knowledge yet systematically fail to translate even this partial awareness into appropriate agentic action-selection -- external metacognitive control reduces the Confident Failure Rate from 0.600 to 0.143 (76% reduction at temperature 0; mean 70% at temperature 0.7 across 5 models from 4 labs). Providing models with their own calibration scores produces no significant improvement (p > 0.05); only architectural constraint is effective. This suggests that external metacognitive scaffolding -- not improved self-knowledge -- is the path to safer autonomous AI systems. Code, data, and Croissant metadata will be released publicly with the benchmark.

cross Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning

Authors: Khalil Akremi, Mariem Handous, Zied Bouslama, Farah Bassalah, Maryem Jebali, Mariem Hanachi, Ines Abdeljaoued-Tej

Abstract: Rabies remains a major public health concern across many African and Asian countries, where accurate diagnosis is critical for effective epidemiological surveillance. The gold standard diagnostic methods rely heavily on fluorescence microscopy, necessitating skilled laboratory personnel for the accurate interpretation of results. Such expertise is often scarce, particularly in regions with low annual sample volumes. This paper presents an automated, AI-driven diagnostic system designed to address these challenges. We developed a robust pipeline utilizing fluorescent image analysis through transfer learning with four deep learning architectures: EfficientNetB0, EfficientNetB2, VGG16, and Vision Transformer (ViTB16). Three distinct data augmentation strategies were evaluated to enhance model generalization on a dataset of 155 microscopic images (123 positive and 32 negative). Our results demonstrate that TrivialAugmentWide was the most effective augmentation technique, as it preserved critical fluorescent patterns while improving model robustness. The EfficientNetB0 model, utilizing Geometric & Color augmentation and selected through stratified 3fold cross-validation, achieved optimal classification performance on cropped images. Despite constraints posed by class imbalance and a limited dataset size, this work confirms the viability of deep learning for automating rabies diagnosis. The proposed method enables fast and reliable detection with significant potential for further optimization. An online tool was deployed to facilitate practical access, establishing a framework for future medical imaging applications. This research underscores the potential of optimized deep learning models to transform rabies diagnostics and improve public health outcomes.

cross Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation

Authors: \'Eric Jacopin

Abstract: AI coding assistants increasingly generate code alongside tests. How developers structure test code, whether inline with the implementation or in separate blocks, has traditionally been a matter of testing philosophy. We investigate whether this choice affects AI code generation quality. We conduct a large-scale empirical study (830+ generated files, 12 models, 3 providers) using SEGA, a three-dimensional evaluation framework measuring Determinism, Preservation, and Correctness. Comparing inline test syntax (Python doctests) against separated test syntax (Rust #[test] blocks) on a d-ary heap implementation, we find that: (1) inline tests yield near-perfect preservation (100%) and correctness (92-100%) across all models; (2) separated tests expose stark model-tier gaps (0-100% correctness) and independence between preservation and correctness; (3) model behavior evolves across generations, and notably one model breaks the test suppression pattern of its three predecessors; (4) mechanistic analysis on 7 open-source architectures (6 transformers and a gated-linear Recurrent Neural Network (RNN)) reveals inline test markers receive 2.8-4.4$\times$ stronger attention in 5/7 models, with causal validation via knockout and steering experiments on the 4 code-specialized transformers and RWKV-6; the co-location mechanism extends to a non-transformer architecture, suggesting the design recommendation is robust to future architectural shifts. In the Foundation Model era, test syntax structure is a software design concern: co-locating tests with implementation code produces measurably better AI-generated code. This arxiv long version includes appendices that further qualify the effect as bounded by both model capability and programming language.

cross Option Pricing on Noisy Intermediate-Scale Quantum Computers: A Quantum Neural Network Approach

Authors: Sebastian Zaj\k{a}c, Rafa{\l} Pracht

Abstract: In a global derivatives market with notional values in the hundreds of trillions of dollars, the accuracy and efficiency of pricing models are of fundamental importance, with direct implications for risk management, capital allocation, and regulatory compliance. In this work, we employ the Black-Scholes-Merton (BSM) framework not as an end in itself, but as a controlled benchmark environment in which to rigorously assess the capabilities of quantum machine learning methods. We propose a fully quantum approach to option pricing based on Quantum Neural Networks (QNNs), and, to the best of our knowledge, present one of the first implementations of such a methodology on currently available quantum hardware. Specifically, we investigate whether QNNs, by exploiting the geometric structure of Hilbert space, can effectively approximate option pricing functions. Our implementation utilizes a compact 2-qubit QNN architecture evaluated across multiple state-of-the-art quantum processors, including IBM Fez, IQM Garnet, IonQ Forte, and Rigetti Ankaa-3. This cross-platform study reveals distinct hardware-dependent performance characteristics while demonstrating that accurate pricing approximations can be achieved consistently across different devices despite the constraints of Noisy Intermediate-Scale Quantum (NISQ) hardware. The results provide empirical evidence that QNN-based approaches constitute a viable framework for derivative pricing. While the analysis is conducted within the BSM setting, the broader significance lies in the potential extension of these methods to more realistic and computationally demanding models, including local volatility, stochastic volatility, and interest rate frameworks commonly used in practice.

cross Spatio-temporal modelling of electric vehicle charging demand

Authors: Kaoutar Bouaachra, Yvenn Amara-Ouali, Yannig Goude, Rapha\"el Lachieze-Rey

Abstract: Accurate forecasting of electric vehicle (EV) charging demand is critical for grid management and infrastructure planning. Yet the field continues to rely on legacy benchmarks; such as the Palo Alto (2020) dataset; that fail to reflect the scale and behavioral diversity of modern charging networks. To address this, we introduce a novel large-scale longitudinal dataset collected across Scotland (2022 2025), which release it as an open benchmark for the community. Building on this dataset, we formulate EV charging demand as a spatio-temporal latent Gaussian field and perform approximate Bayesian inference via Integrated Nested Laplace Approximation (INLA). The resulting model jointly captures spatial dependence, temporal dynamics, and covariate effects within a unified proba bilistic framework. On station-level forecasting tasks, our approach achieves competitive predictive accuracy against machine learning baselines, while additionally providing principled uncertainty quan tification and interpretable spatial and temporal decompositions properties that are essential for risk-aware infrastructure planning.

cross Neural posterior estimation of the neutrino direction in IceCube using transformer-encoded normalizing flows on the sphere

Authors: R. Abbasi, M. Ackermann, J. Adams, J. A. Aguilar, M. Ahlers, J. M. Alameddine, S. Ali, N. M. Amin, K. Andeen, C. Arg\"uelles, Y. Ashida, S. Athanasiadou, S. N. Axani, R. Babu, X. Bai, A. Balagopal V., S. W. Barwick, V. Basu, R. Bay, J. J. Beatty, J. Becker Tjus, P. Behrens, J. Beise, C. Bellenghi, S. Benkel, S. BenZvi, D. Berley, E. Bernardini, D. Z. Besson, E. Blaufuss, L. Bloom, S. Blot, F. Bontempo, J. Y. Book Motzkin, C. Boscolo Meneguolo, S. B\"oser, O. Botner, J. B\"ottcher, J. Braun, B. Brinson, Z. Brisson-Tsavoussis, R. T. Burley, D. Butterfield, K. Carloni, J. Carpio, N. Chau, Z. Chen, D. Chirkin, S. Choi, A. Chubarov, B. A. Clark, G. H. Collin, D. A. Coloma Borja, A. Connolly, J. M. Conrad, D. F. Cowen, C. De Clercq, J. J. DeLaunay, D. Delgado, T. Delmeulle, S. Deng, P. Desiati, K. D. de Vries, G. de Wasseige, T. DeYoung, J. C. D\'iaz-V\'elez, S. DiKerby, T. Ding, M. Dittmer, A. Domi, L. Draper, L. Dueser, D. Durnford, K. Dutta, M. A. DuVernois, T. Ehrhardt, L. Eidenschink, A. Eimer, C. Eldridge, P. Eller, E. Ellinger, D. Els\"asser, R. Engel, H. Erpenbeck, W. Esmail, S. Eulig, J. Evans, P. A. Evenson, K. L. Fan, K. Fang, K. Farrag, A. R. Fazely, A. Fedynitch, N. Feigl, C. Finley, D. Fox, A. Franckowiak, S. Fukami, P. F\"urst, J. Gallagher, E. Ganster, A. Garcia, M. Garcia, E. Genton, L. Gerhardt, A. Ghadimi, C. Glaser, T. Gl\"usenkamp, J. G. Gonzalez, S. Goswami, A. Granados, D. Grant, S. J. Gray, S. Griffin, K. M. Groth, D. Guevel, C. G\"unther, P. Gutjahr, C. Ha, A. Hallgren, L. Halve, F. Halzen, L. Hamacher, M. Handt, K. Hanson, J. Hardin, A. A. Harnisch, P. Hatch, A. Haungs, J. H\"au{\ss}ler, K. Helbing, J. Hellrung, B. Henke, L. Hennig, F. Henningsen, L. Heuermann, R. Hewett, N. Heyer, S. Hickford, A. Hidvegi, C. Hill, G. C. Hill, R. Hmaid, K. D. Hoffman, A. Hollnagel, D. Hooper, S. Hori, K. Hoshina, M. Hostert, W. Hou, M. Hrywniak, T. Huber, K. Hultqvist, K. Hymon, A. Ishihara, W. Iwakiri, M. Jacquart, S. Jain, O. Janik, M. Jansson, M. Jin, N. Kamp, D. Kang, W. Kang, A. Kappes, L. Kardum, T. Karg, A. Karle, A. Katil, M. Kauer, J. L. Kelley, M. Khanal, A. Khatee Zathul, A. Kheirandish, T. Kim, H. Kimku, F. Kirchner, J. Kiryluk, C. Klein, S. R. Klein, Y. Kobayashi, S. Koch, A. Kochocki, R. Koirala, H. Kolanoski, T. Kontrimas, L. K\"opke, C. Kopper, D. J. Koskinen, P. Koundal, M. Kowalski, T. Kozynets, A. Kravka, N. Krieger, T. Krishnan, K. Kruiswijk, E. Krupczak, A. Kumar, E. Kun, N. Kurahashi, C. Lagunas Gualda, L. Lallement Arnaud, M. J. Larson, F. Lauber, J. P. Lazar, K. Leonard DeHolton, A. Leszczy\'nska, C. Li, J. Liao, C. Lin, Q. R. Liu, Y. T. Liu, M. Liubarska, C. Love, L. Lu, F. Lucarelli, W. Luszczak, Y. Lyu, M. Macdonald, E. Magnus, Y. Makino, E. Manao, S. Mancina, A. Mand, I. C. Mari\c{s}, S. Marka, Z. Marka, L. Marten, I. Martinez-Soler, R. Maruyama, J. Mauro, F. Mayhew, F. McNally, K. Meagher, A. Medina, M. Meier, Y. Merckx, L. Merten, J. Mitchell, L. Molchany, S. Mondal, T. Montaruli, R. W. Moore, Y. Morii, A. Mosbrugger, D. Mousadi, E. Moyaux, T. Mukherjee, M. Nakos, U. Naumann, J. Necker, L. Neste, M. Neumann, H. Niederhausen, M. U. Nisa, K. Noda, A. Noell, A. Novikov, A. Obertacke, V. O'Dell, A. Olivas, R. Orsoe, J. Osborn, E. O'Sullivan, B. Owens, V. Palusova, H. Pandya, A. Parenti, N. Park, V. Parrish, E. N. Paudel, L. Paul, C. P\'erez de los Heros, T. Pernice, T. C. Petersen, J. Peterson, S. Pick, M. Plum, A. Pont\'en, V. Poojyam, B. Pries, R. Procter-Murphy, G. T. Przybylski, L. Pyras, C. Raab, J. Rack-Helleis, N. Rad, M. Ravn, K. Rawlins, Z. Rechav, A. Rehman, I. Reistroffer, E. Resconi, S. Reusch, C. D. Rho, W. Rhode, L. Ricca, B. Riedel, A. Rifaie, E. J. Roberts, S. Rodan, M. Rongen, A. Rosted, C. Rott, T. Ruhe, L. Ruohan, D. Ryckbosch, J. Saffer, D. Salazar-Gallegos, P. Sampathkumar, A. Sandrock, G. Sanger-Johnson, M. Santander, S. Sarkar, M. Scarnera, M. Schaufel, H. Schieler, S. Schindler, L. Schlickmann, B. Schl\"uter, F. Schl\"uter, N. Schmeisser, T. Schmidt, A. Scholz, F. G. Schr\"oder, S. Schwirn, S. Sclafani, D. Seckel, L. Seen, M. Seikh, S. Seunarine, P. A. Sevle Myhr, R. Shah, S. Shah, S. Shefali, N. Shimizu, B. Skrzypek, R. Snihur, J. Soedingrekso, D. Soldin, P. Soldin, G. Sommani, C. Spannfellner, G. M. Spiczak, C. Spiering, J. Stachurska, M. Stamatikos, T. Stanev, T. Stezelberger, T. St\"urwald, T. Stuttard, G. W. Sullivan, I. Taboada, S. Ter-Antonyan, A. Terliuk, A. Thakuri, M. Thiesmeyer, W. G. Thompson, J. Thwaites, S. Tilav, K. Tollefson, J. A. Torres, S. Toscano, D. Tosi, K. Upshaw, A. Vaidyanathan, N. Valtonen-Mattila, J. Valverde, J. Vandenbroucke, T. Van Eeden, N. van Eijndhoven, L. Van Rootselaar, J. van Santen, J. Vara, F. Varsi, M. Venugopal, M. Vereecken, S. Vergara Carrasco, S. Verpoest, D. Veske, A. Vijai, J. Villarreal, C. Walck, A. Wang, E. H. S. Warrick, C. Weaver, P. Weigel, A. Weindl, J. Weldert, A. Y. Wen, C. Wendt, J. Werthebach, M. Weyrauch, N. Whitehorn, C. H. Wiebusch, D. R. Williams, L. Witthaus, G. Wrede, X. W. Xu, J. P. Yanez, Y. Yao, E. Yildizci, S. Yoshida, R. Young, F. Yu, S. Yu, T. Yuan, S. Yun-C\'arcamo, A. Zander Jurowitzki, A. Zegarelli, S. Zhang, Z. Zhang, P. Zhelnin, P. Zilberman

Abstract: IceCube is a cubic-kilometer-scale neutrino detector located at the geographic South Pole. A precise directional reconstruction of IceCube neutrinos is vital for associations with astronomical objects. In this context, we discuss neural posterior estimation of the neutrino direction via a transformer encoder that maps to a normalizing flow on the 2-sphere. It achieves a new state-of-the-art angular resolution for the two main event morphologies in IceCube - tracks and showers - while being significantly faster than traditional B-spline-based likelihood reconstructions. All-sky scans can be performed within seconds rather than hours, and take constant computation time, regardless of whether the posterior extent is arc-minutes or spans the whole sky. We utilize a combination of $C^2$-smooth rational-quadratic splines, scale transformations and rotations to define a novel spherical normalizing-flow distribution whose parameters are predicted as a whole as the output of the transformer encoder. We test several structural choices diverting from the vanilla transformer architecture. In particular, we find dual residual streams, nonlinear QKV projection and a separate class token with its own cross-attention processing to boost test-time performance. The angular resolution for both showers and tracks improves substantially over the whole trained energy range from 100 GeV to 100 PeV. At 100 TeV deposited energy, for example, the median angular resolution improves by a factor of $1.3$ for throughgoing tracks, by a factor of $1.7$ for showers and by a factor of $2.5$ for starting tracks compared to state-of-the art likelihood reconstructions based on B-splines. While previous machine-learning (ML) efforts have managed to obtain competitive shower resolutions, this is the first time an ML-based method outperforms likelihood-based muon reconstructions above 100 GeV.

cross What Makes a Bacterial Model a Good Reservoir Computer? Predicting Performance from Separability and Similarity

Authors: Laura Alonso Bartolom\'e (MICALIS, Mnemosyne), Jean-Loup Faulon (MICALIS), Xavier Hinaut (Mnemosyne)

Abstract: Biological systems are promising substrates for computation because they naturally process environmental information through complex internal dynamics. In this study, we investigate whether bacterial metabolic models can act as physical reservoirs and whether their computational performance can be predicted from dynamical properties linked to separability and similarity. We simulated the growth dynamics of five bacterial species, one yeast species, and 29 Escherichia coli single-gene deletion mutants using dynamic flux balance analysis (dFBA), with glucose and xylose concentrations as inputs and growth curves as reservoir states. Computational performance was assessed on random nonlinear classification tasks using a linear readout, while reservoir properties linked to separability and similarity were characterised through kernel and generalisation ranks computed from growth-curve state matrices. Several microbial models achieved high classification accuracy, showing that bacterial metabolic dynamics can support nonlinear computation. Clear differences were observed between species, with some models converging more rapidly and others reaching higher maximum accuracy, revealing a trade-off between convergence speed and peak performance. In contrast, all E. coli mutants were dominated by the wild-type model, suggesting that gene deletions reduce the dynamical richness required for efficient computation. The difference between kernel and generalisation ranks was generally associated with improved accuracy, but deviations across models and sensitivity at low rank values limited its predictive power in practice. Overall, these results show that bacterial metabolic models constitute promising substrates for reservoir computing and provide a first step towards identifying microbial strains with favourable computational properties for future experimental implementations.

cross ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

Authors: Cagri Eryilmaz

Abstract: Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA's CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic-neural reasoning with adaptive multi-agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168-dim state (an alternative world-model MPC planner is also evaluated); (2) a hybrid symbolic-neural architecture that solves K-map and truth-table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge-augmented generation from a 321-pattern base plus 971 open-source reference implementations with focus-aware retrieval; and (4) hierarchical specification decomposition into dependency-ordered sub-modules with interface synchronization. On VerilogEval-Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15-98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self-reported) and ahead of MAGE (95.9%). On a 302-problem non-agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36-60 percentage-point lift per category over the published single-shot baseline; we additionally lead three of four categories shared with NVIDIA's ACE-RTL despite using roughly 30x fewer per-problem attempts. A RISC-V SoC case study demonstrates hierarchical decomposition generating 8/8 lint-passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.

cross From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

Authors: Chenxi Zhou, Pengfei Cao, Jiang Li, Bohan Yu, Jinyu Ye, Jun Zhao, Kang Liu

Abstract: Post-Training Quantization (PTQ) is critical for the efficient deployment of Large Language Models (LLMs). While 4-bit quantization is widely regarded as an optimal trade-off, reducing the precision to 2-bit usually triggers a catastrophic ``performance cliff.'' It remains unclear whether the underlying mechanisms differ fundamentally. Consequently, we conduct a systematic mechanistic analysis, revealing two qualitatively distinct failure modes: Signal Degradation, where the computational patterns remain intact but information precision is impaired by cumulative error; and Computation Collapse, where key components fail to function, preventing correct information processing and destroying the signal in the early layers. Guided by this diagnosis, we conduct mechanism-aware interventions, demonstrating that targeted, training-free repair can mitigate Signal Degradation, but remains ineffective for Computation Collapse. Our findings provide a systematic diagnostic framework for PTQ failures and suggest that addressing Computation Collapse requires structural reconstruction rather than mere compensation.

cross MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

Authors: Zijie Li, Yichun Shi, Jingxiang Sun, Ye Wang, Yixuan Huang, Zhiyao Guo, Xiaochen Lian, Peihao Zhu, Yu Tian, Zhonghua Zhai, Peng Wang

Abstract: We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.

cross DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

Authors: Divyanshu Goyal, Akhil Eppa, Vanya Bannihatti Kumar

Abstract: Vision-language models (VLMs) are increasingly used in settings where sensitivity to low-level image degradations matters, including content moderation, image restoration, and quality monitoring. Yet their ability to recognize distortion type and severity remains poorly understood. We present DistortBench, a diagnostic benchmark for no-reference distortion perception in VLMs. DistortBench contains 13,500 four-choice questions covering 27 distortion types, six perceptual categories, and five severity levels: 25 distortions inherit KADID-10k calibrations, while two added rotation distortions use monotonic angle-based levels. We evaluate 18 VLMs, including 17 open-weight models from five families and one proprietary model. Despite strong performance on high-level vision-language tasks, the best model reaches only 61.9% accuracy, just below the human majority-vote baseline of 65.7% (average individual: 60.2%), indicating that low-level perceptual understanding remains a major weakness of current VLMs. Our analysis further reveals weak and non-monotonic scaling with model size, performance drops in most base--thinking pairs, and distinct severity-response patterns across model families. We hope DistortBench will serve as a useful benchmark for measuring and improving low-level visual perception in VLMs.

cross Algorithm and Hardware Co-Design for Efficient Complex-Valued Uncertainty Estimation

Authors: Zehuan Zhang, Mark Chen, He Li, Wayne Luk

Abstract: Complex-Valued Neural Networks (CVNNs) have significant advantages in handling tasks that involve complex numbers. However, existing CVNNs are unable to quantify predictive uncertainty. We propose, for the first time, dropout-based Bayesian Complex-Valued Neural Networks (BayesCVNNs) to enable uncertainty quantification for complex-valued applications, exhibiting broad applicability and efficiency for hardware implementation due to modularity. Furthermore, as the dual-part nature of complex values significantly broadens the design space and enables novel configurations based on layer-mixing and part-mixing, we introduce an automated search approach to effectively identify optimal configurations for both real and imaginary components. To facilitate deployment, we present a framework that generates customized FPGA-based accelerators for BayesCVNNs, leveraging a set of optimized building blocks. Experiments demonstrate the best configuration can be effectively found via the automated search, attaining higher performance with lower hardware costs compared with manually crafted models. The optimized accelerators achieve approximately 4.5x and 13x speedups on different models with less than 10% power consumption compared to GPU implementations, and outperform existing work in both algorithm and hardware aspects. Our code is publicly available at: https://github.com/zehuanzhang/BayesCVNN.git.

URLs: https://github.com/zehuanzhang/BayesCVNN.git.

cross scpFormer: A Foundation Model for Unified Representation and Integration of the Single-Cell Proteomics

Authors: Qifeng Zhou, Lei Yu, Yuzhi Guo, Yuwei Miao, Hehuan Ma, Wenliang Zhong, Lin Xu, Junzhou Huang

Abstract: The integration of single-cell proteomic data is often hindered by the fragmented nature of targeted antibody panels. To address this limitation, we introduce scpFormer, a transformer-based foundation model designed for single-cell proteomics. Pre-trained on over 390 million cells, scpFormer replaces standard index-based tokenization with a continuous, sequence-anchored approach. By combining Evolutionary Scale Modeling (ESM) with value-aware expression embeddings, it dynamically maps variable panels into a shared semantic space without artificial discretization. We demonstrate that scpFormer generates global cell representations that perform competitively in large-scale batch integration and unsupervised clustering. Moreover, its open-vocabulary architecture facilitates in silico panel expansion, assisting in the reconstruction of biological manifolds in sparse clinical datasets. Finally, this learned protein co-expression logic is transferable to bulk-omics tasks, supporting applications like cancer drug response prediction. scpFormer provides a versatile, panel-agnostic framework to facilitate scalable biomarker discovery and precision oncology.

cross Decision-Focused Federated Learning Under Heterogeneous Objectives and Constraints

Authors: Konstantinos Ziliaskopoulos, Alexander Vinel

Abstract: We consider what we refer to as {Decision-Focused Federated Learning (DFFL)} framework, i.e., a predict-then-optimize approach employed by a collection of agents, where each agent's predictive model is an input to a downstream linear optimization problem, and no direct exchange of raw data is allowed. Importantly, clients can differ both in objective functions and in feasibility constraints. We build on the well-known SPO+ approach and develop heterogeneity bounds for the SPO+ surrogate loss in this case. This is accomplished by employing a support function representation of the feasible region, separating (i) objective shift via norm distances between the cost vectors and (ii) feasible-set shift via shape distances between the constraint sets. In the case of strongly convex feasible regions, sharper bounds are derived due to the optimizer stability. Building on these results, we define a heuristic local-versus-federated excess risk decision rule which, under SPO+ risk, gives a condition for when federation can be expected to improve decision quality: the heterogeneity penalty must be smaller than the statistical advantage of pooling data. We implement a FedAvg-style DFFL set of experiments on both polyhedral and strongly convex problems and show that federation is broadly robust in the strongly convex setting, while performance in the polyhedral setting degrades primarily with constraint heterogeneity, especially for clients with many samples. In other words, especially for the strongly convex case, an approach following a direct implementation of FedAvg and SPO+ can still yield promising performance even when the downstream optimization problems are noticeably different.

cross Separable Pathways for Causal Reasoning: How Architectural Scaffolding Enables Hypothesis-Space Restructuring in LLM Agents

Authors: John Alderete, Sebastian Benthal, Connie Xu, John Xing

Abstract: Causal discovery through experimentation and intervention is fundamental to robust problem solving. It requires not just updating beliefs within a fixed framework but revising the hypothesis space itself, a capacity current AI agents lack when evidence demands representations they have not previously constructed. We extend the blicket detector paradigm from developmental science to test this capacity in AI agents equipped with architectural scaffolding that targets hypothesis-space restructuring. Our compositional architecture has two discrete components: context graphs, which structure exploration as typed state machines, and dynamic behaviors, which monitor for evidence that the current hypothesis space is inadequate and expand it at runtime. Across 1,085 experimental trials, these components make orthogonal contributions: context graphs drive reasoning quality within the post-switch hypothesis space, accounting for 94\% of the accuracy gain, while dynamic behaviors drive reasoning eligibility by detecting regime changes and preventing premature commitment to outdated hypotheses.

cross Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

Authors: Chengyu Huang, Sheng-Yen Chou, Zhengxin Zhang, Claire Cardie

Abstract: Self-play has recently emerged as a promising paradigm to train Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., ask a question), which it then addresses itself by producing a task output (e.g., give an answer). A reward model evaluates the output, and the rewards are then used to train the LLM, typically via Reinforcement Learning (RL). Self-play incurs minimal supervision costs, and this is especially helpful for post-training LLMs, which require high-quality input-output pairs that traditionally have to be written by humans or expensive proprietary models. However, existing work explores self-play only for verifiable tasks such as math and coding. Instead, we seek to extend it to more realistic open-ended tasks. In particular, we propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics, along with input-output pairs, for each example. The rubric is then used to evaluate outputs and train the model. We further ground the framework on a content-rich pretraining corpus to (1) ensure a generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both pretrained and instruction-tuned models, across different tasks ranging from long-form Healthcare QA to creative writing and instruction following.

cross SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Authors: Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng, Yuyi Ao, Leonardo F. R. Ribeiro, Markus Dreyer, Sean Ammirati, Chenyan Xiong

Abstract: Skills have become the de facto way to enable LLM agents to perform complex real-world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluating continual skill learning methods, comprising 20 verified, skill-dependent tasks across 15 sub-domains derived from a real-world skill taxonomy , evaluated at three levels: skill quality, execution trajectory, and task outcome. Using this benchmark, we evaluate recent continual learning techniques, those leveraging one-shot, self/teacher feedback, and skill creator to generate skills from agent experiences. We find that all continual learning methods improve over the no-skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help. Continual learning improves tasks with clear, reusable workflows but struggles on open-ended tasks, and using stronger LLM backbones does not consistently produce better skills. Our analysis also revealed that multiple iterations in continual learning facilitate genuine improvement via external feedback, whereas self-feedback alone induces recursive drift. Our data and code are open-source at https://github.com/cxcscmu/SkillLearnBench to enable further studies of automatic skill generation and continual learning techniques.

URLs: https://github.com/cxcscmu/SkillLearnBench

cross HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

Authors: Darsh Kachroo, Adriana Caraeni, Arjun Prasaath Anbazhagan, Brennan Lagasse, Kevin Zhu

Abstract: Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many-step solutions typical of reasoning tasks. Existing methods excel at either stable preference learning (e.g., DPO variants like KTO and RSO) or structured reasoning (e.g., ReMA's multi-agent RL framework, Tree of Thoughts), but fail to merge these complementary strengths. We propose HiPO (Hierarchical Preference Optimization), an extension of DPO that separates responses into reasoning segments (query clarification and context, reasoning steps, and answer) and computes loss as a weighted sum of the DPO loss for each segment. Our approach enables segment-specific training while maintaining DPO's computational efficiency and training stability. We demonstrate that for multiple 7B LLMs fine-tuned using HiPO and DPO on the Math Stack Exchange preference dataset, the models trained with HiPO outperform the others on a variety of common math benchmarks and achieve greater organization, logical flow, and consistency as measured by GPT-4.1.

cross Machine learning moment closure models for the radiative transfer equation IV: enforcing symmetrizable hyperbolicity in two dimensions

Authors: Juntao Huang

Abstract: This is our fourth work in the series on machine learning (ML) moment closure models for the radiative transfer equation (RTE). In the first three papers of this series, we considered the RTE in slab geometry in 1D1V (i.e. one dimension in physical space and one dimension in angular space), and introduced a gradient-based ML moment closure [1], then enforced the hyperbolicity through a symmetrizer [2], or together with physical characteristic speeds by learning the eigenvalues of the Jacobian matrix [3]. Here, we extend our framework to the RTE in 2D2V (i.e. two dimensions in physical space and two dimensions in angular space). The main idea is to preserve the leading part of the classical $P_N$ model and modify only the highest-order block row. By analyzing the structural properties of the $P_N$ model, we show that its coefficient matrices are symmetric and admit a block-tridiagonal structure. Then we use this property to introduce a block-diagonal symmetrizer for the ML moment model and derive explicit algebraic conditions on the closure blocks which guarantee the symmetrizable hyperbolicity of the resulting ML system. These conditions lead to a natural parametrization of the closure in terms of a symmetric positive definite matrix together with symmetric closure blocks, which can be learned from data while automatically enforcing symmetrizable hyperbolicity by construction. The numerical results show that the proposed framework improves upon the classical $P_N$ model while maintaining hyperbolicity.

cross Pre-Execution Query Slot-Time Prediction in Cloud Data Warehouses: A Feature-Scoped Machine Learning Approach

Authors: Prashant Kumar Pathak

Abstract: Cloud data warehouses bill compute based on slot-time consumed. In shared multi-tenant environments, query cost is highly variable and hard to estimate before execution, causing budget overruns and degraded scheduling. Static query-planner heuristics fail to capture complex SQL structure, data skew, and workload contention. We present a feature-scoped machine learning approach that predicts BigQuery slot-time before execution using only pre-execution observable signals: a structured query complexity score derived from SQL operator costs, data volume features from planner estimates and workload metadata, and textual features from query text. We deliberately exclude runtime factors (slot-pool utilization, cache state, realized skew) unknowable at submission. The model uses a HistGradientBoostingRegressor trained on log-transformed slot-time, with a TF-IDF + TruncatedSVD-512 text pipeline fused with numeric and categorical features. Trained on 749 queries across seven deployment environments and evaluated out-of-distribution on 746 queries from two held-out environments, the model achieves MAE 1.17 slot-minutes, RMSE 4.71, and 74% explained variance on the full workload. On cost-significant queries (slot-time >= 0.01 min, N=282) the model achieves MAE 3.10 versus 4.95 for a predict-mean baseline and 4.54 for predict-median, a 30-37% reduction. On long-tail queries (>= 20 min, N=22) the model does not outperform trivial baselines, consistent with the hypothesis that long-tail queries are dominated by unobserved runtime factors outside the current feature scope. A complexity-routed dual-model architecture is described as a practical refinement, and directions for closing the long-tail gap are identified as future work.

cross Robust Out-of-Distribution Stochastic Optimization

Authors: Xianyu Li, Huan Xu, Xiaolin Huang, Chao Shang

Abstract: Data-driven decision-making under uncertainty typically presumes the collection of historical data from an unknown target probability distribution. However, one may have no access to any data from the target distribution prior to decision-making. To address this challenge, we propose robust out-of-distribution stochastic optimization, a novel data-driven framework that effectively utilizes relevant data distributions for robust decision-making under unseen distributions. A key feature of our framework is that all data distributions are assumed to be randomly generated from a meta-distribution over distributions. To describe uncertainty in distribution generation, we propose to learn a data-driven uncertainty set in a reproducing kernel Hilbert space (RKHS) from relevant data distributions, with adjustable conservatism. We then incorporate this set into a min-max stochastic program to derive robust decisions. Notably, under randomness of distribution generation, we establish rigorous out-of-distribution generalization guarantees for the uncertainty set as well as the solution. To ease problem-solving in RKHS, an approximate parametrization with a provably bounded suboptimality and a row generation strategy are presented. Extensive numerical experiments on multi-item newsvendor and portfolio optimization demonstrate the superior out-of-distribution performance of our decision-making framework under unseen data distribution, even when only a small or moderate number of relevant sources are available.

cross Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

Authors: Sachin Kumar

Abstract: Can small language models achieve strong tool-use performance without complex adaptation mechanisms? This paper investigates this question through Meta-Tool, a controlled empirical study comparing hypernetwork-based LoRA adaptation against carefully designed few-shot prompting. Using a Llama-3.2-3B-Instruct backbone, we evaluate four adaptation mechanisms--few-shot prompting, documentation encoding, hypernetwork-generated LoRA weights, and value-guided beam search--across four diverse benchmarks: Gorilla APIBench, Spider 2.0, WebArena, and InterCode. Our central finding is a well-supported negative result: despite generating non-trivial weight matrices, the 227.8M-parameter hypernetwork provides no measurable improvement over few-shot prompting alone. Comprehensive ablation studies reveal that few-shot examples contribute +21.5% to performance and documentation contributes +5.0%, while the hypernetwork adds 0%. A 3B model with well-designed prompts achieves 79.7% of GPT-5's average performance at $10 \times$ lower latency. Error analysis across 722 failure cases spanning all shot counts (0--5) shows that at the 5-shot configuration (106 failures), failure modes are task-dependent: schema-heavy tasks (Spider 2.0, WebArena) show near-zero format errors with remaining failures semantic, while format errors dominate on Gorilla (100%) and InterCode (70%). These findings redirect practitioners toward prompt engineering and example curation rather than complex adaptation architectures.

cross Toward Safe Autonomous Robotic Endovascular Interventions using World Models

Authors: Harry Robertshaw, Nikola Fischer, Han-Ru Wu, Andrea Walker Perez, Weiyuan Deng, Benjamin Jackson, Christos Bergeles, Alejandro Granados, Thomas C Booth

Abstract: Autonomous mechanical thrombectomy (MT) presents substantial challenges due to highly variable vascular geometries and the requirements for accurate, real-time control. While reinforcement learning (RL) has emerged as a promising paradigm for the automation of endovascular navigation, existing approaches often show limited robustness when faced with diverse patient anatomies or extended navigation horizons. In this work, we investigate a world-model-based framework for autonomous endovascular navigation built on TD-MPC2, a model-based RL method that integrates planning and learned dynamics. We evaluate a TD-MPC2 agent trained on multiple navigation tasks across hold out patient-specific vasculatures and benchmark its performance against the state-of-the-art Soft Actor-Critic (SAC) algorithm agent. Both approaches are further validated in vitro using patient-specific vascular phantoms under fluoroscopic guidance. In simulation, TD-MPC2 demonstrates a significantly higher mean success rate than SAC (58% vs. 36%, p < 0.001), and mean tip contact forces of 0.15 N, well below the proposed 1.5 N vessel rupture threshold. Mean success rates for TD-MPC2 (68%) were comparable to SAC (60%) in vitro, but TD-MPC2 achieved superior path ratios (p = 0.017) at the cost of longer procedure times (p < 0.001). Together, these results provide the first demonstration of autonomous MT navigation validated across both hold out in silico data and fluoroscopy-guided in vitro experiments, highlighting the promise of world models for safe and generalizable AI-assisted endovascular interventions.

cross Maximum Likelihood Reconstruction for Multi-Look Digital Holography with Markov-Modeled Speckle Correlation

Authors: Xi Chen, Arian Maleki, Shirin Jalali

Abstract: Multi-look acquisition is a widely used strategy for reducing speckle noise in coherent imaging systems such as digital holography. By acquiring multiple measurements, speckle can be suppressed through averaging or joint reconstruction, typically under the assumption that speckle realizations across looks are statistically independent. In practice, however, hardware constraints limit measurement diversity, leading to inter-look correlation that degrades the performance of conventional methods. In this work, we study the reconstruction of speckle-free reflectivity from complex-valued multi-look measurements in the presence of correlated speckle. We model the inter-look dependence using a first-order Markov process and derive the corresponding likelihood under a first-order Markov approximation, resulting in a constrained maximum likelihood estimation problem. To solve this problem, we develop an efficient projected gradient descent framework that combines gradient-based updates with implicit regularization via deep image priors, and leverages Monte Carlo approximation and matrix-free operators for scalable computation. Simulation results demonstrate that the proposed approach remains robust under strong inter-look correlation, achieving performance close to the ideal independent-look scenario and consistently outperforming methods that ignore such dependencies. These results highlight the importance of explicitly modeling inter-look correlation and provide a practical framework for multi-look holographic reconstruction under realistic acquisition conditions. Our code is available at: https://github.com/Computational-Imaging-RU/MLE-Holography-Markov.

URLs: https://github.com/Computational-Imaging-RU/MLE-Holography-Markov.

cross WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring

Authors: Mobin Habibpour, Niloufar Alipour Talemi, John Spodnik, Camren J. Khoury, Fatemeh Afghah

Abstract: Wildfire monitoring requires timely, actionable situational awareness from airborne platforms, yet existing aerial visual question answering (VQA) benchmarks do not evaluate wildfire-specific multimodal reasoning grounded in thermal measurements. We introduce WildFireVQA, a large-scale VQA benchmark for aerial wildfire monitoring that integrates RGB imagery with radiometric thermal data. WildFireVQA contains 6,097 RGB-thermal samples, where each sample includes an RGB image, a color-mapped thermal visualization, and a radiometric thermal TIFF, and is paired with 34 questions, yielding a total of 207,298 multiple-choice questions spanning presence and detection, classification, distribution and segmentation, localization and direction, cross-modal reasoning, and flight planning for operational wildfire intelligence. To improve annotation reliability, we combine multimodal large language model (MLLM)-based answer generation with sensor-driven deterministic labeling, manual verification, and intra-frame and inter-frame consistency checks. We further establish a comprehensive evaluation protocol for representative MLLMs under RGB, Thermal, and retrieval-augmented settings using radiometric thermal statistics. Experiments show that across task categories, RGB remains the strongest modality for current models, while retrieved thermal context yields gains for stronger MLLMs, highlighting both the value of temperature-grounded reasoning and the limitations of existing MLLMs in safety-critical wildfire scenarios. The dataset and benchmark code are open-source at https://github.com/mobiiin/WildFire_VQA.

URLs: https://github.com/mobiiin/WildFire_VQA.

cross Vibrotactile Preference Learning: Uncertainty-Aware Preference Learning for Personalized Vibration Feedback

Authors: Rongtao Zhang, Xin Zhu, Masoume Pourebadi Khotbehsara, Warren Dao, Erdem B{\i}y{\i}k, Heather Culbertson

Abstract: Individual differences in vibrotactile perception underscore the growing importance of personalization as haptic feedback becomes more prevalent in interactive systems. We propose Vibrotactile Preference Learning (VPL), a system that captures user-specific preference spaces over vibrotactile parameters via Gaussian-process-based uncertainty-aware preference learning. VPL uses an expected information gain-based acquisition strategy to guide query selection over 40 rounds of pairwise comparisons of overall user preference, augmented with user-reported uncertainty, enabling efficient exploration of the parameter space. We evaluate VPL in a user study (N = 13) using the vibrotactile feedback from a Microsoft Xbox controller, showing that it efficiently learns individualized preferences while maintaining comfortable, low-workload user interactions. These results highlight the potential of VPL for scalable personalization of vibrotactile experiences.

cross Mol-Debate: Multi-Agent Debate Improves Structural Reasoning in Molecular Design

Authors: Wengyu Zhang, Xiao-Yong Wei, Qing Li

Abstract: Text-guided molecular design is a key capability for AI-driven drug discovery, yet it remains challenging to map sequential natural-language instructions with non-linear molecular structures under strict chemical constraints. Most existing approaches, including RAG, CoT prompting, and fine-tuning or RL, emphasize a small set of ad-hoc reasoning perspectives implemented in a largely one-shot generation pipeline. In contrast, real-world drug discovery relies on dynamic, multi-perspective critique and iterative refinement to reconcile semantic intent with structural feasibility. Motivated by this, we propose Mol-Debate, a generation paradigm that enables such dynamic reasoning through an iterative generate-debate-refine loop. We further characterize key challenges in this paradigm and address them through perspective-oriented orchestration, including developer-debater conflict, global-local structural reasoning, and static-dynamic integration. Experiments demonstrate that Mol-Debate achieves state-of-the-art performance against strong general and chemical baselines, reaching 59.82% exact match on ChEBI-20 and 50.52% weighted success rate on S$^2$-Bench. Our code is available at https://github.com/wyuzh/Mol-Debate.

URLs: https://github.com/wyuzh/Mol-Debate.

cross RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings

Authors: Wei Han, David Martinez, Anna Khanina, Lawrence Cavedon, Karin Verspoor

Abstract: A common strategy in transfer learning is few shot fine-tuning, but its success is highly dependent on the quality of samples selected as training examples. Active learning methods such as uncertainty sampling and diversity sampling can select useful samples. However, under extremely low-resource and class-imbalanced conditions, they often favor outliers rather than truly informative samples, resulting in degraded performance. In this paper, we introduce RADS (Reinforcement Adaptive Domain Sampling), a robust sample selection strategy using reinforcement learning (RL) to identify the most informative samples. Experimental evaluations on several real world clinical datasets show our sample selection strategy enhances model transferability while maintaining robust performance under extreme class imbalance compared to traditional methods.

cross AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling

Authors: Zhenyu Wang, Geyan Ye, Wei Liu, Man Tat Alexander Ng

Abstract: Virtual cell modeling predicts molecular state changes under genetic perturbations in silico, which is essential for biological mechanism studies. However, existing approaches suffer from unconstrained reasoning, uninterpretable predictions, and retrieval signals that are weakly aligned with regulatory topology. To address these limitations, we propose AROMA, an Augmented Reasoning Over a Multimodal Architecture for virtual cell genetic perturbation modeling. AROMA integrates textual evidence, graph-topology information, and protein sequence features to model perturbation-target dependencies, and is trained with a two-stage optimization strategy to yield predictions that are both accurate and interpretable. We also construct two knowledge graphs and a perturbation reasoning dataset, PerturbReason, containing more than 498k samples, as reusable resources for the virtual cell domain. Experiments show that AROMA outperforms existing methods across multiple cell lines, and remains robust under zero-shot evaluation on an unseen cell line, as well as in knowledge-sparse, long-tail scenarios. Overall, AROMA demonstrates that combining knowledge-driven multimodal modeling with evidence retrieval provides a promising pathway toward more reliable and interpretable virtual cell perturbation prediction. Model weights are available at https://huggingface.co/blazerye/AROMA. Code is available at https://github.com/blazerye/AROMA.

URLs: https://huggingface.co/blazerye/AROMA., https://github.com/blazerye/AROMA.

cross Online Survival Analysis: A Bandit Approach under Cox PH Model

Authors: Yang Xu, Wenbin Lu, Rui Song

Abstract: Survival analysis is a widely used statistical framework for modeling time-to-event data under censoring. Classical methods, such as the Cox proportional hazards (Cox PH) model, offer a semiparametric approach to estimating the effects of covariates on the hazard function. Despite its importance, survival analysis has been largely unexplored in online settings, particularly within the bandit framework, where decisions must be made sequentially to optimize treatments as new data arrive over time. In this work, we take an initial step toward integrating survival analysis into a purely online learning setting under the Cox PH model, addressing key challenges including staggered entry, delayed feedback, and right censoring. We adapt three canonical bandit algorithms to balance exploration and exploitation, with theoretical guarantees of sublinear regret bounds. Extensive simulations and semi-real experiments using SEER cancer data demonstrate that our approach enables rapid and effective learning of near-optimal treatment policies.

cross Properties and limitations of geometric tempering for gradient flow dynamics

Authors: Francesca Romana Crucinio, Sahani Pathiraja

Abstract: We consider the problem of sampling from a probability distribution $\pi$. It is well known that this can be written as an optimisation problem over the space of probability distributions in which we aim to minimise the Kullback--Leibler divergence from $\pi$. We consider the effect of replacing $\pi$ with a sequence of moving targets $(\pi_t)_{t\ge0}$ defined via geometric tempering on the Wasserstein and Fisher--Rao gradient flows. We show that convergence occurs exponentially in continuous time, providing novel bounds in both cases. We also consider popular time discretisations and explore their convergence properties. We show that in the Fisher--Rao case, replacing the target distribution with a geometric mixture of initial and target distribution never leads to a convergence speed up both in continuous time and in discrete time. Finally, we explore the gradient flow structure of tempered dynamics and derive novel adaptive tempering schedules.

cross Surrogate modeling for interpreting black-box LLMs in medical predictions

Authors: Changho Han, Songsoo Kim, Dong Won Kim, Leo Anthony Celi, Jaewoong Kim, SungA Bae, Dukyong Yoon

Abstract: Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.

cross AI models of unstable flow exhibit hallucination

Authors: Ramdhan Wibawa, Birendra Jha

Abstract: We report the first systematic evidence of hallucination in AI models of fluid dynamics, demonstrated in the canonical problem of hydrodynamically unstable transport known as viscous fingering. AI-based modeling of flow with instabilities remains challenging because rapidly evolving, multiscale fingering patterns are difficult to resolve accurately. We identify solutions that appear visually realistic yet are physically implausible, analogous to hallucinations in large language models. These hallucinations manifest as spurious fluid interfaces and reverse diffusion that violate conservation laws. We show that their origin lies in the spectral bias of AI models, which becomes dominant at high flow rates and viscosity contrasts. Guided by this insight, we introduce DeepFingers, a new framework for AI-driven fluid dynamics that enforces balanced learning across the full spectrum of spatial modes by combining the Fourier Neural Operator with a Deep Operator Network to predict the spatiotemporal evolution of viscous fingers. By conditioning on both time and viscosity contrast, DeepFingers learns mappings between successive concentration fields across regimes. The framework accurately captures tip splitting, finger merging, and channel formation while preserving global metrics of mixing. The results open a new research direction to investigate fundamental limitations in AI models of physical systems.

cross WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

Authors: Juyong Jiang, Chenglin Cai, Chansung Park, Jiasi Shen, Sunghun Kim, Jianguo Li, Yue Wang

Abstract: While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic multi-page websites remain highly challenging. Existing works are often limited to single-page static websites, while agentic frameworks typically rely on multi-turn execution with proprietary models, leading to substantial token costs, high latency, and brittle integration. Training a small LLM end-to-end with reinforcement learning (RL) is a promising alternative, yet it faces a critical bottleneck in designing reliable and computationally feasible rewards for website generation. Unlike single-file coding tasks that can be verified by unit tests, website generation requires evaluating inherently subjective aesthetics, cross-page interactions, and functional correctness. To this end, we propose WebGen-R1, an end-to-end RL framework tailored for project-level website generation. We first introduce a scaffold-driven structured generation paradigm that constrains the large open-ended action space and preserves architectural integrity. We then design a novel cascaded multimodal reward that seamlessly couples structural guarantees with execution-grounded functional feedback and vision-based aesthetic supervision. Extensive experiments demonstrate that our WebGen-R1 substantially transforms a 7B base model from generating nearly nonfunctional websites into producing deployable, aesthetically aligned multi-page websites. Remarkably, our WebGen-R1 not only consistently outperforms heavily scaled open-source models (up to 72B), but also rivals the state-of-the-art DeepSeek-R1 (671B) in functional success, while substantially exceeding it in valid rendering and aesthetic alignment. These results position WebGen-R1 as a viable path for scaling small open models from function-level code generation to project-level web application generation.

cross DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

Authors: Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim

Abstract: Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting -- probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth-py/DialToM.

URLs: https://github.com/Stealth-py/DialToM.

cross VTouch++: A Multimodal Dataset with Vision-Based Tactile Enhancement for Bimanual Manipulation

Authors: Qianxi Hua, Xinyue Li, Zheng Yan, Yang Li, Chi Zhang, Yongyao Li, Yufei Liu

Abstract: Embodied intelligence has advanced rapidly in recent years; however, bimanual manipulation-especially in contact-rich tasks remains challenging. This is largely due to the lack of datasets with rich physical interaction signals, systematic task organization, and sufficient scale. To address these limitations, we introduce the VTOUCH dataset. It leverages vision based tactile sensing to provide high-fidelity physical interaction signals, adopts a matrix-style task design to enable systematic learning, and employs automated data collection pipelines covering real-world, demand-driven scenarios to ensure scalability. To further validate the effectiveness of the dataset, we conduct extensive quantitative experiments on cross-modal retrieval as well as real-robot evaluation. Finally, we demonstrate real-world performance through generalizable inference across multiple robots, policies, and tasks.

cross Mechanistic Interpretability Tool for AI Weather Models

Authors: Kirsten I. Tempest, Matthias Beylich, George C. Craig

Abstract: Artificial Intelligence (AI) weather models are improving rapidly, and their forecasts are already competitive with long-established traditional Numerical Weather Prediction (NWP). To build confidence in this new methodology, it is critical that we understand how these predictions are generated. This is a huge challenge as these AI weather models remain largely black boxes. In other areas of Machine Learning (ML), mechanistic interpretability has emerged as a framework for understanding ML predictions by analysing the building blocks responsible for them. Here we present an open-source, highly adaptable tool which incorporates concepts from mechanistic interpretability. The tool organises internal latent representations from the model processor and allows for initial analyses, including cosine similarity and Principal Component Analysis (PCA), enabling the user to identify directions in latent space potentially associated with meteorological features. Applying our tool to the graph neural network GraphCast, we present preliminary case studies for mid-latitude synoptic-scale waves and specific humidity. These demonstrate the tool's ability to identify linear combinations of latent channels that appear to correspond to interpretable features.

cross MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

Authors: Markus Knauer, Edoardo Fiorini, Maximilian M\"uhlbauer, Stefan Schneyer, Promwat Angsuratanawech, Florian Samuel Lay, Timo Bachmann, Samuel Bustamante, Korbinian Nottensteiner, Freek Stulp, Alin Albu-Sch\"affer, Jo\~ao Silv\'erio, Thomas Eiband

Abstract: Industrial robot applications require increasingly flexible systems that non-expert users can easily adapt for varying tasks and environments. However, different adaptations benefit from different interaction modalities. We present an interactive framework that enables robot skill adaptation through three complementary modalities: kinesthetic touch for precise spatial corrections, natural language for high-level semantic modifications, and a graphical web interface for visualizing geometric relations and trajectories, inspecting and adjusting parameters, and editing via-points by drag-and-drop. The framework integrates five components: energy-based human-intention detection, a tool-based LLM architecture (where the LLM selects and parameterizes predefined functions rather than generating code) for safe natural language adaptation, Kernelized Movement Primitives (KMPs) for motion encoding, probabilistic Virtual Fixtures for guided demonstration recording, and ergodic control for surface finishing. We demonstrate that this tool-based LLM architecture generalizes skill adaptation from KMPs to ergodic control, enabling voice-commanded surface finishing. Validation on a 7-DoF torque-controlled robot at the Automatica 2025 trade fair demonstrates the practical applicability of our approach in industrial settings.

cross Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

Authors: Shelly Francis-Meretzki, Mirco Mutti, Yaniv Romano, Aviv Tamar

Abstract: Recent advances in vision-language-action (VLA) models for robotics have highlighted the importance of reliable uncertainty quantification in sequential tasks. However, assessing and improving calibration in such settings remains mostly unexplored, especially when only partial trajectories are observed. In this work, we formulate sequential calibration for episodic tasks, where task-success confidence is produced along an episode, while success is determined at the end of it. We introduce a sequential extension of the Brier score and show that, for binary outcomes, its risk minimizer coincides with the VLA policy's value function. This connection bridges uncertainty calibration and reinforcement learning, enabling the use of temporal-difference (TD) value estimation as a principled calibration mechanism over time. We empirically show that TD calibration improves performance relative to the state-of-the-art on simulated and real-robot data. Interestingly, we show that when calibrated using TD, the VLA's single-step action probabilities can yield competitive uncertainty estimates, in contrast to recent findings that employed different calibration techniques.

cross Forecasting Individual NetFlows using a Predictive Masked Graph Autoencoder

Authors: Georgios Anyfantis, Pere Barlet-Ros

Abstract: In this paper, we propose a proof-of-concept Graph Neural Network model that can successfully predict network flow-level traffic (NetFlow) by accurately modelling the graph structure and the connection features. We use sliding-windows to split the network traffic in equal-sized heterogeneous bidirectional graphs containing IP, Port, and Connection nodes. We then use the GNN to model the evolution of the graph structure and the connection features. Our approach shows superior results when identifying the Port and IP to which connections attach, while feature reconstruction remains competitive with strong forecasting baselines. Overall, our work showcases the use of GNNs for per-flow NetFlow prediction.

cross Decentralized Machine Learning with Centralized Performance Guarantees via Gibbs Algorithms

Authors: Yaiza Bermudez, Samir Perlaza, I\~naki Esnaola

Abstract: In this paper, it is shown, for the first time, that centralized performance is achievable in decentralized learning without sharing the local datasets. Specifically, when clients adopt an empirical risk minimization with relative-entropy regularization (ERM-RER) learning framework and a forward-backward communication between clients is established, it suffices to share the locally obtained Gibbs measures to achieve the same performance as that of a centralized ERM-RER with access to all the datasets. The core idea is that the Gibbs measure produced by client~$k$ is used, as reference measure, by client~$k+1$. This effectively establishes a principled way to encode prior information through a reference measure. In particular, achieving centralized performance in the decentralized setting requires a specific scaling of the regularization factors with the local sample sizes. Overall, this result opens the door to novel decentralized learning paradigms that shift the collaboration strategy from sharing data to sharing the local inductive bias via the reference measures over the set of models.

cross Towards Certified Malware Detection: Provable Guarantees Against Evasion Attacks

Authors: Nandakrishna Giri, Asmitha K. A., Serena Nicolazzo, Antonino Nocera, Vinod P

Abstract: Machine learning-based static malware detectors remain vulnerable to adversarial evasion techniques, such as metamorphic engine mutations. To address this vulnerability, we propose a certifiably robust malware detection framework based on randomized smoothing through feature ablation and targeted noise injection. During evaluation, our system analyzes an executable by generating multiple ablated variants, classifies them by using a smoothed classifier, and identifies the final label based on the majority vote. By analyzing the top-class voting distribution and the Wilson score interval, we derive a formal certificate that guarantees robustness within a specific radius against feature-space perturbations. We evaluate our approach by comparing the performance of the base classifier and the smoothed classifier on both clean executables and ablated variants generated using PyMetaEngine. Our results demonstrate that the proposed smoothed classifier successfully provides certifiable robustness against metamorphic evasion attacks without requiring modifications to the underlying machine learning architecture.

cross Efficient Symbolic Computations for Identifying Causal Effects

Authors: Benjamin Hollering, Pratik Misra, Nils Sturma

Abstract: Determining identifiability of causal effects from observational data under latent confounding is a central challenge in causal inference. For linear structural causal models, identifiability of causal effects is decidable through symbolic computation. However, standard approaches based on Gr\"obner bases become computationally infeasible beyond small settings due to their doubly exponential complexity. In this work, we study how to practically use symbolic computation for deciding rational identifiability. In particular, we present an efficient algorithm that provably finds the lowest degree identifying formulas. For a causal effect of interest, if there exists an identification formula of a prespecified maximal degree, our algorithm returns such a formula in quasi-polynomial time.

cross On Bayesian Softmax-Gated Mixture-of-Experts Models

Authors: Nicola Bariletto, Huy Nguyen, Nhat Ho, Alessandro Rinaldo

Abstract: Mixture-of-experts models provide a flexible framework for learning complex probabilistic input-output relationships by combining multiple expert models through an input-dependent gating mechanism. These models have become increasingly prominent in modern machine learning, yet their theoretical properties in the Bayesian framework remain largely unexplored. In this paper, we study Bayesian mixture-of-experts models, focusing on the ubiquitous softmax-based gating mechanism. Specifically, we investigate the asymptotic behavior of the posterior distribution for three fundamental statistical tasks: density estimation, parameter estimation, and model selection. First, we establish posterior contraction rates for density estimation, both in the regimes with a fixed, known number of experts and with a random learnable number of experts. We then analyze parameter estimation and derive convergence guarantees based on tailored Voronoi-type losses, which account for the complex identifiability structure of mixture-of-experts models. Finally, we propose and analyze two complementary strategies for selecting the number of experts. Taken together, these results provide one of the first systematic theoretical analyses of Bayesian mixture-of-experts models with softmax gating, and yield several theory-grounded insights for practical model design.

cross Evaluating Assurance Cases as Text-Attributed Graphs for Structure and Provenance Analysis

Authors: Fariz Ikhwantri, Dusica Marijan

Abstract: An assurance case is a structured argument document that justifies claims about a system's requirements or properties, which are supported by evidence. In regulated domains, these are crucial for meeting compliance and safety requirements to industry standards. We propose a graph diagnostic framework for analysing the structure and provenance of assurance cases. We focus on two main tasks: (1) link prediction, to learn and identify connections between argument elements, and (2) graph classification, to differentiate between assurance cases created by a state-of-the-art large language model and those created by humans, aiming to detect bias. We compiled a publicly available dataset of assurance cases, represented as graphs with nodes and edges, supporting both link prediction and provenance analysis. Experiments show that graph neural networks (GNNs) achieve strong link prediction performance (ROC-AUC 0.760) on real assurance cases and generalise well across domains and semi-supervised settings. For provenance detection, GNNs effectively distinguish human-authored from LLM-generated cases (F1 0.94). We observed that LLM-generated assurance cases have different hierarchical linking patterns compared to human-authored cases. Furthermore, existing GNN explanation methods show only moderate faithfulness, revealing a gap between predicted reasoning and the true argument structure.

cross An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling

Authors: Anif N. Shikder, Ramit Dey, Sayantan Auddy, Luisa Liboni, Alexandra N. Busch, Arthur Powanwe, J\'an Min\'a\v{c}, Roberto C. Budzinski, Lyle E. Muller

Abstract: We establish a mathematical correspondence between state space models, a state-of-the-art architecture for capturing long-range dependencies in data, and an exactly solvable nonlinear oscillator network. As a specific example of this general correspondence, we analyze the diagonal linear time-invariant implementation of the Structured State Space Sequence model (S4). The correspondence embeds S4D, a specific implementation of S4, into a ring network topology, in which recent inputs are encoded, as waves of activity traveling over the one-dimensional spatial layout of the network. We then derive an exact operator expression for the full forward pass of S4D, yielding an analytical characterization of its complete input-output map. This expression reveals that the nonlinear decoder in the system induces interactions between these information-carrying waves that enable classifying real-world sequences. These results generalize across modern SSM architectures, and show that they admit an exact mathematical description with a clear physical interpretation. These insights enable a new level of interpretability for these systems in terms of nonlinear oscillator networks.

cross Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge

Authors: Naizhong Xu

Abstract: Modern retrieval-augmented generation (RAG) systems treat vector embeddings as static, context-free artifacts: an embedding has no notion of when it was created, how trustworthy its source is, or which other embeddings depend on it. This flattening of knowledge has a measurable cost: recent work on VersionRAG reports that conventional RAG achieves only 58% accuracy on versioned technical queries, because retrieval returns semantically similar but temporally invalid content. We propose SmartVector, a framework that augments dense embeddings with three explicit properties -- temporal awareness, confidence decay, and relational awareness -- and a five-stage lifecycle modeled on hippocampal-neocortical memory consolidation. A retrieval pipeline replaces pure cosine similarity with a four-signal score that mixes semantic relevance, temporal validity, live confidence, and graph-relational importance. A background consolidation agent detects contradictions, builds dependency edges, and propagates updates along those edges as graph-neural-network-style messages. Confidence is governed by a closed-form function combining an Ebbinghaus-style exponential decay, user-feedback reconsolidation, and logarithmic access reinforcement. We formalize the model, relate it to temporal knowledge graph embedding, agentic memory architectures, and uncertainty-aware RAG, and present a reference implementation. On a reproducible synthetic versioned-policy benchmark of 258 vectors and 138 queries, SmartVector roughly doubles top-1 accuracy over plain cosine RAG (62.0% vs. 31.0% on a held-out split), drops stale-answer rate from 35.0% to 13.3%, cuts Expected Calibration Error by nearly 2x (0.244 vs. 0.470), reduces re-embedding cost per single-word edit by 77%, and is robust across contradiction-injection rates from 0% to 75%.

cross pAI/MSc: ML Theory Research with Humans on the Loop

Authors: Mahmoud Abdelmoneum, Pierfrancesco Beneventano, Tomaso Poggio

Abstract: We present pAI/MSc, an open-source, customizable, modular multi-agent system for academic research workflows. Our goal is not autonomous scientific ideation, nor fully automated research. It is narrower and more practical: to reduce by orders of magnitude the human steering required to turn a specified hypothesis into a literature-grounded, mathematically established, experimentally supported, submission-oriented manuscript draft. pAI/MSc is built with a current emphasis on machine learning theory and adjacent quantitative fields.

cross A weighted angle distance on strings

Authors: Grant Molnar

Abstract: We define a multi-scale metric $d_\rho$ on strings by aggregating angle distances between all $n$-gram count vectors with exponential weights $\rho^n$. We benchmark $d_\rho$ in DBSCAN clustering against edit and $n$-gram baselines, give a linear-time suffix-tree algorithm for evaluation, prove metric and stability properties (including robustness under tandem-repeat stutters), and characterize isometries.

cross Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing

Authors: Abhijit Talluri

Abstract: Adversarial robustness evaluation underpins every claim of trustworthy ML deployment, yet the field suffers from fragmented protocols and undetected gradient masking. We make two contributions. (1) Structured synthesis. We analyze nine peer-reviewed corpus sources (2020--2026) through seven complementary protocols, producing the first end-to-end structured analysis of the field's consensus and unresolved challenges. (2) Auto-ART framework. We introduce Auto-ART, an open-source framework that operationalizes identified gaps: 50+ attacks, 28 defense modules, the Robustness Diagnostic Index (RDI), and gradient-masking detection. It supports multi-norm evaluation (l1/l2/linf/semantic/spatial) and compliance mapping to NIST AI RMF, OWASP LLM Top 10, and the EU AI Act. Empirical validation on RobustBench demonstrates that Auto-ART's pre-screening identifies gradient masking in 92% of flagged cases, and RDI rankings correlate highly with full AutoAttack. Multi-norm evaluation exposes a 23.5 pp gap between average and worst-case robustness on state-of-the-art models. No prior work combines such structured meta-scientific analysis with an executable evaluation framework bridging literature gaps into engineering.

cross AAC: Admissible-by-Architecture Differentiable Landmark Compression for ALT

Authors: An T. Le, Vien Ngo

Abstract: We introduce \textbf{AAC} (Architecturally Admissible Compressor), a differentiable landmark-selection module for ALT (A*, Landmarks, and Triangle inequality) shortest-path heuristics whose outputs are admissible by construction: each forward pass is a row-stochastic mixture of triangle-inequality lower bounds, so the heuristic is admissible for \emph{every} parameter setting without requiring convergence, calibration, or projection. At deployment, the module reduces to classical ALT on a learned subset, composing end-to-end with neural encoders while preserving the classical toolchain. The construction is the first differentiable instance of the compress-while-preserving-admissibility tradition in classical heuristic search. Under a matched per-vertex memory protocol, we establish that ALT with farthest-point-sampling landmarks (FPS-ALT) has provably near-optimal coverage on metric graphs, leaving at most a few percentage points of headroom for \emph{any} selector. AAC operates near this ceiling: the gap is $0.9$--$3.9$ percentage points on 9 road networks and ${\leq}1.3$ percentage points on synthetic graphs, with zero admissibility violations across $1{,}500+$ queries and all logged runs. At matched memory, AAC is also $1.2$--$1.5{\times}$ faster than FPS-ALT at the median query on DIMACS road networks, amortizing its offline cost within $170$--$1{,}924$ queries. A controlled ablation isolates the binding constraint: training-objective drift under default initialization, not architectural capacity; identity-on-first-$m$ initialization closes the expansion-count gap entirely. We release the module, a reusable matched-memory benchmarking protocol with paired two-one-sided-test (TOST) equivalence and pre-registration, and a reference compressed-differential-heuristics baseline.

cross V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

Authors: Yubo Jiang, Yitong An, Xin Yang, Abudukelimu Wuerkaixi, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, Haopeng Zhang

Abstract: We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models (MLLMs). Current MLLMs trained solely on final outcomes often treat visual reasoning as a black box, relying on superficial pattern matching rather than performing rigorous multi-step inference. While Reinforcement Learning with Verifiable Rewards could enforce transparent reasoning trajectories, extending it to visual domains remains severely hindered by the ambiguity of grounding abstract logic into continuous pixel space. We solve this by leveraging the deterministic grid structure of tables as an ideal visual testbed. V-tableR1 employs a specialized critic VLM to provide dense, step-level feedback on the explicit visual chain-of-thought generated by a policy VLM. To optimize this system, we propose Process-Guided Direct Alignment Policy Optimization (PGPO), a novel RL algorithm integrating process rewards, decoupled policy constraints, and length-aware dynamic sampling. Extensive evaluations demonstrate that V-tableR1 explicitly penalizes visual hallucinations and shortcut guessing. By fundamentally shifting multimodal inference from black-box pattern matching to verifiable logical derivation, V-tableR1 4B establishes state-of-the-art accuracy among open-source models on complex tabular benchmarks, outperforming models up to 18x its size and improving over its SFT baseline

cross Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

Authors: Andrew Klearman, Radu Revutchi, Rohin Garg, Rishav Chakravarti, Samuel Marc Denton, Yuan Xue

Abstract: Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.

cross Personalized electric vehicle energy consumption estimation framework that integrates driver behavior with map data

Authors: Sreechakra Vasudeva Raju Rachavelpula, Sangwhan Cha

Abstract: This paper presents a personalized Battery Electric Vehicle (BEV) energy consumption estimation framework that integrates map-based contextual features with driver-specific velocity prediction and physics-based energy consumption modeling. The system combines route selection, detailed road feature processing, a rule-based reference velocity generator, a PID controller-based vehicle dynamics simulator, and a Bidirectional LSTM model trained to reproduce individual driving behavior. The predicted individual-specific velocity profiles are coupled with a quasi-steady backward energy consumption model to compute tractive power, regenerative braking, and State-of-Charge (SOC) evolution. Evaluation across urban, freeway, and hilly routes demonstrates that the proposed approach captures key driver behavioral patterns such as deceleration at intersections, speed-limit tracking, and road grade-dependent responses, while producing accurate power and SOC trajectories. The results highlight the effectiveness of combining learned driver behavior with map-based context and physics-based energy consumption modeling to produce accurate, personalized BEV SOC depletion profiles.

cross Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

Authors: Pranava Madhyastha, Dagmar Adamcova

Abstract: We investigate the integration of human-like working memory constraints into the Transformer architecture and implement several cognitively inspired attention variants, including fixed-width windows based and temporal decay based attention mechanisms. Our modified GPT-2 models are trained from scratch on developmentally plausible datasets (10M and 100M words). Performance is evaluated on grammatical judgment tasks (BLiMP) and alignment with human reading time data. Our results indicate that these cognitively-inspired constraints, particularly fixed-width attention, can significantly improve grammatical accuracy especially when training data is scarce. These constrained models also tend to show a stronger alignment with human processing metrics. The findings suggest that such constraints may serve as a beneficial inductive bias, guiding models towards more robust linguistic representations, especially in data-limited settings.

cross Gauge-Equivariant Graph Neural Networks for Lattice Gauge Theories

Authors: Ali Rayat, Yaohang Li, Gia-Wei Chern

Abstract: Local gauge symmetry underlies fundamental interactions and strongly correlated quantum matter, yet existing machine-learning approaches lack a general, principled framework for learning under site-dependent symmetries, particularly for intrinsically nonlocal observables. Here we introduce a gauge-equivariant graph neural network that embeds non-Abelian symmetry directly into message passing via matrix-valued, gauge-covariant features and symmetry-compatible updates, extending equivariant learning from global to fully local symmetries. In this formulation, message passing implements gauge-covariant transport across the lattice, allowing nonlocal correlations and loop-like structures to emerge naturally from local operations. We validate the approach across pure gauge, gauge-matter, and dynamical regimes, establishing gauge-equivariant message passing as a general paradigm for learning in systems governed by local symmetry.

cross LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image

Authors: Dimitrije Anti\'c, Alvaro Budria, George Paschalidis, Sai Kumar Dwivedi, Dimitrios Tzionas

Abstract: Reconstructing 3D Human-Object Interaction from an RGB image is essential for perceptive systems. Yet, this remains challenging as it requires capturing the subtle physical coupling between the body and objects. While current methods rely on sparse, binary contact cues, these fail to model the continuous proximity and dense spatial relationships that characterize natural interactions. We address this limitation via InterFields, a representation that encodes dense, continuous proximity across the entire body and object surfaces. However, inferring these fields from single images is inherently ill-posed. To tackle this, our intuition is that interaction patterns are characteristically structured by the action and object geometry. We capture this structure in LEXIS, a novel discrete manifold of interaction signatures learned via a VQ-VAE. We then develop LEXIS-Flow, a diffusion framework that leverages LEXIS signatures to estimate human and object meshes alongside their InterFields. Notably, these InterFields help in a guided refinement that ensures physically-plausible, proximity-aware reconstructions without requiring post-hoc optimization. Evaluation on Open3DHOI and BEHAVE shows that LEXIS-Flow significantly outperforms existing SotA baselines in reconstruction, contact, and proximity quality. Our approach not only improves generalization but also yields reconstructions perceived as more realistic, moving us closer to holistic 3D scene understanding. Code & models will be public at https://anticdimi.github.io/lexis.

URLs: https://anticdimi.github.io/lexis.

cross Convergent Evolution: How Different Language Models Learn Similar Number Representations

Authors: Deqing Fu, Tianyi Zhou, Mikhail Belkin, Vatsal Sharan, Robin Jia

Abstract: Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.

cross Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series

Authors: Thorsten Hoeser, Felix Bachofer, Claudia Kuenzer

Abstract: The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth Observation based offshore wind infrastructure mapping has matured for spatial localization, existing open datasets lack temporally dense and semantically fine-grained information on construction and operational dynamics. We introduce a global Sentinel-1 synthetic aperture radar (SAR) time series data corpus that resolves deployment and operational phases of offshore wind infrastructure from 2016Q1 to 2025Q1. Building on an updated object detection workflow, we compile 15,606 time series at detected infrastructure locations, with overall 14,840,637 events as analysis-ready 1D SAR backscatter profiles, one profile per Sentinel-1 acquisition and location. To enable direct use and benchmarking, we release (i) the analysis ready 1D SAR profiles, (ii) event-level baseline semantic labels generated by a rule-based classifier, and (iii) an expert-annotated benchmark dataset of 553 time series with 328,657 event labels. The baseline classifier achieves a macro F1 score of 0.84 in event-wise evaluation and an area under the collapsed edit similarity-quality threshold curve (AUC) of 0.785, indicating temporal coherence. We demonstrate that the resulting corpus supports global-scale analyses of deployment dynamics, the identification of differences in regional deployment patterns, vessel interactions, and operational events, and provides a reference for developing and comparing time series classification methods for offshore wind infrastructure monitoring.

replace Issues with Value-Based Multi-objective Reinforcement Learning: Value Function Interference and Overestimation Sensitivity

Authors: Peter Vamplew (EJ), Ethan (EJ), Watkins, Cameron Foale, Richard Dazeley

Abstract: Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's preferences with respect to the different objectives. This paper investigates two previously unreported issues which can hinder the performance of value-based MORL algorithms when applied in conjunction with a non-linear utility function -- value function interference, and sensitivity to overestimation. We illustrate the nature of these phenomena on simple multi-objective MDPs using a tabular implementation of multiobjective Q-learning.

replace FlashNorm: Fast Normalization for Transformers

Authors: Nils Graef, Filip Makraduli, Andrew Wasielewski, Matthew Clapp

Abstract: Normalization layers are ubiquitous in large language models (LLMs) yet represent a compute bottleneck: on hardware with distinct vector and matrix execution units, the RMS calculation blocks the subsequent matrix multiplication, preventing parallel execution. We present FlashNorm, an exact reformulation of RMSNorm followed by a linear layer that (i) eliminates the normalization weights by folding them into the subsequent linear layer, and (ii) defers the scalar RMS normalization to the output of the matrix multiplication, enabling the two operations to execute in parallel. FlashNorm is mathematically identical to the original computation, it introduces no approximation and requires no retraining. The same technique extends to LayerNorm, Dynamic Tanh (DyT), feed-forward networks with GLU variants, and RoPE-based attention. On an NVIDIA T4 GPU, FlashNorm achieves 33 to 35% lower latency on the norm-then-project operation in the compute-bound (prefill) regime at SmolLM2-135M scale, and 12 to 14% at Llama-7B scale. We verify zero-loss weight folding on SmolLM2-135M, Llama-3.2-1B, and Llama-3.1-8B. Beyond inference speed, FlashNorm simplifies model implementations by reducing parameter tensor count, analogous to the simplification achieved by PaLM's removal of bias-parameters from all linear layers. Watch our explainer video https://youtu.be/GEuJv34_XgU?si and see https://github.com/OpenMachine-ai/transformer-tricks for code.

URLs: https://youtu.be/GEuJv34_XgU?si, https://github.com/OpenMachine-ai/transformer-tricks

replace Fairness-Aware Multi-Group Target Detection in Online Discussion

Authors: Soumyajit Gupta, Maria De-Arteaga, Matthew Lease

Abstract: Target-group detection is the task of detecting which group(s) a piece of content is ``directed at or about''. Applications include targeted marketing, content recommendation, and group-specific content assessment. Key challenges include: 1) that a single post may target multiple groups; and 2) ensuring consistent detection accuracy across groups for fairness. In this work, we investigate fairness implications of target-group detection in the context of toxicity detection, where the perceived harm of a social media post often depends on which group(s) it targets. Because toxicity is highly contextual, language that appears benign in general can be harmful when targeting specific demographic groups. We show our {\em fairness-aware multi-group target detection} approach both reduces bias across groups and shows strong predictive performance, surpassing existing fairness-aware baselines. To enable reproducibility and spur future work, we share our code online.

replace The Costs of Pretending That There Are Data-Generating Probability Distributions in the Social World

Authors: Benedikt H\"oltgen, Robert C. Williamson

Abstract: Machine Learning research, including work promoting fair or equitable algorithms, often relies on the concept of a data-generating probability distribution. The standard presumption is that since data points are 'sampled from' such a distribution, one can learn from observed data about this distribution and, thus, predict future data points which are also drawn from it. We argue, however, that such true probability distributions do not exist and that the rhetoric around them is harmful in social settings. We show that alternative frameworks focusing directly on relevant populations rather than abstract distributions are available and leave classical learning theory almost unchanged. Furthermore, we argue that the assumption of true probabilities or data-generating distributions can be misleading and obscure both the choices made and the goals pursued in machine learning practice. Based on these considerations, we suggest avoiding the assumption of data-generating probability distributions in the social world.

replace Towards Certified Unlearning for Deep Neural Networks

Authors: Binchi Zhang, Yushun Dong, Tianhao Wang, Jundong Li

Abstract: In the field of machine unlearning, certified unlearning has been extensively studied in convex machine learning models due to its high efficiency and strong theoretical guarantees. However, its application to deep neural networks (DNNs), known for their highly nonconvex nature, still poses challenges. To bridge the gap between certified unlearning and DNNs, we propose several simple techniques to extend certified unlearning methods to nonconvex objectives. To reduce the time complexity, we develop an efficient computation method by inverse Hessian approximation without compromising certification guarantees. In addition, we extend our discussion of certification to nonconvergence training and sequential unlearning, considering that real-world users can send unlearning requests at different time points. Extensive experiments on three real-world datasets demonstrate the efficacy of our method and the advantages of certified unlearning in DNNs.

replace Verification of Machine Unlearning is Fragile

Authors: Binchi Zhang, Zihan Chen, Cong Shen, Jundong Li

Abstract: As privacy concerns escalate in the realm of machine learning, data owners now have the option to utilize machine unlearning to remove their data from machine learning models, following recent legislation. To enhance transparency in machine unlearning and avoid potential dishonesty by model providers, various verification strategies have been proposed. These strategies enable data owners to ascertain whether their target data has been effectively unlearned from the model. However, our understanding of the safety issues of machine unlearning verification remains nascent. In this paper, we explore the novel research question of whether model providers can circumvent verification strategies while retaining the information of data supposedly unlearned. Our investigation leads to a pessimistic answer: \textit{the verification of machine unlearning is fragile}. Specifically, we categorize the current verification strategies regarding potential dishonesty among model providers into two types. Subsequently, we introduce two novel adversarial unlearning processes capable of circumventing both types. We validate the efficacy of our methods through theoretical analysis and empirical experiments using real-world datasets. This study highlights the vulnerabilities and limitations in machine unlearning verification, paving the way for further research into the safety of machine unlearning.

replace MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Authors: Zhen Zheng, Xiaonan Song, Chuanjie Liu

Abstract: Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or low system efficiency. In this paper, we propose MixLLM that explores the optimization space of mixed-precision quantization between output features, based on the insight that different features matter differently in the model. MixLLM identifies the important output features in the global view rather than within each single layer, effectively assigning larger bit-width to output features that need it the most to achieve high accuracy and low memory usage. We present the sweet spot of quantization configuration of algorithm-system co-design with high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the Tensor Core easily and fast data type conversion to reduce dequantization overhead, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10\% more bits, the perplexity increase can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while MMLU-Pro loss can be reduced from 1.92 to 0.99 over the SOTA of three popular models. Besides its superior accuracy, MixLLM also achieves state-of-the-art system efficiency. Code is released at https://github.com/microsoft/MixLLM.

URLs: https://github.com/microsoft/MixLLM.

replace Best Policy Learning from Trajectory Preference Feedback

Authors: Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen

Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based Reinforcement Learning (PbRL) offers a more robust alternative by directly leveraging noisy binary comparisons over trajectories. We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models, for example, during multi-turn interactions. Learning in this setting combines an offline preference dataset - potentially biased or out-of-distribution and collected from a rater of subpar `competence' - with online pure exploration, making systematic online learning essential. To this end, we propose Posterior Sampling for Preference Learning ($\mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling that maintains posteriors over the reward model and dynamics. We provide the first Bayesian simple regret guarantees for PbRL and introduce an efficient approximation that outperforms existing baselines on simulation and image generation benchmarks.

replace Recency Biased Causal Attention for Time-series Forecasting

Authors: Kareem Hegazy, Michael W. Mahoney, N. Benjamin Erichson

Abstract: Recency bias is a useful inductive prior for sequential modeling: it emphasizes nearby observations and can still allow longer-range dependencies. Standard Transformer attention lacks this property, relying on all-to-all interactions that overlook the causal and often local structure of temporal data. We propose a simple mechanism to introduce recency bias by reweighting attention scores with a smooth heavy-tailed decay. This adjustment strengthens local temporal dependencies without sacrificing the flexibility to capture broader and data-specific correlations. We show that recency-biased attention consistently improves sequential modeling, aligning Transformer more closely with the read, ignore, and write operations of RNNs. Finally, we demonstrate that our approach achieves competitive and often superior performance on challenging time-series forecasting benchmarks.

replace Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Authors: Yixuan Even Xu, Yash Savani, Fei Fang, J. Zico Kolter

Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion, max-variance down-sampling, that maximizes reward diversity, and provide an efficient $O(n\log n)$ implementation. Empirically, Group Relative Policy Optimization (GRPO) with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ faster across the different reasoning benchmarks and hardware configurations we tested.

replace Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

Authors: Hu Wang, Congbo Ma, Ian Reid, Mohammad Yaqub

Abstract: The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. For language modeling, Group Relative Policy Optimization (GRPO) was proposed to use the within-group sample mean as a baseline for advantage normalization. This estimator can be sensitive to small group size and rollout-level stochasticity, which may lead to suboptimal advantage estimates in some settings. In this paper, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), a lightweight variant that treats per-group rewards as noisy observations of a latent prompt-level reward baseline and uses a 1D Kalman filter to estimate both the baseline and its uncertainty. KRPO introduces no additional learned parameters and can be integrated into GRPO with minimal computational overhead. On mathematical reasoning benchmarks, KRPO consistently improves training reward curves and final accuracy over GRPO. These results suggest that adaptive advantage estimation is a promising direction for critic-free reinforcement learning in language model reasoning. The code is available at https://github.com/billhhh/KRPO_LLMs_RL.

URLs: https://github.com/billhhh/KRPO_LLMs_RL.

replace On the definition and importance of interpretability in scientific machine learning

Authors: Conor Rowan, Alireza Doostan

Abstract: Though neural networks trained on large datasets have been successfully used to describe and predict many physical phenomena, there is a sense among scientists that, unlike traditional scientific models comprising simple mathematical expressions, their findings cannot be integrated into the body of scientific knowledge. Critics of machine learning's inability to produce human-understandable relationships have converged on the concept of "interpretability" as its point of departure from more traditional forms of science. As the growing interest in interpretability has shown, researchers in the physical sciences seek not just predictive models, but also to uncover the fundamental principles that govern a system of interest. However, clarity around a definition of interpretability and the precise role that it plays in science is lacking in the literature. In this work, we argue that researchers in equation discovery and symbolic regression tend to conflate the concept of sparsity with interpretability. We review key papers on interpretable machine learning from outside the scientific community and argue that, though the definitions and methods they propose can inform questions of interpretability for scientific machine learning (SciML), they are inadequate for this new purpose. Noting these deficiencies, we propose an operational definition of interpretability for the physical sciences. Our notion of interpretability emphasizes understanding of the mechanism over mathematical sparsity. Innocuous though it may seem, this emphasis on mechanism shows that sparsity is often unnecessary. It also questions the possibility of interpretable scientific discovery when prior knowledge is lacking. We believe a precise and philosophically informed definition of interpretability in SciML will help focus research efforts toward the most significant obstacles to realizing a data-driven scientific future.

replace Latent Stochastic Interpolants

Authors: Saurabh Singh, Dmitry Lagun

Abstract: Stochastic Interpolants (SI) is a powerful framework for generative modeling, capable of flexibly transforming between two probability distributions. However, its use in jointly optimized latent variable models remains unexplored as it requires direct access to the samples from the two distributions. This work presents Latent Stochastic Interpolants (LSI) enabling joint learning in a latent space with end-to-end optimized encoder, decoder and latent SI models. We achieve this by developing a principled Evidence Lower Bound (ELBO) objective derived directly in continuous time. The joint optimization allows LSI to learn effective latent representations along with a generative process that transforms an arbitrary prior distribution into the encoder-defined aggregated posterior. LSI sidesteps the simple priors of the normal diffusion models and mitigates the computational demands of applying SI directly in high-dimensional observation spaces, while preserving the generative flexibility of the SI framework. We demonstrate the efficacy of LSI through comprehensive experiments on the standard large scale ImageNet generation benchmark.

replace On the Existence of Universal Simulators of Attention

Authors: Debanjan Dutta, Anish Chakrabarty, Faizanuddin Ansari, Swagatam Das

Abstract: Previous work on the learnability of transformers \textemdash\ focused primarily on examining their ability to approximate specific algorithmic patterns through training \textemdash\ has largely been data-driven, offering only probabilistic guarantees rather than deterministic solutions. Expressivity, on the contrary, has been devised to address the problems \emph{computable} by such architecture theoretically. These results proved the Turing-completeness of transformers, investigated bounds focused on circuit complexity, and formal logic. Being at the crossroad between learnability and expressivity, the question remains: \emph{can a transformer, as a computational model, simulate an arbitrary attention mechanism, or in particular, the underlying operations?} In this study, we investigate the transformer encoder's ability to simulate a vanilla attention mechanism. By constructing a universal simulator $\mathcal{U}$ composed of transformer encoders, we present algorithmic solutions to replicate attention outputs and the underlying elementary matrix and activation operations via RASP, a formal framework for transformer computation. We show the existence of an algorithmically achievable, data-agnostic solution, previously known to be approximated only by learning.

replace Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL

Authors: Matthew Zurek, Guy Zamir, Yudong Chen

Abstract: We study offline reinforcement learning in average-reward MDPs, which presents increased challenges from the perspectives of distribution shift and non-uniform coverage, and has been relatively underexamined from a theoretical perspective. While previous work obtains performance guarantees under single-policy data coverage assumptions, such guarantees utilize additional complexity measures which are uniform over all policies, such as the uniform mixing time. We develop sharp guarantees depending only on the target policy, specifically the bias span and a novel policy hitting radius, yielding the first fully single-policy sample complexity bound for average-reward offline RL. We are also the first to handle general weakly communicating MDPs, contrasting restrictive structural assumptions made in prior work. To achieve this, we introduce an algorithm based on pessimistic discounted value iteration enhanced by a novel quantile clipping technique, which enables the use of a sharper empirical-span-based penalty function. Our algorithm also does not require any prior parameter knowledge for its implementation. Remarkably, we show via hard examples that learning under our conditions requires coverage assumptions beyond the stationary distribution of the target policy, distinguishing single-policy complexity measures from previously examined cases. We also develop lower bounds nearly matching our main result.

replace FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation

Authors: Xenia Heilmann, Luca Corbucci, Mattia Cerrato, Anna Monreale

Abstract: Federated Learning (FL) enables collaborative training while preserving privacy, yet it introduces a critical challenge: the "illusion of fairness''. A global model, usually evaluated on the server, appears fair on average while keeping persistent discrimination at the client level. Current fairness-enhancing FL solutions often fall short, as they typically mitigate biases for a single, usually binary, sensitive attribute, while ignoring two realistic and conflicting scenarios: attribute-bias (where clients are unfair toward different sensitive attributes) and value-bias (where clients exhibit conflicting biases toward different values of the same attribute). To support more robust and reproducible fairness research in FL, we introduce FeDa4Fair, the first benchmarking framework designed to stress-test fairness methods under these heterogeneous conditions. Our contributions are three-fold: (1) We introduce FeDa4Fair, a library designed to create datasets tailored to evaluating fair FL methods under heterogeneous client bias; (2) we release a benchmark suite generated by the FeDa4Fair library to standardize the evaluation of fair FL methods; (3) we provide ready-to-use functions for evaluating fairness outcomes for these datasets.

replace The Ratchet Effect in Silico through Interaction-Driven Cumulative Intelligence in Large Language Models

Authors: Ren Zhuang

Abstract: Human intelligence scales through cumulative cultural evolution (CCE), a ratchet process in which innovations are retained against entropic drift. Large language model training, by contrast, still depends primarily on static corpora and parameter growth, leaving little room for endogenous accumulation through interaction. We present POLIS (Population Orchestrated Learning and Inference Society), a framework in which heterogeneous agents generate solutions, verify one another's outputs, retain validated artifacts in shared cultural memory, and internalize them through parameter updates. On mathematical reasoning benchmarks, populations of 1--4B-parameter models achieved average gains of 8.8--18.9 points over base models and narrowed the gap to 70B+ monoliths. Mechanistic ablations identify peer verification as the main ratchet operator and show that internalization sustains accumulation across rounds, providing computational evidence that epistemic vigilance organizes durable knowledge growth. These results position structured social interaction as a scaling lever orthogonal to parameter count.

replace FLOSS: Federated Learning with Opt-Out and Straggler Support

Authors: David J Goetze, Dahlia J Felten, Jeannie R Albrecht, Rohit Bhattacharya

Abstract: Previous work on data privacy in federated learning systems focuses on privacy-preserving operations for data from users who have agreed to share their data for training. However, modern data privacy agreements also empower users to use the system while opting out of sharing their data as desired. When combined with stragglers that arise from heterogeneous device capabilities, the result is missing data from a variety of sources that introduces bias and degrades model performance. In this paper, we present FLOSS, a system that mitigates the impacts of such missing data on federated learning in the presence of stragglers and user opt-out, and empirically demonstrate its performance in simulations.

replace KANMixer: a minimal KAN-centered mixer for long-term time series forecasting

Authors: Lingyu Jiang, Dengzhe Hou, Yuping Wang, Yao Su, Shuo Xing, Wenjing Chen, Xin Zhang, Zhengzhong Tu, Ziming Zhang, Fangzhou Lin, Michael Zielewski, Kazunori D Yamada

Abstract: Long-term time series forecasting (LTSF) underpins critical applications from energy management to weather prediction, yet achieving reliable multi-step-ahead accuracy remains challenging. Existing LTSF approaches, dominated by MLP- and Transformer-based architectures, either rely on simple linear mappings or introduce increasingly complex hand-crafted inductive biases, raising the question of whether a more expressive and principled nonlinear core could offer a better alternative. Therefore, we investigate whether Kolmogorov-Arnold Networks (KANs), a recently proposed model featuring adaptive basis functions capable of granular modulation of nonlinearities, can improve LTSF performance, and under which design choices they are most effective. Specifically, we propose KANMixer, a minimal KAN-centered architecture consisting of a multi-scale pooling frontend, a KAN-based temporal mixing backbone, and prediction heads. By avoiding heavy auxiliary modules, KANMixer enables a clear assessment of KAN components in LTSF. Across 28 benchmark-horizon settings against nine baselines, KANMixer achieves the best MSE in 16 settings and the best MAE in 11. Furthermore, extensive ablations on three representative datasets show that KAN effectiveness depends strongly on the choice of edge function; B-spline bases outperform Fourier and Wavelet alternatives; the prediction head contributes most to the gains; moderate depth is preferred over deeper unstable stacks; and decomposition priors help MLP but harm KAN. Beyond practical guidance for integrating KAN into LTSF, these results reveal an underexplored dependency between structural priors and backbone nonlinearity: design choices that benefit MLP can degrade KAN.

replace Local Diffusion Models and Phases of Data Distributions

Authors: Fangjun Hu, Guangkuo Liu, Yifan F. Zhang, Xun Gao

Abstract: As a class of generative artificial intelligence frameworks inspired by statistical physics, diffusion models have shown extraordinary performance in synthesizing complicated data distributions through a denoising process gradually guided by score functions. Real-life data, like images, is often spatially structured in low-dimensional spaces. However, ordinary diffusion models ignore this local structure and learn spatially global score functions, which are often computationally expensive. In this work, motivated by recent advances in non-equilibrium statistical physics, we develop a generic framework for defining phases of data distributions and use it to analyze the locality requirements of denoisers in diffusion models. We define two distributions as belonging to the same data distribution phase if they can be mutually connected via spatially local operations such as local denoisers, along the same evolution path as the diffusion. We demonstrate that the reverse denoising process consists of an early trivial phase and a late data phase, sandwiching a rapid phase transition where local denoisers must fail. We further demonstrate that the performance of local denoisers is closely tied to spatial Markovianity, which provides an operational criterion for diagnosing such phase transitions. We validate this criterion through numerical experiments on real-world datasets. Our work suggests guidance for simpler and more efficient architectures of diffusion models: far from the phase transition point, we can use small local neural networks to compute the score function; global neural networks are only necessary around the narrow time interval of phase transitions. This result also opens up new directions for studying phases of data distributions, the broader science of generative artificial intelligence, and guiding the design of neural networks inspired by physics concepts.

replace From Nodes to Narratives: Explaining Graph Neural Networks with LLMs and Graph Context

Authors: Peyman Baghershahi, Gregoire Fournier, Pranav Nyati, Sourav Medya

Abstract: Graph Neural Networks (GNNs) have emerged as powerful tools for learning over structured data, including text-attributed graphs (TAGs), which are common in domains such as citation networks, social platforms, and knowledge graphs. GNNs are not inherently interpretable and thus, many explanation methods have been proposed. However, existing explanation methods often struggle to generate interpretable, fine-grained rationales, especially when node attributes include rich natural language. In this work, we introduce GSPELL, a lightweight, post-hoc framework that uses large language models (LLMs) to generate faithful and interpretable explanations for GNN predictions. GSPELL projects GNN node embeddings into the LLM embedding space and constructs hybrid prompts that interleave soft prompts with textual inputs from the graph structure. This enables the LLM to reason about GNN internal representations and to produce natural-language explanations, along with concise explanation subgraphs. Our experiments across real-world TAG datasets demonstrate that GSPELL achieves a favorable trade-off between fidelity and sparsity, while improving human-centric metrics such as insightfulness. GSPELL sets a new direction for LLM-based explainability in graph learning by aligning GNN internals with human reasoning.

replace WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling

Authors: Jiacheng Li, Jianchao Tan, Zhidong Yang, Pingwei Sun, Feiye Huo, Jiayu Qin, Xiangyu Zhang, Maoxin He, Yerui Sun, Yuchen Xie, Guangming Tan, Weile Jia, Xunliang Cai, Tong Zhao

Abstract: Transformer architecture gradually dominates the LLM field. Recent advances in training optimization for Transformer-based large language models (LLMs) primarily focus on architectural modifications or optimizer adjustments. However, these approaches lack systematic optimization of weight patterns during training. Weight pattern refers to the distribution and relative magnitudes of weight parameters in a neural network. To address this issue, we propose a Weight Scaling method called WISCA to enhance training efficiency and model quality by strategically improving neural network weight patterns without changing network structures. By rescaling weights while preserving model outputs, WISCA indirectly optimizes the model's training trajectory. Experiments demonstrate that WISCA significantly improves convergence quality (measured by generalization capability and loss reduction), particularly in LLMs with Grouped Query Attention (GQA) architectures and LoRA fine-tuning tasks. Empirical results show 5.6% average improvement on zero-shot validation tasks and 2.12% average reduction in training perplexity across multiple architectures.

replace Evaluating the Quality of the Quantified Uncertainty for (Re)Calibration of Data-Driven Regression Models

Authors: Jelke Wibbeke, Nico Sch\"onfisch, Sebastian Rohjans, Andreas Rauh

Abstract: In safety-critical applications data-driven models must not only be accurate but also provide reliable uncertainty estimates. This property, commonly referred to as calibration, is essential for risk-aware decision-making. In regression a wide variety of calibration metrics and recalibration methods have emerged. However, these metrics differ significantly in their definitions, assumptions and scales, making it difficult to interpret and compare results across studies. Moreover, most recalibration methods have been evaluated using only a small subset of metrics, leaving it unclear whether improvements generalize across different notions of calibration. In this work, we systematically extract and categorize regression calibration metrics from the literature and benchmark these metrics independently of specific modelling methods or recalibration approaches. Through controlled experiments with real-world, synthetic and artificially miscalibrated data, we demonstrate that calibration metrics frequently produce conflicting results. Our analysis reveals substantial inconsistencies: many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. We identify the Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) as the most dependable metrics in our tests. Our findings highlight the critical role of metric selection in calibration research.

replace EvolveSignal: A Large Language Model Powered Coding Agent for Discovering Traffic Signal Control Strategies

Authors: Leizhen Wang, Peibo Duan, Hao Wang, Yue Wang, Jian Xu, Nan Zheng, Zhenliang Ma

Abstract: In traffic engineering, fixed-time traffic signal control remains widely used for its low cost, stability, and interpretability. However, its design relies on hand-crafted formulas (e.g., Webster) and manual re-timing by engineers to adapt to demand changes, which is labor-intensive and often yields suboptimal results under heterogeneous or congested conditions. This paper introduces EvolveSignal, an LLM-powered coding agent for automatically discovering interpretable heuristic strategies for fixed-time traffic signal control. Rather than deriving entirely new analytical formulations, the proposed framework focuses on exploring code-level variations of existing control logic and identifying effective combinations of heuristic modifications. We formulate the problem as program synthesis, where candidate strategies are represented as Python functions with fixed input-output structures and iteratively optimized through external evaluations (e.g., a traffic simulator) and evolutionary search. Experiments on a signalized intersection demonstrate that the discovered strategies outperform a classical baseline (Webster's method), reducing average delay by 20.1\% and average stops by 47.1\%. Beyond performance, ablation and incremental analyses reveal that EvolveSignal can identify meaningful modifications, such as adjusting cycle length bounds, incorporating right-turn demand, and rescaling green allocations, that provide useful insights for traffic engineers. This work highlights the potential of LLM-driven program synthesis for supporting interpretable and automated heuristic design in traffic signal control.

replace How Will My Business Process Unfold? Predicting Case Suffixes With Start and End Timestamps

Authors: Muhammad Awais Ali, Marlon Dumas, Fredrik Milani

Abstract: Predictive process monitoring supports operational decision-making by forecasting future states of ongoing business cases. A key task is case suffix prediction, which estimates the remaining sequence of activities for a case. Most existing approaches only generate activities with a single timestamp (usually the completion time). However, this is insufficient for resource capacity planning, which requires distinguishing between waiting time and processing time to accurately schedule resources and manage workloads. This paper introduces a technique to predict case suffixes that include both start and end timestamps. By predicting distinct waiting and processing intervals, the method provides a more granular view of future resource demands.

replace Representational Alignment Across Model Layers and Brain Regions with Multi-Level Optimal Transport

Authors: Shaan Shah, Meenakshi Khosla

Abstract: Standard representational similarity methods align each layer of a network to its best match in another independently, producing asymmetric results, lacking a global alignment score, and struggling with networks of different depths. These limitations arise from ignoring global activation structure and restricting mappings to rigid one-to-one layer correspondences. We propose Multi-Level Optimal Transport (MOT), a unified framework that jointly infers soft, globally consistent layer-to-layer couplings and neuron-level transport plans. MOT allows source neurons to distribute mass across multiple target layers while minimizing total transport cost under marginal constraints. This yields both a single alignment score for the entire network comparison and a soft transport plan that naturally handles depth mismatches through mass distribution. We evaluate MOT on vision models, large language models, and human visual cortex recordings. Across all domains, MOT matches or surpasses standard pairwise matching in alignment quality. Moreover, it reveals smooth, fine-grained hierarchical correspondences: early layers map to early layers, deeper layers maintain relative positions, and depth mismatches are resolved by distributing representations across multiple layers. These structured patterns emerge naturally from global optimization without being imposed, yet are absent in greedy layer-wise methods. MOT thus enables richer, more interpretable comparisons between representations, particularly when networks differ in architecture or depth. We further extend our method to a three-level MOT framework, providing a proof-of-concept alignment of two networks across their training trajectories and demonstrating that MOT uncovers checkpoint-wise correspondences missed by greedy layer-wise matching.

replace Improving Large-Scale Recommender Systems with Auxiliary Learning

Authors: Mertcan Cokbas, Ziteng Liu, Zeyi Tao, Elder Veliz, Qin Huang, Ellie Wen, Huayu Li, Qiang Jin, Murat Duman, Benjamin Au, Guy Lebanon, Sagar Chordia, Chengkai Zhang

Abstract: Training large-scale recommendation models under a single global objective implicitly assumes homogeneity across user populations. However, real-world data are composites of heterogeneous cohorts with distinct conditional distributions. As models increase in scale and complexity and as more data is used for training, they become dominated by central distribution patterns, neglecting head and tail regions. This imbalance limits the model's learning ability and can result in inactive attention weights or dead neurons. In this paper, we reveal how the attention mechanism can play a key role in factorization machines for shared embedding selection, and propose to address this challenge by analyzing the substructures in the dataset and exposing those with strong distributional contrast through auxiliary learning. Unlike previous research, which heuristically applies weighted labels or multi-task heads to mitigate such biases, we leverage partially conflicting auxiliary labels to regularize the shared representation. This approach customizes the learning process of attention layers to preserve mutual information with minority cohorts while improving global performance. We evaluated proposed method on massive production datasets with billions of data points each for six SOTA models. Experiments show that the factorization machine is able to capture fine-grained user-ad interactions using the proposed method, achieving up to a 0.16% reduction in normalized entropy overall and delivering gains exceeding 0.30% on targeted minority cohorts.

replace Distributional Inverse Reinforcement Learning

Authors: Feiyang Wu, Ye Zhao, Anqi Wu

Abstract: We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a deterministic reward estimate or match only expected returns, our method captures richer structure in expert behavior, particularly in learning the reward distribution, by minimizing first-order stochastic dominance (FSD) violations and thus integrating distortion risk measures (DRMs) into policy learning, enabling the recovery of both reward distributions and distribution-aware policies. This formulation is well-suited for behavior analysis and risk-aware imitation learning. Theoretical analysis shows that the algorithm converges with $\mathcal{O}(\varepsilon^{-2})$ iteration complexity. Empirical results on synthetic benchmarks, real-world neurobehavioral data, and MuJoCo control tasks demonstrate that our method recovers expressive reward representations and achieves state-of-the-art performance.

replace SAMix: Calibrated and Accurate Continual Learning via Sphere-Adaptive Mixup and Neural Collapse

Authors: Trung-Anh Dang, Vincent Nguyen, Ngoc-Son Vu, Christel Vrain

Abstract: While most continual learning methods focus on mitigating forgetting and improving accuracy, they often overlook the critical aspect of network calibration, despite its importance. Neural collapse, a phenomenon where last-layer features collapse to their class means, has demonstrated advantages in continual learning by reducing feature-classifier misalignment. Few works aim to improve the calibration of continual models for more reliable predictions. Our work goes a step further by proposing a novel method that not only enhances calibration but also improves performance by reducing overconfidence, mitigating forgetting, and increasing accuracy. We introduce Sphere-Adaptive Mixup (SAMix), an adaptive mixup strategy tailored for neural collapse-based methods. SAMix adapts the mixing process to the geometric properties of feature spaces under neural collapse, ensuring more robust regularization and alignment. Experiments show that SAMix significantly boosts performance, surpassing SOTA methods in continual learning while also improving model calibration. SAMix enhances both across-task accuracy and the broader reliability of predictions, making it a promising advancement for robust continual learning systems.

replace From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation

Authors: Ziwei Huang, Ying Shu, Hao Fang, Quanyu Long, Wenya Wang, Qiushi Guo, Tiezheng Ge, Leilei Gan

Abstract: Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.

replace From Raw Features to Effective Embeddings: A Three-Stage Approach for Multimodal Recipe Recommendation

Authors: Jeeho Shin, Kyungho Kim, Kijung Shin

Abstract: Recipe recommendation has become an essential task in web-based food platforms. A central challenge is effectively leveraging rich multimodal features beyond user-recipe interactions. Our analysis shows that even simple uses of multimodal signals yield competitive performance, suggesting that systematic enhancement of these signals is highly promising. We propose TESMR, a 3-stage framework for recipe recommendation that progressively refines raw multimodal features into effective embeddings through: (1) content-based enhancement using foundation models with multimodal comprehension, (2) relation-based enhancement via message propagation over user-recipe interactions, and (3) learning-based enhancement through contrastive learning with learnable embeddings. Experiments on two real-world datasets show that TESMR outperforms existing methods, achieving 7-15% higher Recall@10.

replace Understanding the Staged Dynamics of Transformers in Learning Latent Structure

Authors: Rohan Saha, Farzane Aminmansour, Alona Fyshe

Abstract: Language modeling has shown us that transformers can discover latent structure from context, but the dynamics of how they acquire different components of that structure remain poorly understood, leading to assertions that models just remix training data. In this work, we use the Alchemy benchmark in a controlled setting (Wang et al.,2021) to investigate latent structure learning. We train a small decoder-only transformer on three task variants: 1) inferring missing transitions from partial contextual information, 2) composing simple rules to solve multi-transition sequences, and 3) decomposing complex multi-step examples to infer intermediate transitions. By factorizing each task into interpretable components, we show that the model learns the different latent structure components in discrete stages. We also observe an asymmetry: the model composes fundamental transitions robustly, but struggles to decompose complex examples to discover the atomic transitions. Finally, using causal interventions, we identify layer-specific plasticity windows during which freezing substantially delays or prevents stage completion. These findings provide insight into how a transformer model acquires latent structure, offering a detailed view of how capabilities evolve during training.

replace Hybrid-AIRL: Enhancing Inverse Reinforcement Learning with Supervised Expert Guidance

Authors: Bram Silue, Santiago Amaya-Corredor, Patrick Mannion, Lander Willem, Pieter Libin

Abstract: Adversarial Inverse Reinforcement Learning (AIRL) has shown promise in addressing the sparse reward problem in reinforcement learning (RL) by inferring dense reward functions from expert demonstrations. However, its performance in highly complex, imperfect-information settings remains largely unexplored. To explore this gap, we evaluate AIRL in the context of Heads-Up Limit Hold'em (HULHE) poker, a domain characterized by sparse, delayed rewards and significant uncertainty. In this setting, we find that AIRL struggles to infer a sufficiently informative reward function. To overcome this limitation, we contribute Hybrid-AIRL (H-AIRL), an extension that enhances reward inference and policy learning by incorporating a supervised loss derived from expert data and a stochastic regularization mechanism. We evaluate H-AIRL on a carefully selected set of Gymnasium benchmarks and the HULHE poker setting. Additionally, we analyze the learned reward function through visualization to gain deeper insights into the learning process. Our experimental results show that H-AIRL achieves higher sample efficiency and more stable learning compared to AIRL. This highlights the benefits of incorporating supervised signals into inverse RL and establishes H-AIRL as a promising framework for tackling challenging, real-world settings.

replace A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Authors: Yiming Tang, Harshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Qianxiao Li, Mengnan Du, Dianbo Liu

Abstract: As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they encode concepts has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have widely reported that neural networks represent meaningful concepts as linear directions in their representation spaces and often encode diverse concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, are utilized to address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into monosemantic features. These methods are the backbone of modern mechanistic interpretability, yet in practice they consistently produce polysemantic features, feature absorption, and dead neurons, with very limited theoretical understanding of why these phenomena occur. Existing theoretical work is limited to tied-weight sparse autoencoders, leaving the broader family of SDL methods without formal grounding. We develop the first unified theoretical framework that casts all major SDL variants as a single piecewise biconvex optimization problem, and characterize its global solution set, non-identifiability, and spurious optima. This analysis yields principled explanations for feature absorption and dead neurons. To expose these pathologies under full ground-truth access, we introduce the Linear Representation Bench. Guided by our theory, we propose feature anchoring, a novel technique that restores SDL identifiability, substantially improving feature recovery across synthetic benchmarks and real neural representations.

replace Eventually LIL Regret: Almost Sure $\ln\ln T$ Regret for a sub-Gaussian Mixture on Unbounded Data

Authors: Shubhada Agrawal, Aaditya Ramdas

Abstract: We prove that a classic sub-Gaussian mixture proposed by Robbins in a stochastic setting actually satisfies a path-wise (deterministic) regret bound. For every path in a natural ``Ville event'' $\mathcal E_\alpha$, this regret till time $T$ is bounded by $\ln^2(1/\alpha)/V_T + \ln (1/\alpha) + \ln \ln V_T$ up to universal constants, where $V_T$ is a nonnegative, nondecreasing, cumulative variance process. (The bound reduces to $\ln(1/\alpha) + \ln \ln V_T$ if $V_T \geq \ln(1/\alpha)$.) If the data were stochastic, then one can show that $\mathcal E_\alpha$ has probability at least $1-\alpha$ under a wide class of distributions (eg: sub-Gaussian, symmetric, variance-bounded, etc.). In fact, we show that on the Ville event $\mathcal E_0$ of probability one, the regret on every path in $\mathcal E_0$ is eventually bounded by $\ln \ln V_T$ (up to constants). We explain how this work helps bridge the world of adversarial online learning (which usually deals with regret bounds for bounded data), with game-theoretic statistics (which can handle unbounded data, albeit using stochastic assumptions). In short, conditional regret bounds serve as a bridge between stochastic and adversarial betting.

replace Bridging Mechanistic Interpretability and Prompt Engineering with Gradient Ascent for Interpretable Persona Control

Authors: Harshvardhan Saini, Yiming Tang, Dianbo Liu

Abstract: Controlling emergent behavioral personas (e.g., sycophancy, hallucination) in Large Language Models (LLMs) is critical for AI safety, yet remains a persistent challenge. Existing solutions face a dilemma: manual prompt engineering is intuitive but unscalable and imprecise, while automatic optimization methods are effective but operate as "black boxes" with no interpretable connection to model internals. We propose a novel framework that adapts gradient ascent to LLMs, enabling targeted prompt discovery. In specific, we propose two methods, RESGA and SAEGA, that both optimize randomly initialized prompts to achieve better aligned representation with an identified persona direction. We introduce fluent gradient ascent to control the fluency of discovered persona steering prompts. We demonstrate RESGA and SAEGA's effectiveness across Llama 3.1, Qwen 2.5, and Gemma 3 for steering three different personas, sycophancy, hallucination, and myopic reward. Crucially, on sycophancy, our automatically discovered prompts achieve significant improvement (49.90% compared with 79.24%). By grounding prompt discovery in mechanistically meaningful features, our method offers a new paradigm for controllable and interpretable behavior modification.

replace CEDAR: Context Engineering for Agentic Data Science

Authors: Rishiraj Saha Roy, Chris Hinze, Luzian Hahn, Fabian Kuech

Abstract: We demonstrate CEDAR, an application for automating data science (DS) tasks with an agentic setup. Solving DS problems with LLMs is an underexplored area that has immense market value. The challenges are manifold: task complexities, data sizes, computational limitations, and context restrictions. We show that these can be alleviated via effective context engineering. We first impose structure into the initial prompt with DS-specific input fields, that serve as instructions for the agentic system. The solution is then materialized as an enumerated sequence of interleaved plan and code blocks generated by separate LLM agents, providing a readable structure to the context at any step of the workflow. Function calls for generating these intermediate texts, and for corresponding Python code, ensure that data stays local, and only aggregate statistics and associated instructions are injected into LLM prompts. Fault tolerance and context management are introduced via iterative code generation and smart history rendering. The viability of our agentic data scientist is demonstrated using canonical Kaggle challenges.

replace MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management

Authors: Miriam K. Wolff, Peter Calhoun, Eleonora Maria Aiello, Yao Qin, Sam F. Royston

Abstract: Progress in Type 1 Diabetes (T1D) algorithm development is limited by the fragmentation and lack of standardization across existing T1D management datasets. Current datasets differ substantially in structure and are time-consuming to access and process, which impedes data integration and reduces the comparability and generalizability of algorithmic developments. This work aims to establish a unified and accessible data resource for T1D algorithm development. Multiple publicly available T1D datasets were consolidated into a unified resource, termed the MetaboNet dataset. Inclusion required the availability of both continuous glucose monitoring (CGM) data and corresponding insulin pump dosing records. Additionally, auxiliary information such as reported carbohydrate intake and physical activity was retained when present. The MetaboNet dataset comprises 3135 subjects and 1228 patient-years of overlapping CGM and insulin data, making it substantially larger than existing standalone benchmark datasets. The resource is distributed as a fully public subset available for immediate download at https://metabo-net.org/ , and with a Data Use Agreement (DUA)-restricted subset accessible through their respective application processes. For the datasets in the latter subset, processing pipelines are provided to automatically convert the data into the standardized MetaboNet format. A consolidated public dataset for T1D research is presented, and the access pathways for both its unrestricted and DUA-governed components are described. The resulting dataset covers a broad range of glycemic profiles and demographics and thus can yield more generalizable algorithmic performance than individual datasets.

URLs: https://metabo-net.org/

replace Agnostic Language Identification and Generation

Authors: Mikael M{\o}ller H{\o}gsgaard, Chirag Pabbaraju

Abstract: Recent works on language identification and generation have established tight statistical rates at which these tasks can be achieved. These works typically operate under a strong realizability assumption: that the input data is drawn from an unknown distribution necessarily supported on some language in a given collection. In this work, we relax this assumption of realizability entirely, and impose no restrictions on the distribution of the input data. We propose objectives to study both language identification and generation in this more general "agnostic" setup. Across both problems, we obtain novel interesting characterizations and nearly tight rates.

replace Analyzing Shapley Additive Explanations to Understand Anomaly Detection Algorithm Behaviors and Their Complementarity

Authors: Jordan Levy, Paul Saves, Moncef Garouani, Nicolas Verstaevel, Benoit Gaudou

Abstract: Unsupervised anomaly detection is a challenging problem due to the diversity of data distributions and the lack of labels. Ensemble methods are often adopted to mitigate these challenges by combining multiple detectors, which can reduce individual biases and increase robustness. Yet building an ensemble that is genuinely complementary remains challenging, since many detectors rely on similar decision cues and end up producing redundant anomaly scores. As a result, the potential of ensemble learning is often limited by the difficulty of identifying models that truly capture different types of irregularities. To address this, we propose a methodology for characterizing anomaly detectors through their decision mechanisms. Using SHapley Additive exPlanations, we quantify how each model attributes importance to input features, and we use these attribution profiles to measure similarity between detectors. We show that detectors with similar explanations tend to produce correlated anomaly scores and identify largely overlapping anomalies. Conversely, explanation divergence reliably indicates complementary detection behavior. Our results demonstrate that explanation-driven metrics offer a different criterion than raw outputs for selecting models in an ensemble. However, we also demonstrate that diversity alone is insufficient; high individual model performance remains a prerequisite for effective ensembles. By explicitly targeting explanation diversity while maintaining model quality, we are able to construct ensembles that are more diverse, more complementary, and ultimately more effective for unsupervised anomaly detection.

replace SwiftRepertoire: Few-Shot Immune-Signature Synthesis via Dynamic Kernel Codes

Authors: Rong Fu, Muge Qi, Yang Li, Yabin Jin, Jiekai Wu, Jiaxuan Lu, Chunlei Meng, Youjin Wang, Zeli Su, Juntao Gao, Li Bao, Qi Zhao, Wei Luo, Simon Fong

Abstract: Repertoire-level analysis of T cell receptors offers a biologically grounded signal for disease detection and immune monitoring, yet practical deployment is impeded by label sparsity, cohort heterogeneity, and the computational burden of adapting large encoders to new tasks. We introduce a framework that synthesizes compact task-specific parameterizations from a learned dictionary of prototypes conditioned on lightweight task descriptors derived from repertoire probes and pooled embedding statistics. This synthesis produces small adapter modules applied to a frozen pretrained backbone, enabling immediate adaptation to novel tasks with only a handful of support examples and without full model fine-tuning. The architecture preserves interpretability through motif-aware probes and a calibrated motif discovery pipeline that links predictive decisions to sequence-level signals. Together, these components yield a practical, sample-efficient, and interpretable pathway for translating repertoire-informed models into diverse clinical and research settings where labeled data are scarce and computational resources are constrained.

replace Rashomon Sets and Model Multiplicity in Federated Learning

Authors: Xenia Heilmann, Luca Corbucci, Mattia Cerrato

Abstract: The Rashomon set captures the collection of models that achieve near-identical empirical performance yet may differ substantially in their decision boundaries. Understanding the differences among these models, i.e., their multiplicity, is recognized as a crucial step toward model transparency, fairness, and robustness, as it reveals decision boundaries instabilities that standard metrics obscure. However, the existing definitions of Rashomon set and multiplicity metrics assume centralized learning and do not extend naturally to decentralized, multi-party settings like Federated Learning (FL). In FL, multiple clients collaboratively train models under a central server's coordination without sharing raw data, which preserves privacy but introduces challenges from heterogeneous client data distribution and communication constraints. In this setting, the choice of a single best model may homogenize predictive behavior across diverse clients, amplify biases, or undermine fairness guarantees. In this work, we provide the first formalization of Rashomon sets in FL.First, we adapt the Rashomon set definition to FL, distinguishing among three perspectives: (I) a global Rashomon set defined over aggregated statistics across all clients, (II) a t-agreement Rashomon set representing the intersection of local Rashomon sets across a fraction t of clients, and (III) individual Rashomon sets specific to each client's local distribution.Second, we show how standard multiplicity metrics can be estimated under FL's privacy constraints. Finally, we introduce a multiplicity-aware FL pipeline and conduct an empirical study on standard FL benchmark datasets. Our results demonstrate that all three proposed federated Rashomon set definitions offer valuable insights, enabling clients to deploy models that better align with their local data, fairness considerations, and practical requirements.

replace Explainability in Generative Medical Diffusion Models: A Faithfulness-Based Analysis on MRI Synthesis

Authors: Surjo Dey, Pallabi Saikia

Abstract: This study investigates the explainability of generative diffusion models in the context of medical imaging, focusing on Magnetic resonance imaging (MRI) synthesis. Although diffusion models have shown strong performance in generating realistic medical images, their internal decision making process remains largely opaque. We present a faithfulness-based explainability framework that analyzes how prototype-based explainability methods like ProtoPNet (PPNet), Enhanced ProtoPNet (EPPNet), and ProtoPool can link the relationship between generated and training features. Our study focuses on understanding the reasoning behind image formation through denoising trajectory of diffusion model and subsequently prototype explainability with faithfulness analysis. Experimental analysis shows that EPPNet achieves the highest faithfulness (with score 0.1534), offering more reliable insights, and explainability into the generative process. The results highlight that diffusion models can be made more transparent and trustworthy through faithfulness-based explanations, contributing to safer and more interpretable applications of generative AI in healthcare.

replace Towards Explainable Federated Learning: Understanding the Impact of Differential Privacy

Authors: J\'ulio Oliveira, Rodrigo Ferreira, Andr\'e Riker, Glaucio H. S. Carvalho, Eirini Eleni Tsilopoulou

Abstract: Data privacy and eXplainable Artificial Intelligence (XAI) are two important aspects for modern Machine Learning systems. To enhance data privacy, recent machine learning models have been designed as a Federated Learning (FL) system. On top of that, additional privacy layers can be added, via Differential Privacy (DP). On the other hand, to improve explainability, ML must consider more interpretable approaches with reduced number of features and less complex internal architecture. In this context, this paper aims to achieve a machine learning (ML) model that combines enhanced data privacy with explainability. So, we propose a FL solution, called Federated EXplainable Trees with Differential Privacy (FEXT-DP), that: (i) is based on Decision Trees, since they are lightweight and have superior explainability than neural networks-based FL systems; (ii) provides additional layer of data privacy protection applying Differential Privacy (DP) to the Tree-Based model. However, there is a side effect adding DP: it harms the explainability of the system. So, this paper also presents the impact of DP protection on the explainability of the ML model, analyzing the obtained results for SHAP (SHapley Additive exPlanations) and Mean Decrease in Impurity (MDI).

replace Training-free retrieval-augmented generation with reinforced reasoning for flood damage nowcasting

Authors: Lipai Huang, Kai Yin, Chia-Fu Liu, Ali Mostafavi

Abstract: We propose R2RAG-Flood, a training-free retrieval-augmented generation framework for flood damage nowcasting with reinforced reasoning. The framework builds a reasoning-centric knowledge base from labeled tabular records, where each sample includes structured predictors, a compact text-mode summary, and a model-generated reasoning trajectory. During inference, the target prompt is augmented with geographically local neighbors and selected free-shots to support case-based reasoning without task-specific fine-tuning. A two-stage procedure first determines damage occurrence and then refines severity within a three-level Property Damage Extent (PDE) classification, followed by a conservative downgrade check for weakly supported over-severe outputs. In a Hurricane Harvey case study in Harris County, Texas, the supervised tabular baseline achieves 0.714 overall accuracy and 0.859 accuracy on the damaged classes (medium and high PDE). Across seven LLM backbones, R2RAG-Flood achieves 0.613--0.668 overall accuracy and 0.757--0.896 accuracy on the damaged classes while providing a structured rationale for each prediction. Under the severity-per-cost metric used in this study, lighter R2RAG-Flood variants are more cost-efficient than the supervised baseline and larger LLM backbones. These results demonstrate the feasibility of a reasoning-centric, training-free pipeline for flood damage nowcasting in a realistic case-study setting.

replace Colorful Talks with Graphs: Human-Interpretable Graph Encodings for Large Language Models

Authors: Angelo Zangari, Peyman Baghershahi, Sourav Medya

Abstract: Graph problems are fundamentally challenging for large language models (LLMs). While LLMs excel at processing unstructured text, graph tasks require reasoning over explicit structure, permutation invariance, and computationally complex relationships, creating a mismatch with the representations of text-based models. Our work investigates how LLMs can be effectively applied to graph problems despite these barriers. We introduce a human-interpretable structural encoding strategy for graph-to-text translation that injects graph structure directly into natural language prompts. Our method involves computing a variant of Weisfeiler-Lehman (WL) similarity classes and maps them to human-like color tokens rather than numeric labels. The key insight is that semantically meaningful and human-interpretable cues may be more effectively processed by LLMs than opaque symbolic encoding. Experimental results on multiple algorithmic and predictive graph tasks show the considerable improvements by our method on both synthetic and real-world datasets. By capturing both local and global-range dependencies, our method enhances LLM performance especially on graph tasks that require reasoning over global graph structure.

replace An Adaptive Horizon-Aware Model Selection Framework for Demand Forecasting under Horizon-Induced Degradation

Authors: Adolfo Gonz\'alez, V\'ictor Parada

Abstract: Business environments characterized by intermittent demand, high variability, and multi-step planning require model selection procedures aligned with future operational horizons rather than static test-horizon evaluation. Because no forecasting model is universally dominant, and rankings vary across metrics, demand structures, and forecast horizons, assigning an appropriate model to each series remains a difficult problem in inventory planning, procurement, and supply management. This study addresses that problem by introducing the Metric Degradation by Forecast Horizon (MDFH) procedure as its main methodological contribution. MDFH projects out-of-sample error metrics from the test horizon to a future operational horizon under structural stability conditions, converting conventional static evaluation into a horizon-aware scheme for multi-step decision contexts. From this basis, the study derives RMSSEh as the most parsimonious operational realization of MDFH and proposes the Adaptive Hybrid Selector for Intermittency and Variability (AHSIV) as an adaptive extension for cases where monometric horizon-aware selection is insufficient due to intermittency, variability, metric conflict, and forecast bias. Empirical evaluation on the Walmart, M3, M4, and M5 datasets, using multiple train-test partitions and 12-step forecasting horizons, compares RMSSEh, AHSIV, and ERA as selector mechanisms. Results show that MDFH provides a coherent basis for horizon-aware selector design, that RMSSEh and AHSIV remain competitive across heterogeneous demand environments, and that AHSIV adds robustness in structurally complex settings. Overall, forecasting model selection in multi-SKU environments should be treated as a horizon-aware, structure-sensitive assignment problem aligned with operational planning requirements.

replace SphUnc: Hyperspherical Uncertainty Decomposition and Causal Identification via Information Geometry

Authors: Rong Fu, Chunlei Meng, Jinshuo Liu, Dianyu Zhao, Yongtai Liu, Yibo Meng, Xiaowen Ma, Wangyu Wu, Yangchen Zeng, Shuaishuai Cao, Simon Fong

Abstract: Reliable decision-making in complex multi-agent systems requires calibrated predictions and interpretable uncertainty. We introduce SphUnc, a unified framework combining hyperspherical representation learning with structural causal modeling. The model maps features to unit hypersphere latents using von Mises-Fisher distributions, decomposing uncertainty into epistemic and aleatoric components through information-geometric fusion. A structural causal model on spherical latents enables directed influence identification and interventional reasoning via sample-based simulation. Empirical evaluations on social and affective benchmarks demonstrate improved accuracy, better calibration, and interpretable causal signals, establishing a geometric-causal foundation for uncertainty-aware reasoning in multi-agent settings with higher-order interactions.

replace Overcoming the Modality Gap in Context-Aided Forecasting

Authors: Vincent Zhihao Zheng, \'Etienne Marcotte, Arjun Ashok, Andrew Robert Williams, Lijun Sun, Alexandre Drouin, Valentina Zantedeschi

Abstract: Context-aided forecasting (CAF) holds promise for integrating domain knowledge and forward-looking information, enabling AI systems to surpass traditional statistical methods. However, recent empirical studies reveal a puzzling gap: multimodal models often fail to outperform their unimodal counterparts. We hypothesize that this underperformance stems from poor context quality in existing datasets, as verification is challenging. To address these limitations, we introduce a semi-synthetic data augmentation method that generates contexts both descriptive of temporal dynamics and verifiably complementary to numerical histories. This approach enables massive-scale dataset creation, resulting in CAF-7M, a corpus of 7 million context-augmented time series windows, including a rigorously verified test set. We demonstrate that semi-synthetic pre-training transfers effectively to real-world evaluation, and show clear evidence of context utilization. Our results suggest that dataset quality, rather than architectural limitations, has been the primary bottleneck in context-aided forecasting.

replace Evaluating Black-Box Vulnerabilities with Wasserstein-Constrained Data Perturbations

Authors: Adriana Laurindo Monteiro, Jean-Michel Loubes

Abstract: The growing use of Machine Learning (ML) tools comes with critical challenges, such as limited model explainability. We propose a global explainability framework that leverages Optimal Transport and Distributionally Robust Optimization to analyze how ML algorithms respond to constrained data perturbations. Our approach enforces constraints on feature-level statistics (e.g., brightness, age distribution), generating realistic perturbations that preserve semantic structure. We provide a model-agnostic diagnostic bench that applies to both tabular and image domains with solid theoretical guarantees. We validate the approach on real-world datasets providing interpretable robustness diagnostics that complement standard evaluation and fairness auditing tools.

replace Auto-Unrolled Proximal Gradient Descent: An AutoML Approach to Interpretable Waveform Optimization

Authors: Ahmet Kaplan

Abstract: This study explores the combination of automated machine learning (AutoML) with model-based deep unfolding (DU) for optimizing wireless beamforming and waveforms. We convert the iterative proximal gradient descent (PGD) algorithm into a deep neural network, wherein the parameters of each layer are learned instead of being predetermined. Additionally, we enhance the architecture by incorporating a hybrid layer that performs a learnable linear gradient transformation prior to the proximal projection. By utilizing AutoGluon with a tree-structured parzen estimator (TPE) for hyperparameter optimization (HPO) across an expanded search space, which includes network depth, step-size initialization, optimizer, learning rate scheduler, layer type, and post-gradient activation, the proposed auto-unrolled PGD (Auto-PGD) achieves 98.8% of the spectral efficiency of a traditional 200-iteration PGD solver using only five unrolled layers, while requiring only 100 training samples. We also address a gradient normalization issue to ensure consistent performance during training and evaluation, and we illustrate per-layer sum-rate logging as a tool for transparency. These contributions highlight a notable reduction in the amount of training data and inference cost required, while maintaining high interpretability compared to conventional black-box architectures.

replace PLR: Plackett-Luce for Reordering In-Context Learning Examples

Authors: Pawel Batorski, Paul Swoboda

Abstract: In-context learning (ICL) adapts large language models by conditioning on a small set of ICL examples, avoiding costly parameter updates. Among other factors, performance is often highly sensitive to the ordering of the examples. However, exhaustive search over the $n!$ possible orderings is infeasible. Therefore more efficient ordering methods use model confidence measures (e.g., label-probability entropy) over label sets or take a direct approach to finding the best ordering. We propose PLR, a probabilistic approach to in-context example ordering that replaces discrete ordering search with learning a probability distribution over orderings with the Plackett-Luce model. PLR models orderings using a Plackett-Luce distribution and iteratively updates its parameters to concentrate probability mass on high-performing orderings under a task-level metric. Candidate orderings are sampled efficiently via a Gumbel perturb-and-sort procedure. Experiments on multiple classification benchmarks show that PLR consistently improves few-shot accuracy for $k \in \{4, 8, 16, 32\}$ examples, and we further demonstrate gains on mathematical reasoning tasks where label-based ordering methods are not applicable. Our code is available at https://github.com/Batorskq/PLR.

URLs: https://github.com/Batorskq/PLR.

replace A Heterogeneous Long-Micro Scale Cascading Architecture for General Aviation Health Management

Authors: Xinhang Chen, Zhihuan Wei, Yang Hu, Zhiguo Zeng, Kang Zeng, Wei Wang

Abstract: BACKGROUND: General aviation fleet expansion demands intelligent health monitoring under computational constraints. Real-world aircraft health diagnosis requires balancing accuracy with computational constraints under extreme class imbalance and environmental uncertainty. Existing end-to-end approaches suffer from the receptive field paradox: global attention introduces excessive operational heterogeneity noise for fine-grained fault classification, while localized constraints sacrifice critical cross-temporal context essential for anomaly detection. METHODS: This paper presents an AI-driven heterogeneous cascading architecture for general aviation health management. The proposed Long-Micro Scale Diagnostician (LMSD) explicitly decouples global anomaly detection (full-sequence attention) from micro-scale fault classification (restricted receptive fields), resolving the receptive field paradox while minimizing training overhead. A knowledge distillation-based interpretability module provides physically traceable explanations for safety-critical validation. RESULTS: Experiments on the public National General Aviation Flight Information Database (NGAFID) dataset (28,935 flights, 36 categories) demonstrate 4--8% improvement in safety-critical metrics (MCWPM) with 4.2 times training acceleration and 46% model compression compared to end-to-end baselines. CONCLUSIONS: The AI-driven heterogeneous architecture offers deployable solutions for aviation equipment health management, with potential for digital twin integration in future work. The proposed framework substantiates deployability in resource-constrained aviation environments while maintaining stringent safety requirements.

replace Assessing the Robustness of Climate Foundation Models under No-Analog Distribution Shifts

Authors: Maria Conchita Agana Navarro, Geng Li, Theo Wolf, Maria Perez-Ortiz

Abstract: The accelerating pace of climate change introduces profound non-stationarities that challenge the ability of Machine Learning based climate emulators to generalize beyond their training distributions. While these emulators offer computationally efficient alternatives to traditional Earth System Models, their reliability remains a potential bottleneck under "no-analog" future climate states, which we define here as regimes where external forcing drives the system into conditions outside the empirical range of the historical training data. A fundamental challenge in evaluating this reliability is data contamination; because many models are trained on simulations that already encompass future scenarios, true out-of-distribution (OOD) performance is often masked. To address this, we benchmark the OOD robustness of three state-of-the-art architectures: U-Net, ConvLSTM, and the ClimaX foundation model specifically restricted to a historical-only training regime (1850-2014). We evaluate these models using two complementary strategies: (i) temporal extrapolation to the recent climate (2015-2023) and (ii) cross-scenario forcing shifts across divergent emission pathways. Our analysis within this experimental setup reveals an accuracy vs. stability trade-off: while the ClimaX foundation model achieves the lowest absolute error, it exhibits higher relative performance changes under distribution shifts, with precipitation errors increasing by up to 8.44% under extreme forcing scenarios. These findings suggest that when restricted to historical training dynamics, even high-capacity foundation models are sensitive to external forcing trajectories. Our results underscore the necessity of scenario-aware training and rigorous OOD evaluation protocols to ensure the robustness of climate emulators under a changing climate.

replace VAN-AD: Visual Masked Autoencoder with Normalizing Flow For Time Series Anomaly Detection

Authors: PengYu Chen, Shang Wan, Xiaohou Shi, Yuan Chang, Yan Sun, Sajal K. Das

Abstract: Time series anomaly detection (TSAD) is essential for maintaining the reliability and security of IoT-enabled service systems. Existing methods require training one specific model for each dataset, which exhibits limited generalization capability across different target datasets, hindering anomaly detection performance in various scenarios with scarce training data. To address this limitation, foundation models have emerged as a promising direction. However, existing approaches either repurpose large language models (LLMs) or construct largescale time series datasets to develop general anomaly detection foundation models, and still face challenges caused by severe cross-modal gaps or in-domain heterogeneity. In this paper, we investigate the applicability of large-scale vision models to TSAD. Specifically, we adapt a visual Masked Autoencoder (MAE) pretrained on ImageNet to the TSAD task. However, directly transferring MAE to TSAD introduces two key challenges: overgeneralization and limited local perception. To address these challenges, we propose VAN-AD, a novel MAE-based framework for TSAD. To alleviate the over-generalization issue, we design an Adaptive Distribution Mapping Module (ADMM), which maps the reconstruction results before and after MAE into a unified statistical space to amplify discrepancies caused by abnormal patterns. To overcome the limitation of local perception, we further develop a Normalizing Flow Module (NFM), which combines MAE with normalizing flow to estimate the probability density of the current window under the global distribution. Extensive experiments on nine real-world datasets demonstrate that VAN-AD consistently outperforms existing state-of-the-art methods across multiple evaluation metrics.We make our code and datasets available at https://github.com/PenyChen/VAN-AD.

URLs: https://github.com/PenyChen/VAN-AD.

replace Semantic Interaction Information mediates compositional generalization in latent space

Authors: John Schwarcz

Abstract: Are there still barriers to generalization once all relevant variables are known? We address this question via a framework that casts compositional generalization as a variational inference problem over latent variables with parametric interactions. To explore this, we develop the Cognitive Gridworld, a stationary Partially Observable Markov Decision Process (POMDP) where observations are generated jointly by multiple latent variables, yet feedback is provided for only a single goal variable. This setting allows us to define Semantic Interaction Information (SII): a metric measuring the contribution of latent variable interactions to task performance. Using SII, we analyze Recurrent Neural Networks (RNNs) provided with these interactions, finding that SII explains the accuracy gap between Echo State and Fully Trained networks. Our analysis also uncovers a theoretically predicted failure mode where confidence decouples from accuracy, suggesting that utilizing interactions between relevant variables is a non-trivial capability. We then address a harder regime where the interactions must be learned by an embedding model. Learning how latent variables interact requires accurate inference, yet accurate inference depends on knowing those interactions. The Cognitive Gridworld reveals this circular dependence as a core challenge for continual meta-learning. We approach this dilemma via Representation Classification Chains (RCCs), a JEPA-style architecture that disentangles these processes: variable inference and variable embeddings are learned by separate modules through Reinforcement Learning and self-supervised learning, respectively. Lastly, we demonstrate that RCCs facilitate compositional generalization to novel combinations of relevant variables. Together, these results establish a grounded setting for evaluating goal-directed generalist agents.

replace Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks

Authors: Yunwen Lei, Yufeng Xie

Abstract: Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses measure the distance from initialization by the Frobenius norm, and often imply vacuous bounds in practice for overparamterized models. In this paper, we develop initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.

replace Thinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling

Authors: Shota Takashiro, Masanori Koyama, Takeru Miyato, Yusuke Iwasawa, Yutaka Matsuo, Kohei Hayashi

Abstract: We extend the recent latent recurrent modeling to sequential input streams. By interleaving fast, recurrent latent updates with self-organizational ability between slow observation updates, our method facilitates the learning of stable internal structures that evolve alongside the input. This mechanism allows the model to maintain coherent and clustered representations over long horizons, improving out-of-distribution generalization in reinforcement learning and algorithmic tasks compared to sequential baselines such as LSTM, state space models, and Transformer variants.

replace QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Authors: Ali Slim, Haydar Hamieh, Jawad Kotaich, Yehya Ghosn, Mahdi Chehimi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem

Abstract: Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.

replace Hardware-Efficient Neuro-Symbolic Networks with the Exp-Minus-Log Operator

Authors: Eymen Ipek

Abstract: Deep neural networks (DNNs) deliver state-of-the-art accuracy on regression and classification tasks, yet two structural deficits persistently obstruct their deployment in safety-critical, resource-constrained settings: (i) opacity of the learned function, which precludes formal verification, and (ii) reliance on heterogeneous, library-bound activation functions that inflate latency and silicon area on edge hardware. The recently introduced Exp-Minus-Log (EML) Sheffer operator, eml(x, y) = exp(x) - ln(y), was shown by Odrzywolek (2026) to be sufficient - together with the constant 1 - to express every standard elementary function as a binary tree of identical nodes. We propose to embed EML primitives inside conventional DNN architectures, yielding a hybrid DNN-EML model in which the trunk learns distributed representations and the head is a depth-bounded, weight-sparse EML tree whose snapped weights collapse to closed-form symbolic sub-expressions. We derive the forward equations, prove computational-cost bounds, analyse inference and training acceleration relative to multilayer perceptrons (MLPs) and physics-informed neural networks (PINNs), and quantify the trade-offs for FPGA/analog deployment. We argue that the DNN-EML pairing closes a literature gap: prior neuro-symbolic and equation-learner approaches (EQL, KAN, AI-Feynman) work with heterogeneous primitive sets and do not exploit a single hardware-realisable Sheffer element. A balanced assessment shows that EML is unlikely to accelerate training, and on commodity CPU/GPU it is also unlikely to accelerate inference; however, on a custom EML cell (FPGA logic block or analog circuit) the asymptotic latency advantage can reach an order of magnitude with simultaneous gain in interpretability and formal-verification tractability.

replace Stability and Generalization in Looped Transformers

Authors: Asher Labovich

Abstract: Looped transformers promise test-time compute scaling by spending more iterations on harder problems, but it remains unclear which architectural choices let them extrapolate to harder problems at test time rather than memorize training-specific solutions. We introduce a fixed-point based framework for analyzing looped architectures along three axes of stability -- reachability, input-dependence, and geometry -- and use it to characterize when fixed-point iteration yields meaningful predictions. Theoretically, we prove that looped networks without recall have countable fixed points and cannot achieve strong input-dependence at any spectral regime, while recall combined with outer normalization reliably produces a regime in which fixed points are simultaneously reachable, locally smooth in the input, and supported by stable backpropagation. Empirically, we train single-layer looped transformers on chess, sudoku, and prefix-sums and find that downstream performance tracks the framework's predictions across tasks and architectural configurations. We additionally introduce internal recall, a novel recall placement variant, and show that it becomes competitive with -- and on sudoku, substantially better than -- standard recall placement once outer normalization is applied.

replace A multimodal and temporal foundation model for virtual patient representations at healthcare system scale

Authors: Andrew Zhang, Tong Ding, Sophia J. Wagner, Caiwei Tian, Ming Y. Lu, Rowland Pettit, Joshua E. Lewis, Alexandre Misrahi, Dandan Mo, Long Phi Le, Faisal Mahmood

Abstract: Modern medicine generates vast multimodal data across siloed systems, yet no existing model integrates the full breadth and temporal depth of the clinical record into a unified patient representation. We introduce Apollo, a multimodal temporal foundation model trained and evaluated on over three decades of longitudinal hospital records from a major US hospital system, composed of 25 billion records from 7.2 million patients, representing 28 distinct medical modalities and 12 major medical specialties. Apollo learns a unified representation space integrating over 100 thousand unique medical events in our clinical vocabulary as well as images and clinical text. This "atlas of medical concepts" forms a computational substrate for modeling entire patient care journeys comprised of sequences of structured and unstructured events, which are compressed by Apollo into virtual patient representations. To assess the potential of these whole-patient representations, we created 322 prognosis and retrieval tasks from a held-out test set of 1.4 million patients. We demonstrate the generalized clinical forecasting potential of Apollo embeddings, including predicting new disease onset risk up to five years in advance (95 tasks), disease progression (78 tasks), treatment response (59 tasks), risk of treatment-related adverse events (17 tasks), and hospital operations endpoints (12 tasks). Using feature attribution techniques, we show that model predictions align with clinically-interpretable multimodal biomarkers. We evaluate semantic similarity search on 61 retrieval tasks, and moreover demonstrate the potential of Apollo as a multimodal medical search engine using text and image queries. Together, these modeling capabilities establish the foundation for computable medicine, where the full context of patient care becomes accessible to computational reasoning.

replace Bounded Ratio Reinforcement Learning

Authors: Yunke Ao, Le Chen, Bruce D. Lee, Assefa S. Wahd, Aline Czarnobai, Philipp F\"urnstahl, Bernhard Sch\"olkopf, Andreas Krause

Abstract: Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.

replace FASE : A Fairness-Aware Spatiotemporal Event Graph Framework for Predictive Policing

Authors: Pronob Kumar Barman, Pronoy Kumar Barman, Plaban Kumar Barman, Rohan Mandar Salvi

Abstract: Predictive policing systems that allocate patrol resources based solely on predicted crime risk can unintentionally amplify racial disparities through feedback driven data bias. We present FASE, a Fairness Aware Spatiotemporal Event Graph framework, which integrates spatiotemporal crime prediction with fairness constrained patrol allocation and a closed loop deployment feedback simulator. We model Baltimore as a graph of 25 ZIP Code Tabulation Areas and use 139,982 Part 1 crime incidents from 2017 to 2019 at hourly resolution, producing a sparse feature tensor. The prediction module combines a spatiotemporal graph neural network with a multivariate Hawkes process to capture spatial dependencies and self exciting temporal dynamics. Outputs are modeled using a Zero Inflated Negative Binomial distribution, suitable for overdispersed and zero heavy crime counts. The model achieves a validation loss of 0.4800 and a test loss of 0.4857. Patrol allocation is formulated as a fairness constrained linear optimization problem that maximizes risk weighted coverage while enforcing a Demographic Impact Ratio constraint with deviation bounded by 0.05. Across six simulated deployment cycles, fairness remains within 0.9928 to 1.0262, and coverage ranges from 0.876 to 0.936. However, a persistent detection rate gap of approximately 3.5 percentage points remains between minority and non minority areas. This result shows that allocation level fairness constraints alone do not eliminate feedback induced bias in retraining data, highlighting the need for fairness interventions across the full pipeline.

replace-cross Survival of the Cheapest: Cost-Aware Hardware Adaptation for Adversarial Robustness

Authors: Charles Meyers, Mohammad Reza Saleh Sedghpour, Tommy L\"ofstedt, Erik Elmroth

Abstract: Deploying adversarially robust machine learning systems requires continuous trade-offs between robustness, cost, and latency. We present an autonomic decision-support framework providing a quantitative foundation for adaptive hardware selection and hyper-parameter tuning in cloud-native deep learning. The framework applies accelerated failure time (AFT) models to quantify the effect of hardware choice, batch size, epochs, and validation accuracy on model survival time. This framework can be naturally integrated into an autonomic control loop (monitor--analyse--plan--execute, MAPE-K), where system metrics such as cost, robustness, and latency are continuously evaluated and used to adapt model configurations and hardware selection. Experiments across three GPU architectures confirm the framework is both sound and cost-effective: the Nvidia L4 yields a 20% increase in adversarial survival time while costing 75% less than the V100, demonstrating that expensive hardware does not necessarily improve robustness. The analysis further reveals that model inference latency is a stronger predictor of adversarial robustness than training time or hardware configuration.

replace-cross LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Authors: Joon Sung Park, Carolyn Q. Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, Michael S. Bernstein

Abstract: Machine learning can predict human behavior well when substantial structured data and well-defined outcomes are available, but these models are typically limited to specific outcomes and cannot readily be applied to new domains. We test whether large language models (LLMs) can support a more general-purpose approach by building person-specific simulations (i.e., "generative agents") grounded in self-report data. Using data from a diverse national sample of 1,052 Americans, we build agents from (i) two-hour, semi-structured interviews (elicited using the American Voices Project interview schedule), (ii) structured surveys (the General Social Survey and Big Five personality inventory), or (iii) both sources combined. On held-out General Social Survey items, agent accuracy reached 83% (interview only), 82% (surveys only), and 86% (combined) of participants' two-week test-retest consistency, compared with agents prompted only with individuals' demographics (74%). Agents predicted personality traits and behaviors in experiments with similar accuracy, and reduced disparities in accuracy across racial and ideological groups relative to demographics-only baselines. Together, these results show that LLMs agents grounded in rich qualitative or quantitative self-report data can support general-purpose simulation of individuals across outcomes, without requiring task-specific training data.

replace-cross Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation

Authors: Xiaoling Hu, Xiangrui Zeng, Oula Puonti, Juan Eugenio Iglesias, Bruce Fischl, Yael Balbastre

Abstract: Domain randomization through synthesis is a powerful strategy to train networks that are unbiased with respect to the domain of the input images. Randomization allows networks to see a virtually infinite range of intensities and artifacts during training, thereby minimizing overfitting to appearance and maximizing generalization to unseen data. Although powerful, this approach relies on the accurate tuning of a large set of hyperparameters that govern the probabilistic distribution of the synthesized images. Instead of manually tuning these parameters, we introduce Learn2Synth, a novel procedure in which synthesis parameters are learned using a small set of real labeled data. Unlike methods that impose constraints to align synthetic data with real data (e.g., contrastive or adversarial techniques), which risk misaligning the image and its label map, we tune an augmentation engine such that a segmentation network trained on synthetic data has optimal accuracy when applied to real data. This approach allows the training procedure to benefit from real labeled examples, without ever using these real examples to train the segmentation network, which avoids biasing the network towards the properties of the training set. Specifically, we develop parametric and nonparametric strategies to enhance synthetic images in a way that improves the performance of the segmentation network. We demonstrate the effectiveness of this learning strategy on synthetic and real-world brain scans. Code is available at: https://github.com/HuXiaoling/Learn2Synth.

URLs: https://github.com/HuXiaoling/Learn2Synth.

replace-cross BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Authors: Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, Gang Peng

Abstract: Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, and the performance indicator for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may be prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by $1.3\times$ to $10.8\times$ on a set of microbenchmarks and a typical industry workload under different hardware environments. Code is available at https://github.com/microsoft/MixLLM/tree/batchllm_vllm_064.

URLs: https://github.com/microsoft/MixLLM/tree/batchllm_vllm_064.

replace-cross Throat and acoustic paired speech dataset for deep learning-based speech enhancement

Authors: Yunsik Kim, Yonghun Song, Yoonyoung Chung

Abstract: In high-noise environments such as factories, subways, and busy streets, capturing clear speech is challenging. Throat microphones can offer a solution because of their inherent noise-suppression capabilities; however, the passage of sound waves through skin and tissue attenuates high-frequency information, reducing speech clarity. Recent deep learning approaches have shown promise in enhancing throat microphone recordings, but further progress is constrained by the lack of a standard dataset. Here, we introduce the Throat and Acoustic Paired Speech (TAPS) dataset, a collection of paired utterances recorded from 60 native Korean speakers using throat and acoustic microphones. Furthermore, an optimal alignment approach was developed and applied to address the inherent signal mismatch between the two microphones. We tested three baseline deep learning models on the TAPS dataset and found mapping-based approaches to be superior for improving speech quality and restoring content. These findings demonstrate the TAPS dataset's utility for speech enhancement tasks and support its potential as a standard resource for advancing research in throat microphone-based applications.

replace-cross The Optical and Infrared Are Connected

Authors: Christian K. Jespersen, Peter Melchior, David N. Spergel, Andy D. Goulding, ChangHoon Hahn, Kartheik G. Iyer

Abstract: Galaxies are often modelled as composites of separable components with distinct spectral signatures, implying that different wavelength ranges are only weakly correlated. They are not. We present a data-driven model which exploits subtle correlations between physical processes to accurately predict infrared (IR) WISE photometry from a neural summary of optical SDSS spectra. The model achieves accuracies of $\chi^2_N \approx 1$ for all photometric bands in WISE, as well as good colors. We are able to tightly constrain typically IR-derived properties, e.g., the bolometric luminosities of AGN and dust parameters such as $\mathrm{q_{PAH}}$. We also test whether current SED-fitting methods reproduce such panchromatic relations, but find their predictions biased and overconfident, likely due to model misspecification, with correlated biases in star-formation rates and AGN luminosities being most evident. To help improve SED models, we determine which features of the optical spectrum are responsible for our improved predictions, and identify several lines (CaII, SrII, FeI, [OII] and H$\alpha$), which point to the complex chronology of star formation and chemical enrichment being incorrectly modelled.

replace-cross Quantum Adaptive Self-Attention for Quantum Transformer Models

Authors: Chi-Sheng Chen, En-Jui Kuo

Abstract: Integrating quantum computing into deep learning architectures is a promising but poorly understood endeavor: when does a quantum layer actually help, and how much quantum is enough? We address both questions through Quantum Adaptive Self-Attention (QASA), a hybrid Transformer that replaces the value projection in a \emph{single} encoder layer with a parameterized quantum circuit (PQC), while keeping all other layers classical. This \emph{minimal quantum integration} strategy uses only 36 trainable quantum parameters -- fewer than any competing quantum model -- yet achieves the best MSE on 4 of 9 synthetic benchmarks and a 6.0\% MAE reduction on the real-world ETTh1 dataset. An ablation study reveals that quantum layer \emph{position} matters more than \emph{count}: adding more quantum layers degrades performance, while a single layer at the optimal position consistently outperforms multi-layer quantum configurations. Comparison with two recent quantum time-series baselines -- QLSTM and QnnFormer -- confirms that QASA matches or exceeds models with $2$--$4\times$ more quantum parameters, significantly outperforming QLSTM on the seasonal trend task ($p{=}0.009$, Cohen's $d{>}6$). Crucially, the benefit is \emph{task-conditional}: QASA excels on chaotic, noisy, and trend-dominated signals, while classical Transformers remain superior for clean periodic waveforms -- providing a practical taxonomy for when quantum enhancement is warranted. These findings establish an \emph{architectural parsimony} principle for hybrid quantum-classical design: maximal quantum benefit is achieved not by maximizing quantum resources, but by strategically placing minimal quantum computation where it matters most.

replace-cross The effect of the number of parameters and the number of local feature patches on loss landscapes in distributed quantum neural networks

Authors: Yoshiaki Kawase

Abstract: Quantum neural networks hold promise for tackling computationally challenging tasks that are intractable for classical computers. However, their practical application is hindered by significant optimization challenges, arising from complex loss landscapes characterized by barren plateaus and numerous local minima. These problems become more severe as the number of parameters or qubits increases, hampering effective training. To mitigate these optimization challenges, particularly for classical data, we distribute overlapping local patches across multiple quantum neural networks, processing each patch with an independent quantum neural network, and aggregating their outputs for prediction. In this study, we investigate how the number of parameters and patches affects the loss landscape geometry of this distributed quantum neural network architecture via theoretical and empirical Hessian analyses and loss landscape visualization. Our results confirm that increasing the number of parameters tends to lead to deeper and sharper loss landscapes. Crucially, we theoretically derive and empirically demonstrate that increasing the number of patches significantly reduces the largest Hessian eigenvalue at minima. Furthermore, our analysis of the full Hessian eigenspectrum reveals a structure consisting of a bulk of near-zero eigenvalues and distinct outlier spikes corresponding to the number of classes, similar to classical deep learning models. These findings suggest that our distributed patch approach acts as a form of implicit structural regularization, promoting optimization stability and potentially enhancing generalization. Our study provides valuable insights into optimization challenges and highlights that the distributed patch approach is a promising strategy for developing more trainable and scalable quantum machine learning models for classical data tasks.

replace-cross CubeDAgger: Interactive Imitation Learning for Dynamic Systems with Efficient yet Low-risk Interaction

Authors: Taisuke Kobayashi

Abstract: Interactive imitation learning makes an agent's control policy robust by stepwise supervisions from an expert. The recent algorithms mostly employ expert-agent switching systems to reduce the expert's burden by limitedly selecting the supervision timing. However, this approach is useful only for static tasks; in dynamic tasks, timing discrepancies cause abrupt changes in actions, losing the robot's dynamic stability. This paper therefore proposes a novel method, named CubeDAgger, which improves robustness with less dynamic stability violations even for dynamic tasks. The proposed method is designed on a baseline, EnsembleDAgger, with three improvements. The first adds a regularization to explicitly activate the threshold for deciding the supervision timing. The second transforms the expert-agent switching system to an optimal consensus system of multiple action candidates. Third, autoregressive colored noise is injected to the agent's actions for time-consistent exploration. These improvements are verified by simulations, showing that the trained policies are sufficiently robust while maintaining dynamic stability during interaction. Finally, real-robot scooping experiments with a human expert demonstrate that the proposed method can learn robust policies from scratch based on just 30 minutes of interaction. https://youtu.be/kBl3SCTnVEM

URLs: https://youtu.be/kBl3SCTnVEM

replace-cross Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models

Authors: Michael Li, Nishant Subramani

Abstract: Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information relies primarily on studies of early models like BERT and GPT-2. We systematically probe 25 models from BERT Base to Qwen2.5-7B focusing on two linguistic properties: lexical identity and inflectional features across 6 diverse languages. We find a consistent pattern: inflectional features are linearly decodable throughout the model, while lexical identity is prominent early but increasingly weakens with depth. Further analysis of the representation geometry reveals that models with aggressive mid-layer dimensionality compression show reduced steering effectiveness in those layers, despite probe accuracy remaining high. Pretraining analysis shows that inflectional structure stabilizes early while lexical identity representations continue evolving. Taken together, our findings suggest that transformers maintain inflectional features across layers, while trading off lexical identity for compact, predictive representations. Our code is available at https://github.com/ml5885/model_internal_sleuthing

URLs: https://github.com/ml5885/model_internal_sleuthing

replace-cross Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

Authors: Wenlong Ji, Yihan Pan, Ruihao Zhu, Lihua Lei

Abstract: Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We introduce a new setting for MAB where pre-trained machine learning (ML) models are applied to convert side information and historical data into \emph{surrogate rewards}. A prominent challenge of this setting is that the surrogate rewards may exhibit substantial bias, as true reward data is typically unavailable in the offline phase, forcing ML predictions to heavily rely on extrapolation. To address the issue, we propose the Machine Learning-Assisted Upper Confidence Bound (MLA-UCB) algorithm, which can be applied to any reward prediction model and any form of auxiliary data. When the predicted and true rewards are jointly Gaussian, it provably improves the cumulative regret, even in cases where the mean surrogate reward completely misaligns with the true mean rewards, and achieves the asymptotic optimality among a broad class of policies. Notably, our method requires no prior knowledge of the covariance matrix between true and surrogate rewards. We further extend the method to a batched reward MAB problem, where each arm pull yields a batch of observations and rewards may be non-Gaussian, and we derive computable confidence bounds and regret guarantees that improve upon classical UCB algorithms. Finally, extensive simulations with both Gaussian and ML-generated surrogates, together with real-world studies on language model selection and video recommendation, demonstrate consistent and often substantial regret reductions with moderate offline surrogate sample sizes and correlations.

replace-cross Faster Fixed-Point Methods for Multichain MDPs

Authors: Matthew Zurek, Yudong Chen

Abstract: We study value-iteration (VI) algorithms for solving general (a.k.a. multichain) Markov decision processes (MDPs) under the average-reward criterion, a fundamental but theoretically challenging setting. Beyond the difficulties inherent to all average-reward problems posed by the lack of contractivity and non-uniqueness of solutions to the Bellman operator, in the multichain setting an optimal policy must solve the navigation subproblem of steering towards the best connected component, in addition to optimizing long-run performance within each component. We develop algorithms which better solve this navigational subproblem in order to achieve faster convergence for multichain MDPs, obtaining improved rates of convergence and sharper measures of complexity relative to prior work. Many key components of our results are of potential independent interest, including novel connections between average-reward and discounted problems, optimal fixed-point methods for discounted VI which extend to general Banach spaces, new sublinear convergence rates for the discounted value error, and refined suboptimality decompositions for multichain MDPs. Overall our results yield faster convergence rates for discounted and average-reward problems and expand the theoretical foundations of VI approaches.

replace-cross Artifacts of Numerical Integration in Learning Dynamical Systems

Authors: Bing-Ze Lu, Richard Tsai

Abstract: In many applications, one needs to learn a dynamical system from its solutions sampled at a finite number of time points. The learning problem is often formulated as an optimization problem over a chosen function class. However, in the optimization procedure, prediction data from generic dynamics requires a numerical integrator to assess the mismatch with the observed data. This paper reveals potentially serious effects of a chosen numerical scheme on the learning outcome. Specifically, the analysis demonstrates that a damped oscillatory system may be incorrectly identified as having "anti-damping" and exhibiting a reversed oscillation direction, even though it adequately fits the given data points. This paper shows that the stability region of the selected integrator will distort the nature of the learned dynamics. Crucially, reducing the step size or raising the order of an explicit integrator does not, in general, remedy this artifact, because higher-order explicit methods have stability regions that extend further into the right half complex plane. Furthermore, it is shown that the implicit midpoint method can preserve either conservative or dissipative properties from discrete data, offering a principled integrator choice even when the only prior knowledge is that the system is autonomous.

replace-cross Adaptive Multi-task Learning for Multi-sector Portfolio Optimization

Authors: Qingliang Fan, Ruike Wu, Yanrong Yang

Abstract: Accurate transfer of information across multiple sectors to enhance model estimation is both significant and challenging in multi-sector portfolio optimization involving a large number of assets in different classes. Within the framework of factor modeling, we propose a novel data-adaptive multi-task learning methodology that quantifies and learns the relatedness among the principal temporal subspaces (spanned by factors) across multiple sectors under study. This approach not only improves the simultaneous estimation of multiple factor models but also enhances multi-sector portfolio optimization, which heavily depends on the accurate recovery of these factor models. Additionally, a novel and easy-to-implement algorithm, termed projection-penalized principal component analysis, is developed to accomplish the multi-task learning procedure. Diverse simulation designs and practical application on daily return data from Russell 3000 index demonstrate the advantages of multi-task learning methodology.

replace-cross ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability

Authors: Wenhan Liu, Xinyu Ma, Weiwei Sun, Yutao Zhu, Yuchen Li, Dawei Yin, Zhicheng Dou

Abstract: Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models (LRMs), many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios, and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage training approach, which includes a cold-start supervised fine-tuning (SFT) stage and a reinforcement learning (RL) stage. During the RL stage, we design a novel multi-view ranking reward tailored to the multi-turn nature of listwise ranking. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than the pointwise reranker. Our codes are available at https://github.com/8421BCD/ReasonRank.

URLs: https://github.com/8421BCD/ReasonRank.

replace-cross Neural Bandit Based Optimal LLM Selection for a Pipeline of Subtasks

Authors: Baran Atalar, Eddie Zhang, Carlee Joe-Wong

Abstract: As large language models (LLMs) become increasingly popular, there is a growing need to predict which out of a set of LLMs will yield a successful answer to a given query at low cost. This problem promises to become even more relevant as LLM agents are asked to solve an increasing variety of "agentic'' AI tasks. Such tasks are often broken into smaller subtasks, each of which can then be executed by a LLM expected to perform well on that specific subtask. For example, to extract a diagnosis from medical records, one can first select an LLM to summarize the record, select another to validate the summary, and then select a possibly different LLM to extract the diagnosis from the summarized record. Unlike existing LLM selection or routing algorithms, this setting requires selecting a sequence of LLMs, with the output of each LLM feeding into the next and potentially influencing its success. Thus, unlike single LLM selection, the quality of each subtask's output directly affects the inputs, and hence the cost and success rate, of downstream LLMs, creating complex performance dependencies that must be learned during selection. We propose a neural contextual bandit-based algorithm that trains neural networks to guide LLM selections for the different subtasks, without requiring historical LLM performance data. We prove that our proposed Sequential Bandits algorithm achieves a sublinear regret in the number of tasks, and we experimentally validate its superior performance compared to other LLM selection algorithms on two real datasets.

replace-cross Foundational Design Principles and Patterns for Building Robust and Adaptive GenAI-Native Systems

Authors: Frederik Vandeputte

Abstract: Generative AI (GenAI) has emerged as a transformative technology, demonstrating remarkable capabilities across diverse application domains. However, GenAI faces several major challenges in developing reliable and efficient GenAI-empowered systems due to its unpredictability and inefficiency. This paper advocates for a paradigm shift: future GenAI-native systems should integrate GenAI's cognitive capabilities with traditional software engineering principles to create robust, adaptive, and efficient systems. We introduce foundational GenAI-native design principles centered around five key pillars -- reliability, excellence, evolvability, self-reliance, and assurance -- and propose architectural patterns such as GenAI-native cells, organic substrates, and programmable routers to guide the creation of resilient and self-evolving systems. Additionally, we outline the key ingredients of a GenAI-native software stack and discuss the impact of these systems from technical, user adoption, economic, and legal perspectives, underscoring the need for further validation and experimentation. Our work aims to inspire future research and encourage relevant communities to implement and refine this conceptual framework.

replace-cross Task-Stratified Knowledge Scaling Laws for Post-Training Quantized Large Language Models

Authors: Chenxi Zhou, Pengfei Cao, Jiang Li, Bohan Yu, Jinyu Ye, Jun Zhao, Kang Liu

Abstract: Post-Training Quantization (PTQ) is a critical strategy for efficient Large Language Models (LLMs) deployment. However, existing scaling laws primarily focus on general performance, overlooking crucial fine-grained factors and how quantization differentially impacts diverse knowledge capabilities. To address this, we establish Task-Stratified Knowledge Scaling Laws. By stratifying capabilities into memorization, application, and reasoning, we develop a framework that unifies model size, bit-width, and fine-grained factors: group size and calibration set size. Validated on 293 diverse PTQ configurations, our framework demonstrates strong fit and cross-architecture consistency. It reveals distinct sensitivities across knowledge capabilities: reasoning is precision-critical, application is scale-responsive, and memorization is calibration-sensitive. We highlight that in low-bit scenarios, optimizing these fine-grained factors is essential for preventing performance collapse. These findings provide an empirically-backed foundation for designing knowledge-aware quantization strategies.

replace-cross Gauge-covariant stochastic neural fields: Stability and finite-width effects

Authors: Rodrigo Carmo Terin

Abstract: We develop a gauge-covariant stochastic effective field theory for stability and finite-width effects in deep neural systems. The model uses classical commuting fields: a complex matter field, a real Abelian connection field, and a fictitious stochastic depth variable. Using the Martin--Siggia--Rose--Janssen--de~Dominicis formalism, we derive its functional representation and a two-replica linear-response construction defining the maximal Lyapunov exponent and the amplification factor for the edge of chaos. Finite-width effects appear as perturbative corrections to dressed kernels, and the marginality condition remains unchanged at the order considered for fixed kernel geometry. Numerically, finite-width multilayer perceptrons follow the mean-field instability threshold, and a linear stochastic effective sector reproduces the predicted low-frequency spectral deformation.

replace-cross MasconCube: Fast and Accurate Gravity Modeling with an Explicit Representation

Authors: Pietro Fanti, Dario Izzo

Abstract: The geodesy of irregularly shaped small bodies presents fundamental challenges for gravitational field modeling, particularly as deep space exploration missions increasingly target asteroids and comets. Traditional approaches suffer from critical limitations: spherical harmonics diverge within the Brillouin sphere where spacecraft typically operate, polyhedral models assume unrealistic homogeneous density distributions, and existing machine learning methods like GeodesyNets and Physics-Informed Neural Networks (PINN-GM) require extensive computational resources and training time. This work introduces MasconCubes, a novel self-supervised learning approach that formulates gravity inversion as a direct optimization problem over a regular 3D grid of point masses (mascons). Unlike implicit neural representations, MasconCubes explicitly model mass distributions while leveraging known asteroid shape information to constrain the solution space. Comprehensive evaluation on diverse asteroid models including Bennu, Eros, Itokawa, and synthetic planetesimals demonstrates that MasconCubes achieve superior performance across multiple metrics. Most notably, MasconCubes demonstrate computational efficiency advantages with training times approximately 40 times faster than GeodesyNets while maintaining physical interpretability through explicit mass distributions. These results establish MasconCubes as a promising approach for mission-critical gravitational modeling applications requiring high accuracy, computational efficiency, and physical insight into internal mass distributions of irregular celestial bodies.

replace-cross Scalable Quantum Reinforcement Learning on NISQ Devices with Dynamic-Circuit Qubit Reuse and Grover Optimization

Authors: Thet Htar Su, Shaswot Shresthamali, Masaaki Kondo

Abstract: A scalable and resource-efficient quantum reinforcement learning framework is presented that eliminates the linear qubit-scaling barrier in multi-step quantum Markov decision processes (QMDPs). The proposed framework integrates a QMDP formulation, dynamic-circuit execution, and Grover-based amplitude amplification into a unified quantum-native architecture. Environment dynamics are encoded entirely within quantum Hilbert space, enabling coherent superposition over state-action sequences and a direct quantum agent-environment interface without intermediate quantum-to-classical conversion. The central contribution is a dynamic execution model for multi-step QMDPs that employs mid-circuit measurement and reset to recycle a fixed physical quantum register across sequential interactions. This approach preserves trajectory fidelity relative to a static unrolled QMDP, generating identical state-action sequences while reducing the physical qubit requirement from 7xT to a constant 7, independent of the interaction horizon T. Thus, the qubit complexity of multi-step QMDPs is transformed from O(T) to O(1) while maintaining functional equivalence at the level of trajectory generation. Trajectory returns are evaluated via quantum arithmetic, and high-return trajectories are marked and amplified using amplitude amplification to increase their sampling probability. Simulations confirm preservation of trajectory fidelity with a 66% qubit reduction compared to a static design. Experimental execution on an IBM Heron-class processor demonstrates feasibility on noisy intermediate-scale quantum hardware, establishing a scalable and resource-efficient foundation for large-scale quantum-native reinforcement learning.

replace-cross Transformers Can Learn Connectivity in Some Graphs but Not Others

Authors: Amit Roy, Abulhair Saparov

Abstract: Reasoning capability is essential to ensure the factual correctness of the responses of transformer-based Large Language Models (LLMs), and robust reasoning about transitive relations is instrumental in many settings, such as causal inference. Hence, it is essential to investigate the capability of transformers in the task of inferring transitive relations (e.g., knowing A causes B and B causes C, then A causes C). The task of inferring transitive relations is equivalent to the task of connectivity in directed graphs (e.g., knowing there is a path from A to B, and there is a path from B to C, then there is a path from A to C). Past research focused on whether transformers can learn to infer transitivity from in-context examples provided in the input prompt. However, transformers' capability to infer transitive relations from training examples and how scaling affects the ability is unexplored. In this study, we seek to answer this question by generating directed graphs to train transformer models of varying sizes and evaluate their ability to infer transitive relations for various graph sizes. Our findings suggest that transformers are capable of learning connectivity on "grid-like'' directed graphs where each node can be embedded in a low-dimensional subspace, and connectivity is easily inferable from the embeddings of the nodes. We find that the dimensionality of the underlying grid graph is a strong predictor of transformers' ability to learn the connectivity task, where higher-dimensional grid graphs pose a greater challenge than low-dimensional grid graphs. In addition, we observe that increasing the model scale leads to increasingly better generalization to infer connectivity over grid graphs. However, if the graph is not a grid graph and contains many disconnected components, transformers struggle to learn the connectivity task, especially when the number of components is large.

replace-cross M\"obius transforms and Shapley values for vector-valued functions on weighted directed acyclic multigraphs

Authors: Patrick Forr\'e, Abel Jansma

Abstract: M\"obius inversion and Shapley values are two mathematical tools for characterizing and decomposing higher-order structure in complex systems. The former defines higher-order interactions as discrete derivatives over a partial order; the latter provides a principled way to attribute those interactions back to the `atomic' elements of the system. Both have found wide application, from combinatorics and cooperative game theory to machine learning and explainable AI. We generalize both tools simultaneously in two orthogonal directions: 1) from real-valued functions to functions valued in any abelian group (in particular, vector-valued functions), and 2) from partial orders and lattices to directed acyclic multigraphs (DAMGs) and weighted versions thereof. The classical axioms, linearity, efficiency, null player, and symmetry, which uniquely characterize Shapley values on lattices, are insufficient in this more general setting. We resolve this by introducing projection operators that recursively re-attribute higher-order synergies down to the roots of the graph, and by proposing two natural axioms: weak elements (coalitions with zero synergy can be removed without affecting any attribution) and flat hierarchy (on graphs with no intermediate hierarchy, attributions are distributed proportionally to edge counts). Together with linearity, these three axioms uniquely determine the Shapley values via a simple explicit formula, while automatically implying efficiency, null player, symmetry, and a novel projection property. The resulting framework recovers all existing lattice-based definitions as special cases, and naturally handles settings, such as games on non-lattice partial orders, which were previously out of reach. The extension to vector-valued functions and general DAMG-structured hierarchies opens new application areas in machine learning, natural language processing, and explainable AI.

replace-cross Accumulated Aggregated D-Optimal Designs for Estimating Main Effects in Black-Box Models

Authors: Chih-Yu Chang, Ming-Chung Chang

Abstract: Estimating how individual input variables affect the output of a black-box model is a central task in explainable machine learning. However, existing methods suffer from two key limitations: sensitivity to out-of-distribution (OOD) evaluations, which arises when query points are placed far from the data manifold, and instability under feature correlation, which can lead to unreliable effect estimates in practice. We introduce a unified view of main effect estimation as a design problem, which reveals that all existing methods differ only in their choice of evaluation locations. Building on this formulation, we propose A2D2E, an Estimator based on Accumulated Aggregated D-Optimal Designs, which replaces evaluations with a D-optimal hypercube design to minimize the variance of main effect estimation. A2D2E is model-agnostic, requires no differentiability of the predictor, and admits a closed-form estimator with complexity comparable to existing approaches. We establish that A2D2E is consistent to the same population target as ALE, and extend this result to the realistic setting where only a surrogate model is available. Through extensive simulations across multiple predictive models and dependence settings, we demonstrate that A2D2E outperforms ALE-based methods, with the largest gains under high feature correlation.

replace-cross Combo-Gait: Unified Transformer Framework for Multi-Modal Gait Recognition and Attribute Analysis

Authors: Zhao-Yang Wang, Zhimin Shao, Anirudh Nanduri, Basudha Pal, Laura McDaniel, Jieneng Chen, Rama Chellappa

Abstract: Gait recognition is an important biometric for human identification at a distance, particularly under low-resolution or unconstrained environments. Current works typically focus on either 2D representations (e.g., silhouettes and skeletons) or 3D representations (e.g., meshes and SMPLs), but relying on a single modality often fails to capture the full geometric and dynamic complexity of human walking patterns. In this paper, we propose a multi-modal and multi-task framework that combines 2D temporal silhouettes with 3D SMPL features for robust gait analysis. Beyond identification, we introduce a multitask learning strategy that jointly performs gait recognition and human attribute estimation, including age, body mass index (BMI), and gender. A unified transformer is employed to effectively fuse multi-modal gait features and better learn attribute-related representations, while preserving discriminative identity cues. Extensive experiments on the large-scale BRIAR datasets, collected under challenging conditions such as long-range distances (up to 1 km) and extreme pitch angles (up to 50{\deg}), demonstrate that our approach outperforms state-of-the-art methods in gait recognition and provides accurate human attribute estimation. These results highlight the promise of multi-modal and multitask learning for advancing gait-based human understanding in real-world scenarios.

replace-cross High-Level Multi-Robot Trajectory Planning And Spurious Behavior Detection

Authors: Fernando Salanova, Jes\'us Roche, Cristian Mahulea, Eduardo Montijano

Abstract: The reliable execution of high-level missions in multi-robot systems with heterogeneous agents, requires robust methods for detecting spurious behaviors. In this paper, we address the challenge of identifying spurious executions of plans specified as a Linear Temporal Logic (LTL) formula, as incorrect task sequences, violations of spatial constraints, timing inconsistencies, or deviations from intended mission semantics. To tackle this, we introduce a structured data generation framework based on the Nets-within-Nets (NWN) paradigm, which coordinates robot actions with LTL-derived global mission specifications. We further propose a Transformer-based anomaly detection pipeline that classifies robot trajectories as normal or anomalous. Experimental evaluations show that our method achieves high accuracy (91.3%) in identifying execution inefficiencies, and demonstrates robust detection capabilities for core mission violations (88.3%) and constraint-based adaptive anomalies (66.8%). An ablation experiment of the embedding and architecture was carried out, obtaining successful results where our novel proposition performs better than simpler representations.

replace-cross Language Models Learn Universal Representations of Numbers and Here's Why You Should Care

Authors: Michal \v{S}tef\'anik, Timothee Mickus, Marek Kadl\v{c}\'ik, Bertram H{\o}jer, Michal Spiegel, Ra\'ul V\'azquez, Aman Sinha, Josef Kucha\v{r}, Philipp Mondorf, Pontus Stenetorp

Abstract: Prior work has shown that large language models (LLMs) often converge to accurate input embedding for numbers, based on sinusoidal representations. In this work, we quantify that these representations are in fact strikingly systematic, to the point of being almost perfectly universal: different LLM families develop equivalent sinusoidal structures, and number representations are broadly interchangeable in a large swathe of experimental setups. We show that properly factoring in this characteristic is crucial when it comes to assessing how accurately LLMs encode numeric and other ordinal information, and that mechanistically enhancing this sinusoidality can also lead to reductions of LLMs' arithmetic errors.

replace-cross X-IONet: Cross-Platform Inertial Odometry Network for Pedestrian and Legged Robot

Authors: Dehan Shen, Changhao Chen

Abstract: Learning-based inertial odometry has achieved remarkable progress in pedestrian navigation. However, extending these methods to quadruped robots remains challenging due to their distinct and highly dynamic motion patterns. Models that perform well on pedestrian data often experience severe degradation when deployed on legged platforms. To tackle this challenge, we introduce X-IONet, a cross-platform inertial odometry framework that operates solely using a single Inertial Measurement Unit (IMU). X-IONet incorporates a rule-based expert selection module to classify motion platforms and route IMU sequences to platform-specific expert networks. The displacement prediction network features a dual-stage attention architecture that jointly models long-range temporal dependencies and inter-axis correlations, enabling accurate motion representation. It outputs both displacement and associated uncertainty, which are further fused through an Extended Kalman Filter (EKF) for robust state estimation. Extensive experiments on the public RoNIN pedestrian dataset, the GrandTour quadruped dataset, and a self-collected Go2 quadruped dataset demonstrate that X-IONet achieves state-of-the-art performance, reducing ATE and RTE by 14.3% and 11.4% on RoNIN, 11.8% and 9.7% on GrandTour, and 52.8% and 41.3% on Go2. These results highlight X-IONet's effectiveness for accurate and robust inertial navigation across both human and legged robot platforms.

replace-cross AutoGraphAD: Unsupervised network anomaly detection using Variational Graph Autoencoders

Authors: Georgios Anyfantis, Pere Barlet-Ros

Abstract: Network Intrusion Detection Systems (NIDS) are essential tools for detecting network attacks and intrusions. While extensive research has explored the use of supervised Machine Learning for attack detection and characterisation, these methods require accurately labelled datasets, which are very costly to obtain. Moreover, existing public datasets have limited and/or outdated attacks, and many of them suffer from mislabelled data. To reduce the reliance on labelled data, we propose AutoGraphAD, a novel unsupervised anomaly detection based on a Heterogeneous Variational Graph Autoencoder. AutoGraphAD operates on heterogeneous graphs, made from connection and IP nodes that represent network activity. The model is trained using unsupervised and contrastive learning, without relying on any labelled data. The model's losses are then weighted and combined in an anomaly score used for anomaly detection. Overall, AutoGraphAD yields the same, and in some cases better, results than Anomal-E, but without requiring costly downstream anomaly detectors. As a result, AutoGraphAD achieves around 1.18 orders of magnitude faster training and 1.03 orders of magnitude faster inference, which represents a significant advantage for operational deployment.

replace-cross Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks

Authors: Dionysios Adamopoulos, Anastasia Poulopoulou, Georgios Goumas, Christina Giannoula

Abstract: Sparse Convolution (SpC) powers 3D point cloud networks widely used in autonomous driving and augmented/virtual reality. SpC builds a kernel map that stores mappings between input voxel coordinates, output coordinates, and weight offsets, then uses this map to compute feature vectors for output coordinates. Our work identifies three key properties of voxel coordinates: they are integer-valued, bounded within a limited spatial range, and geometrically continuous, i.e., neighboring voxels on the same object surface are highly likely to exist at small spatial offsets from each other. Prior SpC engines do not fully exploit these properties and suffer from high pre-processing and post-processing overheads during kernel map construction. To address this, we design Spira, the first voxel-property-aware SpC engine for GPUs. Spira proposes (i) a high-performance one-shot search algorithm that builds the kernel map with no pre-processing and high data locality, (ii) an effective packed-native processing scheme that accesses packed voxel coordinates at low cost, (iii) a flexible dual-dataflow execution mechanism that efficiently computes output feature vectors by adapting to layer characteristics, and (iv) a network-wide parallelization strategy that builds kernel maps for all SpC layers concurrently at network start. Our evaluation shows that Spira significantly outperforms prior state-of-the-art SpC engines by 1.68x on average and up to 3.04x for end-to-end inference, and by 2.11x on average and up to 3.44x for layer-wise execution across diverse layer configurations. The source code of Spira is freely available at github.com/SPIN-Research-Group/Spira.

replace-cross Control Consistency Losses for Diffusion Bridges

Authors: Samuel Howard, Nikolas N\"usken, Jakiw Pidstrigach

Abstract: Simulating the conditioned dynamics of diffusion processes, given their initial and terminal states, is an important but challenging problem in the sciences. The difficulty is particularly pronounced for rare events, for which the unconditioned dynamics rarely reach the terminal state. In this work, we propose a novel approach for learning diffusion bridges based on a self-consistency property of the optimal control. The resulting algorithm learns the conditioned dynamics in an iterative online manner, and exhibits strong performance in a range of empirical settings without requiring differentiation through simulated trajectories. Beyond the diffusion bridge setting, we draw connections between our self-consistency framework and recent advances in the wider stochastic optimal control literature.

replace-cross Understanding Overparametrization in Survival Models through Interpolation

Authors: Yin Liu, Jianwen Cai, Didong Li

Abstract: Classical statistical learning theory predicts a U-shaped relationship between test loss and model capacity, driven by the bias-variance trade-off. Recent advances in modern machine learning have revealed a more complex pattern, double-descent, in which test loss, after peaking near the interpolation threshold, decreases again as model capacity continues to grow. While this behavior has been extensively analyzed in regression and classification, its manifestation in survival analysis remains unexplored. This study investigates overparametrization in four representative survival models: DeepSurv, PC-Hazard, Nnet-Survival, and N-MTLR. We rigorously define interpolation and finite-norm interpolation, two key characteristics of loss-based models to understand double-descent. We then show the existence (or absence) of (finite-norm) interpolation of all four models. Our findings clarify how likelihood-based losses and model implementation jointly determine the feasibility of interpolation and show that overparametrization should not be regarded as benign for survival models. All theoretical results are supported by numerical experiments that highlight the distinct generalization behaviors of survival models.

replace-cross Foundation Models in Biomedical Imaging: Turning Hype into Reality

Authors: Amgad Muneer, Kai Zhang, Ibraheem Hamdi, Rizwan Qureshi, Muhammad Waqas, Shereen Fouad, Hazrat Ali, Syed Muhammad Anwar, Jia Wu

Abstract: Foundation models (FMs) are driving a prominent shift in biomedical imaging from task-specific models to unified backbone models for diverse tasks. This opens an avenue to integrate imaging, pathology, clinical records, and genomics data into a composite system. However, this vision contrasts sharply with modern medicine's trajectory toward more granular sub-specialization. This tension, coupled with data scarcity, domain heterogeneity, and limited interpretability, creates a gap between benchmark success and real-world clinical value. We argue that the immediate role of FMs lies in augmenting, not replacing, clinical expertise. To separate hype from reality, we introduce REAL-FM (Real-world Evaluation and Assessment of Foundation Models), a multi-dimensional framework for assessing data, technical readiness, clinical value, workflow integration, and responsible AI. Using REAL-FM, we find that while FMs excel in pattern recognition, they fall short in causal reasoning, domain robustness, and safety. Clinical translation is hindered by scarce representative data for model training, unverified generalization beyond oversimplified benchmark settings, and a lack of prospective outcome-based validation. We further examine FM reasoning paradigms, including sequential logic, spatial understanding, and symbolic domain knowledge. We envision that the path forward lies not in a monolithic medical oracle, but in coordinated subspecialist AI systems that are transparent, safe, and clinically grounded.

replace-cross Device-Native Autonomous Agents for Privacy-Preserving Negotiations

Authors: Joyjit Roy, Samaresh Kumar Singh

Abstract: Automated negotiations in insurance and business-to-business (B2B) commerce encounter substantial challenges. Current systems force a trade-off between convenience and privacy by routing sensitive financial data through centralized servers, increasing security risks, and diminishing user trust. This study introduces a device-native autonomous Agentic AI system for privacy-preserving negotiations. The proposed system operates exclusively on user hardware, enabling real-time bargaining while maintaining sensitive constraints locally. It integrates zero-knowledge proofs to ensure privacy and employs distilled world models to support advanced on-device reasoning. The architecture incorporates six technical components within an Agentic AI workflow. Agents autonomously plan negotiation strategies, conduct secure multi-party bargaining, and generate cryptographic audit trails without exposing user data to external servers. The system is evaluated in insurance and B2B procurement scenarios across diverse device configurations. Results show an average success rate of 87 %, a 2.4x reduction in latency relative to cloud baselines, and strong privacy preservation through zero-knowledge proofs. User studies show 27 % higher trust scores when decision trails are available. These findings establish a foundation for trustworthy autonomous agents in privacy-sensitive financial domains.

replace-cross Epistemology gives a Future to Complementarity in Human-AI Interactions

Authors: Andrea Ferrario, Alessandro Facchini, Juan M. Dur\'an

Abstract: Human-AI complementarity is the claim that a human supported by an AI system can outperform either alone in a decision-making process. Since its introduction in the humanAI interaction literature, it has gained traction by generalizing the reliance paradigm and by offering a more practical alternative to the contested construct of trust in AI. Yet complementarity faces key theoretical challenges: it lacks precise theoretical anchoring, it is formalized only as a post hoc indicator of relative predictive accuracy, it remains silent about other desiderata of human-AI interactions, and it abstracts away from the magnitude-cost profile of its performance gain. As a result, complementarity is difficult to obtain in empirical settings. In this work, we leverage epistemology to address these challenges by reframing complementarity within the discourse on justificatory AI. Drawing on computational reliabilism, we argue that historical instances of complementarity function as evidence that a given human-AI interaction is a reliable epistemic process for a given predictive task. Together with other reliability indicators assessing the alignment of the human-AI team with the epistemic standards and socio-technical practices, complementarity contributes to the degree of reliability of human-AI teams when generating predictions. This repositioning supports the practical reasoning of those affected by these outputs -- patients, managers, regulators, and others. Our approach suggests that the role and value of complementarity lie not in providing a stand-alone measure of relative predictive accuracy, but in helping calibrate decision-making to the reliability of AI-supported processes. We conclude by translating this repositioning into design- and governance-oriented recommendations, including a minimal reporting checklist for justificatory human-AI interactions and measures of efficient complementarity.

replace-cross KOCO-BENCH: Can Large Language Models Leverage Domain Knowledge in Software Development?

Authors: Xue Jiang, Ge Li, Jiaru Qian, Xianjie Shi, Chenjie Li, Hao Zhu, Ziyu Wang, Jielun Zhang, Zheyu Zhao, Kechi Zhang, Jia Li, Wenpin Jiao, Zhi Jin, Yihong Dong

Abstract: Large language models (LLMs) excel at general programming but struggle with domain-specific software development, necessitating domain specialization methods for LLMs to learn and utilize domain knowledge and data. However, existing domain-specific code benchmarks cannot evaluate the effectiveness of domain specialization methods, which focus on assessing what knowledge LLMs possess rather than how they acquire and apply new knowledge, lacking explicit knowledge corpora for developing domain specialization methods. To this end, we present KOCO-BENCH, a novel benchmark designed for evaluating domain specialization methods in real-world software development. KOCO-BENCH contains 6 emerging domains with 11 software frameworks and 25 projects, featuring curated knowledge corpora alongside multi-granularity evaluation tasks including domain code generation (from function-level to project-level with rigorous test suites) and domain knowledge understanding (via multiple-choice Q&A). Unlike previous benchmarks that only provide test sets for direct evaluation, KOCO-BENCH requires acquiring and applying diverse domain knowledge (APIs, rules, constraints, etc.) from knowledge corpora to solve evaluation tasks. Our evaluations reveal that KOCO-BENCH poses significant challenges to state-of-the-art LLMs. Even with domain specialization methods (e.g., SFT, RAG, kNN-LM) applied, improvements remain marginal. Best-performing coding agent, Claude Code, achieves only 34.2%, highlighting the urgent need for more effective domain specialization methods. We release KOCO-BENCH, evaluation code, and baselines to advance further research at https://github.com/jiangxxxue/KOCO-bench.

URLs: https://github.com/jiangxxxue/KOCO-bench.

replace-cross veScale-FSDP: Flexible and High-Performance FSDP at Scale

Authors: Zezhou Wang, Youjie Li, Zhiqi Lin, Jiacheng Yang, Cong Xie, Guanyu Feng, Zheng Zhong, Ziyue Huang, Hongyu Zhu, Zhi Zhang, Yanghua Peng, Xin Liu

Abstract: Fully Sharded Data Parallel (FSDP), also known as Zero Redundancy Optimizer (ZeRO), is widely used for large-scale model training, because of its memory efficiency and minimal intrusion on model code. However, existing FSDP systems rely on fixed element-wise or row-wise sharding formats that conflict with block-structured computations. As a result, they struggle to support modern structure-aware training methods, including block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. In addition, today's implementations incur communication and memory overheads that degrade efficiency at the scale of tens of thousands of GPUs. We introduce veScale-FSDP, a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. veScale-FSDP enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.

replace-cross FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Authors: Yinpeng Wu, Yitong Chen, Lixiang Wang, Jinyu Gu, Zhichao Hua, Yubin Xia

Abstract: Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone's secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average $10.05\times$ speedup in Time to First Token (TTFT) compared to the strawman, and an average $2.44\times$ TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to $24.30\times$ and $4.05\times$ compared to the strawman and optimized strawman, respectively.

replace-cross Robust Principal Component Completion

Authors: Yinjian Wang, Wei Li, Yuanyuan Gui, James E. Fowler, Gemine Vivone

Abstract: Robust principal component analysis (RPCA) seeks a low-rank component and a sparse component from their summation. Yet, in many applications of interest, the sparse foreground actually replaces, or occludes, elements from the low-rank background. To address this mismatch, a new framework is proposed in which the sparse component is identified indirectly through determining its support. This approach, called robust principal component completion (RPCC), is solved via variational Bayesian inference applied to a fully probabilistic Bayesian sparse tensor factorization. Convergence to a hard classifier for the support is shown, thereby eliminating the post-hoc thresholding required of most prior RPCA-driven approaches. Experimental results reveal that the proposed approach delivers near-optimal estimates on synthetic data as well as robust foreground-extraction and anomaly-detection performance on real color video and hyperspectral datasets, respectively. Source implementation and Appendices are available at https://github.com/WongYinJ/BCP-RPCC.

URLs: https://github.com/WongYinJ/BCP-RPCC.

replace-cross From Diffusion to Flow: Efficient Motion Generation in MotionGPT3

Authors: Jaymin Ban, JiHong Jeon, SangYeop Jeong

Abstract: Recent text-driven motion generation methods span both discrete token-based approaches and continuous-latent formulations. MotionGPT3 exemplifies the latter paradigm, combining a learned continuous motion latent space with a diffusion-based prior for text-conditioned synthesis. While rectified flow objectives have recently demonstrated favorable convergence and inference-time properties relative to diffusion in image and audio generation, it remains unclear whether these advantages transfer cleanly to the motion generation setting. In this work, we conduct a controlled empirical study comparing diffusion and rectified flow objectives within the MotionGPT3 framework. By holding the model architecture, training protocol, and evaluation setup fixed, we isolate the effect of the generative objective on training dynamics, final performance, and inference efficiency. Experiments on the HumanML3D dataset show that rectified flow converges in fewer training epochs, reaches strong test performance earlier, and matches or exceeds diffusion-based motion quality under identical conditions. Moreover, flow-based priors exhibit stable behavior across a wide range of inference step counts and achieve competitive quality with fewer sampling steps, yielding improved efficiency-quality trade-offs. Overall, our results suggest that several known benefits of rectified flow objectives do extend to continuous-latent text-to-motion generation, highlighting the importance of the training objective choice in motion priors.

replace-cross Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

Authors: Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez-Rua, Yiannis Douratsos, Tao Xiang

Abstract: Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task depends on what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. To our knowledge, this is the first model to predict camera poses and do camera-controlled video generation within a single framework. We represent each camera as dense ray pixels (raxels), a pixel-aligned encoding that lives in the same latent space as video frames, and denoise the two jointly through a Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, generating video from input images along a pre-defined trajectory, and jointly synthesizing video and trajectory from input images. We evaluate on pose estimation and camera-controlled video generation, and introduce a closed-loop self-consistency test showing that the model's predicted poses and its renderings conditioned on those poses agree. Ablations against Pl\"ucker embeddings confirm that representing cameras in a shared latent space with video is subtantially more effective.

replace-cross Seven simple steps for log analysis in AI systems

Authors: Magda Dubois, Ekin Zorer, Maia Hamin, Joe Skinner, Alexandra Souly, Jerome Wynne, Harry Coppock, Lucas Sato, Sayash Kapoor, Sunishchal Dev, Keno Juchems, Kimberly Mai, Timo Flesch, Lennart Luettgau, Charles Teague, Eric Patey, JJ Allaire, Lorenzo Pacchiardi, Jose Hernandez-Orallo, Cozmin Ududec

Abstract: AI systems produce large volumes of logs as they interact with tools and users. Analysing these logs can help understand model capabilities, propensities, and behaviours, or assess whether an evaluation worked as intended. Researchers have started developing methods for log analysis, but a standardised approach is still missing. Here we suggest a pipeline based on current best practices. We illustrate it with concrete code examples in the Inspect Scout library, provide detailed guidance on each step, and highlight common pitfalls. Our framework provides researchers with a foundation for rigorous and reproducible log analysis.

replace-cross Efficient Transceiver Design for Aerial Image Transmission and Large-scale Scene Reconstruction

Authors: Zeyi Ren, Jialin Dong, Wei Zuo, Yikun Wang, Bingyang Cheng, Sheng Zhou, Zhisheng Niu

Abstract: Large-scale three-dimensional (3D) scene reconstruction in low-altitude intelligent networks (LAIN) demands highly efficient wireless image transmission. However, existing schemes struggle to balance severe pilot overhead with the transmission accuracy required to maintain reconstruction fidelity. To strike a balance between efficiency and reliability, this paper proposes a novel deep learning-based end-to-end (E2E) transceiver design that integrates 3D Gaussian Splatting (3DGS) directly into the training process. By jointly optimizing the communication modules via the combined 3DGS rendering loss, our approach explicitly improves scene recovery quality. Furthermore, this task-driven framework enables the use of a sparse pilot scheme, significantly reducing transmission overhead while maintaining robust image recovery under low-altitude channel conditions. Extensive experiments on real-world aerial image datasets demonstrate that the proposed E2E design significantly outperforms existing baselines, delivering superior transmission performance and accurate 3D scene reconstructions.

replace-cross Rhetorical Questions in LLM Representations: A Linear Probing Study

Authors: Louie Hong Yao, Vishesh Anand, Yuan Zhuang, Tianyu Jiang

Abstract: Rhetorical questions are asked not to seek information but to persuade or signal stance. How large language models internally represent them remains unclear. We analyze rhetorical questions in LLM representations using linear probes on two social-media datasets with different discourse contexts, and find that rhetorical signals emerge early and are most stably captured by last-token representations. Rhetorical questions are linearly separable from information-seeking questions within datasets, and remain detectable under cross-dataset transfer, reaching AUROC around 0.7-0.8. However, we demonstrate that transferability does not simply imply a shared representation. Probes trained on different datasets produce different rankings when applied to the same target corpus, with overlap among the top-ranked instances often below 0.2. Qualitative analysis shows that these divergences correspond to distinct rhetorical phenomena: some probes capture discourse-level rhetorical stance embedded in extended argumentation, while others emphasize localized, syntax-driven interrogative acts. Together, these findings suggest that rhetorical questions in LLM representations are encoded by multiple linear directions emphasizing different cues, rather than a single shared direction.

replace-cross ExoNet: Calibrated Multimodal Deep Learning for TESS Exoplanet Candidate Vetting using Phase-Folded Light Curves, Stellar Parameters, and Multi-Head Attention

Authors: Md. Rashadul Islam

Abstract: The discovery of exoplanets at scale has become one of the defining data science challenges in modern astrophysics. NASA's Transiting Exoplanet Survey Satellite (TESS) had catalogued over 7,800 planet candidates by early 2026, yet confirmation stands at fewer than 720. This paper introduces ExoNet, a multimodal deep learning framework that jointly processes phase-folded global and local light curve views alongside stellar parameter features through a calibrated late-fusion architecture combining 1D Convolutional Neural Networks, 8-head Multi-Head Attention over temporal feature maps, and a residual fusion head with post-hoc Temperature Scaling calibration. Trained on 7,585 labeled Kepler Objects of Interest, ExoNet achieves Test AUC = 0.9549 and 86.3% accuracy. Applied to 4,720 verified unconfirmed TESS Planet Candidates with TOI-TIC cross-identification verified against the NASA Exoplanet Archive, the model yields 1,754 high-confidence signals, 52 habitable-zone candidates, and six Earth-sized habitable-zone targets below 1.6 Earth radii. TOI-5728.01 and TOI-6716.01 emerge as the most Earth-like unconfirmed candidates. Full ablation confirms each modality improves AUC. Code and catalog are openly released.

replace-cross BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Authors: Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Shan Mu, Weihao Yuan, Siyu Zhu

Abstract: Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with $\leq$ 4.4M data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to 3$\times$ decoding throughput speedup compared to the source model. Code is available at: $\href{https://github.com/fudan-generative-vision/Bard-VL}{this~https~URL}$.

URLs: https://github.com/fudan-generative-vision/Bard-VL

replace-cross Q-SINDy: Quantum-Kernel Sparse Identification of Nonlinear Dynamics with Provable Coefficient Debiasing

Authors: Samrendra Roy, Syed Bahauddin Alam

Abstract: Quantum feature maps offer expressive embeddings for classical learning tasks, and augmenting sparse identification of nonlinear dynamics (SINDy) with such features is a natural but unexplored direction. We introduce \textbf{Q-SINDy}, a quantum-kernel-augmented SINDy framework, and identify a specific failure mode that arises: \emph{coefficient cannibalization}, in which quantum features absorb coefficient mass that rightfully belongs to the polynomial basis, corrupting equation recovery. We derive the exact cannibalization-bias formula $\Delta\xi_P = (P^\top P)^{-1}P^\top Q\,\hat\xi_Q$ and prove that orthogonalizing quantum features against the polynomial column space at fit time eliminates this bias exactly. The claim is verified numerically to machine precision ($<10^{-12}$) on multiple systems. Empirically, across six canonical dynamical systems (Duffing, Van der Pol, Lorenz, Lotka-Volterra, cubic oscillator, R\"ossler) and three quantum feature map architectures (ZZ-angle encoding, IQP, data re-uploading), orthogonalized Q-SINDy consistently matches vanilla SINDy's structural recovery while uncorrected augmentation degrades true-positive rates by up to 100\%. A refined dynamics-aware diagnostic, $R^2_Q$ for $\dot X$, predicts cannibalization severity with statistical significance (Pearson $r=0.70$, $p=0.023$). An RBF classical-kernel control across 20 hyperparameter configurations fails more severely than any quantum variant, ruling out feature count as the cause. Orthogonalization remains robust under depolarizing hardware noise up to 2\% per gate, and the framework extends without modification to Burgers' equation.

replace-cross RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

Authors: Mircea Timpuriu, Mihaela-Claudia Cercel, Dumitru-Clementin Cercel

Abstract: The importance of clear and correct text in legal documents cannot be understated, and, consequently, a grammatical error correction tool meant to assist a professional in the law must have the ability to understand the possible errors in the context of a legal environment, correcting them accordingly, and implicitly needs to be trained in the same environment, using realistic legal data. However, the manually annotated data required by such a process is in short supply for languages such as Romanian, much less for a niche domain. The most common approach is the synthetic generation of parallel data; however, it requires a structured understanding of the Romanian grammar. In this paper, we introduce, to our knowledge, the first Romanian-language parallel dataset for the detection and correction of grammatical errors in the legal domain, RoLegalGEC, which aggregates 350,000 examples of errors in legal passages, along with error annotations. Moreover, we evaluate several neural network models that transform the dataset into a valuable tool for both detecting and correcting grammatical errors, including knowledge-distillation Transformers, sequence tagging architectures for detection, and a variety of pre-trained text-to-text Transformer models for correction. We consider that the set of models, together with the novel RoLegalGEC dataset, will enrich the resource base for further research on Romanian.