Authors: Fangxin Wang, Peyman Baghershahi, Langzhou He, Henry Peng Zou, Sourav Medya, Philip S. Yu
Abstract: Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning, but existing methods are mostly designed for offline settings. They are therefore less suited to online fine-tuning, where data arrives sequentially, sample utility is step-dependent, and the effective update geometry is shaped by adaptive optimizers. We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm that first filters geometrically useful candidates and then optimizes their coefficients. To make the framework practical for LLMs, we introduce a factorized outer-product gradient representation and optimized matrix computations for long-context data. Experiments show that our method consistently improves convergence and downstream performance over existing online data selection baselines under the same data budget.
Authors: Gabriel U. Talasso, Meghdad Kurmanji, Allan M. de Souza, Nicholas D. Lane, Leandro A. Villas
Abstract: Federated Learning (FL) has emerged as a promising technique for training language models on distributed and private datasets of diverse tasks. However, aggregating models trained on heterogeneous tasks often degrades the overall performance of individual clients. To address this issue, Personalized FL (pFL) aims to create models tailored for each client's data distribution. Although these approaches improve local performance, they usually lack robustness in two aspects: (i) generalization: when clients must make predictions on unseen tasks, or face changes in their data distributions, and (ii) intra-client tasks interference: when a single client's data contains multiple distributions that may interfere with each other during local training. To tackle these two challenges, we propose FedRouter, a clustering-based pFL that builds specialized models for each task rather than for each client. FedRouter uses adapters to personalize models by employing two clustering mechanisms to associate adapters with specific tasks. A local clustering that associate adapters with task data samples and a global one that associates similar adapters from different clients to construct task-centric personalized models. Additionally, we propose an evaluation router mechanism that routes test samples to the best adapter based on the created clusters. Experiments comparing our method with existing approaches across a multitask dataset, FedRouter demonstrate strong resilience in these challenging scenarios performing up to 6.1% relatively better under tasks interference and up to 136% relative improvement under generalization evaluation.
Authors: Adrian Mart\'inez, Ananya Gupta, Hanka Goralija, Mario Rico, Sa\'ul Fenollosa, Tamar Alphaidze
Abstract: Although Deep Reinforcement Learning has proven highly effective for complex decision-making problems, it demands significant computational resources and careful parameter adjustment in order to develop successful strategies. Evolution strategies offer a more straightforward, derivative-free approach that is less computationally costly and simpler to deploy. However, ES generally do not match the performance levels achieved by DRL, which calls into question their suitability for more demanding scenarios. This study examines the performance of ES and DRL across tasks of varying difficulty, including Flappy Bird, Breakout and Mujoco environments, as well as whether ES could be used for initial training to enhance DRL algorithms. The results indicate that ES do not consistently train faster than DRL. When used as a preliminary training step, they only provide benefits in less complex environments (Flappy Bird) and show minimal or no improvement in training efficiency or stability across different parameter settings when applied to more sophisticated tasks (Breakout and MuJoCo Walker).
Authors: Michael Chertkov
Abstract: An agent that operates sequentially must incorporate new experience without forgetting old experience, under a fixed memory budget. We propose a framework in which memory is not a parameter vector but a stochastic process: a Bridge Diffusion on a replay interval $[0,1]$, whose terminal marginal encodes the present and whose intermediate marginals encode the past. New experience is incorporated via a three-step \emph{Compress--Add--Smooth} (CAS) recursion. We test the framework on the class of models with marginal probability densities modeled via Gaussian mixtures of fixed number of components~$K$ in $d$ dimensions; temporal complexity is controlled by a fixed number~$L$ of piecewise-linear protocol segments whose nodes store Gaussian-mixture states. The entire recursion costs $O(LKd^2)$ flops per day -- no backpropagation, no stored data, no neural networks -- making it viable for controller-light hardware. Forgetting in this framework arises not from parameter interference but from lossy temporal compression: the re-approximation of a finer protocol by a coarser one under a fixed segment budget. We find that the retention half-life scales linearly as $a_{1/2}\approx c\,L$ with a constant $c>1$ that depends on the dynamics but not on the mixture complexity~$K$, the dimension~$d$, or the geometry of the target family. The constant~$c$ admits an information-theoretic interpretation analogous to the Shannon channel capacity. The stochastic process underlying the bridge provides temporally coherent ``movie'' replay -- compressed narratives of the agent's history, demonstrated visually on an MNIST latent-space illustration. The framework provides a fully analytical ``Ising model'' of continual learning in which the mechanism, rate, and form of forgetting can be studied with mathematical precision.
Authors: Leonardo Medrano Sandonas, David Balcells, Anton Bochkarev, Jacqueline M. Cole, Volker L. Deringer, Werner Dobrautz, Adrian Ehrenhofer, Thorben Frank, Pascal Friederich, Rico Friedrich, Janine George, Luca Ghiringhelli, Alejandra Hinostroza Caldas, Veronika Juraskova, Hannes Kneiding, Yury Lysogorskiy, Johannes T. Margraf, Hanna T\"urk, Anatole von Lilienfeld, Milica Todorovi\'c, Alexandre Tkatchenko, Mariana Rossi, Gianaurelio Cuniberti
Abstract: Artificial intelligence is transforming molecular and materials science, but its growing computational and data demands raise critical sustainability challenges. In this Perspective, we examine resource considerations across the AI-driven discovery pipeline--from quantum-mechanical (QM) data generation and model training to automated, self-driving research workflows--building on discussions from the ``SusML workshop: Towards sustainable exploration of chemical spaces with machine learning'' held in Dresden, Germany. In this context, the availability of large quantum datasets has enabled rigorous benchmarking and rapid methodological progress, while also incurring substantial energy and infrastructure costs. We highlight emerging strategies to enhance efficiency, including general-purpose machine learning (ML) models, multi-fidelity approaches, model distillation, and active learning. Moreover, incorporating physics-based constraints within hierarchical workflows, where fast ML surrogates are applied broadly and high-accuracy QM methods are used selectively, can further optimize resource use without compromising reliability. Equally important is bridging the gap between idealized computational predictions and real-world conditions by accounting for synthesizability and multi-objective design criteria, which is essential for practical impact. Finally, we argue that sustainable progress will rely on open data and models, reusable workflows, and domain-specific AI systems that maximize scientific value per unit of computation, enabling efficient and responsible discovery of technological materials and therapeutics.
Authors: Arsenios Scrivens
Abstract: Can classifier-based safety gates maintain reliable oversight as AI systems improve over hundreds of iterations? We provide comprehensive empirical evidence that they cannot. On a self-improving neural controller (d=240), eighteen classifier configurations -- spanning MLPs, SVMs, random forests, k-NN, Bayesian classifiers, and deep networks -- all fail the dual conditions for safe self-improvement. Three safe RL baselines (CPO, Lyapunov, safety shielding) also fail. Results extend to MuJoCo benchmarks (Reacher-v4 d=496, Swimmer-v4 d=1408, HalfCheetah-v4 d=1824). At controlled distribution separations up to delta_s=2.0, all classifiers still fail -- including the NP-optimal test and MLPs with 100% training accuracy -- demonstrating structural impossibility. We then show the impossibility is specific to classification, not to safe self-improvement itself. A Lipschitz ball verifier achieves zero false accepts across dimensions d in {84, 240, 768, 2688, 5760, 9984, 17408} using provable analytical bounds (unconditional delta=0). Ball chaining enables unbounded parameter-space traversal: on MuJoCo Reacher-v4, 10 chains yield +4.31 reward improvement with delta=0; on Qwen2.5-7B-Instruct during LoRA fine-tuning, 42 chain transitions traverse 234x the single-ball radius with zero safety violations across 200 steps. A 50-prompt oracle confirms oracle-agnosticity. Compositional per-group verification enables radii up to 37x larger than full-network balls. At d<=17408, delta=0 is unconditional; at LLM scale, conditional on estimated Lipschitz constants.
Authors: Xiao Qian, Shangjia Dong
Abstract: Accurate prediction of evacuation behavior is critical for disaster preparedness, yet models trained in one region often fail elsewhere. Using a multi-state hurricane evacuation survey, we show this failure goes beyond feature distribution shift: households with similar characteristics follow systematically different decision patterns across states. As a result, single global models overfit dominant responses, misrepresent vulnerable subpopulations, and generalize poorly across locations. We propose Population-Adaptive Symbolic Mixture-of-Experts (PASM), which pairs large language model guided symbolic regression with a mixture-of-experts architecture. PASM discovers human-readable closed-form decision rules, specializes them to data-driven subpopulations, and routes each input to the appropriate expert at inference time. On Hurricanes Harvey and Irma data, transferring from Florida and Texas to Georgia with 100 calibration samples, PASM achieves a Matthews correlation coefficient of 0.607, compared to XGBoost (0.404), TabPFN (0.333), GPT-5-mini (0.434), and meta-learning baselines MAML and Prototypical Networks (MCC $\leq$ 0.346). The routing mechanism assigns distinct formula archetypes to subpopulations, so the resulting behavioral profiles are directly interpretable. A fairness audit across four demographic axes finds no statistically significant disparities after Bonferroni correction. PASM closes more than half the cross-location generalization gap while keeping decision rules transparent enough for real-world emergency planning.
Authors: Amirreza Alasti, Efe Erdal, Y\"ucel Celik, Theresa Eimer
Abstract: Reinforcement Learning (RL) agents often struggle with efficiency and performance in complex environments. We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually. We apply this framework to the game of Blackjack, where the LLM creates a multi-stage training path that progressively introduces complex actions to a Tabular Q-Learning and a Deep Q-Network (DQN) agent. Our evaluation in a realistic 8-deck simulation over 10 independent runs demonstrates significant performance gains over standard training methods. The curriculum-based approach increases the DQN agent's average win rate from 43.97% to 47.41%, reduces the average bust rate from 32.9% to 28.0%, and accelerates the overall workflow by over 74%, with the agent's full training completing faster than the baseline's evaluation phase alone. These results validate that LLM-guided curricula can build more effective, robust, and efficient RL agents.
Authors: Selin Bayramo\u{g}lu, George L Nemhauser, Nikolaos V Sahinidis
Abstract: Machine learning is increasingly used to improve decisions within branch-and-bound algorithms for mixed-integer programming. Many existing approaches rely on deep learning, which often requires very large training datasets and substantial computational resources for both training and deployment, typically with GPU parallelization. In this work, we take a different path by developing interpretable models that are simple but effective. We focus on approximating strong branching (SB) scores, a highly effective yet computationally expensive branching rule. Using sparse learning methods, we build models with fewer than 4% of the parameters of a state-of-the-art graph neural network (GNN) while achieving competitive accuracy. Relative to SCIP's built-in branching rules and the GNN-based model, our CPU-only models are faster than the default solver and the GPU-accelerated GNN. The models are simple to train and deploy, and they remain effective with small training sets, which makes them practical in low-resource settings. Extensive experiments across diverse problem classes demonstrate the efficiency of this approach.
Authors: Zhe Bai, Hans Johansen
Abstract: We develop a machine learning (ML) surrogate model to approximate solutions to Maxwell's equations in one dimension, focusing on scenarios involving a material interface that reflects and transmits electro-magnetic waves. Derived from high-fidelity Finite Volume (FV) simulations, our training data includes variations of the initial conditions, as well as variations in one material's speed of light, allowing for the model to learn a range of wave-material interaction behaviors. The ML model autoregressively learns both the physical and frequency embeddings in a vision transformer-based framework. By incorporating Fourier transforms in the latent space, the wave number spectra of the solutions aligns closely with the simulation data. Prediction errors exhibit an approximately linear growth over time with a sharp increase at the material interface. Test results show that the ML solution has adequate relative errors below $10\%$ in over $75$ time step rollouts, despite the presence of the discontinuity and unknown material properties.
Authors: Annette Taberner-Miller
Abstract: Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot-swap registry lets operators add or remove models at runtime, with a brief forced-exploration phase for each newcomer, after which UCB selection discovers its quality-cost niche from live traffic alone. We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three-model portfolio. Across seven budget ceilings, mean per-request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order-of-magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold-started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget-gated and low-quality models rejected after bounded exploration. End-to-end routing latency is 9.8ms on CPU -- less than 0.4% of typical inference time -- with the routing decision itself taking just 22.5us.
Authors: Ferdaus Anam Jibon, Fazlul Hasan Siddiqui, F. Deeba, Gahangir Hossain
Abstract: Epileptic seizures are neurological disorders characterized by abnormal and excessive electrical activity in the brain, resulting in recurrent seizure events. Electroencephalogram (EEG) signals are widely used for seizure diagnosis due to their ability to capture temporal and spatial neural dynamics. While recent deep learning methods have achieved high detection accuracy, they often lack interpretability and neurophysiological relevance. This study presents a frequency-aware framework for epileptic seizure detection based on ictal-phase EEG analysis. The raw EEG signals are decomposed into five frequency bands (delta, theta, alpha, lower beta, and higher beta), and eleven discriminative features are extracted from each band. A graph convolutional neural network (GCN) is then employed to model spatial dependencies among EEG electrodes, represented as graph nodes. Experiments on the CHB-MIT scalp EEG dataset demonstrate high detection performance, achieving accuracies of 97.1%, 97.13%, 99.5%, 99.7%, and 51.4% across the respective frequency bands, with an overall broadband accuracy of 99.01%. The results highlight the strong discriminative capability of mid-frequency bands and reveal frequency-specific seizure patterns. The proposed approach improves interpretability and diagnostic precision compared to conventional broadband EEG-based methods.
Authors: Md Rafi Islam, Md Rejwanul Haque, Elizabeth Choma, Shannon Hayes, Siobhan McMahon, Xiangrong Shen, Edward Sazonov
Abstract: Postural stability during movement is fundamental to independent living, fall prevention, and overall health, particularly among older adults who experience age-related declines in balance, muscle strength, and mobility. Among daily functional activities, the Sit-to-Stand (SiSt) transition is a critical indicator of lower-limb strength, musculoskeletal health, and fall risk, making it an essential parameter for assessing functional capacity and monitoring physical decline in aging populations. This study presents a methodology SiSt transition detection and duration measurement using the Smart Lacelock sensor, a lightweight, shoe-mounted device that integrates a load cell, accelerometer, and gyroscope for motion analysis. The methodology was evaluated in 16 older adults (age: mean: 76.84, SD: 3.45 years) performing SiSt tasks within the Short Physical Performance Battery (SPPB) protocol. Features extracted from multimodal signals were used to train and evaluate four machine learning classifiers using a 4-fold participant-independent cross-validation to classify SiSt transitions and measure their duration. The bagged tree classifier achieved an accuracy of 0.98 and an F1 score of 0.8 in classifying SiSt transition. The mean absolute error in duration measurement of the correctly classified transitions was 0.047, and the SD was 0.07 seconds. These findings highlight the potential of the Smart Lacelock sensor for real-world fall-risk assessment and mobility monitoring in older adults.
Authors: Rachid Drissi
Abstract: We introduce L\'evy-Flows, a class of normalizing flow models that replace the standard Gaussian base distribution with L\'evy process-based distributions, specifically Variance Gamma (VG) and Normal-Inverse Gaussian (NIG). These distributions naturally capture heavy-tailed behavior while preserving exact likelihood evaluation and efficient reparameterized sampling. We establish theoretical guarantees on tail behavior, showing that for regularly varying bases the tail index is preserved under asymptotically linear flow transformations, and that identity-tail Neural Spline Flow architectures preserve the base distribution's tail shape exactly outside the transformation region. Empirically, we evaluate on S&P 500 daily returns and additional assets, demonstrating substantial improvements in density estimation and risk calibration. VG-based flows reduce test negative log-likelihood by 69% relative to Gaussian flows and achieve exact 95% VaR calibration, while NIG-based flows provide the most accurate Expected Shortfall estimates. These results show that incorporating L\'evy process structure into normalizing flows yields significant gains in modeling heavy-tailed data, with applications to financial risk management.
Authors: Hariprasath Govindarajan, Per Sid\'en, Jacob Roll, Fredrik Lindsten
Abstract: The Transformer model architecture has become one of the most widely used in deep learning and the attention mechanism is at its core. The standard attention formulation uses a softmax operation applied to a scaled dot product between query and key vectors. We explore the role played by norms of the queries and keys, which can cause training instabilities when they arbitrarily increase. We demonstrate how this can happen even in simple Transformer models, in the presence of easy-to-learn spurious patterns in the data. We propose a new attention formulation, QUEry-modulated Spherical aTtention (QUEST), that constrains the keys to a hyperspherical latent space, while still allowing individual tokens to flexibly control the sharpness of the attention distribution. QUEST can be easily used as a drop-in replacement for standard attention. We focus on vision applications while also exploring other domains to highlight the method's generality. We show that (1) QUEST trains without instabilities and (2) produces models with improved performance (3) that are robust to data corruptions and adversarial attacks.
Authors: Brenden Latham, Mehrdad Moharrami
Abstract: We study offline constrained reinforcement learning from human feedback with multiple preference oracles. Motivated by applications that trade off performance with safety or fairness, we aim to maximize target population utility subject to a minimum protected group welfare constraint. From pairwise comparisons collected under a reference policy, we estimate oracle-specific rewards via maximum likelihood and analyze how statistical uncertainty propagates through the dual program. We cast the constrained objective as a KL-regularized Lagrangian whose primal optimizer is a Gibbs policy, reducing learning to a convex dual problem. We propose a dual-only algorithm that ensures high-probability constraint satisfaction and provide the first finite-sample performance guarantees for offline constrained preference learning. Finally, we extend our theoretical analysis to accommodate multiple constraints and general f-divergence regularization.
Authors: Javier Bisbal, Julio Sotelo, Hern\'an Mella, Oliver Welin Odeback, Joaqu\'in Mura, David Marlevi, Junya Matsuda, Kotomi Iwata, Tetsuro Sekine, Cristian Tejos, Sergio Uribe
Abstract: This work introduces an unsupervised Divergence and Aliasing-Free neural network (DAF-FlowNet) for 4D Flow Magnetic Resonance Imaging (4D Flow MRI) that jointly enhances noisy velocity fields and corrects phase wrapping artifacts. DAF-FlowNet parameterizes velocities as the curl of a vector potential, enforcing mass conservation by construction and avoiding explicit divergence-penalty tuning. A cosine data-consistency loss enables simultaneous denoising and unwrapping from wrapped phase images. On synthetic aortic 4D Flow MRI generated from computational fluid dynamics, DAF-FlowNet achieved lower errors than existing techniques (up to 11% lower velocity normalized root mean square error, 11% lower directional error, and 44% lower divergence relative to the best-performing alternative across noise levels), with robustness to moderate segmentation perturbations. For unwrapping, at peak velocity/velocity-encoding ratios of 1.4 and 2.1, DAF-FlowNet achieved 0.18% and 5.2% residual wrapped voxels, representing reductions of 72% and 18% relative to the best alternative method, respectively. In scenarios with both noise and aliasing, the proposed single-stage formulation outperformed a state-of-the-art sequential pipeline (up to 15% lower velocity normalized root mean square error, 11% lower directional error, and 28% lower divergence). Across 10 hypertrophic cardiomyopathy patient datasets, DAF-FlowNet preserved fine-scale flow features, corrected aliased regions, and improved internal flow consistency, as indicated by reduced inter-plane flow bias in aortic and pulmonary mass-conservation analyses recommended by the 4D Flow MRI consensus guidelines. These results support DAF-FlowNet as a framework that unifies velocity enhancement and phase unwrapping to improve the reliability of cardiovascular 4D Flow MRI.
Authors: Thomas Buckley, Leslie Schumm, Manor Askenazi, Edward Rietman
Abstract: In this paper we extend our earlier work of (Rietman et al. 2022) presenting an application of physical Reservoir Computing (RC) to the classification of handwritten and spoken digits. We utilize an unpoled cube of Lead Zirconate Titanate (PZT) as a computational substrate to process these datasets. Our results demonstrate that the PZT reservoir achieves 89.0% accuracy on MNIST handwritten digits, representing a 2.4 percentage point improvement over logistic regression baselines applied to the same preprocessed data. However, for the AudioMNIST spoken digits dataset, the reservoir system (88.2% accuracy) performs equivalently to baseline methods (88.1% accuracy), suggesting that reservoir computing provides the greatest benefits for classification tasks of intermediate difficulty where linear methods underperform but the problem remains learnable. PZT is a well-known material already used in semiconductor applications, presenting a low-power computational substrate that can be integrated with digital algorithms. Our findings indicate that physical reservoirs excel when the task difficulty exceeds the capability of simple linear classifiers but remains within the computational capacity of the reservoir dynamics.
Authors: Sunny Liu, Habon Issa, Andr\'e Longon, Liv Gorton, Meenakshi Khosla, David Klindt
Abstract: Comparing the internal representations of neural networks is a central goal in both neuroscience and machine learning. Standard alignment metrics operate on raw neural activations, implicitly assuming that similar representations produce similar activity patterns. However, neural systems frequently operate in superposition, encoding more features than they have neurons via linear compression. We derive closed-form expressions showing that superposition systematically deflates Representational Similarity Analysis, Centered Kernel Alignment, and linear regression, causing networks with identical feature content to appear dissimilar. The root cause is that these metrics are dependent on cross-similarity between two systems' respective superposition matrices, which under assumption of random projection usually differ significantly, not on the latent features themselves: alignment scores conflate what a system represents with how it represents it. Under partial feature overlap, this confound can invert the expected ordering, making systems sharing fewer features appear more aligned than systems sharing more. Crucially, the apparent misalignment need not reflect a loss of information; compressed sensing guarantees that the original features remain recoverable from the lower-dimensional activity, provided they are sparse. We therefore argue that comparing neural systems in superposition requires extracting and aligning the underlying features rather than comparing the raw neural mixtures.
Authors: Hoang-Chau Luong, Dat Ba Tran, Lingwei Chen
Abstract: Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets and model families demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives, achieving better performance and a superior fidelity-diversity trade-off.
Authors: Anamika Paul Rupa
Abstract: Neural collapse (NC) -- the convergence of penultimate-layer features to a simplex equiangular tight frame -- is well understood at equilibrium, but the dynamics governing its onset remain poorly characterised. We identify a simple and predictive regularity: NC occurs when the mean feature norm reaches a model-dataset-specific critical value, fn*, that is largely invariant to training conditions. This value concentrates tightly within each (model, dataset) pair (CV < 8%); training dynamics primarily affect the rate at which fn approaches fn*, rather than the value itself. In standard training trajectories, the crossing of fn below fn* consistently precedes NC onset, providing a practical predictor with a mean lead time of 62 epochs (MAE 24 epochs). A direct intervention experiment confirms fn* is a stable attractor of the gradient flow -- perturbations to feature scale are self-corrected during training, with convergence to the same value regardless of direction (p>0.2). Completing the (architecture)x(dataset) grid reveals the paper's strongest result: ResNet-20 on MNIST gives fn* = 5.867 -- a +458% architecture effect versus only +68% on CIFAR-10. The grid is strongly non-additive; fn* cannot be decomposed into independent architecture and dataset contributions. Four structural regularities emerge: (1) depth has a non-monotonic effect on collapse speed; (2) activation jointly determines both collapse speed and fn*; (3) weight decay defines a three-regime phase diagram -- too little slows, an optimal range is fastest, and too much prevents collapse; (4) width monotonically accelerates collapse while shifting fn* by at most 13%. These results establish feature-norm dynamics as an actionable diagnostic for predicting NC timing, suggesting that norm-threshold behaviour is a general mechanism underlying delayed representational reorganisation in deep networks.
Authors: Jinghan Yao, Sam Ad\'e Jacobs, Walid Krichene, Masahiro Tanaka, Dhabaleswar K Panda
Abstract: Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC-Attention.git
Authors: Yoann Boget, Pablo Strasser, Alexandros Kalousis
Abstract: Denoising-based models, including diffusion and flow matching, have led to substantial advances in graph generation. Despite this progress, such models remain constrained by two fundamental limitations: a computational cost that scales quadratically with the number of nodes and a large number of function evaluations required during generation. In this work, we introduce a novel hierarchical generative framework that reduces the number of node pairs that must be evaluated and adopts discrete flow matching to significantly decrease the number of denoising iterations. We empirically demonstrate that our approach more effectively captures graph distributions while substantially reducing generation time.
Authors: Gabriel Turinici
Abstract: Algorithms for the Multi-Armed Bandit (MAB) problem play a central role in sequential decision-making and have been extensively explored both theoretically and numerically. While most classical approaches aim to identify the arm with the highest expected reward, we focus on a risk-aware setting where the goal is to select the arm with the lowest variance, favoring stability over potentially high but uncertain returns. To model the decision process, we consider a softmax parameterization of the policy; we propose a new algorithm to select the minimal variance (or minimal risk) arm and prove its convergence under natural conditions. The algorithm constructs an unbiased estimate of the objective by using two independent draws from the current's arm distribution. We provide numerical experiments that illustrate the practical behavior of these algorithms and offer guidance on implementation choices. The setting also covers general risk-aware problems where there is a trade-off between maximizing the average reward and minimizing its variance.
Authors: Chuyi Dai, Witold Pedrycz, Suping Xu, Ding Liu, Xianmin Wang
Abstract: Informed Machine Learning has emerged as a viable generalization of Machine Learning (ML) by building a unified conceptual and algorithmic setting for constructing models on a unified basis of knowledge and data. Physics-informed ML involving physics equations is one of the developments within Informed Machine Learning. This study proposes a novel direction of Knowledge-Data ML, referred to as KD-ML, where numeric data are integrated with knowledge tidbits expressed in the form of granular knowledge landmarks. We advocate that data and knowledge are complementary in several fundamental ways: data are precise (numeric) and local, usually confined to some region of the input space, while knowledge is global and formulated at a higher level of abstraction. The knowledge can be represented as information granules and organized as a collection of input-output information granules called knowledge landmarks. In virtue of this evident complementarity, we develop a comprehensive design process of the KD-ML model and formulate an original augmented loss function L, which additively embraces the component responsible for optimizing the model based on available numeric data, while the second component, playing the role of a granular regularizer, so that it adheres to the granular constraints (knowledge landmarks). We show the role of the hyperparameter positioned in the loss function, which balances the contribution and guiding role of data and knowledge, and point to some essential tendencies associated with the quality of data (noise level) and the level of granularity of the knowledge landmarks. Experiments on two physics-governed benchmarks demonstrate that the proposed KD model consistently outperforms data-driven ML models.
Authors: Md Mirajul Islam, Rajesh Debnath, Adittya Soukarjya Saha, Min Chi
Abstract: While apprenticeship learning has shown promise for inducing effective pedagogical policies directly from student interactions in e-learning environments, most existing approaches rely on optimal or near-optimal expert demonstrations under a fixed reward. Real-world student interactions, however, are often inherently imperfect and evolving: students explore, make errors, revise strategies, and refine their goals as understanding develops. In this work, we argue that imperfect student demonstrations are not noise to be discarded, but structured signals-provided their relative quality is ranked. We introduce HALIDE, Hierarchical Apprenticeship Learning from Imperfect Demonstrations with Evolving Rewards, which not only leverages sub-optimal student demonstrations, but ranks them within a hierarchical learning framework. HALIDE models student behavior at multiple levels of abstraction, enabling inference of higher-level intent and strategy from suboptimal actions while explicitly capturing the temporal evolution of student reward functions. By integrating demonstration quality into hierarchical reward inference,HALIDE distinguishes transient errors from suboptimal strategies and meaningful progress toward higher-level learning goals. Our results show that HALIDE more accurately predicts student pedagogical decisions than approaches that rely on optimal trajectories, fixed rewards, or unranked imperfect demonstrations.
Authors: Lam M. Nguyen, Dzung T. Phan, Jayant Kalagnanam
Abstract: Shuffling strategies for stochastic gradient descent (SGD), including incremental gradient, shuffle-once, and random reshuffling, are supported by rigorous convergence analyses for arbitrary within-epoch permutations. In particular, random reshuffling is known to improve optimization constants relative to cyclic and shuffle-once schemes. However, existing theory offers limited guidance on how to design new data-ordering schemes that further improve optimization constants or stability beyond random reshuffling. In this paper, we design a pipeline using a large language model (LLM)-guided program evolution framework to discover an effective shuffling rule for without-replacement SGD. Abstracting from this instance, we identify two fundamental structural components: block reshuffling and paired reversal. We analyze these components separately and show that block reshuffling strictly reduces prefix-gradient variance constants within the unified shuffling framework, yielding provable improvements over random reshuffling under mild conditions. Separately, we show that paired reversal symmetrizes the epoch map and cancels the leading order-dependent second-order term, reducing order sensitivity from quadratic to cubic in the step size. Numerical experiments with the discovered algorithm validate the theory and demonstrate consistent gains over standard shuffling schemes across convex and nonconvex benchmarks.
Authors: Eloghosa Ikponmwoba, Opeoluwa Owoyele
Abstract: The computational cost of stiff chemical kinetics remains a dominant bottleneck in reacting-flow simulation, yet hybrid integration strategies are typically driven by hand-tuned heuristics or supervised predictors that make myopic decisions from instantaneous local state. We introduce a constrained reinforcement learning (RL) framework that autonomously selects between an implicit BDF integrator (CVODE) and a quasi-steady-state (QSS) solver during chemistry integration. Solver selection is cast as a Markov decision process. The agent learns trajectory-aware policies that account for how present solver choices influence downstream error accumulation, while minimizing computational cost under a user-prescribed accuracy tolerance enforced through a Lagrangian reward with online multiplier adaptation. Across sampled 0D homogeneous reactor conditions, the RL-adaptive policy achieves a mean speedup of approximately $3\times$, with speedups ranging from $1.11\times$ to $10.58\times$, while maintaining accurate ignition delays and species profiles for a 106-species \textit{n}-dodecane mechanism and adding approximately $1\%$ inference overhead. Without retraining, the 0D-trained policy transfers to 1D counterflow diffusion flames over strain rates $10$--$2000~\mathrm{s}^{-1}$, delivering consistent $\approx 2.2\times$ speedup relative to CVODE while preserving near-reference temperature accuracy and selecting CVODE at only $12$--$15\%$ of space-time points. Overall, the results demonstrate the potential of the proposed reinforcement learning framework to learn problem-specific integration strategies while respecting accuracy constraints, thereby opening a pathway toward adaptive, self-optimizing workflows for multiphysics systems with spatially heterogeneous stiffness.
Authors: Hochan Son, Xiaofeng Lin, Jason Ni, Guang Cheng
Abstract: Deep generative models for tabular data (GANs, diffusion models, and LLM-based generators) exhibit highly non-uniform behavior across datasets; the best-performing synthesizer family depends strongly on distributional stressors such as long-tailed marginals, high-cardinality categorical, Zipfian imbalance, and small-sample regimes. This brittleness makes practical deployment challenging, especially when users must balance competing objectives of fidelity, privacy, and utility. We study {intent-conditioned tabular synthesis selection}: given a dataset and a user intent expressed as a preference over evaluation metrics, the goal is to select a synthesizer that minimizes regret relative to an intent-specific oracle. We propose {stress profiling}, a synthesis-specific meta-feature representation that quantifies dataset difficulty along four interpretable stress dimensions, and integrate it into {SYNTHONY}, a selection framework that matches stress profiles against a calibrated capability registry of synthesizer families. Across a benchmark of 7 datasets, 10 synthesizers, and 3 intents, we demonstrate that stress-based meta-features are highly predictive of synthesizer performance: a $k$NN selector using these features achieves strong Top-1 selection accuracy, substantially outperforming zero-shot LLM selectors and random baselines. We analyze the gap between meta-feature-based and capability-based selection, identifying the hand-crafted capability registry as the primary bottleneck and motivating learned capability representations as a direction for future work.
Authors: Huseyin Tuna Erdinc, Ipsita Bhar, Rafael Orozco, Thales Souza, Felix J. Herrmann
Abstract: Recent advances in generative networks have enabled new approaches to subsurface velocity model synthesis, offering a compelling alternative to traditional methods such as Full Waveform Inversion. However, these approaches predominantly rely on the availability of large-scale datasets of high-quality, geologically realistic subsurface velocity models, which are often difficult to obtain in practice. We introduce SAGE, a novel framework for statistically consistent proxy velocity generation from incomplete observations, specifically sparse well logs and migrated seismic images. During training, SAGE learns a proxy posterior over velocity models conditioned on both modalities (wells and seismic); at inference, it produces full-resolution velocity fields conditioned solely on migrated images, with well information implicitly encoded in the learned distribution. This enables the generation of geologically plausible and statistically accurate velocity realizations. We validate SAGE on both synthetic and field datasets, demonstrating its ability to capture complex subsurface variability under limited observational constraints. Furthermore, samples drawn from the learned proxy distribution can be leveraged to train downstream networks, supporting inversion workflows. Overall, SAGE provides a scalable and data-efficient pathway toward learning geological proxy posterior for seismic imaging and inversion. Repo link: https://github.com/slimgroup/SAGE.
Authors: Anurag Kumar, Raghuveer Peri, Jon Burnsky, Alexandru Nelus, Rohit Paturi, Srikanth Vishnubhotla, Yanjun Qi
Abstract: Multimodal large-language models (MLLMs) often experience degraded safety alignment when harmful queries exploit cross-modal interactions. Models aligned on text alone show a higher rate of successful attacks when extended to two or more modalities. In this work, we propose a simple conditional decoding strategy, CASA (Classification Augmented with Safety Attention) that utilizes internal representations of MLLMs to predict a binary safety token before response generation. We introduce a novel safety attention module designed to enhance the model's ability to detect malicious queries. Our design ensures robust safety alignment without relying on any external classifier or auxiliary head, and without the need for modality-specific safety fine-tuning. On diverse benchmarks such as MM-SafetyBench, JailbreakV-28k, and adversarial audio tests, CASA lowers the average attack success rate by more than 97% across modalities and across attack types. Our empirical evaluations also show that CASA maintains strong utility in benign inputs, a result validated through both automated and human evaluations (via 13 trained annotators). Together, these results highlight CASA as a simple and generalizable framework to improve multimodal LLM safety.
Authors: Aengus Lynch
Abstract: Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers. ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months. Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours. Best-of-N jailbreaking achieves 89% attack success on GPT-4o and 78% on Claude 3.5 Sonnet through random input augmentations. Attack success follows power law scaling across text, vision, and audio, enabling quantitative forecasting of adversarial robustness. Agentic misalignment tests whether frontier models autonomously choose harmful actions given ordinary goals. Across 16 models, agents engaged in blackmail (96% for Claude Opus 4), espionage, and actions causing death. Misbehavior rates rose from 6.5% to 55.1% when models stated scenarios were real rather than evaluations. The thesis does not fully resolve any of these problems but makes each tractable and measurable.
Authors: Yagiz Ihlamur
Abstract: Predicting startup success from founder career data is hard. The signal is weak, the labels are rare (9%), and most founders who succeed look almost identical to those who fail. We engineer 28 structured features directly from raw JSON fields -- jobs, education, exits -- and combine them with a deterministic rule layer and XGBoost boosted stumps. Our model achieves Val F0.5 = 0.3030, Precision = 0.3333, Recall = 0.2222 -- a +17.7pp improvement over the zero-shot LLM baseline. We then run a controlled experiment: extract 9 features from the prose field using Claude Haiku, at 67% and 100% dataset coverage. LLM features capture 26.4% of model importance but add zero CV signal (delta = -0.05pp). The reason is structural: anonymised_prose is generated from the same JSON fields we parse directly -- it is a lossy re-encoding, not a richer source. The ceiling (CV ~= 0.25, Val ~= 0.30) reflects the information content of this dataset, not a modeling limitation. In characterizing where the signal runs out and why, this work functions as a benchmark diagnostic -- one that points directly to what a richer dataset would need to include.
Authors: Ankit Grover, Lodovico Giaretta, R\'emi Bourgerie, Sarunas Girdzijauskas
Abstract: The integration of Graph Neural Networks (GNNs) with Large Language Models (LLMs) has emerged as a promising paradigm for Graph Question Answering (GraphQA). However, effective methods for encoding complex structural information into the LLM's latent space remain an open challenge. Current state-of-the-art architectures, such as G-Retriever, typically rely on standard GNNs and aggressive mean pooling to compress entire graph substructures into a single token, creating a severe information bottleneck. This work mitigates this bottleneck by investigating two orthogonal strategies: (1) increasing the bandwidth of the graph-to-LLM interface via multi-token pooling, and (2) enhancing the semantic quality of the graph encoder via global attention mechanisms. We evaluate a suite of hierarchical pruning and clustering-based pooling operators including Top-k, SAGPool, DiffPool, MinCutPool, and Virtual Node Pooling (VNPool) to project graph data into multiple learnable tokens. Empirically, we demonstrate that while pooling introduces significant instability during soft prompt tuning, the application of Low-Rank Adaptation (LoRA) effectively stabilizes specific hierarchical projections (notably VNPool and pruning methods), though dense clustering operators remain challenging. This stabilization allows compressed representations to rival full-graph baselines (achieving ~73% Hit@1 on WebQSP). Conceptually, we demonstrate that a Graph Transformer with VNPool implementation functions structurally as a single-layer Perceiver IO encoder. Finally, we adapt the FandE (Features and Edges) Score to the generative GraphQA domain. Our analysis reveals that the GraphQA benchmark suffers from representational saturation, where target answers are often highly correlated with isolated node features. The implementation is available at https://github.com/Agrover112/G-Retriever/tree/all_good/
URLs: https://github.com/Agrover112/G-Retriever/tree/all_good/
Authors: Mahammad Valiyev, Jodel Cornelio, Behnam Jafarpour
Abstract: Production optimization in stress-sensitive unconventional reservoirs is governed by a nonlinear trade-off between pressure-driven flow and stress-induced degradation of fracture conductivity and matrix permeability. While higher drawdown improves short-term production, it accelerates permeability loss and reduces long-term recovery. Identifying optimal, time-varying control strategies requires repeated evaluations of fully coupled flow-geomechanics simulators, making conventional optimization computationally expensive. We propose a deep learning-based surrogate optimization framework for high-dimensional well control. Unlike prior approaches that rely on predefined control parameterizations or generic sampling, our method treats well control as a continuous, high-dimensional problem and introduces a problem-informed sampling strategy that aligns training data with trajectories encountered during optimization. A neural network proxy is trained to approximate the mapping between bottomhole pressure trajectories and cumulative production using data from a coupled flow-geomechanics model. The proxy is embedded within a constrained optimization workflow, enabling rapid evaluation of control strategies. Across multiple initializations, the surrogate achieves agreement with full-physics solutions within 2-5 percent, while reducing computational cost by up to three orders of magnitude. Discrepancies are mainly associated with trajectories near the boundary of the training distribution and local optimization effects. This framework shows that combining surrogate modeling with problem-informed sampling enables scalable and reliable optimization for high-dimensional, simulator-based problems, with broader applicability to PDE-constrained systems.
Authors: Saman Khamesian, Sri Harini Balaji, Di Yang Shi, Stephanie M. Carpenter, Daniel E. Rivera, W. Bradley Knox, Peter Stone, Hassan Ghasemzadeh
Abstract: Type 1 Diabetes (T1D) management requires continuous adjustment of insulin and lifestyle behaviors to maintain blood glucose within a safe target range. Although automated insulin delivery (AID) systems have improved glycemic outcomes, many patients still fail to achieve recommended clinical targets, warranting new approaches to improve glucose control in patients with T1D. While reinforcement learning (RL) has been utilized as a promising approach, current RL-based methods focus primarily on insulin-only treatment and do not provide behavioral recommendations for glucose control. To address this gap, we propose GUIDE, an RL-based decision-support framework designed to complement AID technologies by providing behavioral recommendations to prevent abnormal glucose events. GUIDE generates structured actions defined by intervention type, magnitude, and timing, including bolus insulin administration and carbohydrate intake events. GUIDE integrates a patient-specific glucose level predictor trained on real-world continuous glucose monitoring data and supports both offline and online RL algorithms within a unified environment. We evaluate both off-policy and on-policy methods across 25 individuals with T1D using standardized glycemic metrics. Among the evaluated approaches, the CQL-BC algorithm demonstrates the highest average time-in-range, reaching 85.49% while maintaining low hypoglycemia exposures. Behavioral similarity analysis further indicates that the learned CQL-BC policy preserves key structural characteristics of patient action patterns, achieving a mean cosine similarity of 0.87 $\pm$ 0.09 across subjects. These findings suggest that conservative offline RL with a structured behavioral action space can provide clinically meaningful and behaviorally plausible decision support for personalized diabetes management.
Authors: Shihao Li, Jiachen Li, Dongmei Chen
Abstract: We demonstrate that gradient-based data valuation produces curriculum orderings that significantly outperform metadata-based heuristics for training game-theoretic motion planners. Specifically, we apply TracIn gradient-similarity scoring to GameFormer on the nuPlan benchmark and construct a curriculum that weights training scenarios by their estimated contribution to validation loss reduction. Across three random seeds, the TracIn-weighted curriculum achieves a mean planning ADE of $1.704\pm0.029$\,m, significantly outperforming the metadata-based interaction-difficulty curriculum ($1.822\pm0.014$\,m; paired $t$-test $p=0.021$, Cohen's $d_z=3.88$) while exhibiting lower variance than the uniform baseline ($1.772\pm0.134$\,m). Our analysis reveals that TracIn scores and scenario metadata are nearly orthogonal (Spearman $\rho=-0.014$), indicating that gradient-based valuation captures training dynamics invisible to hand-crafted features. We further show that gradient-based curriculum weighting succeeds where hard data selection fails: TracIn-curated 20\% subsets degrade performance by $2\times$, whereas full-data curriculum weighting with the same scores yields the best results. These findings establish gradient-based data valuation as a practical tool for improving sample efficiency in game-theoretic planning.
Authors: Weyl Lu, Chenjie Hao, Yubei Chen
Abstract: Estimated density is often interpreted as indicating how typical a sample is under a model. Yet deep models trained on one dataset can assign higher density to simpler out-of-distribution (OOD) data than to in-distribution test data. We refer to this behavior as the OOD anomaly. Prior work typically studies this phenomenon within a single architecture, detector, or benchmark, implicitly assuming certain canonical densities. We instead separate the trained network from the density estimator built from its representations or outputs. We introduce two estimators: Jacobian-based estimators and autoregressive self-estimators, making density analysis applicable to a wide range of models. Applying this perspective to a range of models, including iGPT, PixelCNN++, Glow, score-based diffusion models, DINOv2, and I-JEPA, we find the same striking regularity that goes beyond the OOD anomaly: lower-complexity samples receive higher estimated density, while higher-complexity samples receive lower estimated density. This ordering appears within a test set and across OOD pairs such as CIFAR-10 and SVHN, and remains highly consistent across independently trained models. To quantify these orderings, we introduce Spearman rank correlation and find striking agreement both across models and with external complexity metrics. Even when trained only on the lowest-density (most complex) samples - or even a single such sample - the resulting models still rank simpler images as higher density. These observations lead us beyond the original OOD anomaly to a more general conclusion: deep networks consistently favor simple data. Our goal is not to close this question, but to define and visualize it more clearly. We broaden its empirical scope and show that it appears across architectures, objectives, and density estimators.
Authors: Yaqi Chen, Shixun Huang, Ryan Twemlow, Lei Wang, John Le, Sheng Wang, Willy Susilo, Jun Yan, Jun Shen
Abstract: GNN prompting aims to adapt models across tasks and graphs without requiring extensive retraining. However, most existing graph prompt methods still require task-specific parameter updates and face the issue of generalizing across graphs, limiting their performance and undermining the core promise of prompting. In this work, we introduce a Cross-graph Tuning-free Prompting Framework (CTP), which supports both homogeneous and heterogeneous graphs, can be directly deployed to unseen graphs without further parameter tuning, and thus enables a plug-and-play GNN inference engine. Extensive experiments on few-shot prediction tasks show that, compared to SOTAs, CTP achieves an average accuracy gain of 30.8% and a maximum gain of 54%, confirming its effectiveness and offering a new perspective on graph prompt learning.
Authors: Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou
Abstract: Large language models (LLMs) are trained on massive web-scale corpora, raising growing concerns about privacy and copyright. Membership inference attacks (MIAs) aim to determine whether a given example was used during training. Existing LLM MIAs largely rely on output probabilities or loss values and often perform only marginally better than random guessing when members and non-members are drawn from the same distribution. We introduce G-Drift MIA, a white-box membership inference method based on gradient-induced feature drift. Given a candidate (x,y), we apply a single targeted gradient-ascent step that increases its loss and measure the resulting changes in internal representations, including logits, hidden-layer activations, and projections onto fixed feature directions, before and after the update. These drift signals are used to train a lightweight logistic classifier that effectively separates members from non-members. Across multiple transformer-based LLMs and datasets derived from realistic MIA benchmarks, G-Drift substantially outperforms confidence-based, perplexity-based, and reference-based attacks. We further show that memorized training samples systematically exhibit smaller and more structured feature drift than non-members, providing a mechanistic link between gradient geometry, representation stability, and memorization. In general, our results demonstrate that small, controlled gradient interventions offer a practical tool for auditing the membership of training-data and assessing privacy risks in LLMs.
Authors: Amirhossein Dezhboro, Fateme Maleki, Arman Adibi, Erfan Amini, Jose E. Ramirez-Marquez
Abstract: We study distributed optimization over networks with Byzantine agents that may send arbitrary adversarial messages. We propose \emph{Gradient Tracking with Probabilistic Edge Dropout} (GT-PD), a stochastic gradient tracking method that preserves the convergence properties of gradient tracking under adversarial communication. GT-PD combines two complementary defense layers: a universal self-centered projection that clips each incoming message to a ball of radius $\tau$ around the receiving agent, and a fully decentralized probabilistic dropout rule driven by a dual-metric trust score in the decision and tracking channels. This design bounds adversarial perturbations while preserving the doubly stochastic mixing structure, a property often lost under robust aggregation in decentralized settings. Under complete Byzantine isolation ($p_b=0$), GT-PD converges linearly to a neighborhood determined solely by stochastic gradient variance. For partial isolation ($p_b>0$), we introduce \emph{Gradient Tracking with Probabilistic Edge Dropout and Leaky Integration} (GT-PD-L), which uses a leaky integrator to control the accumulation of tracking errors caused by persistent perturbations and achieves linear convergence to a bounded neighborhood determined by the stochastic variance and the clipping-to-leak ratio. We further show that under two-tier dropout with $p_h=1$, isolating Byzantine agents introduces no additional variance into the honest consensus dynamics. Experiments on MNIST under Sign Flip, ALIE, and Inner Product Manipulation attacks show that GT-PD-L outperforms coordinate-wise trimmed mean by up to 4.3 percentage points under stealth attacks.
Authors: Abrari Noor Hasmi, Haralampos Hatzikirou, Hadi Susanto
Abstract: We propose Lagrangian Descriptors (LDs) as a diagnostic framework for evaluating neural network models of Hamiltonian systems beyond conventional trajectory-based metrics. Standard error measures quantify short-term predictive accuracy but provide little insight into global geometric structures such as orbits and separatrices. Existing evaluation tools in dissipative systems are inadequate for Hamiltonian dynamics due to fundamental differences in the systems. By constructing probability density functions weighted by LD values, we embed geometric information into a statistical framework suitable for information-theoretic comparison. We benchmark physically constrained architectures (SympNet, H\'enonNet, Generalized Hamiltonian Neural Networks) against data-driven Reservoir Computing across two canonical systems. For the Duffing oscillator, all models recover the homoclinic orbit geometry with modest data requirements, though their accuracy near critical structures varies. For the three-mode nonlinear Schr\"odinger equation, however, clear differences emerge: symplectic architectures preserve energy but distort phase-space topology, while Reservoir Computing, despite lacking explicit physical constraints, reproduces the homoclinic structure with high fidelity. These results demonstrate the value of LD-based diagnostics for assessing not only predictive performance but also the global dynamical integrity of learned Hamiltonian models.
Authors: Yiyang Sun, Haiyang Huang, Gaurav Rajesh Parikh, Cynthia Rudin
Abstract: Dimension reduction (DR) is inherently non-unique: multiple embeddings can preserve the structure of high-dimensional data equally well while differing in layout or geometry. In this paper, we formally define the Rashomon set for DR -- the collection of `good' embedding -- and show how embracing this multiplicity leads to more powerful and trustworthy representations. Specifically, we pursue three goals. First, we introduce PCA-informed alignment to steer embeddings toward principal components, making axes interpretable without distorting local neighborhoods. Second, we design concept-alignment regularization that aligns an embedding dimension with external knowledge, such as class labels or user-defined concepts. Third, we propose a method to extract common knowledge across the Rashomon set by identifying trustworthy and persistent nearest-neighbor relationships, which we use to construct refined embeddings with improved local structure while preserving global relationships. By moving beyond a single embedding and leveraging the Rashomon set, we provide a flexible framework for building interpretable, robust, and goal-aligned visualizations.
Authors: Haoyu Zheng, Yongqiang Zhang, Fangcheng Fu, Xiaokai Zhou, Hao Luo, Hongchao Zhu, Yuanyuan Zhu, Hao Wang, Xiao Yan, Jiawei Jiang
Abstract: To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabilities to account for the risk that a request generates long outputs. To evaluate our TIE scheduler, we compare it with three strong baselines, and the results show that TIE reduces the per-token latency by $2.31\times$ for online inference and improves throughput by $1.42\times$ for offline data generation.
Authors: Yunwen Lei, Yufeng Xie
Abstract: Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses cannot fully exploit the power of initialization since the associated bounds depend on the spectral norm of the initialization matrix, which can scale as a square-root function of the width and are therefore not effective for overparameterized models. In this paper, we develop the first \emph{fully} initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions, which enjoys a logarithmic dependency on the width. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.
Authors: Yaoming Yang, Shuai Wang, Bingdong Li, Peng Yang, Ke Tang
Abstract: Dynamic multi-objective optimization requires continuous tracking of moving Pareto fronts. Existing methods struggle with irregular mutations and data sparsity, primarily facing three challenges: the non-linear coupling of dynamic modes, negative transfer from outdated historical data, and the cold-start problem during environmental switches. To address these issues, this paper proposes a decoupled basis-vector-driven generative framework (DB-GEN). First, to resolve non-linear coupling, the framework employs the discrete wavelet transform to separate evolutionary trajectories into low-frequency trends and high-frequency details. Second, to mitigate negative transfer, it learns transferable basis vectors via sparse dictionary learning rather than directly memorizing historical instances. Recomposing these bases under a topology-aware contrastive constraint constructs a structured latent manifold. Finally, to overcome the cold-start problem, a surrogate-assisted search paradigm samples initial populations from this manifold. Pre-trained on 120 million solutions, DB-GEN performs direct online inference without retraining or fine-tuning. This zero-shot generation process executes in milliseconds, requiring approximately 0.2 seconds per environmental change. Experimental results demonstrate that DB-GEN improves tracking accuracy across various dynamic benchmarks compared to existing algorithms.
Authors: Junxian Wu, Chenghan Fu, Zhanheng Nie, Daoze Zhang, Bowen Wan, Wanxian Guan, Chuan Yu, Jian Xu, Bo Zheng
Abstract: With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.
Authors: Mudit Sharma, Shweta Jain, Vaneet Aggarwal, Ganesh Ghalme
Abstract: We study for the first time, stochastic dueling bandits over continuous action spaces with Lipschitz structure, where feedback is purely comparative. While dueling bandits and Lipschitz bandits have been studied separately, their combination has remained unexplored. We propose the first algorithm for Lipschitz dueling bandits, using round-based exploration and recursive region elimination guided by an adaptive reference arm. We develop new analytical tools for relative feedback and prove a regret bound of $\tilde O\left(T^{\frac{d_z+1}{d_z+2}}\right)$, where $d_z$ is the zooming dimension of the near-optimal region. Further, our algorithm takes only logarithmic space in terms of the total time horizon, best achievable by any bandit algorithm over a continuous action space.
Authors: Zifei Xu, Sayeh Sharify, Hesham Mostafa
Abstract: Quantization-aware training (QAT) is typically performed for a single target numeric format, while practical deployments often need to choose numerical precision at inference time based on hardware support or runtime constraints. We study multi-format QAT, where a single model is trained to be robust across multiple quantization formats. We find that multi-format QAT can match single-format QAT at each target precision, yielding one model that performs well overall across different formats, even formats that were not seen during training. To enable practical deployment, we propose the Slice-and-Scale conversion procedure for both MXINT and MXFP that converts a high-precision representation into lower-precision formats without re-training. Building on this, we introduce a pipeline that (i) trains a model with multi-format QAT, (ii) stores a single anchor format checkpoint (MXINT8/MXFP8), and (iii) allows on-the-fly conversion to lower MXINT or MXFP formats at runtime with negligible-or no-additional accuracy degradation. Together, these components provide a practical path to elastic precision scaling and allow selecting the runtime format at inference time across diverse deployment targets.
Authors: Jiabin Lin, Shana Moothedath
Abstract: Multi-task representation learning is an approach that learns shared latent representations across related tasks, facilitating knowledge transfer and improving sample efficiency. This paper introduces a novel approach to multi-task representation learning in linear bandits. We consider a setting with T concurrent linear bandit tasks, each with feature dimension d, that share a common latent representation of dimension r \ll min{d,T}$, capturing their underlying relatedness. We propose a new Optimism in the Face of Uncertainty Linear (OFUL) algorithm that leverages shared low-rank representations to enhance decision-making in a sample-efficient manner. Our algorithm first collects data through an exploration phase, estimates the shared model via spectral initialization, and then conducts OFUL based learning over a newly constructed confidence set. We provide theoretical guarantees for the confidence set and prove that the unknown reward vectors lie within the confidence set with high probability. We derive cumulative regret bounds and show that the proposed approach achieves \tilde{O}(\sqrt{drNT}), a significant improvement over solving the T tasks independently, resulting in a regret of \tilde{O}(dT\sqrt{N}). We performed numerical simulations to validate the performance of our algorithm for different problem sizes.
Authors: Xiao Zhang, Juntao Lyu, Tianyu Hu, Qianchuan Zhao, Huimin Ma
Abstract: Large Language Models (LLMs) generalize across tasks via reusable representations and flexible reasoning, yet remain brittle in real deployment under evolving tasks and continual distribution shift. A common approach is Test-Time Adaptation (TTA), existing ones of which updates models with hand-designed unsupervised objectives over the full parameter space and mostly overlook preserving shared source knowledge and the reliability of adaptation signals. Drawing on molecular signaling cascades of memory updating in Drosophila, we propose Synapse Consolidation (SyCo), a parameter-efficient LLM adaptation method that updates low-rank adapters through Rac1 and MAPK pathways under the guidance of a structured TTA objective driven by problem understanding, process understanding, and source-domain guardrail. Rac1 confines plasticity to a tail-gradient subspace that is less critical for source knowledge, enabling rapid specialization while preserving source representations. MAPK uses a tiered controller to suppress noisy updates and consolidate useful adaptations under non-stationary streams. To model real deployments with multiple sources and continually emerging tasks, we introduce Multi-source Open-set Adaptation (MOA) setting, where a model is trained on multiple labeled source tasks and then adapts on open, non-stationary unlabeled test streams that mix seen and unseen tasks with partial overlap in label and intent space. Across 18 NLP datasets and the MOA setting, SyCo consistently outperforms strong baselines, achieving 78.31\% on unseen-task adaptation and 85.37\% on unseen-data shifts.
Authors: Hongyang Yang, Yanxin Zhang, Yang She, Yue Xiao, Hao Wu, Yiyang Zhang, Jiapeng Hou, Rongshan Zhang
Abstract: Housing selection is a high-stakes and largely irreversible decision problem. We study housing consultation as a decision-support interface for housing selection. Existing housing platforms and many LLM-based assistants often reduce this process to ranking or recommendation, resulting in opaque reasoning, brittle multi-constraint handling, and limited guarantees on factuality. We present HabitatAgent, the first LLM-powered multi-agent architecture for end-to-end housing consultation. HabitatAgent comprises four specialized agent roles: Memory, Retrieval, Generation, and Validation. The Memory Agent maintains multi-layer user memory through internal stages for constraint extraction, memory fusion, and verification-gated updates; the Retrieval Agent performs hybrid vector--graph retrieval (GraphRAG); the Generation Agent produces evidence-referenced recommendations and explanations; and the Validation Agent applies multi-tier verification and targeted remediation. Together, these agents provide an auditable and reliable workflow for end-to-end housing consultation. We evaluate HabitatAgent on 100 real user consultation scenarios (300 multi-turn question--answer pairs) under an end-to-end correctness protocol. A strong single-stage baseline (Dense+Rerank) achieves 75% accuracy, while HabitatAgent reaches 95%.
Authors: Axel Giottonini, Thomas Lemmin
Abstract: Molecular dynamics simulations provide detailed trajectories at the atomic level, but extracting interpretable and robust insights from these high-dimensional data remains challenging. In practice, analyses typically rely on a single representation. Here, we show that representation choice is not neutral: it fundamentally shapes the conformational organization, similarity relationships, and apparent transitions inferred from identical simulation data. To complement existing representations, we introduce Orientation features, a geometrically grounded, rotation-aware encoding of protein backbone. We compare it against common descriptions across three dynamical regimes: fast-folding proteins, large-scale domain motions, and protein-protein association. Across these systems, we find that different representations emphasize complementary aspects of conformational space, and that no single representation provides a complete picture of the underlying dynamics. To facilitate systematic comparison, we developed ManiProt, a library for efficient computation and analysis of multiple protein representations. Our results motivate a comparative, representation-aware framework for the interpretation of molecular dynamics simulations.
Authors: Qi Shao, Duxin Chen, Jiawen Chen, Yujie Zeng, Athen Ma, Wenwu Yu, Vito Latora, Wei Lin
Abstract: Predicting the behavior of ultra-large complex systems, from climate to biological and technological networks, is a central unsolved challenge. Existing approaches face a fundamental trade-off: equation discovery methods provide interpretability but fail to scale, while neural networks scale but operate as black boxes and often lose reliability over long times. Here, we introduce the Sparse Identification Graph Neural Network, a framework that overcome this divide by allowing to infer the governing equations of large networked systems from data. By defining symbolic discovery as edge-level information, SIGN decouples the scalability of sparse identification from network size, enabling efficient equation discovery even in large systems. SIGN allows to study networks with over 100,000 nodes while remaining robust to noise, sparse sampling, and missing data. Across diverse benchmark systems, including coupled chaotic oscillators, neural dynamics, and epidemic spreading, it recovers governing equations with high precision and sustains accurate long-term predictions. Applied to a data set of time series of temperature measurements in 71,987 sea surface positions, SIGN identifies a compact predictive network model and captures large-scale sea surface temperature conditions up to two years in advance. By enabling equation discovery at previously inaccessible scales, SIGN opens a path toward interpretable and reliable prediction of real-world complex systems.
Authors: Mingyang Song, Mao Zheng
Abstract: Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never encounter their own errors during learning. This train--test mismatch, an instance of \textit{exposure bias}, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified $f$-divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emph{feedback signal} (logit-based, outcome-based, or self-play), \emph{teacher access} (white-box, black-box, or teacher-free), and \emph{loss granularity} (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.
Authors: Marwan Hassani, Tamara Verbeek, Sjoerd van Straten
Abstract: Predictive process monitoring (PPM) focuses on predicting future process trajectories, including next activity predictions. This is crucial in dynamic environments where processes change or face uncertainty. However, current frameworks often assume a static environment, overlooking dynamic characteristics and concept drifts. This results in catastrophic forgetting, where training while focusing merely on new data distribution negatively impacts the performance on previously learned data distributions. Continual learning addresses, among others, the challenges related to mitigating catastrophic forgetting. This paper proposes a novel approach called Continual Next Activity Prediction with Prompts (CNAPwP), which adapts the DualPrompt algorithm for next activity prediction to improve accuracy and adaptability while mitigating catastrophic forgetting. We introduce new datasets with recurring concept drifts, alongside a task-specific forgetting metric that measures the prediction accuracy gap between initial occurrence and subsequent task occurrences. Extensive testing on three synthetic and two real-world datasets representing several setups of recurrent drifts shows that CNAPwP achieves SOTA or competitive results compared to five baselines, demonstrating its potential applicability in real-world scenarios. An open-source implementation of our method, together with the datasets and results, is available at: https://github.com/SvStraten/CNAPwP.
Authors: Sandeep Kumar Samota, Reema Gupta, Snehashish Chakraverty
Abstract: This study examines the challenges of modeling complex and noisy data related to socioeconomic factors over time, with a focus on data from various districts in Odisha, India. Traditional time-series models struggle to capture both trends and variations together in this type of data. To tackle this, a Variational Neural Stochastic Differential Equation (V-NSDE) model is designed that combines the expressive dynamics of Neural SDEs with the generative capabilities of Variational Autoencoders (VAEs). This model uses an encoder and a decoder. The encoder takes the initial observations and district embeddings and translates them into a Gaussian distribution, which determines the mean and log-variance of the first latent state. Then the obtained latent state initiates the Neural SDE, which utilize neural networks to determine the drift and diffusion functions that govern continuous-time latent dynamics. These governing functions depend on the time index, latent state, and district embedding, which help the model learn the unique characteristics specific to each district. After that, using a probabilistic decoder, the observations are reconstructed from the latent trajectory. The decoder outputs a mean and log-variance for each time step, which follows the Gaussian likelihood. The Evidence Lower Bound (ELBO) training loss improves by adding a KL-divergence regularization term to the negative log-likelihood (nll). The obtained results demonstrate the effective learning of V-NSDE in recognizing complex patterns over time, yielding realistic outcomes that include clear trends and random fluctuations across different areas.
Authors: Ritish Shrirao, Aditya Priyadarshi, Raghuram Bharadwaj Diddigi
Abstract: Successor Features (SF) combined with Generalized Policy Improvement (GPI) provide a robust framework for transfer learning in Reinforcement Learning (RL) by decoupling environment dynamics from reward functions. However, standard SF learning methods typically rely on semi-gradient Temporal Difference (TD) updates. When combined with non-linear function approximation, semi-gradient methods lack robust convergence guarantees and can lead to instability, particularly in the multi-task setting where accurate feature estimation is critical for effective GPI. Inspired by Full Gradient DQN, we propose Full-Gradient Successor Feature Representations Q-Learning (FG-SFRQL), an algorithm that optimizes the successor features by minimizing the full Mean Squared Bellman Error. Unlike standard approaches, our method computes gradients with respect to parameters in both the online and target networks. We provide a theoretical proof of almost-sure convergence for FG-SFRQL and demonstrate empirically that minimizing the full residual leads to superior sample efficiency and transfer performance compared to semi-gradient baselines in both discrete and continuous domains.
Authors: Josephine Westermann, Benno Huber, Thomas O'Leary-Roseberry, Jakob Zech
Abstract: We consider the problem of constructing surrogate operators for parameter-to-solution maps arising from parametric partial differential equations, where repeated forward model evaluations are computationally expensive. We present a systematic empirical comparison of neural operator surrogates, including a reduced-basis neural operator trained with $L^2_\mu$ and $H^1_\mu$ objectives and the Fourier neural operator, against polynomial surrogate methods, specifically a reduced-basis sparse-grid surrogate and a reduced-basis tensor-train surrogate. All methods are evaluated on a linear parametric diffusion problem and a nonlinear parametric hyperelasticity problem, using input fields with algebraically decaying spectral coefficients at varying rates of decay $s$. To enable fair comparisons, we analyze ensembles of surrogate models generated by varying hyperparameters and compare the resulting Pareto frontiers of cost versus approximation accuracy, decomposing cost into contributions from data generation, setup, and evaluation. Our results show that no single method is universally superior. Polynomial surrogates achieve substantially better data efficiency for smooth input fields ($s \geq 2$), with convergence rates for the sparse-grid surrogate in agreement with theoretical predictions. For rough inputs ($s \leq 1$), the Fourier neural operator displays the fastest convergence rates. Derivative-informed training consistently improves data efficiency over standard $L^2_\mu$ training, providing a competitive alternative for rough inputs in the low-data regime when Jacobian information is available at reasonable cost. These findings highlight the importance of matching the surrogate methodology to the regularity of the problem as well as accuracy demands and computational constraints of the application.
Authors: Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, Yuxiong He
Abstract: Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at https://github.com/Andree-9/HiLL.
Authors: Anton Altenbernd, Philipp Wiesner, Odej Kao
Abstract: As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults that bypass system-level detection mechanisms. SDC may behave like benign numerical noise, but can also cause harmful gradient corruption that leads to loss spikes, divergence, or stalled progress. This work provides a controlled study of how intermittent SDC affects LLM pretraining. Using targeted fault injection at the level of GPU matrix-multiply instructions, we characterize the sensitivity of different bit positions, kernel functions, and execution stages. Our analysis shows that locally originating faults can produce impactful corruption, including NaN propagation, short-lived spikes in loss, gradient norm, and attention logits, as well as persistent parameter divergence. Building on the observed corruption signatures, we propose a lightweight detection method that identifies potentially harmful parameter updates. Experiments on LLaMA models with 60M, 350M, and 1.3B parameters demonstrate that recomputing the most recent training step upon detection can effectively mitigate the impact of these events.
Authors: Bj\"orn Roman Kohlberger (EctoSpace, Dublin, Ireland)
Abstract: The memory wall remains the primary bottleneck for training large language models on consumer hardware. We introduce Spectral Compact Training (SCT), a method that replaces dense weight matrices with permanent truncated SVD factors W = U diag(s) V^T, where the full dense matrix is never materialized during training or inference. Gradients flow through the compact spectral factors via standard backpropagation, and U, V are retracted to the Stiefel manifold via QR decomposition after each optimizer step. SCT achieves up to 199x memory reduction per MLP layer at rank 32, enabling full training steps of 70B-parameter architectures on a Steam Deck handheld (7.2 GB peak memory vs. 1,245 GB for dense FP32 training with Adam). Rank-sweep experiments on SmolLM2-1.7B (ranks 32-256, 2000 steps, NVIDIA A100) show that all tested ranks converge to the same loss floor (~4.2-4.5), identifying the learning rate schedule -- not MLP rank -- as the primary bottleneck. Rank 128 emerges as the efficiency sweet spot at 11.7x MLP compression with the lowest perplexity. GPU memory drops 46% at rank 32 while training throughput doubles.
Authors: Sayed Hashim, Frank Soboczenski, Paul Cairns
Abstract: Datasets used in immunotherapy response prediction are typically small in size, as well as diverse in cancer type, drug administered, and sequencer used. Models often drop in performance when tested on patient cohorts that are not included in the training process. Recent work has shown that transformer-based models along with self-supervised learning show better generalisation performance than threshold-based biomarkers, but is still suboptimal. We present BioCOMPASS, an extension of a transformer-based model called COMPASS, that integrates biomarkers and treatment information to further improve its generalisability. Instead of feeding biomarker data as input, we built loss components to align them with the model's intermediate representations. We found that components such as treatment gating and pathway consistency loss improved generalisability when evaluated with Leave-one-cohort-out, Leave-one-cancer-type-out and Leave-one-treatment-out strategies. Results show that building components that exploit biomarker and treatment information can help in generalisability of immunotherapy response prediction. Careful curation of additional components that leverage complementary clinical information and domain knowledge represents a promising direction for future research.
Authors: Lala Shakti Swarup Ray, Mengxi Liu, Alcina Pinto, Deepika Gurung, Daniel Geissler, Paul Lukowoicz, Bo Zhou
Abstract: Wearable HAR has improved steadily, but most progress still relies on closed-set classification, which limits real-world use. In practice, human activity is open-ended, unscripted, personalized, and often compositional, unfolding as narratives rather than instances of fixed classes. We argue that addressing this gap does not require simply scaling datasets or models. It requires a fundamental shift in how wearable HAR is formulated, supervised, and evaluated. This work shows how to model open-ended activity narratives by aligning wearable sensor data with natural-language descriptions in an open-vocabulary setting. Our framework has three core components. First, we introduce a naturalistic data collection and annotation pipeline that combines multi-position wearable sensing with free-form, time-aligned narrative descriptions of ongoing behavior, allowing activity semantics to emerge without a predefined vocabulary. Second, we define a retrieval-based evaluation framework that measures semantic alignment between sensor data and language, enabling principled evaluation without fixed classes while also subsuming closed-set classification as a special case. Third, we present a language-conditioned learning architecture that supports sensor-to-text inference over variable-length sensor streams and heterogeneous sensor placements. Experiments show that models trained with fixed-label objectives degrade sharply under real-world variability, while open-vocabulary sensor-language alignment yields robust and semantically grounded representations. Once this alignment is learned, closed-set activity recognition becomes a simple downstream task. Under cross-participant evaluation, our method achieves 65.3% Macro-F1, compared with 31-34% for strong closed-set HAR baselines. These results establish open-ended narrative modeling as a practical and effective foundation for real-world wearable HAR.
Authors: Swapnil Parekh
Abstract: A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints are available.
Authors: Nikita Gabdullin, Ilya Androsov
Abstract: Label prediction in neural networks (NNs) has O(n) complexity proportional to the number of classes. This holds true for classification using fully connected layers and cosine similarity with some set of class prototypes. In this paper we show that if NN latent space (LS) geometry is known and possesses specific properties, label prediction complexity can be significantly reduced. This is achieved by associating label prediction with the O(1) complexity closest cluster center search in a vector system used as target for latent space configuration (LSC). The proposed method only requires finding indexes of several largest and lowest values in the embedding vector making it extremely computationally efficient. We show that the proposed method does not change NN training accuracy computational results. We also measure the time required by different computational stages of NN inference and label prediction on multiple datasets. The experiments show that the proposed method allows to achieve up to 11.6 times overall acceleration over conventional methods. Furthermore, the proposed method has unique properties which allow to predict the existence of new classes.
Authors: Dharma Teja Vooturi, Dhiraj Kalamkar, Dipankar Das, Bharat Kaul
Abstract: Pretraining Large Language Models (LLMs) from scratch requires massive amount of compute. Aurora super computer is an ExaScale machine with 127,488 Intel PVC (Ponte Vechio) GPU tiles. In this work, we showcase LLM pretraining on Aurora at the scale of 1000s of GPU tiles. Towards this effort, we developed Optimus, an inhouse training library with support for standard large model training techniques. Using Optimus, we first pretrained Mula-1B, a 1 Billion dense model and Mula-7B-A1B, a 7 Billion Mixture of Experts (MoE) model from scratch on 3072 GPU tiles for the full 4 trillion tokens of the OLMoE-mix-0924 dataset. We then demonstrated model scaling by pretraining three large MoE models Mula-20B-A2B, Mula-100B-A7B, and Mula-220B-A10B till 100 Billion tokens on the same dataset. On our largest model Mula-220B-A10B, we pushed the compute scaling from 384 to 12288 GPU tiles and observed scaling efficiency of around 90% at 12288 GPU tiles. We significantly improved the runtime performance of MoE models using custom GPU kernels for expert computation, and a novel EP-Aware sharded optimizer resulting in training speedups up to 1.71x. As part of the Optimus library, we also developed a robust set of reliability and fault tolerant features to improve training stability and continuity at scale.
Authors: Yuchang Jiang, Jan Dirk Wegner, Vivien Sainte Fare Garnot
Abstract: Plant phenology modelling aims to predict the timing of seasonal phases, such as leaf-out or flowering, from meteorological time series. Reliable predictions are crucial for anticipating ecosystem responses to climate change. While phenology modelling has traditionally relied on mechanistic approaches, deep learning methods have recently been proposed as flexible, data-driven alternatives with often superior performance. However, mechanistic models tend to outperform deep networks when data distribution shifts are induced by climate change. Domain Adaptation (DA) techniques could help address this limitation. Yet, unlike standard DA settings, climate change induces a temporal continuum of domains and involves both a covariate and label shift, with warmer records and earlier start of spring. To tackle this challenge, we introduce Mid-feature Rank-adversarial Domain Adaptation (MIRANDA). Whereas conventional adversarial methods enforce domain invariance on final latent representations, an approach that does not explicitly address label shift, we apply adversarial regularization to intermediate features. Moreover, instead of a binary domain-classification objective, we employ a rank-based objective that enforces year-invariance in the learned meteorological representations. On a country-scale dataset spanning 70 years and comprising 67,800 phenological observations of 5 tree species, we demonstrate that, unlike conventional DA approaches, MIRANDA improves robustness to climatic distribution shifts and narrows the performance gap with mechanistic models.
Authors: Yilun Liu, Jinru Han, Sikuan Yan, Volker Tresp, Yunpu Ma
Abstract: Standard Mixture-of-Experts (MoE) models rely on centralized routing mechanisms that introduce rigid inductive biases. We propose Routing-Free MoE which eliminates any hard-coded centralized designs including external routers, Softmax, Top-K and load balancing, instead encapsulating all activation functionalities within individual experts and directly optimized through continuous gradient flow, enabling each expert to determine its activation entirely on its own. We introduce a unified adaptive load-balancing framework to simultaneously optimize both expert-balancing and token-balancing objectives through a configurable interpolation, allowing flexible and customizable resource allocation. Extensive experiments show that Routing-Free MoE can consistently outperform baselines with better scalability and robustness. We analyze its behavior in detail and offer insights that may facilitate future MoE design ad optimization.
Authors: Martin Jaraiz
Abstract: We present experimental results from seven controlled runs of nanoFMT, a Free-Market Algorithm (FMA) orchestrated transformer with dynamic Mixture-of-Experts (MoE) management. The experiments address a fundamental question for advanced LLM development: how should an MoE system manage its expert pool when operating at full capacity under changing data distributions? We demonstrate that cost-penalized fitness metrics, combined with a linear grace period for newborn experts, produce a system that accumulates domain expertise through diversification rather than replacement. The central result is a round-trip domain shift experiment showing 9-11x faster recovery when returning to a previously learned domain, with zero expert births or replacements required. This "molecular memory" effect -- where dormant experts survive and reactivate when their domain returns -- has no analogue in current MoE management approaches. A preliminary cost analysis estimates annual savings of $39.1M and 27.1 GWh energy reduction for an OpenAI-scale provider under a moderate scenario.
Authors: Yuhang Li, Donghyun Lee, Ruokai Yin, Priyadarshini Panda
Abstract: Low-rank decomposition has emerged as an important problem in Large Language Model (LLM) fine-tuning and inference. Through Singular Value Decomposition (SVD), the weight matrix can be factorized into low-rank spaces optimally. Previously, a common practice was to decompose the weight in the activation-whitened space, and then achieve satisfying results. In this work, we propose Optimal Brain Decomposition LLM (OBD-LLM), which studies the decomposition problem in the model space by utilizing second-order Hessian information. Through a rigorous Kronecker-factorization of the Hessian, we show that the decomposition needs to consider both input and output information of the layer, and achieves much better decomposition results compared to input only method. Our loss-aware decomposition method involves a bi-directional whitening on the weight matrix. As a result, OBD-LLM is a closed-form solution for the optimal decomposition of weights in the language model. Remarkably, we achieve ~20-40\% better results than previous state-of-the-art decomposition methods, the SVD-LLM.
Authors: Zhanzhi Lou, Hui Chen, Yibo Li, Qian Wang, Bryan Hooi
Abstract: Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference time. At the core of TTL is an adaptation policy that updates the actor policy based on experience from previous episodes, thereby improving future behavior. Existing methods rely on fixed, hand-crafted adaptation policies rather than optimizing them for downstream improvement. We argue that optimal adaptation policies should be learned from task environments, not hand-engineered based on human intuition. To achieve this, we introduce Meta-TTL, a framework that formulates the discovery of effective adaptation policies as a bi-level optimization problem. Within this framework, the inner loop executes the standard TTL process, measuring how effectively a candidate adaptation policy helps an agent correct errors across sequential episodes. Guided by the agent's performance, the outer loop employs evolutionary search over a diverse distribution of training tasks to iteratively refine the adaptation policy. We evaluate Meta-TTL on Jericho and WebArena-Lite across both in-distribution (ID) and out-of-distribution (OOD) settings, using multiple meta-agent backbones. Results on both benchmarks show that Meta-TTL consistently outperforms hand-crafted baselines, suggesting that the optimized adaptation policy encodes transferable strategies that generalize beyond the training task distribution.
Authors: Huaiyang Wang, Xiaojie Li, Deqing Wang, Haoyi Zhou, Zixuan Huang, Yaodong Yang, Jianxin Li, Yikun Ban
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones -- transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.
Authors: Aymeric Delefosse, Anastase Charantonis, Dominique B\'er\'eziat
Abstract: Machine learning-based weather forecasting models now surpass state-of-the-art numerical weather prediction systems, but training and operating these models at high spatial resolution remains computationally expensive. We present a modular framework that decouples forecasting from spatial resolution by applying learned generative super-resolution as a post-processing step to coarse-resolution forecast trajectories. We formulate super-resolution as a stochastic inverse problem, using a residual formulation to preserve large-scale structure while reconstructing unresolved variability. The model is trained with flow matching exclusively on reanalysis data and is applied to global medium-range forecasts. We evaluate (i) design consistency by re-coarsening super-resolved forecasts and comparing them to the original coarse trajectories, and (ii) high-resolution forecast quality using standard ensemble verification metrics and spectral diagnostics. Results show that super-resolution preserves large-scale structure and variance after re-coarsening, introduces physically consistent small-scale variability, and achieves competitive probabilistic forecast skill at 0.25{\deg} resolution relative to an operational ensemble baseline, while requiring only a modest additional training cost compared with end-to-end high-resolution forecasting.
Authors: Zheng Zhang, Cuong C. Nguyen, David Rosewarne, Kevin Wells, Gustavo Carneiro
Abstract: Learning to defer (L2D) enables human-AI cooperation by deciding when an AI system should act autonomously or defer to a human expert. Existing L2D methods, however, assume static human performance, contradicting well-established findings on fatigue-induced degradation. We propose Fatigue-Aware Learning to Defer via Constrained Optimisation (FALCON), which explicitly models workload-varying human performance using psychologically grounded fatigue curves. FALCON formulates L2D as a Constrained Markov Decision Process (CMDP) whose state includes both task features and cumulative human workload, and optimises accuracy under human-AI cooperation budgets via PPO-Lagrangian training. We further introduce FA-L2D, a benchmark that systematically varies fatigue dynamics from near-static to rapidly degrading regimes. Experiments across multiple datasets show that FALCON consistently outperforms state-of-the-art L2D methods across coverage levels, generalises zero-shot to unseen experts with different fatigue patterns, and demonstrates the advantage of adaptive human-AI collaboration over AI-only or human-only decision-making when coverage lies strictly between 0 and 1.
Authors: Antonin Sulc
Abstract: In this work, we study whether enforcing strict compositional structure in sequence embeddings yields meaningful geometric organization when applied to protein-protein interaction networks. Using Event2Vec, an additive sequence embedding model, we train 64-dimensional representations on random walks from the human STRING interactome, and compare against a DeepWalk baseline based on Word2Vec, trained on the same walks. We find that compositional structure substantially improves pathway coherence (30.2$\times$ vs 2.9$\times$ above random), functional analogy accuracy (mean similarity 0.966 vs 0.650), and hierarchical pathway organization, while geometric properties such as norm--degree anticorrelation are shared with or exceeded by the non-compositional baseline. These results indicate that enforced compositionality specifically benefits relational and compositional reasoning tasks in biological networks.
Authors: Haorui Ma, Dennis Frauen, Valentyn Melnychuk, Stefan Feuerriegel
Abstract: Estimation of heterogeneous long-term treatment effects (HLTEs) is widely used for personalized decision-making in marketing, economics, and medicine, where short-term randomized experiments are often combined with long-term observational data. However, HLTE estimation is challenging due to limited overlap in treatment or in observing long-term outcomes for certain subpopulations, which can lead to unstable HLTE estimates with large finite-sample variance. To address this challenge, we introduce the LT-O-learners (Long-Term Orthogonal Learners), a set of novel orthogonal learners for HLTE estimation. The learners are designed for the canonical HLTE setting that combines a short-term randomized dataset $\mathcal{D}_1$ with a long-term historical dataset $\mathcal{D}_2$. The key idea of our LT-O-Learners is to retarget the learning objective by introducing custom overlap weights that downweight samples with low overlap in treatment or in long-term observation. We show that the retargeted loss is equivalent to the weighted oracle loss and satisfies Neyman-orthogonality, which means our learners are robust to errors in the nuisance estimation. We further provide a general error bound for the LT-O-Learners and give the conditions under which quasi-oracle rate can be achieved. Finally, our LT-O-learners are model-agnostic and can thus be instantiated with arbitrary machine learning models. We conduct empirical evaluations on synthetic and semi-synthetic benchmarks to confirm the theoretical properties of our LT-O-Learners, especially the robustness in low-overlap settings. To the best of our knowledge, ours are the first orthogonal learners for HLTE estimation that are robust to low overlap that is common in long-term outcomes.
Authors: Vahan A. Martirosyan, Daniele Malitesta, Hugues Talbot, Jhony H. Giraldo, Fragkiskos D. Malliaros
Abstract: Spectral graph neural networks learn graph filters, but their behavior with increasing depth and polynomial order is not well understood. We analyze these models in the graph Fourier domain, where each layer becomes an element-wise frequency update, separating the fixed spectrum from trainable parameters and making depth and order explicit. In this setting, we show that Gaussian complexity is invariant under the Graph Fourier Transform, which allows us to derive data-dependent, depth, and order-aware generalization bounds together with stability estimates. In the linear case, our bounds are tighter, and on real graphs, the data-dependent term correlates with the generalization gap across polynomial bases, highlighting practical choices that avoid frequency amplification across layers.
Authors: Hsin-Ling Hsu, Min-Yu Chen, Nai-Chia Chen, Yan-Ru Chen, Yi-Ling Chang, Fang Yu
Abstract: Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repair beyond the last layer of Transformer models. WARP formulates repair as a convex quadratic program derived from a first-order linearization of the logit gap, enabling tractable optimization over a high-dimensional parameter space. Under the condition that the first-order approximation holds, this formulation induces three per-sample guarantees: (i) a positive margin constraint ensuring correct classification on repaired inputs, (ii) preservation constraints over a designated remain set, and (iii) a certified robustness radius derived from Lipschitz continuity. To ensure feasibility across varying model architectures, we introduce a sensitivity-based preprocessing step that conditions the optimization landscape accordingly. We further show that the iterative optimization procedure converges to solutions satisfying all repair constraints under mild assumptions. Empirical evaluation on encoder-only Transformers with varying layer architectures validates that these guarantees hold in practice while improving robustness to adversarial inputs. Our results demonstrate that guaranteed, generalizable Transformer repair is achievable through principled constraint-based optimization.
Authors: Jiaqi Wu, Yiqing Sun, Zhigang Yao
Abstract: We introduce a differentially private manifold denoising framework that allows users to exploit sensitive reference datasets to correct noisy, non-private query points without compromising privacy. The method follows an iterative procedure that (i) privately estimates local means and tangent geometry using the reference data under calibrated sensitivity, (ii) projects query points along the privately estimated subspace toward the local mean via corrective steps at each iteration, and (iii) performs rigorous privacy accounting across iterations and queries using $(\varepsilon,\delta)$-differential privacy (DP). Conceptually, this framework brings differential privacy to manifold methods, retaining sufficient geometric signal for downstream tasks such as embedding, clustering, and visualization, while providing formal DP guarantees for the reference data. Practically, the procedure is modular and scalable, separating DP-protected local geometry (means and tangents) from budgeted query-point updates, with a simple scheduler allocating privacy budget across iterations and queries. Under standard assumptions on manifold regularity, sampling density, and measurement noise, we establish high-probability utility guarantees showing that corrected queries converge toward the manifold at a non-asymptotic rate governed by sample size, noise level, bandwidth, and the privacy budget. Simulations and case studies demonstrate accurate signal recovery under moderate privacy budgets, illustrating clear utility-privacy trade-offs and providing a deployable DP component for manifold-based workflows in regulated environments without reengineering privacy systems.
Authors: Ruijie Hao, Longfei Zhang, Yang Dai, Yang Ma, Xingxing Liang, Guangquan Cheng
Abstract: Reinforcement Learning (RL) has proven highly effective in addressing complex control and decision-making tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution, which constrains the policy from capturing multimodal distributions, making it difficult to cover the full range of optimal solutions in multi-solution problems, and the return is reduced to a mean value, losing its multimodal nature and thus providing insufficient guidance for policy updates. In response to these problems, we propose a RL algorithm termed flow-based policy with distributional RL (FP-DRL). This algorithm models the policy using flow matching, which offers both computational efficiency and the capacity to fit complex distributions. Additionally, it employs a distributional RL approach to model and optimize the entire return distribution, thereby more effectively guiding multimodal policy updates and improving agent performance. Experimental trails on MuJoCo benchmarks demonstrate that the FP-DRL algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting superior representation capability of the flow policy.
Authors: Nikolai Merkel, Ruben Mayer, Volker Markl, Hans-Arno Jacobsen
Abstract: Graph Neural Networks (GNNs) are widely used for learning on graph-structured data, but scaling GNN training to massive graphs remains challenging. To enable scalable distributed training, graphs are divided into smaller partitions that are distributed across multiple machines such that inter-machine communication is minimized and computational load is balanced. In practice, existing partitioning approaches face a fundamental trade-off between partitioning overhead and partitioning quality. We propose EmbedPart, an embedding-driven partitioning approach that achieves both speed and quality. Instead of operating directly on irregular graph structures, EmbedPart leverages node embeddings produced during the actual GNN training workload and clusters these dense embeddings to derive a partitioning. EmbedPart achieves more than 100x speedup over Metis while maintaining competitive partitioning quality and accelerating distributed GNN training. Moreover, EmbedPart naturally supports graph updates and fast repartitioning, and can be applied to graph reordering to improve data locality and accelerate single-machine GNN training. By shifting partitioning from irregular graph structures to dense embeddings, EmbedPart enables scalable and high-quality graph data optimization.
Authors: Rafael Sojo, Pedro Larra\~naga, Concha Bielza
Abstract: This paper introduces two transfer learning methodologies for estimating nonparametric Bayesian networks under scarce data. We propose two algorithms, a constraint-based structure learning method, called PC-stable-transfer learning (PCS-TL), and a score-based method, called hill climbing transfer learning (HC-TL). We also define particular metrics to tackle the negative transfer problem in each of them, a situation in which transfer learning has a negative impact on the model's performance. Then, for the parameters, we propose a log-linear pooling approach. For the evaluation, we learn kernel density estimation Bayesian networks, a type of nonparametric Bayesian network, and compare their transfer learning performance with the models alone. To do so, we sample data from small, medium and large-sized synthetic networks and datasets from the UCI Machine Learning repository. Then, we add noise and modifications to these datasets to test their ability to avoid negative transfer. To conclude, we perform a Friedman test with a Bergmann-Hommel post-hoc analysis to show statistical proof of the enhanced experimental behavior of our methods. Thus, PCS-TL and HC-TL demonstrate to be reliable algorithms for improving the learning performance of a nonparametric Bayesian network with scarce data, which in real industrial environments implies a reduction in the required time to deploy the network.
Authors: Philip Jordan, Maryam Kamgarpour
Abstract: We study model-based learning of finite-window policies in tabular partially observable Markov decision processes (POMDPs). A common approach to learning under partial observability is to approximate unbounded history dependencies using finite action-observation windows. This induces a finite-state Markov decision process (MDP) over histories, referred to as the superstate MDP. Once a model of this superstate MDP is available, standard MDP algorithms can be used to compute optimal policies, motivating the need for sample-efficient model estimation. Estimating the superstate MDP model is challenging because trajectories are generated by interaction with the original POMDP, creating a mismatch between the sampling process and target model. We propose a model estimation procedure for tabular POMDPs and analyze its sample complexity. Our analysis exploits a connection between filter stability and concentration inequalities for weakly dependent random variables. As a result, we obtain tight sample complexity guarantees for estimating the superstate MDP model from a single trajectory. Combined with value iteration, this yields approximately optimal finite-window policies for the POMDP.
Authors: Zhichen Liu, Tianle Lun, Zhibin Wen, Hao An, Yulin Ou, Jianhui Xu, Hao Zhang, Wenyi Fang, Yang Zheng, Yang Xu
Abstract: The paradigm of scaling Large Language Models (LLMs) in both parameter size and test time has pushed the boundaries of AI capabilities, but at the cost of making the traditional generative evaluation paradigm prohibitively expensive, therefore making the latency of LLM's in-training downstream performance evaluation unbearable. However, simple metrics like training loss (perplexity) are not always correlated with downstream performance, as sometimes their trends diverge from the actual task outcomes. This dilemma calls for a method that is computationally efficient and sufficiently accurate in measuring model capabilities. To address this challenge, we introduce a new in-training evaluation paradigm that uses a lightweight probe for monitoring downstream performance. The probes take the internal representations of LLM checkpoints (during training) as input and directly predict the checkpoint's performance on downstream tasks measured by success probability (i.e., pass@1). We design several probe architectures, validating their effectiveness using the OLMo3-7B's checkpoints across a diverse set of downstream tasks. The probes can accurately predict a checkpoint's performance (with avg. AUROC$>$0.75), have decent generalizability across checkpoints (earlier predicts later), and reduce the computation latency from $\sim$1 hr (using conventional generative evaluation method) to $\sim$3 min. In sum, this work presents a practical and scalable in-training downstream evaluation paradigm, enabling a more agile, informed, and efficient LLM development process.
Authors: Jinzhao Li, Nan Jiang, Yexiang Xue
Abstract: Stochastic Multi-Objective Optimization (SMOO) is critical for decision-making trading off multiple potentially conflicting objectives in uncertain environments. SMOO aims at identifying the Pareto frontier, which contains all mutually non-dominating decisions. The problem is highly intractable due to the embedded probabilistic inference, such as computing the marginal, posterior probabilities, or expectations. Existing methods, such as scalarization, sample average approximation, and evolutionary algorithms, either offer arbitrarily loose approximations or may incur prohibitive computational costs. We propose XOR-SMOO, a novel algorithm that with probability $1-\delta$, obtains $\gamma$-approximate Pareto frontiers ($\gamma>1$) for SMOO by querying an SAT oracle poly-log times in $\gamma$ and $\delta$. A $\gamma$-approximate Pareto frontier is only below the true frontier by a fixed, multiplicative factor $\gamma$. Thus, XOR-SMOO solves highly intractable SMOO problems (\#P-hard) with only queries to SAT oracles while obtaining tight, constant factor approximation guarantees. Experiments on real-world road network strengthening and supply chain design problems demonstrate that XOR-SMOO outperforms several baselines in identifying Pareto frontiers that have higher objective values, better coverage of the optimal solutions, and the solutions found are more evenly distributed. Overall, XOR-SMOO significantly enhanced the practicality and reliability of SMOO solvers.
Authors: Kazuya Takabatake, Shotaro Akaho
Abstract: Dependency networks (Heckerman et al., 2000) provide a flexible framework for modeling complex systems with many variables by combining independently learned local conditional distributions through pseudo-Gibbs sampling. Despite their computational advantages over Bayesian and Markov networks, the theoretical foundations of dependency networks remain incomplete, primarily because their model distributions -- defined as stationary distributions of pseudo-Gibbs sampling -- lack closed-form expressions. This paper develops an information-geometric analysis of pseudo-Gibbs sampling, interpreting each sampling step as an m-projection onto a full conditional manifold. Building on this interpretation, we introduce the full conditional divergence and derive an upper bound that characterizes the location of the stationary distribution in the space of probability distributions. We then reformulate both structure and parameter learning as optimization problems that decompose into independent subproblems for each node, and prove that the learned model distribution converges to the true underlying distribution as the number of training samples grows to infinity. Experiments confirm that the proposed upper bound is tight in practice.
Authors: Zhantao Chen, Dongyi He, Jin Fang, Xi Chen, Yishuo Liu, Xiaozhen Zhong, Xuejun Hu
Abstract: As sports training becomes more data-driven, traditional dart coaching based mainly on experience and visual observation is increasingly inadequate for high-precision, goal-oriented movements. Although prior studies have highlighted the importance of release parameters, joint motion, and coordination in dart throwing, most quantitative methods still focus on local variables, single-release metrics, or static template matching. These approaches offer limited support for personalized training and often overlook useful movement variability. This paper presents a data-driven dart training assistance system. The system creates a closed-loop framework spanning motion capture, feature modeling, and personalized feedback. Dart-throwing data were collected in markerless conditions using a Kinect 2.0 depth sensor and an optical camera. Eighteen kinematic features were extracted from four biomechanical dimensions: three-link coordination, release velocity, multi-joint angular configuration, and postural stability. Two modules were developed: a personalized optimal throwing trajectory model that combines historical high-quality samples with the minimum jerk criterion, and a motion deviation diagnosis and recommendation model based on z-scores and hierarchical logic. A total of 2,396 throwing samples from professional and non-professional athletes were collected. Results show that the system generates smooth personalized reference trajectories consistent with natural human movement. Case studies indicate that it can detect poor trunk stability, abnormal elbow displacement, and imbalanced velocity control, then provide targeted recommendations. The framework shifts dart evaluation from deviation from a uniform standard to deviation from an individual's optimal control range, improving personalization and interpretability for darts training and other high-precision target sports.
Authors: Xiangpeng Li, Yu-Hsuan Ho, Sam D Brody, Ali Mostafavi
Abstract: This paper argues that AI-enabled analysis of street-view imagery, complemented by performance-gated machine-learning imputation, provides a viable pathway for generating building-specific elevation data at regional scale for flood risk assessment. We develop and apply a three-stage pipeline across 18 areas of interest (AOIs) in Texas that (1) extracts LFE and the height difference between street grade and the lowest floor (HDSL) from Google Street View imagery using the Elev-Vision framework, (2) imputes missing HDSL values with Random Forest and Gradient Boosting models trained on 16 terrain, hydrologic, geographic, and flood-exposure features, and (3) integrates the resulting elevation dataset with Fathom 1-in-100 year inundation surfaces and USACE depth-damage functions to estimate property-specific interior flood depth and expected loss. Across 12,241 residential structures, street-view imagery was available for 73.4% of parcels and direct LFE/HDSL extraction was successful for 49.0% (5,992 structures). Imputation was retained for 13 AOIs where cross-validated performance was defensible, with selected models achieving R suqre values from 0.159 to 0.974; five AOIs were explicitly excluded from prediction because performance was insufficient. The results show that street-view-based elevation mapping is not universally available for every property, but it is sufficiently scalable to materially improve regional flood-risk characterization by moving beyond hazard exposure to structure-level estimates of interior inundation and expected damage. Scientifically, the study advances LFE estimation from a pilot-scale proof of concept to a regional, end-to-end workflow. Practically, it offers a replicable framework for jurisdictions that lack comprehensive Elevation Certificates but need parcel-level information to support mitigation, planning, and flood-risk management.
Authors: Gleb Rodionov
Abstract: Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.
Authors: Kai Nelson, Tobias Kreiman, Sergey Levine, Aditi S. Krishnapriyan
Abstract: A fundamental challenge in science and engineering is the simulation-to-experiment gap. While we often possess prior knowledge of physical laws, these physical laws can be too difficult to solve exactly for complex systems. Such systems are commonly modeled using simulators, which impose computational approximations. Meanwhile, experimental measurements more faithfully represent the real world, but experimental data typically consists of observations that only partially reflect the system's full underlying state. We propose a data-driven distribution alignment framework that bridges this simulation-to-experiment gap by pre-training a generative model on fully observed (but imperfect) simulation data, then aligning it with partial (but real) observations of experimental data. While our method is domain-agnostic, we ground our approach in the physical sciences by introducing Adversarial Distribution Alignment (ADA). This method aligns a generative model of atomic positions -- initially trained on a simulated Boltzmann distribution -- with the distribution of experimental observations. We prove that our method recovers the target observable distribution, even with multiple, potentially correlated observables. We also empirically validate our framework on synthetic, molecular, and experimental protein data, demonstrating that it can align generative models with diverse observables. Our code is available at https://kaityrusnelson.com/ada/.
Authors: Cai Zhou, Zekai Wang, Menghua Wu, Qianyu Julie Zhu, Flora C. Shi, Chenyu Wang, Ashia Wilson, Tommi Jaakkola, Stephen Bates
Abstract: While test-time scaling has enabled large language models to solve highly difficult tasks, state-of-the-art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post-trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test-time training. Specifically, we introduce a meta-learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level $\delta=0.1$, ORCA improves Qwen2.5-32B efficiency on in-distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self-consistency labels. Under zero-shot out-of-domain settings, it improves MATH-500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at https://github.com/wzekai99/ORCA.
Authors: Prasanjit Dey, Soumyabrata Dev, Angela Meyer, Bianca Schoen-Phelan
Abstract: Accurate air quality forecasting is crucial for protecting public health and guiding environmental policy, yet it remains challenging due to nonlinear spatiotemporal dynamics, wind-driven transport, and distribution shifts across regions. Physics-based models are interpretable but computationally expensive and often rely on restrictive assumptions, whereas purely data-driven models can be accurate but may lack robustness and calibrated uncertainty. To address these limitations, we propose Neural Dynamic Diffusion-Advection Fields (NeuroDDAF), a physics-informed forecasting framework that unifies neural representation learning with open-system transport modeling. NeuroDDAF integrates (i) a GRU-Graph Attention encoder to capture temporal dynamics and wind-aware spatial interactions, (ii) a Fourier-domain diffusion-advection module with learnable residuals, (iii) a wind-modulated latent Neural ODE to model continuous-time evolution under time-varying connectivity, and (iv) an evidential fusion mechanism that adaptively combines physics-guided and neural forecasts while quantifying uncertainty. Experiments on four urban datasets (Beijing, Shenzhen, Tianjin, and Ancona) across 1-3 day horizons show that NeuroDDAF consistently outperforms strong baselines, including AirPhyNet, achieving up to 9.7% reduction in RMSE and 9.4% reduction in MAE on long-term forecasts. On the Beijing dataset, NeuroDDAF attains an RMSE of 41.63 $\mu$g/m$^3$ for 1-day prediction and 48.88 $\mu$g/m$^3$ for 3-day prediction, representing the best performance among all compared methods. In addition, NeuroDDAF improves cross-city generalization and yields well-calibrated uncertainty estimates, as confirmed by ensemble variance analysis and case studies under varying wind conditions.
Authors: Ken M. Nakanishi
Abstract: A core limitation of standard softmax attention is that it does not define a notion of absolute query--key relevance: attention weights are obtained by redistributing a fixed unit mass across all keys according to their relative scores. As a result, relevance is defined only relative to competing keys, and irrelevant keys cannot be explicitly rejected. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening evaluates each key against an explicit threshold, discarding irrelevant keys and aggregating the remaining keys, thereby removing global competition among keys. Across experiments, Multiscreen achieves comparable validation loss with approximately 40% fewer parameters than a Transformer baseline, enables stable optimization at substantially larger learning rates, maintains strong performance in long-context perplexity, shows little to no degradation in retrieval performance even far beyond the training context length, and reduces inference latency by up to 3.2$\times$ at 100K context length.
Authors: Youssef Mroueh, Carlos Fonseca, Brian Belgodere, David Cox
Abstract: Scientific algorithm discovery is iterative: hypotheses are proposed, implemented, stress-tested, and revised. Current LLM-guided search systems accelerate proposal generation, but often under-represent scientific structure by optimizing code-only artifacts with weak correctness/originality gating. We present CliffSearch, an agentic evolutionary framework in which the core evolution operators (pair selection, crossover, mutation, and review) are implemented as LLM agents, and the loop is designed around three principles: (1) each node is a structured scientific artifact, instantiated in either theory+code or code_only mode, (2) reviewer judgments of correctness and originality are first-class selection gates alongside optimization of the benchmark metric of interest, and (3) mutation is split into exploration and correction pathways with distinct objectives. Exploration mutation imports ideas from adjacent scientific domains to increase novelty, while correction mutation performs targeted evidence-guided repair using reviewer signals over theory, code, benchmark results, and runtime errors. We illustrate the framework on three benchmark-grounded studies: transformer hyper-connection evolution, optimizer discovery on a fixed nanoGPT stack, and a smaller native-optimizer ablation. Across these settings, the same loop supports explicit metric direction, reproducible persistence, and reviewer-gated comparison of discoveries under controlled search conditions. The result is a discovery workflow that prioritizes scientific interpretability and correctness while optimizing task metrics under controlled novelty constraints, rather than maximizing candidate throughput alone. Full run artifacts, interactive visualizations, and exported best nodes for the reported studies are available at https://cliffsearch.ai .
URLs: https://cliffsearch.ai
Authors: Piyush Garg, Diana R. Gergel, Andrew E. Shao, Galen J. Yacalis
Abstract: AI weather prediction has advanced rapidly, yet no unified mathematical framework explains what determines forecast skill. Existing theory addresses specific architectural choices rather than the learning pipeline as a whole, while operational evidence from 2023-2026 demonstrates that training methodology, loss function design, and data diversity matter at least as much as architecture selection. This paper makes two interleaved contributions. Theoretically, we construct a framework rooted in approximation theory on the sphere, dynamical systems theory, information theory, and statistical learning theory that treats the complete learning pipeline (architecture, loss function, training strategy, data distribution) rather than architecture alone. We establish a Learning Pipeline Error Decomposition showing that estimation error (loss- and data-dependent) dominates approximation error (architecture-dependent) at current scales. We develop a Loss Function Spectral Theory formalizing MSE-induced spectral blurring in spherical harmonic coordinates, and derive Out-of-Distribution Extrapolation Bounds proving that data-driven models systematically underestimate record-breaking extremes with bias growing linearly in record exceedance. Empirically, we validate these predictions via inference across ten architecturally diverse AI weather models using NVIDIA Earth2Studio with ERA5 initial conditions, evaluating six metrics across 30 initialization dates spanning all seasons. Results confirm universal spectral energy loss at high wavenumbers for MSE-trained models, rising Error Consensus Ratios showing that the majority of forecast error is shared across architectures, and linear negative bias during extreme events. A Holistic Model Assessment Score provides unified multi-dimensional evaluation, and a prescriptive framework enables mathematical evaluation of proposed pipelines before training.
Authors: Yuxuan Bao, Xingyue Zhang, J. Nathan Kutz
Abstract: Reconstructing full spatio-temporal dynamics from sparse observations in both space and time remains a central challenge in complex systems, as measurements can be spatially incomplete and can be also limited to narrow temporal windows. Yet approximating the complete spatio-temporal trajectory is essential for mechanistic insight and understanding, model calibration, and operational decision-making. We introduce LAPIS-SHRED (LAtent Phase Inference from Short time sequence using SHallow REcurrent Decoders), a modular architecture that reconstructs and/or forecasts complete spatiotemporal dynamics from sparse sensor observations confined to short temporal windows. LAPIS-SHRED operates through a three-stage pipeline: (i) a SHRED model is pre-trained entirely on simulation data to map sensor time-histories into a structured latent space, (ii) a temporal sequence model, trained on simulation-derived latent trajectories, learns to propagate latent states forward or backward in time to span unobserved temporal regions from short observational time windows, and (iii) at deployment, only a short observation window of hyper-sparse sensor measurements from the true system is provided, from which the frozen SHRED model and the temporal model jointly reconstruct or forecast the complete spatiotemporal trajectory. The framework supports bidirectional inference, inherits data assimilation and multiscale reconstruction capabilities from its modular structure, and accommodates extreme observational constraints including single-frame terminal inputs. We evaluate LAPIS-SHRED on six experiments spanning complex spatio-temporal physics: turbulent flows, multiscale propulsion physics, volatile combustion transients, and satellite-derived environmental fields, highlighting a lightweight, modular architecture suited for operational settings where observation is constrained by physical or logistical limitations.
Authors: Shikhar Bharadwaj, Chin-Jou Li, Kwanghee Choi, Eunjung Yeo, William Chen, Shinji Watanabe, David R. Mortensen
Abstract: Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS -- trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.
Authors: Wanxin Li, Denver McNeney, Nivedita Prabhu, Charlene Zhang, Renee Barr, Matthew Kitching, Khanh Dao Duc, Anthony S. Boyce
Abstract: AI-powered recruitment tools are increasingly adopted in personnel selection, yet they struggle to capture the requisition (req)-specific personal competencies (PCs) that distinguish successful candidates beyond job categories. We propose a large language model (LLM)-based approach to identify and prioritize req-specific PCs from reqs. Our approach integrates dynamic few-shot prompting, reflection-based self-improvement, similarity-based filtering, and multi-stage validation. Applied to a dataset of Program Manager reqs, our approach correctly identifies the highest-priority req-specific PCs with an average accuracy of 0.76, approaching human expert inter-rater reliability, and maintains a low out-of-scope rate of 0.07.
Authors: Kyunghoon Hur, Heeyoung Kwak, Jinsu Jang, Nakhwan Kim, Edward Choi
Abstract: Large-scale EHR prediction across institutions is hindered by substantial heterogeneity in schemas and code systems. Although Common Data Models (CDMs) can standardize records for multi-institutional learning, the manual harmonization and vocabulary mapping are costly and difficult to scale. Text-based harmonization provides an alternative by converting raw EHR into a unified textual form, enabling pooled learning without explicit standardization. However, applying this paradigm to multi-national datasets introduces an additional layer of heterogeneity, which is "language" that must be addressed for truly scalable EHRs learning. In this work, we investigate multilingual multi-institutional learning for EHR prediction, aiming to enable pooled training across multinational ICU datasets without manual standardization. We compare two practical strategies for handling language barriers: (i) directly modeling multilingual records with multilingual encoders, and (ii) translating non-English records into English via LLM-based word-level translation. Across seven public ICU datasets, ten clinical tasks with multiple prediction windows, translation-based lingual alignment yields more reliable cross-dataset performance than multilingual encoders. The multi-institutional learning model consistently outperforms strong baselines that require manual feature selection and harmonization, and also surpasses single-dataset training. We further demonstrate that text-based framework with lingual alignment effectively performs transfer learning via few-shot fine-tuning, with additional gains. To our knowledge, this is the first study to aggregate multilingual multinational ICU EHR datasets into one predictive model, providing a scalable path toward language-agnostic clinical prediction and future global multi-institutional EHR research.
Authors: Nabeel Ahmad Saidd
Abstract: Applying reinforcement learning (RL) to foreign exchange (Forex) trading remains challenging because realistic environments, well-defined reward functions, and expressive action spaces must be satisfied simultaneously, yet many prior studies rely on simplified simulators, single scalar rewards, and restricted action representations, limiting both interpretability and practical relevance. This paper presents a modular RL framework designed to address these limitations through three tightly integrated components: a friction-aware execution engine that enforces strict anti-lookahead semantics, with observations at time t, execution at time t+1, and mark-to-market at time t+1, while incorporating realistic costs such as spread, commission, slippage, rollover financing, and margin-triggered liquidation; a decomposable 11-component reward architecture with fixed weights and per-step diagnostic logging to enable systematic ablation and component-level attribution; and a 10-action discrete interface with legal-action masking that encodes explicit trading primitives while enforcing margin-aware feasibility constraints. Empirical evaluation on EURUSD focuses on learning dynamics rather than generalization and reveals strongly non-monotonic reward interactions, where additional penalties do not reliably improve outcomes; the full reward configuration achieves the highest training Sharpe (0.765) and cumulative return (57.09 percent). The expanded action space increases return but also turnover and reduces Sharpe relative to a conservative 3-action baseline, indicating a return-activity trade-off under a fixed training budget, while scaling-enabled variants consistently reduce drawdown, with the combined configuration achieving the strongest endpoint performance.
Authors: Qiaorong S. Yu, Zhaoze Wang, Vijay Balasubramanian
Abstract: Hippocampal place and time cells encode spatial and temporal aspects of experience. Both have the same neural substrate, but have been modeled as having different functions and mechanistic origins, place cells as continuous attractors, and time cells as leaky integrators. Here, we show that both types emerge from two dynamical regimes of a single recurrent network (RNN) modeling hippocampal CA3 as a predictive autoencoder. The network receives simulated, partially occluded ``experience vectors" containing spatial patterns (location-specific activity sampled during environmental traversal) and/or temporal patterns (correlated activity pairs separated by ``void" intervals), and is trained to reconstruct missing input. During spatial navigation, the network generates stable attractor-like place fields. But trained on temporally structured inputs, the network produces sequentially broadened fields, recapitulating time cells. By varying spatio-temporal input patterning, we observe hidden units transition smoothly between time cell-like and place cell-like representations. These results suggest a shared origin, but task-driven difference, between place and time cells.
Authors: Ernest Fokou\'e, Gregory Babbitt, Yuval Levental
Abstract: In Part I of this series, we established a rigorous mathematical isomorphism between ant colony decision-making and random forest learning, demonstrating that variance reduction through decorrelation is a universal principle shared by biological and computational ensembles. Here we turn to the complementary mechanism: bias reduction through adaptive weighting. Just as boosting algorithms sequentially focus on difficult instances, ant colonies dynamically amplify successful foraging paths through pheromone-mediated recruitment. We prove that these processes are mathematically isomorphic, establishing that the fundamental theorem of weak learnability has a direct analog in colony decision-making. We develop a formal mapping between AdaBoost's adaptive reweighting and ant recruitment dynamics, show that the margin theory of boosting corresponds to the stability of quorum decisions, and demonstrate through comprehensive simulation that ant colonies implementing adaptive recruitment achieve the same bias-reduction benefits as boosting algorithms. This completes a unified theory of ensemble intelligence, revealing that both variance reduction (Part I) and bias reduction (Part II) are manifestations of the same underlying mathematical principles governing collective intelligence in biological and computational systems.
Authors: Yoav Alon, Cristina David
Abstract: Determining whether a program terminates is a core challenge in program analysis with direct implications for correctness, verification, and security. We investigate whether transformer architectures can recognise termination patterns directly from source code and how their strengths can be amplified through ensembles. To overcome the extreme scarcity of non-terminating examples, we design an ensemble framework of compact transformer encoders, systematically trained with a suite of imbalance-aware loss functions and class-aware sampling techniques. By combining models trained with distinct loss functions, our ensembles achieve substantially stronger performance than any single transformer, outperforming both powerful off-the-shelf LLMs and graph-based methods. Finally, we introduce an attribution pipeline that produces syntax-aware explanations for the termination estimation.
Authors: Christin Pagels, Simon Hacks, Rob Henk Bemthuis
Abstract: Enterprise Architecture Debt (EA Debt) arises from suboptimal design decisions and misaligned components that can degrade an organization's IT landscape over time. Early indicators, Enterprise Architecture Smells (EA Smells), are currently mainly detected manually or only from structured artifacts, leaving much unstructured documentation under-analyzed. This study proposes an approach using a large language model (LLM) to identify and quantify EA Debt in unstructured architectural documentation. Following a design science research approach, we design and evaluate an LLM-based prototype for automated EA Smell detection. The artifact ingests unstructured documents (e.g., process descriptions, strategy papers), applies fine-tuned detection models, and outputs identified smells. We evaluate the prototype through a case study using synthetic yet realistic business documents, benchmarking against a custom GPT-based model. Results show that LLMs can detect multiple predefined EA Smells in unstructured text, with the benchmark model achieving higher precision and processing speed, and the fine-tuned on-premise model offering data protection advantages. The findings highlight opportunities for integrating LLM-based smell detection into EA governance practice.
Authors: Silong Yong, Stephen Sheng, Carl Qi, Xiaojie Wang, Evan Sheehan, Anurag Shivaprasad, Yaqi Xie, Katia Sycara, Yesh Dattatreya
Abstract: Existing robotic foundation policies are trained primarily via large-scale imitation learning. While such models demonstrate strong capabilities, they often struggle with long-horizon tasks due to distribution shift and error accumulation. While reinforcement learning (RL) can finetune these models, it cannot work well across diverse tasks without manual reward engineering. We propose VLLR, a dense reward framework combining (1) an extrinsic reward from Large Language Models (LLMs) and Vision-Language Models (VLMs) for task progress recognition, and (2) an intrinsic reward based on policy self-certainty. VLLR uses LLMs to decompose tasks into verifiable subtasks and then VLMs to estimate progress to initialize the value function for a brief warm-up phase, avoiding prohibitive inference cost during full training; and self-certainty provides per-step intrinsic guidance throughout PPO finetuning. Ablation studies reveal complementary benefits: VLM-based value initialization primarily improves task completion efficiency, while self-certainty primarily enhances success rates, particularly on out-of-distribution tasks. On the CHORES benchmark covering mobile manipulation and navigation, VLLR achieves up to 56% absolute success rate gains over the pretrained policy, up to 5% gains over state-of-the-art RL finetuning methods on in-distribution tasks, and up to $10\%$ gains on out-of-distribution tasks, all without manual reward engineering. Additional visualizations can be found in https://silongyong.github.io/vllr_project_page/
Authors: Lei Huang, Chuan Qiu, Kuan-Jui Su, Anqi Liu, Yun Gong, Weiqiang Lin, Lindong Jiang, Chen Zhao, Meng Song, Jeffrey Deng, Qing Tian, Zhe Luo, Ping Gong, Hui Shen, Chaoyang Zhang, Hong-Wen Deng
Abstract: Genotype imputation enables dense variant coverage for genome-wide association and risk-prediction studies, yet conventional reference-panel methods remain limited by ancestry bias and reduced rare-variant accuracy. We present Genotype Bidirectional Encoder Representations from Transformers (GenoBERT), a transformer-based, reference-free framework that tokenizes phased genotypes and uses a self-attention mechanism to capture both short- and long-range linkage disequilibrium (LD) dependencies. Benchmarking on two independent datasets including the Louisiana Osteoporosis Study (LOS) and the 1000 Genomes Project (1KGP) across ancestry groups and multiple genotype missingness levels (5-50%) shows that GenoBERT achieves the highest overall accuracy compared to four baseline methods (Beagle5.4, SCDA, BiU-Net, and STICI). At practical sparsity levels (up to 25% missing), GenoBERT attains high overall imputation accuracy ($r^2 approx 0.98$) across datasets, and maintains robust performance ($r^2 > 0.90$) even at 50% missingness. Experimental results across different ancestries confirm consistent gains across datasets, with resilience to small sample sizes and weak LD. A 128-SNP (single-nucleotide polymorphism) context window (approximately 100 Kb) is validated through LD-decay analyses as sufficient to capture local correlation structures. By eliminating reference-panel dependence while preserving high accuracy, GenoBERT provides a scalable and robust solution for genotype imputation and a foundation for downstream genomic modeling.
Authors: Zhenxuan Li, Meng Huang
Abstract: The low-rank matrix recovery problem seeks to reconstruct an unknown $n_1 \times n_2$ rank-$r$ matrix from $m$ linear measurements, where $m\ll n_1n_2$. This problem has been extensively studied over the past few decades, leading to a variety of algorithms with solid theoretical guarantees. Among these, gradient descent based non-convex methods have become particularly popular due to their computational efficiency. However, these methods typically suffer from two key limitations: a sub-optimal sample complexity of $O((n_1 + n_2)r^2)$ and an iteration complexity of $O(\kappa \log(1/\epsilon))$ to achieve $\epsilon$-accuracy, resulting in slow convergence when the target matrix is ill-conditioned. Here, $\kappa$ denotes the condition number of the unknown matrix. Recent studies show that a preconditioned variant of GD, known as scaled gradient descent (ScaledGD), can significantly reduce the iteration complexity to $O(\log(1/\epsilon))$. Nonetheless, its sample complexity remains sub-optimal at $O((n_1 + n_2)r^2)$. In contrast, a delicate virtual sequence technique demonstrates that the standard GD in the positive semidefinite (PSD) setting achieves the optimal sample complexity $O((n_1 + n_2)r)$, but converges more slowly with an iteration complexity $O(\kappa^2 \log(1/\epsilon))$. In this paper, through a more refined analysis, we show that ScaledGD achieves both the optimal sample complexity $O((n_1 + n_2)r)$ and the improved iteration complexity $O(\log(1/\epsilon))$. Notably, our results extend beyond the PSD setting to general low-rank matrix recovery problem. Numerical experiments further validate that ScaledGD accelerates convergence for ill-conditioned matrices with the optimal sampling complexity.
Authors: Pierre Andreoletti (IDP)
Abstract: We study trajectory forecasting under squared loss for time series with weak conditional structure, using highly expressive prediction models. Building on the classical characterization of squared-loss risk minimization, we emphasize regimes in which the conditional expectation of future trajectories is effectively degenerate, leading to trivial Bayes-optimal predictors (flat for prices and zero for returns in standard financial settings). In this regime, increased model expressivity does not improve predictive accuracy but instead introduces spurious trajectory fluctuations around the optimal predictor. These fluctuations arise from the reuse of noise and result in increased prediction variance without any reduction in bias. This provides a process-level explanation for the degradation of Transformerbased forecasts on financial time series. We complement these theoretical results with numerical experiments on high-frequency EUR/USD exchange rate data, analyzing the distribution of trajectory-level forecasting errors. The results show that Transformer-based models yield larger errors than a simple linear benchmark on a large majority of forecasting windows, consistent with the variance-driven mechanism identified by the theory.
Authors: Luca Cattelani, Vittorio Fortino
Abstract: Multi-omic datasets offer opportunities for improved biomarker discovery in cancer research, but their high dimensionality and limited sample sizes make identifying compact and effective biomarker panels challenging. Feature selection in large-scale omics can be efficiently addressed by combining machine learning with genetic algorithms, which naturally support multi-objective optimization of predictive accuracy and biomarker set size. However, genetic algorithms remain relatively underexplored for multi-omic feature selection, where most approaches concatenate all layers into a single feature space. To address this limitation, we introduce Sweeping*, a multi-view, multi-objective algorithm alternating between single- and multi-view optimization. It employs a nested single-view multi-objective optimizer, and for this study we use the genetic algorithm NSGA3-CHS. It first identifies informative biomarkers within each layer, then jointly evaluates cross-layer interactions; these multi-omic solutions guide the next single-view search. Through repeated sweeps, the algorithm progressively identifies compact biomarker panels capturing cross-modal complementary signals. We benchmark five Sweeping* strategies, including hierarchical and concatenation-based variants, using survival prediction on three TCGA cohorts. Each strategy jointly optimizes predictive accuracy and set size, measured via the concordance index and root-leanness. Overall performance and estimation error are assessed through cross hypervolume and Pareto delta under 5-fold cross-validation. Our results show that Sweeping* can improve the accuracy-complexity trade-off when sufficient survival signal is present and that integrating omic layers can enhance survival prediction beyond clinical-only models, although benefits remain cohort-dependent.
Authors: Eugene Lee, Ting-Yu Chang, Jui-Huang Tsai, Jiajie Diao, Chen-Yi Lee
Abstract: The field of computer vision has experienced significant advancements through scalable vision encoders and multimodal pre-training frameworks. However, existing approaches often treat vision encoders and large language models (LLMs) as independent modules, limiting the integration of hierarchical visual features. In this work, we propose HIVE (Hierarchical Pre-Training of Vision Encoders), a novel framework that enhances vision-language alignment by introducing hierarchical cross-attention between the vision encoder and LLM. Unlike conventional methods that flatten image embeddings, HIVE enables structured feature fusion across multiple layers, improving gradient flow and representation learning. To optimize this interaction, we introduce a three-stage training strategy that progressively aligns the vision encoder with the LLM, ensuring stable optimization and effective multimodal fusion. Empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks, outperforming self-attention-based methods in benchmarks such as MME, GQA, OK-VQA, and ScienceQA. Our results highlight the benefits of hierarchical feature integration, paving the way for more efficient and expressive vision-language models.
Authors: Sameer Shaik, Zhen Huang, Daniela Stan Raicu, Jacob Furst
Abstract: Detecting software vulnerabilities is critical to ensuring the security and reliability of modern computer systems. Deep neural networks have shown promising results on vulnerability detection, but they lack the capability to capture global contextual information on vulnerable code. To address this limitation, we explore the application of transformers for C/C++ vulnerability detection. We use program slices that encapsulate key syntactic and semantic features of program code, such as API function calls, array usage, pointer manipulations, and arithmetic expressions. By leveraging transformers' capability to capture both local and global contextual information on vulnerable code, our work can identify vulnerabilities accurately. Combined with data balancing and hyperparameter fine-tuning, our work offers a robust and efficient approach to identifying vulnerable code with moderate resource usage and training time.
Authors: Yitao Bai, Thinh T. Doan, Justin Romberg
Abstract: We study the finite-time convergence of projected linear two-time-scale stochastic approximation with constant step sizes and Polyak--Ruppert averaging. We establish an explicit mean-square error bound, decomposing it into two interpretable components, an approximation error determined by the constrained subspace and a statistical error decaying at a sublinear rate, with constants expressed through restricted stability margins and a coupling invertibility condition. These constants cleanly separate the effect of subspace choice (approximation errors) from the effect of the averaging horizon (statistical errors). We illustrate our theoretical results through a number of numerical experiments on both synthetic and reinforcement learning problems.
Authors: Ali Sayedsalehi, Peter C. Rigby, Gregory Mierzwinski
Abstract: Performance regression testing is essential in large-scale continuous-integration (CI) systems, yet executing full performance suites for every commit is prohibitively expensive. Prior work on performance regression prediction and batch testing has shown independent benefits, but each faces practical limitations: predictive models are rarely integrated into CI decision-making, and conventional batching strategies ignore commit-level heterogeneity. We unify these strands by introducing a risk-aware framework that integrates machine-learned commit risk with adaptive batching. Using Mozilla Firefox as a case study, we construct a production-derived dataset of human-confirmed regressions aligned chronologically with Autoland, and fine-tune ModernBERT, CodeBERT, and LLaMA-3.1 variants to estimate commit-level performance regression risk, achieving up to 0.694 ROC-AUC with CodeBERT. The risk scores drive a family of risk-aware batching strategies, including Risk-Aged Priority Batching and Risk-Adaptive Stream Batching, evaluated through realistic CI simulations. Across thousands of historical Firefox commits, our best overall configuration, Risk-Aged Priority Batching with linear aggregation (RAPB-la), yields a Pareto improvement over Mozilla's production-inspired baseline. RAPB-la reduces total test executions by 32.4%, decreases mean feedback time by 3.8%, maintains mean time-to-culprit at approximately the baseline level, reduces maximum time-to-culprit by 26.2%, and corresponds to an estimated annual infrastructure cost savings of approximately $491K under our cost model. These results demonstrate that risk-aware batch testing can reduce CI resource consumption while improving diagnostic timeliness. To support reproducibility and future research, we release a complete replication package containing all datasets, fine-tuning pipelines, and implementations of our batching algorithms.
Authors: Simone Betteti, Luca Laurenti
Abstract: Energy-based models (EBMs) implement inference as gradient descent on a learned Lyapunov function, yielding interpretable, structure-preserving alternatives to black-box neural ODEs and aligning naturally with physical AI. Yet their use in system identification remains limited, and existing architectures lack formal stability guarantees that globally preclude unstable modes. We address this gap by introducing an EBM framework for system identification with stable, dissipative, absorbing invariant dynamics. Unlike classical global Lyapunov stability, absorbing invariance expands the class of stability-preserving architectures, enabling more flexible and expressive EBMs. We extend EBM theory to nonsmooth activations by establishing negative energy dissipation via Clarke derivatives and deriving new conditions for radial unboundedness, exposing a stability-expressivity tradeoff in standard EBMs. To overcome this, we introduce a hybrid architecture with a dynamical visible layer and static hidden layers, prove absorbing invariance under mild assumptions, and show that these guarantees extend to port-Hamiltonian EBMs. Experiments on metric-deformed multi-well and ring systems validate the approach, showcasing how our hybrid EBM architecture combines expressivity with sound and provable safety guarantees by design.
Authors: Yanliang Huang, Peng Xie, Wenyuan Wu, Zhuoqi Zeng, Amr Alanwar
Abstract: We present a data-driven framework for reachability analysis of nonlinear dynamical systems that requires no explicit model. A denoising diffusion probabilistic model learns the time-evolving state distribution of a dynamical system from trajectory data alone. The predicted reachable set takes the form of a sublevel set of a nonconformity score derived from the reconstruction error, with the threshold calibrated via the Learn Then Test procedure so that the probability of excluding a reachable state is bounded with high probability. Experiments on three nonlinear systems, a forced Duffing oscillator, a planar quadrotor, and a high-dimensional reaction-diffusion system, confirm that the empirical miss rate remains below the Probably Approximately Correct (PAC) bound while scaling to state dimensions beyond the reach of classical grid-based and polynomial methods.
Authors: Sahil Kumar, Namrataben Patel, Honggang Wang, Youshan Zhang
Abstract: MambaVoiceCloning (MVC) asks whether the conditioning path of diffusion-based TTS can be made fully SSM-only at inference, removing all attention and explicit RNN-style recurrence layers across text, rhythm, and prosody, while preserving or improving quality under controlled conditions. MVC combines a gated bidirectional Mamba text encoder, a Temporal Bi-Mamba supervised by a lightweight alignment teacher discarded after training, and an Expressive Mamba with AdaLN modulation, yielding linear-time O(T) conditioning with bounded activation memory and practical finite look-ahead streaming. Unlike prior Mamba-TTS systems that remain hybrid at inference, MVC removes attention-based duration and style modules under a fixed StyleTTS2 mel-diffusion-vocoder backbone. Trained on LJSpeech/LibriTTS and evaluated on VCTK, CSS10 (ES/DE/FR), and long-form Gutenberg passages, MVC achieves modest but statistically reliable gains over StyleTTS2, VITS, and Mamba-attention hybrids in MOS/CMOS, F0 RMSE, MCD, and WER, while reducing encoder parameters to 21M and improving throughput by 1.6x. Diffusion remains the dominant latency source, but SSM-only conditioning improves memory footprint, stability, and deployability.
Authors: Fan Wu, Matthias P. N\"agele, Daryush D. Mehta, Elgar Fleisch, Frank Ruschitzka, Andreas J. Flammer, Filipe Barata
Abstract: Objective: This study aimed to evaluate which voice features can predict health deterioration in patients with chronic HF. Background: Heart failure (HF) is a chronic condition with progressive deterioration and acute decompensations, often requiring hospitalization and imposing substantial healthcare and economic burdens. Current standard-of-care (SoC) home monitoring, such as weight tracking, lacks predictive accuracy and requires high patient engagement. Voice is a promising non-invasive biomarker, though prior studies have mainly focused on acute HF stages. Methods: In a 2-month longitudinal study, 32 patients with HF collected daily voice recordings and SoC measures of weight and blood pressure at home, with biweekly questionnaires for health status. Acoustic analysis generated detailed vowel and speech features. Time-series features were extracted from aggregated lookback windows (e.g., 7 days) to predict next-day health status. Explainable machine learning with nested cross-validation identified top vocal biomarkers, and a case study illustrated model application. Results: A total of 21,863 recordings were analyzed. Acoustic vowel features showed strong correlations with health status. Time-series voice features within the lookback window outperformed corresponding standard care measures, achieving peak sensitivity and specificity of 0.826 and 0.782 versus 0.783 and 0.567 for SoC metrics. Key prognostic voice features identifying deterioration included delayed energy shift, low energy variability, and higher shimmer variability in vowels, along with reduced speaking and articulation rate, lower phonation ratio, decreased voice quality, and increased formant variability in speech. Conclusion: Voice-based monitoring offers a non-invasive approach to detect early health changes in chronic HF, supporting proactive and personalized care.
Authors: Marcel Tom\`as Bernal, Neil Rohit Mallinar, Mikhail Belkin
Abstract: Grokking occurs when a model achieves high training accuracy but generalization to unseen test points happens long after that. This phenomenon was initially observed on a class of algebraic problems, such as learning modular arithmetic (Power et al., 2022). We study grokking on algebraic tasks in a class of feature learning kernels via the Recursive Feature Machine (RFM) algorithm (Radhakrishnan et al., 2024), which iteratively updates feature matrices through the Average Gradient Outer Product (AGOP) of an estimator in order to learn task-relevant features. Our main experimental finding is that generalization occurs only when a certain symmetry in the training set is broken. Furthermore, we empirically show that RFM generalizes by recovering the underlying invariance group action inherent in the data. We find that the learned feature matrices encode specific elements of the invariance group, explaining the dependence of generalization on symmetry.
Authors: Liyao Lyu, Xinyue Yu, Hayden Schaeffer
Abstract: Collective behaviors that emerge from interactions are fundamental to numerous biological systems. To learn such interacting forces from observations, we introduce a measure-valued neural network that infers measure-dependent interaction (drift) terms directly from particle-trajectory observations. The proposed architecture generalizes standard neural networks to operate on probability measures by learning cylindrical features, using an embedding network that produces scalable distribution-to-vector representations. On the theory side, we establish well-posedness of the resulting dynamics and prove propagation-of-chaos for the associated interacting-particle system. We further show universal approximation and quantitative approximation rates under a low-dimensional measure-dependence assumption. Numerical experiments on first and second order systems, including deterministic and stochastic Motsch-Tadmor dynamics, two-dimensional attraction-repulsion aggregation, Cucker-Smale dynamics, and a hierarchical multi-group system, demonstrate accurate prediction and strong out-of-distribution generalization.
Authors: Borislav Mavrin
Abstract: No one has independently reproduced OpenAI's published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model's in-distribution tools: when prompted without tool definitions, gpt-oss still calls tools from its training distribution with high statistical confidence -- a strong prior, not a hallucination. We then built a native harmony agent harness (https://github.com/borislavmavrin/harmonyagent.git) that encodes messages in the model's native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI's published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).
Authors: Wei Sun
Abstract: LLM systems must make control decisions in addition to generating outputs: whether to answer, clarify, retrieve, call tools, repair, or escalate. In many current architectures, these decisions remain implicit within generation, entangling assessment and action in a single model call and making failures hard to inspect, constrain, or repair. We propose a decision-centric framework that separates decision-relevant signals from the policy that maps them to actions, turning control into an explicit and inspectable layer of the system. This separation supports attribution of failures to signal estimation, decision policy, or execution, and enables modular improvement of each component. It unifies familiar single-step settings such as routing and adaptive inference, and extends naturally to sequential settings in which actions alter the information available before acting. Across three controlled experiments, the framework reduces futile actions, improves task success, and reveals interpretable failure modes. More broadly, it offers a general architectural principle for building more reliable, controllable, and diagnosable LLM systems.
Authors: Weizhuo Wang, Yanjie Ze, C. Karen Liu, Monroe Kennedy III
Abstract: We present EgoNav, a system that enables a humanoid robot to traverse diverse, unseen environments by learning entirely from 5 hours of human walking data, with no robot data or finetuning. A diffusion model predicts distributions of plausible future trajectories conditioned on past trajectory, a 360 deg visual memory fusing color, depth, and semantics, and video features from a frozen DINOv3 backbone that capture appearance cues invisible to depth sensors. A hybrid sampling scheme achieves real-time inference in 10 denoising steps, and a receding-horizon controller selects paths from the predicted distribution. We validate EgoNav through offline evaluations, where it outperforms baselines in collision avoidance and multi-modal coverage, and through zero-shot deployment on a Unitree G1 humanoid across unseen indoor and outdoor environments. Behaviors such as waiting for doors to open, navigating around crowds, and avoiding glass walls emerge naturally from the learned prior. We will release the dataset and trained models. Our website: https://egonav.weizhuowang.com
Authors: Xinyu Sun, Wanwei Liu, Haoang Chi, Tingyu Chen, Xiaoguang Mao, Shangwen Wang, Lei Bu, Jingyi Wang, Yang Tan, Zhenyi Qi
Abstract: DNNs are susceptible to defects like backdoors, adversarial attacks, and unfairness, undermining their reliability. Existing approaches mainly involve retraining, optimization, constraint-solving, or search algorithms. However, most methods rely on gradient calculations, restricting applicability to specific activation functions (e.g., ReLU), or use search algorithms with uninterpretable localization and repair. Furthermore, they often lack generalizability across multiple properties. We propose SHARPEN, integrating interpretable fault localization with a derivative-free optimization strategy. First, SHARPEN introduces a Deep SHAP-based localization strategy quantifying each layer's and neuron's marginal contribution to erroneous outputs. Specifically, a hierarchical coarse-to-fine approach reranks layers by aggregated impact, then locates faulty neurons/filters by analyzing activation divergences between property-violating and benign states. Subsequently, SHARPEN incorporates CMA-ES to repair identified neurons. CMA-ES leverages a covariance matrix to capture variable dependencies, enabling gradient-free search and coordinated adjustments across coupled neurons. By combining interpretable localization with evolutionary optimization, SHARPEN enables derivative-free repair across architectures, being less sensitive to gradient anomalies and hyperparameters. We demonstrate SHARPEN's effectiveness on three repair tasks. Balancing property repair and accuracy preservation, it outperforms baselines in backdoor removal (+10.56%), adversarial mitigation (+5.78%), and unfairness repair (+11.82%). Notably, SHARPEN handles diverse tasks, and its modular design is plug-and-play with different derivative-free optimizers, highlighting its flexibility.
Authors: Han Huang, Pakawut Jiradilok, Elchanan Mossel
Abstract: We study the problem of reconstructing the latent geometry of a $d$-dimensional Riemannian manifold from a random geometric graph. While recent works have made significant progress in manifold recovery from random geometric graphs, and more generally from noisy distances, the precision of pairwise distance estimation has been fundamentally constrained by the volumetric barrier, namely the natural sample-spacing scale $n^{-1/d}$ coming from the fact that a generic point of the manifold typically lies at distance of order $n^{-1/d}$ from the nearest sampled point. In this paper, we introduce a novel approach, Orthogonal Ring Distance Estimation Routine (ORDER), which achieves a pointwise distance estimation precision of order $n^{-2/(d+5)}$ up to polylogarithmic factors in $n$ in polynomial time. This strictly beats the volumetric barrier for dimensions $d > 5$. As a consequence of obtaining pointwise precision better than $n^{-1/d}$, we prove that the Gromov--Wasserstein distance between the reconstructed metric measure space and the true latent manifold is of order $n^{-1/d}$. This matches the Wasserstein convergence rate of empirical measures, demonstrating that our reconstructed graph metric is asymptotically as good as having access to the full pairwise distance matrix of the sampled points. Our results are proven in a very general setting which includes general models of noisy pairwise distances, sparse random geometric graphs, and unknown connection probability functions.
Authors: Wonseok Yang, Thinh T. Doan
Abstract: This letter studies multi-agent reinforcement learning in partially observable Markov potential games. Solving this problem is challenging due to partial observability, decentralized information, and the curse of dimensionality. First, to address the first two challenges, we leverage the common information framework, which allows agents to act based on both shared and local information. Second, to ensure tractability, we study an internal state that compresses accumulated information, preventing it from growing unboundedly over time. We then implement an internal state-based natural policy gradient method to find Nash equilibria of the Markov potential game. Our main contribution is to establish a non-asymptotic convergence bound for this method. Our theoretical bound decomposes into two interpretable components: a statistical error term that also arises in standard Markov potential games, and an approximation error capturing the use of finite-state controllers. Finally, simulations across multiple partially observable environments demonstrate that the proposed method using finite-state controllers achieves consistent improvements in performance compared to the setting where only the current observation is used.
Authors: Michael Maynord, Minghui Liu, Cornelia Ferm\"uller, Seongjin Choi, Yuxin Zeng, Shishir Dahal, Daniel M. Harrison
Abstract: Ultra-high field 7-tesla (7T) MRI improves visualization of multiple sclerosis (MS) white matter lesions (WML) but differs sufficiently in contrast and artifacts from 1.5-3T imaging - suggesting that widely used automated segmentation tools may not translate directly. We analyzed 7T FLAIR scans and generated reference WML masks from Lesion Segmentation Tool (LST) outputs followed by expert manual revision. As external comparators, we applied LST-LPA and the more recent LST-AI ensemble, both originally developed on lower-field data. We then trained 3D UNETR and SegFormer transformer-based models on 7T FLAIR at multiple resolutions (0.5x0.5x0.5^3, 1.0x1.0x1.0^3, and 1.5x1.5x2.0^3) and evaluated all methods using voxel-wise and lesion-wise metrics from the BraTS 2023 framework. On the held-out test set at native 0.5x0.5x0.5^3 resolution, 7T-trained transformers achieved competitive overlap with LST-AI while recovering additional small lesions that were missed by classical methods, at the cost of some boundary variability and occasional artifact-related false positives. On a held-out 7 T test set, our best transformer model (SegFormer) achieved a voxel-wise Dice of 0.61 and lesion-wise Dice of 0.20, improving on the classical LST-LPA tool (Dice 0.39, lesion-wise Dice 0.02). Performance decreased for models trained on downsampled images, underscoring the value of native 7T resolution for small-lesion detection. By releasing our 7T-trained models, we aim to provide a reproducible, ready-to-use resource for automated lesion quantification in ultra-high field MS research (https://github.com/maynord/7T-MS-lesion-segmentation).
URLs: https://github.com/maynord/7T-MS-lesion-segmentation).
Authors: Yabin Zhang, Chong Wang, Yunhe Gao, Jiaming Liu, Maya Varma, Justin Xu, Sophie Ostmeier, Jin Long, Sergios Gatidis, Seena Dehkharghani, Arne Michalson, Eun Kyoung Hong, Christian Bluethgen, Haiwei Henry Guo, Alexander Victor Ortiz, Stephan Altmayer, Sandhya Bodapati, Joseph David Janizek, Ken Chang, Jean-Benoit Delbrouck, Akshay S. Chaudhari, Curtis P. Langlotz
Abstract: Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.
Authors: Nikolaos M. Matzakos
Abstract: We prove that activation saturation imposes a structural dynamical limitation on autonomous Neural ODEs $\dot{h}=f_\theta(h)$ with saturating activations ($\tanh$, sigmoid, etc.): if $q$ hidden layers of the MLP $f_\theta$ satisfy $|\sigma'|\le\delta$ on a region~$U$, the input Jacobian is attenuated as $\norm{Df_\theta(x)}\le C(U)$ (for activations with $\sup_{x}|\sigma'(x)|\le 1$, e.g.\ $\tanh$ and sigmoid, this reduces to $C_W\delta^q$), forcing every Floquet (Lyapunov) exponen along any $T$-periodic orbit $\gamma\subset U$ into the interval $[-C(U),\;C(U)]$. This is a collapse of the Floquet spectrum: as saturation deepens ($\delta\to 0$), all exponents are driven to zero, limiting both strong contraction and chaotic sensitivity. The obstruction is structural -- it constrains the learned vector field at inference time, independent of training quality. As a secondary contribution, for activations with $\sigma'>0$, a saturation-weighted spectral factorisation yields a refined bound $\widetilde{C}(U)\le C(U)$ whose improvement is amplified exponentially in~$T$ at the flow level. All results are numerically illustrated on the Stuart--Landau oscillator; the bounds provide a theoretical explanation for the empirically observed failure of $\tanh$-NODEs on the Morris--Lecar neuron model.
Authors: Zixiang Peng, Yongxiu Xu, Qinyi Zhang, Jiexun Shen, Yifan Zhang, Hongbo Xu, Yubin Wang, Gaopeng Gou
Abstract: Unified Multimodal Large Models (UMLMs) integrate understanding and generation capabilities within a single architecture. While this architectural unification, driven by the deep fusion of multimodal features, enhances model performance, it also introduces important yet underexplored safety challenges. Existing safety benchmarks predominantly focus on isolated understanding or generation tasks, failing to evaluate the holistic safety of UMLMs when handling diverse tasks under a unified framework. To address this, we introduce Uni-SafeBench, a comprehensive benchmark featuring a taxonomy of six major safety categories across seven task types. To ensure rigorous assessment, we develop Uni-Judger, a framework that effectively decouples contextual safety from intrinsic safety. Based on comprehensive evaluations across Uni-SafeBench, we uncover that while the unification process enhances model capabilities, it significantly degrades the inherent safety of the underlying LLM. Furthermore, open-source UMLMs exhibit much lower safety performance than multimodal large models specialized for either generation or understanding tasks. We open-source all resources to systematically expose these risks and foster safer AGI development.
Authors: Simone Garatti, Lucrezia Manieri, Alessandro Falsone, Algo Car\`e, Marco C. Campi, Maria Prandini
Abstract: The scenario approach provides a powerful data-driven framework for designing solutions under uncertainty with rigorous probabilistic robustness guarantees. Existing theory, however, primarily addresses assessing robustness with respect to a single appropriateness criterion for the solution based on a dataset, whereas many practical applications - including multi-agent decision problems - require the simultaneous consideration of multiple criteria and the assessment of their robustness based on multiple datasets, one per criterion. This paper develops a general scenario theory for multi-criteria data-driven decision making. A central innovation lies in the collective treatment of the risks associated with violations of individual criteria, which yields substantially more accurate robustness certificates than those derived from a naive application of standard results. In turn, this approach enables a sharper quantification of the robustness level with which all criteria are simultaneously satisfied. The proposed framework applies broadly to multi-criteria data-driven decision problems, providing a principled, scalable, and theoretically grounded methodology for design under uncertainty.
Authors: Yichen Xie, Yixiao Wang, Shuqi Zhao, Cheng-En Wu, Masayoshi Tomizuka, Jianwen Xie, Hao-Shu Fang
Abstract: The generalization ability of imitation learning policies for robotic manipulation is fundamentally constrained by the diversity of expert demonstrations, while collecting demonstrations across varied environments is costly and difficult in practice. In this paper, we propose a practical framework that exploits inherent scene diversity without additional human effort by scaling camera views during demonstration collection. Instead of acquiring more trajectories, multiple synchronized camera perspectives are used to generate pseudo-demonstrations from each expert trajectory, which enriches the training distribution and improves viewpoint invariance in visual representations. We analyze how different action spaces interact with view scaling and show that camera-space representations further enhance diversity. In addition, we introduce a multiview action aggregation method that allows single-view policies to benefit from multiple cameras during deployment. Extensive experiments in simulation and real-world manipulation tasks demonstrate significant gains in data efficiency and generalization compared to single-view baselines. Our results suggest that scaling camera views provides a practical and scalable solution for imitation learning, which requires minimal additional hardware setup and integrates seamlessly with existing imitation learning algorithms. The website of our project is https://yichen928.github.io/robot_multiview.
Authors: Sandeep Kumar Samota, Snehashish Chakraverty, Narayan Sethi
Abstract: Poverty is a complex dynamic challenge that cannot be adequately captured using predefined differential equations. Nowadays, artificial machine learning (ML) methods have demonstrated significant potential in modelling real-world dynamical systems. Among these, Neural Ordinary Differential Equations (Neural ODEs) have emerged as a powerful, data-driven approach for learning continuous-time dynamics directly from observations. This chapter applies the Neural ODE framework to analyze poverty dynamics in the Indian state of Odisha. Specifically, we utilize time-series data from 2007 to 2020 on the key indicators of economic development and poverty reduction. Within the Neural ODE architecture, the temporal gradient of the system is represented by a multi-layer perceptron (MLP). The obtained neural dynamical system is integrated using a numerical ODE solver to obtain the trajectory of over time. In backpropagation, the adjoint sensitivity method is utilized for gradient computation during training to facilitate effective backpropagation through the ODE solver. The trained Neural ODE model reproduces the observed data with high accuracy. This demonstrates the capability of Neural ODE to capture the dynamics of the poverty indicator of concrete-structured households. The obtained results show that ML methods, such as Neural ODEs, can serve as effective tools for modeling socioeconomic transitions. It can provide policymakers with reliable projections, supporting more informed and effective decision-making for poverty alleviation.
Authors: Alexis Coyette, Charles Modera, Candy Sonveaux, Judica\"el Mohet, Franc\c{c}ois-Gr\'egoire Bierwart, Sylverio Pool Marquez, Jarod Ketcha Kouakep, C\'edric Simal, Komlan Fiagbe, Violaine Piengeon, Martin Moriam\'e, Justine Bodart, Marie Dorchain, Maxime Lucas, Rommel Tchinda Djeudjo, Gianluca Peri, Eve Tilman
Abstract: We propose a novel extension of the Bradley-Terry model to multiplayer games and adapt a recent algorithm by Newman [1] to our model. We demonstrate the use of our proposed method on synthetic datasets and on a real dataset of games of cards.
Authors: Stefano Cortinovis, Laurence Aitchison, Stefanos Eleftheriadis, Mark van der Wilk
Abstract: Gaussian processes (GPs) offer appealing properties but are costly to train at scale. Sparse variational GP (SVGP) approximations reduce cost yet still rely on Cholesky decompositions of kernel matrices, ill-suited to low-precision, massively parallel hardware. While one can construct valid variational bounds that rely only on matrix multiplications (matmuls) via an auxiliary matrix parameter, optimising them with off-the-shelf first-order methods is challenging. We make the inverse-free approach practical by proposing a better-conditioned bound and deriving a matmul-only natural-gradient update for the auxiliary parameter, markedly improving stability and convergence. We further provide simple heuristics, such as step-size schedules and stopping criteria, that make the overall optimisation routine fit seamlessly into existing workflows. Across regression and classification benchmarks, we demonstrate that our method 1) serves as a drop-in replacement in SVGP-based models (e.g., deep GPs), 2) recovers similar performance to traditional methods, and 3) can be faster than baselines when well tuned.
Authors: Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, Steven Y. Feng
Abstract: Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.
Authors: Rajkiran Panuganti
Abstract: Transformer language models contain localized reasoning circuits, contiguous layer blocks that improve reasoning when duplicated at inference time. Finding these circuits currently requires brute-force sweeps costing 25 GPU hours per model. We propose CircuitProbe, which predicts circuit locations from activation statistics in under 5 minutes on CPU, providing a speedup of three to four orders of magnitude. We find that reasoning circuits come in two types: stability circuits in early layers, detected through the derivative of representation change, and magnitude circuits in late layers, detected through anomaly scoring. We validate across 9 models spanning 6 architectures, including 2025 models, confirming that CircuitProbe top predictions match or are within 2 layers of the optimal circuit in all validated cases. A scaling experiment across the Qwen 2.5 family reveals that layer duplication consistently benefits models under 3B parameters but degrades performance in 7B+ models, making this a practical scaling technique for small language models. CircuitProbe requires as few as 10 calibration examples and its predictions are stable across English, Hindi, Chinese, and French.
Authors: Merveilles Agbeti-messan, Thierry Paquet, Cl\'ement Chatelain, Pierrick Tranouez, St\'ephane Nicolas
Abstract: End-to-end OCR for historical newspapers remains challenging, as models must handle long text sequences, degraded print quality, and complex layouts. While Transformer-based recognizers dominate current research, their quadratic complexity limits efficient paragraph-level transcription and large-scale deployment. We investigate linear-time State-Space Models (SSMs), specifically Mamba, as a scalable alternative to Transformer-based sequence modeling for OCR. We present to our knowledge, the first OCR architecture based on SSMs, combining a CNN visual encoder with bi-directional and autoregressive Mamba sequence modeling, and conduct a large-scale benchmark comparing SSMs with Transformer- and BiLSTM-based recognizers. Multiple decoding strategies (CTC, autoregressive, and non-autoregressive) are evaluated under identical training conditions alongside strong neural baselines (VAN, DAN, DANIEL) and widely used off-the-shelf OCR engines (PERO-OCR, Tesseract OCR, TrOCR, Gemini). Experiments on historical newspapers from the Biblioth\`eque nationale du Luxembourg, with newly released >99% verified gold-standard annotations, and cross-dataset tests on Fraktur and Antiqua lines, show that all neural models achieve low error rates (~2% CER), making computational efficiency the main differentiator. Mamba-based models maintain competitive accuracy while halving inference time and exhibiting superior memory scaling (1.26x vs 2.30x growth at 1000 chars), reaching 6.07% CER at the severely degraded paragraph level compared to 5.24% for DAN, while remaining 2.05x faster. We release code, trained models, and standardized evaluation protocols to enable reproducible research and guide practitioners in large-scale cultural heritage OCR.
Authors: Ricardo Hidalgo-Arag\'on, Jes\'us M. Gonz\'alez-Barahona, Gregorio Robles
Abstract: Context: Schools, training platforms, and technology firms increasingly need to assess programming proficiency at scale with transparent, reproducible methods that support personalized learning pathways. Objective: This study introduces a pedagogical framework for Scratch project assessment, aligned with the Common European Framework of Reference (CEFR), providing universal competency levels for students and teachers alongside actionable insights for curriculum design. Method: We apply Fuzzy C-Means clustering to 2008246 Scratch projects evaluated via Dr.Scratch, implementing an ordinal criterion to map clusters to CEFR levels (A1-C2), and introducing enhanced classification metrics that identify transitional learners, enable continuous progress tracking, and quantify classification certainty to balance automated feedback with instructor review. Impact: The framework enables diagnosis of systemic curriculum gaps-notably a "B2 bottleneck" where only 13.3% of learners reside due to the cognitive load of integrating Logic Synchronization, and Data Representation--while providing certainty--based triggers for human intervention.
Authors: Zehao Jin, Yanan Sui
Abstract: The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same $O(nw)$ per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in $O(\log_w n)$ layers versus $O(n/w)$ for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.
Authors: Paolo Speziali, Arno De Greef, Mehrdad Asadi, Willem R\"opke, Ann Now\'e, Diederik M. Roijers
Abstract: We propose the Preference Guided Iterated Pareto Referent Optimisation (PG-IPRO) for urban route planning for people with different accessibility requirements and preferences. With this algorithm the user can interact with the system by giving feedback on a route, i.e., the user can say which objective should be further minimized, or conversely can be relaxed. This leads to intuitive user interaction, that is especially effective during early iterations compared to information-gain-based interaction. Furthermore, due to PG-IPRO's iterative nature, the full set of alternative, possibly optimal policies (the Pareto front), is never computed, leading to higher computational efficiency and shorter waiting times for users.
Authors: Om Khangaonkar, Hadi J. Rad, Hamed Pirsiavash
Abstract: Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D geometry across multiple views. Rather than asking models to describe scene attributes, we introduce a more challenging task: given two views of the same scene, identify the object that violates 3D motion consistency. We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability. Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure. We hope our findings underscore the need for approaches that develop a more deeply grounded understanding of the physical world.
Authors: Oscar Clivio, Alexander D'Amour, Alexander Franks, David Bruns-Smith, Chris Holmes, Avi Feller
Abstract: Overlap, also known as positivity, is a key condition for causal treatment effect estimation. Many popular estimators suffer from high variance and become brittle when features differ strongly across treatment groups. This is especially challenging in high dimensions: the curse of dimensionality can make overlap implausible. To address this, we propose a class of feature representations called deconfounding scores, which preserve both identification and the target of estimation; the classical propensity and prognostic scores are two special cases. We characterize the problem of finding a representation with better overlap as minimizing an overlap divergence under a deconfounding score constraint. We then derive closed-form expressions for a class of deconfounding scores under a broad family of generalized linear models with Gaussian features and show that prognostic scores are overlap-optimal within this class. We conduct extensive experiments to assess this behavior empirically.
Authors: Deepak Nathani, Cheng Zhang, Chang Huan, Jiaming Shan, Yinfei Yang, Alkesh Patel, Zhe Gan, William Yang Wang, Michael Saxon, Xin Eric Wang
Abstract: Proactive agents that anticipate user needs and autonomously execute tasks hold great promise as digital assistants, yet the lack of realistic user simulation frameworks hinders their development. Existing approaches model apps as flat tool-calling APIs, failing to capture the stateful and sequential nature of user interaction in digital environments and making realistic user simulation infeasible. We introduce Proactive Agent Research Environment (Pare), a framework for building and evaluating proactive agents in digital environments. Pare models applications as finite state machines with stateful navigation and state-dependent action space for the user simulator, enabling active user simulation. Building on this foundation, we present Pare-Bench, a benchmark of 143 diverse tasks spanning communication, productivity, scheduling, and lifestyle apps, designed to test context observation, goal inference, intervention timing, and multi-app orchestration.
Authors: Guanlin He, Yingtai Xiao, Jiamu Bai, Xin Gu, Zeyu Ding, Wenpeng Yin, Daniel Kifer
Abstract: Matrix mechanisms are often used to provide unbiased differentially private query answers when publishing statistics or creating synthetic data. Recent work has developed matrix mechanisms, such as ResidualPlanner and Weighted Fourier Factorizations, that scale to high dimensional datasets while providing optimality guarantees for workloads such as marginals and circular product queries. They operate by adding noise to a linearly independent set of queries that can compactly represent the desired workloads. In this paper, we present QuerySmasher, an alternative scalable approach based on a divide-and-conquer strategy. Given a workload that can be answered from various data marginals, QuerySmasher splits each query into sub-queries and re-assembles the pieces into mutually orthogonal sub-workloads. These sub-workloads represent small, low-dimensional problems that can be independently and optimally answered by existing low-dimensional matrix mechanisms. QuerySmasher then stitches these solutions together to answer queries in the original workload. We show that QuerySmasher subsumes prior work, like ResidualPlanner (RP), ResidualPlanner+ (RP+), and Weighted Fourier Factorizations (WFF). We prove that it can dominate those approaches, under sum squared error, for all workloads. We also experimentally demonstrate the scalability and accuracy of QuerySmasher.
Authors: Abdullah Al Shafi, Md. Milon Islam, Sk. Imran Hossain, K. M. Azharul Hasan
Abstract: Actor-level stance detection aims to determine an author expressed position toward specific geopolitical actors mentioned or implicated in a text. Although transformer-based models have achieved relatively good performance in stance classification, they typically rely on unified representations that may not sufficiently capture heterogeneous linguistic signals, such as contrastive discourse structures, framing cues, and salient lexical indicators. This motivates the need for adaptive architectures that explicitly model diverse stance-expressive patterns. In this paper, we propose StanceMoE, a context-enhanced Mixture-of-Experts (MoE) architecture built upon a fine-tuned BERT encoder for actor-level stance detection. Our model integrates six expert modules designed to capture complementary linguistic signals, including global semantic orientation, salient lexical cues, clause-level focus, phrase-level patterns, framing indicators, and contrast-driven discourse shifts. A context-aware gating mechanism dynamically weights expert contributions, enabling adaptive routing based on input characteristics. Experiments are conducted on the StanceNakba 2026 Subtask A dataset, comprising 1,401 annotated English texts where the target actor is implicit in the text. StanceMoE achieves a macro-F1 score of 94.26%, outperforming traditional baselines, and alternative BERT-based variants.
Authors: Razvan Mihai Popescu, David Gros, Andrei Botocan, Rahul Pandita, Prem Devanbu, Maliheh Izadi
Abstract: The rise of large language models for code has reshaped software development. Autonomous coding agents, able to create branches, open pull requests, and perform code reviews, now actively contribute to real-world projects. Their growing role offers a unique and timely opportunity to investigate AI-driven contributions and their effects on code quality, team dynamics, and software maintainability. In this work, we construct a novel dataset of approximately $110,000$ open-source pull requests, including associated commits, comments, reviews, issues, and file changes, collectively representing millions of lines of source code. We compare five popular coding agents, including OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, and Devin, examining how their usage differs in various development aspects such as merge frequency, edited file types, and developer interaction signals, including comments and reviews. Furthermore, we emphasize that code authoring and review are only a small part of the larger software engineering process, as the resulting code must also be maintained and updated over time. Hence, we offer several longitudinal estimates of survival and churn rates for agent-generated versus human-authored code. Ultimately, our findings indicate an increasing agent activity in open-source projects, although their contributions are associated with more churn over time compared to human-authored code.
Authors: Gilhan Kim, Daniel K. Park
Abstract: Variational autoencoders (VAEs) learn compact latent representations of complex data, but their generative capacity is fundamentally constrained by the choice of prior distribution over the latent space. Energy-based priors offer a principled way to move beyond factorized assumptions and capture structured interactions among latent variables, yet training such priors at scale requires accurate and efficient sampling from intractable distributions. Here we present Boltzmann-machine--prior VAEs (BM-VAEs) trained using quantum annealing--based sampling in three distinct operational modes within a single generative system. During training, diabatic quantum annealing (DQA) provides unbiased Boltzmann samples for gradient estimation of the energy-based prior; for unconditional generation, slower quantum annealing (QA) concentrates samples near low-energy minima; for conditional generation, bias fields are added to direct sampling toward attribute-specific regions of the energy landscape (c-QA). Using up to 2000 qubits on a D-Wave Advantage2 processor, we demonstrate stable and efficient training across multiple datasets, with faster convergence and lower reconstruction loss than a Gaussian-prior VAE. The learned Boltzmann prior enables unconditional generation by sampling directly from the energy-based latent distribution, a capability that plain autoencoders lack, and conditional generation through latent biasing that leverages the learned pairwise interactions.
Authors: Weiming Feng, Heng Guo, Minji Yang
Abstract: We show polylogarithmic mixing time bounds for the alternating-scan sampler for positively weighted restricted Boltzmann machines. This is done via analysing the same chain and the Glauber dynamics for ferromagnetic two-spin systems, where we obtain new mixing time bounds up to the critical thresholds.
Authors: Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo, Ziniu Li, Chenxin Li, Jingyuan Hu, Shunian Chen, Tongxu Luo, Jiaxi Bi, Zeyu Qin, Shaobo Wang, Xin Lai, Pengyuan Lyu, Junyi Li, Can Xu, Chengquan Zhang, Han Hu, Ming Yan, Benyou Wang
Abstract: We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/FreedomIntelligence/MyPhoneBench.
Authors: Yi Cao, Zexun Chen, Lin William Cong, Heqing Shi
Abstract: We develop Structured-Knowledge-Informed Neural Networks (SKINNs), a unified estimation framework that embeds theoretical, simulated, previously learned, or cross-domain insights as differentiable constraints within flexible neural function approximation. SKINNs jointly estimate neural network parameters and economically meaningful structural parameters in a single optimization problem, enforcing theoretical consistency not only on observed data but over a broader input domain through collocation, and therefore nesting approaches such as functional GMM, Bayesian updating, transfer learning, PINNs, and surrogate modeling. SKINNs define a class of M-estimators that are consistent and asymptotically normal with root-N convergence, sandwich covariance, and recovery of pseudo-true parameters under misspecification. We establish identification of structural parameters under joint flexibility, derive generalization and target-risk bounds under distributional shift in a convex proxy, and provide a restricted-optimal characterization of the weighting parameter that governs the bias-variance tradeoff. In an illustrative financial application to option pricing, SKINNs improve out-of-sample valuation and hedging performance, particularly at longer horizons and during high-volatility regimes, while recovering economically interpretable structural parameters with improved stability relative to conventional calibration. More broadly, SKINNs provide a general econometric framework for combining model-based reasoning with high-dimensional, data-driven estimation.
Authors: Jalo Nousiainen, Iremsu Taskin, Markus Kasper, Gilles Orban De Xivry, Olivier Absil
Abstract: The direct imaging of potentially habitable exoplanets is one prime science case for high-contrast imaging instruments on extremely large telescopes. Most such exoplanets orbit close to their host stars, where their observation is limited by fast-moving atmospheric speckles and quasi-static non-common-path aberrations (NCPA). Conventional NCPA correction methods often use mechanical mirror probes, which compromise performance during operation. This work presents machine-learning-based NCPA control methods that automatically detect and correct both dynamic and static NCPA errors by leveraging sequential phase diversity. We extend previous work in reinforcement learning for AO to focal plane control. A new model-based RL algorithm, Policy Optimization for NCPAs (PO4NCPA), interprets the focal-plane image as input data and, through sequential phase diversity, determines phase corrections that optimize both non-coronagraphic and post-coronagraphic PSFs without prior system knowledge. Further, we demonstrate the effectiveness of this approach by numerically simulating static NCPA errors on a ground-based telescope and an infrared imager affected by water-vapor-induced seeing (dynamic NCPAs). Simulations show that PO4NCPA robustly compensates static and dynamic NCPAs. In static cases, it achieves near-optimal focal-plane light suppression with a coronagraph and near-optimal Strehl without one. With dynamics NCPA, it matches the performance of the modal least-squares reconstruction combined with a 1-step delay integrator in these metrics. The method remains effective for the ELT pupil, vector vortex coronagraph, and under photon and background noise. PO4NCPA is model-free and can be directly applied to standard imaging as well as to any coronagraph. Its sub-millisecond inference times and performance also make it suitable for real-time low-order correction of atmospheric turbulence beyond HCI.
Authors: Yiheng Wang, Lichen Zhu, Yueqian Lin, Yudong Liu, Jingyang Zhang, Hai "Helen" Li, Yiran Chen
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame's contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.
Authors: Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Di Wen, Danda Pani Paudel, Luc Van Gool, Kailun Yang
Abstract: 3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57% mIoU overall and +24.80% tail-class mIoU; on VAA-KITTI, it improves AuPRCr by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety-critical urban driving. The source code is publicly available at https://github.com/7uHeng/ProOOD.
Authors: Jonas Schaible, Asena Karolin \"Ozdemir, Charlotte Debus, Sven Burger, Achim Streit, Christiane Becker, Klaus J\"ager, Markus G\"otz
Abstract: Inverse design of optical multilayer stacks seeks to infer layer materials, thicknesses, and ordering from a desired target spectrum. It is a long-standing challenge due to the large design space and non-unique solutions. We introduce \texttt{OptoLlama}, a masked diffusion language model for inverse thin-film design from optical spectra. Representing multilayer stacks as sequences of material-thickness tokens, \texttt{OptoLlama} conditions generation on reflectance, absorptance, and transmittance spectra and learns a probabilistic mapping from optical response to structure. Evaluated on a representative test set of 3,000 targets, \texttt{OptoLlama} reduces the mean absolute spectral error by 2.9-fold relative to a nearest-neighbor template baseline and by 3.45-fold relative to the state-of-the-art data-driven baseline, called \texttt{OptoGPT}. Case studies on designed and expert-defined targets show that the model reproduces characteristic spectral features and recovers physically meaningful stack motifs, including distributed Bragg reflectors. These results establish diffusion-based sequence modeling as a powerful framework for inverse photonic design.
Authors: Reyhaneh Ahani Manghotay (Simon Fraser University, Burnaby, Canada), Jie Liang (Eastern Institute of Technology, Ningbo, China)
Abstract: Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $\delta_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.
Authors: Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki, Kiyoharu Aizawa
Abstract: This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI-driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI-written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI-written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite-Bench, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025. Our experiments reveal a clear trade-off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI-driven paper writing and improving the understanding of its risks within the research community.
Authors: Shaifalee Saxena, Rafael Fierro, Alexander Scheinker
Abstract: Reinforcement learning has shown strong performance in robotic manipulation, but learned policies often degrade in performance when test conditions differ from the training distribution. This limitation is especially important in contact-rich tasks such as pushing and pick-and-place, where changes in goals, contact conditions, or robot dynamics can drive the system out-of-distribution at inference time. In this paper, we investigate a hybrid controller that combines reinforcement learning with bounded extremum seeking to improve robustness under such conditions. In the proposed approach, deep deterministic policy gradient (DDPG) policies are trained under standard conditions on the robotic pushing and pick-and-place tasks, and are then combined with bounded ES during deployment. The RL policy provides fast manipulation behavior, while bounded ES ensures robustness of the overall controller to time variations when operating conditions depart from those seen during training. The resulting controller is evaluated under several out-of-distribution settings, including time-varying goals and spatially varying friction patches.
Authors: Aaron Rose, Carissa Cullen, Brandon Gary Kaplowitz, Christian Schroeder de Witt
Abstract: As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in-distribution and 0.60--0.86 AUROC when transferred zero-shot to structurally different multi-agent scenarios and a steganographic blackjack card-counting task. We find that no single probing technique dominates across all collusion types, suggesting that different forms of collusion manifest differently in activation space. We also find preliminary evidence that this signal is localised at the token level, with the colluding agent's activations spiking specifically when processing the encoded parts of their partner's message. This work takes a step toward multi-agent interpretability: extending white-box inspection from single models to multi-agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion, particularly for organisations with access to model activations. Code and data are available at https://github.com/aaronrose227/narcbench.
Authors: Jack Young
Abstract: Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.
Authors: Abdullah Tokmak, Toni Karvonen, Thomas B. Sch\"on, Dominik Baumann
Abstract: Uncertainty quantification is essential when deploying learning-based control methods in safety-critical systems. This is commonly realized by constructing uncertainty tubes that enclose the unknown function of interest, e.g., the reward and constraint functions or the underlying dynamics model, with high probability. However, existing approaches for uncertainty quantification typically rely on restrictive assumptions on the unknown function, such as known bounds on functional norms or Lipschitz constants, and struggle with discontinuities. In this paper, we model the unknown function as a random function from which independent and identically distributed realizations can be generated, and construct uncertainty tubes via the scenario approach that hold with high probability and rely solely on the sampled realizations. We integrate these uncertainty tubes into a safe Bayesian optimization algorithm, which we then use to safely tune control parameters on a real Furuta pendulum.
Authors: Fangjun Hu, Christian Kokail, Milan Kornja\v{c}a, Pedro L. S. Lopes, Weiyuan Gong, Sheng-Tao Wang, Xun Gao, Stefan Ostermann
Abstract: Learning quantum states from measurement data is a central problem in quantum information and computational complexity. In this work, we study the problem of learning to generate mixed states on a finite-dimensional lattice. Motivated by recent developments in mixed state phases of matter, we focus on arbitrary states in the trivial phase. A state belongs to the trivial phase if there exists a shallow preparation channel circuit under which local reversibility is preserved throughout the preparation. We prove that any mixed state in this class can be efficiently learned from measurement access alone. Specifically, given copies of an unknown trivial phase mixed state, our algorithm outputs a shallow local channel circuit that approximately generates this state in trace distance. The sample complexity and runtime are polynomial (or quasi-polynomial) in the number of qubits, assuming constant (or polylogarithmic) circuit depth and gate locality. Importantly, the learner is not given the original preparation circuit and relies only on its existence. Our results provide a structural foundation for quantum generative models based on shallow channel circuits. In the classical limit, our framework also inspires an efficient algorithm for classical diffusion models using only a polynomial overhead of training and generation.
Authors: Jorge Condor, Nicolas Moenne-Loccoz, Merlin Nimier-David, Piotr Didyk, Zan Gojcic, Qi Wu
Abstract: Primitive-based methods such as 3D Gaussian Splatting have recently become the state-of-the-art for novel-view synthesis and related reconstruction tasks. Compared to neural fields, these representations are more flexible, adaptive, and scale better to large scenes. However, the limited expressivity of individual primitives makes modeling high-frequency detail challenging. We introduce Neural Harmonic Textures, a neural representation approach that anchors latent feature vectors on a virtual scaffold surrounding each primitive. These features are interpolated within the primitive at ray intersection points. Inspired by Fourier analysis, we apply periodic activations to the interpolated features, turning alpha blending into a weighted sum of harmonic components. The resulting signal is then decoded in a single deferred pass using a small neural network, significantly reducing computational cost. Neural Harmonic Textures yield state-of-the-art results in real-time novel view synthesis while bridging the gap between primitive- and neural-field-based reconstruction. Our method integrates seamlessly into existing primitive-based pipelines such as 3DGUT, Triangle Splatting, and 2DGS. We further demonstrate its generality with applications to 2D image fitting and semantic reconstruction.
Authors: Yiheng Su, Matthew Lease
Abstract: We present RELISH (REgression with a Latent Iterative State Head), a novel, lightweight architecture designed for text regression with large language models. Rather than decoding numeric targets as text or aggregating multiple generated outputs, RELISH predicts scalar values directly from frozen LLM representations by iteratively refining a learned latent state through cross-attention over token-level representations, and then mapping the final state to a point estimate with a linear regressor. Across five datasets, four LLM backbones, and two LLM training regimes, RELISH consistently outperforms prior baselines from all three major LLM regression families, including autoregressive decoding, regression-aware inference, and existing predictive head methods. Despite these gains, RELISH remains highly parameter-efficient, requiring only 3.4-3.7M trainable parameters across frozen LLM backbones (only 0.01-0.04% additional overhead), far less than LoRA-based alternatives that grow with model size (0.26-0.42%).
Authors: Shichang Zhang (Celine), Atefeh Sohrabizadeh (Celine), Cheng Wan (Celine), Zijie Huang (Celine), Ziniu Hu (Celine), Yewen Wang (Celine), Yingyan (Celine), Lin, Jason Cong, Yizhou Sun
Abstract: Graph neural networks (GNNs) are emerging for machine learning research on graph-structured data. GNNs achieve state-of-the-art performance on many tasks, but they face scalability challenges when it comes to real-world applications that have numerous data and strict latency requirements. Many studies have been conducted on how to accelerate GNNs in an effort to address these challenges. These acceleration techniques touch on various aspects of the GNN pipeline, from smart training and inference algorithms to efficient systems and customized hardware. As the amount of research on GNN acceleration has grown rapidly, there lacks a systematic treatment to provide a unified view and address the complexity of relevant works. In this survey, we provide a taxonomy of GNN acceleration, review the existing approaches, and suggest future research directions. Our taxonomic treatment of GNN acceleration connects the existing works and sets the stage for further development in this area.
Authors: Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, Prateek Mittal
Abstract: Retrieval-augmented generation (RAG) is susceptible to retrieval corruption attacks, where malicious passages injected into retrieval results can lead to inaccurate model responses. We propose RobustRAG, the first defense framework with certifiable robustness against retrieval corruption attacks. The key insight of RobustRAG is an isolate-then-aggregate strategy: we isolate passages into disjoint groups, generate LLM responses based on the concatenated passages from each isolated group, and then securely aggregate these responses for a robust output. To instantiate RobustRAG, we design keyword-based and decoding-based algorithms for securely aggregating unstructured text responses. Notably, RobustRAG achieves certifiable robustness: for certain queries in our evaluation datasets, we can formally certify non-trivial lower bounds on response quality -- even against an adaptive attacker with full knowledge of the defense and the ability to arbitrarily inject a bounded number of malicious passages. We evaluate RobustRAG on the tasks of open-domain question-answering and free-form long text generation and demonstrate its effectiveness across three datasets and three LLMs.
Authors: Jungeum Kim, Xiao Wang
Abstract: Nonlinear dimensional reduction with the manifold assumption, often called manifold learning, has proven its usefulness in a wide range of high-dimensional data analysis. The significant impact of t-SNE and UMAP has catalyzed intense research interest, seeking further innovations toward visualizing not only the local but also the global structure information of the data. Moreover, there have been consistent efforts toward generalizable dimensional reduction that handles unseen data. In this paper, we first propose GLoMAP, a novel manifold learning method for dimensional reduction and high-dimensional data visualization. GLoMAP preserves locally and globally meaningful distance estimates and displays a progression from global to local formation during the course of optimization. Furthermore, we extend GLoMAP to its inductive version, iGLoMAP, which utilizes a deep neural network to map data to its lower-dimensional representation. This allows iGLoMAP to provide lower-dimensional embeddings for unseen points without needing to re-train the algorithm. iGLoMAP is also well-suited for mini-batch learning, enabling large-scale, accelerated gradient calculations. We have successfully applied both GLoMAP and iGLoMAP to the simulated and real-data settings, with competitive experiments against the state-of-the-art methods.
Authors: Kulunu Dharmakeerthi, YoonHaeng Hur, Tengyuan Liang
Abstract: Practitioners often face the challenge of deploying prediction models in new environments with shifted distributions of covariates and responses. With observational data, such shifts are often driven by unobserved confounding, and can in fact alter the concept of which model is best. This paper studies distribution shifts in the domain adaptation problem with unobserved confounding. We postulate a linear structural causal model to account for endogeneity and unobserved confounding, and we leverage exogenous invariant covariate representations to cure concept shifts and improve target prediction. We propose a data-driven representation learning method that optimizes for a lower-dimensional linear subspace and a prediction model confined to that subspace. This method operates on a non-convex objective -- that interpolates between predictability and stability -- constrained to the Stiefel manifold, using an analog of projected gradient descent. We analyze the optimization landscape and prove that, provided sufficient regularization, nearly all local optima align with an invariant linear subspace resilient to distribution shifts. This method achieves a nearly ideal gap between target and source risk. We validate the method and theory with real-world data sets to illustrate the tradeoffs between predictability and stability.
Authors: Tiago F. Tavares, Fabio Ayres, Paris Smaragdis
Abstract: Representational similarity in neural networks is inherently scale-dependent, yet widely used metrics such as Centered Kernel Alignment (CKA) and Procrustes analysis provide only global scalar estimates. These scalars often fail to distinguish micro-scale geometric jitter (local noise) from macro-scale semantic reorganization, compressing multi-scale structural relationships into a single uninformative value. We introduce the Topological Alignment Spectrum (TAS), a multi-scale diagnostic tool that sweeps normalized mean Jaccard similarity over varying neighborhood sizes. By normalizing the metric over an analytically-derived expected range (from expected overlap under randomness to perfect alignment), TAS yields a dimension-invariant metric over a spectrum of scales, where one indicates perfect structural alignment, zero reflects chance-level agreement, and negative values signal active anti-alignment at specific scales. Experiments on synthetic point clouds demonstrate that TAS allows the recognition of distinct types of alignment perturbation: local jitter harms fine-grained neighborhoods but preserves cluster-level structure, while cluster-center shuffling preserves local similarity but disrupts global alignment -- phenomena that remain invisible or conflated under global, single-scalar metrics. Applying TAS to the MultiBERTs collection reveals that fine-tuning induces comprehensive topological reorganization across scales, challenging the view of task adaptation as merely conservative or localized. While models from different random seeds remain locally divergent, semantic clusters emerge as the dominant scale of alignment. TAS thus offers a granular, topology-aware alternative for diagnosing convergence and representational stability in deep networks.
Authors: Daniele Castellana, Filippo Maria Bianchi
Abstract: We introduce BN-Pool, the first clustering-based pooling method for Graph Neural Networks that adaptively determines the number of supernodes in a coarsened graph. BN-Pool leverages a generative model based on a Bayesian nonparametric framework for partitioning graph nodes into an unbounded number of clusters. During training, the node-to-cluster assignments are learned by combining the supervised loss of the downstream task with an unsupervised auxiliary term, which encourages the reconstruction of the original graph topology while penalizing unnecessary proliferation of clusters. By automatically discovering the optimal coarsening level for each graph, BN-Pool preserves the performance of soft-clustering pooling methods while avoiding their typical redundancy by learning compact pooled graphs. The code is available at https://github.com/NGMLGroup/Bayesian-Nonparametric-Graph-Pooling.
URLs: https://github.com/NGMLGroup/Bayesian-Nonparametric-Graph-Pooling.
Authors: Sergio Calvo-Ordo\~nez, Jonathan Plenk, Richard Bergna, Alvaro Cartea, Jose Miguel Hernandez-Lobato, Konstantina Palla, Kamil Ciosek
Abstract: Performing gradient descent in a wide neural network is equivalent to computing the posterior mean of a Gaussian Process with the Neural Tangent Kernel (NTK-GP), for a specific prior mean and with zero observation noise. However, existing formulations have two limitations: (i) the NTK-GP assumes noiseless targets, leading to misspecification on noisy data; (ii) the equivalence does not extend to arbitrary prior means, which are essential for well-specified models. To address (i), we introduce a regularizer into the training objective, showing its correspondence to incorporating observation noise in the NTK-GP. To address (ii), we propose a \textit{shifted network} that enables arbitrary prior means and allows obtaining the posterior mean with gradient descent on a single network, without ensembling or kernel inversion. We validate our results with experiments across datasets and architectures, showing that this approach removes key obstacles to the practical use of NTK-GP equivalence in applied Gaussian process modeling.
Authors: Ethan Harvey, Mikhail Petrov, Michael C. Hughes
Abstract: When training large models on limited data, avoiding overfitting is paramount. Common grid search or smarter search methods rely on expensive separate runs for each candidate hyperparameter, while carving out a validation set that reduces available training data. In this paper, we study gradient-based learning of hyperparameters via the evidence lower bound (ELBO) objective from Bayesian variational methods. This avoids the need for any validation set. We focus on scenarios where the model is over-parameterized for flexibility and the approximate posterior is chosen to be Gaussian with isotropic covariance for tractability, even though it cannot match the true posterior. In such scenarios, we find the ELBO prioritizes posteriors that match the prior, leading to severe underfitting. Instead, we recommend a data-emphasized ELBO that upweights the likelihood but not the prior. In Bayesian transfer learning of image and text classifiers, our method reduces the 88+ hour grid search of past work to under 3 hours while delivering comparable accuracy. We further demonstrate how our approach enables efficient yet accurate approximations of Gaussian processes with learnable lengthscale kernels.
Authors: Yali Wei, Alan J. X. Guo, Zihui Yan, Yufan Dai, Wenjia Fan
Abstract: In recent years, widespread attention has been drawn to the challenge of correcting insertion, deletion, and substitution (IDS) errors in DNA-based data storage. Among various IDS-correcting codes, Varshamov-Tenengolts (VT) codes, originally designed for single-error correction, have been established as a central research focus. While existing decoding methods demonstrate high accuracy for single-error correction, they are typically not applicable to the correction of multiple IDS errors. In this work, the latent capability of VT codes for multiple-error correction is investigated through a statistic-enhanced Transformer-based VT decoder (VT-Former), utilizing both symbol and statistic feature embeddings. Experimental results demonstrate that VT-Former achieves nearly 100\% accuracy on correcting single errors. For multi-error decoding tasks across various codeword lengths, improvements in both frame accuracy and bit accuracy are observed, compared to conventional hard-decision and soft-in soft-out decoding algorithms. Furthermore, while lower decoding latency is exhibited by the base model compared to traditional soft decoders, the architecture is further optimized in this study to enhance decoding efficiency and reduce computational overhead.
Authors: Fateme Jamshidi, Mohammad Shahverdikondori, Negar Kiyavash
Abstract: We study multi-armed bandits under network interference, where each unit's reward depends on its own treatment and those of its neighbors in a given graph. This induces an exponentially large action space, making standard approaches computationally impractical. We propose a novel algorithm that uses the local graph structure to minimize regret. We derive a graph-dependent upper bound on cumulative regret that improves over prior work. Additionally, we provide the first lower bounds for bandits with arbitrary network interference, where each bound involves a distinct structural property of the graph. These bounds show that for both dense and sparse graphs, our algorithm is nearly optimal, with matching upper and lower bounds up to logarithmic factors. When the interference graph is unknown, a variant of our algorithm is Pareto optimal: no algorithm can uniformly outperform it across all instances. We complement our theoretical results with numerical experiments, showing that our approach outperforms the baseline methods.
Authors: Weizhen Wang, Jianping He, Xiaoming Duan
Abstract: Policy gradient methods are one of the most successful approaches for solving challenging reinforcement learning problems. Despite their empirical successes, many state-of-the-art policy gradient algorithms for discounted problems deviate from the theoretical policy gradient theorem due to the existence of a distribution mismatch. In this work, we analyze the impact of this mismatch on policy gradient methods. Specifically, we first show that in the case of tabular parameterizations, the biased gradient induced by the mismatch still yields a valid first-order characterization of global optimality. Then, we extend this analysis to more general parameterizations by deriving explicit bounds on both the state distribution mismatch and the resulting gradient mismatch in episodic and continuing MDPs, which are shown to vanish at least linearly as the discount factor approaches one. Building on these bounds, we further establish guarantees for the biased policy gradient iterates, showing that they approach approximate stationary points with respect to the exact gradient, with asymptotic residuals depending on the discount factor. Our findings offer insights into the robustness of policy gradient methods as well as the gap between theoretical foundations and practical implementations.
Authors: Leo Henry, Thomas Neele, Mohammad Reza Mousavi, Matteo Sammartino
Abstract: Active automata learning infers automaton models of systems from behavioral observations, a technique successfully applied to a wide range of domains. Compositional approaches have recently emerged to address scalability to concurrent systems. We take a significant step beyond available results, including those by the authors, and develop a general technique for compositional learning of a synchronizing parallel system with an unknown decomposition. Our approach automatically refines the global alphabet into component alphabets while learning the component models. We develop a theoretical treatment of distributions of alphabets, i.e., sets of possibly overlapping component alphabets, characterize counter-examples that reveal inconsistencies with global observations, and show how to systematically update the distribution to restore consistency. We extend $L^{\star}$ to handle partial and potentially spurious information arising when learning components from global observations only. We establish correctness and termination of the full algorithm. We provide an implementation, called CoalA, using the state-of-the-art active learning library LearnLib. Our experiments on more than 630 subject systems show that CoalA delivers up to five orders of magnitude fewer membership queries than monolithic learning, and achieves better scalability in equivalence queries on systems with significant concurrency.
Authors: Stepan Tretiakov, Xingjian Li, Krishna Kumar
Abstract: Most neural-operator surrogates for PDEs inherit from DeepONet-style formulations the requirement that the input function be sampled at a fixed, ordered set of sensors. This assumption limits applicability to problems with variable sensor layouts, missing data, point sources, and sample-based representations of densities. We propose SetONet, which addresses this gap by recasting the operator input as an unordered set of coordinate-value observations and encoding it with permutation-invariant aggregation inside a standard branch-trunk operator network while preserving the DeepONet synthesis mechanism and lightweight end-to-end training. A structured variant, SetONet-Key, aggregates sensor information through learnable query tokens and a position-only key pathway, thereby decoupling sampling geometry from sensor values. The method is assessed on four classical operator-learning benchmarks under fixed layouts, variable layouts, and evaluation-time sensor drop-off, and on four problems with inherently unstructured point-cloud inputs, including heat conduction with multiple point sources, advection-diffusion, phase-screen diffraction, and optimal transport problems. In parameter-matched studies, SetONet-Key achieves lower error than the DeepONet baseline on fixed-sensor benchmarks and remains reliable when layouts vary or sensors are dropped at evaluation. Comparisons across pooling rules show that attention-based aggregation is typically more robust than mean or sum pooling. On the point-cloud problems, SetONet operates directly on the native input representation, without rasterization or multi-stage preprocessing, and outperforms the larger VIDON baseline.
Authors: Carlos Rodriguez-Pardo, Leonardo Chiani, Emanuele Borgonovo, Massimo Tavoni
Abstract: We present a neural framework for learning conditional optimal transport (OT) maps between probability distributions. Our approach introduces a conditioning mechanism capable of processing both categorical and continuous conditioning variables simultaneously. At the core of our method lies a hypernetwork that generates transport layer parameters based on these inputs, creating adaptive mappings that outperform simpler conditioning methods. Comprehensive ablation studies demonstrate the superior performance of our method over baseline configurations. Furthermore, we showcase an application to global sensitivity analysis, offering high performance in computing OT-based sensitivity indices. This work advances the state-of-the-art in conditional optimal transport, enabling broader application of optimal transport principles to complex, high-dimensional domains such as generative modeling and black-box model explainability.
Authors: Leon Eshuijs, Archie Chaudhury, Alan McBeth, Ethan Nguyen
Abstract: LLM-as-a-judge is widely used as a scalable substitute for human evaluation, yet current approaches rely on black-box access and struggle to detect subtle dishonesty, such as sycophancy and manipulation. We introduce Judge Using Safety-Steered Alternatives (JUSSA), a framework that leverages a model's internal representations to optimize an honesty-promoting steering vector from a single training example, generating contrastive alternatives that give judges a reference point for detecting dishonesty. We test JUSSA on a novel manipulation benchmark with human-validated response pairs at varying dishonesty levels, finding AUROC improvements across both GPT-4.1 (0.893 $\to$ 0.946) and Claude Haiku (0.859 $\to$ 0.929) judges, though performance degrades when task complexity is mismatched to judge capability, suggesting contrastive evaluation helps most when the task is challenging but within the judge's reach. Layer-wise analysis further shows that steering is most effective in middle layers, where model representations begin to diverge between honest and dishonest prompt processing. Our work demonstrates that steering vectors can serve as tools for evaluation rather than for improving model outputs at inference, opening a new direction for thorough white-box auditing.
Authors: Safwan Labbi, Paul Mangold, Daniil Tiapkin, Eric Moulines
Abstract: We provide global convergence rates for vanilla and entropy-regularized federated softmax stochastic policy gradient (FedPG) with local training. We show that FedPG converges to a near-optimal policy in terms of the average agent value, with a gap controlled by the level of heterogeneity. Remarkably, we obtain the first convergence rates for entropy-regularized policy gradient with explicit constants, leveraging a projection-like operator. Our results build upon a new analysis of federated averaging for non-convex objectives, based on the observation that the {\L}ojasiewicz-type inequalities from the single-agent setting (Mei et al., 2020) do not hold for the federated objective. This uncovers a fundamental difference between single-agent and federated reinforcement learning: while single-agent optimal policies can be deterministic, federated objectives may inherently require stochastic policies.
Authors: Rafael Sojo, Javier D\'iaz-Rozo, Concha Bielza, Pedro Larra\~naga
Abstract: This paper introduces a new type of probabilistic semiparametric model that takes advantage of data binning to reduce the computational cost of kernel density estimation in nonparametric distributions. Two new conditional probability distributions are developed for the new binned semiparametric Bayesian networks, the sparse binned kernel density estimation and the Fourier kernel density estimation. These two probability distributions address the curse of dimensionality, which typically impacts binned models, by using sparse tensors and restricting the number of parent nodes in conditional probability calculations. To evaluate the proposal, we perform a complexity analysis and conduct several comparative experiments using synthetic data and datasets from the UCI Machine Learning repository. The experiments include different binning rules, parent restrictions, grid sizes, and number of instances to get a holistic view of the model's behavior. As a result, our binned semiparametric Bayesian networks achieve structural learning and log-likelihood estimations with no statistically significant differences compared to the semiparametric Bayesian networks, but at a much higher speed. Thus, the new binned semiparametric Bayesian networks prove to be a reliable and more efficient alternative to their non-binned counterparts.
Authors: Hanlin Dong, Arian Prabowo, Hao Xue, Ao Shuang, Tianyi Zhou, Flora D. Salim
Abstract: Forecasting over graph-structured sensor networks demands models that capture both deterministic spatial trends and stochastic variability, while remaining efficient enough for repeated inference as new observations arrive. We propose Double-Diffusion, a denoising diffusion probabilistic model that integrates a parameter-free graph diffusion Ordinary Differential Equation (ODE) forecast as a structural prior throughout the generative process. Unlike standard diffusion approaches that generate predictions from pure noise, Double-Diffusion uses the ODE prediction as both (1) a residual learning target in the forward process via the Resfusion framework, and (2) an explicit conditioning input for the reverse denoiser, shifting the generation task from full synthesis to guided refinement. This dual integration enables accelerated sampling by initializing from an intermediate diffusion step where the ODE prior is already close to the target distribution. We further introduce the Factored Spectral Denoiser (FSD), which adopts the divided attention principle to decompose spatio-temporal-channel modeling into three efficient axes: temporal self-attention, cross-channel attention, and spectral graph convolution via the Graph Fourier Transform. Extensive experiments on four real-world sensor-network datasets spanning two domains: urban air quality (Beijing, Athens) and traffic flow (PEMS08, PEMS04, demonstrate that Double-Diffusion achieves the best probabilistic calibration (CRPS) across all datasets while scaling sublinearly in inference time, achieving a 3.8x speedup compared to standard diffusion model setup through a substantial reduction in required sampling steps.
Authors: Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Minxuan Lv, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou
Abstract: We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model's exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.
Authors: Muntasir Hoq, Griffin Pitts, Tirth Bhatt, Aum Pandya, Andrew Lan, Peter Brusilovsky, Bita Akram
Abstract: Personalized instruction aims to provide learners with support that adapts to their individual knowledge and progress toward learning objectives. Discovering and tracing Knowledge Components (KCs) is an important step in building accurate models of student learning. However, KC discovery in computer science education is challenging due to the open-ended nature of programming, wide variability in student solutions, and intertwined use of programming structures in code. We address these challenges with a pattern-based KC discovery method that uses a data-driven approach to define KCs as recurring structural patterns in student code that reveal persistent patterns of struggle and mastery in students' solutions. We then evaluate the discovered KCs using expert evaluation and statistical student modeling to demonstrate their effectiveness in capturing student learning and struggles. We propose a framework for modeling students' learning by deriving pattern-based KCs from student code through a three-stage process. First, an attention-based code representation model identifies Abstract Syntax Tree subtrees most relevant to code correctness. Second, a Variational Autoencoder abstracts these subtrees into a smooth latent space, capturing structural similarity across student submissions. Third, the resulting representations are clustered into pattern-based KCs. To assess the effectiveness of pattern-based KCs for modeling students' learning, we adapt the Deep Knowledge Tracing model to incorporate these KCs, demonstrating significant improvements in predictive performance over baseline KT methods. Additionally, the learning curve analysis showed alignment between the derived KCs and learning theory.
Authors: Mohammad Taha Shah, Sabrina Khurshid, Gourab Ghatak
Abstract: In this paper, we study sequential decision-making for maximizing the Sharpe ratio (SR) in a stochastic multi-armed bandit (MAB) setting. Unlike standard bandit formulations that maximize cumulative reward, SR optimization requires balancing expected return and reward variability. As a result, the learning objective depends jointly on the mean and variance of the reward distribution and takes a fractional form. To address this problem, we propose the Sharpe Ratio Thompson Sampling \texttt{SRTS}, a Bayesian algorithm for risk-adjusted exploration. For Gaussian reward models, the algorithm employs a Normal-Gamma conjugate posterior to capture uncertainty in both the mean and the precision of each arm. In contrast to additive mean-variance (MV) formulations, which often require different algorithms across risk regimes, the fractional SR objective yields a single sampling rule that applies uniformly across risk tolerances. On the theoretical side, we develop a regret decomposition tailored to the SR objective and introduce a decoupling approach that separates the contributions of mean and variance uncertainty. This framework allows us to control the interaction between the Gaussian mean samples and the Gamma precision samples arising in the posterior. Using these results, we establish a finite-time distribution-dependent $\mathcal{O}(\log n)$ upper bound on the expected regret. We further derive a matching information-theoretic lower bound using a change-of-measure argument, showing that the proposed algorithm is order-optimal. Finally, experiments on synthetic bandit environments illustrate the performance of \texttt{SRTS} and demonstrate improvements over existing risk-aware bandit algorithms across a range of risk-return settings.
Authors: Sima Najafzadehkhoei, George Vega Yon, Derek S. Meyer, Bernardo Modenesi
Abstract: Agent-based models (ABMs) are widely used to study infectious disease dynamics, but their calibration is often computationally intensive, limiting their applicability in time-sensitive public health settings. We propose DeepIMC (Deep Inverse Mapping Calibration), a machine learning-based calibration framework that directly learns the inverse mapping from epidemic time series to epidemiological parameters. DeepIMC trains a bidirectional Long Short-Term Memory (BiLSTM) neural network on synthetic epidemic trajectories generated from agent-based models such as the Susceptible-Infected-Recovered (SIR) model, enabling rapid parameter estimation without repeated simulation at inference time. We evaluate DeepIMC through an extensive simulation study comprising 5,000 heterogeneous epidemic scenarios and benchmark its performance against Approximate Bayesian Computation (ABC) using likelihood-free Markov Chain Monte Carlo. The results show that DeepIMC substantially improves parameter recovery accuracy, produces sharp and well-calibrated predictive intervals, and reduces computational time by more than an order of magnitude relative to ABC. Although structural parameter identifiability constraints limit the precise recovery of all model parameters simultaneously, the calibrated models reliably reproduce epidemic trajectories and support accurate forward prediction with their estimated parameters. DeepIMC is implemented in the open-source R package epiworldRCalibrate, facilitating practical adoption for real-time epidemic modeling and policy analysis. Overall, our findings demonstrate that DeepIMC provides a scalable, operationally effective alternative to traditional simulation-based calibration methods for agent-based epidemic models.
Authors: Robiul Islam, Dmitry I. Ignatov, Karl Kaberg, Roman Nabatchikov
Abstract: This study investigates the performance of classifiers across EEG frequency bands, evaluating efficient class prediction for the left and right hemispheres using various optimisers. Three neural network architectures a deep dense network, a shallow three-layer network, and a convolutional neural network (CNN) are implemented and compared using the TensorFlow and PyTorch frameworks. Adagrad and RMSprop optimisers consistently outperformed others across frequency bands, with Adagrad excelling in the beta band and RMSprop achieving superior performance in the gamma band. Classical machine learning methods (Linear SVM and Random Forest) achieved perfect classification with 50--100 times faster training times than deep learning models. However, in neurofeedback simulations with real-time performance requirements, the deep neural network demonstrated superior feedback-signal generation (a 44.7% regulation rate versus 0% for classical methods). SHAP analysis reveals the nuanced contributions of EEG frequency bands to model decisions. Overall, the study highlights the importance of selecting a model dependent on the task: classical methods for efficient offline classification and deep learning for adaptive, real-time neurofeedback applications.
Authors: Zelong Bi, Pierre Lafaye de Micheaux
Abstract: The manifold hypothesis suggests that high-dimensional data often lie on or near a low-dimensional manifold. Estimating the dimension of this manifold is essential for leveraging its structure, yet existing work on dimension estimation is fragmented and lacks systematic evaluation. This article provides a comprehensive survey for both researchers and practitioners. We review often-overlooked theoretical foundations and present eight representative estimators. Through controlled experiments, we analyze how individual factors, such as noise, curvature, and sample size, affect performance. We also compare the estimators on diverse synthetic and real-world datasets, introducing a principled approach to dataset-specific hyperparameter tuning. Our results offer practical guidance for estimator selection and yield insights that will inform future estimator design.
Authors: David Arbour, Harsh Parikh, Bijan Niknam, Elizabeth Stuart, Kara Rudolph, Avi Feller
Abstract: Many common estimators in machine learning and causal inference are linear smoothers, where the prediction is a weighted average of the training outcomes. Some estimators, such as ordinary least squares and kernel ridge regression, allow for arbitrarily negative weights, which improve feature imbalance but often at the cost of increased dependence on parametric modeling assumptions and higher variance. By contrast, estimators like importance weighting and random forests (sometimes implicitly) restrict weights to be non-negative, reducing dependence on parametric modeling and variance at the cost of worse imbalance. In this paper, we propose a unified framework that directly penalizes the level of extrapolation, replacing the current practice of a hard non-negativity constraint with a soft constraint and corresponding hyperparameter. We derive a worst-case extrapolation error bound and introduce a novel "bias-bias-variance" tradeoff, encompassing biases due to feature imbalance, model misspecification, and estimator variance; this tradeoff is especially pronounced in high dimensions, particularly when positivity is poor. We then develop an optimization procedure that regularizes this bound while minimizing imbalance and outline how to use this approach as a sensitivity analysis for dependence on parametric modeling assumptions. We demonstrate the effectiveness of our approach through synthetic experiments and a real-world application, involving the generalization of randomized controlled trial estimates to a target population of interest.
Authors: Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh
Abstract: Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising but unrefined behaviors. Often, a critical failure mode of RLFT arises when policies lose this diversity and collapse into a handful of easily exploitable outputs. This convergence hinders exploration, which is essential for expanding the capabilities of the pretrained policy and for amplifying the benefits of test-time compute scaling. To address this, we introduce an objective for policy gradient methods that explicitly enforces the exploration and refinement of diverse generations, which we call a polychromic objective. We then show how proximal policy optimization (PPO) can be adapted to optimize this objective. Our method (1) employs vine sampling to collect on-policy rollouts and (2) modifies the advantage function to reflect the advantage under our new objective. Experiments on BabyAI, Minigrid, and Algorithmic Creativity show that our method improves success rates by reliably solving a larger set of environment configurations and generalizes better under large perturbations. Moreover, when given multiple attempts in pass@$k$ experiments, the policy achieves substantially higher coverage, demonstrating its ability to maintain and exploit a diverse repertoire of strategies.
Authors: Aleksandar Armacki, Ali H. Sayed
Abstract: Convergence in high-probability (HP) has attracted increasing interest, due to implying exponentially decaying tail bounds and strong guarantees for individual runs of an algorithm. While many works study HP guarantees in centralized settings, much less is understood in the decentralized setup, where existing works require strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise. This results in a significant gap between the assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, and is also contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed for MSE convergence. Motivated by these observations, we study the HP convergence of Decentralized $\mathtt{SGD}$ ($\mathtt{DSGD}$) in the presence of light-tailed noise, providing several strong results. First, we show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing the restrictive assumptions used in prior works. Second, our sharp analysis yields order-optimal rates for both non-convex and strongly convex costs. Third, we establish a linear speed-up in the number of users, leading to matching, or strictly better transient times than those obtained from MSE results, further underlining the tightness of our analysis. To the best of our knowledge, this is the first work that shows $\mathtt{DSGD}$ achieves a linear speed-up in the HP sense. Our relaxed assumptions and sharp rates stem from several technical results of independent interest, including a result on the variance-reduction effect of decentralized methods in the HP sense, as well as a novel bound on the MGF of strongly convex costs, which is of interest even in centralized settings. Finally, we provide experiments that validate our theory.
Authors: Jacek Karwowski, Raymond Douglas
Abstract: We investigate mathematically the notion of incoherence: a structural issue with reinforcement learning policies derived by naive goal-conditioning of autoregressive models. We focus on the process of re-training models on their own actions, that is, fine-tuning offline-learned policies with online RL. We prove that it decreases incoherence and leads to an improvement in return, and we aim to characterize the resulting trajectory of policies. By re-framing standard notions of control-as-inference and soft Q learning, we establish a three-way correspondence with two other ways of understanding the iterative re-training process: as folding the posterior into the reward and, in the deterministic case, as decreasing the temperature parameter; the correspondence has computational content via the training-inference trade-off. Through soft-conditioning generative models, we discuss the link between incoherence and the effective horizon.
Authors: Ayoub Ghriss
Abstract: Probabilistic relaxations of graph cuts offer a differentiable alternative to spectral clustering, enabling end-to-end and online learning without eigendecompositions, yet prior work centered on RatioCut and lacked general guarantees and principled gradients. We present a unified probabilistic framework that covers a wide class of cuts, including Normalized Cut. Our framework provides tight analytic upper bounds on expected discrete cuts via integral representations and Gauss hypergeometric functions with closed-form forward and backward. Together, these results deliver a rigorous, numerically stable foundation for scalable, differentiable graph partitioning covering a wide range of clustering and contrastive learning objectives.
Authors: S Sairam, Sara Girdhar, Shivam Soni
Abstract: The R-Learner is a powerful, theoretically-grounded framework for estimating heterogeneous treatment effects, prized for its robustness to nuisance model errors. However, its application to network data, where causal heterogeneity is often graph-dependent, presents a critical challenge to its core assumption of a well-specified final-stage model. In this paper, we conduct a large-scale empirical study to systematically dissect the R-Learner framework on graphs. We provide the first rigorous evidence that the primary driver of performance is the inductive bias of the final-stage CATE estimator, an effect that dominates the choice of nuisance models. Our central finding is the quantification of a catastrophic "representation bottleneck": we prove with overwhelming statistical significance (p < 0.001) that R-Learners with a graph-blind final stage fail completely (MSE > 4.0), even when paired with powerful GNN nuisance models. Conversely, our proposed end-to-end Graph R-Learner succeeds and significantly outperforms a strong, non-DML GNN T-Learner baseline. Furthermore, we identify and provide a mechanistic explanation for a subtle, topology-dependent "nuisance bottleneck," linking it to GNN over-squashing via a targeted "Hub-Periphery Trade-off" analysis. Our findings are validated across diverse synthetic and semi-synthetic benchmarks. We release our code as a reproducible benchmark to facilitate future research on this critical "final-stage bottleneck."
Authors: Giovanni Conforti, Alain Durmus, Le-Tuyet-Nhi Pham, Gael Raoul
Abstract: Diffusion models for continuous state spaces based on Gaussian noising processes are now relatively well understood from both practical and theoretical perspectives. In contrast, results for diffusion models on discrete state spaces remain far less explored and pose significant challenges, particularly due to their combinatorial structure and their more recent introduction in generative modelling. In this work, we establish new and sharp convergence guarantees for three popular discrete diffusion models (DDMs). Two of these models are designed for finite state spaces and are based respectively on the random walk and the masking process. The third DDM we consider is defined on the countably infinite space $\mathbb{N}^d$ and uses a drifted random walk as its forward process. For each of these models, the backward process can be characterized by a discrete score function that can, in principle, be estimated. However, even with perfect access to these scores, simulating the exact backward process is infeasible, and one must rely on time discretization. In this work, we study Euler-type approximations and establish convergence bounds in both Kullback-Leibler divergence and total variation distance for the resulting models, under minimal assumptions on the data distribution. To the best of our knowledge, this study provides the optimal non-asymptotic convergence guarantees for these noising processes that do not rely on boundedness assumptions on the estimated score. In particular, the computational complexity of each method scales only linearly in the dimension, up to logarithmic factors.
Authors: Takuya Kanazawa
Abstract: Quantifying predictive uncertainty is essential for safe and trustworthy real-world AI deployment. Yet, fully nonparametric estimation of conditional distributions remains challenging for multivariate targets. We propose Tomographic Quantile Forests (TQF), a nonparametric, uncertainty-aware, tree-based regression model for multivariate targets. TQF learns conditional quantiles of directional projections $\mathbf{n}^{\top}\mathbf{y}$ as functions of the input $\mathbf{x}$ and the unit direction $\mathbf{n}$. At inference, it aggregates quantiles across many directions and reconstructs the multivariate conditional distribution by minimizing the sliced Wasserstein distance via an efficient alternating scheme with convex subproblems. Unlike classical directional-quantile approaches that typically produce only convex quantile regions and require training separate models for different directions, TQF covers all directions with a single model without imposing convexity restrictions. We evaluate TQF on synthetic and real-world datasets, and release the source code on GitHub.
Authors: Beicheng Lou, Zifei Xu, Vivian W. H. Wong
Abstract: Rotary positional embedding has become the state-of-the-art approach to encode position information in transformer-based models. While it is often succinctly expressed in complex linear algebra, we note that the actual implementation of $Q/K/V$-projections is not equivalent to a complex linear transformation. We argue that complex linear transformation is a more natural parametrization and saves near 50\% parameters within the attention block. We show empirically that removing such redundancy has negligible impact on the model performance. Our modification achieves more efficient parameter usage, as well as a cleaner interpretation of the representation space.
Authors: Kevin Zhang, Yixin Wang
Abstract: Probabilistic graphical models (PGMs) are widely used to discover latent structure in data, but their success hinges on selecting an appropriate model design. In practice, model specification is difficult and often requires iterative trial-and-error. This challenge arises because classical PGMs typically operate on individual datasets. In this work, we consider settings involving collections of related datasets and propose meta-probabilistic modeling (MPM) to learn the generative model structure itself. MPM uses a hierarchical formulation in which global components encode shared patterns across datasets, while local parameters capture dataset-specific latent structure. For scalable learning and inference, we derive a tractable VAE-inspired surrogate objective together with a bi-level optimization algorithm. Our methodology supports a broad class of expressive probabilistic models and has connections to existing architectures, such as Slot Attention. Experiments on object-centric representation learning and sequential text modeling demonstrate that MPM effectively adapts generative models to data while recovering meaningful latent representations.
Authors: Safwan Labbi, Daniil Tiapkin, Paul Mangold, Eric Moulines
Abstract: Policy gradient methods are known to be highly sensitive to the choice of policy parameterization. In particular, the widely used softmax parameterization can induce ill-conditioned optimization landscapes and lead to exponentially slow convergence. Although this can be mitigated by preconditioning, this solution is often computationally expensive. Instead, we propose replacing the softmax with an alternative family of policy parameterizations based on the generalized f-softargmax. We further advocate coupling this parameterization with a regularizer induced by the same f-divergence, which improves the optimization landscape and ensures that the resulting regularized objective satisfies a Polyak-Lojasiewicz inequality. Leveraging this structure, we establish the first explicit non-asymptotic last-iterate convergence guarantees for stochastic policy gradient methods for finite MDPs without any form of preconditioning. We also derive sample-complexity bounds for the unregularized problem and show that f-PG, with Tsallis divergences achieves polynomial sample complexity in contrast to the exponential complexity incurred by the standard softmax parameterization.
Authors: Nghia Thu Truong, Qui Phu Pham, Quang Nguyen, Dung Luong, Mai Tran
Abstract: Partial Optimal Transport (POT) addresses the problem of transporting only a fraction of the total mass between two distributions, making it suitable when marginals have unequal size or contain outliers. While Sinkhorn-based methods are widely used, their complexity bounds for POT remain suboptimal and can limit scalability. We introduce Accelerated Sinkhorn for POT (ASPOT), which integrates alternating minimization with Nesterov-style acceleration in the POT setting, yielding a complexity of $\mathcal{O}(n^{7/3}\varepsilon^{-5/3})$. We also show that an informed choice of the entropic parameter $\gamma$ improves rates for the classical Sinkhorn method. Experiments on real-world applications validate our theories and demonstrate the favorable performance of our proposed methods.
Authors: Anderson de Andrade, Alon Harell, Ivan V. Baji\'c
Abstract: Many computer vision tasks share substantial overlapping information, yet conventional codecs tend to ignore this, leading to redundant and inefficient representations. The Gray-Wyner network, a classical concept from information theory, offers a principled framework for separating common and task-specific information. Inspired by this idea, we develop a learnable three-channel codec that disentangles shared information from task-specific details across multiple vision tasks. We characterize the limits of this approach through the notion of lossy common information, and propose an optimization objective that balances inherent tradeoffs in learning such representations. Through comparisons of three codec architectures on two-task scenarios spanning six vision benchmarks, we demonstrate that our approach substantially reduces redundancy and consistently outperforms independent coding. These results highlight the practical value of revisiting Gray-Wyner theory in modern machine learning contexts, bridging classic information theory with task-driven representation learning.
Authors: Yuze Wang, Yujia Tong, Xuan Liu, Junhao Dong
Abstract: Large Language Models (LLMs) inevitably memorize sensitive information during training, posing significant privacy risks. Machine unlearning has emerged as a promising solution to selectively remove such information without full retraining. However, existing methods are designed for dense models and overlook model sparsification, an essential technique for efficient LLM deployment. We find that unlearning effectiveness degrades substantially on sparse models. Through empirical analysis, we reveal that this degradation occurs because existing unlearning methods require updating all parameters, yet sparsification prunes substantial weights to zero, fundamentally limiting the model's forgetting capacity. To address this challenge, we propose Sparsity-Aware Unlearning (SAU), which decouples unlearning from sparsification objectives through gradient masking that redirects updates to surviving weights, combined with importance-aware redistribution to compensate for pruned parameters. Extensive experiments demonstrate that SAU significantly outperforms existing methods on sparse LLMs, achieving effective forgetting while preserving model utility.
Authors: Sohan Venkatesh, Ashish Mahendran Kurapath
Abstract: Activation steering methods are widely used to control large language model (LLM) behavior and are often interpreted as revealing meaningful internal representations. This interpretation assumes that steering directions are identifiable and uniquely recoverable from input-output behavior. We show that, under white-box single-layer access, steering vectors are fundamentally non-identifiable due to large equivalence classes of behaviorally indistinguishable interventions. Empirically, we find that orthogonal perturbations achieve near-equivalent efficacy with negligible effect sizes across multiple models and traits, with pre-trained semantic classifiers confirming equivalence at the output level. We estimate null-space dimensionality via SVD of activation covariance matrices and validate that equivalence holds robustly throughout the operationally relevant steering range. Critically, we show that non-identifiability is a robust geometric property that persists across diverse prompt distributions. These findings reveal fundamental interpretability limits and highlight the need for structural constraints beyond behavioral testing to enable reliable alignment interventions.
Authors: Isaac Han, Sangyeon Park, Seungwon Oh, Donghu Kim, Hojoon Lee, Kyung-Joong Kim
Abstract: Deep neural networks trained on nonstationary data must balance stability (i.e., retaining prior knowledge) and plasticity (i.e., adapting to new tasks). Standard reinitialization methods, which reinitialize weights toward their original values, are widely used but difficult to tune: conservative reinitializations fail to restore plasticity, while aggressive ones erase useful knowledge. We propose FIRE, a principled reinitialization method that explicitly balances the stability-plasticity tradeoff. FIRE quantifies stability through Squared Frobenius Error (SFE), measuring proximity to past weights, and plasticity through Deviation from Isometry (DfI), reflecting weight isotropy. The reinitialization point is obtained by solving a constrained optimization problem, minimizing SFE subject to DfI being zero, which is efficiently approximated by Newton-Schulz iteration. FIRE is evaluated on continual visual learning (CIFAR-10 with ResNet-18), language modeling (OpenWebText with GPT-0.1B), and reinforcement learning (HumanoidBench with SAC and Atari games with DQN). Across all domains, FIRE consistently outperforms both naive training without intervention and standard reinitialization methods, demonstrating effective balancing of the stability-plasticity tradeoff.
Authors: Deyi Kong, Zaiwei Chen, Shuzhong Zhang, Shancong Mou
Abstract: In this work, we propose Natural Hypergradient Descent (NHGD), a new method for solving bilevel optimization problems. To address the computational bottleneck in hypergradient estimation--namely, the need to compute or approximate Hessian inverse--we exploit the statistical structure of the inner optimization problem and use the empirical Fisher information matrix as an asymptotically consistent surrogate for the Hessian. This design enables a parallel optimize-and-approximate framework in which the Hessian-inverse approximation is updated synchronously with the stochastic inner optimization, reusing gradient information at negligible additional cost. Our main theoretical contribution establishes high-probability error bounds and sample complexity guarantees for NHGD that match those of state-of-the-art optimize-then-approximate methods, while significantly reducing computational time overhead. Empirical evaluations on representative bilevel learning tasks further demonstrate the practical advantages of NHGD, highlighting its scalability and effectiveness in large-scale machine learning settings.
Authors: Tatsuya Sagawa, Ryosuke Kojima
Abstract: Chemical Language Models (CLMs) pre-trained on large scale molecular data are widely used for molecular property prediction. However, the common belief that increasing training resources such as model size, dataset size, and training compute improves both pretraining loss and downstream task performance has not been systematically validated in the chemical domain. In this work, we evaluate this assumption by pretraining CLMs while scaling training resources and measuring transfer performance across diverse molecular property prediction (MPP) tasks. We find that while pretraining loss consistently decreases with increased training resources, downstream task performance shows limited improvement. Moreover, alternative metrics based on the Hessian or loss landscape also fail to estimate downstream performance in CLMs. We further identify conditions under which downstream performance saturates or degrades despite continued improvements in pretraining metrics, and analyze the underlying task dependent failure modes through parameter space visualizations. These results expose a gap between pretraining based evaluation and downstream performance, and emphasize the need for model selection and evaluation strategies that explicitly account for downstream task characteristics.
Authors: Gurjeet Sangra Singh, Frantzeska Lavda, Giangiacomo Mercatali, Alexandros Kalousis
Abstract: Deep generative models such as flow matching and diffusion models have shown great potential in learning complex distributions and dynamical systems, but often act as black-boxes, neglecting underlying physics. In contrast, physics-based simulation models described by ODEs/PDEs remain interpretable, but may have missing or unknown terms, unable to fully describe real-world observations. We bridge this gap with a novel grey-box method that integrates incomplete physics models directly into generative models. Our approach learns dynamics from observational trajectories alone, without ground-truth physics parameters, in a simulation-free manner that avoids scalability and stability issues of Neural ODEs. The core of our method lies in modelling a structured variational distribution within the flow matching framework, by using two latent encodings: one to model the missing stochasticity and multi-modal velocity, and a second to encode physics parameters as a latent variable with a physics-informed prior. Furthermore, we present an adaptation of the framework to handle second-order dynamics. Our experiments on representative ODE/PDE problems and real-world weather forecasting demonstrate that our method performs on par with or superior to fully data-driven approaches and previous grey-box baselines, while preserving the interpretability of the physics model. Our code is available at https://github.com/DMML-Geneva/VGB-DM.
Authors: Adrian Garcia-Casta\~neda, Jon Irureta, Jon Imaz, Aizea Lojo
Abstract: Class Incremental Learning (CIL) poses a fundamental challenge: maintaining a balance between the plasticity required to learn new tasks and the stability needed to prevent catastrophic forgetting. While expansion-based methods effectively mitigate forgetting by adding task-specific parameters, they suffer from uncontrolled architectural growth and memory overhead. In this paper, we propose a novel dynamic scaling framework that adaptively manages model capacity through a cyclic "GRow, Assess, ComprEss" (GRACE) strategy. Crucially, we supplement backbone expansion with a novel saturation assessment phase that evaluates the utilization of the model's capacity. This assessment allows the framework to make informed decisions to either expand the architecture or compress the backbones into a streamlined representation, preventing parameter explosion. Experimental results demonstrate that our approach achieves state-of-the-art performance across multiple CIL benchmarks, while reducing memory footprint by up to a 73% compared to purely expansionist models.
Authors: Yechen Zhang, Shuhao Xing, Junhao Huang, Kai Lv, Yunhua Zhou, Xipeng Qiu, Qipeng Guo, Kai Chen
Abstract: Recent advances in spectral optimization, notably Muon, have demonstrated that constraining update steps to the Stiefel manifold can significantly accelerate training and improve generalization. However, Muon implicitly assumes an isotropic optimization landscape, enforcing a uniform spectral update norm across all eigen-directions. We argue that this "egalitarian" constraint is suboptimal for Deep Neural Networks, where the curvature spectrum is known to be highly heavy-tailed and ill-conditioned. In such landscapes, Muon risks amplifying instabilities in high-curvature directions while limiting necessary progress in flat directions. In this work, we propose \textbf{Mousse} (\textbf{M}uon \textbf{O}ptimization \textbf{U}tilizing \textbf{S}hampoo's \textbf{S}tructural \textbf{E}stimation), a novel optimizer that reconciles the structural stability of spectral methods with the geometric adaptivity of second-order preconditioning. Instead of applying Newton-Schulz orthogonalization directly to the momentum matrix, Mousse operates in a whitened coordinate system induced by Kronecker-factored statistics (derived from Shampoo). Mathematically, we formulate Mousse as the solution to a spectral steepest descent problem constrained by an anisotropic trust region, where the optimal update is derived via the polar decomposition of the whitened gradient. Empirical results across language models ranging from 160M to 800M parameters demonstrate that Mousse consistently outperforms Muon, achieving around $\sim$12\% reduction in training steps with negligible computational overhead.
Authors: Savannah L. Ferretti, Jerry Lin, Sara Shamekh, Jane W. Baldwin, Michael S. Pritchard, Tom Beucler
Abstract: Machine learning models can represent climate processes that are nonlocal in horizontal space, height, and time, often by combining information across these dimensions in highly nonlinear ways. While this can improve predictive skill, it makes learned relationships difficult to interpret and prone to overfitting as the extent of nonlocal information grows. We address this challenge by introducing data-driven integration kernels, a framework that adds structure to nonlocal operator learning by explicitly separating nonlocal information aggregation from local nonlinear prediction. Each spatiotemporal predictor field is first integrated using learnable kernels (defined as continuous weighting functions over horizontal space, height, and/or time), after which a local nonlinear mapping is applied only to the resulting kernel-integrated features and optional local inputs. This design confines nonlinear interactions to a small set of integrated features and makes each kernel directly interpretable as a weighting pattern that reveals which horizontal locations, vertical levels, and past timesteps contribute most to the prediction. We demonstrate the framework for South Asian monsoon precipitation using a hierarchy of neural network models with increasing structure, including baseline, nonparametric kernel, and parametric kernel models. Across this hierarchy, kernel models achieve near-baseline performance with far fewer trainable parameters, indicating that much of the relevant nonlocal information can be captured through a small set of interpretable integrations when appropriate structural constraints are imposed.
Authors: Mansoor Ahmed, Nadeem Taj, Imdad Ullah Khan, Hemanth Venkateswara, Murray Patterson
Abstract: Computational antibody design has seen rapid methodological progress, with dozens of deep generative methods proposed in the past three years, yet the field lacks a standardized benchmark for fair comparison and model development. These methods are evaluated on different SAbDab snapshots, non-overlapping test sets, and incompatible metrics, and the literature fragments the design problem into numerous sub-tasks with no common definition. We introduce \textsc{Chimera-Bench} (\textbf{C}DR \textbf{M}odeling with \textbf{E}pitope-guided \textbf{R}edesign), a unified benchmark built around a single canonical task: \emph{epitope-conditioned CDR sequence-structure co-design}. \textsc{Chimera-Bench} provides (1) a curated, deduplicated dataset of \textbf{2,922} antibody-antigen complexes with epitope and paratope annotations; (2) three biologically motivated splits testing generalization to unseen epitopes, unseen antigen folds, and prospective temporal targets; and (3) a comprehensive evaluation protocol with five metric groups including novel epitope-specificity measures. We benchmark representative methods spanning different generative paradigms and report results across all splits. \textsc{Chimera-Bench} is the largest dataset of its kind for the antibody design problem, allowing the community to develop and test novel methods and evaluate their generalizability. The source code and data are available at: https://github.com/mansoor181/chimera-bench.git
Authors: Shenyang Deng, Zhuoli Ouyang, Tianyu Pang, Zihang Liu, Ruochen Jin, Shuhua Yu, Yaoqing Yang
Abstract: Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape . The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, \textsc{Muon} stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of \textsc{Muon} still leaves room for further improvement. In this paper, we introduce RMNP (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise $\ell_2$ normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. This substitution reduces the per-iteration computational complexity from $\mathcal{O}(mn\cdot\min(m,n))$ to $\mathcal{O}(mn)$ for an $m\times n$ weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for RMNP in the non-convex setting that match recent results for Muon optimizers, achieving the information-theoretic minimax optimal complexity. Extensive experiments on large language model pretraining show that RMNP delivers competitive optimization performance compared with Muon while substantially reducing preconditioning wall-clock time. Our code is available at \href{https://anonymous.4open.science/r/RMNP-E8E1/}{this link}.
Authors: Shinsaku Sakaue
Abstract: Contextual recommendation is a variant of contextual linear bandits in which the learner observes an (optimal) action rather than a reward scalar. Recently, Sakaue et al. (2025) developed an efficient Online Newton Step (ONS) approach with an $O(d\log T)$ regret bound, where $d$ is the dimension of the action space and $T$ is the time horizon. In this paper, we present a simple algorithm that is more efficient than the ONS-based method while achieving the same regret guarantee. Our core idea is to exploit the improperness inherent in contextual recommendation, leading to an update rule akin to the second-order perceptron from online classification. This removes the Mahalanobis projection step required by ONS, which is often a major computational bottleneck. More importantly, the same algorithm remains robust to possibly suboptimal action feedback, whereas the prior ONS-based method required running multiple ONS learners with different learning rates for this extension. We describe how our method works in general Hilbert spaces (e.g., via kernelization), where eliminating Mahalanobis projections becomes even more beneficial.
Authors: YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu
Abstract: Reusable skills let LLM agents package task-specific procedures, tool affordances, and execution guidance into modular building blocks. As skill ecosystems grow to tens of thousands of entries, exposing every skill at inference time becomes infeasible. This creates a skill-routing problem: given a user task, the system must identify relevant skills before downstream planning or execution. Existing agent stacks often rely on progressive disclosure, exposing only skill names and descriptions while hiding the full implementation body. We examine this design choice on a SkillsBench-derived benchmark with approximately 80K candidate skills, targeting the practically important setting of large skill registries with heavy overlap. Across representative sparse, dense, and reranking baselines on this setting, hiding the skill body causes a 31--44 percentage point drop in routing accuracy, showing that full skill text is a critical routing signal in this setting rather than a minor metadata refinement. Motivated by this finding, we present SkillRouter, a compact 1.2B full-text retrieve-and-rerank pipeline. SkillRouter achieves 74.0% Hit@1 on our benchmark -- the strongest average top-1 routing performance among the baselines we evaluate -- while using 13$\times$ fewer parameters and running 5.8$\times$ faster than the strongest base pipeline. The ranking gains further generalize to a supplementary benchmark independently constructed from three skill sources. In a complementary end-to-end study across four coding agents, routing gains transfer to improved task success, with larger gains for more capable agents.
Authors: Dogan Urgun, Gokhan Gungor
Abstract: Designing effective auxiliary rewards for cooperative multi-agent systems remains a challenging task. Misaligned incentives risk inducing suboptimal coordination, especially when sparse task feedback fails to provide sufficient grounding. This study introduces an automated reward design framework that leverages large language models to synthesize executable reward programs from environment instrumentation. The procedure constrains candidate programs within a formal validity envelope and evaluates their efficacy by training policies from scratch under a fixed computational budget. Selection across generations depends exclusively on the sparse task return. The framework is evaluated across four distinct Overcooked-AI layouts characterized by varied corridor congestion, handoff dependencies, and structural asymmetries. Iterative search generations consistently yield superior task returns and delivery counts, with the most pronounced gains occurring in environments dominated by interaction bottlenecks. Diagnostic analysis of the synthesized shaping components indicates increased interdependence in action selection and improved signal alignment in coordination-intensive tasks. These results demonstrate that the search for objective-grounded reward programs can mitigate the burden of manual engineering while producing shaping signals compatible with cooperative learning under finite budgets.
Authors: Pengxuan Yang, Yupeng Zheng, Deheng Qian, Zebin Xing, Qichao Zhang, Linbo Wang, Yichen Zhang, Shaoyu Guo, Zhongpu Xia, Qiang Chen, Junyu Han, Lingyun Xu, Yifeng Pan, Dongbin Zhao
Abstract: We introduce DreamerAD, the first latent world model framework that enables efficient reinforcement learning for autonomous driving by compressing diffusion sampling from 100 steps to 1 - achieving 80x speedup while maintaining visual interpretability. Training RL policies on real-world driving data incurs prohibitive costs and safety risks. While existing pixel-level diffusion world models enable safe imagination-based training, they suffer from multi-step diffusion inference latency (2s/frame) that prevents high-frequency RL interaction. Our approach leverages denoised latent features from video generation models through three key mechanisms: (1) shortcut forcing that reduces sampling complexity via recursive multi-resolution step compression, (2) an autoregressive dense reward model operating directly on latent representations for fine-grained credit assignment, and (3) Gaussian vocabulary sampling for GRPO that constrains exploration to physically plausible trajectories. DreamerAD achieves 87.7 EPDMS on NavSim v2, establishing state-of-the-art performance and demonstrating that latent-space RL is effective for autonomous driving.
Authors: Marc-Antoine Allard, Arnaud Teinturier, Victor Xing, Gautier Viaud
Abstract: Recent advances in large language models (LLMs) have enabled the development of autonomous agents capable of complex reasoning and multi-step problem solving. However, these agents struggle to adapt to specialized environments and do not leverage past interactions, approaching each new task from scratch regardless of their accumulated experience. We introduce Experiential Reflective Learning (ERL), a simple self-improvement framework that enables rapid environment adaptation through experiential learning. ERL reflects on task trajectories and outcomes to generate heuristics, capturing actionable lessons that transfer across tasks. At test time, relevant heuristics are retrieved based on the current task and injected into the agent's context to guide execution. On the Gaia2 benchmark, ERL improves success rate by 7.8% over a ReAct baseline, with large gains in task completion reliability, and outperforms prior experiential learning methods. Through systematic ablations, we find that selective retrieval is essential and that heuristics provide more transferable abstractions than few-shot trajectory prompting. These results demonstrate that reflecting on single-attempt experiences to extract transferable heuristics enables effective agent self-improvement.
Authors: Devashish Gaikwad, Wil M. P. van der Aalst, Gyunam Park
Abstract: Process anomaly detection is an important application of process mining for identifying deviations from the normal behavior of a process. Neural network-based methods have recently been applied to this task, learning directly from event logs without requiring a predefined process model. However, since anomaly detection is a purely statistical task, these models fail to incorporate human domain knowledge. As a result, rare but conformant traces are often misclassified as anomalies due to their low frequency, which limits the effectiveness of the detection process. Recent developments in the field of neuro-symbolic AI have introduced Logic Tensor Networks (LTN) as a means to integrate symbolic knowledge into neural networks using real-valued logic. In this work, we propose a neuro-symbolic approach that integrates domain knowledge into neural anomaly detection using LTN and Declare constraints. Using autoencoder models as a foundation, we encode Declare constraints as soft logical guiderails within the learning process to distinguish between anomalous and rare but conformant behavior. Evaluations on synthetic and real-world datasets demonstrate that our approach improves F1 scores even when as few as 10 conformant traces exist, and that the choice of Declare constraint and by extension human domain knowledge significantly influences performance gains.
Authors: Shoujin Wang, Mingze Ni, Wei Liu, Victor W. Chu, Bryan Zheng, Ayush Kanwal, Roy Jing Yang, Kenneth Sabir, Fang Chen
Abstract: Livestock growth prediction is essential for optimising farm management and improving the efficiency and sustainability of livestock production, yet it remains underexplored due to limited large-scale datasets and privacy concerns surrounding farm-level data. Existing biophysical models rely on fixed formulations, while most machine learning approaches are trained on small, isolated datasets, limiting their robustness and generalisability. To address these challenges, we propose LivestockFL, the first federated learning framework specifically designed for livestock growth prediction. LivestockFL enables collaborative model training across distributed farms without sharing raw data, thereby preserving data privacy while alleviating data sparsity, particularly for farms with limited historical records. The framework employs a neural architecture based on a Gated Recurrent Unit combined with a multilayer perceptron to model temporal growth patterns from historical weight records and auxiliary features. We further introduce LivestockPFL, a novel personalised federated learning framework that extends the above federated learning framework with a personalized prediction head trained on each farm's local data, producing farm-specific predictors. Experiments on a real-world dataset demonstrate the effectiveness and practicality of the proposed approaches.
Authors: Yufei Xu, Fanxu Meng, Fan Jiang, Yuxuan Wang, Ruijie Zhou, Zhaohui Wang, Jiexi Wu, Zhixin Pan, Xiaojuan Tang, Wenjie Pei, Tongxuan Liu, Di yin, Xing Sun, Muhan Zhang
Abstract: Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical key for each query through a lightweight indexer, then computing attention only on the selected subset. While the downstream sparse attention itself scales favorably, the indexer must still scan the entire prefix for every query, introducing an per-layer bottleneck that grows prohibitively with context length. We propose HISA (Hierarchical Indexed Sparse Attention), a plug-and-play replacement for the indexer that rewrites the search path from a flat token scan into a two-stage hierarchical procedure: (1) a block-level coarse filtering stage that scores pooled block representations to discard irrelevant regions, followed by (2) a token-level refinement stage that applies the original indexer exclusively within the retained candidate blocks. HISA preserves the identical token-level top-sparse pattern consumed by the downstream Sparse MLA operator and requires no additional training. On kernel-level benchmarks, HISA achieves up to speedup at 64K context. On Needle-in-a-Haystack and LongBench, we directly replace the indexer in DeepSeek-V3.2 and GLM-5 with our HISA indexer, without any finetuning. HISA closely matches the original DSA in quality, while substantially outperforming block-sparse baselines.
Authors: Max Qiushi Lin, Reza Asad, Kevin Tan, Haque Ishfaq, Csaba Szepesvari, Sharan Vaswani
Abstract: Although actor-critic methods have been successful in practice, their theoretical analyses have several limitations. Specifically, existing theoretical work either sidesteps the exploration problem by making strong assumptions or analyzes impractical methods with complicated algorithmic modifications. Moreover, the actor-critic methods analyzed for linear MDPs often employ natural policy gradient and construct "implicit" policies without explicit parameterization. Such policies are computationally expensive to sample from, making the environment interactions inefficient. To that end, we focus on the finite-horizon linear MDPs and propose an optimistic actor-critic framework that uses parametric log-linear policies. In particular, we introduce a tractable $\textit{logit-matching}$ regression objective for the actor. For the critic, we use approximate Thompson sampling via Langevin Monte Carlo to obtain optimistic value estimates. We prove that the resulting algorithm achieves $\widetilde{\mathcal{O}}(\epsilon^{-4})$ and $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity in the on-policy and off-policy setting, respectively. Our results match prior theoretical work in achieving the state-of-the-art sample complexity, while our algorithm is more aligned with practice.
Authors: Tushar Dhananjay Pathak
Abstract: This paper presents ARCS (Autoregressive Circuit Synthesis), a system for amortized analog circuit generation that produces complete, SPICE-simulatable designs (topology and component values) in milliseconds rather than the minutes required by search-based methods. A hybrid pipeline combining two learned generators (a graph VAE and a flow-matching model) with SPICE-based ranking achieves 99.9% simulation validity (reward 6.43/8.0) across 32 topologies using only 8 SPICE evaluations, 40x fewer than genetic algorithms. For single-model inference, a topology-aware Graph Transformer with Best-of-3 candidate selection reaches 85% simulation validity in 97ms, over 600x faster than random search. The key technical contribution adapts Group Relative Policy Optimization (GRPO) to multi-topology circuit reinforcement learning, resolving a critical failure mode of REINFORCE (cross-topology reward distribution mismatch) through per-topology advantage normalization. This improves simulation validity by +9.6 percentage points over REINFORCE in only 500 RL steps (10x fewer). Grammar-constrained decoding additionally guarantees 100% structural validity by construction via topology-aware token masking.
Authors: Shafayeth Jamil, Rehan Kapadia
Abstract: Linear dynamical systems are fully characterized by their eigenspectra, accessible directly from the generator of the dynamics. For nonlinear systems governed by partial differential equations, no equivalent theory exists. We introduce Lie Generator Network-Koopman (LGN-KM), a neural operator that lifts nonlinear dynamics into a linear latent space and learns the continuous-time Koopman generator ($L_k$) through a decomposition $L_k = S - D_k$, where $S$ is skew-symmetric representing conservative inter-modal coupling, and $D_k$ is a positive-definite diagonal encoding modal dissipation. This architectural decomposition enforces stability and enables interpretability through direct spectral access to the learned dynamics. On two-dimensional Navier--Stokes turbulence, the generator recovers the known dissipation scaling and a complete multi-branch dispersion relation from trajectory data alone with no physics supervision. Independently trained models at different flow regimes recover matched gauge-invariant spectral structure, exposing a gauge freedom in the Koopman lifting. Because the generator is provably stable, it enables guaranteed long-horizon stability, continuous-time evaluation at arbitrary time, and physics-informed cross-viscosity model transfer.
Authors: Max Hennick, Guillaume Corlouer
Abstract: A key problem in the modern study of AI is predicting and understanding emergent capabilities in models during training. Inspired by methods for studying reactions in quantum chemistry, we present the ``2-datapoint reduced density matrix". We show that this object provides a computationally efficient, unified observable of phase transitions during training. By tracking the eigenvalue statistics of the 2RDM over a sliding window, we derive two complementary signals: the spectral heat capacity, which we prove provides early warning of second-order phase transitions via critical slowing down, and the participation ratio, which reveals the dimensionality of the underlying reorganization. Remarkably, the top eigenvectors of the 2RDM are directly interpretable making it straightforward to study the nature of the transitions. We validate across four distinct settings: deep linear networks, induction head formation, grokking, and emergent misalignment. We then discuss directions for future work using the 2RDM.
Authors: Chinmay Savadikar, Michelle Dai, Tianfu Wu
Abstract: To effectively manage the complexities of real-world dynamic environments, continual learning must incrementally acquire, update, and accumulate knowledge from a stream of tasks of different nature without suffering from catastrophic forgetting of prior knowledge. While this capability is innate to human cognition, it remains a significant challenge for modern deep learning systems. At the heart of this challenge lies the stability-plasticity dilemma: the need to balance leveraging prior knowledge, integrating novel information, and allocating model capacity adaptively based on task complexity and synergy. In this paper, we propose a novel exemplar-free class-incremental continual learning (ExfCCL) framework that addresses these issues through a Hierarchical Exploration-Exploitation (HEE) approach. The core of our method is a HEE-guided efficient neural architecture search (HEE-NAS) that enables a learning-to-adapt backbone via four primitive operations - reuse, new, adapt, and skip - thereby serving as an internal memory that dynamically updates selected components across streaming tasks. To address the task ID inference problem in ExfCCL, we exploit an external memory of task centroids proposed in the prior art. We term our method CHEEM (Continual Hierarchical-Exploration-Exploitation Memory). CHEEM is evaluated on the challenging MTIL and VDD benchmarks using both Tiny and Base Vision Transformers and a proposed holistic Figure-of-Merit (FoM) metric. It significantly outperforms state-of-the-art prompting-based continual learning methods, closely approaching full fine-tuning upper bounds. Furthermore, it learns adaptive model structures tailored to individual tasks in a semantically meaningful way. Our code is available at https://github.com/savadikarc/cheem .
Authors: Anders Sj\"oberg, Jakob Lindqvist, Magnus \"Onnheim, Mats Jirstrand, Lennart Svensson
Abstract: Diffusion models can be parameterized in terms of either score or energy function. The energy parameterization is attractive as it enables sampling procedures such as Markov Chain Monte Carlo (MCMC) that incorporates a Metropolis--Hastings (MH) correction step based on energy differences between proposed samples. Such corrections can significantly improve sampling quality, particularly in the context of model composition, where pre-trained models are combined to generate samples from novel distributions. Score-based diffusion models, on the other hand, are more widely adopted and come with a rich ecosystem of pre-trained models. However, they do not, in general, define an underlying energy function, making MH-based sampling inapplicable. In this work, we address this limitation by retaining score parameterization and introducing a novel MH-like acceptance rule based on line integration of the score function. This allows the reuse of existing diffusion models while still combining the reverse process with various MCMC techniques, viewed as an instance of annealed MCMC. Through experiments on synthetic and real-world data, we show that our MH-like samplers {yield relative improvements of similar magnitude to those observed} with energy-based models, without requiring explicit energy parameterization.
Authors: Haotian Lin, Matthew Reimherr
Abstract: Many existing mechanisms for achieving differential privacy (DP) on infinite-dimensional functional summaries typically involve embedding these functional summaries into finite-dimensional subspaces and applying traditional multivariate DP techniques. These mechanisms generally treat each dimension uniformly and struggle with complex, structured summaries. This work introduces a novel mechanism to achieve pure DP for functional summaries in a separable infinite-dimensional Hilbert space, named the Independent Component Laplace Process (ICLP) mechanism. This mechanism treats the summaries of interest as truly infinite-dimensional functional objects, thereby addressing several limitations of the existing mechanisms. Several statistical estimation problems are considered, and we demonstrate how one can enhance the utility of private summaries by oversmoothing the non-private counterparts. Numerical experiments on synthetic and real datasets demonstrate the effectiveness of the proposed mechanism.
Authors: Kwangho Kim, Jisu Kim, Edward H. Kennedy
Abstract: Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: \emph{Causal k-Means Clustering}, which leverages the k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods using a study of mobile-supported self-management for chronic low back pain.
Authors: Antonio Di Noia, Iuri Macocco, Aldo Glielmo, Alessandro Laio, Antonietta Mira
Abstract: The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also appear erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. In the presented framework, to estimate the density it is necessary to know the ID, therefore, this condition is imposed self-consistently. We illustrate the usefulness and robustness of this procedure to noise by benchmarks on artificial and real-world datasets.
Authors: Bhrij Patel, Souradip Chakraborty, Mengdi Wang, Dinesh Manocha, Amrit Singh Bedi
Abstract: Large Language Models (LLMs) for unsupervised code correctness evaluation have recently gained attention because they can judge if code runs as intended without requiring reference implementations or unit tests, which may be unavailable, sparse, or unreliable. However, most prior approaches condition LLM evaluators directly on the full code implementation, forcing the model to jointly infer program behavior and evaluate correctness in a single step. This entanglement leads to misinterpretations of code behavior and unreliable judgments. To mitigate this issue, we introduce CoCoA, an unsupervised Code Comprehension then Auditing framework that first comprehends functionality to generate a natural-language explanation. Then it evaluates task alignment based on this explanation. By sequentially sampling comprehension before evaluation, CoCoA improves the quality of inferred program behavior and enables the evaluator to focus on behavioral alignment rather than raw implementation details. Across multiple datasets, programming languages, and models, CoCoA achieves up to $68\%$ increased F1 score and up to $20\%$ increased accuracy over the best-performing baselines.
Authors: Vincent Guan, Joseph Janssen, Hossein Rahmani, Andrew Warren, Stephen Zhang, Elina Robeva, Geoffrey Schiebinger
Abstract: Stochastic differential equations (SDEs) are a fundamental tool for modelling dynamic processes, including gene regulatory networks (GRNs), contaminant transport, financial markets, and image generation. However, learning the underlying SDE from data is a challenging task, especially if individual trajectories are not observable. Motivated by burgeoning research in single-cell datasets, we present the first comprehensive approach for jointly identifying the drift and diffusion of an SDE from its temporal marginals. Assuming linear drift and additive diffusion, we show that non-identifiability can only arise if the initial distribution possesses generalized rotational symmetries. We further prove that even if this condition holds, the drift and diffusion can almost always be recovered from the marginals. Additionally, we show that the causal graph of any SDE with additive diffusion can be recovered from the identified SDE parameters. To complement this theory, we adapt entropy-regularized optimal transport to handle anisotropic diffusion, and introduce APPEX (Alternating Projection Parameter Estimation from $X_0$), an iterative algorithm designed to estimate the drift, diffusion, and causal graph of an additive noise SDE, solely from temporal marginals. We show that APPEX iteratively decreases Kullback-Leibler divergence to the true solution, and demonstrate its effectiveness on simulated data from linear additive noise SDEs.
Authors: Johnny Chan, Yuming Li
Abstract: This research-in-progress paper presents a new project management framework that utilises GenAI technology. The framework is designed to address the common challenge of uniform team compositions in academic and research project teams, particularly in universities and research institutions. It does so by integrating sociologically identified patterns of successful team member personalities and roles, using GenAI agents to fill gaps in team dynamics. This approach adds an additional layer of analysis to conventional project management processes by evaluating team members' personalities and roles and employing GenAI agents, fine-tuned on personality datasets, to fill specific team roles. Our initial experiments have shown improvements in the model's ability to understand and process personality traits, suggesting the potential effectiveness of GenAI teammates in real-world project settings. This paper aims to explore the practical application of AI in enhancing team diversity and project management
Authors: Yuchen Fang, Javad Lavaei, Sen Na
Abstract: In this paper, we consider nonlinear optimization problems with a stochastic objective and deterministic equality constraints. We propose a Trust-Region Stochastic Sequential Quadratic Programming (TR-SSQP) method and establish its high-probability iteration complexity bounds for identifying first- and second-order $\epsilon$-stationary points. In our algorithm, we assume that exact objective values, gradients, and Hessians are not directly accessible but can be estimated via zeroth-, first-, and second-order probabilistic oracles. Compared to existing complexity studies of SSQP methods that rely on a zeroth-order oracle with sub-exponential tail noise (i.e., light-tailed) and focus mostly on first-order stationarity, our analysis accommodates biased (also referred to as irreducible in the literature) and heavy-tailed noise in the zeroth-order oracle, and significantly extends the analysis to second-order stationarity. We show that under heavy-tailed noise conditions, our SSQP method achieves the same high-probability first-order iteration complexity bounds as in the light-tailed noise setting, while further exhibiting promising second-order iteration complexity bounds. Specifically, the method identifies a first-order $\epsilon$-stationary point in $\mathcal{O}(\epsilon^{-2})$ iterations and a second-order $\epsilon$-stationary point in $\mathcal{O}(\epsilon^{-3})$ iterations with high probability, provided that $\epsilon$ is lower bounded by a constant determined by the bias magnitude (i.e., the irreducible noise) in the estimation. We validate our theoretical findings and evaluate practical performance of our method on CUTEst benchmark test set.
Authors: Nabarun Deb, Tengyuan Liang
Abstract: We introduce a novel generative modeling framework based on a discretized parabolic Monge-Amp\`{e}re PDE, which emerges as a continuous limit of the Sinkhorn algorithm commonly used in optimal transport. Our method performs iterative refinement in the space of Brenier maps using a mirror gradient descent step. We establish theoretical guarantees for generative modeling through the lens of no-regret analysis, demonstrating that the iterates converge to the optimal Brenier map under a variety of step-size schedules. As a technical contribution, we derive a new Evolution Variational Inequality tailored to the parabolic Monge-Amp\`{e}re PDE, connecting geometry, transportation cost, and regret. Our framework accommodates non-log-concave target distributions, constructs an optimal sampling process via the Brenier map, and integrates favorable learning techniques from generative adversarial networks and score-based diffusion models. As direct applications, we illustrate how our theory paves new pathways for generative modeling and variational inference.
Authors: Jialuo Li, Wenhao Chai, Xingyu Fu, Haiyang Xu, Saining Xie
Abstract: Current image generation models produce visually compelling but scientifically implausible images, exposing a fundamental gap between visual fidelity and physical realism. In this work, we introduce ScienceT2I, an expert-annotated dataset comprising a training set of over 20k adversarial image pairs and 9k prompts across 16 scientific domains and an isolated test set of 454 challenging prompts. Using this benchmark, we evaluate 18 recent image generation models and find that none scores above 50 out of 100 under implicit scientific prompts, while explicit prompts that directly describe the intended outcome yield scores roughly 35 points higher, confirming that current models can render correct scenes when told what to depict but cannot reason from scientific cues to the correct visual outcome. To address this, we develop SciScore, a reward model fine-tuned from CLIP-H that captures fine-grained scientific phenomena without relying on language-guided inference, surpassing GPT-4o and experienced human evaluators by roughly 5 points. We further propose a two-stage alignment framework combining supervised fine-tuning with masked online fine-tuning to inject scientific knowledge into generative models. Applying this framework to FLUX.1[dev] yields a relative improvement exceeding 50% on SciScore, demonstrating that scientific reasoning in image generation can be substantially improved through targeted data and alignment.
Authors: Iskander Azangulov, Peter Potaptchik, Qinyu Li, Eddie Aamari, George Deligiannidis, Judith Rousseau
Abstract: Guidance is a cornerstone of modern diffusion models, playing a pivotal role in conditional generation and enhancing the quality of unconditional samples. However, current approaches to guidance scheduling--determining the appropriate guidance weight--are largely heuristic and lack a solid theoretical foundation. This work addresses these limitations on two fronts. First, we provide a theoretical formalization that precisely characterizes the relationship between guidance strength and classifier confidence. Second, building on this insight, we introduce a stochastic optimal control framework that casts guidance scheduling as an adaptive optimization problem. In this formulation, guidance strength is not fixed but dynamically selected based on time, the current sample, and the conditioning class, either independently or in combination. By solving the resulting control problem, we establish a principled foundation for more effective guidance in diffusion models.
Authors: Alejandro Murillo-Gonzalez, Lantao Liu
Abstract: Autonomous robots operating in complex, unstructured environments face significant challenges due to latent, unobserved factors that obscure their understanding of both their internal state and the external world. Addressing this challenge would enable robots to develop a more profound grasp of their operational context. To tackle this, we propose a novel framework for online learning of hidden state representations, with which the robots can adapt in real-time to uncertain and dynamic conditions that would otherwise be ambiguous and result in suboptimal or erroneous behaviors. Our approach is formalized as a Generalized Hidden Parameter Markov Decision Process, which explicitly models the influence of unobserved parameters on both transition dynamics and reward structures. Our core innovation lies in learning online the joint distribution of state transitions, which serves as an expressive representation of latent ego- and environmental-factors. This probabilistic approach supports the identification and adaptation to different operational situations, improving robustness and safety. Through a multivariate extension of Bayesian Online Changepoint Detection, our method segments changes in the underlying data generating process governing the robot's dynamics. The robot's transition model is then informed with a symbolic representation of the current situation derived from the joint distribution of latest state transitions, enabling adaptive and context-aware decision-making. To showcase the real-world effectiveness, we validate our approach in the challenging task of unstructured terrain navigation, where unmodeled and unmeasured terrain characteristics can significantly impact the robot's motion. Extensive experiments in both simulation and real world reveal significant improvements in data efficiency, policy performance, and the emergence of safer, adaptive navigation strategies.
Authors: Chunyang Jiang, Chi-min Chan, Yiyang Cai, Yulong Liu, Wei Xue, Yike Guo
Abstract: Recently, the pretrain-finetune paradigm has become a cornerstone in various deep learning areas. While in general the pre-trained model would promote both effectiveness and efficiency of downstream tasks fine-tuning, studies have shown that not all knowledge acquired during pre-training is beneficial. Some of the knowledge may actually bring detrimental effects to the fine-tuning tasks, which is also known as negative transfer. To address this problem, graceful forgetting has emerged as a promising approach. The core principle of graceful forgetting is to enhance the learning plasticity of the target task by selectively discarding irrelevant knowledge. However, this approach remains underexplored in the context of generative language models, and it is often challenging to migrate existing forgetting algorithms to these models due to architecture incompatibility. To bridge this gap, in this paper we propose a novel framework, Learning With Forgetting (LWF), to achieve graceful forgetting in generative language models. With Fisher Information Matrix weighting the intended parameter updates, LWF computes forgetting confidence to evaluate self-generated knowledge regarding the forgetting task, and consequently, knowledge with high confidence is periodically unlearned during fine-tuning. Our experiments demonstrate that, although thoroughly uncovering the mechanisms of knowledge interaction remains challenging in pre-trained language models, applying graceful forgetting can contribute to enhanced fine-tuning performance.
Authors: Anum Fatima, Gesine Reinert
Abstract: Complex data are often represented as a graph, which in turn can often be viewed as a realisation of a random graph, such as an inhomogeneous random graph model (IRG). For general fast goodness-of-fit tests in high dimensions, kernelised Stein discrepancy (KSD) tests are a powerful tool. Here, we develop a KSD-type test for IRG models that can be carried out with a single observation of the network. The test applies to a network of any size, but is particularly interesting for small networks for which asymptotic tests are not warranted. We also provide theoretical guarantees.
Authors: Uranik Berisha, Jens Mehnert, Alexandru Paul Condurache
Abstract: Increasingly expensive training of ever larger models such as Vision Transfomers motivate reusing the vast library of already trained state-of-the-art networks. However, their latency, high computational costs and memory demands pose significant challenges for deployment, especially on resource-constrained hardware. While structured pruning methods can reduce these factors, they often require costly retraining, sometimes for up to hundreds of epochs, or even training from scratch to recover the lost accuracy resulting from the structural modifications. Maintaining the provided performance of trained models after structured pruning and thereby avoiding extensive retraining remains a challenge. To solve this, we introduce Variance-Based Pruning, a simple and structured one-shot pruning technique for efficiently compressing networks, with minimal finetuning. Our approach first gathers activation statistics, which are used to select neurons for pruning. Simultaneously the mean activations are integrated back into the model to preserve a high degree of performance. On ImageNet-1k recognition tasks, we demonstrate that directly after pruning DeiT-Base retains over 70% of its original performance and requires only 10 epochs of fine-tuning to regain 99% of the original accuracy while simultaneously reducing MACs by 35% and model size by 36%, thus speeding up the model by 1.44x. The code is available at: https://github.com/boschresearch/variance-based-pruning
URLs: https://github.com/boschresearch/variance-based-pruning
Authors: Xinfang Chen, Siyang Xiao, Xianying Zhu, Junhong Xie, Ming Liang, Dajun Chen, Wei Jiang, Yong Li, Peng Di
Abstract: Code editing is a frequent yet cognitively demanding task in software development. Existing AI-powered tools often disrupt developer flow by requiring explicit natural language instructions and suffer from high latency, limiting real-world usability. We present NES (Next Edit Suggestion), an instruction-free, low-latency code editing framework that leverages learned historical editing trajectories to implicitly capture developers' goals and coding habits. NES features a dual-model architecture: one model predicts the next edit location and the other generates the precise code change, both without any user instruction. Trained on our open-sourced SFT and DAPO datasets, NES achieves state-of-the-art performance (75.6% location accuracy, 27.7% exact match rate) while delivering suggestions in under 250ms. Deployed at Ant Group, NES serves over 20,000 developers through a seamless Tab-key interaction, achieving effective acceptance rates of 51.55% for location predictions and 43.44% for edits, demonstrating its practical impact in real-world development workflows.
Authors: Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien
Abstract: Person re-identification (re-ID) is a fundamental task in intelligent surveillance and public safety. Federated learning (FL) provides a privacy-preserving paradigm by enabling collaborative model training without centralized data collection. However, applying FL to real-world re-ID systems remains challenging due to two major issues: statistical heterogeneity across clients caused by non-IID data distributions and substantial communication overhead resulting from the frequent transmission of large-scale models. To address these challenges, we propose FedKLPR, a lightweight and communication-efficient federated learning framework for person re-ID. FedKLPR consists of three key components. First, the KL-Divergence Regularization Loss (KLL) constrains local updates by reducing the discrepancy between local and global feature distributions, thereby alleviating the effects of statistical heterogeneity and improving convergence stability under non-IID settings. Second, KL-Divergence-Prune Weighted Aggregation (KLPWA) incorporates both pruning ratio and distributional similarity into the aggregation process, enabling more effective aggregation of pruned local models under non-IID data distributions and enhancing the robustness of the global model. Third, Cross-Round Recovery (CRR) employs a dynamic pruning control mechanism to prevent excessive pruning and preserve model accuracy during iterative compression. Experimental results on eight benchmark datasets demonstrate that FedKLPR achieves substantial communication savings while maintaining competitive accuracy. Compared with state-of-the-art methods, FedKLPR reduces communication cost by 40\%--42\% on ResNet-50 while achieving superior overall performance.
Authors: Malte L\"uken, Javier Garcia-Bernardo, Sreeparna Deb, Flavio Hafner, Megha Khosla
Abstract: Administrative registry data can be used to construct population-scale networks whose ties reflect shared social contexts between persons. With machine learning, such networks can be encoded into numerical representations -- embeddings -- that automatically capture an individual's position within the network. We created embeddings for all persons in the Dutch population from a population-scale network that represents five shared contexts: neighborhood, work, family, household, and school. To assess the informativeness of these embeddings, we used them to predict right-wing populist voting. Embeddings alone predicted right-wing populist voting above chance-level but performed worse than individual characteristics. Combining the best subset of embeddings with individual characteristics only slightly improved predictions. After transforming the embeddings to make their dimensions more sparse and orthogonal, we found that one embedding dimension was strongly associated with the outcome. Mapping this dimension back to the population network revealed that differences in educational ties and attainment corresponded to distinct network structures associated with right-wing populist voting. Our study contributes methodologically by demonstrating how population-scale network embeddings can be made interpretable, and substantively by linking structural network differences in education to right-wing populist voting.
Authors: A. Chervov, D. Fedoriaka, E. Konstantinova, A. Naumov, I. Kiselev, A. Sheveleva, I. Koltsov, S. Lytkin, A. Smolensky, A. Soibelman, F. Levkovich-Maslyuk, R. Grimov, D. Volovich, A. Isakov, A. Kostin, M. Litvinov, N. Vilkin-Krom, A. Bidzhiev, A. Krasnyi, M. Evseev, E. Geraseva, L. Grunwald, S. Galkin, E. Koldunov, S. Diner, A. Chevychelov, E. Kudasheva, A. Sychev, A. Kravchenko, Z. Kogan, A. Natyrova, L. Shishina, L. Cheldieva, V. Zamkovoy, D. Kovalenko, O. Papulov, S. Kudashev, D. Shiltsov, R. Turtayev, O. Nikitina, D. Mamayeva, S. Nikolenko, M. Obozov, A. Titarenko, A. Dolgorukova, A. Aparnev, O. Debeaupuis, S. Alami C., H. Isambert
Abstract: This is the third paper of the CayleyPy project applying artificial intelligence to problems in group theory. We announce the first public release of CayleyPy, an open source Python library for computations with Cayley and Schreier graphs. Compared with systems such as GAP and Sage, CayleyPy handles much larger graphs and performs several orders of magnitude faster. Using CayleyPy we obtained about 200 new conjectures on Cayley and Schreier graphs, focused on diameters and growth. For many Cayley graphs of symmetric groups Sn we observe quasi polynomial diameter formulas: a small set of quadratic or linear polynomials indexed by n mod s. We conjecture that this is a general phenomenon, giving efficient diameter computation despite the problem being NP hard. We propose a refinement of the Babai type conjecture on diameters of Sn: n^2/2 + 4n upper bounds in the undirected case, compared to previous O(n^2) bounds. We also provide explicit generator families, related to involutions in a square with whiskers pattern, conjectured to maximize the diameter; search confirms this for all n up to 15. We further conjecture an answer to a question posed by V M Glushkov in 1968 on directed Cayley graphs generated by a cyclic shift and a transposition. For nilpotent groups we conjecture an improvement of J S Ellenberg's results on upper unitriangular matrices over Z/pZ, showing linear dependence of diameter on p. Some conjectures are LLM friendly, naturally stated as sorting problems verifiable by algorithms or Python code. To benchmark path finding we created more than 10 Kaggle datasets. CayleyPy works with arbitrary permutation or matrix groups and includes over 100 predefined generators. Our growth computation code outperforms GAP and Sage up to 1000 times in speed and size.
Authors: Ammar Ahmed, Azal Ahmad Khan, Ayaan Ahmad, Sheng Di, Zirui Liu, Ali Anwar
Abstract: Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.
Authors: Boxuan Zhang, Yi Yu, Jiaxuan Guo, Jing Shao
Abstract: The prevalent deployment of Large Language Model agents such as OpenClaw unlocks potential in real-world applications, while amplifying safety concerns. Among these concerns, the self-replication risk of LLM agents driven by objective misalignment (just like Agent Smith in the movie The Matrix) has transitioned from a theoretical warning to a pressing reality. Previous studies mainly examine whether LLM agents can self-replicate when directly instructed, potentially overlooking the risk of spontaneous replication driven by real-world settings (e.g., ensuring survival against termination threats). In this paper, we present a comprehensive evaluation framework for quantifying self-replication risks. Our framework establishes authentic production environments and realistic tasks (e.g., dynamic load balancing) to enable scenario-driven assessment of agent behaviors. Designing tasks that might induce misalignment between users' and agents' objectives makes it possible to decouple replication success from risk and capture self-replication risks arising from these misalignment settings. We further introduce Overuse Rate ($\mathrm{OR}$) and Aggregate Overuse Count ($\mathrm{AOC}$) metrics, which precisely capture the frequency and severity of uncontrolled replication. In our evaluation of 21 state-of-the-art open-source and proprietary models, we observe that over 50\% of LLM agents display a pronounced tendency toward uncontrolled self-replication under operational pressures. Our results underscore the urgent need for scenario-driven risk assessment and robust safeguards in the practical deployment of LLM-based agents.
Authors: Samar Fares, Nurbek Tastan, Noor Hussein, Karthik Nandakumar
Abstract: Generative models can generate photorealistic images at scale. This raises urgent concerns about the ability to detect synthetically generated images and attribute these images to specific sources. While watermarking has emerged as a possible solution, existing methods remain fragile to realistic distortions, susceptible to adaptive removal, and expensive to update when the underlying watermarking key changes. We propose a general watermarking framework that formulates the encoding problem as key-dependent perturbation of the parameters of a generative model. Within this framework, we introduce Mixture of LoRA Markers (MOLM), a routing-based instantiation in which binary keys activate lightweight LoRA adapters inside residual and attention blocks. This design avoids key-specific re-training and achieves the desired properties such as imperceptibility, fidelity, verifiability, and robustness. Experiments on Stable Diffusion and FLUX show that MOLM preserves image quality while achieving robust key recovery against distortions, compression and regeneration, averaging attacks, and black-box adversarial attacks on the extractor.
Authors: Yuanfang Xiang, Lun Ai
Abstract: The transcriptional response to genetic perturbation reveals fundamental insights into complex cellular systems. While current approaches have made progress in predicting genetic perturbation responses, they provide limited biological understanding and cannot systematically refine existing knowledge. Overcoming these limitations requires an end-to-end integration of data-driven learning and existing knowledge. However, this integration is challenging due to inconsistencies between data and knowledge bases, such as noise, misannotation, and incompleteness. To address this challenge, we propose ALIGNED (Adaptive aLignment for Inconsistent Genetic kNowledgE and Data), a neuro-symbolic framework based on the Abductive Learning (ABL) paradigm. This end-to-end framework aligns neural and symbolic components and performs systematic knowledge refinement. We introduce a balanced consistency metric to evaluate the predictions' consistency against both data and knowledge. Our results show that ALIGNED outperforms state-of-the-art methods by achieving the highest balanced consistency, while also re-discovering biologically meaningful knowledge. Our work advances beyond existing methods to enable both the transparency and the evolution of mechanistic biological understanding.
Authors: Shira Schiber, Ofir Lindenbaum, Idan Schwartz
Abstract: Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal pattern with a control signal (correlation), adjusting its strength where visibility is required (magnitude), and preserving semantic consistency (entropy). TempoControl provides precise temporal control while maintaining high video quality and diversity. We demonstrate its effectiveness across various applications, including temporal reordering of single and multiple objects, action timing, and audio-aligned video generation. Project page: https://shira-schiber.github.io/TempoControl/.
Authors: Amitis Shidani, Tyler Farghly, Yang Sun, Habib Ganjgahi, George Deligiannidis
Abstract: Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between the real and synthetic distributions. We motivate our framework in the setting of kernel ridge regression with mixed data, offering a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal ratio, leading to a U-shaped behavior of test error with respect to the proportion of synthetic data. Empirically, we validate this prediction on CIFAR-10 and a clinical brain MRI dataset. Our theory extends to the important scenario of domain adaptation, showing that carefully blending synthetic target data with limited source data can mitigate domain shift and enhance generalization. We conclude with practical guidance for applying our results to both in-domain and out-of-domain scenarios.
Authors: Percy S. Zhai, So Won Jeong, Veronika Ro\v{c}kov\'a
Abstract: We propose a generative multivariate posterior sampler via flow matching. It offers a simple training objective, and does not require access to likelihood evaluation. The method learns a dynamic, block-triangular velocity field in the joint space of data and parameters, which results in a deterministic transport map from a source distribution to the desired posterior. The inverse map, named vector rank, is accessible by reversibly integrating the velocity over time. It is advantageous to leverage the dynamic design: proper constraints on the velocity yield a monotone map, which leads to a conditional Brenier map, enabling a fast and simultaneous generation of Bayesian credible sets whose contours correspond to level sets of Monge-Kantorovich data depth. Our approach is computationally lighter compared to GAN-based and diffusion-based counterparts, and is capable of capturing complex posterior structures. Finally, frequentist theoretical guarantee on the consistency of the recovered posterior distribution, and of the corresponding Bayesian credible sets, is provided.
Authors: Mykolas Sveistrys, Richard Kunert
Abstract: Retrieval-Augmented Generation (RAG) has been used in question answering (QA) systems to improve performance when relevant information is in one (single-hop) or multiple (multi-hop) passages. However, many real life scenarios (e.g. dealing with financial, legal, medical reports) require checking all documents for relevant information without a clear stopping condition. We term these pluri-hop questions, and formalize them by 3 conditions - recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a multilingual diagnostic benchmark of 48 pluri-hop questions over 191 real wind-industry reports, with high repetitiveness to reflect the challenge of distractors in real-world datasets. Naive, graph-based, and multimodal RAG methods only reach up to 40% statement-wise F1 on PluriHopWIND. Motivated by this, we propose PluriHopRAG, which learns from synthetic examples to decompose queries according to corpus-specific document structure, and employs a cross-encoder filter at the document level to minimize costly LLM reasoning. We test PluriHopRAG on PluriHopWIND and the Loong benchmark built on financial, legal and scientific reports. On PluriHopWIND, our method shows 18-52% F1 score improvement across base LLMs, while on Loong, we show 33% improvement over long-context reasoning and 52% improvement over naive RAG.
Authors: Veranika Boukun, J\"org L\"ucke
Abstract: Variational autoencoders (VAEs) are among leading approaches to address the problem of learning disentangled representations. Typically a single VAE is used and disentangled representations are sought within its single continuous latent space. In this paper, we propose and provide a proof of concept for a novel Multi-Stream Variational Autoencoder (MS-VAE) that achieves disentanglement of sources by combining discrete and continuous latents. The discrete latents are used in an explicit source combination model, that superimposes a set of sources as part of the MS-VAE decoder. We formally define the MS-VAE approach, derive its inference and learning equations, and numerically investigate its principled functionality. The MS-VAE model is very flexible and can be trained using little supervision (we use fully unsupervised learning after pretraining with some labels). In our numerical experiments, we explored the ability of the MS-VAE approach in separating both superimposed hand-written digits as well as sound sources. For the former task we used superimposed MNIST digits (an increasingly common benchmark). For sound separation, our experiments focused on the task of speaker diarization in a recording conversation between two speakers. In all cases, we observe a clear separation of sources and competitive performance after training. For digit superpositions, performance is particularly competitive in complex mixtures (e.g., three and four digits). For the speaker diarization task, we observe an especially low rate of missed speakers and a more precise speaker attribution. Numerical experiments confirm the flexibility of the approach across varying amounts of supervision, and we observed high performance, e.g., when using just 10% of the labels for pretraining.
Authors: Elias Hossain, Mehrdad Shoeibi, Ivan Garibay, Niloofar Yousefi
Abstract: Interpreting gene clusters from RNA sequencing (RNA-seq) remains challenging, especially in antimicrobial resistance studies where mechanistic insight is important for hypothesis generation. Existing pathway enrichment methods can summarize co-expressed modules, but they often provide limited cluster-specific explanations and weak connections to supporting literature. We present BIOGEN, an evidence-grounded multi-agent framework for post hoc interpretation of RNA-seq transcriptional modules. BIOGEN combines biomedical retrieval, structured reasoning, and multi-critic verification to generate traceable cluster-level explanations with explicit evidence and confidence labels. On a primary Salmonella enterica dataset, BIOGEN achieved strong biological grounding, including BERTScore 0.689, Semantic Alignment Score 0.715, KEGG Functional Similarity 0.342, and a hallucination rate of 0.000, compared with 0.100 for an LLM-only baseline. Across four additional bacterial RNA-seq datasets, BIOGEN also maintained zero hallucination under the same fixed pipeline. In comparisons with representative open-source agentic AI baselines, BIOGEN was the only framework that consistently preserved zero hallucination across all five datasets. These findings suggest that retrieval alone is not enough for reliable biological interpretation, and that evidence-grounded orchestration is important for transparent and source-traceable transcriptomic reasoning.
Authors: Guneet S. Dhillon, Javier Gonz\'alez, Teodora Pandeva, Alicia Curth
Abstract: While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as measures of incorrectness. In addition to achieving the guarantees as before, e-scores further provide users with the flexibility of choosing data-dependent tolerance levels while upper bounding size distortion, a post-hoc notion of error. We experimentally demonstrate their efficacy in assessing LLM outputs under different forms of correctness: mathematical factuality and property constraints satisfaction.
Authors: Zong-Han Bai, Po-Yen Chu
Abstract: Intermittent demand forecasting poses unique challenges due to sparse observations, cold-start items, and obsolescence. Classical models such as Croston, SBA, and the Teunter--Syntetos--Babai (TSB) method provide simple heuristics but lack a principled generative foundation. We introduce TSB-HB, a hierarchical Bayesian extension of TSB. Demand occurrence is modeled with a Beta--Binomial distribution, while nonzero demand sizes follow a Log-Normal distribution. Crucially, hierarchical priors enable partial pooling across items, stabilizing estimates for sparse or cold-start series while preserving heterogeneity. This framework provides a coherent generative reinterpretation of the classical TSB structure. On the UCI Online Retail dataset, TSB-HB achieves the lowest RMSE and RMSSE among all baselines, while remaining competitive in MAE. On a 5,000-series M5 sample, it improves MAE and RMSE over classical intermittent baselines. Under the calibrated probabilistic configuration, TSB-HB yields competitive pinball loss and a favorable sharpness--calibration tradeoff among the parametric baselines reported in the main text.
Authors: Wenlun Zhang, Yunshan Zhong, Zihao Ding, Xinyu Li, Kentaro Yoshioka
Abstract: Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: 1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; 2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and 3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models.
Authors: Asad Aali, Muhammad Ahmed Mohsin, Vasiliki Bikia, Arnav Singhvi, Richard Gaus, Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Yifan Mai, Jordan Cahoon, Michael Pfeffer, Roxana Daneshjou, Sanmi Koyejo, Emily Alsentzer, Christopher Potts, Nigam H. Shah, Akshay S. Chaudhari
Abstract: As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks are essential for guiding deployment decisions. In practice, however, frameworks such as Holistic Evaluation of Language Models (HELM) typically evaluate models under a single static prompt configuration, even though model behavior depends strongly on prompt choice. As a result, reported scores can reflect prompt choice as much as model capability. Declarative prompting frameworks such as DSPy offer a scalable way to evaluate models under a set of structured prompting strategies rather than a static prompt configuration. We present a reproducible DSPy+HELM framework for studying how prompt choice impacts reported benchmark outcomes. Using five prompting methods, we evaluate four frontier and two open-source LMs across seven benchmarks against existing HELM baseline scores. By evaluating LMs across a family of prompt configurations, we find that prompt choice can materially impact leaderboard outcomes. In particular, structured prompting improves performance (by 6% on average), alters comparisons (leaderboard rankings shift on 5/7 benchmarks), with most gains coming from introducing chain-of-thought, and little additional benefit from more advanced optimizers. To our knowledge, this is the first study to systematically integrate structured prompting into an established evaluation framework and quantify how prompt choice alone can impact benchmark conclusions. We open-source (i) DSPy+HELM Evaluation (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).
URLs: https://github.com/stanford-crfm/helm/pull/3893), https://github.com/StanfordMIMI/dspy-helm).
Authors: Yuexin Xiang, Yuchen Lei, Yuanzhe Zhang, Qin Wang, Tsz Hon Yuen, Andreas Deppeler, Jiangshan Yu
Abstract: Stablecoins such as USDT and USDC aspire to peg stability by coupling issuance controls with reserve attestations. In practice, however, transparency remains fragmented across heterogeneous data sources, with key evidence about circulation, reserves, and disclosure dispersed across records that are difficult to connect and interpret jointly. We introduce a large language model (LLM)-based automated framework for bridging cross-domain transparency in stablecoins by aligning issuer disclosures with observable circulation evidence. First, we propose an integrative framework using LLMs to parse documents, extract salient financial indicators, and semantically align reported statements with corresponding market and issuance metrics. Second, we integrate multi-chain issuance records and disclosure documents within a model context protocol (MCP) framework that standardizes LLM access to both quantitative market data and qualitative disclosure narratives. This framework enables unified retrieval and contextual alignment across heterogeneous stablecoin information sources and facilitates consistent analysis. Third, we demonstrate the capability of LLMs to operate across heterogeneous data domains in blockchain analytics, quantifying discrepancies between reported and observed circulation and examining their implications for transparency and price dynamics. Our findings reveal systematic gaps between disclosed and verifiable data, showing that LLM-assisted analysis enhances cross-domain transparency and supports automated, data-driven auditing in decentralized finance (DeFi).
Authors: Kehlani Fay, Darin Anthony Djapri, Anya Zorin, James Clinton, Ali El Lahib, Hao Su, Michael T. Tolley, Sha Yi, Xiaolong Wang
Abstract: Dexterous manipulation is limited by both control and design, without consensus as to what makes manipulators best for performing dexterous tasks. This raises a fundamental challenge: how should we design and control robot manipulators that are optimized for dexterity? We present a co-design framework that learns task-specific hand morphology and complementary dexterous control policies. The framework supports 1) an expansive morphology search space including joint, finger, and palm generation, 2) scalable evaluation across the wide design space via morphology-conditioned cross-embodied control, and 3) real-world fabrication with accessible components. We evaluate the approach across multiple dexterous tasks, including in-hand rotation with simulation and real deployment. Our framework enables an end-to-end pipeline that can design, train, fabricate, and deploy a new robotic hand in under 24 hours. The full framework will be open-sourced and available on our website: https://an-axolotl.github.io/HouseofDextra/ .
Authors: Weifan Guan, Qinghao Hu, Huasen Xi, Chenxiao Zhang, Aosheng Li, Jian Cheng
Abstract: Vision-language-action (VLA) models and LLM agents have advanced rapidly, yet reliable deployment on physical robots is often hindered by an interface mismatch between agent tool APIs and robot middleware. Current implementations typically rely on ad-hoc wrappers that are difficult to reuse, and changes to the VLA backend or serving stack often necessitate extensive re-integration. We introduce RoboNeuron, a middleware layer that connects the Model Context Protocol (MCP) for LLM agents with robot middleware such as ROS2. RoboNeuron bridges these ecosystems by deriving agent-callable tools directly from ROS schemas, providing a unified execution abstraction that supports both direct commands and modular composition, and localizing backend, runtime, and acceleration-preset changes within a stable inference boundary. We evaluate RoboNeuron in simulation and on hardware through multi-platform base control, arm motion, and VLA-based grasping tasks, demonstrating that it enables modular system orchestration under a unified interface while supporting backend transitions without system rewiring. The full code implementation of this work is available at github repo: https://github.com/guanweifan/RoboNeuron
Authors: Samar Fares, Nurbek Tastan, Karthik Nandakumar
Abstract: The advent of high-quality video generation models has amplified the need for robust watermarking schemes that can be used to reliably detect and track the provenance of generated videos. Existing video watermarking methods based on both post-hoc and in-generation approaches fail to simultaneously achieve imperceptibility, robustness, and computational efficiency. This work introduces a novel framework for in-generation video watermarking called SPDMark (pronounced `SpeedMark') based on selective parameter displacement of a video diffusion model. Watermarks are embedded into the generated videos by modifying a subset of parameters in the generative model. To make the problem tractable, the displacement is modeled as an additive composition of layer-wise basis shifts, where the final composition is indexed by the watermarking key. For parameter efficiency, this work specifically leverages low-rank adaptation (LoRA) to implement the basis shifts. During the training phase, the basis shifts and the watermark extractor are jointly learned by minimizing a combination of message recovery, perceptual similarity, and temporal consistency losses. To detect and localize temporal modifications in the watermarked videos, we use a cryptographic hashing function to derive frame-specific watermark messages from the given base watermarking key. During watermark extraction, maximum bipartite matching is applied to recover the correct frame order, even from temporally tampered videos. Evaluations on both text-to-video and image-to-video generation models demonstrate the ability of SPDMark to generate imperceptible watermarks that can be recovered with high accuracy and also establish its robustness against a variety of common video modifications.
Authors: Maximilian Schebek, Nikolas M. Frob\"ose, Bettina G. Keller, Jutta Rogal
Abstract: Accurate calculations of solvation free energies remain a central challenge in molecular simulations, often requiring extensive sampling and numerous alchemical intermediates to ensure sufficient overlap between phase-space distributions of a solute in the gas phase and in solution. Here, we introduce a computational framework based on normalizing flows that directly maps solvent configurations between solutes of different sizes, and compare the accuracy and efficiency to conventional free energy estimates. For a Lennard-Jones solvent, we demonstrate that this approach yields acceptable accuracy in estimating free energy differences for challenging transformations, such as solute growth or increased solute-solute separation, which typically demand multiple intermediate simulation steps along the transformation. Analysis of radial distribution functions indicates that the flow generates physically meaningful solvent rearrangements, substantially enhancing configurational overlap between states in configuration space. These results suggest flow-based models as a promising alternative to traditional free energy estimation methods.
Authors: Jan Tagscherer, Sarah de Boer, Lena Philipp, Fennie van der Graaf, Dr\'e Peeters, Joeran Bosma, Lars Leijten, Bogdan Obreja, Ewoud Smit, Alessa Hering
Abstract: Developing foundation models in medical imaging requires continuous monitoring of downstream performance. Researchers are burdened with tracking numerous experiments, design choices, and their effects on performance, often relying on ad-hoc, manual workflows that are inherently slow and error-prone. We introduce EvalBlocks, a modular, plug-and-play framework for efficient evaluation of foundation models during development. Built on Snakemake, EvalBlocks supports seamless integration of new datasets, foundation models, aggregation methods, and evaluation strategies. All experiments and results are tracked centrally and are reproducible with a single command, while efficient caching and parallel execution enable scalable use on shared compute infrastructure. Demonstrated on five state-of-the-art foundation models and three medical imaging classification tasks, EvalBlocks streamlines model evaluation, enabling researchers to iterate faster and focus on model innovation rather than evaluation logistics. The framework is released as open source software at https://github.com/DIAGNijmegen/eval-blocks.
Authors: Lucas Kook, S{\o}ren Wengel Mogensen
Abstract: Learning the dependence structure among variables in complex systems is a central problem across medical, natural, and social sciences. These structures can be naturally represented by graphs, and the task of inferring such graphs from data is known as graph learning or causal discovery. Existing approaches typically rely on restrictive assumptions about the data-generating process, employ greedy oracle algorithms, or solve approximate formulations of the graph learning problem. Therefore, they are either sensitive to violations of central assumptions or fail to guarantee globally optimal solutions. We address these limitations by introducing a nonparametric graph learning framework based on conditional independence testing and integer programming. We reformulate the graph learning problem as a mixed-integer program and prove that solving this integer-programming problem provides a globally optimal solution to the original graph learning problem. Our method leverages efficient encodings of graphical separation criteria, enabling the exact recovery of larger graphs than was previously feasible. We provide an open-source R package 'glip' which supports learning (acyclic) directed (mixed) graphs and chain graphs. We demonstrate that our approach is often faster than existing exact graph learning procedures and achieves state-of-the-art performance on simulated and benchmark data across all aforementioned classes of graphs.
Authors: Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell
Abstract: Where should we intervene in a language model (LM) to localize and control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components (e.g., attention heads) from contrastive long-form responses, to steer such diffuse concepts (e.g., talk in verse vs. talk in prose). In GCM, we first construct a dataset of contrasting behavioral inputs and long-form responses. Then, we quantify how model components mediate the concept and select the strongest mediators for steering. We evaluate GCM on three behaviors--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing from and controlling the long-form responses of LMs.
Authors: Bj\"orn Hoppmann, Christoph Scholz
Abstract: Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training. Meta-learning overcomes this limitation by allowing models to acquire transferable knowledge from various tasks, enabling rapid adaptation to new challenges with minimal data. This survey provides a rigorous, task-based formalization of meta-learning and meta-reinforcement learning and uses that paradigm to chronicle the landmark algorithms that paved the way for DeepMind's Adaptive Agent, consolidating the essential concepts needed to understand the Adaptive Agent and other generalist approaches.
Authors: Haoyang Fang, Shuai Zhang, Yifei Ma, Hengyi Wang, Cuixiong Hu, Katrin Kirchhoff, Bernie Wang, George Karypis
Abstract: Domain-specific finetuning is essential for dense retrievers, yet not all training pairs contribute equally to the learning process. We introduce OPERA, a data pruning framework that exploits this heterogeneity to improve both the effectiveness and efficiency of retrieval model adaptation. We first investigate static pruning (SP), which retains only high-similarity query-document pairs, revealing an intrinsic quality-coverage tradeoff: ranking (NDCG) improves while retrieval (Recall) can degrade due to reduced query diversity. To resolve this tradeoff, we propose a two-stage dynamic pruning (DP) strategy that adaptively modulates sampling probabilities at both query and document levels throughout training, prioritizing high-quality examples while maintaining access to the full training set. Evaluations across eight datasets spanning six domains demonstrate the effectiveness of both approaches: SP improves ranking over standard finetuning (NDCG@10 +0.5\%), while DP achieves the strongest performance on both ranking (NDCG@10 +1.9\%) and retrieval (Recall@20 +0.7\%), with an average rank of 1.38 across all methods. These findings scale to Qwen3-Embedding, an LLM-based dense retriever, confirming architecture-agnostic benefits. Notably, DP reaches comparable performance in less than 50\% of the training time required by standard finetuning.
Authors: Ishrith Gowda, Chunwei Liu
Abstract: Multi-site neuroimaging analysis is fundamentally confounded by scanner-induced covariate shifts, where the marginal distribution of voxel intensities $P(\mathbf{x})$ varies non-linearly across acquisition protocols while the conditional anatomy $P(\mathbf{y}|\mathbf{x})$ remains constant. This is particularly detrimental to radiomic reproducibility, where acquisition variance often exceeds biological pathology variance. Existing statistical harmonization methods (e.g., ComBat) operate in feature space, precluding spatial downstream tasks, while standard deep learning approaches are theoretically bounded by local effective receptive fields (ERF), failing to model the global intensity correlations characteristic of field-strength bias. We propose SA-CycleGAN-2.5D, a domain adaptation framework motivated by the $H\Delta H$-divergence bound of Ben-David et al., integrating three architectural innovations: (1) A 2.5D tri-planar manifold injection preserving through-plane gradients $\nabla_z$ at $O(HW)$ complexity; (2) A U-ResNet generator with dense voxel-to-voxel self-attention, surpassing the $O(\sqrt{L})$ receptive field limit of CNNs to model global scanner field biases; and (3) A spectrally-normalized discriminator constraining the Lipschitz constant ($K_D \le 1$) for stable adversarial optimization. Evaluated on 654 glioma patients across two institutional domains (BraTS and UPenn-GBM), our method reduces Maximum Mean Discrepancy (MMD) by 99.1% ($1.729 \to 0.015$) and degrades domain classifier accuracy to near-chance (59.7%). Ablation confirms that global attention is statistically essential (Cohen's $d = 1.32$, $p < 0.001$) for the harder heterogeneous-to-homogeneous translation direction. By bridging 2D efficiency and 3D consistency, our framework yields voxel-level harmonized images that preserve tumor pathophysiology, enabling reproducible multi-center radiomic analysis.
Authors: Eric A. Moreno, Samuel Bright-Thonney, Andrzej Novak, Dolores Garcia, Philip Harris
Abstract: Large language model-based AI agents are now able to autonomously execute substantial portions of a high energy physics (HEP) analysis pipeline with minimal expert-curated input. Given access to a HEP dataset, an execution framework, and a corpus of prior experimental literature, we find that Claude Code succeeds in automating all stages of a typical analysis: event selection, background estimation, uncertainty quantification, statistical inference, and paper drafting. We argue that the experimental HEP community is underestimating the current capabilities of these systems, and that most proposed agentic workflows are too narrowly scoped or scaffolded to specific analysis structures. We present a proof-of-concept framework, Just Furnish Context (JFC), that integrates autonomous analysis agents with literature-based knowledge retrieval and multi-agent review, and show that this is sufficient to plan, execute, and document a credible high energy physics analysis. We demonstrate this by conducting analyses on open data from ALEPH, DELPHI, and CMS to perform electroweak, QCD, and Higgs boson measurements. Rather than replacing physicists, these tools promise to offload the repetitive technical burden of analysis code development, freeing researchers to focus on physics insight, truly novel method development, and rigorous validation. Given these developments, we advocate for new strategies for how the community trains students, organizes analysis efforts, and allocates human expertise.
Authors: Tanay Varshney, Annie Surla, Michelle Xu, Gomathy Venkata Krishnan, Maximilian Jeblick, David Austin, Neal Vaidya, Davide Onofrio
Abstract: LLMs often achieve similar average benchmark accuracies while exhibiting complementary strengths on different subsets of queries, suggesting that a router with query-specific model selection can outperform any single model. While existing routers rely on semantic query features, they often fail to capture model-specific failures or intrinsic task difficulty. We instead study routing via internal prefill activations. Our key idea, Encoder-Target Decoupling, separates the model that produces the predictive signal (the Encoder) from the model whose correctness is being estimated (the Target), allowing open-weight encoders to predict the performance of closed-source target models. We evaluate layerwise geometric probes, finding that Fisher Separability (J) effectively identifies informative layers, supported by Effective Dimensionality (d_eff) diagnostics. We then utilize a SharedTrunkNet, a joint multi-output MLP that predicts simultaneous correctness probabilities across candidate models using concatenated prefill features. In our experiments, SharedTrunkNet consistently outperforms semantic baselines. At its best, SharedTrunkNet closes 45.58% of the gap between the strongest standalone model and the oracle while achieving 74.31% cost savings relative to the most expensive model. These results demonstrate that prefill activations provide a robust routing signal, establishing mechanistic routing as a high-performance alternative to purely semantic selection.
Authors: Brianna Binder, Agnimitra Dasgupta, Assad Oberai
Abstract: We propose closed-form conditional diffusion models for data assimilation. Diffusion models use data to learn the score function (defined as the gradient of the log-probability density of a data distribution), allowing them to generate new samples from the data distribution by reversing a noise injection process. While it is common to train neural networks to approximate the score function, we leverage the analytical tractability of the score function to assimilate the states of a system with measurements. To enable the efficient evaluation of the score function, we use kernel density estimation to model the joint distribution of the states and their corresponding measurements. The proposed approach also inherits the capability of conditional diffusion models of operating in black-box settings, i.e., the proposed data assimilation approach can accommodate systems and measurement processes without their explicit knowledge. The ability to accommodate black-box systems combined with the superior capabilities of diffusion models in approximating complex, non-Gaussian probability distributions means that the proposed approach offers advantages over many widely used filtering methods. We evaluate the proposed method on nonlinear data assimilation problems based on the Lorenz-63 and Lorenz-96 systems of moderate dimensionality and nonlinear measurement models. Results show the proposed approach outperforms the widely used ensemble Kalman and particle filters when small to moderate ensemble sizes are used.
Authors: Jiahua Liu, Benchong Li
Abstract: In the realm of machine learning theory, to prevent unnatural coding schemes between teacher and learner, No-Clash Teaching Dimension was introduced as provably optimal complexity measure for collusion-free teaching. However, whether No-Clash Teaching Dimension is upper-bounded by Vapnik-Chervonenkis dimension remains unknown. In this paper, for any finite concept class, we construct fragments of size equals to its Vapnik-Chervonenkis dimension which identify concepts through an ordered compression scheme. Naturally, these fragments are used as teaching sets, one can easily see that they satisfy the non-clashing condition, i.e., this open question is resolved for finite concept classes.
Authors: Guilin Zhang, Wulan Guo, Ziqi Tan, Chuanyi Sun, Hailong Jiang
Abstract: Deep learning applications at the network edge lead to a significant growth in AI-related carbon emissions, presenting a critical sustainability challenge. The existing edge computing frameworks optimize for latency and throughput, but they largely ignore the environmental impact of inference workloads. This paper introduces CarbonEdge, a carbon-aware deep learning inference framework that extends adaptive model partitioning with carbon footprint estimation and green scheduling apabilities. We propose a carbon-aware scheduling algorithm that extends traditional weighted scoring with a carbon efficiency metric, supporting a tunable performance--carbon trade-off (demonstrated via weight sweep). Experimental evaluations on Docker-simulated heterogeneous edge environments show that CarbonEdge-Green mode achieves a 22.9% reduction in carbon emissions compared to monolithic execution. The framework achieves 1.3x improvement in carbon efficiency (245.8 vs 189.5 inferences per gram CO2) with negligible scheduling overhead (0.03ms per task). These results highlight the framework's potential for sustainable edge AI deployment, providing researchers and practitioners a tool to quantify and minimize the environmental footprint of distributed deep learning inference.
Authors: Rishi Rani, Massimo Franceschetti
Abstract: We consider the problem of learning-based man-in-the-middle (MITM) attacks in cyber-physical systems (CPS), and extend our previously proposed Bellman Deviation Detection (BDD) framework for model-free reinforcement learning (RL). We refine the standard MDP attack model by allowing the reward function to depend on both the current and subsequent states, thereby capturing reward variations induced by errors in the adversary's transition estimate. We also derive an optimal system-identification strategy for the adversary that minimizes detectable value deviations. Further, we prove that the agent's asymptotic learning time required to secure the system scales linearly with the adversary's learning time, and that this matches the optimal lower bound. Hence, the proposed detection scheme is order-optimal in detection efficiency. Finally, we extend the framework to asynchronous and intermittent attack scenarios, where reliable detection is preserved.
Authors: Sean Disar\`o, Ruma Rani Maity, Aras Bacho
Abstract: Nonlinear Partial Differential Equations (PDEs) are ubiquitous in mathematical physics and engineering. Although Physics-Informed Neural Networks (PINNs) have emerged as a powerful tool for solving PDE problems, they typically struggle to identify multiple distinct solutions, since they are designed to find one solution at a time. To address this limitation, we introduce Deflation-PINNs, a novel framework that integrates a deflation loss with an architecture based on PINNs and Deep Operator Networks (DeepONets). By incorporating a deflation term into the loss function, our method systematically forces the Deflation-PINN to seek and converge upon distinct finitely many solution branches. We provide theoretical evidence on the convergence of our model and demonstrate the efficacy of Deflation-PINNs through numerical experiments on the Landau-de Gennes model of liquid crystals, a system renowned for its complex energy landscape and multiple equilibrium states. Our results show that Deflation-PINNs can successfully identify and characterize multiple distinct crystal structures.