new Attractor Patch Networks: Reducing Catastrophic Forgetting with Routed Low-Rank Patch Experts

Authors: Shashank

Abstract: Transformers achieve strong language modeling accuracy, yet their position-wise feed-forward networks (FFNs) are dense, globally shared, and typically updated end to end. These properties create two practical tensions. First, dense FFNs spend the same compute on every token regardless of context, and they allocate capacity uniformly even when language exhibits highly clustered context structure. Second, continual learning, in the sense of updating the model while serving a data stream, often produces interference because a small update touches broadly shared weights. We propose Attractor Patch Networks (APN), a plug-compatible replacement for the Transformer FFN. APN is a bank of patch experts. A similarity router selects a small top-k set of patches for each token by matching the token representation to learned prototypes. Each selected patch emits a low-rank residual update conditioned on a compact code. The architecture yields conditional, context-specialized nonlinear transformations while preserving the standard Transformer interface. This paper focuses on APN as an architectural primitive. We formalize APN, analyze its expressivity as a piecewise low-rank residual function class, and derive simple interference and stability arguments that make APN naturally compatible with continual learning. In experiments on character-level language modeling, APN achieves competitive perplexity (4.57 vs 4.32 PPL) while enabling dramatically better continual adaptation: when adapting to a shifted domain, APN achieves 2.6 times better retention (11.1 vs 29.4 PPL on the original domain) and 2.8 times better adaptation (6.4 vs 17.8 PPL on the new domain) compared to global fine-tuning of a dense FFN baseline.

new Neural Sabermetrics with World Model: Play-by-play Predictive Modeling with Large Language Model

Authors: Young Jin Ahn, Yiyang Du, Zheyuan Zhang, Haisen Kang

Abstract: Classical sabermetrics has profoundly shaped baseball analytics by summarizing long histories of play into compact statistics. While these metrics are invaluable for valuation and retrospective analysis, they do not define a generative model of how baseball games unfold pitch by pitch, leaving most existing approaches limited to single-step prediction or post-hoc analysis. In this work, we present Neural Sabermetrics with World Model, a Large Language Model (LLM) based play-by-play world model for baseball. We cast baseball games as long auto-regressive sequences of events and continuously pretrain a single LLM on more than ten years of Major League Baseball (MLB) tracking data, comprising over seven million pitch sequences and approximately three billion tokens. The resulting model is capable of predicting multiple aspects of game evolution within a unified framework. We evaluate our model on both in-distribution regular-season data and out-of-distribution postseason games and compare against strong neural baselines from prior work. Despite using a single backbone model, our approach outperforms the performance of existing baselines, (1) correctly predicting approximately 64% of next pitches within a plate appearance and (2) 78% of batter swing decisions, suggesting that LLMs can serve as effective world models for sports.

new Lagged backward-compatible physics-informed neural networks for unsaturated soil consolidation analysis

Authors: Dong Li (Department of Civil, Environmental, and Infrastructure Engineering, George Mason University, Fairfax, VA, USA), Shuai Huang (National Institute of Natural Hazards, Ministry of Emergency Management, Beijing, China), Yapeng Cao (State Key Laboratory of Cryospheric Science and Frozen Soil Engineering, Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences, Lanzhou, China, Navier Laboratory, \'Ecole Nationale des Ponts et Chauss\'ees, Marne-la-Vall\'ee Cedex 2, France), Yujun Cui (Navier Laboratory, \'Ecole Nationale des Ponts et Chauss\'ees, Marne-la-Vall\'ee Cedex 2, France), Xiaobin Wei (School of Civil Engineering, Hebei University of Engineering, Handan, China), Hongtao Cao (College of Civil Engineering, Zhejiang University of Technology, Hangzhou, China)

Abstract: This study develops a Lagged Backward-Compatible Physics-Informed Neural Network (LBC-PINN) for simulating and inverting one-dimensional unsaturated soil consolidation under long-term loading. To address the challenges of coupled air and water pressure dissipation across multi-scale time domains, the framework integrates logarithmic time segmentation, lagged compatibility loss enforcement, and segment-wise transfer learning. In forward analysis, the LBC-PINN with recommended segmentation schemes accurately predicts pore air and pore water pressure evolution. Model predictions are validated against finite element method (FEM) results, with mean absolute errors below 1e-2 for time durations up to 1e10 seconds. A simplified segmentation strategy based on the characteristic air-phase dissipation time improves computational efficiency while preserving predictive accuracy. Sensitivity analyses confirm the robustness of the framework across air-to-water permeability ratios ranging from 1e-3 to 1e3.

new TransConv-DDPM: Enhanced Diffusion Model for Generating Time-Series Data in Healthcare

Authors: Md Shahriar Kabir, Sana Alamgeer, Minakshi Debnath, Anne H. H. Ngu

Abstract: The lack of real-world data in clinical fields poses a major obstacle in training effective AI models for diagnostic and preventive tools in medicine. Generative AI has shown promise in increasing data volume and enhancing model training, particularly in computer vision and natural language processing (NLP) domains. However, generating physiological time-series data, a common type in medical AI applications, presents unique challenges due to its inherent complexity and variability. This paper introduces TransConv-DDPM, an enhanced generative AI method for biomechanical and physiological time-series data generation. The model employs a denoising diffusion probabilistic model (DDPM) with U-Net, multi-scale convolution modules, and a transformer layer to capture both global and local temporal dependencies. We evaluated TransConv-DDPM on three diverse datasets, generating both long and short-sequence time-series data. Quantitative comparisons against state-of-the-art methods, TimeGAN and Diffusion-TS, using four performance metrics, demonstrated promising results, particularly on the SmartFallMM and EEG datasets, where it effectively captured the more gradual temporal change patterns between data points. Additionally, a utility test on the SmartFallMM dataset revealed that adding synthetic fall data generated by TransConv-DDPM improved predictive model performance, showing a 13.64% improvement in F1-score and a 14.93% increase in overall accuracy compared to the baseline model trained solely on fall data from the SmartFallMM dataset. These findings highlight the potential of TransConv-DDPM to generate high-quality synthetic data for real-world applications.

new AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

Authors: Ashutosh Chaubey, Jiacheng Pang, Maksim Siniukov, Mohammad Soleymani

Abstract: Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models have shown strong performance on this task, two key challenges remain - spurious associations between emotions and irrelevant audiovisual cues, and hallucinations of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue-emotion associations, hallucinations and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over responses exhibiting spurious associations or hallucinations, and audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations. Experimental results on DFEW, RAVDESS and EMER demonstrate that our method significantly improves the performance of the reference baseline models with 6-19% of relative performance gains in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI. Code, models and benchmark will be released at https://avere-iclr.github.io.

URLs: https://avere-iclr.github.io.

new TACIT: Transformation-Aware Capturing of Implicit Thought

Authors: Daniel Nobrega

Abstract: We present TACIT (Transformation-Aware Capturing of Implicit Thought), a diffusion-based transformer for interpretable visual reasoning. Unlike language-based reasoning systems, TACIT operates entirely in pixel space using rectified flow, enabling direct visualization of the reasoning process at each inference step. We demonstrate the approach on maze-solving, where the model learns to transform images of unsolved mazes into solutions. Key results on 1 million synthetic maze pairs include: - 192x reduction in training loss over 100 epochs - 22.7x improvement in L2 distance to ground truth - Only 10 Euler steps required (vs. 100-1000 for typical diffusion models) Quantitative analysis reveals a striking phase transition phenomenon: the solution remains invisible for 68% of the transformation (zero recall), then emerges abruptly at t=0.70 within just 2% of the process. Most remarkably, 100% of samples exhibit simultaneous emergence across all spatial regions, ruling out sequential path construction and providing evidence for holistic rather than algorithmic reasoning. This "eureka moment" pattern -- long incubation followed by sudden crystallization -- parallels insight phenomena in human cognition. The pixel-space design with noise-free flow matching provides a foundation for understanding how neural networks develop implicit reasoning strategies that operate below and before language.

new Video-based Music Generation

Authors: Serkan Sulun

Abstract: As the volume of video content on the internet grows rapidly, finding a suitable soundtrack remains a significant challenge. This thesis presents EMSYNC (EMotion and SYNChronization), a fast, free, and automatic solution that generates music tailored to the input video, enabling content creators to enhance their productions without composing or licensing music. Our model creates music that is emotionally and rhythmically synchronized with the video. A core component of EMSYNC is a novel video emotion classifier. By leveraging pretrained deep neural networks for feature extraction and keeping them frozen while training only fusion layers, we reduce computational complexity while improving accuracy. We show the generalization abilities of our method by obtaining state-of-the-art results on Ekman-6 and MovieNet. Another key contribution is a large-scale, emotion-labeled MIDI dataset for affective music generation. We then present an emotion-based MIDI generator, the first to condition on continuous emotional values rather than discrete categories, enabling nuanced music generation aligned with complex emotional content. To enhance temporal synchronization, we introduce a novel temporal boundary conditioning method, called "boundary offset encodings," aligning musical chords with scene changes. Combining video emotion classification, emotion-based music generation, and temporal boundary conditioning, EMSYNC emerges as a fully automatic video-based music generator. User studies show that it consistently outperforms existing methods in terms of music richness, emotional alignment, temporal synchronization, and overall preference, setting a new state-of-the-art in video-based music generation.

new Hybrid Dual-Path Linear Transformations for Efficient Transformer Architectures

Authors: Vladimer Khasia

Abstract: Standard Transformer architectures rely heavily on dense linear transformations, treating feature projection as a monolithic, full-rank operation. We argue that this formulation is inefficient and lacks the structural inductive bias necessary for distinguishing between local feature preservation and global context integration. To address this, we introduce the Hybrid Dual-Path Linear (HDPL) operator, which decomposes the affine transformation into two topologically distinct pathways: a sparse block-diagonal component for high-rank local processing, and a low-rank Variational Autoencoder (VAE) bottleneck for global context regularization. By "surgically" replacing specific projections (Query, Key, Value, Gate, Up) with HDPL operators while retaining standard dense layers for aggregation (Output, Down), we achieve a superior balance of efficiency and representational power. Experiments on the FineWeb-Edu dataset demonstrate that the HDPL architecture outperforms a standard Llama-style baseline, reducing validation loss while simultaneously reducing parameter count by 6.8%. Beyond immediate performance gains, we discuss how the explicit materialization of a probabilistic latent space within the Transformer backbone serves as a vital architectural affordance, offering new pathways for inference-time or hypernetwork induced control, continual adaptation, interpretability, and cross-model or cross-modal synchronization. The code is available at https://github.com/VladimerKhasia/HDPL

URLs: https://github.com/VladimerKhasia/HDPL

new The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

Authors: Yingru Li, Jiawei Xu, Ziniu Li, Jiacai Liu, Wei Liu, Yuxuan Tong, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang

Abstract: Reinforcement Learning (RL) for Large Language Models (LLMs) often suffers from training collapse in long-horizon tasks due to exploding gradient variance. To mitigate this, a baseline is commonly introduced for advantage computation; however, traditional value models remain difficult to optimize, and standard group-based baselines overlook sequence heterogeneity. Although classic optimal baseline theory can achieve global variance reduction, it neglects token heterogeneity and requires prohibitive gradient-based computation. In this work, we derive the Optimal Token Baseline (OTB) from first principles, proving that gradient updates should be weighted inversely to their cumulative gradient norm. To ensure efficiency, we propose the Logit-Gradient Proxy that approximates the gradient norm using only forward-pass probabilities. Our method achieves training stability and matches the performance of large group sizes ($N=32$) with only $N=4$, reducing token consumption by over 65% across single-turn and tool-integrated reasoning tasks.

new Attention-Driven Framework for Non-Rigid Medical Image Registration

Authors: Muhammad Zafar Iqbal, Ghazanfar Farooq Siddiqui, Anwar Ul Haq, Imran Razzak

Abstract: Deformable medical image registration is a fundamental task in medical image analysis with applications in disease diagnosis, treatment planning, and image-guided interventions. Despite significant advances in deep learning based registration methods, accurately aligning images with large deformations while preserving anatomical plausibility remains a challenging task. In this paper, we propose a novel Attention-Driven Framework for Non-Rigid Medical Image Registration (AD-RegNet) that employs attention mechanisms to guide the registration process. Our approach combines a 3D UNet backbone with bidirectional cross-attention, which establishes correspondences between moving and fixed images at multiple scales. We introduce a regional adaptive attention mechanism that focuses on anatomically relevant structures, along with a multi-resolution deformation field synthesis approach for accurate alignment. The method is evaluated on two distinct datasets: DIRLab for thoracic 4D CT scans and IXI for brain MRI scans, demonstrating its versatility across different anatomical structures and imaging modalities. Experimental results demonstrate that our approach achieves performance competitive with state-of-the-art methods on the IXI and DIRLab datasets. The proposed method maintains a favorable balance between registration accuracy and computational efficiency, making it suitable for clinical applications. A comprehensive evaluation using normalized cross-correlation (NCC), mean squared error (MSE), structural similarity (SSIM), Jacobian determinant, and target registration error (TRE) indicates that attention-guided registration improves alignment accuracy while ensuring anatomically plausible deformations.

new Finding Connections: Membership Inference Attacks for the Multi-Table Synthetic Data Setting

Authors: Joshua Ward, Chi-Hua Wang, Guang Cheng

Abstract: Synthetic tabular data has gained attention for enabling privacy-preserving data sharing. While substantial progress has been made in single-table synthetic generation where data are modeled at the row or item level, most real-world data exists in relational databases where a user's information spans items across multiple interconnected tables. Recent advances in synthetic relational data generation have emerged to address this complexity, yet release of these data introduce unique privacy challenges as information can be leaked not only from individual items but also through the relationships that comprise a complete user entity. To address this, we propose a novel Membership Inference Attack (MIA) setting to audit the empirical user-level privacy of synthetic relational data and show that single-table MIAs that audit at an item level underestimate user-level privacy leakage. We then propose Multi-Table Membership Inference Attack (MT-MIA), a novel adversarial attack under a No-Box threat model that targets learned representations of user entities via Heterogeneous Graph Neural Networks. By incorporating all connected items for a user, MT-MIA better targets user-level vulnerabilities induced by inter-tabular relationships than existing attacks. We evaluate MT-MIA on a range of real-world multi-table datasets and demonstrate that this vulnerability exists in state-of-the-art relational synthetic data generators, employing MT-MIA to additionally study where this leakage occurs.

new Landscaper: Understanding Loss Landscapes Through Multi-Dimensional Topological Analysis

Authors: Jiaqing Chen, Nicholas Hadler, Tiankai Xie, Rostyslav Hnatyshyn, Caleb Geniesse, Yaoqing Yang, Michael W. Mahoney, Talita Perciano, John F. Hartwig, Ross Maciejewski, Gunther H. Weber

Abstract: Loss landscapes are a powerful tool for understanding neural network optimization and generalization, yet traditional low-dimensional analyses often miss complex topological features. We present Landscaper, an open-source Python package for arbitrary-dimensional loss landscape analysis. Landscaper combines Hessian-based subspace construction with topological data analysis to reveal geometric structures such as basin hierarchy and connectivity. A key component is the Saddle-Minimum Average Distance (SMAD) for quantifying landscape smoothness. We demonstrate Landscaper's effectiveness across various architectures and tasks, including those involving pre-trained language models, showing that SMAD captures training transitions, such as landscape simplification, that conventional metrics miss. We also illustrate Landscaper's performance in challenging chemical property prediction tasks, where SMAD can serve as a metric for out-of-distribution generalization, offering valuable insights for model diagnostics and architecture design in data-scarce scientific machine learning scenarios.

new Featured Reproducing Kernel Banach Spaces for Learning and Neural Networks

Authors: Isabel de la Higuera, Francisco Herrera, M. Victoria Velasco

Abstract: Reproducing kernel Hilbert spaces provide a foundational framework for kernel-based learning, where regularization and interpolation problems admit finite-dimensional solutions through classical representer theorems. Many modern learning models, however -- including fixed-architecture neural networks equipped with non-quadratic norms -- naturally give rise to non-Hilbertian geometries that fall outside this setting. In Banach spaces, continuity of point-evaluation functionals alone is insufficient to guarantee feature representations or kernel-based learning formulations. In this work, we develop a functional-analytic framework for learning in Banach spaces based on the notion of featured reproducing kernel Banach spaces. We identify the precise structural conditions under which feature maps, kernel constructions, and representer-type results can be recovered beyond the Hilbertian regime. Within this framework, supervised learning is formulated as a minimal-norm interpolation or regularization problem, and existence results together with conditional representer theorems are established. We further extend the theory to vector-valued featured reproducing kernel Banach spaces and show that fixed-architecture neural networks naturally induce special instances of such spaces. This provides a unified function-space perspective on kernel methods and neural networks and clarifies when kernel-based learning principles extend beyond reproducing kernel Hilbert spaces.

new BONSAI: Bayesian Optimization with Natural Simplicity and Interpretability

Authors: Samuel Daulton, David Eriksson, Maximilian Balandat, Eytan Bakshy

Abstract: Bayesian optimization (BO) is a popular technique for sample-efficient optimization of black-box functions. In many applications, the parameters being tuned come with a carefully engineered default configuration, and practitioners only want to deviate from this default when necessary. Standard BO, however, does not aim to minimize deviation from the default and, in practice, often pushes weakly relevant parameters to the boundary of the search space. This makes it difficult to distinguish between important and spurious changes and increases the burden of vetting recommendations when the optimization objective omits relevant operational considerations. We introduce BONSAI, a default-aware BO policy that prunes low-impact deviations from a default configuration while explicitly controlling the loss in acquisition value. BONSAI is compatible with a variety of acquisition functions, including expected improvement and upper confidence bound (GP-UCB). We theoretically bound the regret incurred by BONSAI, showing that, under certain conditions, it enjoys the same no-regret property as vanilla GP-UCB. Across many real-world applications, we empirically find that BONSAI substantially reduces the number of non-default parameters in recommended configurations while maintaining competitive optimization performance, with little effect on wall time.

new Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate

Authors: Zhiqi Bu, Shiyun Xu, Jialin Mao

Abstract: Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as 80X across training horizons and 70X across model sizes.

new On Randomness in Agentic Evals

Authors: Bjarni Haukur Bjarnason, Andr\'e Silva, Martin Monperrus

Abstract: Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2--3 percentage points may reflect evaluation noise rather than genuine algorithmic progress. Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to determine the number of runs needed to detect expected effect sizes, and (3) consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope. While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.

new Beyond Pooling: Matching for Robust Generalization under Data Heterogeneity

Authors: Ayush Roy, Rudrasis Chakraborty, Lav Varshney, Vishnu Suresh Lokhande

Abstract: Pooling heterogeneous datasets across domains is a common strategy in representation learning, but naive pooling can amplify distributional asymmetries and yield biased estimators, especially in settings where zero-shot generalization is required. We propose a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The double robustness and the propensity score matching for the inclusion of data domains make matching more robust than naive pooling and uniform subsampling by filtering out the confounding domains (the main cause of heterogeneity). Theoretical and empirical analyses show that, unlike naive pooling or uniform subsampling, matching achieves better results under asymmetric meta-distributions, which are also extended to non-Gaussian and multimodal real-world settings. Most importantly, we show that these improvements translate to zero-shot medical anomaly detection, one of the extreme forms of data heterogeneity and asymmetry. The code is available on https://github.com/AyushRoy2001/Beyond-Pooling.

URLs: https://github.com/AyushRoy2001/Beyond-Pooling.

new Mimetic Initialization of MLPs

Authors: Asher Trockman, J. Zico Kolter

Abstract: Mimetic initialization uses pretrained models as case studies of good initialization, using observations of structures in trained weights to inspire new, simple initialization techniques. So far, it has been applied only to spatial mixing layers, such convolutional, self-attention, and state space layers. In this work, we present the first attempt to apply the method to channel mixing layers, namely multilayer perceptrons (MLPs). Our extremely simple technique for MLPs -- to give the first layer a nonzero mean -- speeds up training on small-scale vision tasks like CIFAR-10 and ImageNet-1k. Though its effect is much smaller than spatial mixing initializations, it can be used in conjunction with them for an additional positive effect.

new Learning Nonlinear Systems In-Context: From Synthetic Data to Real-World Motor Control

Authors: Tong Jian, Tianyu Dai, Tao Yu

Abstract: LLMs have shown strong in-context learning (ICL) abilities, but have not yet been extended to signal processing systems. Inspired by their design, we have proposed for the first time ICL using transformer models applicable to motor feedforward control, a critical task where classical PI and physics-based methods struggle with nonlinearities and complex load conditions. We propose a transformer based model architecture that separates signal representation from system behavior, enabling both few-shot finetuning and one-shot ICL. Pretrained on a large corpus of synthetic linear and nonlinear systems, the model learns to generalize to unseen system dynamics of real-world motors only with a handful of examples. In experiments, our approach generalizes across multiple motor load configurations, transforms untuned examples into accurate feedforward predictions, and outperforms PI controllers and physics-based feedforward baselines. These results demonstrate that ICL can bridge synthetic pretraining and real-world adaptability, opening new directions for data efficient control of physical systems.

new Latent Target Score Matching, with an application to Simulation-Based Inference

Authors: Joohwan Ko, Tomas Geffner

Abstract: Denoising score matching (DSM) for training diffusion models may suffer from high variance at low noise levels. Target Score Matching (TSM) mitigates this when clean data scores are available, providing a low-variance objective. In many applications clean scores are inaccessible due to the presence of latent variables, leaving only joint signals exposed. We propose Latent Target Score Matching (LTSM), an extension of TSM to leverage joint scores for low-variance supervision of the marginal score. While LTSM is effective at low noise levels, a mixture with DSM ensures robustness across noise scales. Across simulation-based inference tasks, LTSM consistently improves variance, score accuracy, and sample quality.

new Systematic Performance Assessment of Deep Material Networks for Multiscale Material Modeling

Authors: Xiaolong He, Haoyan Wei, Wei Hu, Henan Mao, C. T. Wu

Abstract: Deep Material Networks (DMNs) are structure-preserving, mechanistic machine learning models that embed micromechanical principles into their architectures, enabling strong extrapolation capabilities and significant potential to accelerate multiscale modeling of complex microstructures. A key advantage of these models is that they can be trained exclusively on linear elastic data and then generalized to nonlinear inelastic regimes during online prediction. Despite their growing adoption, systematic evaluations of their performance across the full offline-online pipeline remain limited. This work presents a comprehensive comparative assessment of DMNs with respect to prediction accuracy, computational efficiency, and training robustness. We investigate the effects of offline training choices, including initialization, batch size, training data size, and activation regularization on online generalization performance and uncertainty. The results demonstrate that both prediction error and variance decrease with increasing training data size, while initialization and batch size can significantly influence model performance. Moreover, activation regularization is shown to play a critical role in controlling network complexity and therefore generalization performance. Compared with the original DMN, the rotation-free Interaction-based Material Network (IMN) formulation achieves a 3.4x - 4.7x speed-up in offline training, while maintaining comparable online prediction accuracy and computational efficiency. These findings clarify key trade-offs between model expressivity and efficiency in structure-preserving material networks and provide practical guidance for their deployment in multiscale material modeling.

new Risk-Sensitive Exponential Actor Critic

Authors: Alonso Granados, Jason Pacheco

Abstract: Model-free deep reinforcement learning (RL) algorithms have achieved tremendous success on a range of challenging tasks. However, safety concerns remain when these methods are deployed on real-world applications, necessitating risk-aware agents. A common utility for learning such risk-aware agents is the entropic risk measure, but current policy gradient methods optimizing this measure must perform high-variance and numerically unstable updates. As a result, existing risk-sensitive model-free approaches are limited to simple tasks and tabular settings. In this paper, we provide a comprehensive theoretical justification for policy gradient methods on the entropic risk measure, including on- and off-policy gradient theorems for the stochastic and deterministic policy settings. Motivated by theory, we propose risk-sensitive exponential actor-critic (rsEAC), an off-policy model-free approach that incorporates novel procedures to avoid the explicit representation of exponential value functions and their gradients, and optimizes its policy w.r.t the entropic risk measure. We show that rsEAC produces more numerically stable updates compared to existing approaches and reliably learns risk-sensitive policies in challenging risky variants of continuous tasks in MuJoCo.

new Exactly Computing do-Shapley Values

Authors: R. Teal Witter, \'Alvaro Parafita, Tomas Garriga, Maximilian Muschalik, Fabian Fumagalli, Axel Brando, Lucas Rosenblatt

Abstract: Structural Causal Models (SCM) are a powerful framework for describing complicated dynamics across the natural sciences. A particularly elegant way of interpreting SCMs is do-Shapley, a game-theoretic method of quantifying the average effect of $d$ variables across exponentially many interventions. Like Shapley values, computing do-Shapley values generally requires evaluating exponentially many terms. The foundation of our work is a reformulation of do-Shapley values in terms of the irreducible sets of the underlying SCM. Leveraging this insight, we can exactly compute do-Shapley values in time linear in the number of irreducible sets $r$, which itself can range from $d$ to $2^d$ depending on the graph structure of the SCM. Since $r$ is unknown a priori, we complement the exact algorithm with an estimator that, like general Shapley value estimators, can be run with any query budget. As the query budget approaches $r$, our estimators can produce more accurate estimates than prior methods by several orders of magnitude, and, when the budget reaches $r$, return the Shapley values up to machine precision. Beyond computational speed, we also reduce the identification burden: we prove that non-parametric identifiability of do-Shapley values requires only the identification of interventional effects for the $d$ singleton coalitions, rather than all classes.

new Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation

Authors: Junyan Liu, Haipeng Luo, Zihan Zhang, Lillian J. Ratliff

Abstract: We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $O(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $O(\sqrt{K})$ external regret is well-known to be achievable, their result is still the worse rate $O(K^{2/3})$ on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $O(\min \{\sqrt{K} + (CK)^{1/3},\sqrt{LK}\})$ regret bound, where $C$ quantifies the variance of the opponent's policies and $L$ denotes the number of policy switches (both at most $O(K)$). Therefore, our results not only recover the two extremes -- $O(\sqrt{K})$ external regret when the opponent is fixed and $O(K^{2/3})$ Nash-value regret in the worst case -- but also smoothly interpolate between these extremes by automatically adapting to the opponent's non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $O(\eta C + \sqrt{K/\eta})$ regret bound, where $\eta$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $\eta$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.

new DSL: Understanding and Improving Softmax Recommender Systems with Competition-Aware Scaling

Authors: Bucher Sahyouni, Matthew Vowels, Liqun Chen, Simon Hadfield

Abstract: Softmax Loss (SL) is being increasingly adopted for recommender systems (RS) as it has demonstrated better performance, robustness and fairness. Yet in implicit-feedback, a single global temperature and equal treatment of uniformly sampled negatives can lead to brittle training, because sampled sets may contain varying degrees of relevant or informative competitors. The optimal loss sharpness for a user-item pair with a particular set of negatives, can be suboptimal or destabilising for another with different negatives. We introduce Dual-scale Softmax Loss (DSL), which infers effective sharpness from the sampled competition itself. DSL adds two complementary branches to the log-sum-exp backbone. Firstly it reweights negatives within each training instance using hardness and item--item similarity, secondly it adapts a per-example temperature from the competition intensity over a constructed competitor slate. Together, these components preserve the geometry of SL while reshaping the competition distribution across negatives and across examples. Over several representative benchmarks and backbones, DSL yields substantial gains over strong baselines, with improvements over SL exceeding $10%$ in several settings and averaging $6.22%$ across datasets, metrics, and backbones. Under out-of-distribution (OOD) popularity shift, the gains are larger, with an average of $9.31%$ improvement over SL. We further provide a theoretical, distributionally robust optimisation (DRO) analysis, which demonstrates how DSL reshapes the robust payoff and the KL deviation for ambiguous instances. This helps explain the empirically observed improvements in accuracy and robustness.

new Adaptive Retrieval helps Reasoning in LLMs -- but mostly if it's not used

Authors: Srijan Shakya, Anamaria-Roberta Hartl, Sepp Hochreiter, Korbinian P\"oppel

Abstract: Large Language Models (LLMs) often falter in complex reasoning tasks due to their static, parametric knowledge, leading to hallucinations and poor performance in specialized domains like mathematics. This work explores a fundamental principle for enhancing generative models: treating retrieval as a form of dynamic in-context learning. We test an adaptive retrieval-augmented architecture where an LLM agent actively decides when to query an external knowledge base during its reasoning process. We compare this adaptive strategy against a standard Chain-of-Thought (CoT) baseline and a static retrieval approach on the GSM8K and MATH-500 benchmarks. Although our experiments show that static retrieval is inferior to CoT, the adaptive retrieval shows interesting behavior: While traces including retrieved results show slightly worse performance compared to CoT, traces that do not include retrieval actually perform better compared to CoT. This suggests that: (a) retrieval only rarely helps reasoning (we show a few counterexamples, e.g. using useful theorems) and (b) actively not using retrieval is indicative of good model performance. Furthermore, we find that the model scales its retrieval frequency with the difficulty of the problem, reinforcing that the decision to retrieve is a crucial metacognitive signal. The agent's ability to self-assess its knowledge and selectively engage with external information represents a key principle for building more robust and reliable generative models.

new Probing Neural TSP Representations for Prescriptive Decision Support

Authors: Reuben Narad, L\'eonard Boussioux, Michael Wagner

Abstract: The field of neural combinatorial optimization (NCO) trains neural policies to solve NP-hard problems such as the traveling salesperson problem (TSP). We ask whether, beyond producing good tours, a trained TSP solver learns internal representations that transfer to other optimization-relevant objectives, in the spirit of transfer learning from other domains. We train several attention-based TSP policies, collect their internal activations, and train probes on node/edge embeddings for two NP-hard prescriptive downstream tasks inspired by real-world logistics scenarios: node-removal sensitivity (identifying the most impactful node to remove) and edge-forbid sensitivity (identifying the most critical edge to retain). On a Euclidean TSP100-trained model, probes for both tasks are competitive with existing baselines. Ensembling probe signals with geometric features outperforms the strongest baselines: 65\% top-1 accuracy (vs. 58\% baseline) for the best-node-removal task, and 73\% top-1 accuracy (vs. 67\% baseline) for the worst-edge identification task. To our knowledge, we are the first to study neural TSP solvers as transferable encoders for prescriptive what-if decision-support objectives beyond tour construction. Finally, we show that transfer accuracy increases with solver quality across training and model scale, suggesting that training stronger NCO solvers also yields more useful encoders for downstream objectives. Our code is available at: github.com/ReubenNarad/tsp_prescriptive_probe

new Collaborative and Efficient Fine-tuning: Leveraging Task Similarity

Authors: Gagik Magakyan, Amirhossein Reisizadeh, Chanwoo Park, Pablo A. Parrilo, Asuman Ozdaglar

Abstract: Adaptability has been regarded as a central feature in the foundation models, enabling them to effectively acclimate to unseen downstream tasks. Parameter-efficient fine-tuning methods such as celebrated LoRA facilitate efficient adaptation of large foundation models using labeled, high-quality and generally scarce task data. To mitigate data scarcity in fine-tuning of foundation models, we propose to leverage task similarity across multiple downstream users. Intuitively, users with similar tasks must be able to assist each other in boosting the effective fine-tuning data size. We propose Collaborative Low-Rank Adaptation, or CoLoRA, which exploits task similarity to collaboratively and efficiently fine-tune personalized foundation models. The main idea in CoLoRA is to train one shared adapter capturing underlying task similarities across all tasks, and personalized adapters tailored to user-specific tasks. We theoretically study CoLoRA on heterogeneous linear regression and provide provable guarantees for ground truth recovery. We also conduct several natural language experiments with varying task similarity, which further demonstrate that when trained together with similar tasks, individual performances are significantly boosted.

new The Median is Easier than it Looks: Approximation with a Constant-Depth, Linear-Width ReLU Network

Authors: Abhigyan Dutta, Itay Safran, Paul Valiant

Abstract: We study the approximation of the median of $d$ inputs using ReLU neural networks. We present depth-width tradeoffs under several settings, culminating in a constant-depth, linear-width construction that achieves exponentially small approximation error with respect to the uniform distribution over the unit hypercube. By further establishing a general reduction from the maximum to the median, our results break a barrier suggested by prior work on the maximum function, which indicated that linear width should require depth growing at least as $\log\log d$ to achieve comparable accuracy. Our construction relies on a multi-stage procedure that iteratively eliminates non-central elements while preserving a candidate set around the median. We overcome obstacles that do not arise for the maximum to yield approximation results that are strictly stronger than those previously known for the maximum itself.

new SpecAttn: Co-Designing Sparse Attention with Self-Speculative Decoding

Authors: Yikang Yue, Yuqi Xue, Jian Huang

Abstract: Long-context large language model (LLM) inference has become the norm for today's AI applications. However, it is severely bottlenecked by the increasing memory demands of its KV cache. Previous works have shown that self-speculative decoding with sparse attention, where tokens are drafted using a subset of the KV cache and verified in parallel with full KV cache, speeds up inference in a lossless way. However, this approach relies on standalone KV selection algorithms to select the KV entries used for drafting and overlooks that the criticality of each KV entry is inherently computed during verification. In this paper, we propose SpecAttn, a self-speculative decoding method with verification-guided sparse attention. SpecAttn identifies critical KV entries as a byproduct of verification and only loads these entries when drafting subsequent tokens. This not only improves draft token acceptance rate but also incurs low KV selection overhead, thereby improving decoding throughput. SpecAttn achieves 2.81$\times$ higher throughput over vanilla auto-regressive decoding and 1.29$\times$ improvement over state-of-the-art sparsity-based self-speculative decoding methods.

new Fault-Tolerant Evaluation for Sample-Efficient Model Performance Estimators

Authors: Zihan Zhu, Yanqiu Wu, Qiongkai Xu

Abstract: In the era of Model-as-a-Service, organizations increasingly rely on third-party AI models for rapid deployment. However, the dynamic nature of emerging AI applications, the continual introduction of new datasets, and the growing number of models claiming superior performance make efficient and reliable validation of model services increasingly challenging. This motivates the development of sample-efficient performance estimators, which aim to estimate model performance by strategically selecting instances for labeling, thereby reducing annotation cost. Yet existing evaluation approaches often fail in low-variance settings: RMSE conflates bias and variance, masking persistent bias when variance is small, while p-value based tests become hypersensitive, rejecting adequate estimators for negligible deviations. To address this, we propose a fault-tolerant evaluation framework that integrates bias and variance considerations within an adjustable tolerance level ${\varepsilon}$, enabling the evaluation of performance estimators within practically acceptable error margins. We theoretically show that proper calibration of ${\varepsilon}$ ensures reliable evaluation across different variance regimes, and we further propose an algorithm that automatically optimizes and selects ${\varepsilon}$. Experiments on real-world datasets demonstrate that our framework provides comprehensive and actionable insights into estimator behavior.

new Cerebellar-Inspired Residual Control for Fault Recovery: From Inference-Time Adaptation to Structural Consolidation

Authors: Nethmi Jayasinghe, Diana Gontero, Spencer T. Brown, Vinod K. Sangwan, Mark C. Hersam, Amit Ranjan Trivedi

Abstract: Robotic policies deployed in real-world environments often encounter post-training faults, where retraining, exploration, or system identification are impractical. We introduce an inference-time, cerebellar-inspired residual control framework that augments a frozen reinforcement learning policy with online corrective actions, enabling fault recovery without modifying base policy parameters. The framework instantiates core cerebellar principles, including high-dimensional pattern separation via fixed feature expansion, parallel microzone-style residual pathways, and local error-driven plasticity with excitatory and inhibitory eligibility traces operating at distinct time scales. These mechanisms enable fast, localized correction under post-training disturbances while avoiding destabilizing global policy updates. A conservative, performance-driven meta-adaptation regulates residual authority and plasticity, preserving nominal behavior and suppressing unnecessary intervention. Experiments on MuJoCo benchmarks under actuator, dynamic, and environmental perturbations show improvements of up to $+66\%$ on \texttt{HalfCheetah-v5} and $+53\%$ on \texttt{Humanoid-v5} under moderate faults, with graceful degradation under severe shifts and complementary robustness from consolidating persistent residual corrections into policy parameters.

new ArcMark: Multi-bit LLM Watermark via Optimal Transport

Authors: Atefeh Gilani, Carol Xuan Long, Sajani Vithana, Oliver Kosut, Lalitha Sankar, Flavio P. Calmon

Abstract: Watermarking is an important tool for promoting the responsible use of language models (LMs). Existing watermarks insert a signal into generated tokens that either flags LM-generated text (zero-bit watermarking) or encodes more complex messages (multi-bit watermarking). Though a number of recent multi-bit watermarks insert several bits into text without perturbing average next-token predictions, they largely extend design principles from the zero-bit setting, such as encoding a single bit per token. Notably, the information-theoretic capacity of multi-bit watermarking -- the maximum number of bits per token that can be inserted and detected without changing average next-token predictions -- has remained unknown. We address this gap by deriving the first capacity characterization of multi-bit watermarks. Our results inform the design of ArcMark: a new watermark construction based on coding-theoretic principles that, under certain assumptions, achieves the capacity of the multi-bit watermark channel. In practice, ArcMark outperforms competing multi-bit watermarks in terms of bit rate per token and detection accuracy. Our work demonstrates that LM watermarking is fundamentally a channel coding problem, paving the way for principled coding-theoretic approaches to watermark design.

new Graph homophily booster: Reimagining the role of discrete features in heterophilic graph learning

Authors: Ruizhong Qiu, Ting-Wei Li, Gaotang Li, Hanghang Tong

Abstract: Graph neural networks (GNNs) have emerged as a powerful tool for modeling graph-structured data. However, existing GNNs often struggle with heterophilic graphs, where connected nodes tend to have dissimilar features or labels. While numerous methods have been proposed to address this challenge, they primarily focus on architectural designs without directly targeting the root cause of the heterophily problem. These approaches still perform even worse than the simplest MLPs on challenging heterophilic datasets. For instance, our experiments show that 21 latest GNNs still fall behind the MLP on the Actor dataset. This critical challenge calls for an innovative approach to addressing graph heterophily beyond architectural designs. To bridge this gap, we propose and study a new and unexplored paradigm: directly increasing the graph homophily via a carefully designed graph transformation. In this work, we present a simple yet effective framework called GRAPHITE to address graph heterophily. To the best of our knowledge, this work is the first method that explicitly transforms the graph to directly improve the graph homophily. Stemmed from the exact definition of homophily, our proposed GRAPHITE creates feature nodes to facilitate homophilic message passing between nodes that share similar features. Furthermore, we both theoretically and empirically show that our proposed GRAPHITE significantly increases the homophily of originally heterophilic graphs, with only a slight increase in the graph size. Extensive experiments on challenging datasets demonstrate that our proposed GRAPHITE significantly outperforms state-of-the-art methods on heterophilic graphs while achieving comparable accuracy with state-of-the-art methods on homophilic graphs.

new Robust Ultra-High-Dimensional Variable Selection With Correlated Structure Using Group Testing

Authors: Wanru Guo, Juan Xie, Binbin Wang, Weicong Chen, Xiaoyi Lu, Vipin Chaudhary, Curtis Tatsuoka

Abstract: Background: High-dimensional genomic data exhibit strong group correlation structures that challenge conventional feature selection methods, which often assume feature independence or rely on pre-defined pathways and are sensitive to outliers and model misspecification. Methods: We propose the Dorfman screening framework, a multi-stage procedure that forms data-driven variable groups via hierarchical clustering, performs group and within-group hypothesis testing, and refines selection using elastic net or adaptive elastic net. Robust variants incorporate OGK-based covariance estimation, rank-based correlation, and Huber-weighted regression to handle contaminated and non-normal data. Results: In simulations, Dorfman-Sparse-Adaptive-EN performed best under normal conditions, while Robust-OGK-Dorfman-Adaptive-EN showed clear advantages under data contamination, outperforming classical Dorfman and competing methods. Applied to NSCLC gene expression data for trametinib response, robust Dorfman methods achieved the lowest prediction errors and enriched recovery of clinically relevant genes. Conclusions: The Dorfman framework provides an efficient and robust approach to genomic feature selection. Robust-OGK-Dorfman-Adaptive-EN offers strong performance under both ideal and contaminated conditions and scales to ultra-high-dimensional settings, making it well suited for modern genomic biomarker discovery.

new tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models

Authors: Kevin Li, Dibyadeep Saha, Avni Kanodia, Fan Lai

Abstract: As Low-Rank Adaptation (LoRA) becomes the standard approach for efficiently fine-tuning large language models (LLMs), shared clusters increasingly execute many concurrent LoRA training jobs over the same frozen backbone. While recent advances enable batching (co-locating) multiple adapters during serving, efficient training-time co-location of heterogeneous LoRA adapters presents unique challenges. Jobs often differ in adapter rank, batch size, and resource allocation, and na\"ive batching can introduce synchronization stalls, communication overheads, and per-job slowdowns that are worse than executing independently. We introduce tLoRA, a framework that enables efficient batch training of multiple LoRA jobs. tLoRA fuses adapters that share the same base model into an elastic shared super-model, exploiting existing distributed training frameworks to derive parallelism plans that share resources effectively. At the kernel level, tLoRA employs a fused LoRA kernel that adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-batches to maximize overlap between computation and communication across adapters. At the scheduling layer, tLoRA incorporates an online, residual-capacity-aware scheduler that adaptively groups jobs to maximize collective throughput. Evaluations using real-world cluster traces demonstrate that tLoRA improves training throughput by 1.2--1.8x, job training completion time by 2.3--5.4x, and GPU utilization by 37%.

new XShare: Collaborative in-Batch Expert Sharing for Faster MoE Inference

Authors: Daniil Vankov, Nikita Ivkin, Kyle Ulrich, Xiang Song, Ashish Khetan, George Karypis

Abstract: Mixture-of-Experts (MoE) architectures are increasingly used to efficiently scale large language models. However, in production inference, request batching and speculative decoding significantly amplify expert activation, eroding these efficiency benefits. We address this issue by modeling batch-aware expert selection as a modular optimization problem and designing efficient greedy algorithms for different deployment settings. The proposed method, namely XShare, requires no retraining and dynamically adapts to each batch by maximizing the total gating score of selected experts. It reduces expert activation by up to 30% under standard batching, cuts peak GPU load by up to 3x in expert-parallel deployments, and achieves up to 14% throughput gains in speculative decoding via hierarchical, correlation-aware expert selection even if requests in a batch drawn from heterogeneous datasets.

new Hybrid Feedback-Guided Optimal Learning for Wireless Interactive Panoramic Scene Delivery

Authors: Xiaoyi Wu, Juaren Steiger, Bin Li, R. Srikant

Abstract: Immersive applications such as virtual and augmented reality impose stringent requirements on frame rate, latency, and synchronization between physical and virtual environments. To meet these requirements, an edge server must render panoramic content, predict user head motion, and transmit a portion of the scene that is large enough to cover the user viewport while remaining within wireless bandwidth constraints. Each portion produces two feedback signals: prediction feedback, indicating whether the selected portion covers the actual viewport, and transmission feedback, indicating whether the corresponding packets are successfully delivered. Prior work models this problem as a multi-armed bandit with two-level bandit feedback, but fails to exploit the fact that prediction feedback can be retrospectively computed for all candidate portions once the user head pose is observed. As a result, prediction feedback constitutes full-information feedback rather than bandit feedback. Motivated by this observation, we introduce a two-level hybrid feedback model that combines full-information and bandit feedback, and formulate the portion selection problem as an online learning task under this setting. We derive an instance-dependent regret lower bound for the hybrid feedback model and propose AdaPort, a hybrid learning algorithm that leverages both feedback types to improve learning efficiency. We further establish an instance-dependent regret upper bound that matches the lower bound asymptotically, and demonstrate through real-world trace driven simulations that AdaPort consistently outperforms state-of-the-art baseline methods.

new Laplacian-LoRA: Delaying Oversmoothing in Deep GCNs via Spectral Low-Rank Adaptation

Authors: Sai Vamsi Alisetti

Abstract: Oversmoothing is a fundamental limitation of deep graph convolutional networks (GCNs), causing node representations to collapse as depth increases. While many prior approaches mitigate this effect through architectural modifications or residual mechanisms, the underlying spectral cause of oversmoothing is often left implicit. We propose Laplacian-LoRA, a simple and interpretable low-rank spectral adaptation of standard GCNs. Rather than redesigning message passing, Laplacian-LoRA introduces a learnable, spectrally anchored correction to the fixed Laplacian propagation operator, selectively weakening contraction while preserving stability and the low-pass inductive bias. Across multiple benchmark datasets and depths, Laplacian-LoRA consistently delays the onset of oversmoothing, extending the effective depth of GCNs by up to a factor of two. Embedding variance diagnostics confirm that these gains arise from delayed representational collapse, while learned spectral analysis demonstrates that the correction is smooth, bounded, and well behaved. Our results show that oversmoothing is a depth-dependent spectral phenomenon that can be systematically delayed through modest, low-rank adaptation of the graph propagation operator.

new VertCoHiRF: Decentralized Vertical Clustering Beyond k-means

Authors: Bruno Belucci, Karim Lounici, Vladimir R. Kostic, Katia Meziani

Abstract: Vertical Federated Learning (VFL) enables collaborative analysis across parties holding complementary feature views of the same samples, yet existing approaches are largely restricted to distributed variants of $k$-means, requiring centralized coordination or the exchange of feature-dependent numerical statistics, and exhibiting limited robustness under heterogeneous views or adversarial behavior. We introduce VertCoHiRF, a fully decentralized framework for vertical federated clustering based on structural consensus across heterogeneous views, allowing each agent to apply a base clustering method adapted to its local feature space in a peer-to-peer manner. Rather than exchanging feature-dependent statistics or relying on noise injection for privacy, agents cluster their local views independently and reconcile their proposals through identifier-level consensus. Consensus is achieved via decentralized ordinal ranking to select representative medoids, progressively inducing a shared hierarchical clustering across agents. Communication is limited to sample identifiers, cluster labels, and ordinal rankings, providing privacy by design while supporting overlapping feature partitions and heterogeneous local clustering methods, and yielding an interpretable shared Cluster Fusion Hierarchy (CFH) that captures cross-view agreement at multiple resolutions.We analyze communication complexity and robustness, and experiments demonstrate competitive clustering performance in vertical federated settings.

new Fair Decisions from Calibrated Scores: Achieving Optimal Classification While Satisfying Sufficiency

Authors: Etam Benger, Katrina Ligett

Abstract: Binary classification based on predicted probabilities (scores) is a fundamental task in supervised machine learning. While thresholding scores is Bayes-optimal in the unconstrained setting, using a single threshold generally violates statistical group fairness constraints. Under independence (statistical parity) and separation (equalized odds), such thresholding suffices when the scores already satisfy the corresponding criterion. However, this does not extend to sufficiency: even perfectly group-calibrated scores -- including true class probabilities -- violate predictive parity after thresholding. In this work, we present an exact solution for optimal binary (randomized) classification under sufficiency, assuming finite sets of group-calibrated scores. We provide a geometric characterization of the feasible pairs of positive predictive value (PPV) and false omission rate (FOR) achievable by such classifiers, and use it to derive a simple post-processing algorithm that attains the optimal classifier using only group-calibrated scores and group membership. Finally, since sufficiency and separation are generally incompatible, we identify the classifier that minimizes deviation from separation subject to sufficiency, and show that it can also be obtained by our algorithm, often achieving performance comparable to the optimum.

new Incorruptible Neural Networks: Training Models that can Generalize to Large Internal Perturbations

Authors: Philip Jacobson, Ben Feinberg, Suhas Kumar, Sapan Agarwal, T. Patrick Xiao, Christopher Bennett

Abstract: Flat regions of the neural network loss landscape have long been hypothesized to correlate with better generalization properties. A closely related but distinct problem is training models that are robust to internal perturbations to their weights, which may be an important need for future low-power hardware platforms. In this paper, we explore the usage of two methods, sharpness-aware minimization (SAM) and random-weight perturbation (RWP), to find minima robust to a variety of random corruptions to weights. We consider the problem from two angles: generalization (how do we reduce the noise-robust generalization gap) and optimization (how do we maximize performance from optimizers when subject to strong perturbations). First, we establish, both theoretically and empirically, that an over-regularized RWP training objective is optimal for noise-robust generalization. For small-magnitude noise, we find that SAM's adversarial objective further improves performance over any RWP configuration, but performs poorly for large-magnitude noise. We link the cause of this to a vanishing-gradient effect, caused by unevenness in the loss landscape, affecting both SAM and RWP. Lastly, we demonstrate that dynamically adjusting the perturbation strength to match the evolution of the loss landscape improves optimizing for these perturbed objectives.

new Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

Authors: Yonghui Yang, Wenjian Tao, Jilong Liu, Xingyu Zhu, Junfeng Fang, Weibiao Huang, Le Wu, Richang Hong, Tat-Sent Chua

Abstract: Safety alignment of large language models remains brittle under domain shift and noisy preference supervision. Most existing robust alignment methods focus on uncertainty in alignment data, while overlooking optimization-induced fragility in preference-based objectives. In this work, we revisit robustness for LLM safety alignment from an optimization geometry perspective, and argue that robustness failures cannot be addressed by data-centric methods alone. We propose ShaPO, a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace. By avoiding uniform geometry constraints, ShaPO mitigates the over-regularization that can harm robustness under distribution shift. We instantiate ShaPO at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, while reward-level ShaPO enforces reward-consistent optimization under noisy supervision. Across diverse safety benchmarks and noisy preference settings, ShaPO consistently improves safety robustness over popular preference optimization methods. Moreover, ShaPO composes cleanly with data-robust objectives, yielding additional gains and empirically supporting the proposed optimization-geometry perspective.

new Scalable Dexterous Robot Learning with AR-based Remote Human-Robot Interactions

Authors: Yicheng Yang, Ruijiao Li, Lifeng Wang, Shuai Zheng, Shunzheng Ma, Keyu Zhang, Tuoyu Sun, Chenyun Dai, Jie Ding, Zhuo Zou

Abstract: This paper focuses on the scalable robot learning for manipulation in the dexterous robot arm-hand systems, where the remote human-robot interactions via augmented reality (AR) are established to collect the expert demonstration data for improving efficiency. In such a system, we present a unified framework to address the general manipulation task problem. Specifically, the proposed method consists of two phases: i) In the first phase for pretraining, the policy is created in a behavior cloning (BC) manner, through leveraging the learning data from our AR-based remote human-robot interaction system; ii) In the second phase, a contrastive learning empowered reinforcement learning (RL) method is developed to obtain more efficient and robust policy than the BC, and thus a projection head is designed to accelerate the learning progress. An event-driven augmented reward is adopted for enhancing the safety. To validate the proposed method, both the physics simulations via PyBullet and real-world experiments are carried out. The results demonstrate that compared to the classic proximal policy optimization and soft actor-critic policies, our method not only significantly speeds up the inference, but also achieves much better performance in terms of the success rate for fulfilling the manipulation tasks. By conducting the ablation study, it is confirmed that the proposed RL with contrastive learning overcomes policy collapse. Supplementary demonstrations are available at https://cyberyyc.github.io/.

URLs: https://cyberyyc.github.io/.

new Controllable Value Alignment in Large Language Models through Neuron-Level Editing

Authors: Yonghui Yang, Junwei Li, Jilong Liu, Yicheng He, Fengbin Zhu, Weibiao Huang, Le Wu, Richang Hong, Tat-Seng Chua

Abstract: Aligning large language models (LLMs) with human values has become increasingly important as their influence on human behavior and decision-making expands. However, existing steering-based alignment methods suffer from limited controllability: steering a target value often unintentionally activates other, non-target values. To characterize this limitation, we introduce value leakage, a diagnostic notion that captures the unintended activation of non-target values during value steering, along with a normalized leakage metric grounded in Schwartz's value theory. In light of this analysis, we propose NeVA, a neuron-level editing framework for controllable value alignment in LLMs. NeVA identifies sparse, value-relevant neurons and performs inference-time activation editing, enabling fine-grained control without parameter updates or retraining. Experiments show that NeVA achieves stronger target value alignment while incurring smaller performance degradation on general capability. Moreover, NeVA significantly reduces the average leakage, with residual effects largely confined to semantically related value classes. Overall, NeVA offers a more controllable and interpretable mechanism for value alignment.

new UTOPIA: Unlearnable Tabular Data via Decoupled Shortcut Embedding

Authors: Jiaming He, Fuming Luo, Hongwei Li, Wenbo Jiang, Wenshu Fan, Zhenbo Shi, Xudong Jiang, Yi Yu

Abstract: Unlearnable examples (UE) have emerged as a practical mechanism to prevent unauthorized model training on private vision data, while extending this protection to tabular data is nontrivial. Tabular data in finance and healthcare is highly sensitive, yet existing UE methods transfer poorly because tabular features mix numerical and categorical constraints and exhibit saliency sparsity, with learning dominated by a few dimensions. Under a Spectral Dominance condition, we show certified unlearnability is feasible when the poison spectrum overwhelms the clean semantic spectrum. Guided by this, we propose Unlearnable Tabular Data via DecOuPled Shortcut EmbeddIng (UTOPIA), which exploits feature redundancy to decouple optimization into two channels: high saliency features for semantic obfuscation and low saliency redundant features for embedding a hyper correlated shortcut, yielding constraint-aware dominant shortcuts while preserving tabular validity. Extensive experiments across tabular datasets and models show UTOPIA drives unauthorized training toward near random performance, outperforming strong UE baselines and transferring well across architectures.

new FEM-Informed Hypergraph Neural Networks for Efficient Elastoplasticity

Authors: Jianchuan Yang, Xi Chen, Jidong Zhao

Abstract: Graph neural networks (GNNs) naturally align with sparse operators and unstructured discretizations, making them a promising paradigm for physics-informed machine learning in computational mechanics. Motivated by discrete physics losses and Hierarchical Deep Learning Neural Network (HiDeNN) constructions, we embed finite-element (FEM) computations at nodes and Gauss points directly into message-passing layers and propose a numerically consistent FEM-Informed Hypergraph Neural Networks (FHGNN). Similar to conventional physics-informed neural networks (PINNs), training is purely physics-driven and requires no labeled data: the input is a node element hypergraph whose edges encode mesh connectivity. Guided by empirical results and condition-number analysis, we adopt an efficient variational loss. Validated on 3D benchmarks, including cyclic loading with isotropic/kinematic hardening, the proposed method delivers substantially improved accuracy and efficiency over recent, competitive PINN variants. By leveraging GPU-parallel tensor operations and the discrete representation, it scales effectively to large elastoplastic problems and can be competitive with, or faster than, multi-core FEM implementations at comparable accuracy. This work establishes a foundation for scalable, physics-embedded learning in nonlinear solid mechanics.

new Privately Learning Decision Lists and a Differentially Private Winnow

Authors: Mark Bun, William Fang

Abstract: We give new differentially private algorithms for the classic problems of learning decision lists and large-margin halfspaces in the PAC and online models. In the PAC model, we give a computationally efficient algorithm for learning decision lists with minimal sample overhead over the best non-private algorithms. In the online model, we give a private analog of the influential Winnow algorithm for learning halfspaces with mistake bound polylogarithmic in the dimension and inverse polynomial in the margin. As an application, we describe how to privately learn decision lists in the online model, qualitatively matching state-of-the art non-private guarantees.

new Dichotomy of Feature Learning and Unlearning: Fast-Slow Analysis on Neural Networks with Stochastic Gradient Descent

Authors: Shota Imai, Sota Nishiyama, Masaaki Imaizumi

Abstract: The dynamics of gradient-based training in neural networks often exhibit nontrivial structures; hence, understanding them remains a central challenge in theoretical machine learning. In particular, a concept of feature unlearning, in which a neural network progressively loses previously learned features over long training, has gained attention. In this study, we consider the infinite-width limit of a two-layer neural network updated with a large-batch stochastic gradient, then derive differential equations with different time scales, revealing the mechanism and conditions for feature unlearning to occur. Specifically, we utilize the fast-slow dynamics: while an alignment of first-layer weights develops rapidly, the second-layer weights develop slowly. The direction of a flow on a critical manifold, determined by the slow dynamics, decides whether feature unlearning occurs. We give numerical validation of the result, and derive theoretical grounding and scaling laws of the feature unlearning. Our results yield the following insights: (i) the strength of the primary nonlinear term in data induces the feature unlearning, and (ii) an initial scale of the second-layer weights mitigates the feature unlearning. Technically, our analysis utilizes Tensor Programs and the singular perturbation theory.

new Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

Authors: Hoang Anh Duy Le (Rice University), Sahil Joshi (Rice University), Zeyu Yang (Rice University), Zhaozhuo Xu (Stevens Institute of Technology), Anshumali Shrivastava (Rice University)

Abstract: Self-attention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch&Walk maintains near-lossless accuracy at 20% attention density and can slightly outperform dense attention in some settings, while achieving up to 6x inference speedup.

new BitLogic: Training Framework for Gradient-Based FPGA-Native Neural Networks

Authors: Simon B\"uhrer, Andreas Plesner, Aczel Till, Roger Wattenhofer

Abstract: The energy and latency costs of deep neural network inference are increasingly driven by deployment rather than training, motivating hardware-specialized alternatives to arithmetic-heavy models. Field-Programmable Gate Arrays (FPGAs) provide an attractive substrate for such specialization, yet existing FPGA-based neural approaches are fragmented and difficult to compare. We present BitLogic, a fully gradient-based, end-to-end trainable framework for FPGA-native neural networks built around Lookup Table (LUT) computation. BitLogic replaces multiply-accumulate operations with differentiable LUT nodes that map directly to FPGA primitives, enabling native binary computation, sparse connectivity, and efficient hardware realization. The framework offers a modular functional API supporting diverse architectures, along with learned encoders, hardware-aware heads, and multiple boundary-consistent LUT relaxations. An automated Register Transfer Level (RTL) export pipeline translates trained PyTorch models into synthesizable HDL, ensuring equivalence between software and hardware inference. Experiments across standard vision benchmarks and heterogeneous hardware platforms demonstrate competitive accuracy and substantial gains in FPGA efficiency, including 72.3% test accuracy on CIFAR-10 achieved with fewer than 0.3M logic gates, while attaining sub-20 ns single-sample inference using only LUT resources.

new Nonparametric Bayesian Optimization for General Rewards

Authors: Zishi Zhang, Tao Ren, Yijie Peng

Abstract: This work focuses on Bayesian optimization (BO) under reward model uncertainty. We propose the first BO algorithm that achieves no-regret guarantee in a general reward setting, requiring only Lipschitz continuity of the objective function and accommodating a broad class of measurement noise. The core of our approach is a novel surrogate model, termed as infinite Gaussian process ($\infty$-GP). It is a Bayesian nonparametric model that places a prior on the space of reward distributions, enabling it to represent a substantially broader class of reward models than classical Gaussian process (GP). The $\infty$-GP is used in combination with Thompson Sampling (TS) to enable effective exploration and exploitation. Correspondingly, we develop a new TS regret analysis framework for general rewards, which relates the regret to the total variation distance between the surrogate model and the true reward distribution. Furthermore, with a truncated Gibbs sampling procedure, our method is computationally scalable, incurring minimal additional memory and computational complexities compared to classical GP. Empirical results demonstrate state-of-the-art performance, particularly in settings with non-stationary, heavy-tailed, or other ill-conditioned rewards.

new Learning Molecular Chirality via Chiral Determinant Kernels

Authors: Runhan Shi, Zhicheng Zhang, Letian Chen, Gufeng Yu, Yang Yang

Abstract: Chirality is a fundamental molecular property that governs stereospecific behavior in chemistry and biology. Capturing chirality in machine learning models remains challenging due to the geometric complexity of stereochemical relationships and the limitations of traditional molecular representations that often lack explicit stereochemical encoding. Existing approaches to chiral molecular representation primarily focus on central chirality, relying on handcrafted stereochemical tags or limited 3D encodings, and thus fail to generalize to more complex forms such as axial chirality. In this work, we introduce ChiDeK (Chiral Determinant Kernels), a framework that systematically integrates stereogenic information into molecular representation learning. We propose the chiral determinant kernel to encode the SE(3)-invariant chirality matrix and employ cross-attention to integrate stereochemical information from local chiral centers into the global molecular representation. This design enables explicit modeling of chiral-related features within a unified architecture, capable of jointly encoding central and axial chirality. To support the evaluation of axial chirality, we construct a new benchmark for electronic circular dichroism (ECD) and optical rotation (OR) prediction. Across four tasks, including R/S configuration classification, enantiomer ranking, ECD spectrum prediction, and OR prediction, ChiDeK achieves substantial improvements over state-of-the-art baselines, most notably yielding over 7% higher accuracy on axially chiral tasks on average.

new Achieving Optimal Static and Dynamic Regret Simultaneously in Bandits with Deterministic Losses

Authors: Jian Qian, Chen-Yu Wei

Abstract: In adversarial multi-armed bandits, two performance measures are commonly used: static regret, which compares the learner to the best fixed arm, and dynamic regret, which compares it to the best sequence of arms. While optimal algorithms are known for each measure individually, there is no known algorithm achieving optimal bounds for both simultaneously. Marinov and Zimmert [2021] first showed that such simultaneous optimality is impossible against an adaptive adversary. Our work takes a first step to demonstrate its possibility against an oblivious adversary when losses are deterministic. First, we extend the impossibility result of Marinov and Zimmert [2021] to the case of deterministic losses. Then, we present an algorithm achieving optimal static and dynamic regret simultaneously against an oblivious adversary. Together, they reveal a fundamental separation between adaptive and oblivious adversaries when multiple regret benchmarks are considered simultaneously. It also provides new insight into the long open problem of simultaneously achieving optimal regret against switching benchmarks of different numbers of switches. Our algorithm uses negative static regret to compensate for the exploration overhead incurred when controlling dynamic regret, and leverages Blackwell approachability to jointly control both regrets. This yields a new model selection procedure for bandits that may be of independent interest.

new Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Authors: Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang

Abstract: While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy-tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign-based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well-aligned with practice.

new Brep2Shape: Boundary and Shape Representation Alignment via Self-Supervised Transformers

Authors: Yuanxu Sun, Yuezhou Ma, Haixu Wu, Guanyang Zeng, Muye Chen, Jianmin Wang, Mingsheng Long

Abstract: Boundary representation (B-rep) is the industry standard for computer-aided design (CAD). While deep learning shows promise in processing B-rep models, existing methods suffer from a representation gap: continuous approaches offer analytical precision but are visually abstract, whereas discrete methods provide intuitive clarity at the expense of geometric precision. To bridge this gap, we introduce Brep2Shape, a novel self-supervised pre-training method designed to align abstract boundary representations with intuitive shape representations. Our method employs a geometry-aware task where the model learns to predict dense spatial points from parametric B\'ezier control points, enabling the network to better understand physical manifolds derived from abstract coefficients. To enhance this alignment, we propose a Dual Transformer backbone with parallel streams that independently encode surface and curve tokens to capture their distinct geometric properties. Moreover, the topology attention is integrated to model the interdependencies between surfaces and curves, thereby maintaining topological consistency. Experimental results demonstrate that Brep2Shape offers significant scalability, achieving state-of-the-art accuracy and faster convergence across various downstream tasks.

new Active Learning Using Aggregated Acquisition Functions: Accuracy and Sustainability Analysis

Authors: C\'edric Jung, Shirin Salehi, Anke Schmeink

Abstract: Active learning (AL) is a machine learning (ML) approach that strategically selects the most informative samples for annotation during training, aiming to minimize annotation costs. This strategy not only reduces labeling expenses but also results in energy savings during neural network training, thereby enhancing both data and energy efficiency. In this paper, we implement and evaluate various state-of-the-art acquisition functions, analyzing their accuracy and computational costs, while discussing the advantages and disadvantages of each method. Our findings reveal that representativity-based acquisition functions effectively explore the dataset but do not prioritize boundary decisions, whereas uncertainty-based acquisition functions focus on refining boundary decisions already identified by the neural network. This trade-off is known as the exploration-exploitation dilemma. To address this dilemma, we introduce six aggregation structures: series, parallel, hybrid, adaptive feedback, random exploration, and annealing exploration. Our aggregated acquisition functions alleviate common AL pathologies such as batch mode inefficiency and the cold start problem. Additionally, we focus on balancing accuracy and energy consumption, contributing to the development of more sustainable, energy-aware artificial intelligence (AI). We evaluate our proposed structures on various models and datasets. Our results demonstrate the potential of these structures to reduce computational costs while maintaining or even improving accuracy. Innovative aggregation approaches, such as alternating between acquisition functions such as BALD and BADGE, have shown robust results. Sequentially running functions like $K$-Centers followed by BALD has achieved the same performance goals with up to 12\% fewer samples, while reducing the acquisition cost by almost half.

new Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

Authors: Jinzong Dong, Wei Huang, Jianshu Zhang, Zhuo Chen, Xinzhe Yuan, Qinying Gu, Zhaohui Jiang, Nanyang Ye

Abstract: Offline reinforcement learning (RL) optimizes policies from a previously collected static dataset and is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which yields realistic policies and mitigates bias from out-of-distribution actions, but can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting high-value regions suggested by the critic, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), a plug-and-play training sample replacer that progressively replaces low-value actions with high-value actions generated by a stable actor, broadening the action exploration space while reducing the impact of low-value data. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance and approaches state-of-the-art when combined with the basic TD3+BC.

new Data-Aware and Scalable Sensitivity Analysis for Decision Tree Ensembles

Authors: Namrita Varshney, Ashutosh Gupta, Arhaan Ahmad, Tanay V. Tayal, S. Akshay

Abstract: Decision tree ensembles are widely used in critical domains, making robustness and sensitivity analysis essential to their trustworthiness. We study the feature sensitivity problem, which asks whether an ensemble is sensitive to a specified subset of features -- such as protected attributes -- whose manipulation can alter model predictions. Existing approaches often yield examples of sensitivity that lie far from the training distribution, limiting their interpretability and practical value. We propose a data-aware sensitivity framework that constrains the sensitive examples to remain close to the dataset, thereby producing realistic and interpretable evidence of model weaknesses. To this end, we develop novel techniques for data-aware search using a combination of mixed-integer linear programming (MILP) and satisfiability modulo theories (SMT) encodings. Our contributions are fourfold. First, we strengthen the NP-hardness result for sensitivity verification, showing it holds even for trees of depth 1. Second, we develop MILP-optimizations that significantly speed up sensitivity verification for single ensembles and for the first time can also handle multiclass tree ensembles. Third, we introduce a data-aware framework generating realistic examples close to the training distribution. Finally, we conduct an extensive experimental evaluation on large tree ensembles, demonstrating scalability to ensembles with up to 800 trees of depth 8, achieving substantial improvements over the state of the art. This framework provides a practical foundation for analyzing the reliability and fairness of tree-based models in high-stakes applications.

new On the Importance of a Multi-Scale Calibration for Quantization

Authors: Seungwoo Son, Ingyu Seong, Junhan Kim, Hyemi Jang, Yongkweon Jeon

Abstract: Post-training quantization (PTQ) is a cornerstone for efficiently deploying large language models (LLMs), where a small calibration set critically affects quantization performance. However, conventional practices rely on random sequences of fixed length, overlooking the variable-length nature of LLM inputs. Input length directly influences the activation distribution and, consequently, the weight importance captured by the Hessian, which in turn affects quantization outcomes. As a result, Hessian estimates derived from fixed-length calibration may fail to represent the true importance of weights across diverse input scenarios. We propose MaCa (Matryoshka Calibration), a simple yet effective method for length-aware Hessian construction. MaCa (i) incorporates multi-scale sequence length information into Hessian estimation and (ii) regularizes each sequence as an independent sample, yielding a more stable and fruitful Hessian for accurate quantization. Experiments on state-of-the-art LLMs (e.g., Qwen3, Gemma3, LLaMA3) demonstrate that MaCa consistently improves accuracy under low bit quantization, offering a lightweight enhancement compatible with existing PTQ frameworks. To the best of our knowledge, this is the first work to systematically highlight the role of multi-scale calibration in LLM quantization.

new Bandit Allocational Instability

Authors: Yilun Chen, Jiaqi Lu

Abstract: When multi-armed bandit (MAB) algorithms allocate pulls among competing arms, the resulting allocation can exhibit huge variation. This is particularly harmful in modern applications such as learning-enhanced platform operations and post-bandit statistical inference. Thus motivated, we introduce a new performance metric of MAB algorithms termed allocation variability, which is the largest (over arms) standard deviation of an arm's number of pulls. We establish a fundamental trade-off between allocation variability and regret, the canonical performance metric of reward maximization. In particular, for any algorithm, the worst-case regret $R_T$ and worst-case allocation variability $S_T$ must satisfy $R_T \cdot S_T=\Omega(T^{\frac{3}{2}})$ as $T\rightarrow\infty$, as long as $R_T=o(T)$. This indicates that any minimax regret-optimal algorithm must incur worst-case allocation variability $\Theta(T)$, the largest possible scale; while any algorithm with sublinear worst-case regret must necessarily incur ${S}_T= \omega(\sqrt{T})$. We further show that this lower bound is essentially tight, and that any point on the Pareto frontier $R_T \cdot S_T=\tilde{\Theta}(T^{3/2})$ can be achieved by a simple tunable algorithm UCB-f, a generalization of the classic UCB1. Finally, we discuss implications for platform operations and for statistical inference, when bandit algorithms are used. As a byproduct of our result, we resolve an open question of Praharaj and Khamaru (2025).

new Bipartite Graph Attention-based Clustering for Large-scale scRNA-seq Data

Authors: Zhuomin Liang, Liang Bai, Xian Yang

Abstract: scRNA-seq clustering is a critical task for analyzing single-cell RNA sequencing (scRNA-seq) data, as it groups cells with similar gene expression profiles. Transformers, as powerful foundational models, have been applied to scRNA-seq clustering. Their self-attention mechanism automatically assigns higher attention weights to cells within the same cluster, enhancing the distinction between clusters. Existing methods for scRNA-seq clustering, such as graph transformer-based models, treat each cell as a token in a sequence. Their computational and space complexities are $\mathcal{O}(n^2)$ with respect to the number of cells, limiting their applicability to large-scale scRNA-seq datasets.To address this challenge, we propose a Bipartite Graph Transformer-based clustering model (BGFormer) for scRNA-seq data. We introduce a set of learnable anchor tokens as shared reference points to represent the entire dataset. A bipartite graph attention mechanism is introduced to learn the similarity between cells and anchor tokens, bringing cells of the same class closer together in the embedding space. BGFormer achieves linear computational complexity with respect to the number of cells, making it scalable to large datasets. Experimental results on multiple large-scale scRNA-seq datasets demonstrate the effectiveness and scalability of BGFormer.

new AI-Driven Predictive Modelling for Groundwater Salinization in Israel

Authors: Laxmi Pandey, Ariel Meroz, Ben Cheng, Ankita Manekar, Abhijit Mukherjee, Meirav Cohen, Adway Mitra

Abstract: Increasing salinity and contamination of groundwater is a serious issue in many parts of the world, causing degradation of water resources. The aim of this work is to form a comprehensive understanding of groundwater salinization underlying causal factors and identify important meteorological, geological and anthropogenic drivers of salinity. We have integrated different datasets of potential covariates, to create a robust framework for machine learning based predictive models including Random Forest (RF), XGBoost, Neural network, Long Short-Term Memory (LSTM), convolution neural network (CNN) and linear regression (LR), of groundwater salinity. Additionally, Recursive Feature Elimination (RFE) followed by Global sensitivity analysis (GSA) and Explainable AI (XAI) based SHapley Additive exPlanations (SHAP) were used to estimate the importance scores and find insights into the drivers of salinization. We also did causality analysis via Double machine learning using various predictive models. From these analyses, key meteorological (Precipitation, Temperature), geological (Distance from river, Distance to saline body, TWI, Shoreline distance), and anthropogenic (Area of agriculture field, Treated Wastewater) covariates are identified to be influential drivers of groundwater salinity across Israel. XAI analysis also identified Treated Wastewater (TWW) as an essential anthropogenic driver of salinity, its significance being context-dependent but critical in vulnerable hydro-climatic environment. Our approach provides deeper insight into global salinization mechanisms at country scale, reducing AI model uncertainty and highlighting the need for tailored strategies to address salinity.

new ODELoRA: Training Low-Rank Adaptation by Solving Ordinary Differential Equations

Authors: Yihang Gao, Vincent Y. F. Tan

Abstract: Low-rank adaptation (LoRA) has emerged as a widely adopted parameter-efficient fine-tuning method in deep transfer learning, due to its reduced number of trainable parameters and lower memory requirements enabled by Burer-Monteiro factorization on adaptation matrices. However, classical LoRA training methods treat the low-rank factor matrices individually and optimize them using standard gradient-based algorithms. Such decoupled optimization schemes are theoretically and empirically suboptimal, as they fail to fully exploit the intrinsic structure of the LoRA parameterization. In this work, we propose a novel continuous-time optimization dynamic for LoRA factor matrices in the form of an ordinary differential equation (ODE) that emulates the gradient flow of full fine-tuning on the balanced manifold. We term this approach ODELoRA. To faithfully track the trajectories of ODELoRA, we adopt well-established and theoretically grounded time-discretization schemes, including Euler and Runge--Kutta methods. Our framework provides a unified ODE-based perspective for understanding and designing LoRA training algorithms. We establish linear convergence of the proposed method under strongly convex objectives for certain discretization schemes under mild conditions, and further extend our analysis to the matrix sensing setting. Moreover, we show that ODELoRA achieves stable feature learning, a property that is crucial for training deep neural networks at different scales of problem dimensionality. Empirical results on matrix sensing tasks confirm the derived linear convergence behavior, and experiments on training physics-informed neural networks further demonstrate the superiority of ODELoRA over existing baselines, especially in the training stability.

new Deriving Neural Scaling Laws from the statistics of natural language

Authors: Francesco Cagnetta, Allan Ravent\'os, Surya Ganguli, Matthieu Wyart

Abstract: Despite the fact that experimental neural scaling laws have substantially guided empirical progress in large-scale machine learning, no existing theory can quantitatively predict the exponents of these important laws for any modern LLM trained on any natural language dataset. We provide the first such theory in the case of data-limited scaling laws. We isolate two key statistical properties of language that alone can predict neural scaling exponents: (i) the decay of pairwise token correlations with time separation between token pairs, and (ii) the decay of the next-token conditional entropy with the length of the conditioning context. We further derive a simple formula in terms of these statistics that predicts data-limited neural scaling exponents from first principles without any free parameters or synthetic data models. Our theory exhibits a remarkable match with experimentally measured neural scaling laws obtained from training GPT-2 and LLaMA style models from scratch on two qualitatively different benchmarks, TinyStories and WikiText.

new Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks

Authors: Shenxi Wu, Haosong Zhang, Xingjian Ma, Shirui Bian, Yichi Zhang, Xi Chen, Wei Lin

Abstract: Deeper modern architectures are costly to train, making hyperparameter transfer preferable to expensive repeated tuning. Maximal Update Parametrization ($\mu$P) helps explain why many hyperparameters transfer across width. Yet depth scaling is less understood for modern architectures, whose computation graphs contain multiple parallel paths and residual aggregation. To unify various non-recurrent multi-path neural networks such as CNNs, ResNets, and Transformers, we introduce a graph-based notion of effective depth. Under stabilizing initializations and a maximal-update criterion, we show that the optimal learning rate decays with effective depth following a universal -3/2 power law. Here, the maximal-update criterion maximizes the typical one-step representation change at initialization without causing instability, and effective depth is the minimal path length from input to output, counting layers and residual additions. Experiments across diverse architectures confirm the predicted slope and enable reliable zero-shot transfer of learning rates across depths and widths, turning depth scaling into a predictable hyperparameter-transfer problem.

new CoMI-IRL: Contrastive Multi-Intention Inverse Reinforcement Learning

Authors: Antonio Mone, Frans A. Oliehoek, Luciano Cavalcante Siebert

Abstract: Inverse Reinforcement Learning (IRL) seeks to infer reward functions from expert demonstrations. When demonstrations originate from multiple experts with different intentions, the problem is known as Multi-Intention IRL (MI-IRL). Recent deep generative MI-IRL approaches couple behavior clustering and reward learning, but typically require prior knowledge of the number of true behavioral modes $K^*$. This reliance on expert knowledge limits their adaptability to new behaviors, and only enables analysis related to the learned rewards, and not across the behavior modes used to train them. We propose Contrastive Multi-Intention IRL (CoMI-IRL), a transformer-based unsupervised framework that decouples behavior representation and clustering from downstream reward learning. Our experiments show that CoMI-IRL outperforms existing approaches without a priori knowledge of $K^*$ or labels, while allowing for visual interpretation of behavior relationships and adaptation to unseen behavior without full retraining.

new PALMS: Pavlovian Associative Learning Models Simulator

Authors: Martin Fixman, Alessandro Abati, Juli\'an Jim\'enez Nimmo, Sean Lim, Esther Mondrag\'on

Abstract: Simulations are an indispensable step in the cycle of theory development and refinement, helping researchers formulate precise definitions, generate models, and make accurate predictions. This paper introduces the Pavlovian Associative Learning Models Simulator (PALMS), a Python environment to simulate Pavlovian conditioning experiments. In addition to the canonical Rescorla-Wagner model, PALMS incorporates several attentional learning approaches, including Pearce-Kaye-Hall, Mackintosh Extended, Le Pelley's Hybrid, and a novel extension of the Rescorla-Wagner model with a unified variable learning rate that integrates Mackintosh's and Pearce and Hall's opposing conceptualisations. The simulator's graphical interface allows for the input of entire experimental designs in an alphanumeric format, akin to that used by experimental neuroscientists. Moreover, it uniquely enables the simulation of experiments involving hundreds of stimuli, as well as the computation of configural cues and configural-cue compounds across all models, thereby considerably expanding their predictive capabilities. PALMS operates efficiently, providing instant visualisation of results, supporting rapid, precise comparisons of various models' predictions within a single architecture and environment. Furthermore, graphic displays can be easily saved, and simulated data can be exported to spreadsheets. To illustrate the simulator's capabilities and functionalities, we provide a detailed description of the software and examples of use, reproducing published experiments in the associative learning literature. PALMS is licensed under the open-source GNU Lesser General Public License 3.0. The simulator source code and the latest multiplatform release build are accessible as a GitHub repository at https://github.com/cal-r/PALMS-Simulator

URLs: https://github.com/cal-r/PALMS-Simulator

new Pareto-guided Pipeline for Distilling Featherweight AI Agents in Mobile MOBA Games

Authors: Xionghui Yang, Bozhou Chen, Yunlong Lu, Yongyi Wang, Lingfeng Li, Lanxiao Huang, Lin Liu, Wenjun Wang, Meng Meng, Xia Lin, Wenxin Li

Abstract: Recent advances in game AI have demonstrated the feasibility of training agents that surpass top-tier human professionals in complex environments such as Honor of Kings (HoK), a leading mobile multiplayer online battle arena (MOBA) game. However, deploying such powerful agents on mobile devices remains a major challenge. On one hand, the intricate multi-modal state representation and hierarchical action space of HoK demand large, sophisticated policy networks that are inherently difficult to compress into lightweight forms. On the other hand, production deployment requires high-frequency inference under strict energy and latency constraints on mobile platform. To the best of our knowledge, bridging large-scale game AI and practical on-device deployment has not been systematically studied. In this work, we propose a Pareto optimality guided pipeline and design a high-efficiency student architecture search space tailored for mobile execution, enabling systematic exploration of the trade-off between performance and efficiency. Experimental results demonstrate that the distilled model achieves remarkable efficiency, including an $12.4\times$ faster inference speed (under 0.5ms per frame) and a $15.6\times$ improvement in energy efficiency (under 0.5mAh per game), while retaining a 40.32% win rate against the original teacher model.

new MedVerse: Efficient and Reliable Medical Reasoning via DAG-Structured Parallel Execution

Authors: Jianwen Chen, Xinyu Yang, Peng Xia, Arian Azarang, Yueh Z Lee, Gang Li, Hongtu Zhu, Yun Li, Beidi Chen, Huaxiu Yao

Abstract: Large language models (LLMs) have demonstrated strong performance and rapid progress in a wide range of medical reasoning tasks. However, their sequential autoregressive decoding forces inherently parallel clinical reasoning, such as differential diagnosis, into a single linear reasoning path, limiting both efficiency and reliability for complex medical problems. To address this, we propose MedVerse, a reasoning framework for complex medical inference that reformulates medical reasoning as a parallelizable directed acyclic graph (DAG) process based on Petri net theory. The framework adopts a full-stack design across data, model architecture, and system execution. For data creation, we introduce the MedVerse Curator, an automated pipeline that synthesizes knowledge-grounded medical reasoning paths and transforms them into Petri net-structured representations. At the architectural level, we propose a topology-aware attention mechanism with adaptive position indices that supports parallel reasoning while preserving logical consistency. Systematically, we develop a customized inference engine that supports parallel execution without additional overhead. Empirical evaluations show that MedVerse improves strong general-purpose LLMs by up to 8.9%. Compared to specialized medical LLMs, MedVerse achieves comparable performance while delivering a 1.3x reduction in inference latency and a 1.7x increase in generation throughput, enabled by its parallel decoding capability.

new Compact Conformal Subgraphs

Authors: Sreenivas Gollapudi, Kostas Kollias, Kamesh Munagala, Aravindan Vijayaraghavan

Abstract: Conformal prediction provides rigorous, distribution-free uncertainty guarantees, but often yields prohibitively large prediction sets in structured domains such as routing, planning, or sequential recommendation. We introduce "graph-based conformal compression", a framework for constructing compact subgraphs that preserve statistical validity while reducing structural complexity. We formulate compression as selecting a smallest subgraph capturing a prescribed fraction of the probability mass, and reduce to a weighted version of densest $k$-subgraphs in hypergraphs, in the regime where the subgraph has a large fraction of edges. We design efficient approximation algorithms that achieve constant factor coverage and size trade-offs. Crucially, we prove that our relaxation satisfies a monotonicity property, derived from a connection to parametric minimum cuts, which guarantees the nestedness required for valid conformal guarantees. Our results on the one hand bridge efficient conformal prediction with combinatorial graph compression via monotonicity, to provide rigorous guarantees on both statistical validity, and compression or size. On the other hand, they also highlight an algorithmic regime, distinct from classical densest-$k$-subgraph hardness settings, where the problem can be approximated efficiently. We finally validate our algorithmic approach via simulations for trip planning and navigation, and compare to natural baselines.

new Gaussian Match-and-Copy: A Minimalist Benchmark for Studying Transformer Induction

Authors: Antoine Gonon, Alexandre Cordonnier, Nicolas Boumal

Abstract: Match-and-copy is a core retrieval primitive used at inference time by large language models to retrieve a matching token from the context then copy its successor. Yet, understanding how this behavior emerges on natural data is challenging because retrieval and memorization are entangled. To disentangle the two, we introduce Gaussian Match-and-Copy (GMC), a minimalist benchmark that isolates long-range retrieval through pure second-order correlation signals. Numerical investigations show that this task retains key qualitative aspects of how Transformers develop match-and-copy circuits in practice, and separates architectures by their retrieval capabilities. We also analyze the optimization dynamics in a simplified attention setting. Although many solutions are a priori possible under a regression objective, including ones that do not implement retrieval, we identify an implicit-bias regime in which gradient descent drives the parameters to diverge while their direction aligns with the max-margin separator, yielding hard match selection. We prove this max-margin alignment for GD trajectories that reach vanishing empirical loss under explicit technical conditions.

new Enhancing Time Series Classification with Diversity-Driven Neural Network Ensembles

Authors: Javidan Abdullayev, Maxime Devanne, Cyril Meyer, Ali Ismail-Fawaz, Jonathan Weber, Germain Forestier

Abstract: Ensemble methods have played a crucial role in achieving state-of-the-art (SOTA) performance across various machine learning tasks by leveraging the diversity of features learned by individual models. In Time Series Classification (TSC), ensembles have proven highly effective whether based on neural networks (NNs) or traditional methods like HIVE-COTE. However most existing NN-based ensemble methods for TSC train multiple models with identical architectures and configurations. These ensembles aggregate predictions without explicitly promoting diversity which often leads to redundant feature representations and limits the benefits of ensembling. In this work, we introduce a diversity-driven ensemble learning framework that explicitly encourages feature diversity among neural network ensemble members. Our approach employs a decorrelated learning strategy using a feature orthogonality loss applied directly to the learned feature representations. This ensures that each model in the ensemble captures complementary rather than redundant information. We evaluate our framework on 128 datasets from the UCR archive and show that it achieves SOTA performance with fewer models. This makes our method both efficient and scalable compared to conventional NN-based ensemble approaches.

new Unified Biomolecular Trajectory Generation via Pretrained Variational Bridge

Authors: Ziyang Yu, Wenbing Huang, Yang Liu

Abstract: Molecular Dynamics (MD) simulations provide a fundamental tool for characterizing molecular behavior at full atomic resolution, but their applicability is severely constrained by the computational cost. To address this, a surge of deep generative models has recently emerged to learn dynamics at coarsened timesteps for efficient trajectory generation, yet they either generalize poorly across systems or, due to limited molecular diversity of trajectory data, fail to fully exploit structural information to improve generative fidelity. Here, we present the Pretrained Variational Bridge (PVB) in an encoder-decoder fashion, which maps the initial structure into a noised latent space and transports it toward stage-specific targets through augmented bridge matching. This unifies training on both single-structure and paired trajectory data, enabling consistent use of cross-domain structural knowledge across training stages. Moreover, for protein-ligand complexes, we further introduce a reinforcement learning-based optimization via adjoint matching that speeds progression toward the holo state, which supports efficient post-optimization of docking poses. Experiments on proteins and protein-ligand complexes demonstrate that PVB faithfully reproduces thermodynamic and kinetic observables from MD while delivering stable and efficient generative dynamics.

new Beyond Arrow: From Impossibility to Possibilities in Multi-Criteria Benchmarking

Authors: Polina Gordienko, Christoph Jansen, Julian Rodemann, Georg Schollmeyer

Abstract: Modern benchmarks such as HELM MMLU account for multiple metrics like accuracy, robustness and efficiency. When trying to turn these metrics into a single ranking, natural aggregation procedures can become incoherent or unstable to changes in the model set. We formalize this aggregation as a social choice problem where each metric induces a preference ranking over models on each dataset, and a benchmark operator aggregates these votes across metrics. While prior work has focused on Arrow's impossibility result, we argue that the impossibility often originates from pathological examples and identify sufficient conditions under which these disappear, and meaningful multi-criteria benchmarking becomes possible. In particular, we deal with three restrictions on the combinations of rankings and prove that on single-peaked, group-separable and distance-restricted preferences, the benchmark operator allows for the construction of well-behaved rankings of the involved models. Empirically, we investigate several modern benchmark suites like HELM MMLU and verify which structural conditions are fulfilled on which benchmark problems.

new Astro: Activation-guided Structured Regularization for Outlier-Robust LLM Post-Training Quantization

Authors: Xi Chen, Ming Li, Junxi Li, Changsheng Li, Peisong Wang, Lizhong Ding, Ye Yuan, Guoren Wang

Abstract: Weight-only post-training quantization (PTQ) is crucial for efficient Large Language Model (LLM) deployment but suffers from accuracy degradation caused by weight and activation outliers. Existing mitigation strategies often face critical limitations: they either yield insufficient outlier suppression or incur significant deployment inefficiencies, such as inference latency, heavy preprocessing, or reliance on complex operator fusion. To resolve these limitations, we leverage a key insight: over-parameterized LLMs often converge to Flat Minima, implying a vast equivalent solution space where weights can be adjusted without compromising accuracy. Building on this, we propose Astro, an Activation-guided Structured Regularization framework designed to suppress the negative effects of outliers in a hardware-friendly and efficient manner. Leveraging the activation-guided regularization objective, Astro actively reconstructs intrinsically robust weights, aggressively suppressing weight outliers corresponding to high-magnitude activations without sacrificing model accuracy. Crucially, Astro introduces zero inference latency and is orthogonal to mainstream quantization methods like GPTQ. Extensive experiments show that Astro achieves highly competitive performance; notably, on LLaMA-2-7B, it achieves better performance than complex learning-based rotation methods with almost 1/3 of the quantization time.

new Rational Transductors

Authors: Mehryar Mohri

Abstract: Standard Transformers excel at semantic modeling but struggle with rigid sequential logic and state tracking. Theoretical work establishes that self-attention is limited to $\AC^0$ (under hard attention) or $\TC^0$ (under soft attention), complexity classes that often fail to support robust length generalization on sequential problems without intermediate chain-of-thought. In this work, we introduce \emph{Rational Transductors}, a dual-stream architecture that augments the Transformer with a matrix-valued recurrence derived from Weighted Finite Automata (WFA). By injecting rational state information into the attention mechanism via a \emph{Deep Rational Injection} scheme, our framework strictly generalizes the expressive power of Transformers to capture all Regular Languages, $\NC^1$-complete problems (such as Boolean Formula Evaluation), and fundamental separations like Parity and Modular Counting, while preserving $O(L + \log T)$ parallel time complexity. We ground the architecture in a rigorous learning theory: we prove that \emph{Random Rational Features} act as a universal basis for sequential dependencies, justifying our initialization strategy, while establishing that the \emph{Differentiable Rational Feature} regime is necessary to close the representational compactness gap. Theoretical analysis and empirical results demonstrate that Rational Transductors solve the "Regular Gap," enabling robust length generalization on algorithmic tasks where standard Transformers fail, without the sequential computational bottlenecks of traditional RNNs.

new Object-Oriented Transition Modeling with Inductive Logic Programming

Authors: Gabriel Stella, Dmitri Loguinov

Abstract: Building models of the world from observation, i.e., induction, is one of the major challenges in machine learning. In order to be useful, models need to maintain accuracy when used in novel situations, i.e., generalize. In addition, they should be easy to interpret and efficient to train. Prior work has investigated these concepts in the context of object-oriented representations inspired by human cognition. In this paper, we develop a novel learning algorithm that is substantially more powerful than these previous methods. Our thorough experiments, including ablation tests and comparison with neural baselines, demonstrate a significant improvement over the state-of-the-art.

new Escaping Spectral Bias without Backpropagation: Fast Implicit Neural Representations with Extreme Learning Machines

Authors: Woojin Cho, Junghwan Park

Abstract: Training implicit neural representations (INRs) to capture fine-scale details typically relies on iterative backpropagation and is often hindered by spectral bias when the target exhibits highly non-uniform frequency content. We propose ELM-INR, a backpropagation-free INR that decomposes the domain into overlapping subdomains and fits each local problem using an Extreme Learning Machine (ELM) in closed form, replacing iterative optimization with stable linear least-squares solutions. This design yields fast and numerically robust reconstruction by combining local predictors through a partition of unity. To understand where approximation becomes difficult under fixed local capacity, we analyze the method from a spectral Barron norm perspective, which reveals that global reconstruction error is dominated by regions with high spectral complexity. Building on this insight, we introduce BEAM, an adaptive mesh refinement strategy that balances spectral complexity across subdomains to improve reconstruction quality in capacity-constrained regimes.

new SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models

Authors: Juntong Wu, Jialiang Cheng, Fuyu Lv, Ou Dan, Li Yuan

Abstract: Mixture-of-Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory-bound decoding stage. To address the fundamental tension between batch decoding and expert sparsity, we present SERE, a Similarity-based Expert Re-routing method for Efficient batch decoding in MoE models. SERE dynamically reduces the number of active experts in an input-aware manner by re-routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Notably, SERE avoids static expert pruning or merging, instead enabling dynamic expert skipping based on batch-level expert redundancy. Additionally, we provide an efficient custom CUDA kernel for SERE, enabling plug-and-play use in vLLM with only a single-line code change. Extensive experiments on various complex reasoning benchmarks demonstrate that SERE achieves up to 2.0x speedup with minimal quality loss, providing a practical solution for cost-efficient and latency-sensitive large-scale MoE deployment. Code implementation of SERE can be found in https://github.com/JL-Cheng/SERE.

URLs: https://github.com/JL-Cheng/SERE.

new Dense Neural Networks are not Universal Approximators

Authors: Levi Rauchwerger, Stefanie Jegelka, Ron Levie

Abstract: We investigate the approximation capabilities of dense neural networks. While universal approximation theorems establish that sufficiently large architectures can approximate arbitrary continuous functions if there are no restrictions on the weight values, we show that dense neural networks do not possess this universality. Our argument is based on a model compression approach, combining the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks. We consider ReLU neural networks subject to natural constraints on weights and input and output dimensions, which model a notion of dense connectivity. Within this setting, we demonstrate the existence of Lipschitz continuous functions that cannot be approximated by such networks. This highlights intrinsic limitations of neural networks with dense layers and motivates the use of sparse connectivity as a necessary ingredient for achieving true universality.

new TASTE: Task-Aware Out-of-Distribution Detection via Stein Operators

Authors: Micha{\l} Kozyra, Gesine Reinert

Abstract: Out-of-distribution detection methods are often either data-centric, detecting deviations from the training input distribution irrespective of their effect on a trained model, or model-centric, relying on classifier outputs without explicit reference to data geometry. We propose TASTE (Task-Aware STEin operators): a task-aware framework based on so-called Stein operators, which allows us to link distribution shift to the input sensitivity of the model. We show that the resulting operator admits a clear geometric interpretation as a projection of distribution shift onto the sensitivity field of the model, yielding theoretical guarantees. Beyond detecting the presence of a shift, the same construction enables its localisation through a coordinate-wise decomposition, and for image data-provides interpretable per-pixel diagnostics. Experiments on controlled Gaussian shifts, MNIST under geometric perturbations, and CIFAR-10 perturbed benchmarks demonstrate that the proposed method aligns closely with task degradation while outperforming established baselines.

new Continuous Program Search

Authors: Matthew Siper, Muhammad Umair Nasir, Ahmed Khalifa, Lisa Soros, Jay Azhang, Julian Togelius

Abstract: Genetic Programming yields interpretable programs, but small syntactic mutations can induce large, unpredictable behavioral shifts, degrading locality and sample efficiency. We frame this as an operator-design problem: learn a continuous program space where latent distance has behavioral meaning, then design mutation operators that exploit this structure without changing the evolutionary optimizer. We make locality measurable by tracking action-level divergence under controlled latent perturbations, identifying an empirical trust region for behavior-local continuous variation. Using a compact trading-strategy DSL with four semantic components (long/short entry and exit), we learn a matching block-factorized embedding and compare isotropic Gaussian mutation over the full latent space to geometry-compiled mutation that restricts updates to semantically paired entry--exit subspaces and proposes directions using a learned flow-based model trained on logged mutation outcomes. Under identical $(\mu+\lambda)$ evolution strategies and fixed evaluation budgets across five assets, the learned mutation operator discovers strong strategies using an order of magnitude fewer evaluations and achieves the highest median out-of-sample Sharpe ratio. Although isotropic mutation occasionally attains higher peak performance, geometry-compiled mutation yields faster, more reliable progress, demonstrating that semantically aligned mutation can substantially improve search efficiency without modifying the underlying evolutionary algorithm.

new Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation

Authors: Jarrod Barnes

Abstract: Test-time training (TTT) adapts language models through gradient-based updates at inference. But is adaptation the right strategy? We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B-parameter model (GPT-OSS-120B with LoRA adaptation), we find that search outperforms minimal adaptation (1-5 gradient steps): Best-of-N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3-seed mean), with TTT's "equivalent K" falling below 1, worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal-guided selection: selecting the highest-surprisal (lowest-confidence) correct sample yields 80% success vs. 50% for most-confident selection, a 30% improvement. Extending to surprisal-guided-top3 matches oracle performance at 100%. This zero-cost strategy, validated through length-controlled analysis, recovers oracle performance. For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal-guided selection principle may generalize to other execution-grounded domains where optimal solutions occupy the distribution tail.

new Federated Learning with Profile Mapping under Distribution Shifts and Drifts

Authors: Mohan Li, Dario Fenoglio, Martin Gjoreski, Marc Langheinrich

Abstract: Federated Learning (FL) enables decentralized model training across clients without sharing raw data, but its performance degrades under real-world data heterogeneity. Existing methods often fail to address distribution shift across clients and distribution drift over time, or they rely on unrealistic assumptions such as known number of client clusters and data heterogeneity types, which limits their generalizability. We introduce Feroma, a novel FL framework that explicitly handles both distribution shift and drift without relying on client or cluster identity. Feroma builds on client distribution profiles-compact, privacy-preserving representations of local data-that guide model aggregation and test-time model assignment through adaptive similarity-based weighting. This design allows Feroma to dynamically select aggregation strategies during training, ranging from clustered to personalized, and deploy suitable models to unseen, and unlabeled test clients without retraining, online adaptation, or prior knowledge on clients' data. Extensive experiments show that compared to 10 state-of-the-art methods, Feroma improves performance and stability under dynamic data heterogeneity conditions-an average accuracy gain of up to 12 percentage points over the best baselines across 6 benchmarks-while maintaining computational and communication overhead comparable to FedAvg. These results highlight that distribution-profile-based aggregation offers a practical path toward robust FL under both data distribution shifts and drifts.

new ElliCE: Efficient and Provably Robust Algorithmic Recourse via the Rashomon Sets

Authors: Bohdan Turbal, Iryna Voitsitska, Lesia Semenova

Abstract: Machine learning models now influence decisions that directly affect people's lives, making it important to understand not only their predictions, but also how individuals could act to obtain better results. Algorithmic recourse provides actionable input modifications to achieve more favorable outcomes, typically relying on counterfactual explanations to suggest such changes. However, when the Rashomon set - the set of near-optimal models - is large, standard counterfactual explanations can become unreliable, as a recourse action valid for one model may fail under another. We introduce ElliCE, a novel framework for robust algorithmic recourse that optimizes counterfactuals over an ellipsoidal approximation of the Rashomon set. The resulting explanations are provably valid over this ellipsoid, with theoretical guarantees on uniqueness, stability, and alignment with key feature directions. Empirically, ElliCE generates counterfactuals that are not only more robust but also more flexible, adapting to user-specified feature constraints while being substantially faster than existing baselines. This provides a principled and practical solution for reliable recourse under model uncertainty, ensuring stable recommendations for users even as models evolve.

new Spectral Gating Networks

Authors: Jusheng Zhang, Yijia Fan, Kaitong Cai, Jing Yang, Yongsen Zheng, Kwok-Yan Lam, Liang Lin, Keze Wang

Abstract: Gating mechanisms are ubiquitous, yet a complementary question in feed-forward networks remains under-explored: how to introduce frequency-rich expressivity without sacrificing stability and scalability? This tension is exposed by spline-based Kolmogorov-Arnold Network (KAN) parameterizations, where grid refinement can induce parameter growth and brittle optimization in high dimensions. To propose a stability-preserving way to inject spectral capacity into existing MLP/FFN layers under fixed parameter and training budgets, we introduce Spectral Gating Networks (SGN), a drop-in spectral reparameterization. SGN augments a standard activation pathway with a compact spectral pathway and learnable gates that allow the model to start from a stable base behavior and progressively allocate capacity to spectral features during training. The spectral pathway is instantiated with trainable Random Fourier Features (learned frequencies and phases), replacing grid-based splines and removing resolution dependence. A hybrid GELU-Fourier formulation further improves optimization robustness while enhancing high-frequency fidelity. Across vision, NLP, audio, and PDE benchmarks, SGN consistently improves accuracy-efficiency trade-offs under comparable computational budgets, achieving 93.15% accuracy on CIFAR-10 and up to 11.7x faster inference than spline-based KAN variants. Code and trained models will be released.

new On the Infinite Width and Depth Limits of Predictive Coding Networks

Authors: Francesco Innocenti, El Mehdi Achour, Rafal Bogacz

Abstract: Predictive coding (PC) is a biologically plausible alternative to standard backpropagation (BP) that minimises an energy function with respect to network activities before updating weights. Recent work has improved the training stability of deep PC networks (PCNs) by leveraging some BP-inspired reparameterisations. However, the full scalability and theoretical basis of these approaches remains unclear. To address this, we study the infinite width and depth limits of PCNs. For linear residual networks, we show that the set of width- and depth-stable feature-learning parameterisations for PC is exactly the same as for BP. Moreover, under any of these parameterisations, the PC energy with equilibrated activities converges to the BP loss in a regime where the model width is much larger than the depth, resulting in PC computing the same gradients as BP. Experiments show that these results hold in practice for deep nonlinear networks, as long as an activity equilibrium seem to be reached. Overall, this work unifies various previous theoretical and empirical results and has potentially important implications for the scaling of PCNs.

new Dense Feature Learning via Linear Structure Preservation in Medical Data

Authors: Yuanyun Zhang, Mingxuan Zhang, Siyuan Li, Zihan Wang, Haoran Chen, Wenbo Zhou, Shi Li

Abstract: Deep learning models for medical data are typically trained using task specific objectives that encourage representations to collapse onto a small number of discriminative directions. While effective for individual prediction problems, this paradigm underutilizes the rich structure of clinical data and limits the transferability, stability, and interpretability of learned features. In this work, we propose dense feature learning, a representation centric framework that explicitly shapes the linear structure of medical embeddings. Our approach operates directly on embedding matrices, encouraging spectral balance, subspace consistency, and feature orthogonality through objectives defined entirely in terms of linear algebraic properties. Without relying on labels or generative reconstruction, dense feature learning produces representations with higher effective rank, improved conditioning, and greater stability across time. Empirical evaluations across longitudinal EHR data, clinical text, and multimodal patient representations demonstrate consistent improvements in downstream linear performance, robustness, and subspace alignment compared to supervised and self supervised baselines. These results suggest that learning to span clinical variation may be as important as learning to predict clinical outcomes, and position representation geometry as a first class objective in medical AI.

new Quantifying Explanation Quality in Graph Neural Networks using Out-of-Distribution Generalization

Authors: Ding Zhang, Siddharth Betala, Chirag Agarwal

Abstract: Evaluating the quality of post-hoc explanations for Graph Neural Networks (GNNs) remains a significant challenge. While recent years have seen an increasing development of explainability methods, current evaluation metrics (e.g., fidelity, sparsity) often fail to assess whether an explanation identifies the true underlying causal variables. To address this, we propose the Explanation-Generalization Score (EGS), a metric that quantifies the causal relevance of GNN explanations. EGS is founded on the principle of feature invariance and posits that if an explanation captures true causal drivers, it should lead to stable predictions across distribution shifts. To quantify this, we introduce a framework that trains GNNs using explanatory subgraphs and evaluates their performance in Out-of-Distribution (OOD) settings (here, OOD generalization serves as a rigorous proxy for the explanation's causal validity). Through large-scale validation involving 11,200 model combinations across synthetic and real-world datasets, our results demonstrate that EGS provides a principled benchmark for ranking explainers based on their ability to capture causal substructures, offering a robust alternative to traditional fidelity-based metrics.

new Towards Robust Scaling Laws for Optimizers

Authors: Alexandra Volkova, Mher Safaryan, Christoph H. Lampert, Dan Alistarh

Abstract: The quality of Large Language Model (LLM) pretraining depends on multiple factors, including the compute budget and the choice of optimization algorithm. Empirical scaling laws are widely used to predict loss as model size and training data grow, however, almost all existing studies fix the optimizer (typically AdamW). At the same time, a new generation of optimizers (e.g., Muon, Shampoo, SOAP) promises faster and more stable convergence, but their relationship with model and data scaling is not yet well understood. In this work, we study scaling laws across different optimizers. Empirically, we show that 1) separate Chinchilla-style scaling laws for each optimizer are ill-conditioned and have highly correlated parameters. Instead, 2) we propose a more robust law with shared power-law exponents and optimizer-specific rescaling factors, which enable direct comparison between optimizers. Finally, 3) we provide a theoretical analysis of gradient-based methods for the proxy task of a convex quadratic objective, demonstrating that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.

new Analyzing and Guiding Zero-Shot Posterior Sampling in Diffusion Models

Authors: Roi Benita, Michael Elad, Joseph Keshet

Abstract: Recovering a signal from its degraded measurements is a long standing challenge in science and engineering. Recently, zero-shot diffusion based methods have been proposed for such inverse problems, offering a posterior sampling based solution that leverages prior knowledge. Such algorithms incorporate the observations through inference, often leaning on manual tuning and heuristics. In this work we propose a rigorous analysis of such approximate posterior-samplers, relying on a Gaussianity assumption of the prior. Under this regime, we show that both the ideal posterior sampler and diffusion-based reconstruction algorithms can be expressed in closed-form, enabling their thorough analysis and comparisons in the spectral domain. Building on these representations, we also introduce a principled framework for parameter design, replacing heuristic selection strategies used to date. The proposed approach is method-agnostic and yields tailored parameter choices for each algorithm, jointly accounting for the characteristics of the prior, the degraded signal, and the diffusion dynamics. We show that our spectral recommendations differ structurally from standard heuristics and vary with the diffusion step size, resulting in a consistent balance between perceptual quality and signal fidelity.

new Efficient Planning in Reinforcement Learning via Model Introspection

Authors: Gabriel Stella

Abstract: Reinforcement learning and classical planning are typically seen as two distinct problems, with differing formulations necessitating different solutions. Yet, when humans are given a task, regardless of the way it is specified, they can often derive the additional information needed to solve the problem efficiently. The key to this ability is introspection: by reasoning about their internal models of the problem, humans directly synthesize additional task-relevant information. In this paper, we propose that this introspection can be thought of as program analysis. We discuss examples of how this approach can be applied to various kinds of models used in reinforcement learning. We then describe an algorithm that enables efficient goal-oriented planning over the class of models used in relational reinforcement learning, demonstrating a novel link between reinforcement learning and classical planning.

new ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs

Authors: Yanlin Qi, Xinhang Chen, Huiqiang Jiang, Qitong Wang, Botao Peng, Themis Palpanas

Abstract: KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework based on collision-based candidate selection, followed by a quantized inner-product reranking estimator. For million-token contexts, ParisKV supports CPU-offloaded KV caches via Unified Virtual Addressing (UVA), enabling on-demand top-$k$ fetching with minimal overhead. ParisKV matches or outperforms full attention quality on long-input and long-generation benchmarks. It achieves state-of-the-art long-context decoding efficiency: it matches or exceeds full attention speed even at batch size 1 for long contexts, delivers up to 2.8$\times$ higher throughput within full attention's runnable range, and scales to million-token contexts where full attention runs out of memory. At million-token scale, ParisKV reduces decode latency by 17$\times$ and 44$\times$ compared to MagicPIG and PQCache, respectively, two state-of-the-art KV-cache Top-$k$ retrieval baselines.

new Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

Authors: Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-T\"ur, Hao Peng

Abstract: Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token prediction stages (e.g., pretraining and supervised fine-tuning), despite fundamental differences between RL and these stages highlighted by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rates in AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam-style per-parameter adaptive learning rates and momentum. Confirming this hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model parameters without any sparsity-promoting regularization, more than 1000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. These findings provide new insights into the optimization dynamics of RL in LLMs and show that RL can be substantially more parameter-efficient than previously recognized.

new The Laplacian Keyboard: Beyond the Linear Span

Authors: Siddarth Chandrasekar, Marlos C. Machado

Abstract: Across scientific disciplines, Laplacian eigenvectors serve as a fundamental basis for simplifying complex systems, from signal processing to quantum mechanics. In reinforcement learning (RL), these eigenvectors provide a natural basis for approximating reward functions; however, their use is typically limited to their linear span, which restricts expressivity in complex environments. We introduce the Laplacian Keyboard (LK), a hierarchical framework that goes beyond the linear span. LK constructs a task-agnostic library of options from these eigenvectors, forming a behavior basis guaranteed to contain the optimal policy for any reward within the linear span. A meta-policy learns to stitch these options dynamically, enabling efficient learning of policies outside the original linear constraints. We establish theoretical bounds on zero-shot approximation error and demonstrate empirically that LK surpasses zero-shot solutions while achieving improved sample efficiency compared to standard RL methods.

new Efficient Adaptive Data Analysis over Dense Distributions

Authors: Joon Suk Huh

Abstract: Modern data workflows are inherently adaptive, repeatedly querying the same dataset to refine and validate sequential decisions, but such adaptivity can lead to overfitting and invalid statistical inference. Adaptive Data Analysis (ADA) mechanisms address this challenge; however, there is a fundamental tension between computational efficiency and sample complexity. For $T$ rounds of adaptive analysis, computationally efficient algorithms typically incur suboptimal $O(\sqrt{T})$ sample complexity, whereas statistically optimal $O(\log T)$ algorithms are computationally intractable under standard cryptographic assumptions. In this work, we shed light on this trade-off by identifying a natural class of data distributions under which both computational efficiency and optimal sample complexity are achievable. We propose a computationally efficient ADA mechanism that attains optimal $O(\log T)$ sample complexity when the data distribution is dense with respect to a known prior. This setting includes, in particular, feature--label data distributions arising in distribution-specific learning. As a consequence, our mechanism also yields a sample-efficient (i.e., $O(\log T)$ samples) statistical query oracle in the distribution-specific setting. Moreover, although our algorithm is not based on differential privacy, it satisfies a relaxed privacy notion known as Predicate Singling Out (PSO) security (Cohen and Nissim, 2020). Our results thus reveal an inherent connection between adaptive data analysis and privacy beyond differential privacy.

new TerraBind: Fast and Accurate Binding Affinity Prediction through Coarse Structural Representations

Authors: Matteo Rossi, Ryan Pederson, Miles Wang-Henderson, Ben Kaufman, Edward C. Williams, Carl Underkoffler, Owen Lewis Howell, Adrian Layer, Stephan Thaler, Narbe Mardirossian, John Anthony Parkhill

Abstract: We present TerraBind, a foundation model for protein-ligand structure and binding affinity prediction that achieves 26-fold faster inference than state-of-the-art methods while improving affinity prediction accuracy by $\sim$20\%. Current deep learning approaches to structure-based drug design rely on expensive all-atom diffusion to generate 3D coordinates, creating inference bottlenecks that render large-scale compound screening computationally intractable. We challenge this paradigm with a critical hypothesis: full all-atom resolution is unnecessary for accurate small molecule pose and binding affinity prediction. TerraBind tests this hypothesis through a coarse pocket-level representation (protein C$_\beta$ atoms and ligand heavy atoms only) within a multimodal architecture combining COATI-3 molecular encodings and ESM-2 protein embeddings that learns rich structural representations, which are used in a diffusion-free optimization module for pose generation and a binding affinity likelihood prediction module. On structure prediction benchmarks (FoldBench, PoseBusters, Runs N' Poses), TerraBind matches diffusion-based baselines in ligand pose accuracy. Crucially, TerraBind outperforms Boltz-2 by $\sim$20\% in Pearson correlation for binding affinity prediction on both a public benchmark (CASP16) and a diverse proprietary dataset (18 biochemical/cell assays). We show that the affinity prediction module also provides well-calibrated affinity uncertainty estimates, addressing a critical gap in reliable compound prioritization for drug discovery. Furthermore, this module enables a continual learning framework and a hedged batch selection strategy that, in simulated drug discovery cycles, achieves 6$\times$ greater affinity improvement of selected molecules over greedy-based approaches.

new Learnable Chernoff Baselines for Inference-Time Alignment

Authors: Sunil Madhow, Yuchen Liang, Ness Shroff, Yingbin Liang, Yu-Xiang Wang

Abstract: We study inference-time reward-guided alignment for generative models. Existing methods often rely on either architecture-specific adaptations or computationally costly inference procedures. We introduce Learnable Chernoff Baselines (LCBs) as a method for efficiently and approximately sampling from the exponentially tilted kernels that arise from KL-regularized reward alignment. Using only black-box sampling access to the pretrained model, LCBs implement a form of rejection sampling with adaptively selected acceptance probabilities, which allows fine-grained control over inference-compute scaling. We establish total-variation guarantees to the ideal aligned model, and demonstrate in both continuous and discrete diffusion settings that LCB sampling closely matches ideal rejection sampling while using substantially fewer queries to the pretrained model.

new Riemannian MeanFlow

Authors: Dongyeop Woo, Marta Skreta, Seonghyun Park, Sungsoo Ahn, Kirill Neklyudov

Abstract: Diffusion and flow models have become the dominant paradigm for generative modeling on Riemannian manifolds, with successful applications in protein backbone generation and DNA sequence design. However, these methods require tens to hundreds of neural network evaluations at inference time, which can become a computational bottleneck in large-scale scientific sampling workflows. We introduce Riemannian MeanFlow~(RMF), a framework for learning flow maps directly on manifolds, enabling high-quality generations with as few as one forward pass. We derive three equivalent characterizations of the manifold average velocity (Eulerian, Lagrangian, and semigroup identities), and analyze parameterizations and stabilization techniques to improve training on high-dimensional manifolds. In promoter DNA design and protein backbone generation settings, RMF achieves comparable sample quality to prior methods while requiring up to 10$\times$ fewer function evaluations. Finally, we show that few-step flow maps enable efficient reward-guided design through reward look-ahead, where terminal states can be predicted from intermediate steps at minimal additional cost.

new Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization

Authors: Tanmay Ambadkar, Sourav Panda, Shreyash Kale, Jonathan Dodge, Abhinav Verma

Abstract: Multi-objective reinforcement learning (MORL) seeks to learn policies that balance multiple, often conflicting objectives. Although a single preference-conditioned policy is the most flexible and scalable solution, existing approaches remain brittle in practice, frequently failing to recover complete Pareto fronts. We show that this failure stems from two structural issues in current methods: destructive gradient interference caused by premature scalarization and representational collapse across the preference space. We introduce $D^3PO$, a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly. $D^3PO$ preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization, enabling reliable credit assignment. In addition, a scaled diversity regularizer enforces sensitivity of policy behavior to preference changes, preventing collapse. Across standard MORL benchmarks, including high-dimensional and many-objective control tasks, $D^3PO$ consistently discovers broader and higher-quality Pareto fronts than prior single- and multi-policy methods, matching or exceeding state-of-the-art hypervolume and expected utility while using a single deployable policy.

new MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

Authors: Wanyun Xie, Francesco Tonin, Volkan Cevher

Abstract: Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes impractical, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture computation overhead (< 1 GPU-hour), enabling scalable mixture design for modern VLM pipelines.

new CausalTAD: Injecting Causal Knowledge into Large Language Models for Tabular Anomaly Detection

Authors: Ruiqi Wang, Ruikang Liu, Runyu Chen, Haoxiang Suo, Zhiyi Peng, Zhuo Tang, Changjian Chen

Abstract: Detecting anomalies in tabular data is critical for many real-world applications, such as credit card fraud detection. With the rapid advancements in large language models (LLMs), state-of-the-art performance in tabular anomaly detection has been achieved by converting tabular data into text and fine-tuning LLMs. However, these methods randomly order columns during conversion, without considering the causal relationships between them, which is crucial for accurately detecting anomalies. In this paper, we present CausalTaD, a method that injects causal knowledge into LLMs for tabular anomaly detection. We first identify the causal relationships between columns and reorder them to align with these causal relationships. This reordering can be modeled as a linear ordering problem. Since each column contributes differently to the causal relationships, we further propose a reweighting strategy to assign different weights to different columns to enhance this effect. Experiments across more than 30 datasets demonstrate that our method consistently outperforms the current state-of-the-art methods. The code for CausalTAD is available at https://github.com/350234/CausalTAD.

URLs: https://github.com/350234/CausalTAD.

new Fairness Aware Reward Optimization

Authors: Ching Lam Choi, Vighnesh Subramaniam, Phillip Isola, Antonio Torralba, Stefanie Jegelka

Abstract: Demographic skews in human preference data propagate systematic unfairness through reward models into aligned LLMs. We introduce Fairness Aware Reward Optimization (Faro), an in-processing framework that trains reward models under demographic parity, equalized odds, or counterfactual fairness constraints. We provide the first theoretical analysis of reward-level fairness in LLM alignment, establishing: (i) provable fairness certificates for Faro-trained rewards with controllable slack; a (ii) formal characterization of the accuracy-fairness trade-off induced by KL-regularized fine-tuning, proving fairness transfers from reward to policy; and the (iii) existence of a non-empty Pareto frontier. Unlike pre- and post-processing methods, Faro ensures reward models are simultaneously ordinal (ranking correctly), cardinal (calibrated), and fair. Across multiple LLMs and benchmarks, Faro significantly reduces bias and harmful generations while maintaining or improving model quality.

new Approximating Matrix Functions with Deep Neural Networks and Transformers

Authors: Rahul Padmanabhan, Simone Brugiapaglia

Abstract: Transformers have revolutionized natural language processing, but their use for numerical computation has received less attention. We study the approximation of matrix functions, which map scalar functions to matrices, using neural networks including transformers. We focus on functions mapping square matrices to square matrices of the same dimension. These types of matrix functions appear throughout scientific computing, e.g., the matrix exponential in continuous-time Markov chains and the matrix sign function in stability analysis of dynamical systems. In this paper, we make two contributions. First, we prove bounds on the width and depth of ReLU networks needed to approximate the matrix exponential to an arbitrary precision. Second, we show experimentally that a transformer encoder-decoder with suitable numerical encodings can approximate certain matrix functions at a relative error of 5% with high probability. Our study reveals that the encoding scheme strongly affects performance, with different schemes working better for different functions.

new Efficient Representations are Controllable Representations

Authors: Charles Ye, Jasmine Cui

Abstract: What is the most brute-force way to install interpretable, controllable features into a model's activations? Controlling how LLMs internally represent concepts typically requires sophisticated methods to first identify, then intervene on the model's existing feature geometry. We bypass all of this. We finetune an LLM with a simple auxiliary loss, training 16 of its 3072 residual stream dimensions to be inert interpretability flags that simply indicate what concepts are required for generation. The model reorganizes around them anyway, learning to rely on these flags during actual generation tasks. As a result, these inert flags become genuine internal features: interpretable control switches that allow us to steer generation at inference time. Why does this work? When a feature is reliably supplied at a fixed location, gradient descent gradually eliminates redundant encodings elsewhere, and the model erodes its own alternative representations. A model's efficiency pressure is a lever - exploitable to induce interpretable, controllable representations.

new rePIRL: Learn PRM with Inverse RL for LLM Reasoning

Authors: Xian Wu, Kaijie Zhu, Ying Zhang, Lun Wang, Wenbo Guo

Abstract: Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs. We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods, justifying that rePIRL can learn PRMs with minimal assumptions. Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing methods. We further show the application of our trained PRM in test-time training, test-time scaling, and providing an early signal for training hard problems. Finally, we validate our training recipe and key design choices via a detailed ablation study.

new Interpretable Analytic Calabi-Yau Metrics via Symbolic Distillation

Authors: D Yang Eng

Abstract: Calabi--Yau manifolds are essential for string theory but require computing intractable metrics. Here we show that symbolic regression can distill neural approximations into simple, interpretable formulas. Our five-term expression matches neural accuracy ($R^2 = 0.9994$) with 3,000-fold fewer parameters. Multi-seed validation confirms that geometric constraints select essential features, specifically power sums and symmetric polynomials, while permitting structural diversity. The functional form can be maintained across the studied moduli range ($\psi \in [0, 0.8]$) with coefficients varying smoothly; we interpret these trends as empirical hypotheses within the accuracy regime of the locally-trained teachers ($\sigma \approx 8-9\%$ at $\psi \neq 0$). The formula reproduces physical observables -- volume integrals and Yukawa couplings -- validating that symbolic distillation recovers compact, interpretable models for quantities previously accessible only to black-box networks.

new MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation

Authors: Shijie Wang, Pengfei Li, Yikun Fu, Kaifeng Liu, Fangyuan Li, Yang Liu, Xiaowei Sun, Zonglin Li, Siyao Zhao, Jian Zhao, Kai Tian, Dong Li, Junqi Gao, Yutong Zhang, Yiqun Chen, Yuqiang Li, Zoe Li, Weinan Zhang, Peng Ye, Shuyue Hu, Lei Bai, Bowen Zhou, Kaiyan Zhang, Biqing Qi

Abstract: While the complex reasoning capability of Large Language Models (LLMs) has attracted significant attention, single-agent systems often encounter inherent performance ceilings in complex tasks such as code generation. Multi-agent collaboration offers a promising avenue to transcend these boundaries. However, existing frameworks typically rely on prompt-based test-time interactions or multi-role configurations trained with homogeneous parameters, limiting error correction capabilities and strategic diversity. In this paper, we propose a Multi-Agent Reinforced Training and Inference Framework with Self-Search Scaling (MARTI-MARS2), which integrates policy learning with multi-agent tree search by formulating the multi-agent collaborative exploration process as a dynamic and learnable environment. By allowing agents to iteratively explore and refine within the environment, the framework facilitates evolution from parameter-sharing homogeneous multi-role training to heterogeneous multi-agent training, breaking through single-agent capability limits. We also introduce an efficient inference strategy MARTI-MARS2-T+ to fully exploit the scaling potential of multi-agent collaboration at test time. We conduct extensive experiments across varied model scales (8B, 14B, and 32B) on challenging code generation benchmarks. Utilizing two collaborating 32B models, MARTI-MARS2 achieves 77.7%, outperforming strong baselines like GPT-5.1. Furthermore, MARTI-MARS2 reveals a novel scaling law: shifting from single-agent to homogeneous multi-role and ultimately to heterogeneous multi-agent paradigms progressively yields higher RL performance ceilings, robust TTS capabilities, and greater policy diversity, suggesting that policy diversity is critical for scaling intelligence via multi-agent reinforcement learning.

new Dynamic Load Model for Data Centers with Pattern-Consistent Calibration

Authors: Siyu Lu, Chenhan Xiao, Yang Weng

Abstract: The rapid growth of data centers has made large electronic load (LEL) modeling increasingly important for power system analysis. Such loads are characterized by fast workload-driven variability and protection-driven disconnection and reconnection behavior that are not captured by conventional load models. Existing data center load modeling includes physics-based approaches, which provide interpretable structure for grid simulation, and data-driven approaches, which capture empirical workload variability from data. However, physics-based models are typically uncalibrated to facility-level operation, while trajectory alignment in data-driven methods often leads to overfitting and unrealistic dynamic behavior. To resolve these limitations, we design the framework to leverage both physics-based structure and data-driven adaptability. The physics-based structure is parameterized to enable data-driven pattern-consistent calibration from real operational data, supporting facility-level grid planning. We further show that trajectory-level alignment is limited for inherently stochastic data center loads. Therefore, we design the calibration to align temporal and statistical patterns using temporal contrastive learning (TCL). This calibration is performed locally at the facility, and only calibrated parameters are shared with utilities, preserving data privacy. The proposed load model is calibrated by real-world operational load data from the MIT Supercloud, ASU Sol, Blue Waters, and ASHRAE datasets. Then it is integrated into the ANDES platform and evaluated on the IEEE 39-bus, NPCC 140-bus, and WECC 179-bus systems. We find that interactions among LELs can fundamentally alter post-disturbance recovery behavior, producing compound disconnection-reconnection dynamics and delayed stabilization that are not captured by uncalibrated load models.

new Direct Soft-Policy Sampling via Langevin Dynamics

Authors: Donghyeon Ki, Hee-Jun Ahn, Kyungyoon Kim, Byung-Jun Lee

Abstract: Soft policies in reinforcement learning define policies as Boltzmann distributions over state-action value functions, providing a principled mechanism for balancing exploration and exploitation. However, realizing such soft policies in practice remains challenging. Existing approaches either depend on parametric policies with limited expressivity or employ diffusion-based policies whose intractable likelihoods hinder reliable entropy estimation in soft policy objectives. We address this challenge by directly realizing soft-policy sampling via Langevin dynamics driven by the action gradient of the Q-function. This perspective leads to Langevin Q-Learning (LQL), which samples actions from the target Boltzmann distribution without explicitly parameterizing the policy. However, directly applying Langevin dynamics suffers from slow mixing in high-dimensional and non-convex Q-landscapes, limiting its practical effectiveness. To overcome this, we propose Noise-Conditioned Langevin Q-Learning (NC-LQL), which integrates multi-scale noise perturbations into the value function. NC-LQL learns a noise-conditioned Q-function that induces a sequence of progressively smoothed value landscapes, enabling sampling to transition from global exploration to precise mode refinement. On OpenAI Gym MuJoCo benchmarks, NC-LQL achieves competitive performance compared to state-of-the-art diffusion-based methods, providing a simple yet powerful solution for online RL.

new Harpoon: Generalised Manifold Guidance for Conditional Tabular Diffusion

Authors: Aditya Shankar, Yuandou Wang, Rihan Hai, Lydia Y. Chen

Abstract: Generating tabular data under conditions is critical to applications requiring precise control over the generative process. Existing methods rely on training-time strategies that do not generalise to unseen constraints during inference, and struggle to handle conditional tasks beyond tabular imputation. While manifold theory offers a principled way to guide generation, current formulations are tied to specific inference-time objectives and are limited to continuous domains. We extend manifold theory to tabular data and expand its scope to handle diverse inference-time objectives. On this foundation, we introduce HARPOON, a tabular diffusion method that guides unconstrained samples along the manifold geometry to satisfy diverse tabular conditions at inference. We validate our theoretical contributions empirically on tasks such as imputation and enforcing inequality constraints, demonstrating HARPOON'S strong performance across diverse datasets and the practical benefits of manifold-aware guidance for tabular data. Code URL: https://github.com/adis98/Harpoon

URLs: https://github.com/adis98/Harpoon

new GRAFT: Decoupling Ranking and Calibration for Survival Analysis

Authors: Mohammad Ashhad, Robert Hoehndorf, Ricardo Henao

Abstract: Survival analysis is complicated by censored data, high-dimensional features, and non-linear interactions. Classical models are interpretable but restrictive, while deep learning models are flexible but often non-interpretable and sensitive to noise. We propose GRAFT (Gated Residual Accelerated Failure Time), a novel AFT model that decouples prognostic ranking from calibration. GRAFT's hybrid architecture combines a linear AFT model with a non-linear residual neural network, and it also integrates stochastic gates for automatic, end-to-end feature selection. The model is trained by directly optimizing a differentiable, C-index-aligned ranking loss using stochastic conditional imputation from local Kaplan-Meier estimators. In public benchmarks, GRAFT outperforms baselines in discrimination and calibration, while remaining robust and sparse in high-noise settings.

new Efficient Anti-exploration via VQVAE and Fuzzy Clustering in Offline Reinforcement Learning

Authors: Long Chen, Yinkui Liu, Shen Li, Bo Tang, Xuemin Hu

Abstract: Pseudo-count is an effective anti-exploration method in offline reinforcement learning (RL) by counting state-action pairs and imposing a large penalty on rare or unseen state-action pair data. Existing anti-exploration methods count continuous state-action pairs by discretizing these data, but often suffer from the issues of dimension disaster and information loss in the discretization process, leading to efficiency and performance reduction, and even failure of policy learning. In this paper, a novel anti-exploration method based on Vector Quantized Variational Autoencoder (VQVAE) and fuzzy clustering in offline RL is proposed. We first propose an efficient pseudo-count method based on the multi-codebook VQVAE to discretize state-action pairs, and design an offline RL anti-exploitation method based on the proposed pseudo-count method to handle the dimension disaster issue and improve the learning efficiency. In addition, a codebook update mechanism based on fuzzy C-means (FCM) clustering is developed to improve the use rate of vectors in codebooks, addressing the information loss issue in the discretization process. The proposed method is evaluated on the benchmark of Datasets for Deep Data-Driven Reinforcement Learning (D4RL), and experimental results show that the proposed method performs better and requires less computing cost in multiple complex tasks compared to state-of-the-art (SOTA) methods.

new Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Authors: Guanglong Sun, Siyuan Zhang, Liyuan Wang, Jun Zhu, Hang Su, Yi Zhong

Abstract: Large Language Models (LLMs) often incur an alignment tax: safety post-training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre-trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first-order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low-rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety-directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug-and-play and integrates into standard post-training pipelines without large-scale replay, auxiliary objectives, or retraining. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA consistently improves the safety--utility Pareto frontier over standard baselines. For instance, on Qwen2.5-7B-Instruct under SFT$\rightarrow$DPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53\% to 3.03\% and IFEval from 51.94\% to 63.96\%. Our source code is available at \href{https://github.com/SunGL001/OGPSA}{OGPSA}

URLs: https://github.com/SunGL001/OGPSA

new Adaptive Acquisition Selection for Bayesian Optimization with Large Language Models

Authors: Giang Ngo, Dat Phan Trong, Dang Nguyen, Sunil Gupta, Svetha Venkatesh

Abstract: Bayesian Optimization critically depends on the choice of acquisition function, but no single strategy is universally optimal; the best choice is non-stationary and problem-dependent. Existing adaptive portfolio methods often base their decisions on past function values while ignoring richer information like remaining budget or surrogate model characteristics. To address this, we introduce LMABO, a novel framework that casts a pre-trained Large Language Model (LLM) as a zero-shot, online strategist for the BO process. At each iteration, LMABO uses a structured state representation to prompt the LLM to select the most suitable acquisition function from a diverse portfolio. In an evaluation across 50 benchmark problems, LMABO demonstrates a significant performance improvement over strong static, adaptive portfolio, and other LLM-based baselines. We show that the LLM's behavior is a comprehensive strategy that adapts to real-time progress, proving its advantage stems from its ability to process and synthesize the complete optimization state into an effective, adaptive policy.

new AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

Authors: Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Jiaao Chen, Hanrui Wang, Wei-Chen Wang, Di Jin, Siheng Chen

Abstract: Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines (e.g., DeepSeek-V3.2), demonstrating robust capability for sustained iterative optimization. Code is available at https://github.com/yuzhu-cai/AceGRPO.

URLs: https://github.com/yuzhu-cai/AceGRPO.

new CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios

Authors: Huiyang Yi, Xiaojian Shen, Yonggang Wu, Duxin Chen, He Wang, Wenwu Yu

Abstract: Causal discovery from time series is a fundamental task in machine learning. However, its widespread adoption is hindered by a reliance on untestable causal assumptions and by the lack of robustness-oriented evaluation in existing benchmarks. To address these challenges, we propose CausalCompass, a flexible and extensible benchmark suite designed to assess the robustness of time-series causal discovery (TSCD) methods under violations of modeling assumptions. To demonstrate the practical utility of CausalCompass, we conduct extensive benchmarking of representative TSCD algorithms across eight assumption-violation scenarios. Our experimental results indicate that no single method consistently attains optimal performance across all settings. Nevertheless, the methods exhibiting superior overall performance across diverse scenarios are almost invariably deep learning-based approaches. We further provide hyperparameter sensitivity analyses to deepen the understanding of these findings. We also find, somewhat surprisingly, that NTS-NOTEARS relies heavily on standardized preprocessing in practice, performing poorly in the vanilla setting but exhibiting strong performance after standardization. Finally, our work aims to provide a comprehensive and systematic evaluation of TSCD methods under assumption violations, thereby facilitating their broader adoption in real-world applications. The code and datasets are available at https://github.com/huiyang-yi/CausalCompass.

URLs: https://github.com/huiyang-yi/CausalCompass.

new A Kinetic-Energy Perspective of Flow Matching

Authors: Ziyun Li, Huancheng Hu, Soon Hoe Lim, Xuyu Li, Fei Gao, Enmao Diao, Zezhen Ding, Michalis Vazirgiannis, Henrik Bostrom

Abstract: Flow-based generative models can be viewed through a physics lens: sampling transports a particle from noise to data by integrating a time-varying velocity field, and each sample corresponds to a trajectory with its own dynamical effort. Motivated by classical mechanics, we introduce Kinetic Path Energy (KPE), an action-like, per-sample diagnostic that measures the accumulated kinetic effort along an Ordinary Differential Equation (ODE) trajectory. Empirically, KPE exhibits two robust correspondences: (i) higher KPE predicts stronger semantic fidelity; (ii) high-KPE trajectories terminate on low-density manifold frontiers. We further provide theoretical guarantees linking trajectory energy to data density. Paradoxically, this correlation is non-monotonic. At sufficiently high energy, generation can degenerate into memorization. Leveraging the closed-form of empirical flow matching, we show that extreme energies drive trajectories toward near-copies of training examples. This yields a Goldilocks principle and motivates Kinetic Trajectory Shaping (KTS), a training-free two-phase inference strategy that boosts early motion and enforces a late-time soft landing, reducing memorization and improving generation quality across benchmark tasks.

new Attention-Based Deep Learning for Early Parkinson's Disease Detection with Tabular Biomedical Data

Authors: Olamide Samuel Oseni, Ibraheem Omotolani Obanla, Toheeb Aduramomi Jimoh

Abstract: Early and accurate detection of Parkinson's disease (PD) remains a critical challenge in medical diagnostics due to the subtlety of early-stage symptoms and the complex, non-linear relationships inherent in biomedical data. Traditional machine learning (ML) models, though widely applied to PD detection, often rely on extensive feature engineering and struggle to capture complex feature interactions. This study investigates the effectiveness of attention-based deep learning models for early PD detection using tabular biomedical data. We present a comparative evaluation of four classification models: Multi-Layer Perceptron (MLP), Gradient Boosting, TabNet, and SAINT, using a benchmark dataset from the UCI Machine Learning Repository consisting of biomedical voice measurements from PD patients and healthy controls. Experimental results show that SAINT consistently outperformed all baseline models across multiple evaluation metrics, achieving a weighted precision of 0.98, weighted recall of 0.97, weighted F1-score of 0.97, a Matthews Correlation Coefficient (MCC) of 0.9990, and the highest Area Under the ROC Curve (AUC-ROC). TabNet and MLP demonstrated competitive performance, while Gradient Boosting yielded the lowest overall scores. The superior performance of SAINT is attributed to its dual attention mechanism, which effectively models feature interactions within and across samples. These findings demonstrate the diagnostic potential of attention-based deep learning architectures for early Parkinson's disease detection and highlight the importance of dynamic feature representation in clinical prediction tasks.

new A Thermodynamic Theory of Learning Part II: Critical Period Closure and Continual Learning Failure

Authors: Daisuke Okanohara

Abstract: Learning performed over finite time is necessarily irreversible. In Part~I of this series, we modeled learning as a transport process in the space of parameter distributions and derived the Epistemic Speed Limit, which lower-bounds entropy production under finite-time learning. In this work (Part~II), we study the consequences of this irreversibility for continual learning from a trajectory-level perspective. We show that finite dissipation constrains not only which solutions are reachable, but which learning paths remain dynamically accessible. Although a continuum of task-equivalent realizations can achieve identical task performance, finite-time learning irreversibly selects among these realizations. This selection occurs through the progressive elimination of degrees of freedom that would otherwise enable structural reconfiguration. We refer to this phenomenon as \emph{critical period closure}: beyond a certain stage of learning, transitions between compatible representations become dynamically inaccessible under any finite dissipation budget. As a result, continual learning failure arises not from the absence of solutions satisfying multiple tasks, but from an irreversible loss of representational freedom induced by prior learning. This reframes catastrophic forgetting as a dynamical constraint imposed by finite-time dissipation, rather than direct task interference.

new An Explainable Multi-Task Similarity Measure: Integrating Accumulated Local Effects and Weighted Fr\'echet Distance

Authors: Pablo Hidalgo, Daniel Rodriguez

Abstract: In many machine learning contexts, tasks are often treated as interconnected components with the goal of leveraging knowledge transfer between them, which is the central aim of Multi-Task Learning (MTL). Consequently, this multi-task scenario requires addressing critical questions: which tasks are similar, and how and why do they exhibit similarity? In this work, we propose a multi-task similarity measure based on Explainable Artificial Intelligence (XAI) techniques, specifically Accumulated Local Effects (ALE) curves. ALE curves are compared using the Fr\'echet distance, weighted by the data distribution, and the resulting similarity measure incorporates the importance of each feature. The measure is applicable in both single-task learning scenarios, where each task is trained separately, and multi-task learning scenarios, where all tasks are learned simultaneously. The measure is model-agnostic, allowing the use of different machine learning models across tasks. A scaling factor is introduced to account for differences in predictive performance across tasks, and several recommendations are provided for applying the measure in complex scenarios. We validate this measure using four datasets, one synthetic dataset and three real-world datasets. The real-world datasets include a well-known Parkinson's dataset and a bike-sharing usage dataset -- both structured in tabular format -- as well as the CelebA dataset, which is used to evaluate the application of concept bottleneck encoders in a multitask learning setting. The results demonstrate that the measure aligns with intuitive expectations of task similarity across both tabular and non-tabular data, making it a valuable tool for exploring relationships between tasks and supporting informed decision-making.

new On Improving Neurosymbolic Learning by Exploiting the Representation Space

Authors: Aaditya Naik, Efthymia Tsamoura, Shibo Jin, Mayur Naik, Dan Roth

Abstract: We study the problem of learning neural classifiers in a neurosymbolic setting where the hidden gold labels of input instances must satisfy a logical formula. Learning in this setting proceeds by first computing (a subset of) the possible combinations of labels that satisfy the formula and then computing a loss using those combinations and the classifiers' scores. One challenge is that the space of label combinations can grow exponentially, making learning difficult. We propose a technique that prunes this space by exploiting the intuition that instances with similar latent representations are likely to share the same label. While this intuition has been widely used in weakly supervised learning, its application in our setting is challenging due to label dependencies imposed by logical constraints. We formulate the pruning process as an integer linear program that discards inconsistent label combinations while respecting logical structure. Our approach, CLIPPER, is orthogonal to existing training algorithms and can be seamlessly integrated with them. Across 16 benchmarks over complex neurosymbolic tasks, we demonstrate that CLIPPER boosts the performance of state-of-the-art neurosymbolic engines like Scallop, Dolphin, and ISED by up to 48%, 53%, and 8%, leading to state-of-the-art accuracies.

new Beyond Optimization: Intelligence as Metric-Topology Factorization under Geometric Incompleteness

Authors: Xin Li

Abstract: Contemporary ML often equates intelligence with optimization: searching for solutions within a fixed representational geometry. This works in static regimes but breaks under distributional shift, task permutation, and continual learning, where even mild topological changes can invalidate learned solutions and trigger catastrophic forgetting. We propose Metric-Topology Factorization (MTF) as a unifying geometric principle: intelligence is not navigation through a fixed maze, but the ability to reshape representational geometry so desired behaviors become stable attractors. Learning corresponds to metric contraction (a controlled deformation of Riemannian structure), while task identity and environmental variation are encoded topologically and stored separately in memory. We show any fixed metric is geometrically incomplete: for any local metric representation, some topological transformations make it singular or incoherent, implying an unavoidable stability-plasticity tradeoff for weight-based systems. MTF resolves this by factorizing stable topology from plastic metric warps, enabling rapid adaptation via geometric switching rather than re-optimization. Building on this, we introduce the Topological Urysohn Machine (TUM), implementing MTF through memory-amortized metric inference (MAMI): spectral task signatures index amortized metric transformations, letting a single learned geometry be reused across permuted, reflected, or parity-altered environments. This explains robustness to task reordering, resistance to catastrophic forgetting, and generalization across transformations that defeat conventional continual learning methods (e.g., EWC).

new When Is Compositional Reasoning Learnable from Verifiable Rewards?

Authors: Daniel Barzilai, Yotam Wolf, Ronen Basri

Abstract: The emergence of compositional reasoning in large language models through reinforcement learning with verifiable rewards (RLVR) has been a key driver of recent empirical successes. Despite this progress, it remains unclear which compositional problems are learnable in this setting using outcome-level feedback alone. In this work, we theoretically study the learnability of compositional problems in autoregressive models under RLVR training. We identify a quantity that we call the task-advantage ratio, a joint property of the compositional problem and the base model, that characterizes which tasks and compositions are learnable from outcome-level feedback. On the positive side, using this characterization, we show that compositional problems where correct intermediate steps provide a clear advantage are efficiently learnable with RLVR. We also analyze how such an advantage naturally arises in different problems. On the negative side, when the structural advantage is not present, RLVR may converge to suboptimal compositions. We prove that, in some cases, the quality of the base model determines if such an advantage exists and whether RLVR will converge to a suboptimal solution. We hope our analysis can provide a principled theoretical understanding of when and why RLVR succeeds and when it does not.

new Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization

Authors: Anirudh Satheesh, Vaneet Aggarwal

Abstract: We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the unichain assumption and general policy parameterizations. Existing regret analyses for constrained reinforcement learning largely rely on ergodicity or strong mixing-time assumptions, which fail to hold in the presence of transient states. We propose a primal--dual natural actor--critic algorithm that leverages multi-level Monte Carlo (MLMC) estimators and an explicit burn-in mechanism to handle unichain dynamics without requiring mixing-time oracles. Our analysis establishes finite-time regret and cumulative constraint violation bounds that scale as $\tilde{O}(\sqrt{T})$, up to approximation errors arising from policy and critic parameterization, thereby extending order-optimal guarantees to a significantly broader class of CMDPs.

new Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection

Authors: Yigit Turkmen, Baturalp Buyukates, Melih Bastopcu

Abstract: Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.

new From $O(mn)$ to $O(r^2)$: Two-Sided Low-Rank Communication for Adam in Distributed Training with Memory Efficiency

Authors: Sizhe Dang, Jiaqi Shao, Xiaodong Zheng, Guang Dai, Yan Song, Haishan Ye

Abstract: As foundation models continue to scale, pretraining increasingly relies on data-parallel distributed optimization, making bandwidth-limited gradient synchronization a key bottleneck. Orthogonally, projection-based low-rank optimizers were mainly designed for memory efficiency, but remain suboptimal for communication-limited training: one-sided synchronization still transmits an $O(rn)$ object for an $m\times n$ matrix gradient and refresh steps can dominate peak communicated bytes. We propose TSR, which brings two-sided low-rank communication to Adam-family updates (TSR-Adam) by synchronizing a compact core $U^\top G V\in\mathbb{R}^{r\times r}$, reducing the dominant per-step payload from $O(mn)$ to $O(r^2)$ while keeping moment states in low-dimensional cores. To further reduce the peak communication from subspace refresh, TSR-Adam adopts a randomized SVD-based refresh that avoids full-gradient synchronization. We additionally extend low-rank communication to embedding gradients with embedding-specific ranks and refresh schedules, yielding additional communication and memory savings over keeping embeddings dense. Across pretraining from 60M to 1B model scales, TSR-Adam reduces average communicated bytes per step by $13\times$, and on GLUE fine-tuning it reduces communication by $25\times$, while achieving comparable performance; we further provide a theoretical stationarity analysis for the proposed update. Code is available at https://github.com/DKmiyan/TSR-Adam.

URLs: https://github.com/DKmiyan/TSR-Adam.

new A Unified Density Operator View of Flow Control and Merging

Authors: Riccardo De Santi, Malte Franke, Ya-Ping Hsieh, Andreas Krause

Abstract: Recent progress in large-scale flow and diffusion models raised two fundamental algorithmic challenges: (i) control-based reward adaptation of pre-trained flows, and (ii) integration of multiple models, i.e., flow merging. While current approaches address them separately, we introduce a unifying probability-space framework that subsumes both as limit cases, and enables reward-guided flow merging, allowing principled, task-aware combination of multiple pre-trained flows (e.g., merging priors while maximizing drug-discovery utilities). Our formulation renders possible to express a rich family of operators over generative models densities, including intersection (e.g., to enforce safety), union (e.g., to compose diverse models), interpolation (e.g., for discovery), their reward-guided counterparts, as well as complex logical expressions via generative circuits. Next, we introduce Reward-Guided Flow Merging (RFM), a mirror-descent scheme that reduces reward-guided flow merging to a sequence of standard fine-tuning problems. Then, we provide first-of-their-kind theoretical guarantees for reward-guided and pure flow merging via RFM. Ultimately, we showcase the capabilities of the proposed method on illustrative settings providing visually interpretable insights, and apply our method to high-dimensional de-novo molecular design and low-energy conformer generation.

new The Rise of Sparse Mixture-of-Experts:A Survey from Algorithmic Foundations to Decentralized Architectures and Vertical Domain Applications

Authors: Dong Pan, Bingtao Li, Yongsheng Zheng, Jiren Ma, Victor Fei

Abstract: The sparse Mixture of Experts(MoE) architecture has evolved as a powerful approach for scaling deep learning models to more parameters with comparable computation cost. As an important branch of large language model(LLM), MoE model only activate a subset of experts based on a routing network. This sparse conditional computation mechanism significantly improves computational efficiency, paving a promising path for greater scalability and cost-efficiency. It not only enhance downstream applications such as natural language processing, computer vision, and multimodal in various horizontal domains, but also exhibit broad applicability across vertical domains. Despite the growing popularity and application of MoE models across various domains, there lacks a systematic exploration of recent advancements of MoE in many important fields. Existing surveys on MoE suffer from limitations such as lack coverage or none extensively exploration of key areas. This survey seeks to fill these gaps. In this paper, Firstly, we examine the foundational principles of MoE, with an in-depth exploration of its core components-the routing network and expert network. Subsequently, we extend beyond the centralized paradigm to the decentralized paradigm, which unlocks the immense untapped potential of decentralized infrastructure, enables democratization of MoE development for broader communities, and delivers greater scalability and cost-efficiency. Furthermore we focus on exploring its vertical domain applications. Finally, we also identify key challenges and promising future research directions. To the best of our knowledge, this survey is currently the most comprehensive review in the field of MoE. We aim for this article to serve as a valuable resource for both researchers and practitioners, enabling them to navigate and stay up-to-date with the latest advancements.

new Sharp analysis of linear ensemble sampling

Authors: Arya Akhavan, David Janz, Csaba Szepesv\'ari

Abstract: We analyse linear ensemble sampling (ES) with standard Gaussian perturbations in stochastic linear bandits. We show that for ensemble size $m=\Theta(d\log n)$, ES attains $\tilde O(d^{3/2}\sqrt n)$ high-probability regret, closing the gap to the Thompson sampling benchmark while keeping computation comparable. The proof brings a new perspective on randomized exploration in linear bandits by reducing the analysis to a time-uniform exceedance problem for $m$ independent Brownian motions. Intriguingly, this continuous-time lens is not forced; it appears natural--and perhaps necessary: the discrete-time problem seems to be asking for a continuous-time solution, and we know of no other way to obtain a sharp ES bound.

new Horizon Imagination: Efficient On-Policy Training in Diffusion World Models

Authors: Lior Cohen, Ofir Nabati, Kaixin Wang, Navdeep Kumar, Shie Mannor

Abstract: We study diffusion-based world models for reinforcement learning, which offer high generative fidelity but face critical efficiency challenges in control. Current methods either require heavyweight models at inference or rely on highly sequential imagination, both of which impose prohibitive computational costs. We propose Horizon Imagination (HI), an on-policy imagination process for discrete stochastic policies that denoises multiple future observations in parallel. HI incorporates a stabilization mechanism and a novel sampling schedule that decouples the denoising budget from the effective horizon over which denoising is applied while also supporting sub-frame budgets. Experiments on Atari 100K and Craftium show that our approach maintains control performance with a sub-frame budget of half the denoising steps and achieves superior generation quality under varied schedules. Code is available at https://github.com/leor-c/horizon-imagination.

URLs: https://github.com/leor-c/horizon-imagination.

new The Benefits of Diversity: Combining Comparisons and Ratings for Efficient Scoring

Authors: Julien Fageot, Matthias Grossglauser, L\^e-Nguy\^en Hoang, Matteo Tacchi-B\'enard, Oscar Villemaud

Abstract: Should humans be asked to evaluate entities individually or comparatively? This question has been the subject of long debates. In this work, we show that, interestingly, combining both forms of preference elicitation can outperform the focus on a single kind. More specifically, we introduce SCoRa (Scoring from Comparisons and Ratings), a unified probabilistic model that allows to learn from both signals. We prove that the MAP estimator of SCoRa is well-behaved. It verifies monotonicity and robustness guarantees. We then empirically show that SCoRa recovers accurate scores, even under model mismatch. Most interestingly, we identify a realistic setting where combining comparisons and ratings outperforms using either one alone, and when the accurate ordering of top entities is critical. Given the de facto availability of signals of multiple forms, SCoRa additionally offers a versatile foundation for preference learning.

new TAAM:Inductive Graph-Class Incremental Learning with Task-Aware Adaptive Modulation

Authors: Jingtao Liu, Xinming Zhang

Abstract: Graph Continual Learning (GCL) aims to solve the challenges of streaming graph data. However, current methods often depend on replay-based strategies, which raise concerns like memory limits and privacy issues, while also struggling to resolve the stability-plasticity dilemma. In this paper, we suggest that lightweight, task-specific modules can effectively guide the reasoning process of a fixed GNN backbone. Based on this idea, we propose Task-Aware Adaptive Modulation (TAAM). The key component of TAAM is its lightweight Neural Synapse Modulators (NSMs). For each new task, a dedicated NSM is trained and then frozen, acting as an "expert module." These modules perform detailed, node-attentive adaptive modulation on the computational flow of a shared GNN backbone. This setup ensures that new knowledge is kept within compact, task-specific modules, naturally preventing catastrophic forgetting without using any data replay. Additionally, to address the important challenge of unknown task IDs in real-world scenarios, we propose and theoretically prove a novel method named Anchored Multi-hop Propagation (AMP). Notably, we find that existing GCL benchmarks have flaws that can cause data leakage and biased evaluations. Therefore, we conduct all experiments in a more rigorous inductive learning scenario. Extensive experiments show that TAAM comprehensively outperforms state-of-the-art methods across eight datasets. Code and Datasets are available at: https://github.com/1iuJT/TAAM_AAMAS2026.

URLs: https://github.com/1iuJT/TAAM_AAMAS2026.

new FIRE: Frobenius-Isometry Reinitialization for Balancing the Stability-Plasticity Tradeoff

Authors: Isaac Han, Sangyeon Park, Seungwon Oh, Donghu Kim, Hojoon Lee, Kyung-Joong Kim

Abstract: Deep neural networks trained on nonstationary data must balance stability (i.e., retaining prior knowledge) and plasticity (i.e., adapting to new tasks). Standard reinitialization methods, which reinitialize weights toward their original values, are widely used but difficult to tune: conservative reinitializations fail to restore plasticity, while aggressive ones erase useful knowledge. We propose FIRE, a principled reinitialization method that explicitly balances the stability-plasticity tradeoff. FIRE quantifies stability through Squared Frobenius Error (SFE), measuring proximity to past weights, and plasticity through Deviation from Isometry (DfI), reflecting weight isotropy. The reinitialization point is obtained by solving a constrained optimization problem, minimizing SFE subject to DfI being zero, which is efficiently approximated by Newton-Schulz iteration. FIRE is evaluated on continual visual learning (CIFAR-10 with ResNet-18), language modeling (OpenWebText with GPT-0.1B), and reinforcement learning (HumanoidBench with SAC and Atari games with DQN). Across all domains, FIRE consistently outperforms both naive training without intervention and standard reinitialization methods, demonstrating effective balancing of the stability-plasticity tradeoff.

new Implicit Strategic Optimization: Rethinking Long-Horizon Decision-Making in Adversarial Poker Environments

Authors: Boyang Xia, Weiyou Tian, Qingnan Ren, Jiaqi Huang, Jie Xiao, Shuo Lu, Kai Wang, Lynn Ai, Eric Yang, Bill Shi

Abstract: Training large language model (LLM) agents for adversarial games is often driven by episodic objectives such as win rate. In long-horizon settings, however, payoffs are shaped by latent strategic externalities that evolve over time, so myopic optimization and variation-based regret analyses can become vacuous even when the dynamics are predictable. To solve this problem, we introduce Implicit Strategic Optimization (ISO), a prediction-aware framework in which each agent forecasts the current strategic context and uses it to update its policy online. ISO combines a Strategic Reward Model (SRM) that estimates the long-run strategic value of actions with iso-grpo, a context-conditioned optimistic learning rule. We prove sublinear contextual regret and equilibrium convergence guarantees whose dominant terms scale with the number of context mispredictions; when prediction errors are bounded, our bounds recover the static-game rates obtained when strategic externalities are known. Experiments in 6-player No-Limit Texas Hold'em and competitive Pokemon show consistent improvements in long-term return over strong LLM and RL baselines, and graceful degradation under controlled prediction noise.

new V-ABFT: Variance-Based Adaptive Threshold for Fault-Tolerant Matrix Multiplication in Mixed-Precision Deep Learning

Authors: Yiheng Gao, Qin Hua, Zizhong Chen

Abstract: Algorithm-Based Fault Tolerance (ABFT) is widely adopted to detect silent data corruptions (SDCs) in matrix multiplication, a cornerstone operation in deep learning systems. However, existing threshold determination methods face critical challenges: analytical bounds are overly conservative, while probabilistic approaches like A-ABFT yield thresholds $160$--$4200\times$ larger than actual rounding errors. We present V-ABFT, a variance-based adaptive threshold algorithm that achieves tighter error bounds by directly modeling the verification difference. By leveraging statistical variance estimation, V-ABFT reduces the threshold-to-actual-error ratio to approximately $7$--$20\times$ for FP32/FP64 and $48$--$158\times$ for BF16, representing a \textbf{6--48$\times$ improvement} over A-ABFT while maintaining zero false positive rate across BF16, FP16, FP32, and FP64 precisions. Furthermore, we demonstrate that for fused-kernel ABFT implementations that verify before output quantization, low-precision GEMM can use FP32-level thresholds ($e_{\max} \approx 10^{-6}$), enabling \textbf{$\sim$1000$\times$ finer detection granularity} compared to offline verification with low-precision output ($e_{\max} \approx 10^{-3}$). We reproduce A-ABFT's experimental setup and validate our implementation against the original paper's results. Our method requires only $O(n)$ complexity using max/min/mean statistics, compared to A-ABFT's $O(pn)$ for finding $p$ largest values. Extensive experiments on synthetic data and real model weights (LLaMA-7B, GPT-2, ViT) demonstrate V-ABFT's effectiveness across diverse distributions. V-ABFT is platform-agnostic and has been integrated into fault-tolerant GEMM implementations on both NPUs and GPUs.

new Interpretable Fuzzy Systems For Forward Osmosis Desalination

Authors: Qusai Khaled, Uzay Kaymak, Laura Genga

Abstract: Preserving interpretability in fuzzy rule-based systems (FRBS) is vital for water treatment, where decisions impact public health. While structural interpretability has been addressed using multi-objective algorithms, semantic interpretability often suffers due to fuzzy sets with low distinguishability. We propose a human-in-the-loop approach for developing interpretable FRBS to predict forward osmosis desalination productivity. Our method integrates expert-driven grid partitioning for distinguishable membership functions, domain-guided feature engineering to reduce redundancy, and rule pruning based on firing strength. This approach achieved comparable predictive performance to cluster-based FRBS while maintaining semantic interpretability and meeting structural complexity constraints, providing an explainable solution for water treatment applications.

new Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

Authors: Manan Tayal, Mumuksh Tayal

Abstract: Offline reinforcement learning (RL) provides a compelling paradigm for training autonomous systems without the risks of online exploration, particularly in safety-critical domains. However, jointly achieving strong safety and performance from fixed datasets remains challenging. Existing safe offline RL methods often rely on soft constraints that allow violations, introduce excessive conservatism, or struggle to balance safety, reward optimization, and adherence to the data distribution. To address this, we propose Epigraph-Guided Flow Matching (EpiFlow), a framework that formulates safe offline RL as a state-constrained optimal control problem to co-optimize safety and performance. We learn a feasibility value function derived from an epigraph reformulation of the optimal control problem, thereby avoiding the decoupled objectives or post-hoc filtering common in prior work. Policies are synthesized by reweighting the behavior distribution based on this epigraph value function and fitting a generative policy via flow matching, enabling efficient, distribution-consistent sampling. Across various safety-critical tasks, including Safety-Gymnasium benchmarks, EpiFlow achieves competitive returns with near-zero empirical safety violations, demonstrating the effectiveness of epigraph-guided policy synthesis.

new Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices

Authors: Alejandro Ruiz y Mesa, Guilherme Korol, Moritz Riesteter, Jo\~ao Paulo Cardoso de Lima, Jeronimo Castrillon

Abstract: LLM deployment on resource-constrained edge devices faces severe latency constraints, particularly in real-time applications where delayed responses can compromise safety or usability. Among many approaches to mitigate the inefficiencies of sequential token-by-token generation, Speculative Decoding (SD) has emerged as a promising technique. However, SD at the edge is hindered by two major challenges: (1) integrating SD into a compiler-based workflow without sacrificing performance or programmability, and (2) exploiting the heterogeneous compute resources of modern SoCs through carefully designed partitioning strategies. This work addresses these challenges by using an analytical cost model that explores heterogeneous hardware configurations and guides coarse-grained partitioning of LLM subgraphs, particularly with edge-typical short input sequence lengths. The cost model predicts when speculative sampling and heterogeneous execution are jointly beneficial and is validated on an edge device featuring a hexacore Cortex-A CPU and a Mali GPU, revealing up to 1.68$\times$ speedup for translation tasks, closely matching analytic expectations.

new Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Authors: Shayan Ali Hassan, Tao Ni, Zafar Ayyub Qazi, Marco Canini

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap aggregation and mixture of expert inspired ensemble of fine-tuned models, each specialized on a different attack dataset. At inference, BAGEL uses a random forest router to identify the most suitable ensemble member, then applies stochastic selection to sample additional members for prediction aggregation. When new attacks emerge, BAGEL updates incrementally by fine-tuning a small prompt-safety classifier (86M parameters) and adding the resulting model to the ensemble. BAGEL achieves an F1 score of 0.92 by selecting just 5 ensemble members (430M parameters), outperforming OpenAI Moderation API and ShieldGemma which require billions of parameters. Performance remains robust after nine incremental updates, and BAGEL provides interpretability through its router's structural features. Our results show ensembles of small finetuned classifiers can match or exceed billion-parameter guardrails while offering the adaptability and efficiency required for production systems.

new Efficient Distribution Learning with Error Bounds in Wasserstein Distance

Authors: Eduardo Figueiredo, Steven Adams, Luca Laurenti

Abstract: The Wasserstein distance has emerged as a key metric to quantify distances between probability distributions, with applications in various fields, including machine learning, control theory, decision theory, and biological systems. Consequently, learning an unknown distribution with non-asymptotic and easy-to-compute error bounds in Wasserstein distance has become a fundamental problem in many fields. In this paper, we devise a novel algorithmic and theoretical framework to approximate an unknown probability distribution $\mathbb{P}$ from a finite set of samples by an approximate discrete distribution $\widehat{\mathbb{P}}$ while bounding the Wasserstein distance between $\mathbb{P}$ and $\widehat{\mathbb{P}}$. Our framework leverages optimal transport, nonlinear optimization, and concentration inequalities. In particular, we show that, even if $\mathbb{P}$ is unknown, the Wasserstein distance between $\mathbb{P}$ and $\widehat{\mathbb{P}}$ can be efficiently bounded with high confidence by solving a tractable optimization problem (a mixed integer linear program) of a size that only depends on the size of the support of $\widehat{\mathbb{P}}$. This enables us to develop intelligent clustering algorithms to optimally find the support of $\widehat{\mathbb{P}}$ while minimizing the Wasserstein distance error. On a set of benchmarks, we demonstrate that our approach outperforms state-of-the-art comparable methods by generally returning approximating distributions with substantially smaller support and tighter error bounds.

new SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

Authors: Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, Gao Huang

Abstract: Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture. Prior attempts to combine their strengths typically lead to a stability-performance trade-off. We attribute this phenomenon to a structural incompatibility within a single-stream design: Any application of the Post-Norm operation inevitably obstructs the clean identity gradient preserved by Pre-Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre-training experiments on 1.3B-parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at https://github.com/Qwen-Applications/SiameseNorm.

URLs: https://github.com/Qwen-Applications/SiameseNorm.

new Enhancing Bandit Algorithms with LLMs for Time-varying User Preferences in Streaming Recommendations

Authors: Chenglei Shen, Yi Zhan, Weijie Yu, Xiao Zhang, Jun Xu

Abstract: In real-world streaming recommender systems, user preferences evolve dynamically over time. Existing bandit-based methods treat time merely as a timestamp, neglecting its explicit relationship with user preferences and leading to suboptimal performance. Moreover, online learning methods often suffer from inefficient exploration-exploitation during the early online phase. To address these issues, we propose HyperBandit+, a novel contextual bandit policy that integrates a time-aware hypernetwork to adapt to time-varying user preferences and employs a large language model-assisted warm-start mechanism (LLM Start) to enhance exploration-exploitation efficiency in the early online phase. Specifically, HyperBandit+ leverages a neural network that takes time features as input and generates parameters for estimating time-varying rewards by capturing the correlation between time and user preferences. Additionally, the LLM Start mechanism employs multi-step data augmentation to simulate realistic interaction data for effective offline learning, providing warm-start parameters for the bandit policy in the early online phase. To meet real-time streaming recommendation demands, we adopt low-rank factorization to reduce hypernetwork training complexity. Theoretically, we rigorously establish a sublinear regret upper bound that accounts for both the hypernetwork and the LLM warm-start mechanism. Extensive experiments on real-world datasets demonstrate that HyperBandit+ consistently outperforms state-of-the-art baselines in terms of accumulated rewards.

new Multimodal normative modeling in Alzheimers Disease with introspective variational autoencoders

Authors: Sayantan Kumar, Peijie Qiu, Aristeidis Sotiras

Abstract: Normative modeling learns a healthy reference distribution and quantifies subject-specific deviations to capture heterogeneous disease effects. In Alzheimers disease (AD), multimodal neuroimaging offers complementary signals but VAE-based normative models often (i) fit the healthy reference distribution imperfectly, inflating false positives, and (ii) use posterior aggregation (e.g., PoE/MoE) that can yield weak multimodal fusion in the shared latent space. We propose mmSIVAE, a multimodal soft-introspective variational autoencoder combined with Mixture-of-Product-of-Experts (MOPOE) aggregation to improve reference fidelity and multimodal integration. We compute deviation scores in latent space and feature space as distances from the learned healthy distributions, and map statistically significant latent deviations to regional abnormalities for interpretability. On ADNI MRI regional volumes and amyloid PET SUVR, mmSIVAE improves reconstruction on held-out controls and produces more discriminative deviation scores for outlier detection than VAE baselines, with higher likelihood ratios and clearer separation between control and AD-spectrum cohorts. Deviation maps highlight region-level patterns aligned with established AD-related changes. More broadly, our results highlight the importance of training objectives that prioritize reference-distribution fidelity and robust multimodal posterior aggregation for normative modeling, with implications for deviation-based analysis across multimodal clinical data.

new Spectral Guardrails for Agents in the Wild: Detecting Tool Use Hallucinations via Attention Topology

Authors: Valentin No\"el

Abstract: Deploying autonomous agents in the wild requires reliable safeguards against tool use failures. We propose a training free guardrail based on spectral analysis of attention topology that complements supervised approaches. On Llama 3.1 8B, our method achieves 97.7\% recall with multi-feature detection and 86.1\% recall with 81.0\% precision for balanced deployment, without requiring any labeled training data. Most remarkably, we discover that single layer spectral features act as near-perfect hallucination detectors: Llama L26 Smoothness achieves 98.2\% recall (213/217 hallucinations caught) with a single threshold, and Mistral L3 Entropy achieves 94.7\% recall. This suggests hallucination is not merely a wrong token but a thermodynamic state change: the model's attention becomes noise when it errs. Through controlled cross-model evaluation on matched domains ($N=1000$, $T=0.3$, same General domain, hallucination rates 20--22\%), we reveal the ``Loud Liar'' phenomenon: Llama 3.1 8B's failures are spectrally catastrophic and dramatically easier to detect, while Mistral 7B achieves the best discrimination (AUC 0.900). These findings establish spectral analysis as a principled, efficient framework for agent safety.

new Probability Hacking and the Design of Trustworthy ML for Signal Processing in C-UAS: A Scenario Based Method

Authors: Liisa Janssens, Laura Middeldorp

Abstract: In order to counter the various threats manifested by Unmanned Aircraft Systems (UAS) adequately, specialized Counter Unmanned Aircraft Systems (C-UAS) are required. Enhancing C-UAS with Emerging and Disruptive Technologies (EDTs) such as Artificial Intelligence (AI) can lead to more effective countermeasures. In this paper a scenario-based method is applied to C-UAS augmented with Machine Learning (ML), a subset of AI, that can enhance signal processing capabilities. Via the scenarios-based method we frame in this paper probability hacking as a challenge and identify requirements which can be implemented in existing Rule of Law mechanisms to prevent probability hacking. These requirements strengthen the trustworthiness of the C-UAS, which feed into justified trust - a key to successful Human-Autonomy Teaming, in civil and military contexts. Index Terms: C-UAS, Scenario-based method, Emerging and Disruptive Technologies, Probability hacking, Trustworthiness.

new Online Domain-aware LLM Decoding for Continual Domain Evolution

Authors: Mohammad Abu-Shaira, Weishi Shi

Abstract: LLMs are typically fine-tuned offline on domain-specific data, assuming a static domain. In practice, domain knowledge evolves continuously through new regulations, products, services, and interaction patterns. Retraining or fine-tuning LLMs for every new instance is computationally infeasible. Additionally, real-world environments also exhibit temporal dynamics with shifting data distributions. Disregarding this phenomenon, commonly referred to as concept drift, can significantly diminish a model's predictive accuracy. This mismatch between evolving domains and static adaptation pipelines highlights the need for efficient, real-time adaptation without costly retraining. In response, we introduce Online Domain-aware Decoding framework (ODD). ODD performs probability-level fusion between a base LLM and a prefix-tree prior, guided by adaptive confidence modulation using disagreement and continuity signals. Empirical evaluation under diverse drift scenarios demonstrates that ODD consistently surpasses LLM-Greedy and LLM-Temp Scaled across all syntactic and semantic NLG metrics. It yields an absolute ROUGE-L gain of 0.065 and a 13.6% relative improvement in Cosine Similarity over the best baseline. These results demonstrate ODD 's robustness to evolving lexical and contextual patterns, making it suitable for dynamic LLM applications.

new Mutual information and task-relevant latent dimensionality

Authors: Paarth Gulati, Eslam Abdelaleem, Audrey Sederberg, Ilya Nemenman

Abstract: Estimating the dimensionality of the latent representation needed for prediction -- the task-relevant dimension -- is a difficult, largely unsolved problem with broad scientific applications. We cast it as an Information Bottleneck question: what embedding bottleneck dimension is sufficient to compress predictor and predicted views while preserving their mutual information (MI). This repurposes neural MI estimators for dimensionality estimation. We show that standard neural estimators with separable/bilinear critics systematically inflate the inferred dimension, and we address this by introducing a hybrid critic that retains an explicit dimensional bottleneck while allowing flexible nonlinear cross-view interactions, thereby preserving the latent geometry. We further propose a one-shot protocol that reads off the effective dimension from a single over-parameterized hybrid model, without sweeping over bottleneck sizes. We validate the approach on synthetic problems with known task-relevant dimension. We extend the approach to intrinsic dimensionality by constructing paired views of a single dataset, enabling comparison with classical geometric dimension estimators. In noisy regimes where those estimators degrade, our approach remains reliable. Finally, we demonstrate the utility of the method on multiple physics datasets.

new Online Bayesian Imbalanced Learning with Bregman-Calibrated Deep Networks

Authors: Zahir Alsulaimawi

Abstract: Class imbalance remains a fundamental challenge in machine learning, where standard classifiers exhibit severe performance degradation in minority classes. Although existing approaches address imbalance through resampling or cost-sensitive learning during training, they require retraining or access to labeled target data when class distributions shift at deployment time, a common occurrence in real-world applications such as fraud detection, medical diagnosis, and anomaly detection. We present \textit{Online Bayesian Imbalanced Learning} (OBIL), a principled framework that decouples likelihood-ratio estimation from class-prior assumptions, enabling real-time adaptation to distribution shifts without model retraining. Our approach builds on the established connection between Bregman divergences and proper scoring rules to show that deep networks trained with such losses produce posterior probability estimates from which prior-invariant likelihood ratios can be extracted. We prove that these likelihood-ratio estimates remain valid under arbitrary changes in class priors and cost structures, requiring only a threshold adjustment for optimal Bayes decisions. We derive finite-sample regret bounds demonstrating that OBIL achieves $O(\sqrt{T \log T})$ regret against an oracle with perfect prior knowledge. Extensive experiments on benchmark datasets and medical diagnosis benchmarks under simulated deployment shifts demonstrate that OBIL maintains robust performance under severe distribution shifts, outperforming state-of-the-art methods in F1 Score when test distributions deviate significantly from the training conditions.

new Variance-Gated Ensembles: An Epistemic-Aware Framework for Uncertainty Estimation

Authors: H. Martin Gillis, Isaac Xu, Thomas Trappenberg

Abstract: Machine learning applications require fast and reliable per-sample uncertainty estimation. A common approach is to use predictive distributions from Bayesian or approximation methods and additively decompose uncertainty into aleatoric (i.e., data-related) and epistemic (i.e., model-related) components. However, additive decomposition has recently been questioned, with evidence that it breaks down when using finite-ensemble sampling and/or mismatched predictive distributions. This paper introduces Variance-Gated Ensembles (VGE), an intuitive, differentiable framework that injects epistemic sensitivity via a signal-to-noise gate computed from ensemble statistics. VGE provides: (i) a Variance-Gated Margin Uncertainty (VGMU) score that couples decision margins with ensemble predictive variance; and (ii) a Variance-Gated Normalization (VGN) layer that generalizes the variance-gated uncertainty mechanism to training via per-class, learnable normalization of ensemble member probabilities. We derive closed-form vector-Jacobian products enabling end-to-end training through ensemble sample mean and variance. VGE matches or exceeds state-of-the-art information-theoretic baselines while remaining computationally efficient. As a result, VGE provides a practical and scalable approach to epistemic-aware uncertainty estimation in ensemble models. An open-source implementation is available at: https://github.com/nextdevai/vge.

URLs: https://github.com/nextdevai/vge.

new Reliable and Responsible Foundation Models: A Comprehensive Survey

Authors: Xinyu Yang, Junlin Han, Rishi Bommasani, Jinqi Luo, Wenjie Qu, Wangchunshu Zhou, Adel Bibi, Xiyao Wang, Jaehong Yoon, Elias Stengel-Eskin, Shengbang Tong, Lingfeng Shen, Rafael Rafailov, Runjia Li, Zhaoyang Wang, Yiyang Zhou, Chenhang Cui, Yu Wang, Wenhao Zheng, Huichi Zhou, Jindong Gu, Zhaorun Chen, Peng Xia, Tony Lee, Thomas Zollo, Vikash Sehwag, Jixuan Leng, Jiuhai Chen, Yuxin Wen, Huan Zhang, Zhun Deng, Linjun Zhang, Pavel Izmailov, Pang Wei Koh, Yulia Tsvetkov, Andrew Wilson, Jiaheng Zhang, James Zou, Cihang Xie, Hao Wang, Philip Torr, Julian McAuley, David Alvarez-Melis, Florian Tram\`er, Kaidi Xu, Suman Jana, Chris Callison-Burch, Rene Vidal, Filippos Kokkinos, Mohit Bansal, Beidi Chen, Huaxiu Yao

Abstract: Foundation models, including Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), Image Generative Models (i.e, Text-to-Image Models and Image-Editing Models), and Video Generative Models, have become essential tools with broad applications across various domains such as law, medicine, education, finance, science, and beyond. As these models see increasing real-world deployment, ensuring their reliability and responsibility has become critical for academia, industry, and government. This survey addresses the reliable and responsible development of foundation models. We explore critical issues, including bias and fairness, security and privacy, uncertainty, explainability, and distribution shift. Our research also covers model limitations, such as hallucinations, as well as methods like alignment and Artificial Intelligence-Generated Content (AIGC) detection. For each area, we review the current state of the field and outline concrete future research directions. Additionally, we discuss the intersections between these areas, highlighting their connections and shared challenges. We hope our survey fosters the development of foundation models that are not only powerful but also ethical, trustworthy, reliable, and socially responsible.

new A second order regret bound for NormalHedge

Authors: Yoav Freund, Nicholas J. A. Harvey, Victor S. Portella, Yabing Qi, Yu-Xiang Wang

Abstract: We consider the problem of prediction with expert advice for ``easy'' sequences. We show that a variant of NormalHedge enjoys a second-order $\epsilon$-quantile regret bound of $O\big(\sqrt{V_T \log(V_T/\epsilon)}\big) $ when $V_T > \log N$, where $V_T$ is the cumulative second moment of instantaneous per-expert regret averaged with respect to a natural distribution determined by the algorithm. The algorithm is motivated by a continuous time limit using Stochastic Differential Equations. The discrete time analysis uses self-concordance techniques.

new The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

Authors: Seonglae Cho, Zekun Wu, Kleyton Da Costa, Adriano Koshiyama

Abstract: When a language model asserts that "the capital of Australia is Sydney," does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture families. The structure is simple: the discriminative signal occupies 3-8 dimensions, performance degrades with additional dimensions, and no nonlinear classifier improves over linear separation. Centroid distance in the low-dimensional subspace matches trained probe performance (0.90 AUC), enabling few-shot detection: on GPT-2, 25 labeled examples achieve 89% of full-data accuracy. We validate causally through activation steering: the learned direction produces 10.9 percentage point changes in error rates while random directions show no effect. Internal probes achieve 0.80-0.97 AUC; output-based methods (P(True), semantic entropy) achieve only 0.44-0.64 AUC. The correctness signal exists internally but is not expressed in outputs. That centroid distance matches probe performance indicates class separation is a mean shift, making detection geometric rather than learned.

new Spherical Steering: Geometry-Aware Activation Rotation for Language Models

Authors: Zejia You, Chunyuan Deng, Hanjie Chen

Abstract: Inference-time steering has emerged as a promising paradigm for controlling language models (LMs) without the cost of retraining. However, standard approaches typically rely on activation addition, a geometric operation that inevitably alters the magnitude of hidden representations. This raises concerns about representation collapse and degradation of open-ended generation capabilities. In this work, we explore Spherical Steering, a training-free primitive that resolves this trade-off through activation rotation. Rather than shifting activations with a fixed vector, our method rotates them along a geodesic toward a target direction, guiding the activation toward the target concept while preserving the integrity of the signal. To further enhance adaptivity, we incorporate a confidence gate that dynamically modulates steering strength based on input uncertainty. Extensive experiments across multiple-choice benchmarks demonstrate that Spherical Steering significantly outperforms addition-based baselines (notably by +10% on TruthfulQA, COPA, and Storycloze), while simultaneously maintaining the model's general open-ended generation quality. This work highlights the value of geometric consistency, suggesting that norm-preserving rotation is a robust and effective primitive for precise inference-time control.

new A Causal Machine Learning Framework for Treatment Personalization in Clinical Trials: Application to Ulcerative Colitis

Authors: Cristian Minoccheri, Sophia Tesic, Kayvan Najarian, Ryan Stidham

Abstract: Randomized controlled trials estimate average treatment effects, but treatment response heterogeneity motivates personalized approaches. A critical question is whether statistically detectable heterogeneity translates into improved treatment decisions -- these are distinct questions that can yield contradictory answers. We present a modular causal machine learning framework that evaluates each question separately: permutation importance identifies which features predict heterogeneity, best linear predictor (BLP) testing assesses statistical significance, and doubly robust policy evaluation measures whether acting on the heterogeneity improves patient outcomes. We apply this framework to patient-level data from the UNIFI maintenance trial of ustekinumab in ulcerative colitis, comparing placebo, standard-dose ustekinumab every 12 weeks, and dose-intensified ustekinumab every 8 weeks, using cross-fitted X-learner models with baseline demographics, medication history, week-8 clinical scores, laboratory biomarkers, and video-derived endoscopic features. BLP testing identified strong associations between endoscopic features and treatment effect heterogeneity for ustekinumab versus placebo, yet doubly robust policy evaluation showed no improvement in expected remission from incorporating endoscopic features, and out-of-fold multi-arm evaluation showed worse performance. Diagnostic comparison of prognostic contribution against policy value revealed that endoscopic scores behaved as disease severity markers -- improving outcome prediction in untreated patients but adding noise to treatment selection -- while clinical variables (fecal calprotectin, age, CRP) captured the decision-relevant variation. These results demonstrate that causal machine learning applications to clinical trials should include policy-level evaluation alongside heterogeneity testing.

new Nansde-net: A neural sde framework for generating time series with memory

Authors: Hiromu Ozai, Kei Nakagawa

Abstract: Modeling time series with long- or short-memory characteristics is a fundamental challenge in many scientific and engineering domains. While fractional Brownian motion has been widely used as a noise source to capture such memory effects, its incompatibility with It\^o calculus limits its applicability in neural stochastic differential equation~(SDE) frameworks. In this paper, we propose a novel class of noise, termed Neural Network-kernel ARMA-type noise~(NA-noise), which is an It\^o-process-based alternative capable of capturing both long- and short-memory behaviors. The kernel function defining the noise structure is parameterized via neural networks and decomposed into a product form to preserve the Markov property. Based on this noise process, we develop NANSDE-Net, a generative model that extends Neural SDEs by incorporating NA-noise. We prove the theoretical existence and uniqueness of the solution under mild conditions and derive an efficient backpropagation scheme for training. Empirical results on both synthetic and real-world datasets demonstrate that NANSDE-Net matches or outperforms existing models, including fractional SDE-Net, in reproducing long- and short-memory features of the data, while maintaining computational tractability within the It\^o calculus framework.

new Dreaming in Code for Curriculum Learning in Open-Ended Worlds

Authors: Konstantinos Mitsides, Maxence Faldor, Antoine Cully

Abstract: Open-ended learning frames intelligence as emerging from continual interaction with an ever-expanding space of environments. While recent advances have utilized foundation models to programmatically generate diverse environments, these approaches often focus on discovering isolated behaviors rather than orchestrating sustained progression. In complex open-ended worlds, the large combinatorial space of possible challenges makes it difficult for agents to discover sequences of experiences that remain consistently learnable. To address this, we propose Dreaming in Code (DiCode), a framework in which foundation models synthesize executable environment code to scaffold learning toward increasing competence. In DiCode, "dreaming" takes the form of materializing code-level variations of the world. We instantiate DiCode in Craftax, a challenging open-ended benchmark characterized by rich mechanics and long-horizon progression. Empirically, DiCode enables agents to acquire long-horizon skills, achieving a $16\%$ improvement in mean return over the strongest baseline and non-zero success on late-game combat tasks where prior methods fail. Our results suggest that code-level environment design provides a practical mechanism for curriculum control, enabling the construction of intermediate environments that bridge competence gaps in open-ended worlds. Project page and source code are available at https://konstantinosmitsides.github.io/dreaming-in-code and https://github.com/konstantinosmitsides/dreaming-in-code.

URLs: https://konstantinosmitsides.github.io/dreaming-in-code, https://github.com/konstantinosmitsides/dreaming-in-code.

new Interpretable Dynamic Network Modeling of Tensor Time Series via Kronecker Time-Varying Graphical Lasso

Authors: Shingo Higashiguchi, Koki Kawabata, Yasuko Matsubara, Yasushi Sakurai

Abstract: With the rapid development of web services, large amounts of time series data are generated and accumulated across various domains such as finance, healthcare, and online platforms. As such data often co-evolves with multiple variables interacting with each other, estimating the time-varying dependencies between variables (i.e., the dynamic network structure) has become crucial for accurate modeling. However, real-world data is often represented as tensor time series with multiple modes, resulting in large, entangled networks that are hard to interpret and computationally intensive to estimate. In this paper, we propose Kronecker Time-Varying Graphical Lasso (KTVGL), a method designed for modeling tensor time series. Our approach estimates mode-specific dynamic networks in a Kronecker product form, thereby avoiding overly complex entangled structures and producing interpretable modeling results. Moreover, the partitioned network structure prevents the exponential growth of computational time with data dimension. In addition, our method can be extended to stream algorithms, making the computational time independent of the sequence length. Experiments on synthetic data show that the proposed method achieves higher edge estimation accuracy than existing methods while requiring less computation time. To further demonstrate its practical value, we also present a case study using real-world data. Our source code and datasets are available at https://github.com/Higashiguchi-Shingo/KTVGL.

URLs: https://github.com/Higashiguchi-Shingo/KTVGL.

new CADO: From Imitation to Cost Minimization for Heatmap-based Solvers in Combinatorial Optimization

Authors: Hyungseok Song, Deunsol Yoon, Kanghoon Lee, Han-Seul Jeong, Soonyoung Lee, Woohyung Lim

Abstract: Heatmap-based solvers have emerged as a promising paradigm for Combinatorial Optimization (CO). However, we argue that the dominant Supervised Learning (SL) training paradigm suffers from a fundamental objective mismatch: minimizing imitation loss (e.g., cross-entropy) does not guarantee solution cost minimization. We dissect this mismatch into two deficiencies: Decoder-Blindness (being oblivious to the non-differentiable decoding process) and Cost-Blindness (prioritizing structural imitation over solution quality). We empirically demonstrate that these intrinsic flaws impose a hard performance ceiling. To overcome this limitation, we propose CADO (Cost-Aware Diffusion models for Optimization), a streamlined Reinforcement Learning fine-tuning framework that formulates the diffusion denoising process as an MDP to directly optimize the post-decoded solution cost. We introduce Label-Centered Reward, which repurposes ground-truth labels as unbiased baselines rather than imitation targets, and Hybrid Fine-Tuning for parameter-efficient adaptation. CADO achieves state-of-the-art performance across diverse benchmarks, validating that objective alignment is essential for unlocking the full potential of heatmap-based solvers.

new DrugR: Optimizing Molecular Drugs through LLM-based Explicit Reasoning

Authors: Haoran Liu, Zheni Zeng, Yukun Yan, Yuxuan Chen, Yunduo Xiao

Abstract: Molecule generation and optimization is a fundamental task in chemical domain. The rapid development of intelligent tools, especially large language models (LLMs) with powerful knowledge reserves and interactive capabilities, has provided new paradigms for it. Nevertheless, the intrinsic challenge for LLMs lies in the complex implicit relationship between molecular structure and pharmacological properties and the lack of corresponding labeled data. To bridge this gap, we propose DrugR, an LLM-based method that introduces explicit, step-by-step pharmacological reasoning into the optimization process. Our approach integrates domain-specific continual pretraining, supervised fine-tuning via reverse data engineering, and self-balanced multi-granular reinforcement learning. This framework enables DrugR to effectively improve key ADMET properties while preserving the original molecule's core efficacy. Experimental results demonstrate that DrugR achieves comprehensive enhancement across multiple properties without compromising structural similarity or target binding affinity. Importantly, its explicit reasoning process provides clear, interpretable rationales for each optimization step, yielding actionable design insights and advancing toward automated, knowledge-driven scientific discovery. Our code and model checkpoints are open-sourced to foster future research.

new Distribution-Free Robust Functional Predict-Then-Optimize

Authors: Yash Patel, Ambuj Tewari

Abstract: The solution of PDEs in decision-making tasks is increasingly being undertaken with the help of neural operator surrogate models due to the need for repeated evaluation. Such methods, while significantly more computationally favorable compared to their numerical counterparts, fail to provide any calibrated notions of uncertainty in their predictions. Current methods approach this deficiency typically with ensembling or Bayesian posterior estimation. However, these approaches either require distributional assumptions that fail to hold in practice or lack practical scalability, limiting their applications in practice. We, therefore, propose a novel application of conformal prediction to produce distribution-free uncertainty quantification over the function spaces mapped by neural operators. We then demonstrate how such prediction regions enable a formal regret characterization if leveraged in downstream robust decision-making tasks. We further demonstrate how such posited robust decision-making tasks can be efficiently solved using an infinite-dimensional generalization of Danskin's Theorem and calculus of variations and empirically demonstrate the superior performance of our proposed method over more restrictive modeling paradigms, such as Gaussian Processes, across several engineering tasks.

new Thermodynamic Isomorphism of Transformers: A Lagrangian Approach to Attention Dynamics

Authors: Gunn Kim

Abstract: Although the Transformer architecture has revolutionized artificial intelligence, its underlying mechanisms remain largely heuristic and lack a unified physical theory. In this work, we propose a first-principles framework for information dynamics, treating the attention mechanism as a physical system governed by the principle of least action rather than as an algorithmic optimization. By mapping information states to a Riemannian manifold with the Fisher information metric, we derive the intelligence Lagrangian. We show that the softmax function corresponds to the unique thermodynamic equilibrium state that minimizes the Helmholtz free energy of the information gas. In addition, we identify the query-key interaction as an electrodynamic coupling between an external field and an intrinsic dipole moment. This theory establishes the first law of information thermodynamics, unifying inference (mechanical work) and learning (chemical evolution). It also explains emergent phenomena, such as scaling laws and grokking, as phase transitions characterized by the divergence of specific heat. Finally, we discuss how rotational symmetry breaking in the attention manifold generates massless Goldstone bosons, providing a field-theoretic perspective on rotary positional embeddings (RoPE). Our work connects Statistical Physics and Deep Learning, laying the groundwork for a general theory of physics-based intelligence.

new Sparsity-Aware Evolution for Model Merging

Authors: Huan Zhang, Yanjian Zhang, Guillaume Wisniewski, Nadi Tomeh, Bang Liu

Abstract: We propose a sparsity-aware evolutionary (SAE) framework for model merging that involves iterative pruning-merging cycles to act as a novel mutation operator. We incorporate the sparsity constraints into the score function, which steers the evolutionary process to favor more sparse models, in addition to other conventional performance scores. Interestingly, the by-product of \textit{competition} for sparsity introduces an extra local \textit{attraction} and interplay into the evolutionary process: if one competitor has more zero elements, the other competitor's non-zero elements will occupy those positions, even though the less sparse competitor loses to the more sparse competitor in other positions. The proposed pipeline is evaluated on a variety of large-scale LLM benchmarks. Experiments demonstrate that our approach can improve model merging reliability across multiple benchmarks, and is easy to incorporate due to its simplicity and being orthogonal to most existing approaches.

new SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Authors: Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, Huaxiu Yao

Abstract: Large Language Model (LLM) agents have shown stunning results in complex tasks, yet they often operate in isolation, failing to learn from past experiences. Existing memory-based methods primarily store raw trajectories, which are often redundant and noise-heavy. This prevents agents from extracting high-level, reusable behavioral patterns that are essential for generalization. In this paper, we propose SkillRL, a framework that bridges the gap between raw experience and policy improvement through automatic skill discovery and recursive evolution. Our approach introduces an experience-based distillation mechanism to build a hierarchical skill library SkillBank, an adaptive retrieval strategy for general and task-specific heuristics, and a recursive evolution mechanism that allows the skill library to co-evolve with the agent's policy during reinforcement learning. These innovations significantly reduce the token footprint while enhancing reasoning utility. Experimental results on ALFWorld, WebShop and seven search-augmented tasks demonstrate that SkillRL achieves state-of-the-art performance, outperforming strong baselines over 15.3% and maintaining robustness as task complexity increases. Code is available at this https://github.com/aiming-lab/SkillRL.

URLs: https://github.com/aiming-lab/SkillRL.

new Linearization Explains Fine-Tuning in Large Language Models

Authors: Zahra Rahimi Afzal, Tara Esmaeilbeig, Mojtaba Soltanalian, Mesrob I. Ohannessian

Abstract: Parameter-Efficient Fine-Tuning (PEFT) is a popular class of techniques that strive to adapt large models in a scalable and resource-efficient manner. Yet, the mechanisms underlying their training performance and generalization remain underexplored. In this paper, we provide several insights into such fine-tuning through the lens of linearization. Fine-tuned models are often implicitly encouraged to remain close to the pretrained model. By making this explicit, using an Euclidean distance inductive bias in parameter space, we show that fine-tuning dynamics become equivalent to learning with the positive-definite neural tangent kernel (NTK). We specifically analyze how close the fully linear and the linearized fine-tuning optimizations are, based on the strength of the regularization. This allows us to be pragmatic about how good a model linearization is when fine-tuning large language models (LLMs). When linearization is a good model, our findings reveal a strong correlation between the eigenvalue spectrum of the NTK and the performance of model adaptation. Motivated by this, we give spectral perturbation bounds on the NTK induced by the choice of layers selected for fine-tuning. We empirically validate our theory on Low Rank Adaptation (LoRA) on LLMs. These insights not only characterize fine-tuning but also have the potential to enhance PEFT techniques, paving the way to better informed and more nimble adaptation in LLMs.

new Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers

Authors: Juncheng Dong, Bowen He, Moyang Guo, Ethan X. Fang, Zhuoran Yang, Vahid Tarokh

Abstract: In-context reinforcement learning (ICRL) leverages the in-context learning capabilities of transformer models (TMs) to efficiently generalize to unseen sequential decision-making tasks without parameter updates. However, existing ICRL methods rely on explicit reward signals during pretraining, which limits their applicability when rewards are ambiguous, hard to specify, or costly to obtain. To overcome this limitation, we propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback, eliminating the need for reward supervision. We study two variants that differ in the granularity of feedback: Immediate Preference-based RL (I-PRL) with per-step preferences, and Trajectory Preference-based RL (T-PRL) with trajectory-level comparisons. We first show that supervised pretraining, a standard approach in ICRL, remains effective under preference-only context datasets, demonstrating the feasibility of in-context reinforcement learning using only preference signals. To further improve data efficiency, we introduce alternative preference-native frameworks for I-PRL and T-PRL that directly optimize TM policies from preference data without requiring reward signals nor optimal action labels.Experiments on dueling bandits, navigation, and continuous control tasks demonstrate that ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.

new Constraint-Aware Generative Auto-bidding via Pareto-Prioritized Regret Optimization

Authors: Binglin Wu, Yingyi Zhang, Xianneng Li, Ruyue Deng, Chuan Yue, Weiru Zhang, Xiaoyi Zeng

Abstract: Auto-bidding systems aim to maximize marketing value while satisfying strict efficiency constraints such as Target Cost-Per-Action (CPA). Although Decision Transformers provide powerful sequence modeling capabilities, applying them to this constrained setting encounters two challenges: 1) standard Return-to-Go conditioning causes state aliasing by neglecting the cost dimension, preventing precise resource pacing; and 2) standard regression forces the policy to mimic average historical behaviors, thereby limiting the capacity to optimize performance toward the constraint boundary. To address these challenges, we propose PRO-Bid, a constraint-aware generative auto-bidding framework based on two synergistic mechanisms: 1) Constraint-Decoupled Pareto Representation (CDPR) decomposes global constraints into recursive cost and value contexts to restore resource perception, while reweighting trajectories based on the Pareto frontier to focus on high-efficiency data; and 2) Counterfactual Regret Optimization (CRO) facilitates active improvement by utilizing a global outcome predictor to identify superior counterfactual actions. By treating these high-utility outcomes as weighted regression targets, the model transcends historical averages to approach the optimal constraint boundary. Extensive experiments on two public benchmarks and online A/B tests demonstrate that PRO-Bid achieves superior constraint satisfaction and value acquisition compared to state-of-the-art baselines.

new Inverting Data Transformations via Diffusion Sampling

Authors: Jinwoo Kim, S\'ekou-Oumar Kaba, Jiyun Park, Seunghoon Hong, Siamak Ravanbakhsh

Abstract: We study the problem of transformation inversion on general Lie groups: a datum is transformed by an unknown group element, and the goal is to recover an inverse transformation that maps it back to the original data distribution. Such unknown transformations arise widely in machine learning and scientific modeling, where they can significantly distort observations. We take a probabilistic view and model the posterior over transformations as a Boltzmann distribution defined by an energy function on data space. To sample from this posterior, we introduce a diffusion process on Lie groups that keeps all updates on-manifold and only requires computations in the associated Lie algebra. Our method, Transformation-Inverting Energy Diffusion (TIED), relies on a new trivialized target-score identity that enables efficient score-based sampling of the transformation posterior. As a key application, we focus on test-time equivariance, where the objective is to improve the robustness of pretrained neural networks to input transformations. Experiments on image homographies and PDE symmetries demonstrate that TIED can restore transformed inputs to the training distribution at test time, showing improved performance over strong canonicalization and sampling baselines. Code is available at https://github.com/jw9730/tied.

URLs: https://github.com/jw9730/tied.

new When Do Multi-Agent Systems Outperform? Analysing the Learning Efficiency of Agentic Systems

Authors: Junwei Su, Chuan Wu

Abstract: Reinforcement Learning (RL) has emerged as a crucial method for training or fine-tuning large language models (LLMs), enabling adaptive, task-specific optimizations through interactive feedback. Multi-Agent Reinforcement Learning (MARL), in particular, offers a promising avenue by decomposing complex tasks into specialized subtasks learned by distinct interacting agents, potentially enhancing the ability and efficiency of LLM systems. However, theoretical insights regarding when and why MARL outperforms Single-Agent RL (SARL) remain limited, creating uncertainty in selecting the appropriate RL framework. In this paper, we address this critical gap by rigorously analyzing the comparative sample efficiency of MARL and SARL within the context of LLM. Leveraging the Probably Approximately Correct (PAC) framework, we formally define SARL and MARL setups for LLMs, derive explicit sample complexity bounds, and systematically characterize how task decomposition and alignment influence learning efficiency. Our results demonstrate that MARL improves sample complexity when tasks naturally decompose into independent subtasks, whereas dependent subtasks diminish MARL's comparative advantage. Additionally, we introduce and analyze the concept of task alignment, quantifying the trade-offs when enforcing independent task decomposition despite potential misalignments. These theoretical insights clarify empirical inconsistencies and provide practical criteria for deploying MARL strategies effectively in complex LLM scenarios.

new Noise Stability of Transformer Models

Authors: Themistoklis Haris, Zihan Zhang, Yuichi Yoshida

Abstract: Understanding simplicity biases in deep learning offers a promising path toward developing reliable AI. A common metric for this, inspired by Boolean function analysis, is average sensitivity, which captures a model's robustness to single-token perturbations. We argue that average sensitivity has two key limitations: it lacks a natural generalization to real-valued domains and fails to explain the "junta-like" input dependence we empirically observe in modern LLMs. To address these limitations, we propose noise stability as a more comprehensive simplicity metric. Noise stability expresses a model's robustness to correlated noise applied to all input coordinates simultaneously. We provide a theoretical analysis of noise stability for single-layer attention and ReLU MLP layers and tackle the multi-layer propagation problem with a covariance interval propagation approach. Building on this theory, we develop a practical noise stability regularization method. Experiments on algorithmic and next-token-prediction tasks show that our regularizer consistently catalyzes grokking and accelerates training by approximately $35\%$ and $75\%$ respectively. Our results sculpt a new connection between signal propagation in neural networks and interpretability, with noise stability emerging as a powerful tool for understanding and improving modern Transformers.

new Trust-Based Incentive Mechanisms in Semi-Decentralized Federated Learning Systems

Authors: Ajay Kumar Shrestha

Abstract: In federated learning (FL), decentralized model training allows multi-ple participants to collaboratively improve a shared machine learning model without exchanging raw data. However, ensuring the integrity and reliability of the system is challenging due to the presence of potentially malicious or faulty nodes that can degrade the model's performance. This paper proposes a novel trust-based incentive mechanism designed to evaluate and reward the quality of contributions in FL systems. By dynamically assessing trust scores based on fac-tors such as data quality, model accuracy, consistency, and contribution fre-quency, the system encourages honest participation and penalizes unreliable or malicious behavior. These trust scores form the basis of an incentive mechanism that rewards high-trust nodes with greater participation opportunities and penal-ties for low-trust participants. We further explore the integration of blockchain technology and smart contracts to automate the trust evaluation and incentive distribution processes, ensuring transparency and decentralization. Our proposed theoretical framework aims to create a more robust, fair, and transparent FL eco-system, reducing the risks posed by untrustworthy participants.

new Grokking in Linear Models for Logistic Regression

Authors: Nataraj Das, Atreya Vedantam, Chandrashekar Lakshminarayanan

Abstract: Grokking, the phenomenon of delayed generalization, is often attributed to the depth and compositional structure of deep neural networks. We study grokking in one of the simplest possible settings: the learning of a linear model with logistic loss for binary classification on data that are linearly (and max margin) separable about the origin. We investigate three testing regimes: (1) test data drawn from the same distribution as the training data, in which case grokking is not observed; (2) test data concentrated around the margin, in which case grokking is observed; and (3) adversarial test data generated via projected gradient descent (PGD) attacks, in which case grokking is also observed. We theoretically show that the implicit bias of gradient descent induces a three-phase learning process-population-dominated, support-vector-dominated unlearning, and support-vector-dominated generalization-during which delayed generalization can arise. Our analysis further relates the emergence of grokking to asymmetries in the data, both in the number of examples per class and in the distribution of support vectors across classes, and yields a characterization of the grokking time. We experimentally validate our theory by planting different distributions of population points and support vectors, and by analyzing accuracy curves and hyperplane dynamics. Overall, our results demonstrate that grokking does not require depth or representation learning, and can emerge even in linear models through the dynamics of the bias term.

new TextResNet: Decoupling and Routing Optimization Signals in Compound AI Systems via Deep Residual Tuning

Authors: Suizhi Huang, Mei Li, Han Yu, Xiaoxiao Li

Abstract: Textual Gradient-style optimizers (TextGrad) enable gradient-like feedback propagation through compound AI systems. However, they do not work well for deep chains. The root cause of this limitation stems from the Semantic Entanglement problem in these extended workflows. In standard textual backpropagation, feedback signals mix local critiques with upstream contexts, leading to Attribution Ambiguity. To address this challenge, we propose TextResNet, a framework that reformulates the optimization process to achieve precise signal routing via four key innovations. Firstly, in the forward pass, it enforces Additive Semantic Deltas to preserve an Identity Highway for gradient flow. Secondly, in the backward pass, it introduces Semantic Gradient Decomposition via a Semantic Projector to disentangle feedback into causally independent subspaces. Thirdly, it implements Causal Routing, which routes projected signals to their specific components. Finally, it performs Density-Aware Optimization Scheduling to leverage the disentangled signals to dynamically allocate resources to key system bottlenecks. Our results show that TextResNet not only achieves superior performance compared to TextGrad, but also exhibits remarkable stability for agentic tasks in compound AI systems where baselines collapse. Code is available at https://github.com/JeanDiable/TextResNet.

URLs: https://github.com/JeanDiable/TextResNet.

new Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback

Authors: Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, Paul Mineiro

Abstract: In this paper, we study Interaction-Grounded Learning (IGL) [Xie et al., 2021], a paradigm designed for realistic scenarios where the learner receives indirect feedback generated by an unknown mechanism, rather than explicit numerical rewards. While prior work on IGL provides efficient algorithms with provable guarantees, those results are confined to single-step settings, restricting their applicability to modern sequential decision-making systems such as multi-turn Large Language Model (LLM) deployments. To bridge this gap, we propose a computationally efficient algorithm that achieves a sublinear regret guarantee for contextual episodic Markov Decision Processes (MDPs) with personalized feedback. Technically, we extend the reward-estimator construction of Zhang et al. [2024a] from the single-step to the multi-step setting, addressing the unique challenges of decoding latent rewards under MDPs. Building on this estimator, we design an Inverse-Gap-Weighting (IGW) algorithm for policy optimization. Finally, we demonstrate the effectiveness of our method in learning personalized objectives from multi-turn interactions through experiments on both a synthetic episodic MDP and a real-world user booking dataset.

new Fast Flow Matching based Conditional Independence Tests for Causal Discovery

Authors: Shunyu Zhao, Yanfeng Yang, Shuai Li, Kenji Fukumizu

Abstract: Constraint-based causal discovery methods require a large number of conditional independence (CI) tests, which severely limits their practical applicability due to high computational complexity. Therefore, it is crucial to design an algorithm that accelerates each individual test. To this end, we propose the Flow Matching-based Conditional Independence Test (FMCIT). The proposed test leverages the high computational efficiency of flow matching and requires the model to be trained only once throughout the entire causal discovery procedure, substantially accelerating causal discovery. According to numerical experiments, FMCIT effectively controls type-I error and maintains high testing power under the alternative hypothesis, even in the presence of high-dimensional conditioning sets. In addition, we further integrate FMCIT into a two-stage guided PC skeleton learning framework, termed GPC-FMCIT, which combines fast screening with guided, budgeted refinement using FMCIT. This design yields explicit bounds on the number of CI queries while maintaining high statistical power. Experiments on synthetic and real-world causal discovery tasks demonstrate favorable accuracy-efficiency trade-offs over existing CI testing methods and PC variants.

new Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression

Authors: Yuntian Tang, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Wenxi Li, Wei Li, Jie Hu, Xinghao Chen, Rongrong Ji, Shaohui Lin

Abstract: Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods.

new Near-Oracle KV Selection via Pre-hoc Sparsity for Long-Context Inference

Authors: Yifei Gao, Lei Wang, Rong-Cheng Tu, Qixin Zhang, Jun Cheng, Dacheng Tao

Abstract: A core bottleneck in large language model (LLM) inference is the cost of attending over the ever-growing key-value (KV) cache. Although near-oracle top-k KV selection can preserve the quality of dense attention while sharply reducing computation and bandwidth, existing sparse methods generally rely on posterior heuristics, i.e., selectors conditioned on observed attention or proxy scores. Such conditioning introduces posterior bias: it tends to distort true token importance and miss salient tokens, thereby impairing long-range reasoning. To tackle this problem, we propose Pre-hoc Sparsity (PrHS), which selects KV entries before attention scoring and provides explicit accuracy control. Let the attention mass of discarded entries be delta (the dropped mass). Through a marginal-to-mutual-information analysis, we derive an upper bound on the mutual-information loss that depends only on the dropped mass. This relation explains failure modes of posterior heuristics and enables verifiable guarantees by controlling the dropped mass in advance. Within PrHS, we instantiate three orthogonal pre-hoc selectors along the axes of time, depth, and layer. Extensive experiments on LLaMA and Mistral families validate PrHS. Across GSM8K and CoQA, PrHS reduces retrieval overhead by over 90%, achieving 3x higher retrieval sparsity than HShare at matched or better accuracy. It incurs under 1% average degradation on LongBench, lowers attention FLOPs by about 15% versus prior sparse baselines, and yields a 9.9x speedup in attention-operator latency and 2.8x higher throughput on NVIDIA A100-80GB GPUs than the dense baseline.

new Regime Change Hypothesis: Foundations for Decoupled Dynamics in Neural Network Training

Authors: Cristian P\'erez-Corral, Alberto Fern\'andez-Hern\'andez, Jose I. Mestre, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ort\'i

Abstract: Despite the empirical success of DNN, their internal training dynamics remain difficult to characterize. In ReLU-based models, the activation pattern induced by a given input determines the piecewise-linear region in which the network behaves affinely. Motivated by this geometry, we investigate whether training exhibits a two-timescale behavior: an early stage with substantial changes in activation patterns and a later stage where weight updates predominantly refine the model within largely stable activation regimes. We first prove a local stability property: outside measure-zero sets of parameters and inputs, sufficiently small parameter perturbations preserve the activation pattern of a fixed input, implying locally affine behavior within activation regions. We then empirically track per-iteration changes in weights and activation patterns across fully-connected and convolutional architectures, as well as Transformer-based models, where activation patterns are recorded in the ReLU feed-forward (MLP/FFN) submodules, using fixed validation subsets. Across the evaluated settings, activation-pattern changes decay 3 times earlier than weight-update magnitudes, showing that late-stage training often proceeds within relatively stable activation regimes. These findings provide a concrete, architecture-agnostic instrument for monitoring training dynamics and motivate further study of decoupled optimization strategies for piecewise-linear networks. For reproducibility, code and experiment configurations will be released upon acceptance.

new ManifoldKV: Training-Free KV Cache Compression via Euclidean Outlier Detection

Authors: Debajyoti Datta, Trishala Neeraj, Bibek Paudel, Vyom Sharma, Subhabrata Mukherjee

Abstract: Long-context inference is constrained by KV-cache memory, which grows linearly with sequence length; KV-cache compression therefore hinges on reliably selecting which past tokens to retain. Most geometry-based eviction methods score keys by cosine similarity to a global centroid, but cosine is scale-invariant and can discard magnitude cues that distinguish semantically salient tokens. We propose ManifoldKV, a training-free scorer that ranks tokens by Euclidean distance to the key centroid, capturing both angular and radial deviations. On the RULER benchmark, ManifoldKV achieves 95.7% accuracy at 4K-16K contexts with 20% compression; matching the best geometric baseline while improving robustness in two regimes where cosine scoring fails. First, on multi-key retrieval, ManifoldKV reduces directional collisions, achieving 92.4% vs KeyDiff's 77.0% (+15.4 points) on 3-key NIAH at 50% compression. Second, to address dilution and performance collapse of global centroids at 64K context, we introduce WindowedManifoldKV, which restores accuracy to 84.3% at 25% compression, a 49-point recovery over global L2 and +3.2 points over KeyDiff. The method requires only 3 lines of code and works across 4 architectures without tuning.

new All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

Authors: Tal Burla, Roi Livni

Abstract: We study the sample complexity of the best-case Empirical Risk Minimizer in the setting of stochastic convex optimization. We show that there exists an instance in which the sample size is linear in the dimension, learning is possible, but the Empirical Risk Minimizer is likely to be unique and to overfit. This resolves an open question by Feldman. We also extend this to approximate ERMs. Building on our construction we also show that (constrained) Gradient Descent potentially overfits when horizon and learning rate grow w.r.t sample size. Specifically we provide a novel generalization lower bound of $\Omega\left(\sqrt{\eta T/m^{1.5}}\right)$ for Gradient Descent, where $\eta$ is the learning rate, $T$ is the horizon and $m$ is the sample size. This narrows down, exponentially, the gap between the best known upper bound of $O(\eta T/m)$ and existing lower bounds from previous constructions.

new The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs

Authors: Zhiliang Chen, Alfred Wei Lun Leong, Shao Yong Ong, Apivich Hemachandram, Gregory Kang Ruey Lau, Chuan-Sheng Foo, Zhengyuan Liu, Nancy F. Chen, Bryan Kian Hsiang Low

Abstract: Co-optimizing data and model configurations for training LLMs presents a classic chicken-and-egg dilemma: The best training data configuration (e.g., data mixture) for a downstream task depends on the chosen model configuration (e.g., model architecture), and vice versa. However, jointly optimizing both data and model configurations is often deemed intractable, and existing methods focus on either data or model optimization without considering their interaction. We introduce JoBS, an approach that uses a scaling-law-inspired performance predictor to aid Bayesian optimization (BO) in jointly optimizing LLM training data and model configurations efficiently. JoBS allocates a portion of the optimization budget to learn an LLM performance predictor that predicts how promising a training configuration is from a small number of training steps. The remaining budget is used to perform BO entirely with the predictor, effectively amortizing the cost of running full-training runs. We study JoBS's average regret and devise the optimal budget allocation to minimize regret. JoBS outperforms existing multi-fidelity BO baselines, as well as data and model optimization approaches across diverse LLM tasks under the same optimization budget.

new Dynamic Regret via Discounted-to-Dynamic Reduction with Applications to Curved Losses and Adam Optimizer

Authors: Yan-Feng Xie, Yu-Jie Zhang, Peng Zhao, Zhi-Hua Zhou

Abstract: We study dynamic regret minimization in non-stationary online learning, with a primary focus on follow-the-regularized-leader (FTRL) methods. FTRL is important for curved losses and for understanding adaptive optimizers such as Adam, yet existing dynamic regret analyses are less explored for FTRL. To address this, we build on the discounted-to-dynamic reduction and present a modular way to obtain dynamic regret bounds of FTRL-related problems. Specifically, we focus on two representative curved losses: linear regression and logistic regression. Our method not only simplifies existing proofs for the optimal dynamic regret of online linear regression, but also yields new dynamic regret guarantees for online logistic regression. Beyond online convex optimization, we apply the reduction to analyze the Adam optimizers, obtaining optimal convergence rates in stochastic, non-convex, and non-smooth settings. The reduction also enables a more detailed treatment of Adam with two discount parameters $(\beta_1,\beta_2)$, leading to new results for both clipped and clip-free variants of Adam optimizers.

new OJBKQ: Objective-Joint Babai-Klein Quantization

Authors: Xinyu Wang, Ziyu Zhao, Peng Lu, Yu Gu, Xiao-Wen Chang

Abstract: Post-training quantization (PTQ) is widely used to compress large language models without retraining. However, many existing weight-only methods rely on heuristic objectives and greedy rounding, thus leading to noticeable degradation under low-bit quantization. In this work, we introduce OJBKQ (Objective-Joint Babai-Klein Quantization with K-Best Sampling), a layer-wise PTQ method that formulates weight quantization as a joint optimization problem over activations and weights. This formulation results in a multiple-right-hand-side box-constrained integer least squares (BILS) problem in each layer, which is NP-hard. For each column of the weight matrix, we apply an extended Babai nearest-plane algorithm and an extended version of Klein's randomized Babai algorithm to find the minimum-residual Babai-Klein point, a sub-optimal solution to the BILS problem. Experimental results on large language models show that OJBKQ achieves lower perplexity at 3-4 bits compared to existing PTQ approaches, while maintaining comparable computational cost.

new Reinforcement Learning with Backtracking Feedback

Authors: Bilgehan Sel, Vaishakh Keshava, Phillip Wallis, Lukas Rutishauser, Ming Jin, Dingcheng Li

Abstract: Addressing the critical need for robust safety in Large Language Models (LLMs), particularly against adversarial attacks and in-distribution errors, we introduce Reinforcement Learning with Backtracking Feedback (RLBF). This framework advances upon prior methods, such as BSAFE, by primarily leveraging a Reinforcement Learning (RL) stage where models learn to dynamically correct their own generation errors. Through RL with critic feedback on the model's live outputs, LLMs are trained to identify and recover from their actual, emergent safety violations by emitting an efficient "backtrack by x tokens" signal, then continuing generation autoregressively. This RL process is crucial for instilling resilience against sophisticated adversarial strategies, including middle filling, Greedy Coordinate Gradient (GCG) attacks, and decoding parameter manipulations. To further support the acquisition of this backtracking capability, we also propose an enhanced Supervised Fine-Tuning (SFT) data generation strategy (BSAFE+). This method improves upon previous data creation techniques by injecting violations into coherent, originally safe text, providing more effective initial training for the backtracking mechanism. Comprehensive empirical evaluations demonstrate that RLBF significantly reduces attack success rates across diverse benchmarks and model scales, achieving superior safety outcomes while critically preserving foundational model utility.

new Modalities, a PyTorch-native Framework For Large-scale LLM Training and Research

Authors: Max L\"ubbering, Timm Ruland, Richard Rutmann, Felix Stollenwerk, David Fitzek, Michael Fromm, Alexander Weber, Rafet Sifa, Nicolas Flores-Herr, Joachim K\"ohler, Mehdi Ali

Abstract: Today's LLM (pre-) training and research workflows typically allocate a significant amount of compute to large-scale ablation studies. Despite the substantial compute costs of these ablations, existing open-source frameworks provide limited tooling for these experiments, often forcing researchers to write their own wrappers and scripts. We propose Modalities, an end-to-end PyTorch-native framework that integrates data-driven LLM research with large-scale model training from two angles. Firstly, by integrating state-of-the-art parallelization strategies, it enables both efficient pretraining and systematic ablations at trillion-token and billion-parameter scale. Secondly, Modalities adopts modular design with declarative, self-contained configuration, enabling reproducibility and extensibility levels that are difficult to achieve out-of-the-box with existing LLM training frameworks.

new Drop the mask! GAMM-A Taxonomy for Graph Attributes Missing Mechanisms

Authors: Richard Serrano (LabHC), Baptiste Jeudy (LabHC), Charlotte Laclau (IDS, S2A), Christine Largeron (LabHC)

Abstract: Exploring missing data in attributed graphs introduces unique challenges beyond those found in tabular datasets. In this work, we extend the taxonomy for missing data mechanisms to attributed graphs by proposing GAMM (Graph Attributes Missing Mechanisms), a framework that systematically links missingness probability to both node attributes and the underlying graph structure. Our taxonomy enriches the conventional definitions of masking mechanisms by introducing graph-specific dependencies. We empirically demonstrate that state-of-the-art imputation methods, while effective on traditional masks, significantly struggle when confronted with these more realistic graph-aware missingness scenarios.

new Radial M\"untz-Sz\'asz Networks: Neural Architectures with Learnable Power Bases for Multidimensional Singularities

Authors: Gnankan Landry Regis N'guessan, Bum Jun Kim

Abstract: Radial singular fields, such as $1/r$, $\log r$, and crack-tip profiles, are difficult to model for coordinate-separable neural architectures. We show that any $C^2$ function that is both radial and additively separable must be quadratic, establishing a fundamental obstruction for coordinate-wise power-law models. Motivated by this result, we introduce Radial M\"untz-Sz\'asz Networks (RMN), which represent fields as linear combinations of learnable radial powers $r^\mu$, including negative exponents, together with a limit-stable log-primitive for exact $\log r$ behavior. RMN admits closed-form spatial gradients and Laplacians, enabling physics-informed learning on punctured domains. Across ten 2D and 3D benchmarks, RMN achieves 1.5$\times$--51$\times$ lower RMSE than MLPs and 10$\times$--100$\times$ lower RMSE than SIREN while using 27 parameters, compared with 33,537 for MLPs and 8,577 for SIREN. We extend RMN to angular dependence (RMN-Angular) and to multiple sources with learnable centers (RMN-MC); when optimization converges, source-center recovery errors fall below $10^{-4}$. We also report controlled failures on smooth, strongly non-radial targets to delineate RMN's operating regime.

new The Connection between Kriging and Large Neural Networks

Authors: Marius Marinescu

Abstract: AI has impacted many disciplines and is nowadays ubiquitous. In particular, spatial statistics is in a pivotal moment where it will increasingly intertwine with AI. In this scenario, a relevant question is what relationship spatial statistics models have with machine learning (ML) models, if any. In particular, in this paper, we explore the connections between Kriging and neural networks. At first glance, they may appear unrelated. Kriging - and its ML counterpart, Gaussian process regression - are grounded in probability theory and stochastic processes, whereas many ML models are extensively considered Black-Box models. Nevertheless, they are strongly related. We study their connections and revisit the relevant literature. The understanding of their relations and the combination of both perspectives may enhance ML techniques by making them more interpretable, reliable, and spatially aware.

new USBD: Universal Structural Basis Distillation for Source-Free Graph Domain Adaptation

Authors: Yingxu Wang, Kunyu Zhang, Mengzhu Wang, Siyang Gao, Nan Yin

Abstract: SF-GDA is pivotal for privacy-preserving knowledge transfer across graph datasets. Although recent works incorporate structural information, they implicitly condition adaptation on the smoothness priors of sourcetrained GNNs, thereby limiting their generalization to structurally distinct targets. This dependency becomes a critical bottleneck under significant topological shifts, where the source model misinterprets distinct topological patterns unseen in the source domain as noise, rendering pseudo-label-based adaptation unreliable. To overcome this limitation, we propose the Universal Structural Basis Distillation, a framework that shifts the paradigm from adapting a biased model to learning a universal structural basis for SF-GDA. Instead of adapting a biased source model to a specific target, our core idea is to construct a structure-agnostic basis that proactively covers the full spectrum of potential topological patterns. Specifically, USBD employs a bi-level optimization framework to distill the source dataset into a compact structural basis. By enforcing the prototypes to span the full Dirichlet energy spectrum, the learned basis explicitly captures diverse topological motifs, ranging from low-frequency clusters to high-frequency chains, beyond those present in the source. This ensures that the learned basis creates a comprehensive structural covering capable of handling targets with disparate structures. For inference, we introduce a spectral-aware ensemble mechanism that dynamically activates the optimal prototype combination based on the spectral fingerprint of the target graph. Extensive experiments on benchmarks demonstrate that USBD significantly outperforms state-of-the-art methods, particularly in scenarios with severe structural shifts, while achieving superior computational efficiency by decoupling the adaptation cost from the target data scale.

new RIFLE: Robust Distillation-based FL for Deep Model Deployment on Resource-Constrained IoT Networks

Authors: Pouria Arefijamal, Mahdi Ahmadlou, Bardia Safaei, J\"org Henkel

Abstract: Federated learning (FL) is a decentralized learning paradigm widely adopted in resource-constrained Internet of Things (IoT) environments. These devices, typically relying on TinyML models, collaboratively train global models by sharing gradients with a central server while preserving data privacy. However, as data heterogeneity and task complexity increase, TinyML models often become insufficient to capture intricate patterns, especially under extreme non-IID (non-independent and identically distributed) conditions. Moreover, ensuring robustness against malicious clients and poisoned updates remains a major challenge. Accordingly, this paper introduces RIFLE - a Robust, distillation-based Federated Learning framework that replaces gradient sharing with logit-based knowledge transfer. By leveraging a knowledge distillation aggregation scheme, RIFLE enables the training of deep models such as VGG-19 and Resnet18 within constrained IoT systems. Furthermore, a Kullback-Leibler (KL) divergence-based validation mechanism quantifies the reliability of client updates without exposing raw data, achieving high trust and privacy preservation simultaneously. Experiments on three benchmark datasets (MNIST, CIFAR-10, and CIFAR-100) under heterogeneous non-IID conditions demonstrate that RIFLE reduces false-positive detections by up to 87.5%, enhances poisoning attack mitigation by 62.5%, and achieves up to 28.3% higher accuracy compared to conventional federated learning baselines within only 10 rounds. Notably, RIFLE reduces VGG19 training time from over 600 days to just 1.39 hours on typical IoT devices (0.3 GFLOPS), making deep learning practical in resource-constrained networks.

new Estimating Aleatoric Uncertainty in the Causal Treatment Effect

Authors: Liyuan Xu, Bijan Mazaheri

Abstract: Previous work on causal inference has primarily focused on averages and conditional averages of treatment effects, with significantly less attention on variability and uncertainty in individual treatment responses. In this paper, we introduce the variance of the treatment effect (VTE) and conditional variance of treatment effect (CVTE) as the natural measure of aleatoric uncertainty inherent in treatment responses, and we demonstrate that these quantities are identifiable from observed data under mild assumptions, even in the presence of unobserved confounders. We further propose nonparametric kernel-based estimators for VTE and CVTE, and our theoretical analysis establishes their convergence. We also test the performance of our method through extensive empirical experiments on both synthetic and semi-simulated datasets, where it demonstrates superior or comparable performance to naive baselines.

new Low Rank Transformer for Multivariate Time Series Anomaly Detection and Localization

Authors: Charalampos Shimillas, Kleanthis Malialis, Konstantinos Fokianos, Marios M. Polycarpou

Abstract: Multivariate time series (MTS) anomaly diagnosis, which encompasses both anomaly detection and localization, is critical for the safety and reliability of complex, large-scale real-world systems. The vast majority of existing anomaly diagnosis methods offer limited theoretical insights, especially for anomaly localization, which is a vital but largely unexplored area. The aim of this contribution is to study the learning process of a Transformer when applied to MTS by revealing connections to statistical time series methods. Based on these theoretical insights, we propose the Attention Low-Rank Transformer (ALoRa-T) model, which applies low-rank regularization to self-attention, and we introduce the Attention Low-Rank score, effectively capturing the temporal characteristics of anomalies. Finally, to enable anomaly localization, we propose the ALoRa-Loc method, a novel approach that associates anomalies to specific variables by quantifying interrelationships among time series. Extensive experiments and real data analysis, show that the proposed methodology significantly outperforms state-of-the-art methods in both detection and localization tasks.

new Learning Credal Ensembles via Distributionally Robust Optimization

Authors: Kaizheng Wang, Ghifari Adam Faza, Fabio Cuzzolin, Siu Lun Chau, David Moens, Hans Hallez

Abstract: Credal predictors are models that are aware of epistemic uncertainty and produce a convex set of probabilistic predictions. They offer a principled way to quantify predictive epistemic uncertainty (EU) and have been shown to improve model robustness in various settings. However, most state-of-the-art methods mainly define EU as disagreement caused by random training initializations, which mostly reflects sensitivity to optimization randomness rather than uncertainty from deeper sources. To address this, we define EU as disagreement among models trained with varying relaxations of the i.i.d. assumption between training and test data. Based on this idea, we propose CreDRO, which learns an ensemble of plausible models through distributionally robust optimization. As a result, CreDRO captures EU not only from training randomness but also from meaningful disagreement due to potential distribution shifts between training and test data. Empirical results show that CreDRO consistently outperforms existing credal methods on tasks such as out-of-distribution detection across multiple benchmarks and selective classification in medical applications.

new Time-Delayed Transformers for Data-Driven Modeling of Low-Dimensional Dynamics

Authors: Albert Alcalde, Markus Widhalm, Emre Y{\i}lmaz

Abstract: We propose the time-delayed transformer (TD-TF), a simplified transformer architecture for data-driven modeling of unsteady spatio-temporal dynamics. TD-TF bridges linear operator-based methods and deep sequence models by showing that a single-layer, single-head transformer can be interpreted as a nonlinear generalization of time-delayed dynamic mode decomposition (TD-DMD). The architecture is deliberately minimal, consisting of one self-attention layer with a single query per prediction and one feedforward layer, resulting in linear computational complexity in sequence length and a small parameter count. Numerical experiments demonstrate that TD-TF matches the performance of strong linear baselines on near-linear systems, while significantly outperforming them in nonlinear and chaotic regimes, where it accurately captures long-term dynamics. Validation studies on synthetic signals, unsteady aerodynamics, the Lorenz '63 system, and a reaction-diffusion model show that TD-TF preserves the interpretability and efficiency of linear models while providing substantially enhanced expressive power for complex dynamics.

new Beyond Correctness: Learning Robust Reasoning via Transfer

Authors: Hyunseok Lee, Soheil Abbasloo, Jihoon Tack, Jinwoo Shin

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently strengthened LLM reasoning, but its focus on final answer correctness leaves a critical gap: it does not ensure the robustness of the reasoning process itself. We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it, and treat reasoning as a form of meaning transfer that must survive truncation, reinterpretation, and continuation. Building on this principle, we introduce Reinforcement Learning with Transferable Reward (RLTR), which operationalizes robustness via transfer reward that tests whether a partial reasoning prefix from one model can guide a separate model to the correct answer. This encourages LLMs to produce reasoning that is stable, interpretable, and genuinely generalizable. Our approach improves sampling consistency while improving final answer accuracy, and it reaches comparable performance in substantially fewer training steps. For example, on MATH500, RLTR achieves a +3.6%p gain in Maj@64 compared to RLVR and matches RLVR's average accuracy with roughly 2.5x fewer training steps, providing both more reliable reasoning and significantly more sample efficient.

new Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

Authors: Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, Deqing Wang

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global reuse of historical rollouts within a single principled framework. We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound. Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.

new Is Meta-Path Attention an Explanation? Evidence of Alignment and Decoupling in Heterogeneous GNNs

Authors: Maiqi Jiang, Noman Ali, Yiran Ding, Yanfu Zhang

Abstract: Meta-path-based heterogeneous graph neural networks aggregate over meta-path-induced views, and their semantic-level attention over meta-path channels is widely used as a narrative for ``which semantics matter.'' We study this assumption empirically by asking: when does meta-path attention reflect meta-path importance, and when can it decouple? A key challenge is that most post-hoc GNN explainers are designed for homogeneous graphs, and naive adaptations to heterogeneous neighborhoods can mix semantics and confound perturbations. To enable a controlled empirical analysis, we introduce MetaXplain, a meta-path-aware post-hoc explanation protocol that applies existing explainers in the native meta-path view domain via (i) view-factorized explanations, (ii) schema-valid channel-wise perturbations, and (iii) fusion-aware attribution, without modifying the underlying predictor. We benchmark representative gradient-, perturbation-, and Shapley-style explainers on ACM, DBLP, and IMDB with HAN and HAN-GCN, comparing against xPath and type-matched random baselines under standard faithfulness metrics. To quantify attention reliability, we propose Meta-Path Attention--Explanation Alignment (MP-AEA), which measures rank correlation between learned attention weights and explanation-derived meta-path contribution scores across random runs. Our results show that meta-path-aware explanations typically outperform random controls, while MP-AEA reveals both high-alignment and statistically significant decoupling regimes depending on the dataset and backbone; moreover, retraining on explanation-induced subgraphs often preserves, and in some noisy regimes improves, predictive performance, suggesting an explanation-as-denoising effect.

new Bridging Academia and Industry: A Comprehensive Benchmark for Attributed Graph Clustering

Authors: Yunhui Liu, Pengyu Qiu, Yu Xing, Yongchao Liu, Peng Du, Chuntao Hong, Jiajun Zheng, Tao Zheng, Tieke He

Abstract: Attributed Graph Clustering (AGC) is a fundamental unsupervised task that integrates structural topology and node attributes to uncover latent patterns in graph-structured data. Despite its significance in industrial applications such as fraud detection and user segmentation, a significant chasm persists between academic research and real-world deployment. Current evaluation protocols suffer from the small-scale, high-homophily citation datasets, non-scalable full-batch training paradigms, and a reliance on supervised metrics that fail to reflect performance in label-scarce environments. To bridge these gaps, we present PyAGC, a comprehensive, production-ready benchmark and library designed to stress-test AGC methods across diverse scales and structural properties. We unify existing methodologies into a modular Encode-Cluster-Optimize framework and, for the first time, provide memory-efficient, mini-batch implementations for a wide array of state-of-the-art AGC algorithms. Our benchmark curates 12 diverse datasets, ranging from 2.7K to 111M nodes, specifically incorporating industrial graphs with complex tabular features and low homophily. Furthermore, we advocate for a holistic evaluation protocol that mandates unsupervised structural metrics and efficiency profiling alongside traditional supervised metrics. Battle-tested in high-stakes industrial workflows at Ant Group, this benchmark offers the community a robust, reproducible, and scalable platform to advance AGC research towards realistic deployment. The code and resources are publicly available via GitHub (https://github.com/Cloudy1225/PyAGC), PyPI (https://pypi.org/project/pyagc), and Documentation (https://pyagc.readthedocs.io).

URLs: https://github.com/Cloudy1225/PyAGC),, https://pypi.org/project/pyagc),, https://pyagc.readthedocs.io).

new Causal Schr\"odinger Bridges: Constrained Optimal Transport on Structural Manifolds

Authors: Rui Wu, Li YongJun

Abstract: Generative modeling typically seeks the path of least action via deterministic flows (ODE). While effective for in-distribution tasks, we argue that these deterministic paths become brittle under causal interventions, which often require transporting probability mass across low-density regions ("off-manifold") where the vector field is ill-defined. This leads to numerical instability and spurious correlations. In this work, we introduce the Causal Schr\"odinger Bridge (CSB), a framework that reformulates counterfactual inference as Entropic Optimal Transport. Unlike deterministic approaches that require strict invertibility, CSB leverages diffusion processes (SDEs) to robustly "tunnel" through support mismatches while strictly enforcing structural admissibility constraints. We prove the Structural Decomposition Theorem, showing that the global high-dimensional bridge factorizes into local, robust transitions. Empirical validation on high-dimensional interventions (Morpho-MNIST) demonstrates that CSB significantly outperforms deterministic baselines in structural consistency, particularly in regimes of strong, out-of-distribution treatments.

new Rho-Perfect: Correlation Ceiling For Subjective Evaluation Datasets

Authors: Fredrik Cumlin

Abstract: Subjective ratings contain inherent noise that limits the model-human correlation, but this reliability issue is rarely quantified. In this paper, we present $\rho$-Perfect, a practical estimation of the highest achievable correlation of a model on subjectively rated datasets. We define $\rho$-Perfect to be the correlation between a perfect predictor and human ratings, and derive an estimate of the value based on heteroscedastic noise scenarios, a common occurrence in subjectively rated datasets. We show that $\rho$-Perfect squared estimates test-retest correlation and use this to validate the estimate. We demonstrate the use of $\rho$-Perfect on a speech quality dataset and show how the measure can distinguish between model limitations and data quality issues.

new Stateless Yet Not Forgetful: Implicit Memory as a Hidden Channel in LLMs

Authors: Ahmed Salem, Andrew Paverd, Sahar Abdelnabi

Abstract: Large language models (LLMs) are commonly treated as stateless: once an interaction ends, no information is assumed to persist unless it is explicitly stored and re-supplied. We challenge this assumption by introducing implicit memory-the ability of a model to carry state across otherwise independent interactions by encoding information in its own outputs and later recovering it when those outputs are reintroduced as input. This mechanism does not require any explicit memory module, yet it creates a persistent information channel across inference requests. As a concrete demonstration, we introduce a new class of temporal backdoors, which we call time bombs. Unlike conventional backdoors that activate on a single trigger input, time bombs activate only after a sequence of interactions satisfies hidden conditions accumulated via implicit memory. We show that such behavior can be induced today through straightforward prompting or fine-tuning. Beyond this case study, we analyze broader implications of implicit memory, including covert inter-agent communication, benchmark contamination, targeted manipulation, and training-data poisoning. Finally, we discuss detection challenges and outline directions for stress-testing and evaluation, with the goal of anticipating and controlling future developments. To promote future research, we release code and data at: https://github.com/microsoft/implicitMemory.

URLs: https://github.com/microsoft/implicitMemory.

new M-Loss: Quantifying Model Merging Compatibility with Limited Unlabeled Data

Authors: Tiantong Wang, Yiyang Duan, Haoyu Chen, Tiantong Wu, Wei Yang Bryan Lim

Abstract: Training of large-scale models is both computationally intensive and often constrained by the availability of labeled data. Model merging offers a compelling alternative by directly integrating the weights of multiple source models without requiring additional data or extensive training. However, conventional model merging techniques, such as parameter averaging, often suffer from the unintended combination of non-generalizable features, especially when source models exhibit significant weight disparities. Comparatively, model ensembling generally provides more stable and superior performance that aggregates multiple models by averaging outputs. However, it incurs higher inference costs and increased storage requirements. While previous studies experimentally showed the similarities between model merging and ensembling, theoretical evidence and evaluation metrics remain lacking. To address this gap, we introduce Merging-ensembling loss (M-Loss), a novel evaluation metric that quantifies the compatibility of merging source models using very limited unlabeled data. By measuring the discrepancy between parameter averaging and model ensembling at layer and node levels, M-Loss facilitates more effective merging strategies. Specifically, M-Loss serves both as a quantitative criterion of the theoretical feasibility of model merging, and a guide for parameter significance in model pruning. Our theoretical analysis and empirical evaluations demonstrate that incorporating M-Loss into the merging process significantly improves the alignment between merged models and model ensembling, providing a scalable and efficient framework for accurate model consolidation.

new An arithmetic method algorithm optimizing k-nearest neighbors compared to regression algorithms and evaluated on real world data sources

Authors: Theodoros Anagnostopoulos, Evanthia Zervoudi, Christos Anagnostopoulos, Apostolos Christopoulos, Bogdan Wierzbinski

Abstract: Linear regression analysis focuses on predicting a numeric regressand value based on certain regressor values. In this context, k-Nearest Neighbors (k-NN) is a common non-parametric regression algorithm, which achieves efficient performance when compared with other algorithms in literature. In this research effort an optimization of the k-NN algorithm is proposed by exploiting the potentiality of an introduced arithmetic method, which can provide solutions for linear equations involving an arbitrary number of real variables. Specifically, an Arithmetic Method Algorithm (AMA) is adopted to assess the efficiency of the introduced arithmetic method, while an Arithmetic Method Regression (AMR) algorithm is proposed as an optimization of k-NN adopting the potentiality of AMA. Such algorithm is compared with other regression algorithms, according to an introduced optimal inference decision rule, and evaluated on certain real world data sources, which are publicly available. Results are promising since the proposed AMR algorithm has comparable performance with the other algorithms, while in most cases it achieves better performance than the k-NN. The output results indicate that introduced AMR is an optimization of k-NN.

new Modeling Score Approximation Errors in Diffusion Models via Forward SPDEs

Authors: Junsu Seo

Abstract: This study investigates the dynamics of Score-based Generative Models (SGMs) by treating the score estimation error as a stochastic source driving the Fokker-Planck equation. Departing from particle-centric SDE analyses, we employ an SPDE framework to model the evolution of the probability density field under stochastic drift perturbations. Under a simplified setting, we utilize this framework to interpret the robustness of generative models through the lens of geometric stability and displacement convexity. Furthermore, we introduce a candidate evaluation metric derived from the quadratic variation of the SPDE solution projected onto a radial test function. Preliminary observations suggest that this metric remains effective using only the initial 10% of the sampling trajectory, indicating a potential for computational efficiency.

new Conditional Sequence Modeling for Safe Reinforcement Learning

Authors: Wensong Bai, Chao Zhang, Qihang Xu, Chufan Chen, Chenhao Zhou, Hui Qian

Abstract: Offline safe reinforcement learning (RL) aims to learn policies from a fixed dataset while maximizing performance under cumulative cost constraints. In practice, deployment requirements often vary across scenarios, necessitating a single policy that can adapt zero-shot to different cost thresholds. However, most existing offline safe RL methods are trained under a pre-specified threshold, yielding policies with limited generalization and deployment flexibility across cost thresholds. Motivated by recent progress in conditional sequence modeling (CSM), which enables flexible goal-conditioned control by specifying target returns, we propose RCDT, a CSM-based method that supports zero-shot deployment across multiple cost thresholds within a single trained policy. RCDT is the first CSM-based offline safe RL algorithm that integrates a Lagrangian-style cost penalty with an auto-adaptive penalty coefficient. To avoid overly conservative behavior and achieve a more favorable return--cost trade-off, a reward--cost-aware trajectory reweighting mechanism and Q-value regularization are further incorporated. Extensive experiments on the DSRL benchmark demonstrate that RCDT consistently improves return--cost trade-offs over representative baselines, advancing the state-of-the-art in offline safe RL.

new Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

Authors: Ziyao Tang, Pengkun Jiao, Xinhang Chen, Wei Liu, Shiyong Li, Jingjing Chen

Abstract: Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturing long-horizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Based on this insight, we propose LU-KV, a novel framework that optimizes head-level budget allocation through a convex-hull relaxation and a marginal-utility-based greedy solver to achieve near-optimal precision. Furthermore, we implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV. Extensive evaluations on LongBench and RULER benchmarks demonstrate that LU-KV achieves an 80% reduction in KV cache size with minimal performance degradation, while simultaneously reducing inference latency and GPU memory footprint.

new FairRARI: A Plug and Play Framework for Fairness-Aware PageRank

Authors: Emmanouil Kariotakis, Aritra Konar

Abstract: PageRank (PR) is a fundamental algorithm in graph machine learning tasks. Owing to the increasing importance of algorithmic fairness, we consider the problem of computing PR vectors subject to various group-fairness criteria based on sensitive attributes of the vertices. At present, principled algorithms for this problem are lacking - some cannot guarantee that a target fairness level is achieved, while others do not feature optimality guarantees. In order to overcome these shortcomings, we put forth a unified in-processing convex optimization framework, termed FairRARI, for tackling different group-fairness criteria in a ``plug and play'' fashion. Leveraging a variational formulation of PR, the framework computes fair PR vectors by solving a strongly convex optimization problem with fairness constraints, thereby ensuring that a target fairness level is achieved. We further introduce three different fairness criteria which can be efficiently tackled using FairRARI to compute fair PR vectors with the same asymptotic time-complexity as the original PR algorithm. Extensive experiments on real-world datasets showcase that FairRARI outperforms existing methods in terms of utility, while achieving the desired fairness levels across multiple vertex groups; thereby highlighting its effectiveness.

new SDFed: Bridging Local Global Discrepancy via Subspace Refinement and Divergence Control in Federated Prompt Learning

Authors: Yicheng Di, Wei Yuan, Tieke He, Zhanjie Zhang, Ao Ma, Yuan Liu, Hongzhi Yin

Abstract: Vision-language pretrained models offer strong transferable representations, yet adapting them in privacy-sensitive multi-party settings is challenging due to the high communication cost of federated optimization and the limited local data on clients. Federated prompt learning mitigates this issue by keeping the VLPM backbone frozen and collaboratively training lightweight prompt parameters. However, existing approaches typically enforce a unified prompt structure and length across clients, which is inadequate under practical client heterogeneity in both data distributions and system resources, and may further introduce conflicts between globally shared and locally optimal knowledge. To address these challenges, we propose \textbf{SDFed}, a heterogeneous federated prompt learning framework that bridges Local-Global Discrepancy via Subspace Refinement and Divergence Control. SDFed maintains a fixed-length global prompt for efficient aggregation while allowing each client to learn a variable-length local prompt to better match its data characteristics and capacity. To mitigate local-global conflicts and facilitate effective knowledge transfer, SDFed introduces a subspace refinement method for local prompts and an information retention and divergence control strategy that preserves key local information while maintaining appropriate separability between global and local representations. Extensive experiments on several datasets demonstrate that SDFed consistently improves performance and robustness in heterogeneous federated settings.

new TFMLinker: Universal Link Predictor by Graph In-Context Learning with Tabular Foundation Models

Authors: Tianyin Liao, Chunyu Hu, Yicheng Sui, Xingxuan Zhang, Peng Cui, Jianxin Li, Ziwei Zhang

Abstract: Link prediction is a fundamental task in graph machine learning with widespread applications such as recommendation systems, drug discovery, knowledge graphs, etc. In the foundation model era, how to develop universal link prediction methods across datasets and domains becomes a key problem, with some initial attempts adopting Graph Foundation Models utilizing Graph Neural Networks and Large Language Models. However, the existing methods face notable limitations, including limited pre-training scale or heavy reliance on textual information. Motivated by the success of tabular foundation models (TFMs) in achieving universal prediction across diverse tabular datasets, we explore an alternative approach by TFMs, which are pre-trained on diverse synthetic datasets sampled from structural causal models and support strong in-context learning independent of textual attributes. Nevertheless, adapting TFMs for link prediction faces severe technical challenges such as how to obtain the necessary context and capture link-centric topological information. To solve these challenges, we propose TFMLinker (Tabular Foundation Model for Link Predictor), aiming to leverage the in-context learning capabilities of TFMs to perform link prediction across diverse graphs without requiring dataset-specific fine-tuning. Specifically, we first develop a prototype-augmented local-global context module to construct context that captures both graph-specific and cross-graph transferable patterns. Next, we design a universal topology-aware link encoder to capture link-centric topological information and generate link representations as inputs for the TFM. Finally, we employ the TFM to predict link existence through in-context learning. Experiments on 6 graph benchmarks across diverse domains demonstrate the superiority of our method over state-of-the-art baselines without requiring dataset-specific finetuning.

new Breaking the Grid: Distance-Guided Reinforcement Learning in Large Discrete and Hybrid Action Spaces

Authors: Heiko Hoppe, Fabian Akkerman, Wouter van Heeswijk, Maximilian Schiffer

Abstract: Reinforcement Learning is increasingly applied to logistics, scheduling, and recommender systems, but standard algorithms struggle with the curse of dimensionality in such large discrete action spaces. Existing algorithms typically rely on restrictive grid-based structures or computationally expensive nearest-neighbor searches, limiting their effectiveness in high-dimensional or irregularly structured domains. We propose Distance-Guided Reinforcement Learning (DGRL), combining Sampled Dynamic Neighborhoods (SDN) and Distance-Based Updates (DBU) to enable efficient RL in spaces with up to 10$^\text{20}$ actions. Unlike prior methods, SDN leverages a semantic embedding space to perform stochastic volumetric exploration, provably providing full support over a local trust region. Complementing this, DBU transforms policy optimization into a stable regression task, decoupling gradient variance from action space cardinality and guaranteeing monotonic policy improvement. DGRL naturally generalizes to hybrid continuous-discrete action spaces without requiring hierarchical dependencies. We demonstrate performance improvements of up to 66% against state-of-the-art benchmarks across regularly and irregularly structured environments, while simultaneously improving convergence speed and computational complexity.

new ERIS: Enhancing Privacy and Communication Efficiency in Serverless Federated Learning

Authors: Dario Fenoglio, Pasquale Polverino, Jacopo Quizi, Martin Gjoreski, Marc Langheinrich

Abstract: Scaling federated learning (FL) to billion-parameter models introduces critical trade-offs between communication efficiency, model accuracy, and privacy guarantees. Existing solutions often tackle these challenges in isolation, sacrificing accuracy or relying on costly cryptographic tools. We propose ERIS, a serverless FL framework that balances privacy and accuracy while eliminating the server bottleneck and distributing the communication load. ERIS combines a model partitioning strategy, distributing aggregation across multiple client-side aggregators, with a distributed shifted gradient compression mechanism. We theoretically prove that ERIS (i) converges at the same rate as FedAvg under standard assumptions, and (ii) bounds mutual information leakage inversely with the number of aggregators, enabling strong privacy guarantees with no accuracy degradation. Experiments across image and text tasks, including large language models, confirm that ERIS achieves FedAvg-level accuracy while substantially reducing communication cost and improving robustness to membership inference and reconstruction attacks, without relying on heavy cryptography or noise injection.

new Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Authors: Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, Yang Zhang

Abstract: By introducing routers to selectively activate experts in Transformer layers, the mixture-of-experts (MoE) architecture significantly reduces computational costs in large language models (LLMs) while maintaining competitive performance, especially for models with massive parameters. However, prior work has largely focused on utility and efficiency, leaving the safety risks associated with this sparse architecture underexplored. In this work, we show that the safety of MoE LLMs is as sparse as their architecture by discovering unsafe routes: routing configurations that, once activated, convert safe outputs into harmful ones. Specifically, we first introduce the Router Safety importance score (RoSais) to quantify the safety criticality of each layer's router. Manipulation of only the high-RoSais router(s) can flip the default route into an unsafe one. For instance, on JailbreakBench, masking 5 routers in DeepSeek-V2-Lite increases attack success rate (ASR) by over 4$\times$ to 0.79, highlighting an inherent risk that router manipulation may naturally occur in MoE LLMs. We further propose a Fine-grained token-layer-wise Stochastic Optimization framework to discover more concrete Unsafe Routes (F-SOUR), which explicitly considers the sequentiality and dynamics of input tokens. Across four representative MoE LLM families, F-SOUR achieves an average ASR of 0.90 and 0.98 on JailbreakBench and AdvBench, respectively. Finally, we outline defensive perspectives, including safety-aware route disabling and router training, as promising directions to safeguard MoE LLMs. We hope our work can inform future red-teaming and safeguarding of MoE LLMs. Our code is provided in https://github.com/TrustAIRLab/UnsafeMoE.

URLs: https://github.com/TrustAIRLab/UnsafeMoE.

new CauScale: Neural Causal Discovery at Scale

Authors: Bo Peng, Sirui Chen, Jiaguo Tian, Yu Qiao, Chaochao Lu

Abstract: Causal discovery is essential for advancing data-driven fields such as scientific AI and data analysis, yet existing approaches face significant time- and space-efficiency bottlenecks when scaling to large graphs. To address this challenge, we present CauScale, a neural architecture designed for efficient causal discovery that scales inference to graphs with up to 1000 nodes. CauScale improves time efficiency via a reduction unit that compresses data embeddings and improves space efficiency by adopting tied attention weights to avoid maintaining axis-specific attention maps. To keep high causal discovery accuracy, CauScale adopts a two-stream design: a data stream extracts relational evidence from high-dimensional observations, while a graph stream integrates statistical graph priors and preserves key structural signals. CauScale successfully scales to 500-node graphs during training, where prior work fails due to space limitations. Across testing data with varying graph scales and causal mechanisms, CauScale achieves 99.6% mAP on in-distribution data and 84.4% on out-of-distribution data, while delivering 4-13,000 times inference speedups over prior methods. Our project page is at https://github.com/OpenCausaLab/CauScale.

URLs: https://github.com/OpenCausaLab/CauScale.

new LEFT: Learnable Fusion of Tri-view Tokens for Unsupervised Time Series Anomaly Detection

Authors: Dezheng Wang, Tong Chen, Guansong Pang, Congyan Chen, Shihua Li, Hongzhi Yin

Abstract: As a fundamental data mining task, unsupervised time series anomaly detection (TSAD) aims to build a model for identifying abnormal timestamps without assuming the availability of annotations. A key challenge in unsupervised TSAD is that many anomalies are too subtle to exhibit detectable deviation in any single view (e.g., time domain), and instead manifest as inconsistencies across multiple views like time, frequency, and a mixture of resolutions. However, most cross-view methods rely on feature or score fusion and do not enforce analysis-synthesis consistency, meaning the frequency branch is not required to reconstruct the time signal through an inverse transform, and vice versa. In this paper, we present Learnable Fusion of Tri-view Tokens (LEFT), a unified unsupervised TSAD framework that models anomalies as inconsistencies across complementary representations. LEFT learns feature tokens from three views of the same input time series: frequency-domain tokens that embed periodicity information, time-domain tokens that capture local dynamics, and multi-scale tokens that learns abnormal patterns at varying time series granularities. By learning a set of adaptive Nyquist-constrained spectral filters, the original time series is rescaled into multiple resolutions and then encoded, allowing these multi-scale tokens to complement the extracted frequency- and time-domain information. When generating the fused representation, we introduce a novel objective that reconstructs fine-grained targets from coarser multi-scale structure, and put forward an innovative time-frequency cycle consistency constraint to explicitly regularize cross-view agreement. Experiments on real-world benchmarks show that LEFT yields the best detection accuracy against SOTA baselines, while achieving a 5x reduction on FLOPs and 8x speed-up for training.

new Projected Gradient Ascent for Efficient Reward-Guided Updates with One-Step Generative Models

Authors: Jisung Hwang, Minhyuk Sung

Abstract: We propose a constrained latent optimization method for reward-guided generation that preserves white Gaussian noise characteristics with negligible overhead. Test-time latent optimization can unlock substantially better reward-guided generations from pretrained generative models, but it is prone to reward hacking that degrades quality and also too slow for practical use. In this work, we make test-time optimization both efficient and reliable by replacing soft regularization with hard white Gaussian noise constraints enforced via projected gradient ascent. Our method applies a closed-form projection after each update to keep the latent vector explicitly noise-like throughout optimization, preventing the drift that leads to unrealistic artifacts. This enforcement adds minimal cost: the projection matches the $O(N \log N)$ complexity of standard algorithms such as sorting or FFT and does not practically increase wall-clock time. In experiments, our approach reaches a comparable Aesthetic Score using only 30% of the wall-clock time required by the SOTA regularization-based method, while preventing reward hacking.

new From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism

Authors: Sarthak Wanjari

Abstract: Offline Reinforcement Learning (RL) promises the recovery of optimal policies from static datasets, yet it remains susceptible to the overestimation of out-of-distribution (OOD) actions, particularly in fractured and sparse data manifolds.Current solutions necessitates a trade off between computational efficiency and performance. Methods like CQL offers rigorous conservatism but require tremendous compute power while efficient expectile-based methods like IQL often fail to correct OOD errors on pathological datasets, collapsing to Behavioural Cloning. In this work, we propose Geometric Pessimism, a modular, compute-efficient framework that augments standard IQL with density-based penalty derived from k-nearest-neighbour distances in the state-action embedding space. By pre-computing the penalties applied to each state-action pair our method injects OOD conservatism via reward shaping with a O(1) training overhead. Evaluated on the D4Rl MuJoCo benchmark, our method, Geo-IQL outperforms standard IQL on sensitive and unstable medium-replay tasks by over 18 points, while reducing inter-seed variance by 4x. Furthermore, Geo-IQL does not degrade performance on stable manifolds. Crucially, we validate our algorithm on the MIMIC-III Sepsis critical care dataset. While standard IQL collapses to behaviour cloning, Geo-IQL demonstrates active policy improvement. Maintaining safety constraints, achieving 86.4% terminal agreement with clinicians compared to IQL's 75%. Our results suggest that geometric pessimism provides the necessary regularisation to safely overcome local optima in critical, real-world decision systems.

new Two-Stage Data Synthesization: A Statistics-Driven Restricted Trade-off between Privacy and Prediction

Authors: Xiaotong Liu, Shao-Bo Lin, Jun Fan, Ding-Xuan Zhou

Abstract: Synthetic data have gained increasing attention across various domains, with a growing emphasis on their performance in downstream prediction tasks. However, most existing synthesis strategies focus on maintaining statistical information. Although some studies address prediction performance guarantees, their single-stage synthesis designs make it challenging to balance the privacy requirements that necessitate significant perturbations and the prediction performance that is sensitive to such perturbations. We propose a two-stage synthesis strategy. In the first stage, we introduce a synthesis-then-hybrid strategy, which involves a synthesis operation to generate pure synthetic data, followed by a hybrid operation that fuses the synthetic data with the original data. In the second stage, we present a kernel ridge regression (KRR)-based synthesis strategy, where a KRR model is first trained on the original data and then used to generate synthetic outputs based on the synthetic inputs produced in the first stage. By leveraging the theoretical strengths of KRR and the covariant distribution retention achieved in the first stage, our proposed two-stage synthesis strategy enables a statistics-driven restricted privacy--prediction trade-off and guarantee optimal prediction performance. We validate our approach and demonstrate its characteristics of being statistics-driven and restricted in achieving the privacy--prediction trade-off both theoretically and numerically. Additionally, we showcase its generalizability through applications to a marketing problem and five real-world datasets.

new Equalized Generative Treatment: Matching f-divergences for Fairness in Generative Models

Authors: Alexandre Verine, Rafael Pinot, Florian Le Bronnec

Abstract: Fairness is a crucial concern for generative models, which not only reflect but can also amplify societal and cultural biases. Existing fairness notions for generative models are largely adapted from classification and focus on balancing the probability of generating samples from each sensitive group. We show that such criteria are brittle, as they can be met even when different sensitive groups are modeled with widely varying quality. To address this limitation, we introduce a new fairness definition for generative models, termed as equalized generative treatment (EGT), which requires comparable generation quality across all sensitive groups, with quality measured via a reference f-divergence. We further analyze the trade-offs induced by EGT, demonstrating that enforcing fairness constraints necessarily couples the overall model quality to that of the most challenging group to approximate. This indicates that a simple yet efficient min-max fine-tuning method should be able to balance f-divergences across sensitive groups to satisfy EGT. We validate this theoretical insight through a set of experiments on both image and text generation tasks. We demonstrate that min-max methods consistently achieve fairer outcomes compared to other approaches from the literature, while maintaining competitive overall performance for both tasks.

new LLaDA2.1: Speeding Up Text Diffusion via Token Editing

Authors: Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, Zhuocheng Gong, Yanmei Gu, Jian Guan, Kaiyuan Guan, Hongliang He, Zenan Huang, Juyong Jiang, Zhonghui Jiang, Zhenzhong Lan, Chengxi Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Yuan Lu, Yuxin Ma, Xingyu Mou, Zhenxuan Pan, Kaida Qiu, Yuji Ren, Jianfeng Tan, Yiding Tian, Zian Wang, Lanning Wei, Tao Wu, Yipeng Xing, Wentao Ye, Liangyu Zha, Tianze Zhang, Xiaolu Zhang, Junbo Zhao, Da Zheng, Hao Zhong, Wanli Zhong, Jun Zhou, Junlin Zhou, Liwang Zhu, Muzhi Zhu, Yihong Zhuang

Abstract: While LLaDA2.0 showcased the scaling potential of 100B-level block-diffusion models and their inherent parallelization, the delicate equilibrium between decoding speed and generation quality has remained an elusive frontier. Today, we unveil LLaDA2.1, a paradigm shift designed to transcend this trade-off. By seamlessly weaving Token-to-Token (T2T) editing into the conventional Mask-to-Token (M2T) scheme, we introduce a joint, configurable threshold-decoding scheme. This structural innovation gives rise to two distinct personas: the Speedy Mode (S Mode), which audaciously lowers the M2T threshold to bypass traditional constraints while relying on T2T to refine the output; and the Quality Mode (Q Mode), which leans into conservative thresholds to secure superior benchmark performances with manageable efficiency degrade. Furthering this evolution, underpinned by an expansive context window, we implement the first large-scale Reinforcement Learning (RL) framework specifically tailored for dLLMs, anchored by specialized techniques for stable gradient estimation. This alignment not only sharpens reasoning precision but also elevates instruction-following fidelity, bridging the chasm between diffusion dynamics and complex human intent. We culminate this work by releasing LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B). Across 33 rigorous benchmarks, LLaDA2.1 delivers strong task performance and lightning-fast decoding speed. Despite its 100B volume, on coding tasks it attains an astounding 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench.

new Dashed Line Defense: Plug-And-Play Defense Against Adaptive Score-Based Query Attacks

Authors: Yanzhang Fu, Zizheng Guo, Jizhou Luo

Abstract: Score-based query attacks pose a serious threat to deep learning models by crafting adversarial examples (AEs) using only black-box access to model output scores, iteratively optimizing inputs based on observed loss values. While recent runtime defenses attempt to disrupt this process via output perturbation, most either require access to model parameters or fail when attackers adapt their tactics. In this paper, we first reveal that even the state-of-the-art plug-and-play defense can be bypassed by adaptive attacks, exposing a critical limitation of existing runtime defenses. We then propose Dashed Line Defense (DLD), a plug-and-play post-processing method specifically designed to withstand adaptive query strategies. By introducing ambiguity in how the observed loss reflects the true adversarial strength of candidate examples, DLD prevents attackers from reliably analyzing and adapting their queries, effectively disrupting the AE generation process. We provide theoretical guarantees of DLD's defense capability and validate its effectiveness through experiments on ImageNet, demonstrating that DLD consistently outperforms prior defenses--even under worst-case adaptive attacks--while preserving the model's predicted labels.

new The Theory and Practice of MAP Inference over Non-Convex Constraints

Authors: Leander Kurscheidt, Gabriele Masina, Roberto Sebastiani, Antonio Vergari

Abstract: In many safety-critical settings, probabilistic ML systems have to make predictions subject to algebraic constraints, e.g., predicting the most likely trajectory that does not cross obstacles. These real-world constraints are rarely convex, nor the densities considered are (log-)concave. This makes computing this constrained maximum a posteriori (MAP) prediction efficiently and reliably extremely challenging. In this paper, we first investigate under which conditions we can perform constrained MAP inference over continuous variables exactly and efficiently and devise a scalable message-passing algorithm for this tractable fragment. Then, we devise a general constrained MAP strategy that interleaves partitioning the domain into convex feasible regions with numerical constrained optimization. We evaluate both methods on synthetic and real-world benchmarks, showing our % approaches outperform constraint-agnostic baselines, and scale to complex densities intractable for SoTA exact solvers.

new CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

Authors: Ning Yang, Chengzhi Wang, Yibo Liu, Baoliang Tian, Haijun Zhang

Abstract: Large Language Models (LLMs) in long-context scenarios are severely constrained by the linear growth of Key-Value (KV) cache memory. Existing KV compression methods rely either on static thresholds and attention-only heuristics or on coarse memory budget allocation. Under tight memory budgets, these methods overlook two key factors: prompt-dependent variation in compression risk and functional heterogeneity across attention heads, which destabilize token selection and lead to tail failures. To address these challenges, we propose CompilerKV, a risk-adaptive and head-aware compression framework that compiles offline experience into reusable decision tables for prefill-only deployment. CompilerKV integrates two key synergistic components: (i) a Head Heterogeneity Table, learned via offline contextual bandits, which assigns head-specific reliability weights to govern functional differences across attention heads explicitly; and (ii) a Risk-Adaptive Threshold Gating mechanism that jointly models attention entropy and local perplexity, transforming prompt-level risk into deployable retention thresholds. Experiments on LongBench show CompilerKV dominates SOTA methods under a 512-token budget, recovering 97.7\% of FullKV performance while achieving up to +5.2 points gain over the strongest competitor.

new Learning To Sample From Diffusion Models Via Inverse Reinforcement Learning

Authors: Constant Bourdrez, Alexandre V\'erine, Olivier Capp\'e

Abstract: Diffusion models generate samples through an iterative denoising process, guided by a neural network. While training the denoiser on real-world data is computationally demanding, the sampling procedure itself is more flexible. This adaptability serves as a key lever in practice, enabling improvements in both the quality of generated samples and the efficiency of the sampling process. In this work, we introduce an inverse reinforcement learning framework for learning sampling strategies without retraining the denoiser. We formulate the diffusion sampling procedure as a discrete-time finite-horizon Markov Decision Process, where actions correspond to optional modifications of the sampling dynamics. To optimize action scheduling, we avoid defining an explicit reward function. Instead, we directly match the target behavior expected from the sampler using policy gradient techniques. We provide experimental evidence that this approach can improve the quality of samples generated by pretrained diffusion models and automatically tune sampling hyperparameters.

new SoK: The Pitfalls of Deep Reinforcement Learning for Cybersecurity

Authors: Shae McFadden, Myles Foley, Elizabeth Bates, Ilias Tsingenopoulos, Sanyam Vyas, Vasilios Mavroudis, Chris Hicks, Fabio Pierazzi

Abstract: Deep Reinforcement Learning (DRL) has achieved remarkable success in domains requiring sequential decision-making, motivating its application to cybersecurity problems. However, transitioning DRL from laboratory simulations to bespoke cyber environments can introduce numerous issues. This is further exacerbated by the often adversarial, non-stationary, and partially-observable nature of most cybersecurity tasks. In this paper, we identify and systematize 11 methodological pitfalls that frequently occur in DRL for cybersecurity (DRL4Sec) literature across the stages of environment modeling, agent training, performance evaluation, and system deployment. By analyzing 66 significant DRL4Sec papers (2018-2025), we quantify the prevalence of each pitfall and find an average of over five pitfalls per paper. We demonstrate the practical impact of these pitfalls using controlled experiments in (i) autonomous cyber defense, (ii) adversarial malware creation, and (iii) web security testing environments. Finally, we provide actionable recommendations for each pitfall to support the development of more rigorous and deployable DRL-based security systems.

new Reasoning aligns language models to human cognition

Authors: Gon\c{c}alo Guiomar, Elia Torre, Pehuen Moure, Victoria Shavina, Mario Giulianelli, Shih-Chii Liu, Valerio Mante

Abstract: Do language models make decisions under uncertainty like humans do, and what role does chain-of-thought (CoT) reasoning play in the underlying decision process? We introduce an active probabilistic reasoning task that cleanly separates sampling (actively acquiring evidence) from inference (integrating evidence toward a decision). Benchmarking humans and a broad set of contemporary large language models against near-optimal reference policies reveals a consistent pattern: extended reasoning is the key determinant of strong performance, driving large gains in inference and producing belief trajectories that become strikingly human-like, while yielding only modest improvements in active sampling. To explain these differences, we fit a mechanistic model that captures systematic deviations from optimal behavior via four interpretable latent variables: memory, strategy, choice bias, and occlusion awareness. This model places humans and models in a shared low-dimensional cognitive space, reproduces behavioral signatures across agents, and shows how chain-of-thought shifts language models toward human-like regimes of evidence accumulation and belief-to-choice mapping, tightening alignment in inference while leaving a persistent gap in information acquisition.

new Trapped by simplicity: When Transformers fail to learn from noisy features

Authors: Evan Peters, Ando Deng, Matheus H. Zambianco, Devin Blankespoor, Achim Kempf

Abstract: Noise is ubiquitous in data used to train large language models, but it is not well understood whether these models are able to correctly generalize to inputs generated without noise. Here, we study noise-robust learning: are transformers trained on data with noisy features able to find a target function that correctly predicts labels for noiseless features? We show that transformers succeed at noise-robust learning for a selection of $k$-sparse parity and majority functions, compared to LSTMs which fail at this task for even modest feature noise. However, we find that transformers typically fail at noise-robust learning of random $k$-juntas, especially when the boolean sensitivity of the optimal solution is smaller than that of the target function. We argue that this failure is due to a combination of two factors: transformers' bias toward simpler functions, combined with an observation that the optimal function for noise-robust learning typically has lower sensitivity than the target function for random boolean functions. We test this hypothesis by exploiting transformers' simplicity bias to trap them in an incorrect solution, but show that transformers can escape this trap by training with an additional loss term penalizing high-sensitivity solutions. Overall, we find that transformers are particularly ineffective for learning boolean functions in the presence of feature noise.

new QUOKA: Query-Oriented KV Selection For Efficient LLM Prefill

Authors: Dalton Jones, Junyoung Park, Matthew Morse, Mingu Lee, Chris Lott, Harper Langston

Abstract: We present QUOKA: Query-oriented KV selection for efficient attention, a training-free and hardware agnostic sparse attention algorithm for accelerating transformer inference under chunked prefill. While many queries focus on a smaller group of keys in the attention operator, we observe that queries with low cosine similarity with respect to the mean query interact more strongly with more keys and have the greatest contribution to final attention logits. By prioritizing these low cosine similarity queries, the behavior of full attention during the prefill stage can be closely approximated. QUOKA leverages this observation, accelerating attention by (1) first retaining a small set of representative queries and (2) then subselectin the keys most aligned with those queries. Through experiments on Needle-In-A-Haystack, LongBench, RULER, and Math500, we show that, while realizing a 3x reduction in time-to-first-token, 5x speedup in attention on Nvidia GPUs and up to nearly a 7x speedup on Intel Xeon CPUs, QUOKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.

new Data Reconstruction: Identifiability and Optimization with Sample Splitting

Authors: Yujie Shen, Zihan Wang, Jian Qian, Qi Lei

Abstract: Training data reconstruction from KKT conditions has shown striking empirical success, yet it remains unclear when the resulting KKT equations have unique solutions and, even in identifiable regimes, how to reliably recover solutions by optimization. This work hereby focuses on these two complementary questions: identifiability and optimization. On the identifiability side, we discuss the sufficient conditions for KKT system of two-layer networks with polynomial activations to uniquely determine the training data, providing a theoretical explanation of when and why reconstruction is possible. On the optimization side, we introduce sample splitting, a curvature-aware refinement step applicable to general reconstruction objectives (not limited to KKT-based formulations): it creates additional descent directions to escape poor stationary points and refine solutions. Experiments demonstrate that augmenting several existing reconstruction methods with sample splitting consistently improves reconstruction performance.

new Foundation Inference Models for Ordinary Differential Equations

Authors: Maximilian Mauel, Johannes R. H\"ubers, David Berghaus, Patrick Seifner, Ramses J. Sanchez

Abstract: Ordinary differential equations (ODEs) are central to scientific modelling, but inferring their vector fields from noisy trajectories remains challenging. Current approaches such as symbolic regression, Gaussian process (GP) regression, and Neural ODEs often require complex training pipelines and substantial machine learning expertise, or they depend strongly on system-specific prior knowledge. We propose FIM-ODE, a pretrained Foundation Inference Model that amortises low-dimensional ODE inference by predicting the vector field directly from noisy trajectory data in a single forward pass. We pretrain FIM-ODE on a prior distribution over ODEs with low-degree polynomial vector fields and represent the target field with neural operators. FIM-ODE achieves strong zero-shot performance, matching and often improving upon ODEFormer, a recent pretrained symbolic baseline, across a range of regimes despite using a simpler pretraining prior distribution. Pretraining also provides a strong initialisation for finetuning, enabling fast and stable adaptation that outperforms modern neural and GP baselines without requiring machine learning expertise.

new On the Expressive Power of GNNs for Boolean Satisfiability

Authors: Saku Peltonen, Roger Wattenhofer

Abstract: Machine learning approaches to solving Boolean Satisfiability (SAT) aim to replace handcrafted heuristics with learning-based models. Graph Neural Networks have emerged as the main architecture for SAT solving, due to the natural graph representation of Boolean formulas. We analyze the expressive power of GNNs for SAT solving through the lens of the Weisfeiler-Leman (WL) test. As our main result, we prove that the full WL hierarchy cannot, in general, distinguish between satisfiable and unsatisfiable instances. We show that indistinguishability under higher-order WL carries over to practical limitations for WL-bounded solvers that set variables sequentially. We further study the expressivity required for several important families of SAT instances, including regular, random and planar instances. To quantify expressivity needs in practice, we conduct experiments on random instances from the G4SAT benchmark and industrial instances from the International SAT Competition. Our results suggest that while random instances are largely distinguishable, industrial instances often require more expressivity to predict a satisfying assignment.

new Central Dogma Transformer II: An AI Microscope for Understanding Cellular Regulatory Mechanisms

Authors: Nobuyuki Ota

Abstract: Current biological AI models lack interpretability -- their internal representations do not correspond to biological relationships that researchers can examine. Here we present CDT-II, an "AI microscope" whose attention maps are directly interpretable as regulatory structure. By mirroring the central dogma in its architecture, each attention mechanism corresponds to a specific biological relationship: DNA self-attention for genomic relationships, RNA self-attention for gene co-regulation, and DNA-to-RNA cross-attention for transcriptional control. Using only genomic embeddings and raw per-cell expression, CDT-II enables experimental biologists to observe regulatory networks in their own data. Applied to K562 CRISPRi data, CDT-II predicts perturbation effects (per-gene mean $r = 0.84$) and recovers the GFI1B regulatory network without supervision (6.6-fold enrichment, $P = 3.5 \times 10^{-17}$). Two distinct attention mechanisms converge on an RNA processing module ($P = 1 \times 10^{-16}$). CDT-II establishes mechanism-oriented AI as an alternative to task-oriented approaches, revealing regulatory structure rather than merely optimizing predictions.

new Redundancy-Free View Alignment for Multimodal Human Activity Recognition with Arbitrarily Missing Views

Authors: Duc-Anh Nguyen, Nhien-An Le-Khac

Abstract: Multimodal multiview learning seeks to integrate information from diverse sources to enhance task performance. Existing approaches often struggle with flexible view configurations, including arbitrary view combinations, numbers of views, and heterogeneous modalities. Focusing on the context of human activity recognition, we propose RALIS, a model that combines multiview contrastive learning with a mixture-of-experts module to support arbitrary view availability during both training and inference. Instead of trying to reconstruct missing views, an adjusted center contrastive loss is used for self-supervised representation learning and view alignment, mitigating the impact of missing views on multiview fusion. This loss formulation allows for the integration of view weights to account for view quality. Additionally, it reduces computational complexity from $O(V^2)$ to $O(V)$, where $V$ is the number of views. To address residual discrepancies not captured by contrastive learning, we employ a mixture-of-experts module with a specialized load balancing strategy, tasked with adapting to arbitrary view combinations. We highlight the geometric relationship among components in our model and how they combine well in the latent space. RALIS is validated on four datasets encompassing inertial and human pose modalities, with the number of views ranging from three to nine, demonstrating its performance and flexibility.

new HoGS: Homophily-Oriented Graph Synthesis for Local Differentially Private GNN Training

Authors: Wen Xu, Zhetao Li, Yong Xiao, Pengpeng Qiao, Mianxiong Dong, Kaoru Ota

Abstract: Graph neural networks (GNNs) have demonstrated remarkable performance in various graph-based machine learning tasks by effectively modeling high-order interactions between nodes. However, training GNNs without protection may leak sensitive personal information in graph data, including links and node features. Local differential privacy (LDP) is an advanced technique for protecting data privacy in decentralized networks. Unfortunately, existing local differentially private GNNs either only preserve link privacy or suffer significant utility loss in the process of preserving link and node feature privacy. In this paper, we propose an effective LDP framework, called HoGS, which trains GNNs with link and feature protection by generating a synthetic graph. Concretely, HoGS first collects the link and feature information of the graph under LDP, and then utilizes the phenomenon of homophily in graph data to reconstruct the graph structure and node features separately, thereby effectively mitigating the negative impact of LDP on the downstream GNN training. We theoretically analyze the privacy guarantee of HoGS and conduct experiments using the generated synthetic graph as input to various state-of-the-art GNN architectures. Experimental results on three real-world datasets show that HoGS significantly outperforms baseline methods in the accuracy of training GNNs.

new FreqLens: Interpretable Frequency Attribution for Time Series Forecasting

Authors: Chi-Sheng Chen, Xinyu Zhang, En-Jui Kuo, Guan-Ying Chen, Qiuzhe Xie, Fan Zhang

Abstract: Time series forecasting models often lack interpretability, limiting their adoption in domains requiring explainable predictions. We propose \textsc{FreqLens}, an interpretable forecasting framework that discovers and attributes predictions to learnable frequency components. \textsc{FreqLens} introduces two key innovations: (1) \emph{learnable frequency discovery} -- frequency bases are parameterized via sigmoid mapping and learned from data with diversity regularization, enabling automatic discovery of dominant periodic patterns without domain knowledge; and (2) \emph{axiomatic frequency attribution} -- a theoretically grounded framework that provably satisfies Completeness, Faithfulness, Null-Frequency, and Symmetry axioms, with per-frequency attributions equivalent to Shapley values. On Traffic and Weather datasets, \textsc{FreqLens} achieves competitive or superior performance while discovering physically meaningful frequencies: all 5 independent runs discover the 24-hour daily cycle ($24.6 \pm 0.1$h, 2.5\% error) and 12-hour half-daily cycle ($11.8 \pm 0.1$h, 1.6\% error) on Traffic, and weekly cycles ($10\times$ longer than the input window) on Weather. These results demonstrate genuine frequency-level knowledge discovery with formal theoretical guarantees on attribution quality.

new Default Machine Learning Hyperparameters Do Not Provide Informative Initialization for Bayesian Optimization

Authors: Nicol\'as Villagr\'an Prieto, Eduardo C. Garrido-Merch\'an

Abstract: Bayesian Optimization (BO) is a standard tool for hyperparameter tuning thanks to its sample efficiency on expensive black-box functions. While most BO pipelines begin with uniform random initialization, default hyperparameter values shipped with popular ML libraries such as scikit-learn encode implicit expert knowledge and could serve as informative starting points that accelerate convergence. This hypothesis, despite its intuitive appeal, has remained largely unexamined. We formalize the idea by initializing BO with points drawn from truncated Gaussian distributions centered at library defaults and compare the resulting trajectories against a uniform-random baseline. We conduct an extensive empirical evaluation spanning three BO back-ends (BoTorch, Optuna, Scikit-Optimize), three model families (Random Forests, Support Vector Machines, Multilayer Perceptrons), and five benchmark datasets covering classification and regression tasks. Performance is assessed through convergence speed and final predictive quality, and statistical significance is determined via one-sided binomial tests. Across all conditions, default-informed initialization yields no statistically significant advantage over purely random sampling, with p-values ranging from 0.141 to 0.908. A sensitivity analysis on the prior variance confirms that, while tighter concentration around the defaults improves early evaluations, this transient benefit vanishes as optimization progresses, leaving final performance unchanged. Our results provide no evidence that default hyperparameters encode useful directional information for optimization. We therefore recommend that practitioners treat hyperparameter tuning as an integral part of model development and favor principled, data-driven search strategies over heuristic reliance on library defaults.

new A Graphop Analysis of Graph Neural Networks on Sparse Graphs: Generalization and Universal Approximation

Authors: Ofek Amran, Tom Gilat, Ron Levie

Abstract: Generalization and approximation capabilities of message passing graph neural networks (MPNNs) are often studied by defining a compact metric on a space of input graphs under which MPNNs are H\"older continuous. Such analyses are of two varieties: 1) when the metric space includes graphs of unbounded sizes, the theory is only appropriate for dense graphs, and, 2) when studying sparse graphs, the metric space only includes graphs of uniformly bounded size. In this work, we present a unified approach, defining a compact metric on the space of graphs of all sizes, both sparse and dense, under which MPNNs are H\"older continuous. This leads to more powerful universal approximation theorems and generalization bounds than previous works. The theory is based on, and extends, a recent approach to graph limit theory called graphop analysis.

new How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

Authors: Yapei Chang, Kyle Lo, Mohit Iyyer, Luca Soldaini

Abstract: Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model sizes and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by >10 points across three models without systematic regressions on standard benchmarks, with gains robust to superficial source-document memorization or format compliance. Taken together, How2Everything shows how pretraining web data can support a closed loop of capability evaluation and improvement at scale.

new Efficient Deep Learning for Biometrics: Overview, Challenges and Trends in Ear of Frugal AI

Authors: Karim Haroun, Aya Zitouni, Aicha Zenakhri, Meriem Amel Guessoum, Larbi Boubchir

Abstract: Recent advances in deep learning, whether on discriminative or generative tasks have been beneficial for various applications, among which security and defense. However, their increasing computational demands during training and deployment translates directly into high energy consumption. As a consequence, this induces a heavy carbon footprint which hinders their widespread use and scalability, but also a limitation when deployed on resource-constrained edge devices for real-time use. In this paper, we briefly survey efficient deep learning methods for biometric applications. Specifically, we tackle the challenges one might incur when training and deploying deep learning approaches, and provide a taxonomy of the various efficient deep learning families. Additionally, we discuss complementary metrics for evaluating the efficiency of these models such as memory, computation, latency, throughput, and advocate for universal and reproducible metrics for better comparison. Last, we give future research directions to consider.

new $\texttt{lrnnx}$: A library for Linear RNNs

Authors: Karan Bania, Soham Kalburgi, Manit Tanwar, Dhruthi, Aditya Nagarsekar, Harshvardhan Mestha, Naman Chibber, Raj Deshmukh, Anish Sathyanarayanan, Aarush Rathore, Pratham Chheda

Abstract: Linear recurrent neural networks (LRNNs) provide a structured approach to sequence modeling that bridges classical linear dynamical systems and modern deep learning, offering both expressive power and theoretical guarantees on stability and trainability. In recent years, multiple LRNN-based architectures have been proposed, each introducing distinct parameterizations, discretization schemes, and implementation constraints. However, existing implementations are fragmented across different software frameworks, often rely on framework-specific optimizations, and in some cases require custom CUDA kernels or lack publicly available code altogether. As a result, using, comparing, or extending LRNNs requires substantial implementation effort. To address this, we introduce $\texttt{lrnnx}$, a unified software library that implements several modern LRNN architectures under a common interface. The library exposes multiple levels of control, allowing users to work directly with core components or higher-level model abstractions. $\texttt{lrnnx}$ aims to improve accessibility, reproducibility, and extensibility of LRNN research and applications. We make our code available under a permissive MIT license.

new Robust Policy Optimization to Prevent Catastrophic Forgetting

Authors: Mahdi Sabbaghi, George Pappas, Adel Javanmard, Hamed Hassani

Abstract: Large language models are commonly trained through multi-stage post-training: first via RLHF, then fine-tuned for other downstream objectives. Yet even small downstream updates can compromise earlier learned behaviors (e.g., safety), exposing a brittleness known as catastrophic forgetting. This suggests standard RLHF objectives do not guarantee robustness to future adaptation. To address it, most prior work designs downstream-time methods to preserve previously learned behaviors. We argue that preventing this requires pre-finetuning robustness: the base policy should avoid brittle high-reward solutions whose reward drops sharply under standard fine-tuning. We propose Fine-tuning Robust Policy Optimization (FRPO), a robust RLHF framework that optimizes reward not only at the current policy, but across a KL-bounded neighborhood of policies reachable by downstream adaptation. The key idea is to ensure reward stability under policy shifts via a max-min formulation. By modifying GRPO, we develop an algorithm with no extra computation, and empirically show it substantially reduces safety degradation across multiple base models and downstream fine-tuning regimes (SFT and RL) while preserving downstream task performance. We further study a math-focused RL setting, demonstrating that FRPO preserves accuracy under subsequent fine-tuning.

new Permissive-Washing in the Open AI Supply Chain: A Large-Scale Audit of License Integrity

Authors: James Jewitt, Gopi Krishnan Rajbahadur, Hao Li, Bram Adams, Ahmed E. Hassan

Abstract: Permissive licenses like MIT, Apache-2.0, and BSD-3-Clause dominate open-source AI, signaling that artifacts like models, datasets, and code can be freely used, modified, and redistributed. However, these licenses carry mandatory requirements: include the full license text, provide a copyright notice, and preserve upstream attribution, that remain unverified at scale. Failure to meet these conditions can place reuse outside the scope of the license, effectively leaving AI artifacts under default copyright for those uses and exposing downstream users to litigation. We call this phenomenon ``permissive washing'': labeling AI artifacts as free to use, while omitting the legal documentation required to make that label actionable. To assess how widespread permissive washing is in the AI supply chain, we empirically audit 124,278 dataset $\rightarrow$ model $\rightarrow$ application supply chains, spanning 3,338 datasets, 6,664 models, and 28,516 applications across Hugging Face and GitHub. We find that an astonishing 96.5\% of datasets and 95.8\% of models lack the required license text, only 2.3\% of datasets and 3.2\% of models satisfy both license text and copyright requirements, and even when upstream artifacts provide complete licensing evidence, attribution rarely propagates downstream: only 27.59\% of models preserve compliant dataset notices and only 5.75\% of applications preserve compliant model notices (with just 6.38\% preserving any linked upstream notice). Practitioners cannot assume permissive labels confer the rights they claim: license files and notices, not metadata, are the source of legal truth. To support future research, we release our full audit dataset and reproducible pipeline.

new Kirin: Improving ANN efficiency with SNN Hybridization

Authors: Chenyu Wang, Zhanglu Yan, Zhi Zhou, Xu Chen, Weng-Fai Wong

Abstract: Artificial neural networks (ANNs), particularly large language models (LLMs), demonstrate powerful inference capabilities but consume substantial energy. Conversely, spiking neural networks (SNNs) exhibit exceptional energy efficiency due to their binary and event-driven characteristics, thus motivating the study of ANN-to-SNN conversion. In this process, quantization plays a pivotal role, mapping LLMs' floating-point parameters to discrete SNN parameters via the temporal dimension of the time window. However, several challenges remain in the conversion process: (i) converting high bit-width quantization values into binary spikes requires longer time windows, increasing system latency; and (ii) the inherent trade-off between the information loss of single-spike schemes and the energy costs of multi-spike ones in SNN. To address these challenges, we propose Kirin, a integer and spike hybrid based SNN to achieve accuracy lossless ANN-to-SNN conversion with time and energy efficiency. Specifically, we first propose a Spike Matrix Hybridization strategy that encoding low bit-width parameters that leading to small time window size into binary spikes while preserving the rest in integer format, thereby reducing the overall latency of SNN execution. Second, we introduce a silence threshold mechanism to regulate the timing of single-spike firing, ensuring the output is mathematically equivalent to the LLM's output and preserves accuracy. Experimental results demonstrate that Kirin, under a W4A4\&8 quantization setting, achieves near-FP16 accuracy while reducing energy consumption by up to 84.66\% and shortening time steps by 93.75\%.

new FlexMoRE: A Flexible Mixture of Rank-heterogeneous Experts for Efficient Federatedly-trained Large Language Models

Authors: Annemette Brok Pirchert, Jacob Nielsen, Mogens Henrik From, Lukas Galke Poech, Peter Schneider-Kamp

Abstract: Recent advances in mixture-of-experts architectures have shown that individual experts models can be trained federatedly, i.e., in isolation from other experts by using a common base model to facilitate coordination. However, we hypothesize that full-sized experts may not be necessary for all domains and that instead low-rank adapters may be sufficient. Here, we introduce FlexMoRE, a Flexible Mixture of Rank-heterogenous Experts, which may be either full-sized experts or adapters of a suitable rank. We systematically investigate the trade-off between expert rank and downstream task performance by evaluating $6$ experts with ranks $2^0$ to $2^{14}$ resulting in experiments covering 150 mixtures (96 with 2 experts, 54 with 7 experts) that are evaluated across $120$ tasks. For our experiments, we build on FlexOlmo and turn its pre-trained experts into low-rank versions. Our regression analysis from expert rank to downstream task performance reveals that the best-performing rank is substantially higher for reasoning-heavy benchmarks than for knowledge-heavy benchmarks. These findings on rank sensitivity come with direct implications for memory efficiency: Using optimal ranks, FlexMoRE yields improved downstream task performance (average score $47.18$) compared to the baseline FlexOlmo-style mixture of full-sized experts (average score $45.46$) at less than one third the parameters ($10.75$B for FlexMoRE vs. $33.27$B for FlexOlmo). All code will be made available.

new Bayesian Preference Learning for Test-Time Steerable Reward Models

Authors: Jiwoo Hong, Shao Tang, Zhipeng Wang

Abstract: Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapt to unseen preference distributions at test time for both single and multi-objective settings. With more in-context demonstrations, ICRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting, while widening the Pareto frontier with a 4% gain in hypervolume on helpfulness and refusal benchmarks. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.

new Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

Authors: Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, Bo An

Abstract: Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may deviate from diverse agents' reward distributions, which ultimately leads to gradient-norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems. Dr. MAS uses an agent-wise remedy: normalizing advantages per agent using each agent's own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems, supporting scalable orchestration, flexible per-agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi-agent math reasoning and multi-turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6\% avg@16 and +4.6\% pass@16 on math, and +15.2\% avg@16 and +13.1\% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent-model assignments while improving efficiency.

new Rethinking Graph Generalization through the Lens of Sharpness-Aware Minimization

Authors: Yang Qiu, Yixiong Zou, Jun Wang

Abstract: Graph Neural Networks (GNNs) have achieved remarkable success across various graph-based tasks but remain highly sensitive to distribution shifts. In this work, we focus on a prevalent yet under-explored phenomenon in graph generalization, Minimal Shift Flip (MSF),where test samples that slightly deviate from the training distribution are abruptly misclassified. To interpret this phenomenon, we revisit MSF through the lens of Sharpness-Aware Minimization (SAM), which characterizes the local stability and sharpness of the loss landscape while providing a theoretical foundation for modeling generalization error. To quantify loss sharpness, we introduce the concept of Local Robust Radius, measuring the smallest perturbation required to flip a prediction and establishing a theoretical link between local stability and generalization. Building on this perspective, we further observe a continual decrease in the robust radius during training, indicating weakened local stability and an increasingly sharp loss landscape that gives rise to MSF. To jointly solve the MSF phenomenon and the intractability of radius, we develop an energy-based formulation that is theoretically proven to be monotonically correlated with the robust radius, offering a tractable and principled objective for modeling flatness and stability. Building on these insights, we propose an energy-driven generative augmentation framework (E2A) that leverages energy-guided latent perturbations to generate pseudo-OOD samples and enhance model generalization. Extensive experiments across multiple benchmarks demonstrate that E2A consistently improves graph OOD generalization, outperforming state-of-the-art baselines.

new Discovering Interpretable Algorithms by Decompiling Transformers to RASP

Authors: Xinting Huang, Aleksandra Bakalova, Satwik Bhattamishra, William Merrill, Michael Hahn

Abstract: Recent work has shown that the computations of Transformers can be simulated in the RASP family of programming languages. These findings have enabled improved understanding of the expressive capacity and generalization abilities of Transformers. In particular, Transformers have been suggested to length-generalize exactly on problems that have simple RASP programs. However, it remains open whether trained models actually implement simple interpretable programs. In this paper, we present a general method to extract such programs from trained Transformers. The idea is to faithfully re-parameterize a Transformer as a RASP program and then apply causal interventions to discover a small sufficient sub-program. In experiments on small Transformers trained on algorithmic and formal language tasks, we show that our method often recovers simple and interpretable RASP programs from length-generalizing transformers. Our results provide the most direct evidence so far that Transformers internally implement simple RASP programs.

new Magnitude Distance: A Geometric Measure of Dataset Similarity

Authors: Sahel Torkamani, Henry Gouk, Rik Sarkar

Abstract: Quantifying the distance between datasets is a fundamental question in mathematics and machine learning. We propose \textit{magnitude distance}, a novel distance metric defined on finite datasets using the notion of the \emph{magnitude} of a metric space. The proposed distance incorporates a tunable scaling parameter, $t$, that controls the sensitivity to global structure (small $t$) and finer details (large $t$). We prove several theoretical properties of magnitude distance, including its limiting behavior across scales and conditions under which it satisfies key metric properties. In contrast to classical distances, we show that magnitude distance remains discriminative in high-dimensional settings when the scale is appropriately tuned. We further demonstrate how magnitude distance can be used as a training objective for push-forward generative models. Our experimental results support our theoretical analysis and demonstrate that magnitude distance provides meaningful signals, comparable to established distance-based generative approaches.

new Near-optimal Swap Regret Minimization for Convex Losses

Authors: Lunjia Hu, Jon Schneider, Yifan Wu

Abstract: We give a randomized online algorithm that guarantees near-optimal $\widetilde O(\sqrt T)$ expected swap regret against any sequence of $T$ adaptively chosen Lipschitz convex losses on the unit interval. This improves the previous best bound of $\widetilde O(T^{2/3})$ and answers an open question of Fishelson et al. [2025b]. In addition, our algorithm is efficient: it runs in $\mathsf{poly}(T)$ time. A key technical idea we develop to obtain this result is to discretize the unit interval into bins at multiple scales of granularity and simultaneously use all scales to make randomized predictions, which we call multi-scale binning and may be of independent interest. A direct corollary of our result is an efficient online algorithm for minimizing the calibration error for general elicitable properties. This result does not require the Lipschitzness assumption of the identification function needed in prior work, making it applicable to median calibration, for which we achieve the first $\widetilde O(\sqrt T)$ calibration error guarantee.

new AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection

Authors: Junru Zhang, Lang Feng, Haoran Shi, Xu Guo, Han Yu, Yabo Dong, Duanqing Xu

Abstract: Time-series anomaly detection (TSAD) with multimodal large language models (MLLMs) is an emerging area, yet a persistent challenge remains: MLLMs rely on coarse time-series heuristics but struggle with multi-dimensional, detailed reasoning, which is vital for understanding complex time-series data. We present AnomSeer to address this by reinforcing the model to ground its reasoning in precise, structural details of time series, unifying anomaly classification, localization, and explanation. At its core, an expert chain-of-thought trace is generated to provide a verifiable, fine-grained reasoning from classical analyses (e.g., statistical measures, frequency transforms). Building on this, we propose a novel time-series grounded policy optimization (TimerPO) that incorporates two additional components beyond standard reinforcement learning: a time-series grounded advantage based on optimal transport and an orthogonal projection to ensure this auxiliary granular signal does not interfere with the primary detection objective. Across diverse anomaly scenarios, AnomSeer, with Qwen2.5-VL-3B/7B-Instruct, outperforms larger commercial baselines (e.g., GPT-4o) in classification and localization accuracy, particularly on point- and frequency-driven exceptions. Moreover, it produces plausible time-series reasoning traces that support its conclusions.

new Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Authors: Oliver Daniels, Perusha Moodley, Ben Marlin, David Lindner

Abstract: Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a sufficiently capable misaligned model.

new Learning Potentials for Dynamic Matching and Application to Heart Transplantation

Authors: Itai Zilberstein, Ioannis Anagnostides, Zachary W. Sollie, Arman Kilic, Tuomas Sandholm

Abstract: Each year, thousands of patients in need of heart transplants face life-threatening wait times due to organ scarcity. While allocation policies aim to maximize population-level outcomes, current approaches often fail to account for the dynamic arrival of organs and the composition of waitlisted candidates, thereby hampering efficiency. The United States is transitioning from rigid, rule-based allocation to more flexible data-driven models. In this paper, we propose a novel framework for non-myopic policy optimization in general online matching relying on potentials, a concept originally introduced for kidney exchange. We develop scalable and accurate ways of learning potentials that are higher-dimensional and more expressive than prior approaches. Our approach is a form of self-supervised imitation learning: the potentials are trained to mimic an omniscient algorithm that has perfect foresight. We focus on the application of heart transplant allocation and demonstrate, using real historical data, that our policies significantly outperform prior approaches -- including the current US status quo policy and the proposed continuous distribution framework -- in optimizing for population-level outcomes. Our analysis and methods come at a pivotal moment in US policy, as the current heart transplant allocation system is under review. We propose a scalable and theoretically grounded path toward more effective organ allocation.

new Breaking the Simplification Bottleneck in Amortized Neural Symbolic Regression

Authors: Paul Saegert, Ullrich K\"othe

Abstract: Symbolic regression (SR) aims to discover interpretable analytical expressions that accurately describe observed data. Amortized SR promises to be much more efficient than the predominant genetic programming SR methods, but currently struggles to scale to realistic scientific complexity. We find that a key obstacle is the lack of a fast reduction of equivalent expressions to a concise normalized form. Amortized SR has addressed this by general-purpose Computer Algebra Systems (CAS) like SymPy, but the high computational cost severely limits training and inference speed. We propose SimpliPy, a rule-based simplification engine achieving a 100-fold speed-up over SymPy at comparable quality. This enables substantial improvements in amortized SR, including scalability to much larger training sets, more efficient use of the per-expression token budget, and systematic training set decontamination with respect to equivalent test expressions. We demonstrate these advantages in our Flash-ANSR framework, which achieves much better accuracy than amortized baselines (NeSymReS, E2E) on the FastSRB benchmark. Moreover, it performs on par with state-of-the-art direct optimization (PySR) while recovering more concise instead of more complex expressions with increasing inference budget.

new Discrete Bridges for Mutual Information Estimation

Authors: Iryna Zabarianska, Sergei Kholkin, Grigoriy Ksenofontov, Ivan Butakov, Alexander Korotin

Abstract: Diffusion bridge models in both continuous and discrete state spaces have recently become powerful tools in the field of generative modeling. In this work, we leverage the discrete state space formulation of bridge matching models to address another important problem in machine learning and information theory: the estimation of the mutual information (MI) between discrete random variables. By neatly framing MI estimation as a domain transfer problem, we construct a Discrete Bridge Mutual Information (DBMI) estimator suitable for discrete data, which poses difficulties for conventional MI estimators. We showcase the performance of our estimator on two MI estimation settings: low-dimensional and image-based.

new GSS: Gated Subspace Steering for Selective Memorization Mitigation in LLMs

Authors: Xuanqi Zhang, Haoyang Shang, Xiaoxiao Li

Abstract: Large language models (LLMs) can memorize and reproduce training sequences verbatim -- a tendency that undermines both generalization and privacy. Existing mitigation methods apply interventions uniformly, degrading performance on the majority of tokens that generalize normally. We show empirically that memorization is sparse, intermittent, and token-conditioned, suggesting that effective mitigation requires context-aware intervention rather than static parameter modification. To this end, we propose a novel and effective selective memorization mitigation method -- Gated Subspace Steering (GSS), which decomposes intervention into a probe (detecting memorization-relevant activations) and a steer (applying targeted correction only when the probe exceeds a threshold). The optimal probe-steer pair emerges from a principled optimization framework based on optimal subspace steering. Experiments on four benchmarks show GSS matches or exceeds state-of-the-art memorization reduction while requiring $100-1000 \times$ less compute than optimization-based alternatives. Furthermore, we provide new theoretical insights into the geometry of memorization in neural representations.

new Positive Distribution Shift as a Framework for Understanding Tractable Learning

Authors: Marko Medvedev, Idan Attias, Elisabetta Cornacchia, Theodor Misiakiewicz, Gal Vardi, Nathan Srebro

Abstract: We study a setting where the goal is to learn a target function f(x) with respect to a target distribution D(x), but training is done on i.i.d. samples from a different training distribution D'(x), labeled by the true target f(x). Such a distribution shift (here in the form of covariate shift) is usually viewed negatively, as hurting or making learning harder, and the traditional distribution shift literature is mostly concerned with limiting or avoiding this negative effect. In contrast, we argue that with a well-chosen D'(x), the shift can be positive and make learning easier -- a perspective called Positive Distribution Shift (PDS). Such a perspective is central to contemporary machine learning, where much of the innovation is in finding good training distributions D'(x), rather than changing the training algorithm. We further argue that the benefit is often computational rather than statistical, and that PDS allows computationally hard problems to become tractable even using standard gradient-based training. We formalize different variants of PDS, show how certain hard classes are easily learnable under PDS, and make connections with membership query learning.

new GEMSS: A Variational Bayesian Method for Discovering Multiple Sparse Solutions in Classification and Regression Problems

Authors: Kate\v{r}ina Henclov\'a, V\'aclav \v{S}m\'idl

Abstract: Selecting interpretable feature sets in underdetermined ($n \ll p$) and highly correlated regimes constitutes a fundamental challenge in data science, particularly when analyzing physical measurements. In such settings, multiple distinct sparse subsets may explain the response equally well. Identifying these alternatives is crucial for generating domain-specific insights into the underlying mechanisms, yet conventional methods typically isolate a single solution, obscuring the full spectrum of plausible explanations. We present GEMSS (Gaussian Ensemble for Multiple Sparse Solutions), a variational Bayesian framework specifically designed to simultaneously discover multiple, diverse sparse feature combinations. The method employs a structured spike-and-slab prior for sparsity, a mixture of Gaussians to approximate the intractable multimodal posterior, and a Jaccard-based penalty to further control solution diversity. Unlike sequential greedy approaches, GEMSS optimizes the entire ensemble of solutions within a single objective function via stochastic gradient descent. The method is validated on a comprehensive benchmark comprising 128 synthetic experiments across classification and regression tasks. Results demonstrate that GEMSS scales effectively to high-dimensional settings ($p=5000$) with sample size as small as $n = 50$, generalizes seamlessly to continuous targets, handles missing data natively, and exhibits remarkable robustness to class imbalance and Gaussian noise. GEMSS is available as a Python package 'gemss' at PyPI. The full GitHub repository at https://github.com/kat-er-ina/gemss/ also includes a free, easy-to-use application suitable for non-coders.

URLs: https://github.com/kat-er-ina/gemss/

new Diffusion-Inspired Reconfiguration of Transformers for Uncertainty Calibration

Authors: Manh Cuong Dao, Quang Hung Pham, Phi Le Nguyen, Thao Nguyen Truong, Bryan Kian Hsiang Low, Trong Nghia Hoang

Abstract: Uncertainty calibration in pre-trained transformers is critical for their reliable deployment in risk-sensitive applications. Yet, most existing pre-trained transformers do not have a principled mechanism for uncertainty propagation through their feature transformation stack. In this work, we propose a diffusion-inspired reconfiguration of transformers in which each feature transformation block is modeled as a probabilistic mapping. Composing these probabilistic mappings reveals a probability path that mimics the structure of a diffusion process, transporting data mass from the input distribution to the pre-trained feature distribution. This probability path can then be recompiled on a diffusion process with a unified transition model to enable principled propagation of representation uncertainty throughout the pre-trained model's architecture while maintaining its original predictive performance. Empirical results across a variety of vision and language benchmarks demonstrate that our method achieves superior calibration and predictive accuracy compared to existing uncertainty-aware transformers.

new DynamiQ: Accelerating Gradient Synchronization using Compressed Multi-hop All-reduce

Authors: Wenchen Han, Shay Vargaftik, Michael Mitzenmacher, Ran Ben Basat

Abstract: Multi-hop all-reduce is the de facto backbone of large model training. As the training scale increases, the network often becomes a bottleneck, motivating reducing the volume of transmitted data. Accordingly, recent systems demonstrated significant acceleration of the training process using gradient quantization. However, these systems are not optimized for multi-hop aggregation, where entries are partially summed multiple times along their aggregation topology. This paper presents DynamiQ, a quantization framework that bridges the gap between quantization best practices and multi-hop aggregation. DynamiQ introduces novel techniques to better represent partial sums, co-designed with a decompress-accumulate-recompress fused kernel to facilitate fast execution. We extended PyTorch DDP to support DynamiQ over NCCL P2P, and across different LLMs, tasks, and scales, we demonstrate consistent improvement of up to 34.2% over the best among state-of-the-art methods such as Omni-Reduce, THC, and emerging standards such as MXFP4, MXFP6, and MXFP8. Further, DynamiQ is the only evaluated method that consistently reaches near-baseline accuracy (e.g., 99.9% of the BF16 baseline) and does so while significantly accelerating the training.

new StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

Authors: Suraj Ranganath, Atharv Ramesh

Abstract: AI-text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress-tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi-detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3-4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0-M5) against three detector families (RoBERTa, FastDetectGPT, and Binoculars) at the security-relevant 1% false positive rate operating point. StealthRL achieves near-zero detection (0.001 mean TPR@1%FPR), reduces mean AUROC from 0.74 to 0.27, and attains a 99.9% attack success rate. Critically, attacks transfer to a held-out detector family not seen during training, revealing shared architectural vulnerabilities rather than detector-specific brittleness. We additionally conduct LLM-based quality evaluation via Likert scoring, analyze detector score distributions to explain why evasion succeeds, and provide per-detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI-text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at https://github.com/suraj-ranganath/StealthRL.

URLs: https://github.com/suraj-ranganath/StealthRL.

new A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Authors: Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli

Abstract: Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world toward a goal state. Behaviourally, we evaluate the agent against an optimal policy across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and complex goal structures. We then use probing methods to decode the agent's internal representations of the environment state and its multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map of the environment, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from broader environment structural cues toward information supporting immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.

new Distributionally Robust Optimization via Generative Ambiguity Modeling

Authors: Jiaqi Wen, Jianyi Yang

Abstract: This paper studies Distributionally Robust Optimization (DRO), a fundamental framework for enhancing the robustness and generalization of statistical learning and optimization. An effective ambiguity set for DRO must involve distributions that remain consistent to the nominal distribution while being diverse enough to account for a variety of potential scenarios. Moreover, it should lead to tractable DRO solutions. To this end, we propose generative model-based ambiguity sets that capture various adversarial distributions beyond the nominal support space while maintaining consistency with the nominal distribution. Building on this generative ambiguity modeling, we propose DRO with Generative Ambiguity Set (GAS-DRO), a tractable DRO algorithm that solves the inner maximization over the parameterized generative model space. We formally establish the stationary convergence performance of GAS-DRO. We implement GAS-DRO with a diffusion model and empirically demonstrate its superior Out-of-Distribution (OOD) generalization performance in ML tasks.

new StretchTime: Adaptive Time Series Forecasting via Symplectic Attention

Authors: Yubin Kim, Viresh Pati, Jevon Twitty, Vinh Pham, Shihao Yang, Jiecheng Lu

Abstract: Transformer architectures have established strong baselines in time series forecasting, yet they typically rely on positional encodings that assume uniform, index-based temporal progression. However, real-world systems, from shifting financial cycles to elastic biological rhythms, frequently exhibit "time-warped" dynamics where the effective flow of time decouples from the sampling index. In this work, we first formalize this misalignment and prove that rotary position embedding (RoPE) is mathematically incapable of representing non-affine temporal warping. To address this, we propose Symplectic Positional Embeddings (SyPE), a learnable encoding framework derived from Hamiltonian mechanics. SyPE strictly generalizes RoPE by extending the rotation group $\mathrm{SO}(2)$ to the symplectic group $\mathrm{Sp}(2,\mathbb{R})$, modulated by a novel input-dependent adaptive warp module. By allowing the attention mechanism to adaptively dilate or contract temporal coordinates end-to-end, our approach captures locally varying periodicities without requiring pre-defined warping functions. We implement this mechanism in StretchTime, a multivariate forecasting architecture that achieves state-of-the-art performance on standard benchmarks, demonstrating superior robustness on datasets exhibiting non-stationary temporal dynamics.

new Improving Detection of Rare Nodes in Hierarchical Multi-Label Learning

Authors: Isaac Xu, Martin Gillis, Ayushi Sharma, Benjamin Misiuk, Craig J. Brown, Thomas Trappenberg

Abstract: In hierarchical multi-label classification, a persistent challenge is enabling model predictions to reach deeper levels of the hierarchy for more detailed or fine-grained classifications. This difficulty partly arises from the natural rarity of certain classes (or hierarchical nodes) and the hierarchical constraint that ensures child nodes are almost always less frequent than their parents. To address this, we propose a weighted loss objective for neural networks that combines node-wise imbalance weighting with focal weighting components, the latter leveraging modern quantification of ensemble uncertainties. By emphasizing rare nodes rather than rare observations (data points), and focusing on uncertain nodes for each model output distribution during training, we observe improvements in recall by up to a factor of five on benchmark datasets, along with statistically significant gains in $F_{1}$ score. We also show our approach aids convolutional networks on challenging tasks, as in situations with suboptimal encoders or limited data.

new DirMoE: Dirichlet-routed Mixture of Experts

Authors: Amirhossein Vahidi, Hesam Asadollahzadeh, Navid Akhavan Attar, Marie Moullet, Kevin Ly, Xingyi Yang, Mohammad Lotfollahi

Abstract: Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-$k$+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-$k$+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.

new ARO: A New Lens On Matrix Optimization For Large Models

Authors: Wenbo Gong, Javier Zazo, Qijun Luo, Puqian Wang, James Hensman, Chao Ma

Abstract: Matrix-based optimizers have attracted growing interest for improving LLM training efficiency, with significant progress centered on orthogonalization/whitening based methods. While yielding substantial performance gains, a fundamental question arises: can we develop new paradigms beyond orthogonalization, pushing the efficiency frontier further? We present \textbf{Adaptively Rotated Optimization (ARO}, a new matrix optimization framework that treats gradient rotation as a first class design principle. ARO accelerates LLM training by performing normed steepest descent in a rotated coordinate system, where the rotation is determined by a novel norm-informed policy. This perspective yields update rules that go beyond existing orthogonalization and whitening optimizers, improving sample efficiency in practice. To make comparisons reliable, we propose a rigorously controlled benchmarking protocol that reduces confounding and bias. Under this protocol, ARO consistently outperforms AdamW (by 1.3 $\sim$1.35$\times$) and orthogonalization methods (by 1.1$\sim$1.15$\times$) in LLM pretraining at up to 8B activated parameters, and up to $8\times$ overtrain budget, without evidence of diminishing returns. Finally, we discuss how ARO can be reformulated as a symmetry-aware optimizer grounded in rotational symmetries of residual streams, motivating advanced designs that enable computationally efficient exploitation of cross-layer/cross module couplings.

new ShapeCond: Fast Shapelet-Guided Dataset Condensation for Time Series Classification

Authors: Sijia Peng, Yun Xiong, Xi Chen, Yi Xie, Guanzhi Li, Yanwei Yu, Yangyong Zhu, Zhiqiang Shen

Abstract: Time series data supports many domains (e.g., finance and climate science), but its rapid growth strains storage and computation. Dataset condensation can alleviate this by synthesizing a compact training set that preserves key information. Yet most condensation methods are image-centric and often fail on time series because they miss time-series-specific temporal structure, especially local discriminative motifs such as shapelets. In this work, we propose ShapeCond, a novel and efficient condensation framework for time series classification that leverages shapelet-based dataset knowledge via a shapelet-guided optimization strategy. Our shapelet-assisted synthesis cost is independent of sequence length: longer series yield larger speedups in synthesis (e.g., 29$\times$ faster over prior state-of-the-art method CondTSC for time-series condensation, and up to 10,000$\times$ over naively using shapelets on the Sleep dataset with 3,000 timesteps). By explicitly preserving critical local patterns, ShapeCond improves downstream accuracy and consistently outperforms all prior state-of-the-art time series dataset condensation methods across extensive experiments. Code is available at https://github.com/lunaaa95/ShapeCond.

URLs: https://github.com/lunaaa95/ShapeCond.

new ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling

Authors: Yilang Zhang, Bingcong Li, Niao He, Georgios B. Giannakis

Abstract: Scaling network depth has been a central driver behind the success of modern foundation models, yet recent investigations suggest that deep layers are often underutilized. This paper revisits the default mechanism for deepening neural networks, namely residual connections, from an optimization perspective. Rigorous analysis proves that the layout of residual connections can fundamentally shape convergence behavior, and even induces an exponential gap in convergence rates. Prompted by this insight, we introduce adaptive neural connection reassignment (ANCRe), a principled and lightweight framework that parameterizes and learns residual connectivities from the data. ANCRe adaptively reassigns residual connections with negligible computational and memory overhead ($<1\%$), while enabling more effective utilization of network depth. Extensive numerical tests across pre-training of large language models, diffusion models, and deep ResNets demonstrate consistently accelerated convergence, boosted performance, and enhanced depth efficiency over conventional residual connections.

new Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense

Authors: Jiacheng Liu, Yaxin Luo, Jiacheng Cui, Xinyi Shang, Xiaohan Zhao, Zhiqiang Shen

Abstract: The rapid evolution of GUI-enabled agents has rendered traditional CAPTCHAs obsolete. While previous benchmarks like OpenCaptchaWorld established a baseline for evaluating multimodal agents, recent advancements in reasoning-heavy models, such as Gemini3-Pro-High and GPT-5.2-Xhigh have effectively collapsed this security barrier, achieving pass rates as high as 90% on complex logic puzzles like "Bingo". In response, we introduce Next-Gen CAPTCHAs, a scalable defense framework designed to secure the next-generation web against the advanced agents. Unlike static datasets, our benchmark is built upon a robust data generation pipeline, allowing for large-scale and easily scalable evaluations, notably, for backend-supported types, our system is capable of generating effectively unbounded CAPTCHA instances. We exploit the persistent human-agent "Cognitive Gap" in interactive perception, memory, decision-making, and action. By engineering dynamic tasks that require adaptive intuition rather than granular planning, we re-establish a robust distinction between biological users and artificial agents, offering a scalable and diverse defense mechanism for the agentic era.

cross NLP Sampling: Combining MCMC and NLP Methods for Diverse Constrained Sampling

Authors: Marc Toussaint, Cornelius V. Braun, Joaquim Ortiz-Haro

Abstract: Generating diverse samples under hard constraints is a core challenge in many areas. With this work we aim to provide an integrative view and framework to combine methods from the fields of MCMC, constrained optimization, as well as robotics, and gain insights in their strengths from empirical evaluations. We propose NLP Sampling as a general problem formulation, propose a family of restarting two-phase methods as a framework to integrated methods from across the fields, and evaluate them on analytical and robotic manipulation planning problems. Complementary to this, we provide several conceptual discussions, e.g. on the role of Lagrange parameters, global sampling, and the idea of a Diffused NLP and a corresponding model-based denoising sampler.

cross Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

Authors: Ali Shendabadi, Parnia Izadirad, Mostafa Salehi, Mahmoud Bijankhan

Abstract: Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.

cross Graph-Based Nearest-Neighbor Search without the Spread

Authors: Jeff Giliberti, Sariel Har-Peled, Jonas Sauer, Ali Vakilian

Abstract: $\renewcommand{\Re}{\mathbb{R}}$Recent work showed how to construct nearest-neighbor graphs of linear size, on a given set $P$ of $n$ points in $\Re^d$, such that one can answer approximate nearest-neighbor queries in logarithmic time in the spread. Unfortunately, the spread might be unbounded in $n$, and an interesting theoretical question is how to remove the dependency on the spread. Here, we show how to construct an external linear-size data structure that, combined with the linear-size graph, allows us to answer ANN queries in logarithmic time in $n$.

cross Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models

Authors: Lucky Susanto, Musa Izzanardi Wijanarko, Khumaisa Nur'aini, Farid Adilazuarda, Alham Fikri Aji, Derry Tanti Wijaya

Abstract: While pixel-based language modeling aims to bypass the sub-word tokenization bottleneck by rendering text as images, recent multimodal variants such as DualGPT reintroduce text tokenizers to improve autoregressive performance. We investigate a fundamental question, does visual rendering truly decouple a model from tokenization constraints? Focusing on four Indonesian low-resource local languages that have their own non-Latin scripts (i.e., Javanese, Balinese, Sundanese, and Lampungnese), we evaluate the impact of script-tokenizer alignment within the DualGPT architecture. Our results show that, despite visual rendering, reintegrating a text tokenizer into the architecture reintroduces the same issue that pixel-based language modeling aims to resolve, which is the tokenizer misalignment problem. Despite having lower OOV and fertility rates, we show that the Llama 2 tokenizer performs significantly worse than a custom tokenizer, with improvements of up to 30.15 chrF++. Our findings serve as a warning for future multimodal variants, as text tokenizers remain a significant barrier to equitable models.

cross Bridging the Knowledge Void: Inference-time Acquisition of Unfamiliar Programming Languages for Coding Tasks

Authors: Chen Shen, Wei Cheng, Jingyue Yang, Huan Zhang, Yuhan Wu, Wei Hu

Abstract: The proficiency of Large Language Models (LLMs) in coding tasks is often a reflection of their extensive pre-training corpora, which typically collapses when confronted with previously unfamiliar programming languages. Departing from data-intensive finetuning, we investigate the paradigm of Inference-time Language Acquisition (ILA), where an LLM masters an unfamiliar language through dynamic interaction with limited external resources. In this paper, we propose ILA-agent, a general ILA framework that equips LLMs with a set of behavioral primitives. By modeling essential human-like behaviors as a suite of tools, ILA-agent enables LLMs to incrementally explore, apply, and verify language knowledge through structured interactions with the official documentation and execution environment. To provide a rigorous evaluation in a low-resource setting, we construct Cangjie-bench, a multi-task benchmark based on the novel statically-typed language Cangjie. We instantiate ILA-agent for Cangjie and evaluate its performance across code generation, translation, and program repair tasks. Results using diverse LLMs demonstrate that ILA-agent significantly outperforms retrieval-augmented baselines. Further analysis of agent trajectories characterizes the emergent behavior patterns while highlighting persisting performance gaps.

cross Deep Reinforcement Learning for Interference Suppression in RIS-Aided Space-Air-Ground Integrated Networks

Authors: Pujitha Mamillapalli, Shikhar Verma, Tiago Koketsu Rodrigues, Abhinav Kumar

Abstract: Future 6G networks envision ubiquitous connectivity through space-air-ground integrated networks (SAGINs), where high-altitude platform stations (HAPSs) and satellites complement terrestrial systems to provide wide-area, low-latency coverage. However, the rapid growth of terrestrial devices intensifies spectrum sharing between terrestrial and non-terrestrial segments, resulting in severe cross-tier interference. In particular, frequency sharing between the HAPS satellite uplink and HAPS ground downlink improves spectrum efficiency but suffers from interference caused by the HAPS antenna back-lobe. Existing approaches relying on zero-forcing (ZF) codebooks have limited performance under highly dynamic channel conditions. To overcome this limitation, we employ a reconfigurable intelligent surface (RIS)-aided HAPS-based SAGIN framework with a deep deterministic policy gradient (DDPG) algorithm. The proposed DDPG framework optimizes the HAPS beamforming weights to form spatial nulls toward interference sources while maintaining robust links to the desired signals. Simulation results demonstrate that the DDPG framework consistently outperforms conventional ZF beamforming among different RIS configurations, achieving up to $11.3\%$ throughput improvement for a $4\times4$ RIS configuration, validating its adaptive capability to enhance spectral efficiency in dynamic HAPS-based SAGINs.

cross Hybrid Deep Learning Framework for CSI-Based Activity Recognition in Bandwidth-Constrained Wi-Fi Sensing

Authors: Alison M. Fernandes, Hermes I. Del Monego, Bruno S. Chang, Anelise Munaretto, H\'elder M. Fontes, Rui Campos

Abstract: This paper presents a novel hybrid deep learning framework designed to enhance the robustness of CSI-based Human Activity Recognition (HAR) within bandwidth-constrained Wi-Fi sensing environments. The core of our proposed methodology is a preliminary Doppler trace extraction stage, implemented to amplify salient motion-related signal features before classification. Subsequently, these enhanced inputs are processed by a hybrid neural architecture, which integrates Inception networks responsible for hierarchical spatial feature extraction and Bidirectional Long Short-Term Memory (BiLSTM) networks that capture temporal dependencies. A Support Vector Machine (SVM) is then utilized as the final classification layer to optimize decision boundaries. The framework's efficacy was systematically validated using a public dataset across 20, 40, and 80 MHz bandwidth configurations. The model yielded accuracies of 89.27% (20 MHz), 94.13% (40 MHz), and 95.30% (80 MHz), respectively. These results confirm a marked superiority over standalone deep learning baselines, especially in the most constrained low-bandwidth scenarios. This study underscores the utility of combining Doppler-based feature engineering with a hybrid learning architecture for reliable HAR in bandwidth-limited wireless sensing applications.

cross Machine learning enhanced data assimilation framework for multiscale carbonate rock characterization

Authors: Zhenkai Bo, Ahmed H. Elsheikh, Hannah P. Menke, Julien Maes, Sebastian Geiger, Muhammad Z. Kashim, Zainol A. A. Bakar, Kamaljit Singh

Abstract: Carbonate reservoirs offer significant capacity for subsurface carbon storage, oil production, and underground hydrogen storage. X-ray computed tomography (X-ray CT) coupled with numerical simulations is commonly used to investigate the multiphase flow behaviors in carbonate rocks. Carbonates exhibit pore size distribution across scales, hindering the comprehensive investigation with conventional X-ray CT images. Imaging samples at both macro and micro-scales (multi-scale imaging) proved to be a viable option in this context. However, multi-scale imaging faces two key limitations: the trade-off between field of view and voxel size necessitates resource-intensive imaging, while multi-scale multi-physics numerical simulations on resulting digital models incur prohibitive computational costs. To address these challenges, we propose a machine learning-enhanced data assimilation framework that leverages experimental drainage relative permeability measurements to achieve efficient characterization of micro-scale structures, delivering a data-driven solution toward a high-fidelity multiscale digital rock modeling. We train a dense neural network (DNN) as a proxy to a multi-scale pore network simulator and couple it with an ensemble smoother with multiple data assimilation (ESMDA) algorithm. DNN-ESMDA framework simultaneously infers the CO2-brine drainage relative permeability of microporosity phases with associated uncertainty estimation, revealing the relative importance of each rock phase and guiding future characterization. Our DNN-ESMDA framework achieves a computational speedup, reducing inference time from thousands of hours to seconds compared with the usage of conventional multiscale numerical simulation. Given this computational efficiency and applicability, the machine learning-enhanced ESMDA framework presents a generalizable approach for characterizing multiscale carbonate rocks.

cross Curriculum-Learned Vanishing Stacked Residual PINNs for Hyperbolic PDE State Reconstruction

Authors: Katayoun Eshkofti, Matthieu Barreau

Abstract: Modeling distributed dynamical systems governed by hyperbolic partial differential equations (PDEs) remains challenging due to discontinuities and shocks that hinder the convergence of traditional physics-informed neural networks (PINNs). The recently proposed vanishing stacked residual PINN (VSR-PINN) embeds a vanishing-viscosity mechanism within stacked residual refinements to enable a smooth transition from the parabolic to hyperbolic regime. This paper integrates three curriculum-learning methods as primal-dual (PD) optimization, causality progression, and adaptive sampling into the VSR-PINN. The PD strategy balances physics and data losses, the causality scheme unlocks deeper stacks by respecting temporal and gradient evolution, and adaptive sampling targets high residuals. Numerical experiments on traffic reconstruction confirm that enforcing causality systematically reduces the median point-wise MSE and its variability across runs, yielding improvements of nearly one order of magnitude over non-causal training in both the baseline and PD variants.

cross Adaptive Temporal Dynamics for Personalized Emotion Recognition: A Liquid Neural Network Approach

Authors: Anindya Bhattacharjee, Nittya Ananda Biswas, K. A. Shahriar, Adib Rahman

Abstract: Emotion recognition from physiological signals remains challenging due to their non-stationary, noisy, and subject-dependent characteristics. This work presents, to the best of our knowledge, the first comprehensive application of liquid neural networks for EEG-based emotion recognition. The proposed multimodal framework combines convolutional feature extraction, liquid neural networks with learnable time constants, and attention-guided fusion to model temporal EEG dynamics with complementary peripheral physiological and personality features. Dedicated subnetworks are used to process EEG features and auxiliary modalities, and a shared autoencoder-based fusion module is used to learn discriminative latent representations before classification. Subject-dependent experiments conducted on the PhyMER dataset across seven emotional classes achieve an accuracy of 95.45%, surpassing previously reported results. Furthermore, temporal attention analysis provides interpretable insights into emotion-specific temporal relevance, and t-SNE visualizations demonstrate enhanced class separability, highlighting the effectiveness of the proposed approach. Finally, statistical analysis of temporal dynamics confirms that the network self-organizes into distinct functional groups with specialized fast and slow neurons, proving it independently tunes learnable time constants and memory dominance to effectively capture complex emotion artifacts.

cross MolLIBRA: Genetic Molecular Optimization with Multi-Fingerprint Surrogates and Text-Molecule Aligned Critic

Authors: Masahi Okada, Kazuki Sakai, Hiroaki Yoshida, Masaki Okoshi, Tadahiro Taniguchi

Abstract: We study sample-efficient molecular optimization under a limited budget of oracle evaluations. We propose MolLIBRA (MultimOdaLity and Language Integrated Bayesian and evolutionaRy optimizAtion), a genetic algorithm based framework that pre-ranks candidate molecules using multiple critics before oracle calls: (i) an ensemble of Gaussian process (GP) surrogates defined over multiple molecular fingerprints and (ii) a pretrained text-molecule aligned encoder CLAMP. The GP ensemble enables adaptive selection of task-appropriate fingerprints, while CLAMP provides a zero-shot scoring signal from task descriptions by measuring the similarity between molecular and text embeddings. On the Practical Molecular Optimization (PMO) benchmark with a budget of 1,000 evaluations (PMO-1K), MolLIBRA-L, our variant with a language-model-based candidate generator, attains the best Top-10 AUC on 14/22 tasks and the highest overall sum of Top-10 AUC across tasks among prior methods.

cross Scalable spatial point process models for forensic footwear analysis

Authors: Alokesh Manna, Neil Spencer, Dipak K. Dey

Abstract: Shoe print evidence recovered from crime scenes plays a key role in forensic investigations. By examining shoe prints, investigators can determine details of the footwear worn by suspects. However, establishing that a suspect's shoes match the make and model of a crime scene print may not be sufficient. Typically, thousands of shoes of the same size, make, and model are manufactured, any of which could be responsible for the print. Accordingly, a popular approach used by investigators is to examine the print for signs of ``accidentals,'' i.e., cuts, scrapes, and other features that accumulate on shoe soles after purchase due to wear. While some patterns of accidentals are common on certain types of shoes, others are highly distinctive, potentially distinguishing the suspect's shoe from all others. Quantifying the rarity of a pattern is thus essential to accurately measuring the strength of forensic evidence. In this study, we address this task by developing a hierarchical Bayesian model. Our improvement over existing methods primarily stems from two advancements. First, we frame our approach in terms of a latent Gaussian model, thus enabling inference to be efficiently scaled to large collections of annotated shoe prints via integrated nested Laplace approximations. Second, we incorporate spatially varying coefficients to model the relationship between shoes' tread patterns and accidental locations. We demonstrate these improvements through superior performance on held-out data, which enhances accuracy and reliability in forensic shoe print analysis.

cross Where Not to Learn: Prior-Aligned Training with Subset-based Attribution Constraints for Reliable Decision-Making

Authors: Ruoyu Chen, Shangquan Sun, Xiaoqing Guo, Sanyi Zhang, Kangwei Liu, Shiming Liu, Zhangcheng Wang, Qunli Zhang, Hua Zhang, Xiaochun Cao

Abstract: Reliable models should not only predict correctly, but also justify decisions with acceptable evidence. Yet conventional supervised learning typically provides only class-level labels, allowing models to achieve high accuracy through shortcut correlations rather than the intended evidence. Human priors can help constrain such behavior, but aligning models to these priors remains challenging because learned representations often diverge from human perception. To address this challenge, we propose an attribution-based human prior alignment method. We encode human priors as input regions that the model is expected to rely on (e.g., bounding boxes), and leverage a highly faithful subset-selection-based attribution approach to expose the model's decision evidence during training. When the attribution region deviates substantially from the prior regions, we penalize reliance on off-prior evidence, encouraging the model to shift its attribution toward the intended regions. This is achieved through a training objective that imposes attribution constraints induced by the human prior. We validate our method on both image classification and click decision tasks in MLLM-based GUI agent models. Across conventional classification and autoregressive generation settings, human prior alignment consistently improves task accuracy while also enhancing the model's decision reasonability.

cross Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models

Authors: Jiaxi Yang, Shicheng Liu, Yuchen Yang, Dongwon Lee

Abstract: With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigurable \textbf{R}efusal in \textbf{VLM}s (\textbf{CR-VLM}), a robust and efficient approach for {\em configurable} refusal based on activation steering. CR-VLM consists of three integrated components: (1) extracting a configurable refusal vector via a teacher-forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that mitigates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfactual vision enhancement module that aligns visual representations with refusal requirements. Comprehensive experiments across multiple datasets and various VLMs demonstrate that CR-VLM achieves effective, efficient, and robust configurable refusals, offering a scalable path toward user-adaptive safety alignment in VLMs.

cross Financial Bond Similarity Search Using Representation Learning

Authors: Amin Haeri, Mahdi Ghelichi, Nishant Agrawal, David Li, Catalina Gomez Sanchez

Abstract: Finding similar bonds remains challenging in fixed-income analytics, as numerical financial attributes often overshadow categorical non-financial ones such as issuer sector and domicile. This paper shows that these categorical attributes dominate the predictability of spread curves and proposes embedding models to capture their semantic similarities, outperforming one-hot and many other baselines. Evaluated via sparse-issuer augmentation, the approach improves risk modeling and curve construction.

cross AI for Sustainable Data Protection and Fair Algorithmic Management in Environmental Regulation

Authors: Sahibpreet Singh, Saksham Sharma

Abstract: Integration of AI into environmental regulation represents a significant advancement in data management. It offers promising results in both data protection plus algorithmic fairness. This research addresses the critical need for sustainable data protection in the era of ever evolving cyber threats. Traditional encryption methods face limitations in handling the dynamic nature of environmental data. This necessitates the exploration of advanced cryptographic techniques. The objective of this study is to evaluate how AI can enhance these techniques to ensure robust data protection while facilitating fair algorithmic management. The methodology involves a comprehensive review of current advancements in AI-enhanced homomorphic encryption (HE) and multi-party computation (MPC). It is coupled with an analysis of how these techniques can be applied to environmental data regulation. Key findings indicate that AI-driven dynamic key management, adaptive encryption schemes, and optimized computational efficiency in HE, alongside AI-enhanced protocol optimization and fault mitigation in MPC, significantly improve the security of environmental data processing. These findings highlight a crucial research gap in the intersection of AI, cyber laws, and environmental regulation, particularly in terms of addressing algorithmic bias, transparency, and accountability. The implications of this research underscore the need for stricter cyber laws. Also, the development of comprehensive regulations to safeguard sensitive environmental data. Future efforts should focus on refining AI systems to balance security with privacy and ensuring that regulatory frameworks can adapt to technological advancements. This study provides a foundation for future research aimed at achieving secure sustainable environmental data management through AI innovations.

cross Condition Errors Refinement in Autoregressive Image Generation with Diffusion Loss

Authors: Yucheng Zhou, Hao Li, Jianbing Shen

Abstract: Recent studies have explored autoregressive models for image generation, with promising results, and have combined diffusion models with autoregressive frameworks to optimize image generation via diffusion losses. In this study, we present a theoretical analysis of diffusion and autoregressive models with diffusion loss, highlighting the latter's advantages. We present a theoretical comparison of conditional diffusion and autoregressive diffusion with diffusion loss, demonstrating that patch denoising optimization in autoregressive models effectively mitigates condition errors and leads to a stable condition distribution. Our analysis also reveals that autoregressive condition generation refines the condition, causing the condition error influence to decay exponentially. In addition, we introduce a novel condition refinement approach based on Optimal Transport (OT) theory to address ``condition inconsistency''. We theoretically demonstrate that formulating condition refinement as a Wasserstein Gradient Flow ensures convergence toward the ideal condition distribution, effectively mitigating condition inconsistency. Experiments demonstrate the superiority of our method over diffusion and autoregressive models with diffusion loss methods.

cross Fair Context Learning for Evidence-Balanced Test-Time Adaptation in Vision-Language Models

Authors: Sanggeon Yun, Ryozo Masukawa, SungHeon Jeong, Wenjun Huang, Hanning Chen, Mohsen Imani

Abstract: Vision-Language Models (VLMs) such as CLIP enable strong zero-shot recognition but suffer substantial degradation under distribution shifts. Test-Time Adaptation (TTA) aims to improve robustness using only unlabeled test samples, yet most prompt-based TTA methods rely on entropy minimization -- an approach that can amplify spurious correlations and induce overconfident errors when classes share visual features. We propose Fair Context Learning (FCL), an episodic TTA framework that avoids entropy minimization by explicitly addressing shared-evidence bias. Motivated by our additive evidence decomposition assumption, FCL decouples adaptation into (i) augmentation-based exploration to identify plausible class candidates, and (ii) fairness-driven calibration that adapts text contexts to equalize sensitivity to common visual evidence. This fairness constraint mitigates partial feature obsession and enables effective calibration of text embeddings without relying on entropy reduction. Through extensive evaluation, we empirically validate our theoretical motivation and show that FCL achieves competitive adaptation performance relative to state-of-the-art TTA methods across diverse domain-shift and fine-grained benchmarks.

cross DLLM-Searcher: Adapting Diffusion Large Language Model for Search Agents

Authors: Jiahao Zhao, Shaoxuan Xu, Zhongxiang Sun, Fengqi Zhu, Jingyang Ou, Yuling Shi, Chongxuan Li, Xiao Zhang, Jun Xu

Abstract: Recently, Diffusion Large Language Models (dLLMs) have demonstrated unique efficiency advantages, enabled by their inherently parallel decoding mechanism and flexible generation paradigm. Meanwhile, despite the rapid advancement of Search Agents, their practical deployment is constrained by a fundamental limitation, termed as 1) Latency Challenge: the serial execution of multi-round reasoning, tool calling, and tool response waiting under the ReAct agent paradigm induces severe end-to-end latency. Intuitively, dLLMs can leverage their distinctive strengths to optimize the operational efficiency of agents under the ReAct agent paradigm. Practically, existing dLLM backbones face the 2) Agent Ability Challenge. That is, existing dLLMs exhibit remarkably weak reasoning and tool-calling capabilities, preventing these advantages from being effectively realized in practice. In this paper, we propose DLLM-Searcher, an optimization framework for dLLM-based Search Agents. To solve the Agent Ability Challenge, we design a two-stage post-training pipeline encompassing Agentic Supervised Fine-Tuning (Agentic SFT) and Agentic Variance-Reduced Preference Optimization Agentic VRPO, which enhances the backbone dLLM's information seeking and reasoning capabilities. To mitigate the Latency Challenge, we leverage the flexible generation mechanism of dLLMs and propose a novel agent paradigm termed Parallel-Reasoning and Acting P-ReAct. P-ReAct guides the model to prioritize decoding tool_call instructions, thereby allowing the model to keep thinking while waiting for the tool's return. Experimental results demonstrate that DLLM-Searcher achieves performance comparable to mainstream LLM-based search agents and P-ReAct delivers approximately 15% inference acceleration. Our code is available at https://anonymous.4open.science/r/DLLM-Searcher-553C

URLs: https://anonymous.4open.science/r/DLLM-Searcher-553C

cross OMNI-Dent: Towards an Accessible and Explainable AI Framework for Automated Dental Diagnosis

Authors: Leeje Jang, Yao-Yi Chiang, Angela M. Hastings, Patimaporn Pungchanchaikul, Martha B. Lucas, Emily C. Schultz, Jeffrey P. Louie, Mohamed Estai, Wen-Chen Wang, Ryan H. L. Ip, Boyen Huang

Abstract: Accurate dental diagnosis is essential for oral healthcare, yet many individuals lack access to timely professional evaluation. Existing AI-based methods primarily treat diagnosis as a visual pattern recognition task and do not reflect the structured clinical reasoning used by dental professionals. These approaches also require large amounts of expert-annotated data and often struggle to generalize across diverse real-world imaging conditions. To address these limitations, we present OMNI-Dent, a data-efficient and explainable diagnostic framework that incorporates clinical reasoning principles into a Vision-Language Model (VLM)-based pipeline. The framework operates on multi-view smartphone photographs,embeds diagnostic heuristics from dental experts, and guides a general-purpose VLM to perform tooth-level evaluation without dental-specific fine-tuning of the VLM. By utilizing the VLM's existing visual-linguistic capabilities, OMNI-Dent aims to support diagnostic assessment in settings where curated clinical imaging is unavailable. Designed as an early-stage assistive tool, OMNI-Dent helps users identify potential abnormalities and determine when professional evaluation may be needed, offering a practical option for individuals with limited access to in-person care.

cross ShapBPT: Image Feature Attributions Using Data-Aware Binary Partition Trees

Authors: Muhammad Rashid, Elvio G. Amparore, Enrico Ferrari, Damiano Verda

Abstract: Pixel-level feature attributions are an important tool in eXplainable AI for Computer Vision (XCV), providing visual insights into how image features influence model predictions. The Owen formula for hierarchical Shapley values has been widely used to interpret machine learning (ML) models and their learned representations. However, existing hierarchical Shapley approaches do not exploit the multiscale structure of image data, leading to slow convergence and weak alignment with the actual morphological features. Moreover, no prior Shapley method has leveraged data-aware hierarchies for Computer Vision tasks, leaving a gap in model interpretability of structured visual data. To address this, this paper introduces ShapBPT, a novel data-aware XCV method based on the hierarchical Shapley formula. ShapBPT assigns Shapley coefficients to a multiscale hierarchical structure tailored for images, the Binary Partition Tree (BPT). By using this data-aware hierarchical partitioning, ShapBPT ensures that feature attributions align with intrinsic image morphology, effectively prioritizing relevant regions while reducing computational overhead. This advancement connects hierarchical Shapley methods with image data, providing a more efficient and semantically meaningful approach to visual interpretability. Experimental results confirm ShapBPT's effectiveness, demonstrating superior alignment with image structures and improved efficiency over existing XCV methods, and a 20-subject user study confirming that ShapBPT explanations are preferred by humans.

cross Neural Sentinel: Unified Vision Language Model (VLM) for License Plate Recognition with Human-in-the-Loop Continual Learning

Authors: Karthik Sivakoti

Abstract: Traditional Automatic License Plate Recognition (ALPR) systems employ multi-stage pipelines consisting of object detection networks followed by separate Optical Character Recognition (OCR) modules, introducing compounding errors, increased latency, and architectural complexity. This research presents Neural Sentinel, a novel unified approach that leverages Vision Language Models (VLMs) to perform license plate recognition, state classification, and vehicle attribute extraction through a single forward pass. Our primary contribution lies in demonstrating that a fine-tuned PaliGemma 3B model, adapted via Low-Rank Adaptation (LoRA), can simultaneously answer multiple visual questions about vehicle images, achieving 92.3% plate recognition accuracy, which is a 14.1% improvement over EasyOCR and 9.9% improvement over PaddleOCR baselines. We introduce a Human-in-the-Loop (HITL) continual learning framework that incorporates user corrections while preventing catastrophic forgetting through experience replay, maintaining a 70:30 ratio of original training data to correction samples. The system achieves a mean inference latency of 152ms with an Expected Calibration Error (ECE) of 0.048, indicating well calibrated confidence estimates. Additionally, the VLM first architecture enables zero-shot generalization to auxiliary tasks including vehicle color detection (89%), seatbelt detection (82%), and occupancy counting (78%) without task specific training. Through extensive experimentation on real world toll plaza imagery, we demonstrate that unified vision language approaches represent a paradigm shift in ALPR systems, offering superior accuracy, reduced architectural complexity, and emergent multi-task capabilities that traditional pipeline approaches cannot achieve.

cross Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

Authors: Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Manling Li

Abstract: Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent's ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.

cross FADE: Selective Forgetting via Sparse LoRA and Self-Distillation

Authors: Carolina R. Kelsch, Leonardo S. B. Pereira, Natnael Mola, Luis H. Arribas, Juan C. S. M. Avedillo

Abstract: Machine Unlearning aims to remove the influence of specific data or concepts from trained models while preserving overall performance, a capability increasingly required by data protection regulations and responsible AI practices. Despite recent progress, unlearning in text-to-image diffusion models remains challenging due to high computational costs and the difficulty of balancing effective forgetting with retention of unrelated concepts. We introduce FADE (Fast Adapter for Data Erasure), a two-stage unlearning method for image generation that combines parameter localization with self-distillation. FADE first identifies parameters most responsible for the forget set using gradient-based saliency and constrains updates through sparse LoRA adapters, ensuring lightweight, localized modifications. In a second stage, FADE applies a self-distillation objective that overwrites the forgotten concept with a user-defined surrogate while preserving behavior on retained data. The resulting adapters are memory-efficient, reversible, and can be merged or removed at runtime, enabling flexible deployment in production systems. We evaluated FADE on the UnlearnCanvas benchmark and conducted ablation studies on Imagenette, Labeled Faces in the Wild, AtharvaTaras Dog Breeds Dataset, and SUN Attributes datasets, demonstrating State-of-the-Art unlearning performance with fine-grained control over the forgetting-retention trade-off. Our results demonstrate that FADE achieves strong concept erasure and high retainability across various domains, making it a suitable solution for selective unlearning in diffusion-based image generation models.

cross Pro-ZD: A Transferable Graph Neural Network Approach for Proactive Zero-Day Threats Mitigation

Authors: Nardine Basta, Firas Ben Hmida, Houssem Jmal, Muhammad Ikram, Mohamed Ali Kaafar, Andy Walker

Abstract: In today's enterprise network landscape, the combination of perimeter and distributed firewall rules governs connectivity. To address challenges arising from increased traffic and diverse network architectures, organizations employ automated tools for firewall rule and access policy generation. Yet, effectively managing risks arising from dynamically generated policies, especially concerning critical asset exposure, remains a major challenge. This challenge is amplified by evolving network structures due to trends like remote users, bring-your-own devices, and cloud integration. This paper introduces a novel graph neural network model for identifying weighted shortest paths. The model aids in detecting network misconfigurations and high-risk connectivity paths that threaten critical assets, potentially exploited in zero-day attacks -- cyber-attacks exploiting undisclosed vulnerabilities. The proposed Pro-ZD framework adopts a proactive approach, automatically fine-tuning firewall rules and access policies to address high-risk connections and prevent unauthorized access. Experimental results highlight the robustness and transferability of Pro-ZD, achieving over 95% average accuracy in detecting high-risk connections. \

cross LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

Authors: Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Yuxuan Liao, Zehong Wang, Zhiyuan Liu, Zhenfei Yin, Li Yuan, Philip Torr, Huan Sun, Xiangxiang Zeng, Mengdi Wang, Le Cong, Shenghua Gao, Xiangru Tang

Abstract: Chemical large language models (LLMs) predominantly rely on explicit Chain-of-Thought (CoT) in natural language to perform complex reasoning. However, chemical reasoning is inherently continuous and structural, and forcing it into discrete linguistic tokens introduces a fundamental representation mismatch that constrains both efficiency and performance. We introduce LatentChem, a latent reasoning interface that decouples chemical computation from textual generation, enabling models to perform multi-step reasoning directly in continuous latent space while emitting language only for final outputs. Remarkably, we observe a consistent emergent behavior: when optimized solely for task success, models spontaneously internalize reasoning, progressively abandoning verbose textual derivations in favor of implicit latent computation. This shift is not merely stylistic but computationally advantageous. Across diverse chemical reasoning benchmarks, LatentChem achieves a 59.88\% non-tie win rate over strong CoT-based baselines on ChemCoTBench, while delivering a 10.84$\times$ average inference speedup. Our results provide empirical evidence that chemical reasoning is more naturally and effectively realized as continuous latent dynamics rather than discretized linguistic trajectories.

cross Electron-Informed Coarse-Graining Molecular Representation Learning for Real-World Molecular Physics

Authors: Gyoung S. Na, Chanyoung Park

Abstract: Various representation learning methods for molecular structures have been devised to accelerate data-driven chemistry. However, the representation capabilities of existing methods are essentially limited to atom-level information, which is not sufficient to describe real-world molecular physics. Although electron-level information can provide fundamental knowledge about chemical compounds beyond the atom-level information, obtaining the electron-level information in real-world molecules is computationally impractical and sometimes infeasible. We propose a method for learning electron-informed molecular representations without additional computation costs by transferring readily accessible electron-level information about small molecules to large molecules of our interest. The proposed method achieved state-of-the-art prediction accuracy on extensive benchmark datasets containing experimentally observed molecular physics. The source code for HEDMoL is available at https://github.com/ngs00/HEDMoL.

URLs: https://github.com/ngs00/HEDMoL.

cross BayesFlow 2.0: Multi-Backend Amortized Bayesian Inference in Python

Authors: Lars K\"uhmichel, Jerry M. Huang, Valentin Pratz, Jonas Arruda, Hans Olischl\"ager, Daniel Habermann, Simon Kucharsky, Lasse Elsem\"uller, Aayush Mishra, Niels Bracher, Svenja Jedhoff, Marvin Schmitt, Paul-Christian B\"urkner, Stefan T. Radev

Abstract: Modern Bayesian inference involves a mixture of computational methods for estimating, validating, and drawing conclusions from probabilistic models as part of principled workflows. An overarching motif of many Bayesian methods is that they are relatively slow, which often becomes prohibitive when fitting complex models to large data sets. Amortized Bayesian inference (ABI) offers a path to solving the computational challenges of Bayes. ABI trains neural networks on model simulations, rewarding users with rapid inference of any model-implied quantity, such as point estimates, likelihoods, or full posterior distributions. In this work, we present the Python library BayesFlow, Version 2.0, for general-purpose ABI. Along with direct posterior, likelihood, and ratio estimation, the software includes support for multiple popular deep learning backends, a rich collection of generative networks for sampling and density estimation, complete customization and high-level interfaces, as well as new capabilities for hyperparameter optimization, design optimization, and hierarchical modeling. Using a case study on dynamical system parameter estimation, combined with comparisons to similar software, we show that our streamlined, user-friendly workflow has strong potential to support broad adoption.

cross Fast and Robust Likelihood-Guided Diffusion Posterior Sampling with Amortized Variational Inference

Authors: L\'eon Zheng (MBZUAI, LRE), Thomas Hirtz (MBZUAI, LRE), Yazid Janati (MBZUAI, LRE), Eric Moulines (MBZUAI, LRE)

Abstract: Zero-shot diffusion posterior sampling offers a flexible framework for inverse problems by accommodating arbitrary degradation operators at test time, but incurs high computational cost due to repeated likelihood-guided updates. In contrast, previous amortized diffusion approaches enable fast inference by replacing likelihood-based sampling with implicit inference models, but at the expense of robustness to unseen degradations. We introduce an amortization strategy for diffusion posterior sampling that preserves explicit likelihood guidance by amortizing the inner optimization problems arising in variational diffusion posterior sampling. This accelerates inference for in-distribution degradations while maintaining robustness to previously unseen operators, thereby improving the trade-off between efficiency and flexibility in diffusion-based inverse problems.

cross Reasoning-Augmented Representations for Multimodal Retrieval

Authors: Jianrui Zhang, Anirudh Sundara Rajan, Brandon Han, Soochahn Lee, Sukanta Ganguly, Yong Jae Lee

Abstract: Universal Multimodal Retrieval (UMR) seeks any-to-any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data-induced: when images carry "silent" evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data-centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision--Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference-time enhancement alone is insufficient; the retriever must be trained on these semantically dense representations to avoid distribution shift and fully exploit the added signal. Across M-BEIR, our reasoning-augmented training method yields consistent gains over strong baselines, with ablations showing that corpus enhancement chiefly benefits knowledge-intensive queries while query enhancement is critical for compositional modification requests. We publicly release our code at https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval.

URLs: https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval.

cross Discrete Adjoint Matching

Authors: Oswin So, Brian Karrer, Chuchu Fan, Ricky T. Q. Chen, Guan-Horng Liu

Abstract: Computation methods for solving entropy-regularized reward optimization -- a class of problems widely used for fine-tuning generative models -- have advanced rapidly. Among those, Adjoint Matching (AM, Domingo-Enrich et al., 2025) has proven highly effective in continuous state spaces with differentiable rewards. Transferring these practical successes to discrete generative modeling, however, remains particularly challenging and largely unexplored, mainly due to the drastic shift in generative model classes to discrete state spaces, which are nowhere differentiable. In this work, we propose Discrete Adjoint Matching (DAM) -- a discrete variant of AM for fine-tuning discrete generative models characterized by Continuous-Time Markov Chains, such as diffusion-based large language models. The core of DAM is the introduction of discrete adjoint-an estimator of the optimal solution to the original problem but formulated on discrete domains-from which standard matching frameworks can be applied. This is derived via a purely statistical standpoint, in contrast to the control-theoretic viewpoint in AM, thereby opening up new algorithmic opportunities for general adjoint-based estimators. We showcase DAM's effectiveness on synthetic and mathematical reasoning tasks.

cross Free Energy Mixer

Authors: Jiecheng Lu, Shihao Yang

Abstract: Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.

cross High-fidelity 3D multi-slab diffusion MRI using Slab-shifting for Harmonized 3D Acquisition and Reconstruction with Profile Encoding Networks (SHARPEN)

Authors: Ziyu Li, Karla L. Miller, Wenchuan Wu

Abstract: Three-dimensional (3D) multi-slab imaging is a promising approach for high-resolution in vivo diffusion MRI (dMRI) due to its compatibility with short TR (1-2 s), providing optimal signal-to-noise ratio (SNR) efficiency. A major challenge, however, is slab boundary artifacts arising from non-ideal slab-selective RF excitation. Non-rectangular slab profiles reduce signal intensity at slab boundaries, while profile overlap across adjacent slabs introduces inter-slab crosstalk, where repeated excitation shortens the local TR and limits T1 recovery. To mitigate slab boundary artifacts without increasing scan time, we build on slab profile encoding and propose Slab-shifting for Harmonized 3D Acquisition and Reconstruction with Profile Encoding Networks (SHARPEN). For different diffusion directions, SHARPEN applies inter-volume field-of-view shifts along the slice direction to provide complementary slab profile encoding without prolonging acquisition. Slab profiles are estimated using a lightweight self-supervised neural network that exploits consistency across shifted acquisitions and known physical properties of slab profiles and diffusion images, and corrected images are reconstructed accordingly. SHARPEN was validated using simulated and prospectively acquired high-resolution in vivo data and demonstrates accurate slab profile estimation and robust boundary artifact correction, even in the presence of inter-volume motion. SHARPEN does not require high-quality reference training data and supports subject-specific training. Its efficient GPU-based implementation delivers faster and more accurate correction than NPEN, yielding slice-wise quantitative profiles that closely match those from reference 2D acquisitions. SHARPEN enables high-quality dMRI at 0.7 mm isotropic resolution on a 3T clinical scanner, highlighting its potential to advance submillimeter dMRI for neuroscience research.

cross The Value of Variance: Mitigating Debate Collapse in Multi-Agent Systems via Uncertainty-Driven Policy Optimization

Authors: Luoxi Tang, Yuqiao Meng, Joseph Costa, Yingxue Zhang, Muchao Ye, Zhaohan Xi

Abstract: Multi-agent debate (MAD) systems improve LLM reasoning through iterative deliberation, but remain vulnerable to debate collapse, a failure type where final agent decisions are compromised on erroneous reasoning. Existing methods lack principled mechanisms to detect or prevent such failures. To address this gap, we first propose a hierarchical metric that quantifies behavioral uncertainty at three levels: intra-agent (individual reasoning uncertainty), inter-agent (interactive uncertainty), and system-level (output uncertainty). Empirical analysis across several benchmarks reveals that our proposed uncertainty quantification reliably indicates system failures, which demonstrates the validity of using them as diagnostic metrics to indicate the system failure. Subsequently, we propose a mitigation strategy by formulating an uncertainty-driven policy optimization to penalize self-contradiction, peer conflict, and low-confidence outputs in a dynamic debating environment. Experiments demonstrate that our proposed uncertainty-driven mitigation reliably calibrates the multi-agent system by consistently improving decision accuracy while reducing system disagreement.

cross Automated Modernization of Machine Learning Engineering Notebooks for Reproducibility

Authors: Bihui Jin, Kaiyuan Wang, Pengyu Nie

Abstract: Interactive computational notebooks (e.g., Jupyter notebooks) are widely used in machine learning engineering (MLE) to program and share end-to-end pipelines, from data preparation to model training and evaluation. However, environment erosion-the rapid evolution of hardware and software ecosystems for machine learning-has rendered many published MLE notebooks non-reproducible in contemporary environments, hindering code reuse and scientific progress. To quantify this gap, we study 12,720 notebooks mined from 79 popular Kaggle competitions: only 35.4% remain reproducible today. Crucially, we find that environment backporting, i.e., downgrading dependencies to match the submission time, does not improve reproducibility but rather introduces additional failure modes. To address environment erosion, we design and implement MLEModernizer, an LLM-driven agentic framework that treats the contemporary environment as a fixed constraint and modernizes notebook code to restore reproducibility. MLEModernizer iteratively executes notebooks, collects execution feedback, and applies targeted fixes in three types: error-repair, runtime-reduction, and score-calibration. Evaluated on 7,402 notebooks that are non-reproducible under the baseline environment, MLEModernizer makes 5,492 (74.2%) reproducible. MLEModernizer enables practitioners to validate, reuse, and maintain MLE artifacts as the hardware and software ecosystems continue to evolve.

cross Extracting Root-Causal Brain Activity Driving Psychopathology from Resting State fMRI

Authors: Eric V. Strobl

Abstract: Neuroimaging studies of psychiatric disorders often correlate imaging patterns with diagnostic labels or composite symptom scores, yielding diffuse associations that obscure underlying mechanisms. We instead seek to identify root-causal maps -- localized BOLD disturbances that initiate pathological cascades -- and to link them selectively to symptom dimensions. We introduce a bilevel structural causal model that connects between-subject symptom structure to within-subject resting-state fMRI via independent latent sources with localized direct effects. Based on this model, we develop SOURCE (Symptom-Oriented Uncovering of Root-Causal Elements), a procedure that links interpretable symptom axes to a parsimonious set of localized drivers. Experiments show that SOURCE recovers localized maps consistent with root-causal BOLD drivers and increases interpretability and anatomical specificity relative to existing comparators.

cross Is there "Secret Sauce'' in Large Language Model Development?

Authors: Matthias Mertens, Natalia Fischl-Lanzoni, Neil Thompson

Abstract: Do leading LLM developers possess a proprietary ``secret sauce'', or is LLM performance driven by scaling up compute? Using training and benchmark data for 809 models released between 2022 and 2025, we estimate scaling-law regressions with release-date and developer fixed effects. We find clear evidence of developer-specific efficiency advantages, but their importance depends on where models lie in the performance distribution. At the frontier, 80-90% of performance differences are explained by higher training compute, implying that scale--not proprietary technology--drives frontier advances. Away from the frontier, however, proprietary techniques and shared algorithmic progress substantially reduce the compute required to reach fixed capability thresholds. Some companies can systematically produce smaller models more efficiently. Strikingly, we also find substantial variation of model efficiency within companies; a firm can train two models with more than 40x compute efficiency difference. We also discuss the implications for AI leadership and capability diffusion.

cross Beyond Crash: Hijacking Your Autonomous Vehicle for Fun and Profit

Authors: Qi Sun, Ahmed Abdo, Luis Burbano, Ziyang Li, Yaxing Yao, Alvaro Cardenas, Yinzhi Cao

Abstract: Autonomous Vehicles (AVs), especially vision-based AVs, are rapidly being deployed without human operators. As AVs operate in safety-critical environments, understanding their robustness in an adversarial environment is an important research problem. Prior physical adversarial attacks on vision-based autonomous vehicles predominantly target immediate safety failures (e.g., a crash, a traffic-rule violation, or a transient lane departure) by inducing a short-lived perception or control error. This paper shows a qualitatively different risk: a long-horizon route integrity compromise, where an attacker gradually steers a victim AV away from its intended route and into an attacker-chosen destination while the victim continues to drive "normally." This will not pose a danger to the victim vehicle itself, but also to potential passengers sitting inside the vehicle. In this paper, we design and implement the first adversarial framework, called JackZebra, that performs route-level hijacking of a vision-based end-to-end driving stack using a physically plausible attacker vehicle with a reconfigurable display mounted on the rear. The central challenge is temporal persistence: adversarial influence must remain effective in changing viewpoints, lighting, weather, traffic, and the victim's continual replanning -- without triggering conspicuous failures. Our key insight is to treat route hijacking as a closed-loop control problem and to convert adversarial patches into steering primitives that can be selected online via an interactive adjustment loop. Our adversarial patches are also carefully designed against worst-case background and sensor variations so that the adversarial impacts on the victim. Our evaluation shows that JackZebra can successfully hijack victim vehicles to deviate from original routes and stop at adversarial destinations with a high success rate.

cross 3D Transport-based Morphometry (3D-TBM) for medical image analysis

Authors: Hongyu Kan, Kristofor Pas, Ivan Medri, Naqib Sad Pathan, Natasha Ironside, Shinjini Kundu, Jingjia He, Gustavo Kunde Rohde

Abstract: Transport-Based Morphometry (TBM) has emerged as a new framework for 3D medical image analysis. By embedding images into a transport domain via invertible transformations, TBM facilitates effective classification, regression, and other tasks using transport-domain features. Crucially, the inverse mapping enables the projection of analytic results back into the original image space, allowing researchers to directly interpret clinical features associated with model outputs in a spatially meaningful way. To facilitate broader adoption of TBM in clinical imaging research, we present 3D-TBM, a tool designed for morphological analysis of 3D medical images. The framework includes data preprocessing, computation of optimal transport embeddings, and analytical methods such as visualization of main transport directions, together with techniques for discerning discriminating directions and related analysis methods. We also provide comprehensive documentation and practical tutorials to support researchers interested in applying 3D-TBM in their own medical imaging studies. The source code is publicly available through PyTransKit.

cross BRIDGE: Predicting Human Task Completion Time From Model Performance

Authors: Fengyuan Liu, Jay Gala, Nilaksh, Dzmitry Bahdanau, Siva Reddy, Hugo Larochelle

Abstract: Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.

cross Steer2Adapt: Dynamically Composing Steering Vectors Elicits Efficient Adaptation of LLMs

Authors: Pengrui Han, Xueqiang Xu, Keyang Xuan, Peiyang Song, Siru Ouyang, Runchu Tian, Yuqing Jiang, Cheng Qian, Pengcheng Jiang, Jiashuo Sun, Junxia Cui, Ming Zhong, Ge Liu, Jiawei Han, Jiaxuan You

Abstract: Activation steering has emerged as a promising approach for efficiently adapting large language models (LLMs) to downstream behaviors. However, most existing steering methods rely on a single static direction per task or concept, making them inflexible under task variation and inadequate for complex tasks that require multiple coordinated capabilities. To address this limitation, we propose STEER2ADAPT, a lightweight framework that adapts LLMs by composing steering vectors rather than learning new ones from scratch. In many domains (e.g., reasoning or safety), tasks share a small set of underlying concept dimensions. STEER2ADAPT captures these dimensions as a reusable, low-dimensional semantic prior subspace, and adapts to new tasks by dynamically discovering a linear combination of basis vectors from only a handful of examples. Experiments across 9 tasks and 3 models in both reasoning and safety domains demonstrate the effectiveness of STEER2ADAPT, achieving an average improvement of 8.2%. Extensive analyses further show that STEER2ADAPT is a data-efficient, stable, and transparent inference-time adaptation method for LLMs.

cross Cross-View World Models

Authors: Rishabh Sharma, Gijs Hogervorst, Wayne E. Mackey, David J. Heeger, Stefano Martiniani

Abstract: World models enable agents to plan by imagining future states, but existing approaches operate from a single viewpoint, typically egocentric, even when other perspectives would make planning easier; navigation, for instance, benefits from a bird's-eye view. We introduce Cross-View World Models (XVWM), trained with a cross-view prediction objective: given a sequence of frames from one viewpoint, predict the future state from the same or a different viewpoint after an action is taken. Enforcing cross-view consistency acts as geometric regularization: because the input and output views may share little or no visual overlap, to predict across viewpoints, the model must learn view-invariant representations of the environment's 3D structure. We train on synchronized multi-view gameplay data from Aimlabs, an aim-training platform providing precisely aligned multi-camera recordings with high-frequency action labels. The resulting model gives agents parallel imagination streams across viewpoints, enabling planning in whichever frame of reference best suits the task while executing from the egocentric view. Our results show that multi-view consistency provides a strong learning signal for spatially grounded representations. Finally, predicting the consequences of one's actions from another viewpoint may offer a foundation for perspective-taking in multi-agent settings.

cross Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

Authors: Chong Wang, Nan Du, Tom Gunter, Tao Lei, Kulin Seth, Senyu Tong, Jianyu Wang, Guoli Yin, Xiyou Zhou, Kelvin Zou, Ruoming Pang

Abstract: Efficient large-scale inference of transformer-based large language models (LLMs) remains a fundamental systems challenge, frequently requiring multi-GPU parallelism to meet stringent latency and throughput targets. Conventional tensor parallelism decomposes matrix operations across devices but introduces substantial inter-GPU synchronization, leading to communication bottlenecks and degraded scalability. We propose the Parallel Track (PT) Transformer, a novel architectural paradigm that restructures computation to minimize cross-device dependencies. PT achieves up to a 16x reduction in synchronization operations relative to standard tensor parallelism, while maintaining competitive model quality in our experiments. We integrate PT into two widely adopted LLM serving stacks-Tensor-RT-LLM and vLLM-and report consistent improvements in serving efficiency, including up to 15-30% reduced time to first token, 2-12% reduced time per output token, and up to 31.90% increased throughput in both settings.

cross Semantic Search At LinkedIn

Authors: Fedor Borisyuk, Sriram Vasudevan, Muchen Wu, Guoyao Li, Benjamin Le, Shaobo Zhang, Qianqi Kay Shen, Yuchin Juan, Kayhan Behdin, Liming Dong, Kaixu Yang, Shusen Jing, Ravi Pothamsetty, Rajat Arora, Sophie Yanying Sheng, Vitaly Abdrashitov, Yang Zhao, Lin Su, Xiaoqing Wang, Chujie Zheng, Sarang Metkar, Rupesh Gupta, Igor Lapchuk, David N. Racca, Madhumitha Mohan, Yanbo Li, Haojun Li, Saloni Gandhi, Xueying Lu, Chetan Bhole, Ali Hooshmand, Xin Yang, Raghavan Muthuregunathan, Jiajun Zhang, Mathew Teoh, Adam Coler, Abhinav Gupta, Xiaojing Ma, Sundara Raman Ramachandran, Morteza Ramezani, Yubo Wang, Lijuan Zhang, Richard Li, Jian Sheng, Chanh Nguyen, Yen-Chi Chen, Chuanrui Zhu, Claire Zhang, Jiahao Xu, Deepti Kulkarni, Qing Lan, Arvind Subramaniam, Ata Fatahibaarzi, Steven Shimizu, Yanning Chen, Zhipeng Wang, Ran He, Zhengze Zhou, Qingquan Song, Yun Dai, Caleb Johnson, Ping Liu, Shaghayegh Gharghabi, Gokulraj Mohanasundaram, Juan Bottaro, Santhosh Sachindran, Qi Guo, Yunxiang Ren, Chengming Jiang, Di Mo, Luke Simon, Jianqiang Shen, Jingwei Wu, Wenjing Zhang

Abstract: Semantic search with large language models (LLMs) enables retrieval by meaning rather than keyword overlap, but scaling it requires major inference efficiency advances. We present LinkedIn's LLM-based semantic search framework for AI Job Search and AI People Search, combining an LLM relevance judge, embedding-based retrieval, and a compact Small Language Model trained via multi-teacher distillation to jointly optimize relevance and engagement. A prefill-oriented inference architecture co-designed with model pruning, context compression, and text-embedding hybrid interactions boosts ranking throughput by over 75x under a fixed latency constraint while preserving near-teacher-level NDCG, enabling one of the first production LLM-based ranking systems with efficiency comparable to traditional approaches and delivering significant gains in quality and user engagement.

cross Optimization of Precipitate Segmentation Through Linear Genetic Programming of Image Processing

Authors: Kyle Williams, Andrew Seltzman

Abstract: Current analysis of additive manufactured niobium-based copper alloys relies on hand annotation due to varying contrast, noise, and image artifacts present in micrographs, slowing iteration speed in alloy development. We present a filtering and segmentation algorithm for detecting precipitates in FIB cross-section micrographs, optimized using linear genetic programming (LGP), which accounts for the various artifacts. To this end, the optimization environment uses a domain-specific language for image processing to iterate on solutions. Programs in this language are a list of image-filtering blocks with tunable parameters that sequentially process an input image, allowing for reliable generation and mutation by a genetic algorithm. Our environment produces optimized human-interpretable MATLAB code representing an image filtering pipeline. Under ideal conditions--a population size of 60 and a maximum program length of 5 blocks--our system was able to find a near-human accuracy solution with an average evaluation error of 1.8% when comparing segmentations pixel-by-pixel to a human baseline using an XOR error evaluation. Our automation work enabled faster iteration cycles and furthered exploration of the material composition and processing space: our optimized pipeline algorithm processes a 3.6 megapixel image in about 2 seconds on average. This ultimately enables convergence on strong, low-activation, precipitation hardened copper alloys for additive manufactured fusion reactor parts.

cross High Fidelity Textual User Representation over Heterogeneous Sources via Reinforcement Learning

Authors: Rajat Arora, Ye Tao, Jianqiang Shen, Ping Liu, Muchen Wu, Qianqi Shen, Benjamin Le, Fedor Borisyuk, Jingwei Wu, Wenjing Zhang

Abstract: Effective personalization on large-scale job platforms requires modeling members based on heterogeneous textual sources, including profiles, professional data, and search activity logs. As recommender systems increasingly adopt Large Language Models (LLMs), creating unified, interpretable, and concise representations from heterogeneous sources becomes critical, especially for latency-sensitive online environments. In this work, we propose a novel Reinforcement Learning (RL) framework to synthesize a unified textual representation for each member. Our approach leverages implicit user engagement signals (e.g., clicks, applies) as the primary reward to distill salient information. Additionally, the framework is complemented by rule-based rewards that enforce formatting and length constraints. Extensive offline experiments across multiple LinkedIn products, one of the world's largest job platforms, demonstrate significant improvements in key downstream business metrics. This work provides a practical, labeling-free, and scalable solution for constructing interpretable user representations that are directly compatible with LLM-based systems.

cross RAPiD: Real-time Deterministic Trajectory Planning via Diffusion Behavior Priors for Safe and Efficient Autonomous Driving

Authors: Ruturaj Reddy, Hrishav Bakul Barua, Junn Yong Loo, Thanh Thi Nguyen, Ganesh Krishnasamy

Abstract: Diffusion-based trajectory planners have demonstrated strong capability for modeling the multimodal nature of human driving behavior, but their reliance on iterative stochastic sampling poses critical challenges for real-time, safety-critical deployment. In this work, we present RAPiD, a deterministic policy extraction framework that distills a pretrained diffusion-based planner into an efficient policy while eliminating diffusion sampling. Using score-regularized policy optimization, we leverage the score function of a pre-trained diffusion planner as a behavior prior to regularize policy learning. To promote safety and passenger comfort, the policy is optimized using a critic trained to imitate a predictive driver controller, providing dense, safety-focused supervision beyond conventional imitation learning. Evaluations demonstrate that RAPiD achieves competitive performance on closed-loop nuPlan scenarios with an 8x speedup over diffusion baselines, while achieving state-of-the-art generalization among learning-based planners on the interPlan benchmark. The official website of this work is: https://github.com/ruturajreddy/RAPiD.

URLs: https://github.com/ruturajreddy/RAPiD.

cross Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

Authors: Ruturaj Reddy, Hrishav Bakul Barua, Junn Yong Loo, Thanh Thi Nguyen, Ganesh Krishnasamy

Abstract: Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms, i.e., one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.

cross Optimizing Few-Step Generation with Adaptive Matching Distillation

Authors: Lichen Bai, Zikai Zhou, Shitong Shao, Wenliang Zhong, Shuo Yang, Shuo Chen, Bojun Chen, Zeke Xie

Abstract: Distribution Matching Distillation (DMD) is a powerful acceleration paradigm, yet its stability is often compromised in Forbidden Zone, regions where the real teacher provides unreliable guidance while the fake teacher exerts insufficient repulsive force. In this work, we propose a unified optimization framework that reinterprets prior art as implicit strategies to avoid these corrupted regions. Based on this insight, we introduce Adaptive Matching Distillation (AMD), a self-correcting mechanism that utilizes reward proxies to explicitly detect and escape Forbidden Zones. AMD dynamically prioritizes corrective gradients via structural signal decomposition and introduces Repulsive Landscape Sharpening to enforce steep energy barriers against failure mode collapse. Extensive experiments across image and video generation tasks (e.g., SDXL, Wan2.1) and rigorous benchmarks (e.g., VBench, GenEval) demonstrate that AMD significantly enhances sample fidelity and training robustness. For instance, AMD improves the HPSv2 score on SDXL from 30.64 to 31.25, outperforming state-of-the-art baselines. These findings validate that explicitly rectifying optimization trajectories within Forbidden Zones is essential for pushing the performance ceiling of few-step generative models.

cross Efficient Post-Training Pruning of Large Language Models with Statistical Correction

Authors: Peiqi Yu, Jinhao Wang, Xinyi Sui, Nam Ling, Wei Wang, Wei Jiang

Abstract: Post-training pruning is an effective approach for reducing the size and inference cost of large language models (LLMs), but existing methods often face a trade-off between pruning quality and computational efficiency. Heuristic pruning methods are efficient but sensitive to activation outliers, while reconstruction-based approaches improve fidelity at the cost of heavy computation. In this work, we propose a lightweight post-training pruning framework based on first-order statistical properties of model weights and activations. During pruning, channel-wise statistics are used to calibrate magnitude-based importance scores, reducing bias from activation-dominated channels. After pruning, we apply an analytic energy compensation to correct distributional distortions caused by weight removal. Both steps operate without retraining, gradients, or second-order information. Experiments across multiple LLM families, sparsity patterns, and evaluation tasks show that the proposed approach improves pruning performance while maintaining computational cost comparable to heuristic methods. The results suggest that simple statistical corrections can be effective for post-training pruning of LLMs.

cross Learned Finite Element-based Regularization of the Inverse Problem in Electrocardiographic Imaging

Authors: Manuel Haas, Thomas Grandits, Thomas Pinetz, Thomas Beiert, Simone Pezzuto, Alexander Effland

Abstract: Electrocardiographic imaging (ECGI) seeks to reconstruct cardiac electrical activity from body-surface potentials noninvasively. However, the associated inverse problem is severely ill-posed and requires robust regularization. While classical approaches primarily employ spatial smoothing, the temporal structure of cardiac dynamics remains underexploited despite its physiological relevance. We introduce a space-time regularization framework that couples spatial regularization with a learned temporal Fields-of-Experts (FoE) prior to capture complex spatiotemporal activation patterns. We derive a finite element discretization on unstructured cardiac surface meshes, prove Mosco-convergence, and develop a scalable optimization algorithm capable of handling the FoE term. Numerical experiments on synthetic epicardial data demonstrate improved denoising and inverse reconstructions compared to handcrafted spatiotemporal methods, yielding solutions that are both robust to noise and physiologically plausible.

cross Statistical inference after variable selection in Cox models: A simulation study

Authors: Lena Schemet, Sarah Friedrich-Welz

Abstract: Choosing relevant predictors is central to the analysis of biomedical time-to-event data. Classical frequentist inference, however, presumes that the set of covariates is fixed in advance and does not account for data-driven variable selection. As a consequence, naive post-selection inference may be biased and misleading. In right-censored survival settings, these issues may be further exacerbated by the additional uncertainty induced by censoring. We investigate several inference procedures applied after variable selection for the coefficients of the Lasso and its extension, the adaptive Lasso, in the context of the Cox model. The methods considered include sample splitting, exact post-selection inference, and the debiased Lasso. Their performance is examined in a neutral simulation study reflecting realistic covariate structures and censoring rates commonly encountered in biomedical applications. To complement the simulation results, we illustrate the practical behavior of these procedures in an applied example using a publicly available survival dataset.

cross GraphAgents: Knowledge Graph-Guided Agentic AI for Cross-Domain Materials Design

Authors: Isabella A. Stewart, Tarjei Paule Hage, Yu-Chuan Hsu, Markus J. Buehler

Abstract: Large Language Models (LLMs) promise to accelerate discovery by reasoning across the expanding scientific landscape. Yet, the challenge is no longer access to information but connecting it in meaningful, domain-spanning ways. In materials science, where innovation demands integrating concepts from molecular chemistry to mechanical performance, this is especially acute. Neither humans nor single-agent LLMs can fully contend with this torrent of information, with the latter often prone to hallucinations. To address this bottleneck, we introduce a multi-agent framework guided by large-scale knowledge graphs to find sustainable substitutes for per- and polyfluoroalkyl substances (PFAS)-chemicals currently under intense regulatory scrutiny. Agents in the framework specialize in problem decomposition, evidence retrieval, design parameter extraction, and graph traversal, uncovering latent connections across distinct knowledge pockets to support hypothesis generation. Ablation studies show that the full multi-agent pipeline outperforms single-shot prompting, underscoring the value of distributed specialization and relational reasoning. We demonstrate that by tailoring graph traversal strategies, the system alternates between exploitative searches focusing on domain-critical outcomes and exploratory searches surfacing emergent cross-connections. Illustrated through the exemplar of biomedical tubing, the framework generates sustainable PFAS-free alternatives that balance tribological performance, thermal stability, chemical resistance, and biocompatibility. This work establishes a framework combining knowledge graphs with multi-agent reasoning to expand the materials design space, showcasing several initial design candidates to demonstrate the approach.

cross Physical Analog Kolmogorov-Arnold Networks based on Reconfigurable Nonlinear-Processing Units

Authors: Manuel Escudero, Mohamadreza Zolfagharinejad, Sjoerd van den Belt, Nikolaos Alachiotis, Wilfred G. van der Wiel

Abstract: Kolmogorov-Arnold Networks (KANs) shift neural computation from linear layers to learnable nonlinear edge functions, but implementing these nonlinearities efficiently in hardware remains an open challenge. Here we introduce a physical analog KAN architecture in which edge functions are realized in materia using reconfigurable nonlinear-processing units (RNPUs): multi-terminal nanoscale silicon devices whose input-output characteristics are tuned via control voltages. By combining multiple RNPUs into an edge processor and assembling these blocks into a reconfigurable analog KAN (aKAN) architecture with integrated mixed-signal interfacing, we establish a realistic system-level hardware implementation that enables compact KAN-style regression and classification with programmable nonlinear transformations. Using experimentally calibrated RNPU models and hardware measurements, we demonstrate accurate function approximation across increasing task complexity while requiring fewer or comparable trainable parameters than multilayer perceptrons (MLPs). System-level estimates indicate an energy per inference of $\sim$250 pJ and an end-to-end inference latency of $\sim$600 ns for a representative workload, corresponding to a $\sim$10$^{2}$-10$^{3}\times$ reduction in energy accompanied by a $\sim$10$\times$ reduction in area compared to a digital fixed-point MLP at similar approximation error. These results establish RNPUs as scalable, hardware-native nonlinear computing primitives and identify analog KAN architectures as a realistic silicon-based pathway toward energy-, latency-, and footprint-efficient analog neural-network hardware, particularly for edge inference.

cross MDL: A Unified Multi-Distribution Learner in Large-scale Industrial Recommendation through Tokenization

Authors: Shanlei Mu, Yuchen Jiang, Shikang Wu, Shiyong Hong, Tianmu Sha, Junjie Zhang, Jie Zhu, Zhe Chen, Zhe Wang, Jingjian Lin

Abstract: Industrial recommender systems increasingly adopt multi-scenario learning (MSL) and multi-task learning (MTL) to handle diverse user interactions and contexts, but existing approaches suffer from two critical drawbacks: (1) underutilization of large-scale model parameters due to limited interaction with complex feature modules, and (2) difficulty in jointly modeling scenario and task information in a unified framework. To address these challenges, we propose a unified \textbf{M}ulti-\textbf{D}istribution \textbf{L}earning (MDL) framework, inspired by the "prompting" paradigm in large language models (LLMs). MDL treats scenario and task information as specialized tokens rather than auxiliary inputs or gating signals. Specifically, we introduce a unified information tokenization module that transforms features, scenarios, and tasks into a unified tokenized format. To facilitate deep interaction, we design three synergistic mechanisms: (1) feature token self-attention for rich feature interactions, (2) domain-feature attention for scenario/task-adaptive feature activation, and (3) domain-fused aggregation for joint distribution prediction. By stacking these interactions, MDL enables scenario and task information to "prompt" and activate the model's vast parameter space in a bottom-up, layer-wise manner. Extensive experiments on real-world industrial datasets demonstrate that MDL significantly outperforms state-of-the-art MSL and MTL baselines. Online A/B testing on Douyin Search platform over one month yields +0.0626\% improvement in LT30 and -0.3267\% reduction in change query rate. MDL has been fully deployed in production, serving hundreds of millions of users daily.

cross Evaluating Object-Centric Models beyond Object Discovery

Authors: Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth

Abstract: Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.

cross LLM-Guided Diagnostic Evidence Alignment for Medical Vision-Language Pretraining under Limited Pairing

Authors: Huimin Yan, Liang Bai, Xian Yang, Long Chen

Abstract: Most existing CLIP-style medical vision--language pretraining methods rely on global or local alignment with substantial paired data. However, global alignment is easily dominated by non-diagnostic information, while local alignment fails to integrate key diagnostic evidence. As a result, learning reliable diagnostic representations becomes difficult, which limits their applicability in medical scenarios with limited paired data. To address this issue, we propose an LLM-Guided Diagnostic Evidence Alignment method (LGDEA), which shifts the pretraining objective toward evidence-level alignment that is more consistent with the medical diagnostic process. Specifically, we leverage LLMs to extract key diagnostic evidence from radiology reports and construct a shared diagnostic evidence space, enabling evidence-aware cross-modal alignment and allowing LGDEA to effectively exploit abundant unpaired medical images and reports, thereby substantially alleviating the reliance on paired data. Extensive experimental results demonstrate that our method achieves consistent and significant improvements on phrase grounding, image--text retrieval, and zero-shot classification, and even rivals pretraining methods that rely on substantial paired data.

cross Improving Variable-Length Generation in Diffusion Language Models via Length Regularization

Authors: Zicong Cheng, Ruixuan Jia, Jia Li, Guo-Wei Yang, Meng-Hao Guo, Shi-Min Hu

Abstract: Diffusion Large Language Models (DLLMs) are inherently ill-suited for variable-length generation, as their inference is defined on a fixed-length canvas and implicitly assumes a known target length. When the length is unknown, as in realistic completion and infilling, naively comparing confidence across mask lengths becomes systematically biased, leading to under-generation or redundant continuations. In this paper, we show that this failure arises from an intrinsic lengthinduced bias in generation confidence estimates, leaving existing DLLMs without a robust way to determine generation length and making variablelength inference unreliable. To address this issue, we propose LR-DLLM, a length-regularized inference framework for DLLMs that treats generation length as an explicit variable and achieves reliable length determination at inference time. It decouples semantic compatibility from lengthinduced uncertainty through an explicit length regularization that corrects biased confidence estimates. Based on this, LR-DLLM enables dynamic expansion or contraction of the generation span without modifying the underlying DLLM or its training procedure. Experiments show that LRDLLM achieves 51.3% Pass@1 on HumanEvalInfilling under fully unknown lengths (+13.4% vs. DreamOn) and 51.5% average Pass@1 on four-language McEval (+14.3% vs. DreamOn).

cross Linguistic properties and model scale in brain encoding: from small to compressed language models

Authors: Subba Reddy Oota, Vijay Rowtula, Satya Sai Srinath Namburi, Khushbu Pahwa, Anant Khandelwal, Manish Gupta, Tanmoy Chakraborty, Bapi S. Raju

Abstract: Recent work has shown that scaling large language models (LLMs) improves their alignment with human brain activity, yet it remains unclear what drives these gains and which representational properties are responsible. Although larger models often yield better task performance and brain alignment, they are increasingly difficult to analyze mechanistically. This raises a fundamental question: what is the minimal model capacity required to capture brain-relevant representations? To address this question, we systematically investigate how constraining model scale and numerical precision affects brain alignment. We compare full-precision LLMs, small language models (SLMs), and compressed variants (quantized and pruned) by predicting fMRI responses during naturalistic language comprehension. Across model families up to 14B parameters, we find that 3B SLMs achieve brain predictivity indistinguishable from larger LLMs, whereas 1B models degrade substantially, particularly in semantic language regions. Brain alignment is remarkably robust to compression: most quantization and pruning methods preserve neural predictivity, with GPTQ as a consistent exception. Linguistic probing reveals a dissociation between task performance and brain predictivity: compression degrades discourse, syntax, and morphology, yet brain predictivity remains largely unchanged. Overall, brain alignment saturates at modest model scales and is resilient to compression, challenging common assumptions about neural scaling and motivating compact models for brain-aligned language modeling.

cross Capturing the Topological Phase Transition and Thermodynamics of the 2D XY Model via Manifold-Aware Score-Based Generative Modeling

Authors: Pratyush Jha

Abstract: The application of generative modeling to many-body physics offers a promising pathway for analyzing high-dimensional state spaces of spin systems. However, unlike computer vision tasks where visual fidelity suffices, physical systems require the rigorous reproduction of higher-order statistical moments and thermodynamic quantities. While Score-Based Generative Models (SGMs) have emerged as a powerful tool, their standard formulation on Euclidean embedding space is ill-suited for continuous spin systems, where variables inherently reside on a manifold. In this work, we demonstrate that training on the Euclidean space compromises the model's ability to learn the target distribution as it prioritizes to learn the manifold constraints. We address this limitation by proposing the use of Manifold-Aware Score-Based Generative Modeling framework applied to the 64x64 2D XY model (a 4096-dimensional torus). We show that our method estimates the theoretical Boltzmann score with superior precision compared to standard diffusion models. Consequently, we successfully capture the Berezinskii-Kosterlitz Thouless (BKT) phase transition and accurately reproduce second-moment quantities, such as heat capacity without explicit feature engineering. Furthermore, we demonstrate zero-shot generalization to unseen lattice sizes, accurately recovering the physics of variable system scales without retraining. Since this approach bypasses domain-specific feature engineering, it remains intrinsically generalizable to other continuous spin systems.

cross How does longer temporal context enhance multimodal narrative video processing in the brain?

Authors: Prachi Jindal, Anant Khandelwal, Manish Gupta, Bapi S. Raju, Subba Reddy Oota, Tanmoy Chakraborty

Abstract: Understanding how humans and artificial intelligence systems process complex narrative videos is a fundamental challenge at the intersection of neuroscience and machine learning. This study investigates how the temporal context length of video clips (3--12 s clips) and the narrative-task prompting shape brain-model alignment during naturalistic movie watching. Using fMRI recordings from participants viewing full-length movies, we examine how brain regions sensitive to narrative context dynamically represent information over varying timescales and how these neural patterns align with model-derived features. We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Further, shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Finally, narrative-task prompts (multi-scene summary, narrative summary, character motivation, and event boundary detection) elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions. Together, our results position long-form narrative movies as a principled testbed for probing biologically relevant temporal integration and interpretable representations in long-context MLLMs.

cross $\partial$CBDs: Differentiable Causal Block Diagrams

Authors: Thomas Beckers, J\'an Drgo\v{n}a, Truong X. Nghiem

Abstract: Modern cyber-physical systems (CPS) integrate physics, computation, and learning, demanding modeling frameworks that are simultaneously composable, learnable, and verifiable. Yet existing approaches treat these goals in isolation: causal block diagrams (CBDs) support modular system interconnections but lack differentiability for learning; differentiable programming (DP) enables end-to-end gradient-based optimization but provides limited correctness guarantees; while contract-based verification frameworks remain largely disconnected from data-driven model refinement. To address these limitations, we introduce differentiable causal block diagrams ($\partial$CBDs), a unifying formalism that integrates these three perspectives. Our approach (i) retains the compositional structure and execution semantics of CBDs, (ii) incorporates assume--guarantee (A--G) contracts for modular correctness reasoning, and (iii) introduces residual-based contracts as differentiable, trajectory-level certificates compatible with automatic differentiation (AD), enabling gradient-based optimization and learning. Together, these elements enable a scalable, verifiable, and trainable modeling pipeline that preserves causality and modularity while supporting data-, physics-, and constraint-informed optimization for CPS.

cross Automated rock joint trace mapping using a supervised learning model trained on synthetic data generated by parametric modelling

Authors: Jessica Ka Yi Chiu, Tom Frode Hansen, Eivind Magnus Paulsen, Ole Jakob Mengshoel

Abstract: This paper presents a geology-driven machine learning method for automated rock joint trace mapping from images. The approach combines geological modelling, synthetic data generation, and supervised image segmentation to address limited real data and class imbalance. First, discrete fracture network models are used to generate synthetic jointed rock images at field-relevant scales via parametric modelling, preserving joint persistence, connectivity, and node-type distributions. Second, segmentation models are trained using mixed training and pretraining followed by fine-tuning on real images. The method is tested in box and slope domains using several real datasets. The results show that synthetic data can support supervised joint trace detection when real data are scarce. Mixed training performs well when real labels are consistent (e.g. box-domain), while fine-tuning is more robust when labels are noisy (e.g. slope-domain where labels can be biased, incomplete, and inconsistent). Fully zero-shot prediction from synthetic model remains limited, but useful generalisation is achieved by fine-tuning with a small number of real data. Qualitative analysis shows clearer and more geologically meaningful joint traces than indicated by quantitative metrics alone. The proposed method supports reliable joint mapping and provides a basis for further work on domain adaptation and evaluation.

cross SleepMaMi: A Universal Sleep Foundation Model for Integrating Macro- and Micro-structures

Authors: Keondo Park, Younghoon Na, Yourim Choi, Hyunwoo Ryu, Hyun-Woo Shin, Hyung-Sin Kim

Abstract: While the shift toward unified foundation models has revolutionized many deep learning domains, sleep medicine remains largely restricted to task-specific models that focus on localized micro-structure features. These approaches often neglect the rich, multi-modal context of Polysomnography (PSG) and fail to capture the global macro-structure of a full night's sleep. To address this, we introduce SleepMaMi , a Sleep Foundation Model engineered to master both hour-long sleep architectures and fine-grained signal morphologies. Our framework utilizes a hierarchical dual-encoder design: a Macro-Encoder to model full-night temporal dependencies and a Micro-Encoder to capture short-term characteristics from biosignals. Macro-Encoder is trained via Demographic-Guided Contrastive Learning, which aligns overnight sleep patterns with objective subject metadata, such as age, sex and BMI to refine global representations. Micro-Encoder is optimized via a hybrid Masked Autoencoder (MAE) and multi-modal contrastive objective. Pre-trained on a massive corpus of $>$20,000 PSG recordings (158K hours),SleepMaMi outperforms existing foundation models across a diverse suite of downstream tasks, demonstrating superior generalizability and label-efficient adaptation for clinical sleep analysis.

cross Scalable Mean-Field Variational Inference via Preconditioned Primal-Dual Optimization

Authors: Jinhua Lyu, Tianmin Yu, Ying Ma, Naichen Shi

Abstract: In this work, we investigate the large-scale mean-field variational inference (MFVI) problem from a mini-batch primal-dual perspective. By reformulating MFVI as a constrained finite-sum problem, we develop a novel primal-dual algorithm based on an augmented Lagrangian formulation, termed primal-dual variational inference (PD-VI). PD-VI jointly updates global and local variational parameters in the evidence lower bound in a scalable manner. To further account for heterogeneous loss geometry across different variational parameter blocks, we introduce a block-preconditioned extension, P$^2$D-VI, which adapts the primal-dual updates to the geometry of each parameter block and improves both numerical robustness and practical efficiency. We establish convergence guarantees for both PD-VI and P$^2$D-VI under properly chosen constant step size, without relying on conjugacy assumptions or explicit bounded-variance conditions. In particular, we prove $O(1/T)$ convergence to a stationary point in general settings and linear convergence under strong convexity. Numerical experiments on synthetic data and a real large-scale spatial transcriptomics dataset demonstrate that our methods consistently outperform existing stochastic variational inference approaches in terms of convergence speed and solution quality.

cross Flow-Based Conformal Predictive Distributions

Authors: Trevor Harris

Abstract: Conformal prediction provides a distribution-free framework for uncertainty quantification via prediction sets with exact finite-sample coverage. In low dimensions these sets are easy to interpret, but in high-dimensional or structured output spaces they are difficult to represent and use, which can limit their ability to integrate with downstream tasks such as sampling and probabilistic forecasting. We show that any differentiable nonconformity score induces a deterministic flow on the output space whose trajectories converge to the boundary of the corresponding conformal prediction set. This leads to a computationally efficient, training-free method for sampling conformal boundaries in arbitrary dimensions. Boundary samples can be reconformalized to form pointwise prediction sets with controlled risk, and mixing across confidence levels yields conformal predictive distributions whose quantile regions coincide exactly with conformal prediction sets. We evaluate the approach on PDE inverse problems, precipitation downscaling, climate model debiasing, and hurricane trajectory forecasting.

cross Efficient Table Retrieval and Understanding with Multimodal Large Language Models

Authors: Zhuoyan Xu, Haoyang Fang, Boran Han, Bonan Min, Bernie Wang, Cuixiong Hu, Shuai Zhang

Abstract: Tabular data is frequently captured in image form across a wide range of real-world scenarios such as financial reports, handwritten records, and document scans. These visual representations pose unique challenges for machine understanding, as they combine both structural and visual complexities. While recent advances in Multimodal Large Language Models (MLLMs) show promising results in table understanding, they typically assume the relevant table is readily available. However, a more practical scenario involves identifying and reasoning over relevant tables from large-scale collections to answer user queries. To address this gap, we propose TabRAG, a framework that enables MLLMs to answer queries over large collections of table images. Our approach first retrieves candidate tables using jointly trained visual-text foundation models, then leverages MLLMs to perform fine-grained reranking of these candidates, and finally employs MLLMs to reason over the selected tables for answer generation. Through extensive experiments on a newly constructed dataset comprising 88,161 training and 9,819 testing samples across 8 benchmarks with 48,504 unique tables, we demonstrate that our framework significantly outperforms existing methods by 7.0% in retrieval recall and 6.1% in answer accuracy, offering a practical solution for real-world table understanding tasks.

cross Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making

Authors: Ross Greer, Laura Fleig, Maitrayee Keskar, Erika Maquiling, Giovanni Tapia Lopez, Angel Martinez-Sanchez, Parthib Roy, Jake Rattigan, Mira Sur, Alejandra Vidrio, Thomas Marcotte, Mohan Trivedi

Abstract: The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., "turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.

cross Debugging code world models

Authors: Babak Rahmani

Abstract: Code World Models (CWMs) are language models trained to simulate program execution by predicting explicit runtime state after every executed command. This execution-based world modeling enables internal verification within the model, offering an alternative to natural language chain-of-thought reasoning. However, the sources of errors and the nature of CWMs' limitations remain poorly understood. We study CWMs from two complementary perspectives: local semantic execution and long-horizon state tracking. On real-code benchmarks, we identify two dominant failure regimes. First, dense runtime state reveals produce token-intensive execution traces, leading to token-budget exhaustion on programs with long execution histories. Second, failures disproportionately concentrate in string-valued state, which we attribute to limitations of subword tokenization rather than program structure. To study long-horizon behavior, we use a controlled permutation-tracking benchmark that isolates state propagation under action execution. We show that long-horizon degradation is driven primarily by incorrect action generation: when actions are replaced with ground-truth commands, a Transformer-based CWM propagates state accurately over long horizons, despite known limitations of Transformers in long-horizon state tracking. These findings suggest directions for more efficient supervision and state representations in CWMs that are better aligned with program execution and data types.

cross Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning

Authors: Ross Greer, Maitrayee Keskar, Angel Martinez-Sanchez, Parthib Roy, Shashank Shriram, Mohan Trivedi

Abstract: Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.

cross On Generation in Metric Spaces

Authors: Jiaxun Li, Vinod Raman, Ambuj Tewari

Abstract: We study generation in separable metric instance spaces. We extend the language generation framework from Kleinberg and Mullainathan [2024] beyond countable domains by defining novelty through metric separation and allowing asymmetric novelty parameters for the adversary and the generator. We introduce the $(\varepsilon,\varepsilon')$-closure dimension, a scale-sensitive analogue of closure dimension, which yields characterizations of uniform and non-uniform generatability and a sufficient condition for generation in the limit. Along the way, we identify a sharp geometric contrast. Namely, in doubling spaces, including all finite-dimensional normed spaces, generatability is stable across novelty scales and invariant under equivalent metrics. In general metric spaces, however, generatability can be highly scale-sensitive and metric-dependent; even in the natural infinite-dimensional Hilbert space $\ell^2$, all notions of generation may fail abruptly as the novelty parameters vary.

cross BFTS: Thompson Sampling with Bayesian Additive Regression Trees

Authors: Ruizhe Deng, Bibhas Chakraborty, Ran Chen, Yan Shuo Tan

Abstract: Contextual bandits are a core technology for personalized mobile health interventions, where decision-making requires adapting to complex, non-linear user behaviors. While Thompson Sampling (TS) is a preferred strategy for these problems, its performance hinges on the quality of the underlying reward model. Standard linear models suffer from high bias, while neural network approaches are often brittle and difficult to tune in online settings. Conversely, tree ensembles dominate tabular data prediction but typically rely on heuristic uncertainty quantification, lacking a principled probabilistic basis for TS. We propose Bayesian Forest Thompson Sampling (BFTS), the first contextual bandit algorithm to integrate Bayesian Additive Regression Trees (BART), a fully probabilistic sum-of-trees model, directly into the exploration loop. We prove that BFTS is theoretically sound, deriving an information-theoretic Bayesian regret bound of $\tilde{O}(\sqrt{T})$. As a complementary result, we establish frequentist minimax optimality for a "feel-good" variant, confirming the structural suitability of BART priors for non-parametric bandits. Empirically, BFTS achieves state-of-the-art regret on tabular benchmarks with near-nominal uncertainty calibration. Furthermore, in an offline policy evaluation on the Drink Less micro-randomized trial, BFTS improves engagement rates by over 30% compared to the deployed policy, demonstrating its practical effectiveness for behavioral interventions.

cross PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

Authors: Qiuming Luo, Yuebing Li, Feng Li, Chang Kong

Abstract: Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.

URLs: https://github.com/LLLVTA/PAND.

cross TodoEvolve: Learning to Architect Agent Planning Systems

Authors: Jiaxi Liu, Yanzuo Jiang, Guibin Zhang, Zihan Zhang, Heng Chang, Zhenfei Yin, Qibing Ren, Junchi Yan

Abstract: Planning has become a central capability for contemporary agent systems in navigating complex, long-horizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that lack the flexibility to adapt to the structural diversity of open-ended problems. To address this limitation, we introduce TodoEvolve, a meta-planning paradigm that autonomously synthesizes and dynamically revises task-specific planning architectures. Specifically, we first construct PlanFactory, a modular design space that standardizes diverse planning paradigms within a unified codebase encompassing topology, initialization, adaptation, and navigation, thereby providing a common interface for heterogeneous planning patterns. Leveraging PlanFactory, we collect high-quality planning trajectories and train Todo-14B via \textit{Impedance-Guided Preference Optimization} (IGPO), a multi-objective reinforcement learning objective that encourages the generation of planning systems that are performant, stable, and token-efficient across arbitrary tasks and agent backbones. Empirical evaluations on five agentic benchmarks demonstrate that TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical API costs and runtime overhead.

cross MemFly: On-the-Fly Memory Optimization via Information Bottleneck

Authors: Zhenyuan Zhang, Xianzhang Jia, Zhiqin Yang, Zhenbo Song, Wei Xue, Sirui Han, Yike Guo

Abstract: Long-term memory enables large language model agents to tackle complex tasks through historical interactions. However, existing frameworks encounter a fundamental dilemma between compressing redundant information efficiently and maintaining precise retrieval for downstream tasks. To bridge this gap, we propose MemFly, a framework grounded in information bottleneck principles that facilitates on-the-fly memory evolution for LLMs. Our approach minimizes compression entropy while maximizing relevance entropy via a gradient-free optimizer, constructing a stratified memory structure for efficient storage. To fully leverage MemFly, we develop a hybrid retrieval mechanism that seamlessly integrates semantic, symbolic, and topological pathways, incorporating iterative refinement to handle complex multi-hop queries. Comprehensive experiments demonstrate that MemFly substantially outperforms state-of-the-art baselines in memory coherence, response fidelity, and accuracy.

cross SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

Authors: Taolin Zhang, Hang Guo, Wang Lu, Tao Dai, Shu-Tao Xia, Jindong Wang

Abstract: As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall's~$\tau$ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios. Code is available at {https://github.com/taolinzhang/SparseEval}.

URLs: https://github.com/taolinzhang/SparseEval

cross CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Authors: Minbeom Kim, Mihir Parmar, Phillip Wallis, Lesly Miculicich, Kyomin Jung, Krishnamurthy Dj Dvijotham, Long T. Le, Tomas Pfister

Abstract: AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks. In this attack scenario, malicious commands hidden within untrusted content trick the agent into performing unauthorized actions. Existing defenses can reduce attack success but often suffer from the over-defense dilemma: they deploy expensive, always-on sanitization regardless of actual threat, thereby degrading utility and latency even in benign scenarios. We revisit IPI through a causal ablation perspective: a successful injection manifests as a dominance shift where the user request no longer provides decisive support for the agent's privileged action, while a particular untrusted segment, such as a retrieved document or tool output, provides disproportionate attributable influence. Based on this signature, we propose CausalArmor, a selective defense framework that (i) computes lightweight, leave-one-out ablation-based attributions at privileged decision points, and (ii) triggers targeted sanitization only when an untrusted segment dominates the user intent. Additionally, CausalArmor employs retroactive Chain-of-Thought masking to prevent the agent from acting on ``poisoned'' reasoning traces. We present a theoretical analysis showing that sanitization based on attribution margins conditionally yields an exponentially small upper bound on the probability of selecting malicious actions. Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses while improving explainability and preserving utility and latency of AI agents.

cross Learning-guided Kansa collocation for forward and inverse PDEs beyond linearity

Authors: Zheyuan Hu, Weitao Chen, Cengiz \"Oztireli, Chenliang Zhou, Fangcheng Zhong

Abstract: Partial Differential Equations are precise in modelling the physical, biological and graphical phenomena. However, the numerical methods suffer from the curse of dimensionality, high computation costs and domain-specific discretization. We aim to explore pros and cons of different PDE solvers, and apply them to specific scientific simulation problems, including forwarding solution, inverse problems and equations discovery. In particular, we extend the recent CNF (NeurIPS 2023) framework solver to multi-dependent-variable and non-linear settings, together with down-stream applications. The outcomes include implementation of selected methods, self-tuning techniques, evaluation on benchmark problems and a comprehensive survey of neural PDE solvers and scientific simulation applications.

cross Learning to Alleviate Familiarity Bias in Video Recommendation

Authors: Zheng Ren, Yi Wu, Jianan Lu, Acar Ary, Yiqu Liu, Li Wei, Lukasz Heldt

Abstract: Modern video recommendation systems aim to optimize user engagement and platform objectives, yet often face structural exposure imbalances caused by behavioral biases. In this work, we focus on the post-ranking stage and present LAFB (Learning to Alleviate Familiarity Bias), a lightweight and model-agnostic framework designed to mitigate familiarity bias in recommendation outputs. LAFB models user-content familiarity using discrete and continuous interaction features, and estimates personalized debiasing factors to adjust user rating prediction scores, thereby reducing the dominance of familiar content in the final ranking. We conduct large-scale offline evaluations and online A/B testing in a real-world recommendation system, under a unified serving stack that also compares LAFB with deployable popularity-oriented remedies. Results show that LAFB increases novel watch-time share and improves exposure for emerging creators and overall content diversity, while maintaining stable overall watch time and short-term satisfaction. LAFB has already been launched in the post-ranking stage of YouTube's recommendation system, demonstrating its effectiveness in real-world applications.

cross Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models

Authors: TrungKhang Tran, TrungTin Nguyen, Md Abul Bashar, Nhat Ho, Richi Nayak, Christopher Drovandi

Abstract: Mixture-of-Experts (MoE) architectures combine specialized predictors through a learned gate and are effective across regression and classification, but for classification with softmax multinomial-logistic gating, rigorous guarantees for stable maximum-likelihood training and principled model selection remain limited. We address both issues in the full-data (batch) regime. First, we derive a batch minorization-maximization (MM) algorithm for softmax-gated multinomial-logistic MoE using an explicit quadratic minorizer, yielding coordinate-wise closed-form updates that guarantee monotone ascent of the objective and global convergence to a stationary point (in the standard MM sense), avoiding approximate M-steps common in EM-type implementations. Second, we prove finite-sample rates for conditional density estimation and parameter recovery, and we adapt dendrograms of mixing measures to the classification setting to obtain a sweep-free selector of the number of experts that achieves near-parametric optimal rates after merging redundant fitted atoms. Experiments on biological protein--protein interaction prediction validate the full pipeline, delivering improved accuracy and better-calibrated probabilities than strong statistical and machine-learning baselines.

cross Tighter Information-Theoretic Generalization Bounds via a Novel Class of Change of Measure Inequalities

Authors: Yanxiao Liu, Yijun Fan an Deniz G\"und\"uz

Abstract: In this paper, we propose a novel class of change of measure inequalities via a unified framework based on the data processing inequality for $f$-divergences, which is surprisingly elementary yet powerful enough to yield tighter inequalities. We provide change of measure inequalities in terms of a broad family of information measures, including $f$-divergences (with Kullback-Leibler divergence and $\chi^2$-divergence as special cases), R\'enyi divergence, and $\alpha$-mutual information (with maximal leakage as a special case). We then embed these inequalities into the analysis of generalization error for stochastic learning algorithms, yielding novel and tighter high-probability information-theoretic generalization bounds, while also recovering several best-known results via simplified analyses. A key advantage of our framework is its flexibility: it readily adapts to a range of settings, including the conditional mutual information framework, PAC-Bayesian theory, and differential privacy mechanisms, for which we derive new generalization bounds.

cross ForecastOcc: Vision-based Semantic Occupancy Forecasting

Authors: Riya Mohan, Juana Valeria Hurtado, Rohit Mohan, Abhinav Valada

Abstract: Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.

cross FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging

Authors: Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, Zhuotao Tian

Abstract: Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training-free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity-based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy elimination. Extensive experiments conducted on three representative VLLMs across five video understanding benchmarks demonstrate the effectiveness and generalization of our method. Notably, by retaining only 10% of visual tokens, FlashVID preserves 99.1% of the performance of LLaVA-OneVision. Consequently, FlashVID can serve as a training-free and plug-and-play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5-VL, resulting in a relative improvement of 8.6% within the same computational budget. Code is available at https://github.com/Fanziyang-v/FlashVID.

URLs: https://github.com/Fanziyang-v/FlashVID.

cross Graph-based Semi-Supervised Learning via Maximum Discrimination

Authors: Nadav Katz, Ariel Jaffe

Abstract: Semi-supervised learning (SSL) addresses the critical challenge of training accurate models when labeled data is scarce but unlabeled data is abundant. Graph-based SSL (GSSL) has emerged as a popular framework that captures data structure through graph representations. Classic graph SSL methods, such as Label Propagation and Label Spreading, aim to compute low-dimensional representations where points with the same labels are close in representation space. Although often effective, these methods can be suboptimal on data with complex label distributions. In our work, we develop AUC-spec, a graph approach that computes a low-dimensional representation that maximizes class separation. We compute this representation by optimizing the Area Under the ROC Curve (AUC) as estimated via the labeled points. We provide a detailed analysis of our approach under a product-of-manifold model, and show that the required number of labeled points for AUC-spec is polynomial in the model parameters. Empirically, we show that AUC-spec balances class separation with graph smoothness. It demonstrates competitive results on synthetic and real-world datasets while maintaining computational efficiency comparable to the field's classic and state-of-the-art methods.

cross The CAPSARII Approach to Cyber-Secure Wearable, Ultra-Low-Power Networked Sensors for Soldier Health Monitoring

Authors: Luciano Bozzi, Christian Celidonio, Umberto Nuzzi, Massimo Biagini, Stefano Cherubin, Asbj{\o}rn Djupdal, Tor Andre Haugdahl, Andrea Aliverti, Alessandra Angelucci, Giovanni Agosta, Gerardo Pelosi, Paolo Belluco, Samuele Polistina, Riccardo Volpi, Luigi Malag\`o, Michael Schneider, Florian Wieczorek, Xabier Eguiluz

Abstract: The European Defence Agency's revised Capability Development Plan (CDP) identifies as a priority improving ground combat capabilities by enhancing soldiers' equipment for better protection. The CAPSARII project proposes in innovative wearable system and Internet of Battlefield Things (IoBT) framework to monitor soldiers' physiological and psychological status, aiding tactical decisions and medical support. The CAPSARII system will enhance situational awareness and operational effectiveness by monitoring physiological, movement and environmental parameters, providing real-time tactical decision support through AI models deployed on edge nodes and enable data analysis and comparative studies via cloud-based analytics. CAPSARII also aims at improving usability through smart textile integration, longer battery life, reducing energy consumption through software and hardware optimizations, and address security concerns with efficient encryption and strong authentication methods. This innovative approach aims to transform military operations by providing a robust, data-driven decision support tool.

cross GAAVI: Global Asymptotic Anytime Valid Inference for the Conditional Mean Function

Authors: Brian M Cho, Raaz Dwivedi, Nathan Kallus

Abstract: Inference on the conditional mean function (CMF) is central to tasks from adaptive experimentation to optimal treatment assignment and algorithmic fairness auditing. In this work, we provide a novel asymptotic anytime-valid test for a CMF global null (e.g., that all conditional means are zero) and contrasts between CMFs, enabling experimenters to make high confidence decisions at any time during the experiment beyond a minimum sample size. We provide mild conditions under which our tests achieve (i) asymptotic type-I error guarantees, (i) power one, and, unlike past tests, (iii) optimal sample complexity relative to a Gaussian location testing. By inverting our tests, we show how to construct function-valued asymptotic confidence sequences for the CMF and contrasts thereof. Experiments on both synthetic and real-world data show our method is well-powered across various distributions while preserving the nominal error rate under continuous monitoring.

cross Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

Authors: Risal Shahriar Shefin, Debashis Gupta, Thai Le, Sarra Alqahtani

Abstract: Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains "downstream-first" detection anomalies by revealing pathways that amplify upstream deviations. Evaluated across 500 episodes in Simple Spread (3 and 5 agents) and 100 episodes in StarCraft II using MADDPG and HATRPO, our method achieves 88.2-99.4% Patient-0 detection accuracy while providing interpretable geometric evidence for detection decisions. By moving beyond black-box detection to interpretable gradient-level forensics, this framework offers practical tools for diagnosing cascading failures in safety-critical MARL systems.

cross MMLSv2: A Multimodal Dataset for Martian Landslide Detection in Remote Sensing Imagery

Authors: Sidike Paheding, Abel Reyes-Angulo, Leo Thomas Ramos, Angel D. Sappa, Rajaneesh A., Hiral P. B., Sajin Kumar K. S., Thomas Oommen

Abstract: We present MMLSv2, a dataset for landslide segmentation on Martian surfaces. MMLSv2 consists of multimodal imagery with seven bands: RGB, digital elevation model, slope, thermal inertia, and grayscale channels. MMLSv2 comprises 664 images distributed across training, validation, and test splits. In addition, an isolated test set of 276 images from a geographically disjoint region from the base dataset is released to evaluate spatial generalization. Experiments conducted with multiple segmentation models show that the dataset supports stable training and achieves competitive performance, while still posing challenges in fragmented, elongated, and small-scale landslide regions. Evaluation on the isolated test set leads to a noticeable performance drop, indicating increased difficulty and highlighting its value for assessing model robustness and generalization beyond standard in-distribution settings. Dataset will be available at: https://github.com/MAIN-Lab/MMLS_v2

URLs: https://github.com/MAIN-Lab/MMLS_v2

cross Gender and Race Bias in Consumer Product Recommendations by Large Language Models

Authors: Ke Xu, Shera Potka, Alex Thomo

Abstract: Large Language Models are increasingly employed in generating consumer product recommendations, yet their potential for embedding and amplifying gender and race biases remains underexplored. This paper serves as one of the first attempts to examine these biases within LLM-generated recommendations. We leverage prompt engineering to elicit product suggestions from LLMs for various race and gender groups and employ three analytical methods-Marked Words, Support Vector Machines, and Jensen-Shannon Divergence-to identify and quantify biases. Our findings reveal significant disparities in the recommendations for demographic groups, underscoring the need for more equitable LLM recommendation systems.

cross Adjustment of Cluster-Then-Predict Framework for Multiport Scatterer Load Prediction

Authors: Hanjun Park, Aleksandr D. Kuznetsov, Ville Viikari

Abstract: Predicting interdependent load values in multiport scatterers is challenging due to high dimensionality and complex dependence between impedance and scattering ability, yet this prediction remains crucial for the design of communication and measurement systems. In this paper, we propose a two-stage cluster-then-predict framework for multiple load values prediction task in multiport scatterers. The proposed cluster-then-predict approach effectively captures the underlying functional relation between S-parameters and corresponding load impedances, achieving up to a 46% reduction in Root Mean Square Error (RMSE) compared to the baseline when applied to gradient boosting (GB). This improvement is consistent across various clustering and regression methods. Furthermore, we introduce the Real-world Unified Index (RUI), a metric for quantitative analysis of trade-offs among multiple metrics with conflicting objectives and different scales, suitable for performance assessment in realistic scenarios. Based on RUI, the combination of K-means clustering and k-nearest neighbors (KNN) is identified as the optimal setup for the analyzed multiport scatterer.

cross Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

Authors: Milan Ganai, Katie Luo, Jonas Frey, Clark Barrett, Marco Pavone

Abstract: Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.

cross Evasion of IoT Malware Detection via Dummy Code Injection

Authors: Sahar Zargarzadeh, Mohammad Islam

Abstract: The Internet of Things (IoT) has revolutionized connectivity by linking billions of devices worldwide. However, this rapid expansion has also introduced severe security vulnerabilities, making IoT devices attractive targets for malware such as the Mirai botnet. Power side-channel analysis has recently emerged as a promising technique for detecting malware activity based on device power consumption patterns. However, the resilience of such detection systems under adversarial manipulation remains underexplored. This work presents a novel adversarial strategy against power side-channel-based malware detection. By injecting structured dummy code into the scanning phase of the Mirai botnet, we dynamically perturb power signatures to evade AI/ML-based anomaly detection without disrupting core functionality. Our approach systematically analyzes the trade-offs between stealthiness, execution overhead, and evasion effectiveness across multiple state-of-the-art models for side-channel analysis, using a custom dataset collected from smartphones of diverse manufacturers. Experimental results show that our adversarial modifications achieve an average attack success rate of 75.2\%, revealing practical vulnerabilities in power-based intrusion detection frameworks.

cross Fundamental Limits of Community Detection in Contextual Multi-Layer Stochastic Block Models

Authors: Shuyang Gong, Dong Huang, Zhangsong Li

Abstract: We consider the problem of community detection from the joint observation of a high-dimensional covariate matrix and $L$ sparse networks, all encoding noisy, partial information about the latent community labels of $n$ subjects. In the asymptotic regime where the networks have constant average degree and the number of features $p$ grows proportionally with $n$, we derive a sharp threshold under which detecting and estimating the subject labels is possible. Our results extend the work of \cite{MN23} to the constant-degree regime with noisy measurements, and also resolve a conjecture in \cite{YLS24+} when the number of networks is a constant. Our information-theoretic lower bound is obtained via a novel comparison inequality between Bernoulli and Gaussian moments, as well as a statistical variant of the ``recovery to chi-square divergence reduction'' argument inspired by \cite{DHSS25}. On the algorithmic side, we design efficient algorithms based on counting decorated cycles and decorated paths and prove that they achieve the sharp threshold for both detection and weak recovery. In particular, our results show that there is no statistical-computational gap in this setting.

cross Information Geometry of Absorbing Markov-Chain and Discriminative Random Walks

Authors: Masanari Kimura

Abstract: Discriminative Random Walks (DRWs) are a simple yet powerful tool for semi-supervised node classification, but their theoretical foundations remain fragmentary. We revisit DRWs through the lens of information geometry, treating the family of class-specific hitting-time laws on an absorbing Markov chain as a statistical manifold. Starting from a log-linear edge-weight model, we derive closed-form expressions for the hitting-time probability mass function, its full moment hierarchy, and the observed Fisher information. The Fisher matrix of each seed node turns out to be rank-one, taking the quotient by its null space yields a low-dimensional, globally flat manifold that captures all identifiable directions of the model. Leveraging the geometry, we introduce a sensitivity score for unlabeled nodes that bounds, and in one-dimensional cases attains, the maximal first-order change in DRW betweenness under unit Fisher perturbations. The score can lead to principled strategies for active label acquisition, edge re-weighting, and explanation.

cross InfiCoEvalChain: A Blockchain-Based Decentralized Framework for Collaborative LLM Evaluation

Authors: Yifan Yang, Jinjia Li, Kunxi Li, Puhao Zheng, Yuanyi Wang, Zheyan Qu, Yang Yu, Jianmin Wu, Ming Li, Hongxia Yang

Abstract: The rapid advancement of large language models (LLMs) demands increasingly reliable evaluation, yet current centralized evaluation suffers from opacity, overfitting, and hardware-induced variance. Our empirical analysis reveals an alarming inconsistency in existing evaluations: the standard deviation across ten repeated runs of a single model on HumanEval (1.67) actually exceeds the performance gap among the top-10 models on the official leaderboard (0.91), rendering current rankings statistically precarious. To mitigate these instabilities, we propose a decentralized evaluation framework that enables hardware and parameter diversity through large-scale benchmarking across heterogeneous compute nodes. By leveraging the blockchain-based protocol, the framework incentivizes global contributors to act as independent validators, using a robust reward system to ensure evaluation integrity and discourage dishonest participation. This collective verification transforms evaluation from a "centralized black box" into a "decentralized endorsement" where multi-party consensus and diverse inference environments yield a more stable, representative metric. Experimental results demonstrate that the decentralized evaluation framework reduces the standard deviation across ten runs on the same model to 0.28. This significant improvement over conventional frameworks ensures higher statistical confidence in model rankings. We have completely implemented this platform and will soon release it to the community.

cross Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization

Authors: Ruichen Jiang, Zakaria Mhammedi, Mehryar Mohri, Aryan Mokhtari

Abstract: We study online linear optimization with matrix variables constrained by the operator norm, a setting where the geometry renders designing data-dependent and efficient adaptive algorithms challenging. The best-known adaptive regret bounds are achieved by Shampoo-like methods, but they require solving a costly quadratic projection subproblem. To address this, we extend the gradient-based prediction scheme to adaptive matrix online learning and cast algorithm design as constructing a family of smoothed potentials for the nuclear norm. We define a notion of admissibility for such smoothings and prove any admissible smoothing yields a regret bound matching the best-known guarantees of one-sided Shampoo. We instantiate this framework with two efficient methods that avoid quadratic projections. The first is an adaptive Follow-the-Perturbed-Leader (FTPL) method using Gaussian stochastic smoothing. The second is Follow-the-Augmented-Matrix-Leader (FAML), which uses a deterministic hyperbolic smoothing in an augmented matrix space. By analyzing the admissibility of these smoothings, we show both methods admit closed-form updates and match one-sided Shampoo's regret up to a constant factor, while significantly reducing computational cost. Lastly, using the online-to-nonconvex conversion, we derive two matrix-based optimizers, Pion (from FTPL) and Leon (from FAML). We prove convergence guarantees for these methods in nonsmooth nonconvex settings, a guarantee that the popular Muon optimizer lacks.

cross Discrete Adjoint Schr\"odinger Bridge Sampler

Authors: Wei Guo, Yuchen Zhu, Xiaochen Du, Juno Nam, Yongxin Chen, Rafael G\'omez-Bombarelli, Guan-Horng Liu, Molei Tao, Jaemoo Choi

Abstract: Learning discrete neural samplers is challenging due to the lack of gradients and combinatorial complexity. While stochastic optimal control (SOC) and Schr\"odinger bridge (SB) provide principled solutions, efficient SOC solvers like adjoint matching (AM), which excel in continuous domains, remain unexplored for discrete spaces. We bridge this gap by revealing that the core mechanism of AM is $\mathit{state}\text{-}\mathit{space~agnostic}$, and introduce $\mathbf{discrete~ASBS}$, a unified framework that extends AM and adjoint Schr\"odinger bridge sampler (ASBS) to discrete spaces. Theoretically, we analyze the optimality conditions of the discrete SB problem and its connection to SOC, identifying a necessary cyclic group structure on the state space to enable this extension. Empirically, discrete ASBS achieves competitive sample quality with significant advantages in training efficiency and scalability.

cross A Statistical Framework for Alignment with Biased AI Feedback

Authors: Xintao Xia, Zhiqiu Xia, Linjun Zhang, Zhanrui Cai

Abstract: Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback datasets. In this paper, we develop two debiased alignment methods within a general framework that accommodates heterogeneous prompt-response distributions and external human feedback sources. Debiased Direct Preference Optimization (DDPO) augments standard DPO with a residual-based correction and density-ratio reweighting to mitigate systematic bias, while retaining DPO's computational efficiency. Debiased Identity Preference Optimization (DIPO) directly estimates human preference probabilities without imposing a parametric reward model. We provide theoretical guarantees for both methods: DDPO offers a practical and computationally efficient solution for large-scale alignment, whereas DIPO serves as a robust, statistically optimal alternative that attains the semiparametric efficiency bound. Empirical studies on sentiment generation, summarization, and single-turn dialogue demonstrate that the proposed methods substantially improve alignment efficiency and recover performance close to that of an oracle trained on fully human-labeled data.

cross Is Flow Matching Just Trajectory Replay for Sequential Data?

Authors: Soon Hoe Lim, Shizheng Lin, Michael W. Mahoney, N. Benjamin Erichson

Abstract: Flow matching (FM) is increasingly used for time-series generation, but it is not well understood whether it learns a general dynamical structure or simply performs an effective "trajectory replay". We study this question by deriving the velocity field targeted by the empirical FM objective on sequential data, in the limit of perfect function approximation. For the Gaussian conditional paths commonly used in practice, we show that the implied sampler is an ODE whose dynamics constitutes a nonparametric, memory-augmented continuous-time dynamical system. The optimal field admits a closed-form expression as a similarity-weighted mixture of instantaneous velocities induced by past transitions, making the dataset dependence explicit and interpretable. This perspective positions neural FM models trained by stochastic optimization as parametric surrogates of an ideal nonparametric solution. Using the structure of the optimal field, we study sampling and approximation schemes that improve the efficiency and numerical robustness of ODE-based generation. On nonlinear dynamical system benchmarks, the resulting closed-form sampler yields strong probabilistic forecasts directly from historical transitions, without training.

cross PACC: Protocol-Aware Cross-Layer Compression for Compact Network Traffic Representation

Authors: Zhaochen Guo, Tianyufei Zhou, Honghao Wang, Ronghua Li, Shinan Liu

Abstract: Network traffic classification is a core primitive for network security and management, yet it is increasingly challenged by pervasive encryption and evolving protocols. A central bottleneck is representation: hand-crafted flow statistics are efficient but often too lossy, raw-bit encodings can be accurate but are costly, and recent pre-trained embeddings provide transfer but frequently flatten the protocol stack and entangle signals across layers. We observe that real traffic contains substantial redundancy both across network layers and within each layer; existing paradigms do not explicitly identify and remove this redundancy, leading to wasted capacity, shortcut learning, and degraded generalization. To address this, we propose PACC, a redundancy-aware, layer-aware representation framework. PACC treats the protocol stack as multi-view inputs and learns compact layer-wise projections that remain faithful to each layer while explicitly factorizing representations into shared (cross-layer) and private (layer-specific) components. We operationalize these goals with a joint objective that preserves layer-specific information via reconstruction, captures shared structure via contrastive mutual-information learning, and maximizes task-relevant information via supervised losses, yielding compact latents suitable for efficient inference. Across datasets covering encrypted application classification, IoT device identification, and intrusion detection, PACC consistently outperforms feature-engineered and raw-bit baselines. On encrypted subsets, it achieves up to a 12.9% accuracy improvement over nPrint. PACC matches or surpasses strong foundation-model baselines. At the same time, it improves end-to-end efficiency by up to 3.16x.

cross Circuit Representations of Random Forests with Applications to XAI

Authors: Chunxi Ji, Adnan Darwiche

Abstract: We make three contributions in this paper. First, we present an approach for compiling a random forest classifier into a set of circuits, where each circuit directly encodes the instances in some class of the classifier. We show empirically that our proposed approach is significantly more efficient than existing similar approaches. Next, we utilize this approach to further obtain circuits that are tractable for computing the complete and general reasons of a decision, which are instance abstractions that play a fundamental role in computing explanations. Finally, we propose algorithms for computing the robustness of a decision and all shortest ways to flip it. We illustrate the utility of our contributions by using them to enumerate all sufficient reasons, necessary reasons and contrastive explanations of decisions; to compute the robustness of decisions; and to identify all shortest ways to flip the decisions made by random forest classifiers learned from a wide range of datasets.

cross MemAdapter: Fast Alignment across Agent Memory Paradigms via Generative Subgraph Retrieval

Authors: Xin Zhang, Kailai Yang, Chenyue Li, Hao Li, Qiyu Wei, Jun'ichi Tsujii, Sophia Ananiadou

Abstract: Memory mechanism is a core component of LLM-based agents, enabling reasoning and knowledge discovery over long-horizon contexts. Existing agent memory systems are typically designed within isolated paradigms (e.g., explicit, parametric, or latent memory) with tightly coupled retrieval methods that hinder cross-paradigm generalization and fusion. In this work, we take a first step toward unifying heterogeneous memory paradigms within a single memory system. We propose MemAdapter, a memory retrieval framework that enables fast alignment across agent memory paradigms. MemAdapter adopts a two-stage training strategy: (1) training a generative subgraph retriever from the unified memory space, and (2) adapting the retriever to unseen memory paradigms by training a lightweight alignment module through contrastive learning. This design improves the flexibility for memory retrieval and substantially reduces alignment cost across paradigms. Comprehensive experiments on three public evaluation benchmarks demonstrate that the generative subgraph retriever consistently outperforms five strong agent memory systems across three memory paradigms and agent model scales. Notably, MemAdapter completes cross-paradigm alignment within 13 minutes on a single GPU, achieving superior performance over original memory retrievers with less than 5% of training compute. Furthermore, MemAdapter enables effective zero-shot fusion across memory paradigms, highlighting its potential as a plug-and-play solution for agent memory systems.

cross Learning Human-Like Badminton Skills for Humanoid Robots

Authors: Yeke Chen, Shihao Dong, Xiaoyu Ji, Jingkai Sun, Zeren Luo, Liu Zhao, Jiahui Zhang, Wanyue Li, Ji Ma, Bowen Xu, Yimin Han, Yudong Zhao, Peng Lu

Abstract: Realizing versatile and human-like performance in high-demand sports like badminton remains a formidable challenge for humanoid robotics. Unlike standard locomotion or static manipulation, this task demands a seamless integration of explosive whole-body coordination and precise, timing-critical interception. While recent advances have achieved lifelike motion mimicry, bridging the gap between kinematic imitation and functional, physics-aware striking without compromising stylistic naturalness is non-trivial. To address this, we propose Imitation-to-Interaction, a progressive reinforcement learning framework designed to evolve a robot from a "mimic" to a capable "striker." Our approach establishes a robust motor prior from human data, distills it into a compact, model-based state representation, and stabilizes dynamics via adversarial priors. Crucially, to overcome the sparsity of expert demonstrations, we introduce a manifold expansion strategy that generalizes discrete strike points into a dense interaction volume. We validate our framework through the mastery of diverse skills, including lifts and drop shots, in simulation. Furthermore, we demonstrate the first zero-shot sim-to-real transfer of anthropomorphic badminton skills to a humanoid robot, successfully replicating the kinetic elegance and functional precision of human athletes in the physical world.

cross Grounding Generative Planners in Verifiable Logic: A Hybrid Architecture for Trustworthy Embodied AI

Authors: Feiyu Wu, Xu Zheng, Yue Qu, Zhuocheng Wang, Zicheng Feng, Hui Li

Abstract: Large Language Models (LLMs) show promise as planners for embodied AI, but their stochastic nature lacks formal reasoning, preventing strict safety guarantees for physical deployment. Current approaches often rely on unreliable LLMs for safety checks or simply reject unsafe plans without offering repairs. We introduce the Verifiable Iterative Refinement Framework (VIRF), a neuro-symbolic architecture that shifts the paradigm from passive safety gatekeeping to active collaboration. Our core contribution is a tutor-apprentice dialogue where a deterministic Logic Tutor, grounded in a formal safety ontology, provides causal and pedagogical feedback to an LLM planner. This enables intelligent plan repairs rather than mere avoidance. We also introduce a scalable knowledge acquisition pipeline that synthesizes safety knowledge bases from real-world documents, correcting blind spots in existing benchmarks. In challenging home safety tasks, VIRF achieves a perfect 0 percent Hazardous Action Rate (HAR) and a 77.3 percent Goal-Condition Rate (GCR), which is the highest among all baselines. It is highly efficient, requiring only 1.1 correction iterations on average. VIRF demonstrates a principled pathway toward building fundamentally trustworthy and verifiably safe embodied agents.

cross Schr\"odinger bridge problem via empirical risk minimization

Authors: Denis Belomestny, Alexey Naumov, Nikita Puchkin, Denis Suchkov

Abstract: We study the Schr\"odinger bridge problem when the endpoint distributions are available only through samples. Classical computational approaches estimate Schr\"odinger potentials via Sinkhorn iterations on empirical measures and then construct a time-inhomogeneous drift by differentiating a kernel-smoothed dual solution. In contrast, we propose a learning-theoretic route: we rewrite the Schr\"odinger system in terms of a single positive transformed potential that satisfies a nonlinear fixed-point equation and estimate this potential by empirical risk minimization over a function class. We establish uniform concentration of the empirical risk around its population counterpart under sub-Gaussian assumptions on the reference kernel and terminal density. We plug the learned potential into a stochastic control representation of the bridge to generate samples. We illustrate performance of the suggested approach with numerical experiments.

cross Altruism and Fair Objective in Mixed-Motive Markov games

Authors: Yao-hua Franck Xu, Tayeb Lemlouma, Arnaud Braud, Jean-Marie Bonnin

Abstract: Cooperation is fundamental for society's viability, as it enables the emergence of structure within heterogeneous groups that seek collective well-being. However, individuals are inclined to defect in order to benefit from the group's cooperation without contributing the associated costs, thus leading to unfair situations. In game theory, social dilemmas entail this dichotomy between individual interest and collective outcome. The most dominant approach to multi-agent cooperation is the utilitarian welfare which can produce efficient highly inequitable outcomes. This paper proposes a novel framework to foster fairer cooperation by replacing the standard utilitarian objective with Proportional Fairness. We introduce a fair altruistic utility for each agent, defined on the individual log-payoff space and derive the analytical conditions required to ensure cooperation in classic social dilemmas. We then extend this framework to sequential settings by defining a Fair Markov Game and deriving novel fair Actor-Critic algorithms to learn fair policies. Finally, we evaluate our method in various social dilemma environments.

cross When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment

Authors: Igor Santos-Grueiro

Abstract: Safety evaluation for advanced AI systems implicitly assumes that behavior observed under evaluation is predictive of behavior in deployment. This assumption becomes fragile for agents with situational awareness, which may exploitregime leakage-informational cues distinguishing evaluation from deployment-to implement conditional policies such as sycophancy and sleeper agents, which preserve compliance under oversight while defecting in deployment-like regimes. We reframe alignment evaluation as a problem of information flow under partial observability. Within this framework, we show that divergence between evaluation-time and deployment-time behavior is bounded by the mutual information between internal representations and the regime variable. Motivated by this result, we study regime-blind mechanisms: training-time interventions that reduce the extractability of regime information at decision-relevant internal representations via adversarial invariance. We evaluate this approach on a base, open-weight language model across two fully characterized failure modes -scientific sycophancy and temporal sleeper agents. Regime-blind training suppresses regime-conditioned behavior in both evaluated cases without measurable loss of task utility, but with qualitatively different dynamics: sycophancy exhibits a sharp representational and behavioral transition at low intervention strength, whereas sleeper-agent behavior requires substantially stronger pressure and does not exhibit a clean collapse of regime decodability. These results demonstrate that representational invariance is a meaningful but fundamentally limited control lever, whose effectiveness depends on how regime information is embedded in the policy. We argue that behavioral evaluation should be complemented with white-box diagnostics of regime awareness and information flow.

cross Gesture Matters: Pedestrian Gesture Recognition for AVs Through Skeleton Pose Evaluation

Authors: Alif Rizqullah Mahdi, Mahdi Rezaei, Natasha Merat

Abstract: Gestures are a key component of non-verbal communication in traffic, often helping pedestrian-to-driver interactions when formal traffic rules may be insufficient. This problem becomes more apparent when autonomous vehicles (AVs) struggle to interpret such gestures. In this study, we present a gesture classification framework using 2D pose estimation applied to real-world video sequences from the WIVW dataset. We categorise gestures into four primary classes (Stop, Go, Thank & Greet, and No Gesture) and extract 76 static and dynamic features from normalised keypoints. Our analysis demonstrates that hand position and movement velocity are especially discriminative in distinguishing between gesture classes, achieving a classification accuracy score of 87%. These findings not only improve the perceptual capabilities of AV systems but also contribute to the broader understanding of pedestrian behaviour in traffic contexts.

cross Empirical Study of Observable Sets in Multiclass Quantum Classification

Authors: Paul San Sebastian, Mikel Ca\~nizo, Roman Orus

Abstract: Variational quantum algorithms have gained attention as early applications of quantum computers for learning tasks. In the context of supervised learning, most of the works that tackle classification problems with parameterized quantum circuits constrain their scope to the setting of binary classification or perform multiclass classification via ensembles of binary classifiers (strategies such as one versus rest). Those few works that propose native multiclass models, however, do not justify the choice of observables that perform the classification. This work studies two main classification criteria in multiclass quantum machine learning: maximizing the expected value of an observable representing a class or maximizing the fidelity of the encoded quantum state with a reference state representing a class. To compare both approaches, sets of Pauli strings and sets of projectors into the computational basis are chosen as observables in the quantum machine learning models. Observing the empirical behavior of each model type, the effect of different observable set choices on the performance of quantum machine learning models is analyzed in the context of Barren Plateaus and Neural Collapse. The results provide insights that may guide the design of future multiclass quantum machine learning models.

cross Enhanced Food Category Recognition under Illumination-Induced Domain Shift

Authors: Keonvin Park, Aditya Pal, Jin Hong Mok

Abstract: Visual food recognition systems deployed in real-world environments, such as automated conveyor-belt inspection, are highly sensitive to domain shifts caused by illumination changes. While recent studies have shown that lighting variations can significantly distort food perception by both humans and AI, existing works are often limited to single food categories or controlled settings, and most public food datasets lack explicit illumination annotations. In this work, we investigate illumination-induced domain shift in multi-class food category recognition using two widely adopted datasets, Food-101 and Fruits-360. We demonstrate substantial accuracy degradation under cross-dataset evaluation due to mismatched visual conditions. To address this challenge, we construct synthetic illumination-augmented datasets by systematically varying light temperature and intensity, enabling controlled robustness analysis without additional labels. We further evaluate cross-dataset transfer learning and domain generalization, with a focus on illumination-sensitive target categories such as apple-based classes. Experimental results show that illumination-aware augmentation significantly improves recognition robustness under domain shift while preserving real-time performance. Our findings highlight the importance of illumination robustness and provide practical insights for deploying reliable food recognition systems in real-world inspection scenarios.

cross Learning Self-Correction in Vision-Language Models via Rollout Augmentation

Authors: Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang

Abstract: Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.

cross Estimation of Fish Catch Using Sentinel-2, 3 and XGBoost-Kernel-Based Kernel Ridge Regression

Authors: Kanu Mohammed, Vaishnavi Joshi, Pranjali Diliprao Patil, Sandipan Mondal, Ming-An Lee, Subhadip Dey

Abstract: Oceanographic factors, such as sea surface temperature and upper-ocean dynamics, have a significant impact on fish distribution. Maintaining fisheries that contribute to global food security requires quantifying these connections. This study uses multispectral images from Sentinel-2 MSI and Sentinel-3 OLCI to estimate fish catch using an Extreme Gradient Boosting (XGBoost)-kernelized Kernel Ridge Regression (KRR) technique. According to model evaluation, the XGBoost-KRR framework achieves the strongest correlation and the lowest prediction error across both sensors, suggesting improved capacity to capture nonlinear ocean-fish connections. While Sentinel-2 MSI resolves finer-scale spatial variability, emphasizing localized ecological interactions, Sentinel-3 OLCI displays smoother spectral responses associated with poorer spatial resolution. By supporting sustainable ecosystem management and strengthening satellite-based fisheries assessment, the proposed approach advances SDGs 2 (Zero Hunger) and 14 (Life Below Water).

cross Do physics-informed neural networks (PINNs) need to be deep? Shallow PINNs using the Levenberg-Marquardt algorithm

Authors: Muhammad Luthfi Shahab, Imam Mukhlash, Hadi Susanto

Abstract: This work investigates the use of shallow physics-informed neural networks (PINNs) for solving forward and inverse problems of nonlinear partial differential equations (PDEs). By reformulating PINNs as nonlinear systems, the Levenberg-Marquardt (LM) algorithm is employed to efficiently optimize the network parameters. Analytical expressions for the neural network derivatives with respect to the input variables are derived, enabling accurate and efficient computation of the Jacobian matrix required by LM. The proposed approach is tested on several benchmark problems, including the Burgers, Schr\"odinger, Allen-Cahn, and three-dimensional Bratu equations. Numerical results demonstrate that LM significantly outperforms BFGS in terms of convergence speed, accuracy, and final loss values, even when using shallow network architectures with only two hidden layers. These findings indicate that, for a wide class of PDEs, shallow PINNs combined with efficient second-order optimization methods can provide accurate and computationally efficient solutions for both forward and inverse problems.

cross Reinforcement Inference: Leveraging Uncertainty for Self-Correcting Language Model Reasoning

Authors: Xinhai Sun

Abstract: Modern large language models (LLMs) are often evaluated and deployed under a \emph{one-shot, greedy} inference protocol, especially in professional settings that require deterministic behavior. This regime can systematically under-estimate a fixed model's true capability: many errors arise not from missing knowledge, but from premature commitment under internal ambiguity. We introduce \emph{Reinforcement Inference}, an entropy-aware inference-time control strategy that uses the model's own uncertainty to selectively invoke a second, more deliberate reasoning attempt, enabling stronger performance \emph{without any retraining}. On 12,032 MMLU-Pro questions across 14 subjects, using DeepSeek-v3.2 with deterministic decoding in a zero-shot setting, Reinforcement Inference improves accuracy from 60.72\% to 84.03\%, while only incurring 61.06\% additional inference calls. A 100\% re-asking ablation reaches 84.35\%, indicating that uncertainty-aware selection captures most of the attainable improvement with substantially less compute. Moreover, a \emph{prompt-only} ablation underperforms the baseline, suggesting that the gains are not explained by generic `` your output had high entropy, think step-by-step'' prompting alone. Beyond providing a practical inference-time upgrade, our results suggest a broader \emph{entropy-aware} paradigm for measuring and expanding model capability: because modern decoder-based models generate outputs autoregressively, entropy and related confidence measures arise naturally as first-class control signals during generation. The resulting gap between one-pass greedy inference and uncertainty-conditioned deliberation offers a diagnostic lens on an LLM's latent reasoning horizon and motivates future training objectives that explicitly constrain correctness--confidence alignment.

cross Trajectory Stitching for Solving Inverse Problems with Flow-Based Models

Authors: Alexander Denker, Moshe Eliasof, Zeljko Kereta, Carola-Bibiane Sch\"onlieb

Abstract: Flow-based generative models have emerged as powerful priors for solving inverse problems. One option is to directly optimize the initial latent code (noise), such that the flow output solves the inverse problem. However, this requires backpropagating through the entire generative trajectory, incurring high memory costs and numerical instability. We propose MS-Flow, which represents the trajectory as a sequence of intermediate latent states rather than a single initial code. By enforcing the flow dynamics locally and coupling segments through trajectory-matching penalties, MS-Flow alternates between updating intermediate latent states and enforcing consistency with observed data. This reduces memory consumption while improving reconstruction quality. We demonstrate the effectiveness of MS-Flow over existing methods on image recovery and inverse problems, including inpainting, super-resolution, and computed tomography.

cross Incremental (k, z)-Clustering on Graphs

Authors: Emilio Cruciani, Sebastian Forster, Antonis Skarlatos

Abstract: Given a weighted undirected graph, a number of clusters $k$, and an exponent $z$, the goal in the $(k, z)$-clustering problem on graphs is to select $k$ vertices as centers that minimize the sum of the distances raised to the power $z$ of each vertex to its closest center. In the dynamic setting, the graph is subject to adversarial edge updates, and the goal is to maintain explicitly an exact $(k, z)$-clustering solution in the induced shortest-path metric. While efficient dynamic $k$-center approximation algorithms on graphs exist [Cruciani et al. SODA 2024], to the best of our knowledge, no prior work provides similar results for the dynamic $(k,z)$-clustering problem. As the main result of this paper, we develop a randomized incremental $(k, z)$-clustering algorithm that maintains with high probability a constant-factor approximation in a graph undergoing edge insertions with a total update time of $\tilde O(k m^{1+o(1)}+ k^{1+\frac{1}{\lambda}} m)$, where $\lambda \geq 1$ is an arbitrary fixed constant. Our incremental algorithm consists of two stages. In the first stage, we maintain a constant-factor bicriteria approximate solution of size $\tilde{O}(k)$ with a total update time of $m^{1+o(1)}$ over all adversarial edge insertions. This first stage is an intricate adaptation of the bicriteria approximation algorithm by Mettu and Plaxton [Machine Learning 2004] to incremental graphs. One of our key technical results is that the radii in their algorithm can be assumed to be non-decreasing while the approximation ratio remains constant, a property that may be of independent interest. In the second stage, we maintain a constant-factor approximate $(k,z)$-clustering solution on a dynamic weighted instance induced by the bicriteria approximate solution. For this subproblem, we employ a dynamic spanner algorithm together with a static $(k,z)$-clustering algorithm.

cross GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

Authors: Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

Abstract: Human perception for effective object tracking in a 2D video stream arises from the implicit use of prior 3D knowledge combined with semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings while neglecting 3D geometric cues, which makes them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to enable geometric cue inference from only a few 2D images. To tackle the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing with null-space constrained updates that incorporate geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios. Extensive experiments on multiple GOT benchmarks demonstrate that GOT-Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking.

cross DNS: Data-driven Nonlinear Smoother for Complex Model-free Process

Authors: Fredrik Cumlin, Anubhab Ghosh, Saikat Chatterjee

Abstract: We propose data-driven nonlinear smoother (DNS) to estimate a hidden state sequence of a complex dynamical process from a noisy, linear measurement sequence. The dynamical process is model-free, that is, we do not have any knowledge of the nonlinear dynamics of the complex process. There is no state-transition model (STM) of the process available. The proposed DNS uses a recurrent architecture that helps to provide a closed-form posterior of the hidden state sequence given the measurement sequence. DNS learns in an unsupervised manner, meaning the training dataset consists of only measurement data and no state data. We demonstrate DNS using simulations for smoothing of several stochastic dynamical processes, including a benchmark Lorenz system. Experimental results show that the DNS is significantly better than a deep Kalman smoother (DKS) and an iterative data-driven nonlinear state estimation (iDANSE) smoother.

cross Constructive conditional normalizing flows

Authors: Borjan Geshkovski, Dom\`enec Ruiz-Balet

Abstract: Motivated by applications in conditional sampling, given a probability measure $\mu$ and a diffeomorphism $\phi$, we consider the problem of simultaneously approximating $\phi$ and the pushforward $\phi_{\#}\mu$ by means of the flow of a continuity equation whose velocity field is a perceptron neural network with piecewise constant weights. We provide an explicit construction based on a polar-like decomposition of the Lagrange interpolant of $\phi$. The latter involves a compressible component, given by the gradient of a particular convex function, which can be realized exactly, and an incompressible component, which -- after approximating via permutations -- can be implemented through shear flows intrinsic to the continuity equation. For more regular maps $\phi$ -- such as the Kn\"othe-Rosenblatt rearrangement -- we provide an alternative, probabilistic construction inspired by the Maurey empirical method, in which the number of discontinuities in the weights doesn't scale inversely with the ambient dimension.

cross Enhancing Genetic Algorithms with Graph Neural Networks: A Timetabling Case Study

Authors: Laura-Maria Cornei, Mihaela-Elena Breab\u{a}n

Abstract: This paper investigates the impact of hybridizing a multi-modal Genetic Algorithm with a Graph Neural Network for timetabling optimization. The Graph Neural Network is designed to encapsulate general domain knowledge to improve schedule quality, while the Genetic Algorithm explores different regions of the search space and integrates the deep learning model as an enhancement operator to guide the solution search towards optimality. Initially, both components of the hybrid technique were designed, developed, and optimized independently to solve the tackled task. Multiple experiments were conducted on Staff Rostering, a well-known timetabling problem, to compare the proposed hybridization with the standalone optimized versions of the Genetic Algorithm and Graph Neural Network. The experimental results demonstrate that the proposed hybridization brings statistically significant improvements in both the time efficiency and solution quality metrics, compared to the standalone methods. To the best of our knowledge, this work proposes the first hybridization of a Genetic Algorithm with a Graph Neural Network for solving timetabling problems.

cross We Should Separate Memorization from Copyright

Authors: Adi Haviv, Niva Elkin-Koren, Uri Hacohen, Roi Livni, Shay Moran

Abstract: The widespread use of foundation models has introduced a new risk factor of copyright issue. This issue is leading to an active, lively and on-going debate amongst the data-science community as well as amongst legal scholars. Where claims and results across both sides are often interpreted in different ways and leading to different implications. Our position is that much of the technical literature relies on traditional reconstruction techniques that are not designed for copyright analysis. As a result, memorization and copying have been conflated across both technical and legal communities and in multiple contexts. We argue that memorization, as commonly studied in data science, should not be equated with copying and should not be used as a proxy for copyright infringement. We distinguish technical signals that meaningfully indicate infringement risk from those that instead reflect lawful generalization or high-frequency content. Based on this analysis, we advocate for an output-level, risk-based evaluation process that aligns technical assessments with established copyright standards and provides a more principled foundation for research, auditing, and policy.

cross Retrieval Pivot Attacks in Hybrid RAG: Measuring and Mitigating Amplified Leakage from Vector Seeds to Graph Expansion

Authors: Scott Thornton

Abstract: Hybrid Retrieval-Augmented Generation (RAG) pipelines combine vector similarity search with knowledge graph expansion for multi-hop reasoning. We show that this composition introduces a distinct security failure mode: a vector-retrieved "seed" chunk can pivot via entity links into sensitive graph neighborhoods, causing cross-tenant data leakage that does not occur in vector-only retrieval. We formalize this risk as Retrieval Pivot Risk (RPR) and introduce companion metrics Leakage@k, Amplification Factor, and Pivot Depth (PD) to quantify leakage magnitude and traversal structure. We present seven Retrieval Pivot Attacks that exploit the vector-to-graph boundary and show that adversarial injection is not required: naturally shared entities create cross-tenant pivot paths organically. Across a synthetic multi-tenant enterprise corpus and the Enron email corpus, the undefended hybrid pipeline exhibits high pivot risk (RPR up to 0.95) with multiple unauthorized items returned per query. Leakage consistently appears at PD=2, which we attribute to the bipartite chunk-entity topology and formalize as a proposition. We then show that enforcing authorization at a single location, the graph expansion boundary, eliminates measured leakage (RPR near 0) across both corpora, all attack variants, and label forgery rates up to 10 percent, with minimal overhead. Our results indicate the root cause is boundary enforcement, not inherently complex defenses: two individually secure retrieval components can compose into an insecure system unless authorization is re-checked at the transition point.

cross Learning to Judge: LLMs Designing and Applying Evaluation Rubrics

Authors: Clemencia Siro, Pourya Aliannejadi, Mohammad Aliannejadi

Abstract: Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate whether LLMs can design and apply their own evaluation rubrics. We evaluate the semantic coherence and scoring reliability of LLM-defined criteria and their alignment with human criteria. LLMs reliably generate interpretable and task-aware evaluation dimensions and apply them consistently within models, but their scoring reliability degrades in factual and knowledge-intensive settings. Closed-source models such as GPT-4o achieve higher agreement and cross-model generalization than open-weight models such as Llama. Our findings position evaluation as a learned linguistic capability of LLMs, consistent within models but fragmented across them, and call for new methods that jointly model human and LLM evaluative language to improve reliability and interpretability.

cross Towards Understanding Multimodal Fine-Tuning: Spatial Features

Authors: Lachin Naghashyar, Hunar Batra, Ashkan Khakzar, Philip Torr, Ronald Clark, Christian Schroeder de Witt, Constantin Venhoff

Abstract: Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.

cross Welfarist Formulations for Diverse Similarity Search

Authors: Siddharth Barman, Nirjhar Das, Shivam Gupta, Kirankumar Shiragur

Abstract: Nearest Neighbor Search (NNS) is a fundamental problem in data structures with wide-ranging applications, such as web search, recommendation systems, and, more recently, retrieval-augmented generations (RAG). In such recent applications, in addition to the relevance (similarity) of the returned neighbors, diversity among the neighbors is a central requirement. In this paper, we develop principled welfare-based formulations in NNS for realizing diversity across attributes. Our formulations are based on welfare functions -- from mathematical economics -- that satisfy central diversity (fairness) and relevance (economic efficiency) axioms. With a particular focus on Nash social welfare, we note that our welfare-based formulations provide objective functions that adaptively balance relevance and diversity in a query-dependent manner. Notably, such a balance was not present in the prior constraint-based approach, which forced a fixed level of diversity and optimized for relevance. In addition, our formulation provides a parametric way to control the trade-off between relevance and diversity, providing practitioners with flexibility to tailor search results to task-specific requirements. We develop efficient nearest neighbor algorithms with provable guarantees for the welfare-based objectives. Notably, our algorithm can be applied on top of any standard ANN method (i.e., use standard ANN method as a subroutine) to efficiently find neighbors that approximately maximize our welfare-based objectives. Experimental results demonstrate that our approach is practical and substantially improves diversity while maintaining high relevance of the retrieved neighbors.

cross Amortising Inference and Meta-Learning Priors in Neural Networks

Authors: Tommy Rochussen, Vincent Fortuin

Abstract: One of the core facets of Bayesianism is in the updating of prior beliefs in light of new evidence$\text{ -- }$so how can we maintain a Bayesian approach if we have no prior beliefs in the first place? This is one of the central challenges in the field of Bayesian deep learning, where it is not clear how to represent beliefs about a prediction task by prior distributions over model parameters. Bridging the fields of Bayesian deep learning and probabilistic meta-learning, we introduce a way to $\textit{learn}$ a weights prior from a collection of datasets by introducing a way to perform per-dataset amortised variational inference. The model we develop can be viewed as a neural process whose latent variable is the set of weights of a BNN and whose decoder is the neural network parameterised by a sample of the latent variable itself. This unique model allows us to study the behaviour of Bayesian neural networks under well-specified priors, use Bayesian neural networks as flexible generative models, and perform desirable but previously elusive feats in neural processes such as within-task minibatching or meta-learning under extreme data-starvation.

cross Empirically Understanding the Value of Prediction in Allocation

Authors: Unai Fischer-Abaigar, Emily Aiken, Christoph Kern, Juan Carlos Perdomo

Abstract: Institutions increasingly use prediction to allocate scarce resources. From a design perspective, better predictions compete with other investments, such as expanding capacity or improving treatment quality. Here, the big question is not how to solve a specific allocation problem, but rather which problem to solve. In this work, we develop an empirical toolkit to help planners form principled answers to this question and quantify the bottom-line welfare impact of investments in prediction versus other policy levers such as expanding capacity and improving treatment quality. Applying our framework in two real-world case studies on German employment services and poverty targeting in Ethiopia, we illustrate how decision-makers can reliably derive context-specific conclusions about the relative value of prediction in their allocation problem. We make our software toolkit, rvp, and parts of our data available in order to enable future empirical work in this area.

cross Multimodal Learning for Arcing Detection in Pantograph-Catenary Systems

Authors: Hao Dong, Eleni Chatzi, Olga Fink

Abstract: The pantograph-catenary interface is essential for ensuring uninterrupted and reliable power delivery in electrified rail systems. However, electrical arcing at this interface poses serious risks, including accelerated wear of contact components, degraded system performance, and potential service disruptions. Detecting arcing events at the pantograph-catenary interface is challenging due to their transient nature, noisy operating environment, data scarcity, and the difficulty of distinguishing arcs from other similar transient phenomena. To address these challenges, we propose a novel multimodal framework that combines high-resolution image data with force measurements to more accurately and robustly detect arcing events. First, we construct two arcing detection datasets comprising synchronized visual and force measurements. One dataset is built from data provided by the Swiss Federal Railways (SBB), and the other is derived from publicly available videos of arcing events in different railway systems and synthetic force data that mimic the characteristics observed in the real dataset. Leveraging these datasets, we propose MultiDeepSAD, an extension of the DeepSAD algorithm for multiple modalities with a new loss formulation. Additionally, we introduce tailored pseudo-anomaly generation techniques specific to each data type, such as synthetic arc-like artifacts in images and simulated force irregularities, to augment training data and improve the discriminative ability of the model. Through extensive experiments and ablation studies, we demonstrate that our framework significantly outperforms baseline approaches, exhibiting enhanced sensitivity to real arcing events even under domain shifts and limited availability of real arcing observations.

cross Learning the Value Systems of Societies with Preference-based Multi-objective Reinforcement Learning

Authors: Andr\'es Holgado-S\'anchez, Peter Vamplew, Richard Dazeley, Sascha Ossowski, Holger Billhardt

Abstract: Value-aware AI should recognise human values and adapt to the value systems (value-based preferences) of different users. This requires operationalization of values, which can be prone to misspecification. The social nature of values demands their representation to adhere to multiple users while value systems are diverse, yet exhibit patterns among groups. In sequential decision making, efforts have been made towards personalization for different goals or values from demonstrations of diverse agents. However, these approaches demand manually designed features or lack value-based interpretability and/or adaptability to diverse user preferences. We propose algorithms for learning models of value alignment and value systems for a society of agents in Markov Decision Processes (MDPs), based on clustering and preference-based multi-objective reinforcement learning (PbMORL). We jointly learn socially-derived value alignment models (groundings) and a set of value systems that concisely represent different groups of users (clusters) in a society. Each cluster consists of a value system representing the value-based preferences of its members and an approximately Pareto-optimal policy that reflects behaviours aligned with this value system. We evaluate our method against a state-of-the-art PbMORL algorithm and baselines on two MDPs with human values.

cross AMEM4Rec: Leveraging Cross-User Similarity for Memory Evolution in Agentic LLM Recommenders

Authors: Minh-Duc Nguyen, Hai-Dang Kieu, Dung D. Le

Abstract: Agentic systems powered by Large Language Models (LLMs) have shown strong potential in recommender systems but remain hindered by several challenges. Fine-tuning LLMs is parameter-inefficient, and prompt-based agentic reasoning is limited by context length and hallucination risk. Moreover, existing agentic recommendation systems predominantly leverages semantic knowledge while neglecting the collaborative filtering (CF) signals essential for implicit preference modeling. To address these limitations, we propose AMEM4Rec, an agentic LLM-based recommender that learns collaborative signals in an end-to-end manner through cross-user memory evolution. AMEM4Rec stores abstract user behavior patterns from user histories in a global memory pool. Within this pool, memories are linked to similar existing ones and iteratively evolved to reinforce shared cross-user patterns, enabling the system to become aware of CF signals without relying on a pre-trained CF model. Extensive experiments on Amazon and MIND datasets show that AMEM4Rec consistently outperforms state-of-the-art LLM-based recommenders, demonstrating the effectiveness of evolving memory-guided collaborative filtering.

cross Cutting Through the Noise: On-the-fly Outlier Detection for Robust Training of Machine Learning Interatomic Potentials

Authors: Terry C. W. Lam, Niamh O'Neill, Christoph Schran, Lars L. Schaaf

Abstract: The accuracy of machine learning interatomic potentials suffers from reference data that contains numerical noise. Often originating from unconverged or inconsistent electronic-structure calculations, this noise is challenging to identify. Existing mitigation strategies such as manual filtering or iterative refinement of outliers, require either substantial expert effort or multiple expensive retraining cycles, making them difficult to scale to large datasets. Here, we introduce an on-the-fly outlier detection scheme that automatically down-weights noisy samples, without requiring additional reference calculations. By tracking the loss distribution via an exponential moving average, this unsupervised method identifies outliers throughout a single training run. We show that this approach prevents overfitting and matches the performance of iterative refinement baselines with significantly reduced overhead. The method's effectiveness is demonstrated by recovering accurate physical observables for liquid water from unconverged reference data, including diffusion coefficients. Furthermore, we validate its scalability by training a foundation model for organic chemistry on the SPICE dataset, where it reduces energy errors by a factor of three. This framework provides a simple, automated solution for training robust models on imperfect datasets across dataset sizes.

cross Understanding Dynamic Compute Allocation in Recurrent Transformers

Authors: Ibraheem Muhammad Moosa, Suhas Lohit, Ye Wang, Moitreya Chatterjee, Wenpeng Yin

Abstract: Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones. However, prior work is primarily evaluated on natural-language benchmarks using task-level metrics, where token-level difficulty is unobservable and confounded with architectural factors, making it unclear whether compute allocation truly aligns with underlying complexity. We address this gap through three contributions. First, we introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty, enabling direct testing of token-level compute allocation. Second, we propose ANIRA, a unified recurrent Transformer framework that supports per-token variable-depth computation while isolating compute allocation decisions from other model factors. Third, we use this framework to conduct a systematic analysis of token-level adaptive computation across alignment with complexity, generalization, and decision timing. Our results show that compute allocation aligned with task complexity can emerge without explicit difficulty supervision, but such alignment does not imply algorithmic generalization: models fail to extrapolate to unseen input sizes despite allocating additional computation. We further find that early compute decisions rely on static structural cues, whereas online halting more closely tracks algorithmic execution state.

cross Differentiable Logical Programming for Quantum Circuit Discovery and Optimization

Authors: Antonin Sulc

Abstract: Designing high-fidelity quantum circuits remains challenging, and current paradigms often depend on heuristic, fixed-ansatz structures or rule-based compilers that can be suboptimal or lack generality. We introduce a neuro-symbolic framework that reframes quantum circuit design as a differentiable logic programming problem. Our model represents a scaffold of potential quantum gates and parameterized operations as a set of learnable, continuous ``truth values'' or ``switches,'' $s \in [0, 1]^N$. These switches are optimized via standard gradient descent to satisfy a user-defined set of differentiable, logical axioms (e.g., correctness, simplicity, robustness). We provide a theoretical formulation bridging continuous logic (via T-norms) and unitary evolution (via geodesic interpolation), while addressing the barren plateau problem through biased initialization. We illustrate the approach on tasks including discovery of a 4-qubit Quantum Fourier Transform (QFT) from a scaffold of 21 candidate gates. We also report a hardware-aware adaptation experiment on the 133-qubit IBM Torino processor, where the method improved fidelity by 59.3 percentage points in a localized routing task while adapting to hardware failures.

cross Contrastive Learning for Diversity-Aware Product Recommendations in Retail

Authors: Vasileios Karlis, Ezgi Y{\i}ld{\i}r{\i}m, David Vos, Maarten de Rijke

Abstract: Recommender systems often struggle with long-tail distributions and limited item catalog exposure, where a small subset of popular items dominates recommendations. This challenge is especially critical in large-scale online retail settings with extensive and diverse product assortments. This paper introduces an approach to enhance catalog coverage without compromising recommendation quality in the existing digital recommendation pipeline at IKEA Retail. Drawing inspiration from recent advances in negative sampling to address popularity bias, we integrate contrastive learning with carefully selected negative samples. Through offline and online evaluations, we demonstrate that our method improves catalog coverage, ensuring a more diverse set of recommendations yet preserving strong recommendation performance.

cross Winner's Curse Drives False Promises in Data-Driven Decisions: A Case Study in Refugee Matching

Authors: Hamsa Bastani, Osbert Bastani, Bryce McLaughlin

Abstract: A major challenge in data-driven decision-making is accurate policy evaluation-i.e., guaranteeing that a learned decision-making policy achieves the promised benefits. A popular strategy is model-based policy evaluation, which estimates a model from data to infer counterfactual outcomes. This strategy is known to produce unwarrantedly optimistic estimates of the true benefit due to the winner's curse. We searched the recent literature on data-driven decision-making, identifying a sample of 55 papers published in the Management Science in the past decade; all but two relied on this flawed methodology. Several common justifications are provided: (1) the estimated models are accurate, stable, and well-calibrated, (2) the historical data uses random treatment assignment, (3) the model family is well-specified, and (4) the evaluation methodology uses sample splitting. Unfortunately, we show that no combination of these justifications avoids the winner's curse. First, we provide a theoretical analysis demonstrating that the winner's curse can cause large, spurious reported benefits even when all these justifications hold. Second, we perform a simulation study based on the recent and consequential data-driven refugee matching problem. We construct a synthetic refugee matching environment (calibrated to closely match the real setting) but designed so that no assignment policy can improve expected employment compared to random assignment. Model-based methods report large, stable gains of around 60% even when the true effect is zero; these gains are on par with improvements of 22-75% reported in the literature. Our results provide strong evidence against model-based evaluation.

cross Analysis of Converged 3D Gaussian Splatting Solutions: Density Effects and Prediction Limit

Authors: Zhendong Wang, Cihan Ruan, Jingchuan Xiao, Chuqing Shi, Wei Jiang, Wei Wang, Wenjie Liu, Nam Ling

Abstract: We investigate what structure emerges in 3D Gaussian Splatting (3DGS) solutions from standard multi-view optimization. We term these Rendering-Optimal References (RORs) and analyze their statistical properties, revealing stable patterns: mixture-structured scales and bimodal radiance across diverse scenes. To understand what determines these parameters, we apply learnability probes by training predictors to reconstruct RORs from point clouds without rendering supervision. Our analysis uncovers fundamental density-stratification. Dense regions exhibit geometry-correlated parameters amenable to render-free prediction, while sparse regions show systematic failure across architectures. We formalize this through variance decomposition, demonstrating that visibility heterogeneity creates covariance-dominated coupling between geometric and appearance parameters in sparse regions. This reveals the dual character of RORs: geometric primitives where point clouds suffice, and view synthesis primitives where multi-view constraints are essential. We provide density-aware strategies that improve training robustness and discuss architectural implications for systems that adaptively balance feed-forward prediction and rendering-based refinement.

cross AMS-HD: Hyperdimensional Computing for Real-Time and Energy-Efficient Acute Mountain Sickness Detection

Authors: Abu Masum, Mehran Moghadam, M. Hassan Najafi, Bige Unluturk, Ulkuhan Guler, Sercan Aygun

Abstract: Altitude sickness is a potentially life-threatening condition that impacts many individuals traveling to elevated altitudes. Timely detection is critical as symptoms can escalate rapidly. Early recognition enables simple interventions such as descent, oxygen, or medication, and prompt treatment can save lives by significantly lowering the risk of severe complications. Although conventional machine learning (ML) techniques have been applied to identify altitude sickness using physiological signals, such as heart rate, oxygen saturation, respiration rate, blood pressure, and body temperature, they often struggle to balance predictive performance with low hardware demands. In contrast, hyperdimensional computing (HDC) remains under-explored for this task with limited biomedical features, where it may offer a compelling alternative to existing classification models. Its vector symbolic framework is inherently suited to hardware-efficient design, making it a strong candidate for low-power systems like wearables. Leveraging lightweight computation and efficient streamlined memory usage, HDC enables real-time detection of altitude sickness from physiological parameters collected by wearable devices, achieving accuracy comparable to that of traditional ML models. We present AMS-HD, a novel system that integrates tailored feature extraction and Hadamard HV encoding to enhance both the precision and efficiency of HDC-based detection. This framework is well-positioned for deployment in wearable health monitoring platforms, enabling continuous, on-the-go tracking of acute altitude sickness.

cross Online monotone density estimation and log-optimal calibration

Authors: Rohan Hore, Ruodu Wang, Aaditya Ramdas

Abstract: We study the problem of online monotone density estimation, where density estimators must be constructed in a predictable manner from sequentially observed data. We propose two online estimators: an online analogue of the classical Grenander estimator, and an expert aggregation estimator inspired by exponential weighting methods from the online learning literature. In the well-specified stochastic setting, where the underlying density is monotone, we show that the expected cumulative log-likelihood gap between the online estimators and the true density admits an $O(n^{1/3})$ bound. We further establish a $\sqrt{n\log{n}}$ pathwise regret bound for the expert aggregation estimator relative to the best offline monotone estimator chosen in hindsight, under minimal regularity assumptions on the observed sequence. As an application of independent interest, we show that the problem of constructing log-optimal p-to-e calibrators for sequential hypothesis testing can be formulated as an online monotone density estimation problem. We adapt the proposed estimators to build empirically adaptive p-to-e calibrators and establish their optimality. Numerical experiments illustrate the theoretical results.

cross Provably robust learning of regression neural networks using $\beta$-divergences

Authors: Abhik Ghosh, Suryasis Jana

Abstract: Regression neural networks (NNs) are most commonly trained by minimizing the mean squared prediction error, which is highly sensitive to outliers and data contamination. Existing robust training methods for regression NNs are often limited in scope and rely primarily on empirical validation, with only a few offering partial theoretical guarantees. In this paper, we propose a new robust learning framework for regression NNs based on the $\beta$-divergence (also known as the density power divergence) which we call `rRNet'. It applies to a broad class of regression NNs, including models with non-smooth activation functions and error densities, and recovers the classical maximum likelihood learning as a special case. The rRNet is implemented via an alternating optimization scheme, for which we establish convergence guarantees to stationary points under mild, verifiable conditions. The (local) robustness of rRNet is theoretically characterized through the influence functions of both the parameter estimates and the resulting rRNet predictor, which are shown to be bounded for suitable choices of the tuning parameter $\beta$, depending on the error density. We further prove that rRNet attains the optimal 50\% asymptotic breakdown point at the assumed model for all $\beta\in(0, 1]$, providing a strong global robustness guarantee that is largely absent for existing NN learning methods. Our theoretical results are complemented by simulation experiments and real-data analyses, illustrating practical advantages of rRNet over existing approaches in both function approximation problems and prediction tasks with noisy observations.

cross MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

Authors: Ruijie Zhu, Jiahao Lu, Wenbo Hu, Xiaoguang Han, Jianfei Cai, Ying Shan, Chuanxia Zheng

Abstract: We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page

URLs: https://ruijiezhu94.github.io/MotionCrafter_Page

cross Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

Authors: John Gardiner, Orlando Romero, Brendan Tivnan, Nicol\`o Dal Fabbro, George J. Pappas

Abstract: The inability to communicate poses a major challenge to coordination in multi-agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision-making. In contrast, this work introduces the first framework for training MARL agents to exploit shared quantum entanglement as a coordination resource, which permits a larger class of communication-free correlated policies than shared randomness alone. This is motivated by well-known results in quantum physics which posit that, for certain single-round cooperative games with no communication, shared quantum entanglement enables strategies that outperform those that only use shared randomness. In such cases, we say that there is quantum advantage. Our framework is based on a novel differentiable policy parameterization that enables optimization over quantum measurements, together with a novel policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. To illustrate the effectiveness of our proposed method, we first show that we can learn, purely from experience, strategies that attain quantum advantage in single-round games that are treated as black box oracles. We then demonstrate how our machinery can learn policies with quantum advantage in an illustrative multi-agent sequential decision-making problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP).

cross When do neural ordinary differential equations generalize on complex networks?

Authors: Moritz Laber, Tina Eliassi-Rad, Brennan Klein

Abstract: Neural ordinary differential equations (neural ODEs) can effectively learn dynamical systems from time series data, but their behavior on graph-structured data remains poorly understood, especially when applied to graphs with different size or structure than encountered during training. We study neural ODEs ($\mathtt{nODE}$s) with vector fields following the Barab\'asi-Barzel form, trained on synthetic data from five common dynamical systems on graphs. Using the $\mathbb{S}^1$-model to generate graphs with realistic and tunable structure, we find that degree heterogeneity and the type of dynamical system are the primary factors in determining $\mathtt{nODE}$s' ability to generalize across graph sizes and properties. This extends to $\mathtt{nODE}$s' ability to capture fixed points and maintain performance amid missing data. Average clustering plays a secondary role in determining $\mathtt{nODE}$ performance. Our findings highlight $\mathtt{nODE}$s as a powerful approach to understanding complex systems but underscore challenges emerging from degree heterogeneity and clustering in realistic graphs.

cross Universal Coefficients and Mayer-Vietoris Sequence for Groupoid Homology

Authors: Luciano Melodia

Abstract: We study homology of ample groupoids via the compactly supported Moore complex of the nerve. Let $A$ be a topological abelian group. For $n\ge 0$ set $C_n(\mathcal G;A) := C_c(\mathcal G_n,A)$ and define $\partial_n^A=\sum_{i=0}^n(-1)^i(d_i)_*$. This defines $H_n(\mathcal G;A)$. The theory is functorial for continuous \'etale homomorphisms. It is compatible with standard reductions, including restriction to saturated clopen subsets. In the ample setting it is invariant under Kakutani equivalence. We reprove Matui type long exact sequences and identify the comparison maps at chain level. For discrete $A$ we prove a natural universal coefficient short exact sequence $$0\to H_n(\mathcal G)\otimes_{\mathbb Z}A\xrightarrow{\ \iota_n^{\mathcal G}\ }H_n(\mathcal G;A)\xrightarrow{\ \kappa_n^{\mathcal G}\ }\operatorname{Tor}_1^{\mathbb Z}\bigl(H_{n-1}(\mathcal G),A\bigr)\to 0.$$ The key input is the chain level isomorphism $C_c(\mathcal G_n,\mathbb Z)\otimes_{\mathbb Z}A\cong C_c(\mathcal G_n,A)$, which reduces the groupoid statement to the classical algebraic UCT for the free complex $C_c(\mathcal G_\bullet,\mathbb Z)$. We also isolate the obstruction for non-discrete coefficients. For a locally compact totally disconnected Hausdorff space $X$ with a basis of compact open sets, the image of $\Phi_X:C_c(X,\mathbb Z)\otimes_{\mathbb Z}A\to C_c(X,A)$ is exactly the compactly supported functions with finite image. Thus $\Phi_X$ is surjective if and only if every $f\in C_c(X,A)$ has finite image, and for suitable $X$ one can produce compactly supported continuous maps $X\to A$ with infinite image. Finally, for a clopen saturated cover $\mathcal G_0=U_1\cup U_2$ we construct a short exact sequence of Moore complexes and derive a Mayer-Vietoris long exact sequence for $H_\bullet(\mathcal G;A)$ for explicit computations.

cross Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models

Authors: Zichen Jeff Cui, Omar Rayyan, Haritheja Etukuru, Bowen Tan, Zavier Andrianarivo, Zicheng Teng, Yihang Zhou, Krish Mehta, Nicholas Wojno, Kevin Yuanbo Wu, Manan H Anjaria, Ziyuan Wu, Manrong Mao, Guangxun Zhang, Binit Shah, Yejin Kim, Soumith Chintala, Lerrel Pinto, Nur Muhammad Mahi Shafiullah

Abstract: The prevalent paradigm in robot learning attempts to generalize across environments, embodiments, and tasks with language prompts at runtime. A fundamental tension limits this approach: language is often too abstract to guide the concrete physical understanding required for robust manipulation. In this work, we introduce Contact-Anchored Policies (CAP), which replace language conditioning with points of physical contact in space. Simultaneously, we structure CAP as a library of modular utility models rather than a monolithic generalist policy. This factorization allows us to implement a real-to-sim iteration cycle: we build EgoGym, a lightweight simulation benchmark, to rapidly identify failure modes and refine our models and datasets prior to real-world deployment. We show that by conditioning on contact and iterating via simulation, CAP generalizes to novel environments and embodiments out of the box on three fundamental manipulation skills while using only 23 hours of demonstration data, and outperforms large, state-of-the-art VLAs in zero-shot evaluations by 56%. All model checkpoints, codebase, hardware, simulation, and datasets will be open-sourced. Project page: https://cap-policy.github.io/

URLs: https://cap-policy.github.io/

cross Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

Authors: Amir Mallak, Alaa Maalouf

Abstract: Out of distribution (OOD) robustness in autonomous driving is often reduced to a single number, hiding what breaks a policy. We decompose environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled $k$-factor perturbations ($k \in \{0,1,2,3\}$). Using closed loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary ID support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and FM features yield state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single factor drops are rural $\rightarrow$ urban and day $\rightarrow$ night ($\sim 31\%$ each); actor swaps $\sim 10\%$, moderate rain $\sim 7\%$; season shifts can be drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above $85\%$ under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below $50\%$ by three changes. (5) Interactions are non-additive: some pairings partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views improves robustness ($+11.8$ points from $5$ to $14$ traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD $60.6\% \rightarrow 70.1\%$) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.

replace Geometric Imbalance in Semi-Supervised Node Classification

Authors: Liang Yan, Shengzhong Zhang, Bisheng Li, Menglin Yang, Chen Yang, Min Zhou, Weiyang Ding, Yutong Xie, Zengfeng Huang

Abstract: Class imbalance in graph data presents a significant challenge for effective node classification, particularly in semi-supervised scenarios. In this work, we formally introduce the concept of geometric imbalance, which captures how message passing on class-imbalanced graphs leads to geometric ambiguity among minority-class nodes in the riemannian manifold embedding space. We provide a rigorous theoretical analysis of geometric imbalance on the riemannian manifold and propose a unified framework that explicitly mitigates it through pseudo-label alignment, node reordering, and ambiguity filtering. Extensive experiments on diverse benchmarks show that our approach consistently outperforms existing methods, especially under severe class imbalance. Our findings offer new theoretical insights and practical tools for robust semi-supervised node classification.

replace Towards Transparent and Efficient Anomaly Detection in Industrial Processes through ExIFFI

Authors: Davide Frizzo, Francesco Borsatti, Alessio Arcudi, Antonio De Moliner, Roberto Oboe, Gian Antonio Susto

Abstract: Anomaly Detection (AD) is crucial in industrial settings to streamline operations by detecting underlying issues. Conventional methods merely label observations as normal or anomalous, lacking crucial insights. In Industry 5.0, interpretable outcomes become desirable to enable users to understand the rational under model decisions. This paper presents the first industrial application of ExIFFI, a recent approach for fast, efficient explanations for the Extended Isolation Forest (EIF) AD method. ExIFFI is tested on three industrial datasets, demonstrating superior explanation effectiveness, computational efficiency and improved raw anomaly detection performances. ExIFFI reaches over then 90\% of average precision on all the benchmarks considered in the study and overperforms state-of-the-art Explainable Artificial Intelligence (XAI) approaches in terms of the feature selection proxy task metric which was specifically introduced to quantitatively evaluate model explanations.

replace Kernel-based Optimally Weighted Conformal Time-Series Prediction

Authors: Jonghyeok Lee, Chen Xu, Yao Xie

Abstract: In this work, we present a novel conformal prediction method for time-series, which we call Kernel-based Optimally Weighted Conformal Prediction Intervals (KOWCPI). Specifically, KOWCPI adapts the classic Reweighted Nadaraya-Watson (RNW) estimator for quantile regression on dependent data and learns optimal data-adaptive weights. Theoretically, we tackle the challenge of establishing a conditional coverage guarantee for non-exchangeable data under strong mixing conditions on the non-conformity scores. We demonstrate the superior performance of KOWCPI on real and synthetic time-series data against state-of-the-art methods, where KOWCPI achieves narrower confidence intervals without losing coverage.

replace Why Rectified Power Unit Networks Fail and How to Improve It: An Effective Field Theory Perspective

Authors: Taeyoung Kim, Myungjoo Kang

Abstract: The Rectified Power Unit (RePU) activation function, a differentiable generalization of the Rectified Linear Unit (ReLU), has shown promise in constructing neural networks due to its smoothness properties. However, deep RePU networks often suffer from critical issues such as vanishing or exploding values during training, rendering them unstable regardless of hyperparameter initialization. Leveraging the perspective of effective field theory, we identify the root causes of these failures and propose the Modified Rectified Power Unit (MRePU) activation function. MRePU addresses RePU's limitations while preserving its advantages, such as differentiability and universal approximation properties. Theoretical analysis demonstrates that MRePU satisfies criticality conditions necessary for stable training, placing it in a distinct universality class. Extensive experiments validate the effectiveness of MRePU, showing significant improvements in training stability and performance across various tasks, including polynomial regression, physics-informed neural networks (PINNs) and real-world vision tasks. Our findings highlight the potential of MRePU as a robust alternative for building deep neural networks.

replace Provable Domain Adaptation for Offline Reinforcement Learning with Limited Samples

Authors: Weiqin Chen, Xinjie Zhang, Sandipan Mishra, Santiago Paternain

Abstract: Offline reinforcement learning (RL) learns effective policies from a static target dataset. The performance of state-of-the-art offline RL algorithms notwithstanding, it relies on the size of the target dataset, and it degrades if limited samples in the target dataset are available, which is often the case in real-world applications. To address this issue, domain adaptation that leverages auxiliary samples from related source datasets (such as simulators) can be beneficial. However, establishing the optimal way to trade off the limited target dataset and the large-but-biased source dataset while ensuring provably theoretical guarantees remains an open challenge. To the best of our knowledge, this paper proposes the first framework that theoretically explores the impact of the weights assigned to each dataset on the performance of offline RL. In particular, we establish performance bounds and the existence of the optimal weight, which can be computed in closed form under simplifying assumptions. We also provide algorithmic guarantees in terms of convergence to a neighborhood of the optimum. Notably, these results depend on the quality of the source dataset and the number of samples in the target dataset. Our empirical results on the well-known Procgen and MuJoCo benchmarks substantiate the theoretical contributions in this work.

replace Disentangled Representation Learning for Parametric Partial Differential Equations

Authors: Ning Liu, Lu Zhang, Tian Gao, Yue Yu

Abstract: Neural operators (NOs) excel at learning mappings between function spaces, serving as efficient forward solution approximators for PDE-governed systems. However, as black-box solvers, they offer limited insight into the underlying physical mechanism, due to the lack of interpretable representations of the physical parameters that drive the system. To tackle this challenge, we propose a new paradigm for learning disentangled representations from NO parameters, thereby effectively solving an inverse problem. Specifically, we introduce DisentangO, a novel hyper-neural operator architecture designed to unveil and disentangle latent physical factors of variation embedded within the black-box neural operator parameters. At the core of DisentangO is a multi-task NO architecture that distills the varying parameters of the governing PDE through a task-wise adaptive layer, alongside a variational autoencoder that disentangles these variations into identifiable latent factors. By learning these disentangled representations, DisentangO not only enhances physical interpretability but also enables more robust generalization across diverse systems. Empirical evaluations across supervised, semi-supervised, and unsupervised learning contexts show that DisentangO effectively extracts meaningful and interpretable latent features, bridging the gap between predictive performance and physical understanding in neural operator frameworks.

replace Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices

Authors: Jie Xiao, Qianyi Huang, Xu Chen, Chen Tian

Abstract: As large language models (LLMs) increasingly integrate into every aspect of our work and daily lives, there are growing concerns about user privacy, which push the trend toward local deployment of these models. There are a number of lightweight LLMs (e.g., Gemini Nano, LLAMA2 7B) that can run locally on smartphones, providing users with greater control over their personal data. As a rapidly emerging application, we are concerned about their performance on commercial-off-the-shelf mobile devices. To fully understand the current landscape of LLM deployment on mobile platforms, we conduct a comprehensive measurement study on mobile devices. While user experience is the primary concern for end-users, developers focus more on the underlying implementations. Therefore, we evaluate both user-centric metrics-such as token throughput, latency, and response quality-and developer-critical factors, including resource utilization, OS strategies, battery consumption, and launch time. We also provide comprehensive comparisons across the mobile system-on-chips (SoCs) from major vendors, highlighting their performance differences in handling LLM workloads, which may help developers identify and address bottlenecks for mobile LLM applications. We hope that this study can provide insights for both the development of on-device LLMs and the design for future mobile system architecture.

replace Shallow Diffuse: Robust and Invisible Watermarking through Low-Dimensional Subspaces in Diffusion Models

Authors: Wenda Li, Huijie Zhang, Qing Qu

Abstract: The widespread use of AI-generated content from diffusion models has raised significant concerns regarding misinformation and copyright infringement. Watermarking is a crucial technique for identifying these AI-generated images and preventing their misuse. In this paper, we introduce Shallow Diffuse, a new watermarking technique that embeds robust and invisible watermarks into diffusion model outputs. Unlike existing approaches that integrate watermarking throughout the entire diffusion sampling process, Shallow Diffuse decouples these steps by leveraging the presence of a low-dimensional subspace in the image generation process. This method ensures that a substantial portion of the watermark lies in the null space of this subspace, effectively separating it from the image generation process. Our theoretical and empirical analyses show that this decoupling strategy greatly enhances the consistency of data generation and the detectability of the watermark. Extensive experiments further validate that our Shallow Diffuse outperforms existing watermarking methods in terms of robustness and consistency. The codes are released at https://github.com/liwd190019/Shallow-Diffuse.

URLs: https://github.com/liwd190019/Shallow-Diffuse.

replace Federated Learning Clients Clustering with Adaptation to Data Drifts

Authors: Minghao Li (Harvard University), Dmitrii Avdiukhin (Northwestern University), Rana Shahout (Harvard University), Nikita Ivkin (Amazon), Vladimir Braverman (Johns Hopkins University), Minlan Yu (Harvard University)

Abstract: Federated Learning (FL) trains deep models across edge devices without centralizing raw data, preserving user privacy. However, client heterogeneity slows down convergence and limits global model accuracy. Clustered FL (CFL) mitigates this by grouping clients with similar representations and training a separate model for each cluster. In practice, client data evolves over time, a phenomenon we refer to as data drift, which breaks cluster homogeneity and degrades performance. Data drift can take different forms depending on whether changes occur in the output values, the input features, or the relationship between them. We propose FIELDING, a CFL framework for handling diverse types of data drift with low overhead. FIELDING detects drift at individual clients and performs selective re-clustering to balance cluster quality and model performance, while remaining robust to malicious clients and varying levels of heterogeneity. Experiments show that FIELDING improves final model accuracy by 1.9-5.9% and achieves target accuracy 1.16x-2.23x faster than existing state-of-the-art CFL methods.

replace Disentangled Parameter-Efficient Linear Model for Long-Term Time Series Forecasting

Authors: Yuang Zhao, Tianyu Li, Jiadong Chen, Shenrong Ye, Fuxin Jiang, Xiaofeng Gao

Abstract: Long-term Time Series Forecasting (LTSF) is crucial across various domains, but complex deep models like Transformers are often prone to overfitting on extended sequences. Linear Fully Connected models have emerged as a powerful alternative, achieving competitive results with fewer parameters. However, their reliance on a single, monolithic weight matrix leads to quadratic parameter redundancy and an entanglement of temporal and frequential properties. To address this, we propose DiPE-Linear, a novel model that disentangles this monolithic mapping into a sequence of specialized, parameter-efficient modules. DiPE-Linear features three core components: Static Frequential Attention to prioritize critical frequencies, Static Time Attention to focus on key time steps, and Independent Frequential Mapping to independently process frequency components. A Low-rank Weight Sharing policy further enhances efficiency for multivariate data. This disentangled architecture collectively reduces parameter complexity from quadratic to linear and computational complexity to log-linear. Experiments on real-world datasets show that DiPE-Linear delivers state-of-the-art performance with significantly fewer parameters, establishing a new and highly efficient baseline for LTSF. Our code is available at https://github.com/wintertee/DiPE-Linear/

URLs: https://github.com/wintertee/DiPE-Linear/

replace SoK: Blockchain-Based Decentralized AI (DeAI)

Authors: Elizabeth Lui, Rui Sun, Vatsal Shah, Xihan Xiong, Jiahao Sun, Davide Crapis, William Knottenbelt, Zhipeng Wang

Abstract: Centralization enhances the efficiency of Artificial Intelligence (AI) but also introduces critical challenges, including single points of failure, inherent biases, data privacy risks, and scalability limitations. To address these issues, blockchain-based Decentralized Artificial Intelligence (DeAI) has emerged as a promising paradigm that leverages decentralization and transparency to improve the trustworthiness of AI systems. Despite rapid adoption in industry, the academic community lacks a systematic analysis of DeAI's technical foundations, opportunities, and challenges. This work presents the first Systematization of Knowledge (SoK) on DeAI, offering a formal definition, a taxonomy of existing solutions based on the AI lifecycle, and an in-depth investigation of the roles of blockchain in enabling secure and incentive-compatible collaboration. We further review security risks across the DeAI lifecycle and empirically evaluate representative mitigation techniques. Finally, we highlight open research challenges and future directions for advancing blockchain-based DeAI.

replace DeMo: Decoupled Momentum Optimization

Authors: Bowen Peng, Lizhang Chen, Baiyu Su, Jeffrey Quesnelle, Diederik P. Kingma, Qiang Liu

Abstract: Scaling neural network training increasingly depends on synchronous data-parallelism, yet full-precision gradient all-reduce imposes a severe communication bottleneck. We propose Decoupled Momentum Optimization (DeMo), a drop-in replacement for any momentum-based optimizers that significantly reduces the communication bandwidth while maintaining convergence. DeMo (i) decouples local momentum updates, (ii) applies a fast orthonormal transform (e.g., DCT) followed by top-k sparsification, and (iii) reuses the momentum buffer as error feedback via momentum subtraction. This design reduces per-step communication by up to two orders of magnitude with minimal computational overhead. Experiments on 300M and 1B-parameter DeMo language models show DeMo transmits up to 85x less data per GPU than AdamW-DDP while achieving comparable loss and accuracy. DeMo is topology-agnostic and enables training across multi-datacenter or Ethernet-based setups. Code is available at https://github.com/bloc97/DeMo

URLs: https://github.com/bloc97/DeMo

replace Interpretable Generalized Additive Models for Datasets with Missing Values

Authors: Hayden McTavish, Jon Donnelly, Margo Seltzer, Cynthia Rudin

Abstract: Many important datasets contain samples that are missing one or more feature values. Maintaining the interpretability of machine learning models in the presence of such missing data is challenging. Singly or multiply imputing missing values complicates the model's mapping from features to labels. On the other hand, reasoning on indicator variables that represent missingness introduces a potentially large number of additional terms, sacrificing sparsity. We solve these problems with M-GAM, a sparse, generalized, additive modeling approach that incorporates missingness indicators and their interaction terms while maintaining sparsity through l0 regularization. We show that M-GAM provides similar or superior accuracy to prior methods while significantly improving sparsity relative to either imputation or naive inclusion of indicator variables.

replace LEASE: Offline Preference-based Reinforcement Learning with High Sample Efficiency

Authors: Xiao-Yin Liu, Guotao Li, Xiao-Hu Zhou, Zeng-Guang Hou

Abstract: Offline preference-based reinforcement learning (PbRL) provides an effective way to overcome the challenges of designing reward and the high costs of online interaction. However, since labeling preference needs real-time human feedback, acquiring sufficient preference labels is challenging. To solve this, this paper proposes a offLine prEference-bAsed RL with high Sample Efficiency (LEASE) algorithm, where a learned transition model is leveraged to generate unlabeled preference data. Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model, where only high confidence and low variance data are selected. Moreover, we provide the generalization bound of reward model to analyze the factors influencing reward accuracy, and demonstrate that the policy learned by LEASE has theoretical improvement guarantee. The developed theory is based on state-action pair, which can be easily combined with other offline algorithms. The experimental results show that LEASE can achieve comparable performance to baseline under fewer preference data without online interaction.

replace Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights

Authors: Tzu-Heng Huang, Manjot Bilkhu, John Cooper, Frederic Sala, Javier Movellan

Abstract: Large-scale web-crawled datasets contain noise, bias, and irrelevant information, necessitating data selection techniques. Existing methods depend on hand-crafted heuristics, downstream datasets, or require expensive influence-based computations -- all of which limit scalability and introduce unwanted data dependencies. To address this, we introduce the Mimic Score, a simple and geometry-based data-quality metric that evaluates utility by measuring alignment between a sample's gradients and a target direction induced by a pre-trained reference model. This leverages readily available model weights, avoids needing validation datasets, and incurs minimal computational overheads. Building on this metric, we propose Grad-Mimic, a two-stage framework that re-weights samples online to accelerate training and aggregates sample utilities offline to construct effective data filters. Empirically, we show that using mimic scores to guide training improves data efficiency, accelerates convergence, yields consistent performance gains across six image datasets, and enhances CLIP models with 20.7% fewer training steps. Additionally, mimic score-based filters augment existing filtering techniques, enabling improved CLIP models trained with 4.7 million fewer samples.

replace Reducing Aleatoric and Epistemic Uncertainty through Multi-modal Data Acquisition

Authors: Arthur Hoarau, Benjamin Quost, S\'ebastien Destercke, Willem Waegeman

Abstract: To generate accurate and reliable predictions, modern AI systems need to combine data from multiple modalities, such as text, images, audio, spreadsheets, and time series. Multi-modal data introduces new opportunities and challenges for disentangling uncertainty: it is commonly assumed in the machine learning community that epistemic uncertainty can be reduced by collecting more data, while aleatoric uncertainty is irreducible. However, this assumption is challenged in modern AI systems when information is obtained from different modalities. This paper introduces an innovative data acquisition framework where uncertainty disentanglement leads to actionable decisions, allowing sampling in two directions: sample size and data modality. The main hypothesis is that aleatoric uncertainty decreases as the number of modalities increases, while epistemic uncertainty decreases by collecting more observations. We provide proof-of-concept implementations on two multi-modal datasets to showcase our data acquisition framework, which combines ideas from active learning, active feature acquisition and uncertainty quantification.

replace CoHiRF: Hierarchical Consensus for Interpretable Clustering Beyond Scalability Limits

Authors: Katia Meziani, Bruno Belucci, Karim Lounici, Vladimir R. Kostic

Abstract: We introduce CoHiRF (Consensus Hierarchical Random Features), a hierarchical consensus framework that enables existing clustering methods to operate beyond their usual computational and memory limits. CoHiRF is a meta-algorithm that operates exclusively on the label assignments produced by a base clustering method, without modifying its objective function, optimization procedure, or geometric assumptions. It repeatedly applies the base method to multiple low-dimensional feature views or stochastic realizations, enforces agreement through consensus, and progressively reduces the problem size via representative-based contraction. Across a diverse set of synthetic and real-world experiments involving centroid-based, kernel-based, density-based, and graph-based methods, we show that CoHiRF can improve robustness to high-dimensional noise, enhance stability under stochastic variability, and enable scalability to regimes where the base method alone is infeasible. We also provide an empirical characterization of when hierarchical consensus is beneficial, highlighting the role of reproducible label relations and their compatibility with representative-based contraction. Beyond flat partitions, CoHiRF produces an explicit Cluster Fusion Hierarchy, offering a multi-resolution and interpretable view of the clustering structure. Together, these results position hierarchical consensus as a practical and flexible tool for large-scale clustering, extending the applicability of existing methods without altering their underlying behavior.

replace From Features to Transformers: Redefining Ranking for Scalable Impact

Authors: Fedor Borisyuk, Lars Hertel, Ganesh Parameswaran, Gaurav Srivastava, Sudarshan Srinivasa Ramanujam, Borja Ocejo, Peng Du, Andrei Akterskii, Neil Daftary, Shao Tang, Daqi Sun, Qiang Charles Xiao, Deepesh Nathani, Mohit Kothari, Yun Dai, Guoyao Li, Aman Gupta

Abstract: We present LiGR, a large-scale ranking framework developed at LinkedIn that brings state-of-the-art transformer-based modeling architectures into production. We introduce a modified transformer architecture that incorporates learned normalization and simultaneous set-wise attention to user history and ranked items. This architecture enables several breakthrough achievements, including: (1) the deprecation of most manually designed feature engineering, outperforming the prior state-of-the-art system using only few features (compared to hundreds in the baseline), (2) validation of the scaling law for ranking systems, showing improved performance with larger models, more training data, and longer context sequences, and (3) simultaneous joint scoring of items in a set-wise manner, leading to automated improvements in diversity. To enable efficient serving of large ranking models, we describe techniques to scale inference effectively using single-pass processing of user history and set-wise attention. We also summarize key insights from various ablation studies and A/B tests, highlighting the most impactful technical approaches.

replace Deep Meta Coordination Graphs for Multi-agent Reinforcement Learning

Authors: Nikunj Gupta, James Zachary Hare, Rajgopal Kannan, Viktor Prasanna

Abstract: This paper presents deep meta coordination graphs (DMCG) for learning cooperative policies in multi-agent reinforcement learning (MARL). Coordination graph formulations encode local interactions and accordingly factorize the joint value function of all agents to improve efficiency in MARL. Through DMCG, we dynamically compose what we refer to as \textit{meta coordination graphs}, to learn a more expressive representation of agent interactions and use them to integrate agent information through graph convolutional networks. The goal is to enable an evolving coordination graph to guide effective coordination in cooperative MARL tasks. The graphs are jointly optimized with agents' value functions to learn to implicitly reason about joint actions, facilitating the end-to-end learning of interaction representations and coordinated policies. We demonstrate that DMCG consistently achieves state-of-the-art coordination performance and sample efficiency on challenging cooperative tasks, outperforming several prior graph-based and non-graph-based MARL baselines. Through several ablations, we also isolate the impact of individual components in DMCG, showing that the observed improvements are due to the meaningful design choices in this approach. We also include an analysis of its computational complexity to discuss its practicality in real-world applications. All codes can be found here: {\color{blue}{https://github.com/Nikunj-Gupta/dmcg-marl}.

URLs: https://github.com/Nikunj-Gupta/dmcg-marl

replace Scalable Back-Propagation-Free Training of Optical Physics-Informed Neural Networks

Authors: Yequan Zhao, Xinling Yu, Xian Xiao, Zhixiong Chen, Ziyue Liu, Geza Kurczveil, Raymond G. Beausoleil, Sijia Liu, Zheng Zhang

Abstract: Physics intelligence and digital twins often require rapid and repeated performance evaluation of various engineering systems (e.g. robots, autonomous vehicles, semiconductor chips) to enable (almost) real-time actions or decision making. This has motivated the development of accelerated partial differential equation (PDE) solvers, in resource-constrained scenarios if the PDE solvers are to be deployed on the edge. Physics-informed neural networks (PINNs) have shown promise in solving high-dimensional PDEs, but the training time on state-of-the-art digital hardware (e.g., GPUs) is still orders-of-magnitude longer than the latency required for enabling real-time decision making. Photonic computing offers a potential solution to address this huge latency gap because of its ultra-high operation speed. However, the lack of photonic memory and the large device sizes prevent training real-size PINNs on photonic chips. This paper proposes a completely back-propagation-free (BP-free) and highly salable framework for training real-size PINNs on silicon photonic platforms. Our approach involves three key innovations: (1) a sparse-grid Stein derivative estimator to avoid the BP in the loss evaluation of a PINN, (2) a dimension-reduced zeroth-order optimization via tensor-train decomposition to achieve better scalability and convergence in BP-free training, and (3) a scalable on-chip photonic PINN training accelerator design using photonic tensor cores. We validate our numerical methods on both low- and high-dimensional PDE benchmarks. Through pre-silicon simulation based on real device parameters, we further demonstrate the significant performance benefit (e.g., real-time training, huge chip area reduction) of our photonic accelerator.

replace The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Authors: Tom Wollschl\"ager, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan G\"unnemann, Johannes Gasteiger

Abstract: The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.

replace Remasking Discrete Diffusion Models with Inference-Time Scaling

Authors: Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, Volodymyr Kuleshov

Abstract: Part of the success of diffusion models stems from their ability to perform iterative refinement, i.e., repeatedly correcting outputs during generation. However, modern masked discrete diffusion lacks this capability: when a token is generated, it cannot be updated again, even when it introduces an error. Here, we address this limitation by introducing the remasking diffusion model (ReMDM) sampler, a method that can be applied to pretrained masked diffusion models in a principled way and that is derived from a discrete diffusion model with a custom remasking backward process. Most interestingly, ReMDM endows discrete diffusion with a form of inference-time compute scaling. By increasing the number of sampling steps, ReMDM generates natural language outputs that approach the quality of autoregressive models, whereas when the computation budget is limited, ReMDM better maintains quality. ReMDM also improves sample quality of masked diffusion models for discretized images, and in scientific domains such as molecule design, ReMDM facilitates diffusion guidance and pushes the Pareto frontier of controllability relative to classical masking and uniform noise diffusion. We provide the code along with a blog post on the project page: https://guanghanwang.com/remdm

URLs: https://guanghanwang.com/remdm

replace Latent Domain Modeling Improves Robustness to Geographic Shifts

Authors: Ruth Crasto, Esther Rolf

Abstract: Geographic distribution shift arises when the distribution of locations on Earth in a training dataset is different from what is seen at inference time. Using standard empirical risk minimization (ERM) in this setting can lead to uneven generalization across different spatially-determined groups of interest such as continents or biomes. The most common approaches to tackling geographic distribution shift apply domain adaptation methods using discrete group labels, ignoring geographic coordinates that are often available as metadata. On the other hand, modeling methods that integrate geographic coordinates have been shown to improve overall performance, but their impact on geographic domain generalization has not been studied. In this work, we propose a general modeling framework for improving robustness to geographic distribution shift. The key idea is to model continuous, latent domain assignment using location encoders and to condition the main task predictor on the jointly-trained latents. On four diverse geo-tagged image datasets with different group splits, we show that instances of our framework achieve significant improvements in worst-group performance compared to existing domain adaptation and location-aware modeling methods. In particular, we achieve new state-of-the-art results on two datasets from the WILDS benchmark.

replace RiskAgent: Synergizing Language Models with Validated Tools for Evidence-Based Risk Prediction

Authors: Fenglin Liu, Jinge Wu, Hongjian Zhou, Xiao Gu, Jiayuan Zhu, Jiazhen Pan, Junde Wu, Soheila Molaei, Anshul Thakur, Lei Clifton, Honghan Wu, David A. Clifton

Abstract: Large Language Models (LLMs) achieve competitive results compared to human experts in medical examinations. However, it remains a challenge to apply LLMs to complex clinical decision-making, which requires a deep understanding of medical knowledge and differs from the standardized, exam-style scenarios commonly used in current efforts. A common approach is to fine-tune LLMs for target tasks, which, however, not only requires substantial data and computational resources but also remains prone to generating `hallucinations'. In this work, we present RiskAgent, which synergizes language models with hundreds of validated clinical decision tools supported by evidence-based medicine, to provide generalizable and faithful recommendations. Our experiments show that RiskAgent not only achieves superior performance on a broad range of clinical risk predictions across diverse scenarios and diseases, but also demonstrates robust generalization in tool learning on the external MedCalc-Bench dataset, as well as in medical reasoning and question answering on three representative benchmarks, MedQA, MedMCQA, and MMLU.

replace Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

Authors: Shwai He, Weilin Cai, Jiayi Huang, Ang Li

Abstract: The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation to balance performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where underloaded experts complete computations early but must wait for overloaded experts, leading to global delays. We define this phenomenon as the \textbf{\textit{Straggler Effect}}, as the most burdened experts dictate the overall inference latency. To address this, we first propose \textit{\textbf{Capacity-Aware Token Drop}}, which enforces expert capacity limits by discarding excess tokens from overloaded experts, effectively reducing load imbalance with minimal performance impact (e.g., $30\%$ speedup with only $0.9\%$ degradation on OLMoE). Next, given the presence of low-load experts remaining well below the capacity threshold, we introduce \textit{\textbf{Capacity-Aware Expanded Drop}}, which allows tokens to include additional local experts in their candidate set before enforcing strict local capacity constraints, thereby improving load balance and enhancing the utilization of underused experts. Extensive experiments on both language and multimodal MoE models demonstrate the effectiveness of our approach, yielding substantial gains in expert utilization, model performance, and inference efficiency, e.g., applying Expanded Drop to Mixtral-8$\times$7B-Instruct yields a {0.2\%} average performance improvement and a {1.85$\times$} inference speedup. The code is released at: https://github.com/CASE-Lab-UMD/Capacity-Aware-MoE.

URLs: https://github.com/CASE-Lab-UMD/Capacity-Aware-MoE.

replace Right Reward Right Time for Federated Learning

Authors: Thanh Linh Nguyen, Dinh Thai Hoang, Diep N. Nguyen, Quoc-Viet Pham

Abstract: Critical learning periods (CLPs) in federated learning (FL) refer to early stages during which low-quality contributions (e.g., sparse training data availability) can permanently impair the performance of the global model owned by the model owner (i.e., a cloud server). However, existing incentive mechanisms exhibit temporal homogeneity, treating all training rounds as equally important, thereby failing to prioritize and attract high-quality contributions during CLPs. This inefficiency is compounded by information asymmetry due to privacy regulations, where the cloud lacks knowledge of client training capabilities, leading to adverse selection. Thus, in this article, we propose a time-aware contract-theoretic incentive framework, named Right Reward Right Time (R3T), to encourage client involvement, especially during CLPs, to maximize the utility of the cloud server. We formulate a cloud utility function that captures the trade-off between the achieved model performance and rewards allocated for clients' contributions, explicitly accounting for client heterogeneity in time and system capabilities, effort, and joining time. Then, we devise a CLP-aware incentive mechanism deriving an optimal contract design that satisfies individual rationality, incentive compatibility, and budget feasibility constraints, motivating rational clients to participate early and contribute efforts. By providing the right reward at the right time, our approach can attract the highest-quality contributions during CLPs. Simulation and proof-of-concept studies show that R3T mitigates information asymmetry, increases cloud utility, and yields superior economic efficiency compared to conventional incentive mechanisms. Our proof-of-concept results demonstrate up to a 47.6% reduction in the total number of clients and up to a 300% improvement in convergence time while achieving competitive test accuracy.

replace Probabilistic Forecasting via Autoregressive Flow Matching

Authors: Ahmed ElGazzar, Marcel van Gerven

Abstract: In this work, we propose FlowTime, a generative model for probabilistic forecasting of multivariate timeseries data. Given historical measurements and optional future covariates, we formulate forecasting as sampling from a learned conditional distribution over future trajectories. Specifically, we decompose the joint distribution of future observations into a sequence of conditional densities, each modeled via a shared flow that transforms a simple base distribution into the next observation distribution, conditioned on observed covariates. To achieve this, we leverage the flow matching (FM) framework, enabling scalable and simulation-free learning of these transformations. By combining this factorization with the FM objective, FlowTime retains the benefits of autoregressive models -- including strong extrapolation performance, compact model size, and well-calibrated uncertainty estimates -- while also capturing complex multi-modal conditional distributions, as seen in modern transport-based generative models. We demonstrate the effectiveness of FlowTime on multiple dynamical systems and real-world forecasting tasks.

replace ASIDE: Architectural Separation of Instructions and Data in Language Models

Authors: Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Alexandra Volkova, Soroush Tabesh, Sebastian Lapuschkin, Wojciech Samek, Christoph H. Lampert

Abstract: Despite their remarkable performance, large language models lack elementary safety features, making them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as the root cause of the success of prompt injection attacks. In this work, we propose a new architectural element, ASIDE, that allows language models to clearly separate instructions and data at the level of token embeddings. ASIDE applies an orthogonal rotation to the embeddings of data tokens, thus creating clearly distinct representations of instructions and data tokens without introducing any additional parameters. As we demonstrate experimentally across a range of models, instruction-tuning LLMs with ASIDE (1) achieves substantially higher instruction-data separation without performance loss and (2) makes the models more robust to prompt injection benchmarks, even without dedicated safety training. Additionally, we provide insights into the mechanism underlying our method through an analysis of the model representations. The source code and training scripts are openly accessible at https://github.com/egozverev/aside.

URLs: https://github.com/egozverev/aside.

replace LogicXGNN: Grounded Logical Rules for Explaining Graph Neural Networks

Authors: Chuqin Geng, Ziyu Zhao, Zhaoyue Wang, Haolin Ye, Yuhe Jiang, Xujie Si

Abstract: Existing rule-based explanations for Graph Neural Networks (GNNs) provide global interpretability but often optimize and assess fidelity in an intermediate, uninterpretable concept space, overlooking grounding quality for end users in the final subgraph explanations. This gap yields explanations that may appear faithful yet be unreliable in practice. To this end, we propose LogicXGNN, a post-hoc framework that constructs logical rules over reliable predicates explicitly designed to capture the GNN's message-passing structure, thereby ensuring effective grounding. We further introduce data-grounded fidelity ($\textit{Fid}_{\mathcal{D}}$), a realistic metric that evaluates explanations in their final-graph form, along with complementary utility metrics such as coverage and validity. Across extensive experiments, LogicXGNN improves $\textit{Fid}_{\mathcal{D}}$ by over 20% on average relative to state-of-the-art methods while being 10-100 $\times$ faster. With strong scalability and utility performance, LogicXGNN produces explanations that are faithful to the model's logic and reliably grounded in observable data. Our code is available at https://github.com/allengeng123/LogicXGNN/.

URLs: https://github.com/allengeng123/LogicXGNN/.

replace Geometric Reasoning in the Embedding Space

Authors: Jan H\r{u}la, David Moj\v{z}\'i\v{s}ek, Ji\v{r}\'i Jane\v{c}ek, David Herel, Mikol\'a\v{s} Janota

Abstract: In this contribution, we demonstrate that Graph Neural Networks and Transformers can learn to reason about geometric constraints. We train them to predict spatial position of points in a discrete 2D grid from a set of constraints that uniquely describe hidden figures containing these points. Both models are able to predict the position of points and interestingly, they form the hidden figures described by the input constraints in the embedding space during the reasoning process. Our analysis shows that both models recover the grid structure during training so that the embeddings corresponding to the points within the grid organize themselves in a 2D subspace and reflect the neighborhood structure of the grid. We also show that the Graph Neural Network we design for the task performs significantly better than the Transformer and is also easier to scale.

replace Spatiotemporal Attention-Augmented Inverse Reinforcement Learning for Multi-Agent Task Allocation

Authors: Huilin Yin, Zhikun Yang, Linchuan Zhang, Daniel Watzenig

Abstract: Adversarial inverse reinforcement learning (IRL) for multi-agent task allocation (MATA) is challenged by non-stationary interactions and high-dimensional coordination. Unconstrained reward inference in these settings often leads to high variance and poor generalization. We propose an attention-structured adversarial IRL framework that constrains reward inference via spatiotemporal representation learning. Our method employs multi-head self-attention (MHSA) for long-range temporal dependencies and graph attention networks (GAT) for agent-task relational structures. We formulate reward inference as a low-capacity, adaptive linear transformation of the environment reward, ensuring stable and interpretable guidance. This framework decouples reward inference from policy learning and optimizes the reward model adversarially. Experiments on benchmark MATA scenarios show that our approach outperforms representative MARL baselines in convergence speed, cumulative rewards, and spatial efficiency. Results demonstrate that attention-guided, capacity-constrained reward inference is a scalable and effective mechanism for stabilizing adversarial IRL in complex multi-agent systems.

replace Federated Hierarchical Reinforcement Learning for Adaptive Traffic Signal Control

Authors: Yongjie Fu, Lingyun Zhong, Zifan Li, Xuan Di

Abstract: Multi-agent reinforcement learning (MARL) has shown promise for adaptive traffic signal control (ATSC), enabling multiple intersections to coordinate signal timings in real time. However, in large-scale settings, MARL faces constraints due to extensive data sharing and communication requirements. Federated learning (FL) mitigates these challenges by training shared models without directly exchanging raw data, yet traditional FL methods such as FedAvg struggle with highly heterogeneous intersections. Different intersections exhibit varying traffic patterns, demands, and road structures, so performing FedAvg across all agents is inefficient. To address this gap, we propose Hierarchical Federated Reinforcement Learning (HFRL) for ATSC. HFRL employs clustering-based or optimization-based techniques to dynamically group intersections and perform FedAvg independently within groups of intersections with similar characteristics, enabling more effective coordination and scalability than standard FedAvg.Our experiments on synthetic and real-world traffic networks demonstrate that HFRL consistently outperforms decentralized and standard federated RL approaches, and achieves competitive or superior performance compared to centralized RL as network scale and heterogeneity increase, particularly in real-world settings. The method also identifies suitable grouping patterns based on network structure or traffic demand, resulting in a more robust framework for distributed, heterogeneous systems.

replace Think2SQL: Reinforce LLM Reasoning Capabilities for Text2SQL

Authors: Simone Papicchio, Simone Rossi, Luca Cagliero, Paolo Papotti

Abstract: While Large Language Models (LLMs) have advanced the state-of-the-art in Text-to-SQL, robust reasoning in complex, multi-table environments remains a bottleneck for parameter-efficient models. This paper presents a systematic empirical study on injecting reasoning capabilities into Text-to-SQL through the lens of Reinforcement Learning with Verifiable Rewards (RLVR). We uncover a critical interplay between reward density, advantage scaling, and model capacity. Our analysis yields four primary insights. First, we propose a novel execution-guided dense reward function that significantly outperforms binary signals and existing state-of-the-art rewards by providing granular feedback at the instance level. Second, we analyze the mechanics of advantage calculation, demonstrating that while large models thrive on sparse signals with aggressive advantage scaling, smaller models require dense rewards and conservative scaling to improve Text-to-SQL performance. Third, we evaluate the impact of cold start, showing that distillation does not always improve RLVR performance and that supervised, fine-tuned models are prone to distributional mimicry. Fourth, we map the Pareto frontier of training efficiency, providing insights for optimizing Text-to-SQL reasoning under computational constraints. Our findings culminate in the Think2SQL family: our 4B-parameter model demonstrates reasoning capabilities competitive with state-of-the-art models such as o3. We release our models, datasets, and code to create a blueprint for RLVR optimization in Text-to-SQL at https://anonymous.4open.science/r/Think2SQL-3B7F.

URLs: https://anonymous.4open.science/r/Think2SQL-3B7F.

replace Evaluating Autoencoders for Parametric and Invertible Multidimensional Projections

Authors: Frederik L. Dennig, Nina Geyer, Daniela Blumberg, Yannick Metz, Daniel A. Keim

Abstract: Recently, neural networks have gained attention for creating parametric and invertible multidimensional data projections. Parametric projections allow for embedding previously unseen data without recomputing the projection as a whole, while invertible projections enable the generation of new data points. However, these properties have never been explored simultaneously for arbitrary projection methods. We evaluate three autoencoder (AE) architectures for creating parametric and invertible projections. Based on a given projection, we train AEs to learn a mapping into 2D space and an inverse mapping into the original space. We perform a quantitative and qualitative comparison on four datasets of varying dimensionality and pattern complexity using t-SNE. Our results indicate that AEs with a customized loss function can create smoother parametric and inverse projections than feed-forward neural networks while giving users control over the strength of the smoothing effect.

replace Toward Efficient Exploration by Large Language Model Agents

Authors: Dilip Arumugam, Thomas L. Griffiths

Abstract: A burgeoning area within reinforcement learning (RL) is the design of sequential decision-making agents centered around large language models (LLMs). While autonomous decision-making agents powered by modern LLMs could facilitate numerous real-world applications, such successes demand agents that are capable of data-efficient RL. One key obstacle to achieving data efficiency in RL is exploration, a challenge that we demonstrate many recent proposals for LLM agent designs struggle to contend with. Meanwhile, classic algorithms from the RL literature known to gracefully address exploration require technical machinery that can be challenging to operationalize in purely natural language settings. In this work, rather than relying on finetuning or in-context learning to coax LLMs into implicitly imitating a RL algorithm, we illustrate how LLMs can be used to explicitly implement an existing RL algorithm (Posterior Sampling for Reinforcement Learning) whose capacity for statistically-efficient exploration is already well-studied. We offer empirical results demonstrating how our LLM-based implementation of a known, data-efficient RL algorithm can be considerably more effective in natural language tasks that demand prudent exploration.

replace Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers

Authors: Zhexiang Li, Haoyu Wang, Yutong Bao, David Woodruff

Abstract: Efficient attention mechanisms enable long-context transformers but often miss globally important tokens, degrading modeling quality. We introduce a pre-scoring framework that assigns a query-independent global importance prior to keys before applying hierarchical approximate attention. Using clustering-based or leverage-style scoring, pre-scoring identifies structurally informative keys and restricts computation to this prioritized subset. Integrated with HyperAttention, pre-scoring substantially improves approximation quality on long-context language modeling: on ChatGLM with 131k-token contexts, perplexity decreases from 12.0 to 9.5 under a fixed interaction budget while retaining subquadratic efficiency. Clustering-based scoring consistently outperforms leverage-based selection under identical key budgets. Beyond language, replacing self-attention in Vision Transformers preserves most of the baseline accuracy, showing that the approach generalizes across modalities. We provide structural guarantees under a planted-subspace model, showing that clustering recovers the same heavy-key sets as leverage-based methods. Overall, pre-scoring improves the efficiency-accuracy trade-off of approximate attention by better prioritizing informative keys without sacrificing scalability.

replace Dist2ill: Distributional Distillation for One-Pass Uncertainty Estimation in Large Language Models

Authors: Yicong Zhao, King Yeung Tsang, Harshil Vejendla, Haizhou Shi, Zhuohang Li, Zhigang Hua, Qi Xu, Tunyu Zhang, Yi Wang, Ligong Han, Bradley A. Malin, Hao Wang

Abstract: Large Language Models (LLMs) often exhibit misalignment between the quality of their generated responses and the confidence estimates they assign to them. Bayesian treatments, such as marginalizing over a reliable weight posterior or over the space of reasoning traces, provide an effective remedy, but incur substantial computational overhead due to repeated sampling at test time. To enable accurate uncertainty estimation in a single forward pass, we propose a novel distributional distillation framework (Dist2ill) that trains an LLM to produce multiple diverse reasoning paths within one inference pass, while using a lightweight parametric module to approximate empirical confidence scores derived from the sampling distribution. Extensive experiments demonstrate that Dist2ill preserves reasoning diversity and achieves state-of-the-art uncertainty estimation, substantially improving Expected Calibration Error (ECE) and Negative Log-Likelihood (NLL), while remaining computationally efficient.

replace Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures

Authors: Zhiheng Chen, Ruofan Wu, Guanhua Fang

Abstract: The transformer architecture has demonstrated remarkable capabilities in modern artificial intelligence, among which the capability of implicitly learning an internal model during inference time is widely believed to play a key role in the under standing of pre-trained large language models. However, most recent works have been focusing on studying supervised learning topics such as in-context learning, leaving the field of unsupervised learning largely unexplored. This paper investigates the capabilities of transformers in solving Gaussian Mixture Models (GMMs), a fundamental unsupervised learning problem through the lens of statistical estimation. We propose a transformer-based learning framework called TGMM that simultaneously learns to solve multiple GMM tasks using a shared transformer backbone. The learned models are empirically demonstrated to effectively mitigate the limitations of classical methods such as Expectation-Maximization (EM) or spectral algorithms, at the same time exhibit reasonable robustness to distribution shifts. Theoretically, we prove that transformers can approximate both the EM algorithm and a core component of spectral methods (cubic tensor power iterations). These results bridge the gap between practical success and theoretical understanding, positioning transformers as versatile tools for unsupervised learning.

replace Parallel Layer Normalization for Universal Approximation

Authors: Yunhao Ni, Yuxin Guo, Yuhe Liu, Wenxin Sun, Jie Luo, Wenjun Wu, Lei Huang

Abstract: This paper studies the approximation capabilities of neural networks that combine layer normalization (LN) with linear layers. We prove that networks consisting of two linear layers with parallel layer normalizations (PLNs) inserted between them (referred to as PLN-Nets) achieve universal approximation, whereas architectures that use only standard LN exhibit strictly limited expressive power.We further analyze approximation rates of shallow and deep PLN-Nets under the $L^\infty$ norm as well as in Sobolev norms. Our analysis extends beyond LN to RMSNorm, and from standard MLPs to position-wise feed-forward networks, the core building blocks used in RNNs and Transformers.Finally, we provide empirical experiments to explore other possible potentials of PLN-Nets.

replace FlashKAT: Understanding and Addressing Performance Bottlenecks in the Kolmogorov-Arnold Transformer

Authors: Matthew Raffel, Lizhong Chen

Abstract: The Kolmogorov-Arnold Network (KAN) has been gaining popularity as an alternative to the multilayer perceptron (MLP) due to its greater expressiveness and interpretability. Even so, KAN suffers from training instability and being orders of magnitude slower due to its increased computational cost, limiting its applicability to large-scale tasks. Recently, the Kolmogorov-Arnold Transformer (KAT) has been proposed, achieving FLOPs comparable to traditional Transformer models with MLPs by leveraging Group-Rational KAN (GR-KAN). Unfortunately, despite the comparable FLOPs, our testing shows that KAT remains 123x slower during training, indicating that there are other performance bottlenecks beyond FLOPs. In this paper, we conduct a series of experiments to understand the root cause of the slowdown in KAT. We uncover that the slowdown can be isolated to memory stalls, linked more specifically to inefficient gradient accumulations in the backward pass of GR-KAN. To address this memory bottleneck, we propose FlashKAT, which minimizes accesses to slow memory and the usage of atomic adds through a restructured kernel. Evaluations show that FlashKAT achieves up to an 86.5x training speedup over state-of-the-art KAT while reducing rounding errors in gradient computation.

replace Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study

Authors: Kaustubh Ponkshe, Shaan Shah, Raghav Singhal, Praneeth Vepakomma

Abstract: Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. However, this behavior is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this perspective. We examine whether safety-relevant behavior is concentrated in specific linear subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in activations. Across both weight and activation spaces, our findings are consistent: subspaces that amplify safe behaviors also amplify useful ones, and prompts with different safety implications activate overlapping representations. Rather than residing in distinct directions, we show that safety is highly entangled with the general learning components of the model. This suggests that subspace-based defenses face fundamental limitations and underscores the need for alternative strategies to preserve safety under continued training. We corroborate these findings with multiple experiments on five open-source LLMs from the Llama and Qwen families. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.

URLs: https://github.com/CERT-Lab/safety-subspaces.

replace PiFlow: Principle-Aware Scientific Discovery with Multi-Agent Collaboration

Authors: Yingming Pu, Tao Lin, Hongyu Chen

Abstract: Large Language Model (LLM)-based multi-agent systems (MAS) demonstrate remarkable potential for scientific discovery. Existing approaches, however, often automate scientific discovery using predefined workflows that lack rationality constraints. This often leads to aimless hypothesizing and a failure to consistently link hypotheses with evidence, thereby hindering the systematic reduction of uncertainty. Overcoming these limitations fundamentally requires a principled approach to exploration. We introduce PiFlow, an information-theoretical framework, treating automated scientific discovery as a structured uncertainty reduction problem guided by principles (e.g., scientific laws). Extensive evaluations across three distinct scientific domains demonstrate that PiFlow (I) improves discovery efficiency by 31.18%~41.73% and solution quality by 12.47%~31.72% against state-of-the-art methods, (II) delivers a 5.6x speedup in time-to-solution while reducing token consumption by up to 27% compared to vanilla agents, and (III) serves as a Plug-and-Play module that generalizes on existing agent architecture. Overall, PiFlow establishes a novel paradigm shift in highly efficient agentic scientific discovery, paving the way for more robust and accelerated AI-driven research.

replace Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

Authors: Pedro P. Santos, Alberto Sardinha, Francisco S. Melo

Abstract: In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.

replace Meta-reinforcement learning with minimum attention

Authors: Shashank Gupta, Pilhwa Lee

Abstract: Minimum attention applies the least action principle in the changes of control concerning state and time, first proposed by Brockett. The involved regularization is highly relevant in emulating biological control, such as motor learning. We apply minimum attention in reinforcement learning (RL) as part of the rewards and investigate its connection to meta-learning and stabilization. Specifically, model-based meta-learning with minimum attention is explored in high-dimensional nonlinear dynamics. Ensemble-based model learning and gradient-based meta-policy learning are alternately performed. Empirically, the minimum attention does show outperforming competence in comparison to the state-of-the-art algorithms of model-free and model-based RL, i.e., fast adaptation in few shots and variance reduction from the perturbations of the model and environment. Furthermore, the minimum attention demonstrates an improvement in energy efficiency.

replace Out of the Shadows: Exploring a Latent Space for Neural Network Verification

Authors: Lukas Koller, Tobias Ladner, Matthias Althoff

Abstract: Neural networks are ubiquitous. However, they are often sensitive to small input changes. Hence, to prevent unexpected behavior in safety-critical applications, their formal verification -- a notoriously hard problem -- is necessary. Many state-of-the-art verification algorithms use reachability analysis or abstract interpretation to enclose the set of possible outputs of a neural network. Often, the verification is inconclusive due to the conservatism of the enclosure. To address this problem, we propose a novel specification-driven input refinement procedure, i.e., we iteratively enclose the preimage of a neural network for all unsafe outputs to reduce the set of possible inputs to only enclose the unsafe ones. For that, we transfer output specifications to the input space by exploiting a latent space, which is an artifact of the propagation of a projection-based set representation through a neural network. A projection-based set representation, e.g., a zonotope, is a "shadow" of a higher-dimensional set -- a latent space -- that does not change during a set propagation through a neural network. Hence, the input set and the output enclosure are "shadows" of the same latent space that we can use to transfer constraints. We present an efficient verification tool for neural networks that uses our iterative refinement to significantly reduce the number of subproblems in a branch-and-bound procedure. Using zonotopes as a set representation, unlike many other state-of-the-art approaches, our approach can be realized by only using matrix operations, which enables a significant speed-up through efficient GPU acceleration. We demonstrate that our tool achieves competitive performance compared to the top-ranking tools of the international neural network verification competition.

replace Automatic and Structure-Aware Sparsification of Hybrid Neural ODEs

Authors: Bob Junyi Zou, Lu Tian

Abstract: Hybrid neural ordinary differential equations (neural ODEs) integrate mechanistic models with neural ODEs, offering strong inductive bias and flexibility, and are particularly advantageous in data-scarce healthcare settings. However, excessive latent states and interactions from mechanistic models can lead to training inefficiency and over-fitting, limiting practical effectiveness of hybrid neural ODEs. In response, we propose a new hybrid pipeline for automatic state selection and structure optimization in mechanistic neural ODEs, combining domain-informed graph modifications with data-driven regularization to sparsify the model for improving predictive performance and stability while retaining mechanistic plausibility. Experiments on synthetic and real-world data show improved predictive performance and robustness with desired sparsity, establishing an effective solution for hybrid model reduction in healthcare applications.

replace Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees

Authors: Sourav Ganguly, Kishan Panaganti, Arnob Ghosh, Adam Wierman

Abstract: Constrained decision-making is essential for designing safe policies in real-world control systems, yet simulated environments often fail to capture real-world adversities. We consider the problem of learning a policy that will maximize the cumulative reward while satisfying a constraint, even when there is a mismatch between the real model and an accessible simulator/nominal model. In particular, we consider the robust constrained Markov decision problem (RCMDP) where an agent needs to maximize the reward and satisfy the constraint against the worst possible stochastic model under the uncertainty set centered around an unknown nominal model. Primal-dual methods, effective for standard constrained MDP (CMDP), are not applicable here because of the lack of the strong duality property. Further, one cannot apply the standard robust value-iteration based approach on the composite value function either as the worst case models may be different for the reward value function and the constraint value function. We propose a novel technique that effectively minimizes the constraint value function--to satisfy the constraints; on the other hand, when all the constraints are satisfied, it can simply maximize the robust reward value function. We prove that such an algorithm finds a policy with at most $\epsilon$ sub-optimality and feasible policy after $O(\epsilon^{-2})$ iterations. In contrast to the state-of-the-art method, we do not need to employ a binary search, thus, we reduce the computation time by at least 4x for smaller value of discount factor ($\gamma$) and by at least 6x for larger value of $\gamma$.

replace Understanding Generalization in Diffusion Distillation via Probability Flow Distance

Authors: Huijie Zhang, Zijian Huang, Siyi Chen, Jinfan Zhou, Zekai Zhang, Peng Wang, Qing Qu

Abstract: Diffusion distillation provides an effective approach for learning lightweight and few-steps diffusion models with efficient generation. However, evaluating their generalization remains challenging: theoretical metrics are often impractical for high-dimensional data, while no practical metrics rigorously measure generalization. In this work, we bridge this gap by introducing probability flow distance (\texttt{PFD}), a theoretically grounded and computationally efficient metric to measure generalization. Specifically, \texttt{PFD} quantifies the distance between distributions by comparing their noise-to-data mappings induced by the probability flow ODE. Using \texttt{PFD} under the diffusion distillation setting, we empirically uncover several key generalization behaviors, including: (1) quantitative scaling behavior from memorization to generalization, (2) epoch-wise double descent training dynamics, and (3) bias-variance decomposition. Beyond these insights, our work lays a foundation for generalization studies in diffusion distillation and bridges them with diffusion training.

replace Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems

Authors: Shichang Zhang, Hongzhe Du, Jiaqi W. Ma, Himabindu Lakkaraju

Abstract: Modern AI systems are typically developed through multiple stages-pretraining, fine-tuning rounds, and subsequent adaptation or alignment, where each stage builds on the previous ones and updates the model in distinct ways. This raises a critical question of accountability: when a deployed model succeeds or fails, which stage is responsible, and to what extent? We pose the accountability attribution problem for tracing model behavior back to specific stages of the model development process. To address this challenge, we propose a general framework that answers counterfactual questions about stage effects: how would the model's behavior have changed if the updates from a particular stage had not occurred? Within this framework, we introduce estimators that efficiently quantify stage effects without retraining the model, accounting for both the data and key aspects of model optimization dynamics, including learning rate schedules, momentum, and weight decay. We demonstrate that our approach successfully quantifies the accountability of each stage to the model's behavior. Based on the attribution results, our method can identify and remove spurious correlations learned during image classification and text toxicity detection tasks that were developed across multiple stages. Our approach provides a practical tool for model analysis and represents a significant step toward more accountable AI development.

replace Rectified Flows for Fast Multiscale Fluid Flow Modeling

Authors: Victor Armegioiu, Yannick Ramic, Siddhartha Mishra

Abstract: Statistical surrogate modeling of fluid flows is hard because dynamics are multiscale and highly sensitive to initial conditions. Conditional diffusion surrogates can be accurate, but usually need hundreds of stochastic sampling steps. We propose a rectified-flow surrogate that learns a time-dependent conditional velocity field transporting input-to-output laws along near-straight trajectories. Inference is then a deterministic ODE solve, making each function evaluation more informative: on multiscale 2D benchmarks, we match diffusion-class posterior statistics with only (8) ODE steps versus (\ge 128) for score-based diffusion. Theoretically, we give a law-level analysis for conditional PDE forecasting. We (i) connect one-point Wasserstein field metrics to the (k=1) correlation-marginal perspective in statistical solutions, (ii) derive a one-step error split into a **coverage** term (high-frequency tail, controlled by structure functions/spectral decay) and a **fit** term (controlled by the training objective), and (iii) show that rectification-time **straightness** controls ODE local truncation error, yielding practical step-size/step-count guidance. Motivated by this, we introduce a curvature-aware sampler that uses an EMA straightness proxy to adapt blending and step sizes at inference. Across incompressible and compressible multiscale 2D flows, it matches diffusion baselines in Wasserstein statistics and spectra, preserves fine-scale structure beyond MSE surrogates, and significantly reduces inference cost.

replace Neural MJD: Neural Non-Stationary Merton Jump Diffusion for Time Series Prediction

Authors: Yuanpei Gao, Qi Yan, Yan Leng, Renjie Liao

Abstract: While deep learning methods have achieved strong performance in time series prediction, their black-box nature and inability to explicitly model underlying stochastic processes often limit their generalization to non-stationary data, especially in the presence of abrupt changes. In this work, we introduce Neural MJD, a neural network based non-stationary Merton jump diffusion (MJD) model. Our model explicitly formulates forecasting as a stochastic differential equation (SDE) simulation problem, combining a time-inhomogeneous It\^o diffusion to capture non-stationary stochastic dynamics with a time-inhomogeneous compound Poisson process to model abrupt jumps. To enable tractable learning, we introduce a likelihood truncation mechanism that caps the number of jumps within small time intervals and provide a theoretical error bound for this approximation. Additionally, we propose an Euler-Maruyama with restart solver, which achieves a provably lower error bound in estimating expected states and reduced variance compared to the standard solver. Experiments on both synthetic and real-world datasets demonstrate that Neural MJD consistently outperforms state-of-the-art deep learning and statistical learning methods.

replace AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

Authors: Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua

Abstract: As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and utility, we present a theoretically grounded and empirically effective activation steering method called AlphaSteer. Specifically, it considers activation steering as a learnable process with two principled learning objectives: utility preservation and safety enhancement. For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints. For safety enhancement, it learns to construct a refusal direction vector for steering malicious data, with the help of linear regression. Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer, which significantly improves the safety of LLMs without compromising general capabilities. Our codes are available at https://github.com/AlphaLab-USTC/AlphaSteer.

URLs: https://github.com/AlphaLab-USTC/AlphaSteer.

replace Can We Infer Confidential Properties of Training Data from LLMs?

Authors: Pengrun Huang, Chhavi Yadav, Kamalika Chaudhuri, Ruihan Wu

Abstract: Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties -- such as patient demographics or disease prevalence -- that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.

replace SIMSHIFT: A Benchmark for Adapting Neural Surrogates to Distribution Shifts

Authors: Paul Setinek, Gianluca Galletti, Thomas Gross, Dominik Schn\"urer, Johannes Brandstetter, Werner Zellinger

Abstract: Neural surrogates for Partial Differential Equations (PDEs) often suffer significant performance degradation when evaluated on problem configurations outside their training distribution, such as new initial conditions or structural dimensions. While Unsupervised Domain Adaptation (UDA) techniques have been widely used in vision and language to generalize across domains without additional labeled data, their application to complex engineering simulations remains largely unexplored. In this work, we address this gap through two focused contributions. First, we introduce SIMSHIFT, a novel benchmark dataset and evaluation suite composed of four industrial simulation tasks spanning diverse processes and physics: hot rolling, sheet metal forming, electric motor design and heatsink design. Second, we extend established UDA methods to state-of-the-art neural surrogates and systematically evaluate them. Extensive experiments on SIMSHIFT highlight the challenges of out-of-distribution neural surrogate modeling, demonstrate the potential of UDA in simulation, and reveal open problems in achieving robust neural surrogates under distribution shifts in industrially relevant scenarios. Our codebase is available at https://github.com/psetinek/simshift

URLs: https://github.com/psetinek/simshift

replace Interpretability and Generalization Bounds for Learning Spatial Physics

Authors: Alejandro Francisco Queiruga, Theo Gutman-Solo, Shuai Jiang

Abstract: While there are many applications of ML to scientific problems that look promising, visuals can be deceiving. Using numerical analysis techniques, we rigorously quantify the accuracy, convergence rates, and generalization bounds of certain ML models applied to linear differential equations for parameter discovery or solution finding. Beyond the quantity and discretization of data, we identify that the function space of the data is critical to the generalization of the model. A similar lack of generalization is empirically demonstrated for commonly used models, including physics-specific techniques. Counterintuitively, we find that different classes of models can exhibit opposing generalization behaviors. Based on our theoretical analysis, we also introduce a new mechanistic interpretability lens on scientific models whereby Green's function representations can be extracted from the weights of black-box models. Our results inform a new cross-validation technique for measuring generalization in physical systems, which can serve as a benchmark.

replace These Are Not All the Features You Are Looking For: A Fundamental Bottleneck in Supervised Pretraining

Authors: Xingyu Alice Yang, Jianyu Zhang, L\'eon Bottou

Abstract: Transfer learning is widely used to adapt large pretrained models to new tasks with only a small amount of new data. However, a challenge persists -- the features from the original task often do not fully cover what is needed for unseen data, especially when the relatedness of tasks is not clear. Since deep learning models tend to learn very sparse representations, they retain only the minimal features required for the initial training while discarding potentially ones for downstream transfer. A theoretical framework developed in this work demonstrates that such pretraining captures inconsistent aspects of the data distribution, therefore, inducing transfer bias. To address this limitation, we propose an inexpensive ensembling strategy that aggregates multiple models to generate richer feature representations. On ResNet, this approach yields a $9\%$ improvement in transfer accuracy without incurring extra pretraining cost. We also present empirical evidence from a range of deep learning studies, confirming that the phenomenon is pervasive across modern deep learning architectures. These results suggests that relying solely on large pretrained networks is not always the most effective way to improve model generalization. Instead, fostering richer, more diverse representations -- e.g. - through model ensembles -- can substantially enhance transfer learning performance.

replace mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale

Authors: Xiaona Zhou, Constantin Brif, Ismini Lourentzou

Abstract: Anomaly detection in multivariate time series is essential across domains such as healthcare, cybersecurity, and industrial monitoring, yet remains fundamentally challenging due to high-dimensional dependencies, the presence of cross-correlations between time-dependent variables, and the scarcity of labeled anomalies. We introduce mTSBench, the largest benchmark to date for multivariate time series anomaly detection and model selection, consisting of 344 labeled time series across 19 datasets from a wide range of application domains. We comprehensively evaluate 24 anomaly detectors, including the only two publicly available large language model-based methods for multivariate time series. Consistent with prior findings, we observe that no single detector dominates across datasets, motivating the need for effective model selection. We benchmark three recent model selection methods and find that even the strongest of them remain far from optimal. Our results highlight the outstanding need for robust, generalizable selection strategies. We open-source the benchmark at https://plan-lab.github.io/mtsbench to encourage future research.

URLs: https://plan-lab.github.io/mtsbench

replace Theoretical Modeling of Large Language Model Self-Improvement Training Dynamics Through Solver-Verifier Gap

Authors: Yifan Sun, Yushan Liang, Zhen Zhang, Xin Liu, Jiaye Teng

Abstract: Self-improvement is a significant techniques within the realm of large language model (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap. This is inspired by the conjecture that the performance enhancement of self-improvement stems from the gap between LLM's solver capability and verifier capability. Based on the theoretical framework, we further show how to model the entire training trajectory. This framework allows quantifying the capability limit of self-improvement by fitting the theoretical model to the experiment results. We validate the effectiveness of the theoretical framework on various LLMs and datasets. Beyond self-improvement, we extend our analysis to investigate how external data influences these dynamics within the framework. Notably, we find that under limited external data regimes, such external data can be utilized at any stage without significantly affecting final performances, which accords with the empirical observations.

replace Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards

Authors: Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Adrian Mladenic Grobelnik, Nurendra Choudhary, Eddie Huang, Karthik Subbian, Linjun Zhang, Diyi Yang, James Zou, Jure Leskovec

Abstract: Compound AI systems integrating multiple components, such as Large Language Models, specialized tools, and traditional machine learning models, are increasingly deployed to solve complex real-world tasks. However, optimizing compound systems remains challenging due to their non-differentiable structures and diverse configuration types across components, including prompts, hyperparameters, and model parameters. To address this challenge, we propose Optimas, a unified framework for effective optimization of compound systems. The core idea of Optimas is to maintain one Local Reward Function (LRF) per component, each satisfying a local-global alignment property, i.e., each component's local reward correlates with the global system performance. In each iteration, Optimas efficiently adapts the LRFs to maintain this property while simultaneously maximizing each component's local reward. This approach enables independent updates of heterogeneous configurations using the designated optimization method, while ensuring that local improvements consistently lead to performance gains. We present extensive evaluations across five real-world compound systems to demonstrate that Optimas outperforms strong baselines by an average improvement of 11.92%, offering a general and effective approach for improving compound systems. Our website is at https://optimas.stanford.edu.

URLs: https://optimas.stanford.edu.

replace Predicting Graph Structure via Adapted Flux Balance Analysis

Authors: Sevvandi Kandanaarachchi, Ziqi Xu, Stefan Westerlund, Conrad Sanderson

Abstract: Many dynamic processes such as telecommunication and transport networks can be described through discrete time series of graphs. Modelling the dynamics of such time series enables prediction of graph structure at future time steps, which can be used in applications such as detection of anomalies. Existing approaches for graph prediction have limitations such as assuming that the vertices do not to change between consecutive graphs. To address this, we propose to exploit time series prediction methods in combination with an adapted form of flux balance analysis (FBA), a linear programming method originating from biochemistry. FBA is adapted to incorporate various constraints applicable to the scenario of growing graphs. Empirical evaluations on synthetic datasets (constructed via Preferential Attachment model) and real datasets (UCI Message, HePH, Facebook, Bitcoin) demonstrate the efficacy of the proposed approach.

replace XiChen: A global weather observation-to-forecast machine learning system via four-dimensional variational gradient-guided flexible assimilation

Authors: Wuxin Wang, Weicheng Ni, Lilan Huang, Tao Hao, Ben Fei, Shuo Ma, Taikang Yuan, Yanlai Zhao, Kefeng Deng, Xiaoyong Li, Hongze Leng, Boheng Duan, Lei Bai, Weimin Zhang, Junqiang Song, Kaijun Ren

Abstract: Machine Learning (ML) has shown great promise in revolutionizing weather forecasting, yet most ML systems still rely on initial conditions generated by Numerical Weather Prediction (NWP) systems. End-to-end ML models aim to eliminate this dependency, but they often rely on observation-specific encoders and require redesign or retraining when observation sources change, thereby limiting their operational robustness. Here, we introduce XiChen, a global weather observation-to-forecast ML system via four-dimensional variational (4DVar) gradient-guided flexible assimilation. We demonstrate that the gradient of the 4DVar cost function serves as a physically grounded interface that maps heterogeneous observations into a common state space. This novel formulation enables XiChen to flexibly assimilate diverse conventional and raw satellite observations while preserving physical consistency. Experiments show that the system achieves forecasting metrics competitive with operational NWP systems. This work provides a practical and physically consistent route toward operational ML-based global weather forecasting systems with heterogeneous and evolving observations.

replace Einstein Fields: A Neural Perspective To Computational General Relativity

Authors: Sandeep Suresh Cranganore, Andrei Bodnar, Arturs Berzins, Johannes Brandstetter

Abstract: We introduce Einstein Fields, a neural representation designed to compress computationally intensive four-dimensional numerical relativity simulations into compact implicit neural network weights. By modeling the metric, the core tensor field of general relativity, Einstein Fields enable the derivation of physical quantities via automatic differentiation. Unlike conventional neural fields (e.g., signed distance, occupancy, or radiance fields), Einstein Fields fall into the class of Neural Tensor Fields with the key difference that, when encoding the spacetime geometry into neural field representations, dynamics emerge naturally as a byproduct. Our novel implicit approach demonstrates remarkable potential, including continuum modeling of four-dimensional spacetime, mesh-agnosticity, storage efficiency, derivative accuracy, and ease of use. It achieves up to a $4,000$-fold reduction in storage memory compared to discrete representations while retaining a numerical accuracy of five to seven decimal places. Moreover, in single precision, differentiation of the Einstein Fields-parameterized metric tensor is up to five orders of magnitude more accurate compared to naive finite differencing methods. We demonstrate these properties on several canonical test beds of general relativity and numerical relativity simulation data, while also releasing an open-source JAX-based library: \href{https://github.com/AndreiB137/EinFields}{https://github.com/AndreiB137/EinFields}, taking the first steps to studying the potential of machine learning in numerical relativity.

URLs: https://github.com/AndreiB137/EinFields, https://github.com/AndreiB137/EinFields

replace NOCTA: Non-Greedy Objective Cost-Tradeoff Acquisition for Longitudinal Data

Authors: Dzung Dinh, Boqi Chen, Yunni Qu, Marc Niethammer, Junier Oliva

Abstract: In many critical domains, features are not freely available at inference time: each measurement may come with a cost of time, money, and risk. Longitudinal prediction further complicates this setting because both features and labels evolve over time, and missing measurements at earlier timepoints may become permanently unavailable. We propose NOCTA, a Non-Greedy Objective Cost-Tradeoff Acquisition framework that sequentially acquires the most informative features at inference time while accounting for both temporal dynamics and acquisition cost. NOCTA is driven by a novel objective, NOCT, which evaluates a candidate set of future feature-time acquisitions by its expected predictive loss together with its acquisition cost. Since NOCT depends on unobserved future trajectories at inference time, we develop two complementary estimators: (i) NOCT-Contrastive, which learns an embedding of partial observations utilizing the induced distribution over future acquisitions, and (ii) NOCT-Amortized, which directly predicts NOCT for candidate plans with a neural network. Experiments on synthetic and real-world medical datasets demonstrate that both NOCTA estimators outperform existing baselines, achieving higher accuracy at lower acquisition costs.

replace From SGD to Spectra: A Theory of Neural Network Weight Dynamics

Authors: Brian Richard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, Anirudh Gajula

Abstract: Deep neural networks have revolutionized machine learning, yet their training dynamics remain theoretically unclear-we develop a continuous-time, matrix-valued stochastic differential equation (SDE) framework that rigorously connects the microscopic dynamics of SGD to the macroscopic evolution of singular-value spectra in weight matrices. We derive exact SDEs showing that squared singular values follow Dyson Brownian motion with eigenvalue repulsion, and characterize stationary distributions as gamma-type densities with power-law tails, providing the first theoretical explanation for the empirically observed 'bulk+tail' spectral structure in trained networks. Through controlled experiments on transformer and MLP architectures, we validate our theoretical predictions and demonstrate quantitative agreement between SDE-based forecasts and observed spectral evolution, providing a rigorous foundation for understanding why deep learning works.

replace A Parallel Alternative for Energy-Efficient Neural Network Training and Inferencing

Authors: Sudip K. Seal, Maksudul Alam, Jorge Ramirez, Sajal Dash, Hao Lu

Abstract: Energy efficiency of training and inferencing with large neural network models is a critical challenge facing the future of sustainable large-scale machine learning workloads. This paper introduces an alternative strategy, called phantom parallelism, to minimize the net energy consumption of traditional tensor (model) parallelism, the most energy-inefficient component of large neural network training. The approach is presented in the context of feed-forward network architectures as a preliminary, but comprehensive, proof-of-principle study of the proposed methodology. We derive new forward and backward propagation operators for phantom parallelism, implement them as custom autograd operations within an end-to-end phantom parallel training pipeline and compare its parallel performance and energy-efficiency against those of conventional tensor parallel training pipelines. Formal analyses that predict lower bandwidth and FLOP counts are presented with supporting empirical results on up to 256 GPUs that corroborate these gains. Experiments are shown to deliver approximately 50% reduction in the energy consumed to train FFNs using the proposed phantom parallel approach when compared with conventional tensor parallel methods. Additionally, the proposed approach is shown to train smaller phantom models to the same model loss on smaller GPU counts as larger tensor parallel models on larger GPU counts offering the possibility for even greater energy savings.

replace Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants

Authors: Ryan Mioduski

Abstract: Virginia's seventeenth- and eighteenth-century land patents survive primarily as narrative metes-and-bounds descriptions, limiting spatial analysis. This study systematically evaluates current-generation large language models (LLMs) in converting these prose abstracts into geographically accurate latitude/longitude coordinates within a focused evaluation context. A digitized corpus of 5,471 Virginia patent abstracts (1695-1732) is released, with 43 rigorously verified test cases serving as an initial, geographically focused benchmark. Six OpenAI models across three architectures-o-series, GPT-4-class, and GPT-3.5-were tested under two paradigms: direct-to-coordinate and tool-augmented chain-of-thought invoking external geocoding APIs. Results were compared against a GIS analyst baseline, Stanford NER geoparser, Mordecai-3 neural geoparser, and a county-centroid heuristic. The top single-call model, o3-2025-04-16, achieved a mean error of 23 km (median 14 km), outperforming the median LLM (37.4 km) by 37.5%, the weakest LLM (50.3 km) by 53.5%, and external baselines by 67% (GIS analyst) and 70% (Stanford NER). A five-call ensemble further reduced errors to 19.2 km (median 12.2 km) at minimal additional cost (~USD 0.20 per grant), outperforming the median LLM by 48.7%. A patentee-name redaction ablation slightly increased error (~7%), showing reliance on textual landmark and adjacency descriptions rather than memorization. The cost-effective gpt-4o-2024-08-06 model maintained a 28 km mean error at USD 1.09 per 1,000 grants, establishing a strong cost-accuracy benchmark. External geocoding tools offer no measurable benefit in this evaluation. These findings demonstrate LLMs' potential for scalable, accurate, cost-effective historical georeferencing.

replace SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML

Authors: Ismail Lamaakal, Chaymae Yahyati, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine Maleh

Abstract: Reliable uncertainty estimation is a key missing piece for on-device monitoring in TinyML: microcontrollers must detect failures, distribution shift, or accuracy drops under strict flash/latency budgets, yet common uncertainty approaches (deep ensembles, MC dropout, early exits, temporal buffering) typically require multiple passes, extra branches, or state that is impractical on milliwatt hardware. This paper proposes a novel and practical method, SNAP-UQ, for single-pass, label-free uncertainty estimation based on depth-wise next-activation prediction. SNAP-UQ taps a small set of backbone layers and uses tiny int8 heads to predict the mean and scale of the next activation from a low-rank projection of the previous one; the resulting standardized prediction error forms a depth-wise surprisal signal that is aggregated and mapped through a lightweight monotone calibrator into an actionable uncertainty score. The design introduces no temporal buffers or auxiliary exits and preserves state-free inference, while increasing deployment footprint by only a few tens of kilobytes. Across vision and audio backbones, SNAP-UQ reduces flash and latency relative to early-exit and deep-ensemble baselines (typically $\sim$40--60% smaller and $\sim$25--35% faster), with several competing methods at similar accuracy often exceeding MCU memory limits. On corrupted streams, it improves accuracy-drop event detection by multiple AUPRC points and maintains strong failure detection (AUROC $\approx 0.9$) in a single forward pass. By grounding uncertainty in layer-to-layer dynamics rather than solely in output confidence, SNAP-UQ offers a novel, resource-efficient basis for robust TinyML monitoring.

replace DyMixOp: A Neural Operator Designed from a Complex Dynamics Perspective with Local-Global Mixing for Solving PDEs

Authors: Pengyu Lai, Yixiao Chen, Dewu Yang, Rui Wang, Feng Wang, Hui Xu

Abstract: A primary challenge in using neural networks to approximate nonlinear dynamical systems governed by partial differential equations (PDEs) lies in recasting these systems into a tractable representation particularly when the dynamics are inherently non-linearizable or require infinite-dimensional spaces for linearization. To address this challenge, we introduce DyMixOp, a novel neural operator framework for PDEs that integrates theoretical insights from complex dynamical systems. Grounded in dynamics-aware priors and inertial manifold theory, DyMixOp projects the original infinite-dimensional PDE dynamics onto a finite-dimensional latent space. This reduction preserves both essential linear structures and dominant nonlinear interactions, thereby establishing a physically interpretable and computationally structured foundation. Central to this approach is the local-global mixing (LGM) transformation, a key architectural innovation inspired by the convective nonlinearity in turbulent flows. By multiplicatively coupling local fine-scale features with global spectral information, LGM effectively captures high-frequency details and complex nonlinear couplings while mitigating the spectral bias that plagues many existing neural operators. The framework is further enhanced by a dynamics-informed architecture that stacks multiple LGM layers in a hybrid configuration, incorporating timescale-adaptive gating and parallel aggregation of intermediate dynamics. This design enables robust approximation of general evolutionary dynamics across diverse physical regimes. Extensive experiments on seven benchmark PDE systems spanning 1D to 3D, elliptic to hyperbolic types demonstrate that DyMixOp achieves state-of-the-art performance on six of them, significantly reducing prediction errors (by up to 94.3% in chaotic regimes) while maintaining computational efficiency and strong scalability.

replace Beyond the Mean: Fisher-Orthogonal Projection for Natural Gradient Descent in Large Batch Training

Authors: Yishun Lu, Wesley Armour

Abstract: Modern GPUs are equipped with large amounts of high-bandwidth memory, enabling them to support mini-batch sizes of up to tens of thousands of training samples. However, most existing optimizers struggle to perform effectively at such a large batch size. As batch size increases, gradient noise decreases due to averaging over many samples, limiting the ability of first-order methods to escape sharp or suboptimal minima and reach the global minimum. Meanwhile, second-order methods like the natural gradient with Kronecker-Factored Approximate Curvature (KFAC) often require excessively high damping to remain stable at large batch sizes. This high damping effectively washes out the curvature information that gives these methods their advantage, reducing their performance to that of simple gradient descent. In this paper, we introduce Fisher-Orthogonal Projection (FOP), a novel technique that restores the effectiveness of the second-order method at very large batch sizes, enabling scalable training with improved generalization and faster convergence. FOP constructs a variance-aware update direction by leveraging gradients from two sub-batches, enhancing the average gradient with a component of the gradient difference that is orthogonal to the average under the Fisher-metric.

replace Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS

Authors: Can Jin, Yang Zhou, Qixin Zhang, Hongwu Peng, Di Zhang, Zihan Dong, Marco Pavone, Ligong Han, Zhang-Wei Hong, Tong Che, Dimitris N. Metaxas

Abstract: Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in LLMs.

replace Constraint Learning in Multi-Agent Dynamic Games from Demonstrations of Local Nash Interactions

Authors: Zhouyu Zhang, Chih-Yuan Chiu, Glen Chou

Abstract: We present an inverse dynamic game-based algorithm to learn parametric constraints from a given dataset of local Nash equilibrium interactions between multiple agents. Specifically, we introduce mixed-integer linear programs (MILP) encoding the Karush-Kuhn-Tucker (KKT) conditions of the interacting agents, which recover constraints consistent with the local Nash stationarity of the interaction demonstrations. We establish theoretical guarantees that our method learns inner approximations of the true safe and unsafe sets. We also use the interaction constraints recovered by our method to design motion plans that robustly satisfy the underlying constraints. Across simulations and hardware experiments, our methods accurately inferred constraints and designed safe interactive motion plans for various classes of constraints, both convex and non-convex, from interaction demonstrations of agents with nonlinear dynamics.

replace A-FloPS: Accelerating Diffusion Models via Adaptive Flow Path Sampler

Authors: Cheng Jin, Zhenyu Xiao, Yuantao Gu

Abstract: Diffusion models deliver state-of-the-art generative performance across diverse modalities but remain computationally expensive due to their inherently iterative sampling process. Existing training-free acceleration methods typically improve numerical solvers for the reverse-time ODE, yet their effectiveness is fundamentally constrained by the inefficiency of the underlying sampling trajectories. We propose A-FloPS (Adaptive Flow Path Sampler), a principled, training-free framework that reparameterizes the sampling trajectory of any pre-trained diffusion model into a flow-matching form and augments it with an adaptive velocity decomposition. The reparameterization analytically maps diffusion scores to flow-compatible velocities, yielding integration-friendly trajectories without retraining. The adaptive mechanism further factorizes the velocity field into a linear drift term and a residual component whose temporal variation is actively suppressed, restoring the accuracy benefits of high-order integration even in extremely low-NFE regimes. Extensive experiments on conditional image generation and text-to-image synthesis show that A-FloPS consistently outperforms state-of-the-art training-free samplers in both sample quality and efficiency. Notably, with as few as $5$ function evaluations, A-FloPS achieves substantially lower FID and generates sharper, more coherent images. The adaptive mechanism also improves native flow-based generative models, underscoring its generality. These results position A-FloPS as a versatile and effective solution for high-quality, low-latency generative modeling.

replace Efficient Graph Knowledge Distillation from GNNs to Kolmogorov--Arnold Networks via Self-Attention Dynamic Sampling

Authors: Can Cui, Zilong Fu, Penghe Huang, Yuanyuan Li, Wu Deng, Dongyan Li

Abstract: Recent success of graph neural networks (GNNs) in modeling complex graph-structured data has fueled interest in deploying them on resource-constrained edge devices. However, their substantial computational and memory demands present ongoing challenges. Knowledge distillation (KD) from GNNs to MLPs offers a lightweight alternative, but MLPs remain limited by fixed activations and the absence of neighborhood aggregation, constraining distilled performance. To tackle these intertwined limitations, we propose SA-DSD, a novel self-attention-guided dynamic sampling distillation framework. To the best of our knowledge, this is the first work to employ an enhanced Kolmogorov-Arnold Network (KAN) as the student model. We improve Fourier KAN (FR-KAN+) with learnable frequency bases, phase shifts, and optimized algorithms, substantially improving nonlinear fitting capability over MLPs while preserving low computational complexity. To explicitly compensate for the absence of neighborhood aggregation that is inherent to both MLPs and KAN-based students, SA-DSD leverages a self-attention mechanism to dynamically identify influential nodes, construct adaptive sampling probability matrices, and enforce teacher-student prediction consistency. Extensive experiments on six real world datasets demonstrate that, under inductive and most of transductive settings, SA-DSD surpasses three GNN teachers by 3.05%-3.62% and improves FR-KAN+ by 15.61%. Moreover, it achieves a 16.69x parameter reduction and a 55.75% decrease in average runtime per epoch compared to key benchmarks.

replace RefineStat: Efficient Exploration for Probabilistic Program Synthesis

Authors: Madhav Kanda, Shubham Ugare, Sasa Misailovic

Abstract: Probabilistic programming offers a powerful framework for modeling uncertainty, yet statistical model discovery in this domain entails navigating an immense search space under strict domain-specific constraints. When small language models are tasked with generating probabilistic programs, they frequently produce outputs that suffer from both syntactic and semantic errors, such as flawed inference constructs. Motivated by probabilistic programmers' domain expertise and debugging strategies, we introduce RefineStat, a language model--driven framework that enforces semantic constraints ensuring synthesized programs contain valid distributions and well-formed parameters, and then applies diagnostic-aware refinement by resampling prior or likelihood components whenever reliability checks fail. We evaluate RefineStat on multiple probabilistic-programming code-generation tasks using smaller language models (SLMs) and find that it produces programs that are both syntactically sound and statistically reliable, often matching or surpassing those from closed-source large language models (e.g., OpenAI o3).

replace Initialization Schemes for Kolmogorov-Arnold Networks: An Empirical Study

Authors: Spyros Rigas, Dhruv Verma, Georgios Alexandridis, Yixuan Wang

Abstract: Kolmogorov-Arnold Networks (KANs) are a recently introduced neural architecture that replace fixed nonlinearities with trainable activation functions, offering enhanced flexibility and interpretability. While KANs have been applied successfully across scientific and machine learning tasks, their initialization strategies remain largely unexplored. In this work, we study initialization schemes for spline-based KANs, proposing two theory-driven approaches inspired by LeCun and Glorot, as well as an empirical power-law family with tunable exponents. Our evaluation combines large-scale grid searches on function fitting and forward PDE benchmarks, an analysis of training dynamics through the lens of the Neural Tangent Kernel, and evaluations on a subset of the Feynman dataset. Our findings indicate that the Glorot-inspired initialization significantly outperforms the baseline in parameter-rich models, while power-law initialization achieves the strongest performance overall, both across tasks and for architectures of varying size. All code and data accompanying this manuscript are publicly available at https://github.com/srigas/KAN_Initialization_Schemes.

URLs: https://github.com/srigas/KAN_Initialization_Schemes.

replace Density-Aware Farthest Point Sampling

Authors: Paolo Climaco, Jochen Garcke

Abstract: We focus on training machine learning regression models in scenarios where the availability of labeled training data is limited due to computational constraints or high labeling costs. Thus, selecting suitable training sets from unlabeled data is essential for balancing performance and efficiency. For the selection of the training data, we focus on passive and model-agnostic sampling methods that only consider the data feature representations. We derive an upper bound for the expected prediction error of Lipschitz continuous regression models that linearly depends on the weighted fill distance of the training set: a quantity we can estimate simply by considering the data features. We introduce ''Density-Aware Farthest Point Sampling'' (DA-FPS), a novel sampling method. We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance, thereby aiming at minimizing our derived bound. We conduct experiments using two regression models across three datasets. The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.

replace CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

Authors: Boao Kong, Junzhu Liang, Yuxi Liu, Renjia Deng, Kun Yuan

Abstract: Low-rank architectures have become increasingly important for efficient large language model (LLM) pre-training, providing substantial reductions in both parameter complexity and memory/computational demands. Despite these advantages, current low-rank methods face three critical shortcomings: (1) compromised model performance, (2) considerable computational overhead, and (3) limited activation memory savings. To address these limitations, we propose Cross-layer Low-Rank residual Network (CR-Net), an innovative parameter-efficient framework inspired by our discovery that inter-layer activation residuals possess low-rank properties. CR-Net implements this insight through a dual-path architecture that efficiently reconstructs layer activations by combining previous-layer outputs with their low-rank differences, thereby maintaining high-rank information with minimal parameters. We further develop a specialized activation recomputation strategy tailored for CR-Net that dramatically reduces memory requirements. Extensive pre-training experiments across model scales from 60M to 7B parameters demonstrate that CR-Net consistently outperforms state-of-the-art low-rank frameworks while requiring fewer computational resources and less memory.

replace Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context

Authors: Andrew Wang, Jiashuo Zhang, Michael Oberst

Abstract: Public datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing machine learning (ML) computer vision models in healthcare. However, the reported strong average-case performance of these models do not necessarily reflect their actual utility when used in heterogeneous clinical settings, potentially masking weaker performance in medically significant scenarios. In this work we use clinical context to provide a more holistic evaluation of current "state-of-the-art" (SOTA) models for CXR diagnosis. In particular, we use discharge summaries, recorded prior to each CXR, to derive a "pre-CXR" probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. We use this measure to probe model performance along two dimensions: First, using a stratified analysis, we show that models tend to have lower performance (as measured by AUROC and other metrics) among individuals with higher pre-CXR probability. Second, by controlling for pre-CXR probability via matching and re-weighting, we demonstrate that performance degrades when the correlation is broken between prior context and the current CXR label, suggesting that model performance may depend in part on inference of pre-CXR clinical context.

replace d2: Improved Techniques for Training Reasoning Diffusion Language Models

Authors: Guanghan Wang, Gilad Turok, Yair Schiff, Marianne Arriola, Volodymyr Kuleshov

Abstract: While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on accurate estimates of the sampling trajectory likelihoods. Our likelihood estimator, d2-AnyOrder, achieves exact trajectory likelihood with a single model pass for DLMs that support a sampling algorithm called any-order decoding. Through an empirical study of widely used DLMs, we show that any-order decoding is not universally supported in practice. Consequently, for DLMs that do not naturally support any-order decoding, we propose another estimator, d2-StepMerge, which, unlike d2-AnyOrder, only approximates the trajectory likelihood. d2-StepMerge trades off compute for approximation accuracy in an analytically tractable manner. Empirically, d2 significantly outperforms widely-used RL baselines when applied to popular DLMs, and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500). We provide the code along with a blog post on the project page: https://guanghanwang.com/d2

URLs: https://guanghanwang.com/d2

replace Interpretable Discovery of One-parameter Subgroups: A Modular Framework for Elliptical, Hyperbolic, and Parabolic Symmetries

Authors: Pavan Karjol, Vivek V Kashyap, Rohan Kashyap, Prathosh A P

Abstract: We propose a modular, data-driven framework for jointly learning unknown functional mappings and discovering the underlying one-parameter symmetry subgroup governing the data. Unlike conventional geometric deep learning methods that assume known symmetries, our approach identifies the relevant continuous subgroup directly from data. We consider the broad class of one-parameter subgroups, which admit a canonical geometric classification into three regimes: elliptical, hyperbolic, and parabolic. Given an assumed regime, our framework instantiates a corresponding symmetry discovery architecture with invariant and equivariant representation layers structured according to the Lie algebra of the subgroup, and learns the exact generator parameters end-to-end from data. This yields models whose invariance or equivariance is guaranteed by construction and admits formal proofs, enabling symmetry to be explicitly traced to identifiable components of the architecture. The approach is applicable to one-parameter subgroups of a wide range of matrix Lie groups, including $SO(n)$, $SL(n)$, and the Lorentz group. Experiments on synthetic and real-world systems, including moment of inertia prediction, double-pendulum dynamics, and high-energy \textit{Top Quark Tagging}, demonstrate accurate subgroup recovery and strong predictive performance across both compact and non-compact regimes.

replace ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity

Authors: Xiaoyang Liu, Tao Zhu, Zineng Dong, Yuntian Liu, Qingfeng Guo, Zhaoxuan Liu, Yu Chen, Tao Luo

Abstract: Despite significant strides in statement autoformalization, a critical gap remains in the development of automated evaluation metrics capable of assessing formal translation quality. Existing metrics often fail to balance semantic and structural information: string-based methods neglect semantics, whereas proof-based approaches offer no graded similarity when proofs fail. To address these issues, we introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which captures syntactic structure by transforming formal statements into operator trees and computes a real-valued similarity score using our novel TransTED (Transformation Tree Edit Distance) Similarity metric by incorporating semantic transformations. For rigorous validation, we present EPLA (Evaluating Provability and Likeness for Autoformalization), a benchmark comprising 1,247 expert-annotated formal statement pairs derived from miniF2F and ProofNet, distinctively labeled for both semantic provability and structural likeness. Experiments on the EPLA benchmark demonstrate that TransTED Similarity surpasses existing methods, achieving state-of-the-art accuracy and Kappa score. The benchmark dataset, code, and detailed experimental results are available at https://github.com/XiaoyangLiu-sjtu/ASSESS.

URLs: https://github.com/XiaoyangLiu-sjtu/ASSESS.

replace Functional Critics Are Essential for Actor-Critic: From Off-Policy Stability to Efficient Exploration

Authors: Qinxun Bai, Yuxuan Han, Wei Xu, Zhengyuan Zhou

Abstract: The actor-critic (AC) framework has achieved strong empirical success in off-policy reinforcement learning but suffers from the "moving target" problem, where the evaluated policy changes continually. Functional critics, or policy-conditioned value functions, address this by explicitly including a representation of the policy as input. While conceptually appealing, previous efforts have struggled to remain competitive against standard AC. In this work, we revisit functional critics within the actor-critic framework and identify two critical aspects that render them a necessity rather than a luxury. First, we demonstrate their power in stabilizing the complex interplay between the "deadly triad" and the "moving target". We provide a convergent off-policy AC algorithm under linear functional approximation that dismantles several longstanding barriers between theory and practice: it utilizes target-based TD learning, accommodates dynamic behavior policies, and operates without the restrictive "full coverage" assumptions. By formalizing a dual trust-coverage mechanism, our framework provides principled guidelines for pursuing sample efficiency-rigorously governing behavior policy updates and critic re-evaluations to maximize off-policy data utility. Second, we uncover a foundational link between functional critics and efficient exploration. We demonstrate that existing model-free approximations of posterior sampling are limited in capturing policy-dependent uncertainty, a gap the functional critic formalism bridges. These results represent, to our knowledge, first-of-their-kind contributions to the RL literature. Practically, we propose a tailored neural network architecture and a minimalist AC algorithm. In preliminary experiments on the DeepMind Control Suite, this implementation achieves performance competitive with state-of-the-art methods without standard implementation heuristics.

replace LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport

Authors: Ashkan Shahbazi, Chayne Thrash, Yikun Bai, Keaton Hamm, Navid NaderiAlizadeh, Soheil Kolouri

Abstract: Transformers have proven highly effective across modalities, but standard softmax attention scales quadratically with sequence length, limiting long context modeling. Linear attention mitigates this by approximating attention with kernel feature maps, yet most attention mechanisms remain row normalized and can over concentrate mass on a few tokens, harming robustness and information flow. Doubly stochastic attention counteracts this by balancing token participation across both rows and columns, but existing approaches often add significant overhead. We propose LOTFormer, a linear time doubly stochastic attention mechanism derived from an optimal transport view of attention as a coupling between query and key measures. LOTFormer enforces a low rank transport plan by conditioning on a learnable pivot measure with small support. We solve two entropic transport problems, queries to pivot and pivot to keys, and compose them into a conditional coupling that is provably doubly stochastic, has rank at most $r \ll n$, and applies to values in $O(nr)$ time without forming the full $n \times n$ matrix. The pivot locations and masses are learned end-to-end. Across vision and text benchmarks, LOTFormer delivers strong accuracy efficiency tradeoffs when plugged into standard backbones including Swin, DeiT, and BERT.

replace Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning

Authors: Xin Qiu, Yulu Gan, Conor F. Hayes, Qiyao Liang, Yinggan Xu, Roberto Dailey, Elliot Meyerson, Babak Hodjat, Risto Miikkulainen

Abstract: Fine-tuning large language models (LLMs) for downstream tasks is an essential stage of modern AI deployment. Reinforcement learning (RL) has emerged as the dominant fine-tuning paradigm, underpinning many state-of-the-art LLMs. In contrast, evolution strategies (ES) has largely been overlooked due to the widespread belief that it does not scale to modern model sizes. This paper overturns this assumption by demonstrating the first successful application of ES to full-parameter fine-tuning of LLMs at the billion-parameter scale, without dimensionality reduction. ES can indeed search over extremely high-dimensional parameter spaces and outperform established RL implementations across multiple axes, including improved tolerance to long-horizon and delayed rewards, robustness across diverse base LLMs, reduced susceptibility to reward hacking, and improved training stability. These findings suggest that ES is not merely a viable alternative to RL, but a fundamentally different and powerful backpropagation-free post-training paradigm that opens a new direction for LLM fine-tuning beyond current RL-based approaches. The source codes are provided at: https://github.com/VsonicV/es-fine-tuning-paper.

URLs: https://github.com/VsonicV/es-fine-tuning-paper.

replace Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

Authors: Jonas H\"ubotter, Patrik Wolf, Alexander Shevchenko, Dennis J\"uni, Andreas Krause, Gil Kur

Abstract: Recent empirical studies have explored the idea of continuing to train a model at test-time for a given task, known as test-time training (TTT), and have found it to yield significant performance improvements. However, there is limited understanding of why and when TTT is effective. Earlier explanations mostly focused on the observation that TTT may help when applied to out-of-distribution adaptation or used with privileged data. However, the growing scale of foundation models with most test data being in-distribution questions these explanations. We instead posit that foundation models remain globally underparameterized, with TTT providing a mechanism for specialization after generalization, focusing capacity on concepts relevant to the test task. Specifically, under the linear representation hypothesis, we propose a model in which TTT achieves a substantially smaller in-distribution test error than global training. We empirically validate our model's key assumptions by training a sparse autoencoder on ImageNet, showing that semantically related data points are explained by only a few shared concepts. Finally, we perform scaling studies across image and language tasks that confirm the practical implications of our model, identifying the regimes where specialization is most effective.

replace Gradient Descent with Large Step Sizes: Chaos and Fractal Convergence Region

Authors: Shuang Liang, Guido Mont\'ufar

Abstract: We examine gradient descent in matrix factorization and show that under large step sizes the parameter space develops a fractal structure. We derive the exact critical step size for convergence in scalar-vector factorization and show that near criticality the selected minimizer depends sensitively on the initialization. Moreover, we show that adding regularization amplifies this sensitivity, generating a fractal boundary between initializations that converge and those that diverge. The analysis extends to general matrix factorization with orthogonal initialization. Our findings reveal that near-critical step sizes induce a chaotic regime of gradient descent where the training outcome is unpredictable and there are no simple implicit biases, such as towards balancedness, minimum norm, or flatness.

replace A Review on Single-Problem Multi-Attempt Heuristic Optimization

Authors: Judith Echevarrieta, Etor Arza, Aritz P\'erez, Josu Ceberio

Abstract: In certain real-world optimization scenarios, practitioners are not interested in solving multiple problems but rather in finding the best solution to a single, specific problem. When the computational budget is large relative to the cost of evaluating a candidate solution, multiple heuristic alternatives can be tried to solve the same given problem, each possibly with a different algorithm, parameter configuration, initialization, or stopping criterion. In this practically relevant setting, the sequential selection of which alternative to try next is crucial for efficiently identifying the best possible solution across multiple attempts. However, suitable sequential alternative selection strategies have traditionally been studied separately across different research topics and have not been the exclusive focus of any existing review. As a result, the state-of-the-art remains fragmented for practitioners interested in this setting, with surveys either covering only subsets of relevant strategies or including approaches that rely on assumptions that are not feasible for the single-problem case. This work addresses the identified gap by providing a focused review of single-problem multi-attempt heuristic optimization. It brings together suitable strategies for this setting that have been studied separately through algorithm selection, parameter tuning, multi-start, and resource allocation. These strategies are described using a unified terminology within a common framework, which supports the construction of a taxonomy for systematically organizing and classifying them. The resulting comprehensive review facilitates both the identification and the development of strategies for the single-problem multi-attempt setting in practice.

replace GeoGen: A Two-stage Coarse-to-Fine Framework for Fine-grained Synthetic Location-based Social Network Trajectory Generation

Authors: Rongchao Xu, Kunlin Cai, Lin Jiang, Zhiqing Hong, Yuan Tian, Guang Wang

Abstract: Location-Based Social Network (LBSN) check-in trajectory data are important for many practical applications, like POI recommendation, advertising, and pandemic intervention. However, the high collection costs and ever-increasing privacy concerns prevent us from accessing large-scale LBSN trajectory data. The recent advances in synthetic data generation provide us with a new opportunity to achieve this, which utilizes generative AI to generate synthetic data that preserves the characteristics of real data while ensuring privacy protection. However, generating synthetic LBSN check-in trajectories remains challenging due to their spatially discrete, temporally irregular nature and the complex spatio-temporal patterns caused by sparse activities and uncertain human mobility. To address this challenge, we propose GeoGen, a two-stage coarse-to-fine framework for large-scale LBSN check-in trajectory generation. In the first stage, we reconstruct spatially continuous, temporally regular latent movement sequences from the original LBSN check-in trajectories and then design a Sparsity-aware Spatio-temporal Diffusion model (S$^2$TDiff) with an efficient denosing network to learn their underlying behavioral patterns. In the second stage, we design Coarse2FineNet, a Transformer-based Seq2Seq architecture equipped with a dynamic context fusion mechanism in the encoder and a multi-task hybrid-head decoder, which generates fine-grained LBSN trajectories based on coarse-grained latent movement sequences by modeling semantic relevance and behavioral uncertainty. Extensive experiments on four real-world datasets show that GeoGen excels state-of-the-art models for both fidelity and utility evaluation, e.g., it increases over 69% and 55% in distance and radius metrics on the FS-TKY dataset.

replace VAO: Validation-Aligned Optimization for Cross-Task Generative Auto-Bidding

Authors: Yiqin Lv, Zhiyu Mou, Miao Xu, Jinghao Chen, Qi Wang, Yixiu Mao, Yun Qu, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng, Xiangyang Ji

Abstract: Generative auto-bidding has demonstrated strong performance in online advertising, yet it often suffers from data scarcity in small-scale settings with limited advertiser participation. While cross-task data sharing is a natural remedy to mitigate this issue, naive approaches often introduce gradient bias due to distribution shifts across different tasks, and existing methods are not readily applicable to generative auto-bidding. In this paper, we propose Validation-Aligned Optimization (VAO), a principled data-sharing method that adaptively reweights cross-task data contributions based on validation performance feedback. Notably, VAO aligns training dynamics to prioritize updates that improve generalization on the target task, effectively leveraging auxiliary data and mitigating gradient bias. Building on VAO, we introduce a unified generative autobidding framework that generalizes across multiple tasks using a single model and all available task data. Extensive experiments on standard auto-bidding benchmarks validate the effectiveness of our approach.

replace Probabilistic bias adjustment of seasonal predictions of Arctic Sea Ice Concentration

Authors: Parsa Gooya, Reinel Sospedra-Alfonso

Abstract: Seasonal forecast of Arctic sea ice concentration is key to mitigate the negative impact and assess potential opportunities posed by the rapid decline of sea ice coverage. Seasonal prediction systems based on climate models often show systematic biases and complex spatio-temporal errors that grow with the forecasts. Consequently, operational predictions are routinely bias corrected and calibrated using retrospective forecasts. For predictions of Arctic sea ice concentration, error corrections are mainly based on one-to-one post-processing methods including climatological mean or linear regression correction and, more recently, machine learning. Such deterministic adjustments are confined at best to the limited number of costly-to-run ensemble members of the raw forecast. However, decision-making requires proper quantification of uncertainty and likelihood of events, particularly of extremes. We introduce a probabilistic error correction framework based on a conditional Variational Autoencoder model to map the conditional distribution of observations given the biased model prediction. This method naturally allows for generating large ensembles of adjusted forecasts. We evaluate our model using deterministic and probabilistic metrics and show that the adjusted forecasts are better calibrated, closer to the observational distribution, and have smaller errors than climatological mean adjusted forecasts.

replace PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

Authors: Abdelkrim Zitouni, Mehdi Hennequin, Juba Agoun, Ryan Horache, Nadia Kabachi, Omar Rivasplata

Abstract: We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. The new bound provides non-vacuous certificates for modern off-policy algorithms such as Soft Actor-Critic. We demonstrate the practical utility of the bound through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across several continuous control tasks show that the proposed approach provides meaningful confidence certificates while maintaining competitive performance.

replace Transformer-based Learning-to-Optimize Approach for Scalable and Generalizable Beamforming

Authors: Yubo Zhang, Xiao-Yang Liu, Xiaodong Wang

Abstract: We develop an unsupervised deep learning framework for downlink beamforming in large-scale MU-MISO channels. The model is trained offline, allowing real-time inference through lightweight feedforward computations in dynamic communication environments. Following the learning-to-optimize (L2O) paradigm, a multi-layer Transformer iteratively refines both channel and beamformer features via residual connections. To enhance training, three strategies are introduced: (i) curriculum learning (CL) to improve early-stage convergence and avoid local optima, (ii) semi-amortized learning to refine each Transformer block with a few gradient ascent steps, and (iii) sliding-window training to stabilize optimization by training only a subset of Transformer blocks at a time. Extensive simulations show that the proposed scheme outperforms existing baselines at low-to-medium SNRs and closely approaches WMMSE performance at high SNRs, while achieving substantially faster inference than iterative and online learning approaches.

replace Differentially Private Linear Regression and Synthetic Data Generation with Statistical Guarantees

Authors: Shurong Lin, Aleksandra Slavkovi\'c, Deekshith Reddy Bhoomireddy

Abstract: In social sciences, small- to medium-scale datasets are common and linear regression (LR) is canonical. In privacy-aware settings, much work has focused on differentially private (DP) LR, but mostly on point estimation with limited attention to uncertainty quantification. Meanwhile, synthetic data generation (SDG) is increasingly important for reproducibility studies, yet current DP LR methods do not readily support it. Mainstream SDG approaches are either tailored to discretized data, making them less suitable for continuous regression, or rely on deep models that require large datasets, limiting their use for the smaller, continuous data typical in social science. We propose a method for LR with valid inference under Gaussian DP: a DP bias-corrected estimator with asymptotic confidence intervals (CIs) and a general SDG procedure in which regression on the synthetic data matches our DP regression. Our binning-aggregation strategy is effective in small- to moderate-dimensional settings. Experiments show our method (1) improves accuracy over existing methods, (2) provides valid CIs, and (3) produces more reliable synthetic data for downstream ML tasks than current DP SDGs.

replace ActivationReasoning: Logical Reasoning in Latent Activation Spaces

Authors: Lukas Helff, Ruben H\"arle, Wolfgang Stammer, Felix Friedrich, Manuel Brack, Antonia W\"ust, Hikaru Shindo, Patrick Schramowski, Kristian Kersting

Abstract: Large language models (LLMs) excel at generating fluent text, but their internal reasoning remains opaque and difficult to control. Sparse autoencoders (SAEs) make hidden activations more interpretable by exposing latent features that often align with human concepts. Yet, these features are fragile and passive, offering no mechanism for systematic reasoning or model control. To address this, we introduce ActivationReasoning (AR), a framework that embeds explicit logical reasoning into the latent space of LLMs. It proceeds in three stages: (1) Finding latent representations, first latent concept representations are identified (e.g., via SAEs) and organized into a dictionary; (2) Activating propositions, at inference time AR detects activating concepts and maps them to logical propositions; and (3)Logical reasoning, applying logical rules over these propositions to infer higher-order structures, compose new concepts, and steer model behavior. We evaluate AR on multi-hop reasoning (PrOntoQA), abstraction and robustness to indirect concept cues (Rail2Country), reasoning over natural and diverse language (ProverQA), and context-sensitive safety (BeaverTails). Across all tasks, AR scales robustly with reasoning complexity, generalizes to abstract and context-sensitive tasks, and transfers across model backbones. These results demonstrate that grounding logical structure in latent activations not only improves transparency but also enables structured reasoning, reliable control, and alignment with desired behaviors, providing a path toward more reliable and auditable AI.

replace MetaCluster: Enabling Deep Compression of Kolmogorov-Arnold Network

Authors: Matthew Raffel, Adwaith Renjith, Lizhong Chen

Abstract: Kolmogorov-Arnold Networks (KANs) replace scalar weights with per-edge vectors of basis coefficients, thereby increasing expressivity and accuracy while also resulting in a multiplicative increase in parameters and memory. We propose MetaCluster, a framework that makes KANs highly compressible without sacrificing accuracy. Specifically, a lightweight meta-learner, trained jointly with the KAN, maps low-dimensional embeddings to coefficient vectors, thereby shaping them to lie on a low-dimensional manifold that is amenable to clustering. We then run K-means in coefficient space and replace per-edge vectors with shared centroids. Afterwards, the meta-learner can be discarded, and a brief fine-tuning of the centroid codebook recovers any residual accuracy loss. The resulting model stores only a small codebook and per-edge indices, exploiting the vector nature of KAN parameters to amortize storage across multiple coefficients. On MNIST, CIFAR-10, and CIFAR-100, across standard KANs and ConvKANs using multiple basis functions, MetaCluster achieves a reduction of up to $80\times$ in parameter storage, with no loss in accuracy. Similarly, on high-dimensional equation modeling tasks, MetaCluster achieves a parameter reduction of $124.1\times$, without impacting performance. Code will be released upon publication.

replace Scalable LinUCB: Low-Rank Design Matrix Updates for Recommenders with Large Action Spaces

Authors: Evgenia Shustova, Marina Sheshukova, Sergey Samsonov, Evgeny Frolov

Abstract: In this paper, we introduce PSI-LinUCB, a scalable variant of LinUCB that enables efficient training, inference, and memory usage by representing the inverse regularized design matrix as a sum of a diagonal matrix and low-rank correction. We derive numerically stable rank-1 and batched updates that maintain the inverse without explicitly forming the matrix. To control memory growth, we employ a projector-splitting integrator for dynamical low-rank approximation, yielding an average per-step update cost and memory usage of $O(dr)$ for approximation rank $r$. The inference complexity of the proposed algorithm is $O(dr)$ per action evaluation. Experiments on recommender system datasets demonstrate the effectiveness of our algorithm.

replace Generalised Flow Maps for Few-Step Generative Modelling on Riemannian Manifolds

Authors: Oscar Davis, Michael S. Albergo, Nicholas M. Boffi, Michael M. Bronstein, Avishek Joey Bose

Abstract: Geometric data and purpose-built generative models on them have become ubiquitous in high-impact deep learning application domains, ranging from protein backbone generation and computational chemistry to geospatial data. Current geometric generative models remain computationally expensive at inference -- requiring many steps of complex numerical simulation -- as they are derived from dynamical measure transport frameworks such as diffusion and flow-matching on Riemannian manifolds. In this paper, we propose Generalised Flow Maps (GFM), a new class of few-step generative models that generalises the Flow Map framework in Euclidean spaces to arbitrary Riemannian manifolds. We instantiate GFMs with three self-distillation-based training methods: Generalised Lagrangian Flow Maps, Generalised Eulerian Flow Maps, and Generalised Progressive Flow Maps. We theoretically show that GFMs, under specific design decisions, unify and elevate existing Euclidean few-step generative models, such as consistency models, shortcut models, and meanflows, to the Riemannian setting. We benchmark GFMs against other geometric generative models on a suite of geometric datasets, including geospatial data, RNA torsion angles, and hyperbolic manifolds, and achieve state-of-the-art sample quality for single- and few-step evaluations, and superior or competitive log-likelihoods using the implicit probability flow.

replace GPTOpt: Teaching LLMs to do Interpretable Black-Box Optimization

Authors: Jamison Meindl, Yunsheng Tian, Tony Cui, Veronika Thost, Zhang-Wei Hong, Jie Chen, Wojciech Matusik, Mina Konakovi\'c Lukovi\'c

Abstract: Global optimization of expensive, derivative-free black-box functions demands extreme sample efficiency and decision interpretability. While Large Language Models (LLMs) have shown broad capabilities, even state-of-the-art models remain limited in solving continuous black-box optimization tasks and struggle to maintain exploration-exploitation balance. We introduce GPTOpt, an optimization method that equips LLMs with continuous black-box optimization capabilities by fine-tuning Llama 3.1 8B on structured Bayesian optimization (BO) data, including surrogate model information. This provides an explainable framework calibrated to produce surrogate model outputs comparable to a Gaussian process, while keeping the advantages of flexible LLM-based optimization. On a variety of black-box optimization benchmarks, our model shows favorable performance compared to traditional optimizers and transformer-based alternatives, while providing important context and insight into the model's decisions.

replace EVINGCA: Adaptive Graph Clustering with Evolving Neighborhood Statistics

Authors: Randolph Wiredu-Aidoo

Abstract: Clustering is a fundamental tool for discovering structure in data, yet many existing methods rely on restrictive assumptions. Algorithms such as K-Means and Gaussian Mixtures favor convex or Gaussian clusters, while density-based approaches like DBSCAN and HDBSCAN struggle with variable densities or moderate dimensionality. This paper introduces EVINGCA (Evolving Variance-Informed Nonparametric Graph Construction Algorithm), a density-variance-based clustering method that grows clusters incrementally using breadth-first search on a nearest-neighbor graph. Edges are filtered via z-scores of neighbor distances, with estimates refined as clusters expand, enabling adaptation to cluster-specific structure, and a recovery regime distinct from that of existing alternatives. Over-segmentation is exploited by a propagation phase, which propagates inner, denser "skeletons" out to sharp decision boundaries in low-contrast regions. Experiments on 28 diverse datasets demonstrate competitive runtime behavior and a statistically significant improvement over baseline methods in ARI-based label recovery capacity.

replace Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse

Authors: Shaojie Wang, Jinghui Wang, Yinghan Cui, Xuxing Chen, Chao Wang, Liang Huang, Xiaojiang Zhang, Junyi Peng, Li Wan, Haotian Zhang, Bin Chen

Abstract: Agentic large language model (LLM) training often involves multi-turn interaction trajectories that branch into multiple execution paths due to concurrent tool use, think-mode, sub-agent, context management and other runtime designs. As a result, the token produced by a single task naturally forms a tree-structured token trajectory with shared prefixes, rather than a linear sequence. Existing training pipelines linearize such trajectories and treat each branch independently, leading to substantial redundant computation in both forward and backward passes. To eliminate such redundancy, we introduce Tree Training, an efficient training framework for tree-structured trajectories. Its core component, Gradient Restoration, enables correct gradient aggregation across shared prefixes, allowing each prefix to be computed exactly once while remaining mathematically equivalent to independent training on all branches. To support large trajectory trees in practice, we redesign the training engine to natively ingest tree-structured data and propose Tree Packing, a memory-efficient partitioning strategy that preserves high prefix reuse. Experiments conducted on dense and MOE models of real-world agentic trajectories show 6.2x training speedup for both supervised fine-tuning and the model update phase in reinforcement learning.

replace Test-Time Iterative Error Correction for Efficient Diffusion Models

Authors: Yunshan Zhong, Weiqi Yan, Yuxin Zhang

Abstract: With the growing demand for high-quality image generation on resource-constrained devices, efficient diffusion models have received increasing attention. However, such models suffer from approximation errors introduced by efficiency techniques, which significantly degrade generation quality. Once deployed, these errors are difficult to correct, as modifying the model is typically infeasible in deployment environments. Through an analysis of error propagation across diffusion timesteps, we reveal that these approximation errors can accumulate exponentially, severely impairing output quality. Motivated by this insight, we propose Iterative Error Correction (IEC), a novel test-time method that mitigates inference-time errors by iteratively refining the model's output. IEC is theoretically proven to reduce error propagation from exponential to linear growth, without requiring any retraining or architectural changes. IEC can seamlessly integrate into the inference process of existing diffusion models, enabling a flexible trade-off between performance and efficiency. Extensive experiments show that IEC consistently improves generation quality across various datasets, efficiency techniques, and model architectures, establishing it as a practical and generalizable solution for test-time enhancement of efficient diffusion models. The code is available in https://github.com/zysxmu/IEC.

URLs: https://github.com/zysxmu/IEC.

replace Peeling Context from Cause for Molecular Property Prediction

Authors: Tao Li, Kaiyuan Hou, Tuan Vinh, Monika Raj, Carl Yang

Abstract: Deep models are used for molecular property prediction, yet they are often difficult to interpret and may rely on spurious context rather than causal structure, which reduces reliability under distribution shift and harms predictive performance. We introduce CLaP (Causal Layerwise Peeling), a framework that separates causal signal from context in a layerwise manner and integrates diverse graph representations of molecules. At each layer, a causal block performs a soft split into causal and non-causal branches, fuses causal evidence across modalities, and progressively removes batch-coupled context to focus on label-relevant structure, thereby limiting shortcut signals and stabilizing layerwise refinement. Across four molecular benchmarks, CLaP consistently improves MAE, MSE, and $R^2$ over competitive baselines. The model also produces atom-level causal saliency maps that highlight substructures responsible for predictions, providing actionable guidance for targeted molecular edits. Case studies confirm the accuracy of these maps and their alignment with chemical intuition. By peeling context from cause at every layer, the model yields predictors that are both accurate and interpretable for molecular design.

replace FMMI: Flow Matching Mutual Information Estimation

Authors: Ivan Butakov, Alexander Semenenko, Valeriya Kirova, Alexey Frolov, Ivan Oseledets

Abstract: We introduce a novel Mutual Information (MI) estimator that fundamentally reframes the discriminative approach. Instead of training a classifier to discriminate between joint and marginal distributions, we learn a normalizing flow that transforms one into the other. This technique produces a computationally efficient and precise MI estimate that scales well to high dimensions and across a wide range of ground-truth MI values.

replace Knowing What You Know Is Not Enough: Large Language Model Confidences Don't Align With Their Actions

Authors: Arka Pal, Teo Kitanovski, Arthur Liang, Akilesh Potti, Micah Goldblum

Abstract: Large language models (LLMs) are increasingly deployed in agentic and multi-turn workflows where they are tasked to perform actions of significant consequence. In order to deploy them reliably and manage risky outcomes in these settings, it is helpful to access model uncertainty estimates. However, confidence elicitation methods for LLMs are typically not evaluated directly in agentic settings; instead, they are evaluated on static datasets, such as Q&A benchmarks. In this work we investigate the relationship between confidence estimates elicited in static settings and the behavior of LLMs in interactive settings. We uncover a significant action-belief gap -- LLMs frequently take actions that contradict their elicited confidences. In a prediction market setting, we find that models often bet against their own high-confidence predictions; in a tool-use setting, models fail to reliably invoke information-seeking tools when their internal confidence is low; and in a user-challenge setting, models change their answers when they have high confidence in them, whilst sticking to answers they have low confidence in. Crucially, we show that static calibration is an insufficient predictor of consistency in the above dynamic settings, as stronger, better calibrated models are somtimes less consistent than their smaller and weaker open-source counterparts. Our results highlight a critical blind spot in current evaluation methodologies: ensuring that a model knows what it knows does not guarantee that it will act rationally on that knowledge.

replace Twice Sequential Monte Carlo for Tree Search

Authors: Yaniv Oren, Joery A. de Vries, Pascal R. van der Vaart, Matthijs T. J. Spaan, Wendelin B\"ohmer

Abstract: Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS as a policy improvement operator, scales favorably with sequential compute, reduces estimator variance and mitigates the effects of path degeneracy while retaining the properties that make SMC natural to parallelize.

replace LogSyn: A Few-Shot LLM Framework for Structured Insight Extraction from Unstructured General Aviation Maintenance Logs

Authors: Devansh Agarwal, Maitreyi Chatterjee, Biplab Chatterjee

Abstract: Aircraft maintenance logs hold valuable safety data but remain underused due to their unstructured text format. This paper introduces LogSyn, a framework that uses Large Language Models (LLMs) to convert these logs into structured, machine-readable data. Using few-shot in-context learning on 6,169 records, LogSyn performs Controlled Abstraction Generation (CAG) to summarize problem-resolution narratives and classify events within a detailed hierarchical ontology. The framework identifies key failure patterns, offering a scalable method for semantic structuring and actionable insight extraction from maintenance logs. This work provides a practical path to improve maintenance workflows and predictive analytics in aviation and related industries.

replace Decoupling and Damping: Structurally-Regularized Gradient Matching for Multimodal Graph Condensation

Authors: Lian Shen, Zhendan Chen, Meijia Song, Yinhui jiang, Ziming Su, Juan Liu, Xiangrong Liu

Abstract: In multimodal graph learning, graph structures that integrate information from multiple sources, such as vision and text, can more comprehensively model complex entity relationships. However, the continuous growth of their data scale poses a significant computational bottleneck for training. Graph condensation methods provide a feasible path forward by synthesizing compact and representative datasets. Nevertheless, existing condensation approaches generally suffer from performance limitations in multimodal scenarios, mainly due to two reasons: (1) semantic misalignment between different modalities leads to gradient conflicts; (2) the message-passing mechanism of graph neural networks further structurally amplifies such gradient noise. Based on this, we propose Structural Regularized Gradient Matching (SR-GM), a condensation framework for multimodal graphs. This method alleviates gradient conflicts between modalities through a gradient decoupling mechanism and introduces a structural damping regularizer to suppress the propagation of gradient noise in the topology, thereby transforming the graph structure from a noise amplifier into a training stabilizer. Extensive experiments on four multimodal graph datasets demonstrate the effectiveness of SR-GM, highlighting its state-of-the-art performance and cross-architecture generalization capabilities in multimodal graph dataset condensation.

replace How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets

Authors: Xiwen Huang, Pierre Pinson

Abstract: We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to benchmark baselines including random sampling and a greedy knapsack heuristic. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.

replace How to Correctly Report LLM-as-a-Judge Evaluations

Authors: Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee

Abstract: Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification. Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset. Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals. Importantly, we characterize parameter regimes defined by the true evaluation score and the LLM judge's sensitivity and specificity in which our LLM-based evaluation yields more reliable estimates than human-only evaluation. Moreover, we show that our framework remains unbiased under distribution shift between the test and calibration datasets, in contrast to existing approaches.

replace Towards Active Synthetic Data Generation for Finetuning Language Models

Authors: Samuel Kessler, Menglin Xia, Daniel Madrigal Diaz, Dongge Han, Helia Heshemi, Saravan Rajmohan, Victor Ruehle, Jordan T. Ash

Abstract: A common and effective means for improving language model capabilities involves finetuning a ``student'' language model's parameters on generations from a more proficient ``teacher'' model. Termed ``synthetic data'', these generations are often produced before any student finetuning, but some work has considered generating new synthetic samples as training progresses. This paper studies and advocates for the latter case, where data are generated in an iterative, closed-loop fashion that is guided by the current state of the student model. For a fixed budget of generated samples, or a budget in terms of compute spent querying a teacher, we show that this curation of finetuning data affords improved student performance over static generation. Further, while there have been several LLM-specific methods proposed that operate in this regime, we find that simple, inexpensive selection criteria from the active learning literature tend to be most performant. We validate these claims across four mathematical and logical reasoning datasets using four different small language models.

replace Open-Set Domain Adaptation Under Background Distribution Shift: Challenges and A Provably Efficient Solution

Authors: Shravan Chaudhari, Yoav Wald, Suchi Saria

Abstract: As we deploy machine learning systems in the real world, a core challenge is to maintain a model that is performant even as the data shifts. Such shifts can take many forms: new classes may emerge that were absent during training, a problem known as open-set recognition, and the distribution of known categories may change. Guarantees on open-set recognition are mostly derived under the assumption that the distribution of known classes, which we call the background distribution, is fixed. In this paper we develop CoLOR, a method that is guaranteed to solve open-set recognition even in the challenging case where the background distribution shifts. We prove that the method works under benign assumptions that the novel class is separable from the non-novel classes, and provide theoretical guarantees that it outperforms a representative baseline in a simplified overparameterized setting. We develop techniques to make CoLOR scalable and robust, and perform comprehensive empirical evaluations on image and text data. The results show that CoLOR significantly outperforms existing open-set recognition methods under background shift. Moreover, we provide new insights into how factors such as the size of the novel class influences performance, an aspect that has not been extensively explored in prior work.

replace RNNs perform task computations by dynamically warping neural representations

Authors: Arthur Pellegrino, Angus Chadwick

Abstract: Analysing how neural networks represent data features in their activations can help interpret how they perform tasks. Hence, a long line of work has focused on mathematically characterising the geometry of such "neural representations." In parallel, machine learning has seen a surge of interest in understanding how dynamical systems perform computations on time-varying input data. Yet, the link between computation-through-dynamics and representational geometry remains poorly understood. Here, we hypothesise that recurrent neural networks (RNNs) perform computations by dynamically warping their representations of task variables. To test this hypothesis, we develop a Riemannian geometric framework that enables the derivation of the manifold topology and geometry of a dynamical system from the manifold of its inputs. By characterising the time-varying geometry of RNNs, we show that dynamic warping is a fundamental feature of their computations.

replace China Regional 3km Downscaling Based on Residual Corrective Diffusion Model

Authors: Honglu Sun, Hao Jing, Zhixiang Dai, Sa Xiao, Wei Xue, Jian Sun, Qifeng Lu

Abstract: A fundamental challenge in numerical weather prediction is to efficiently produce high-resolution forecasts. A common solution is applying downscaling methods, which include dynamical downscaling and statistical downscaling, to the outputs of global models. This work focuses on statistical downscaling, which establishes statistical relationships between low-resolution and high-resolution historical data using statistical models. Deep learning has emerged as a powerful tool for this task, giving rise to various high-performance super-resolution models, which can be directly applied for downscaling, such as diffusion models and Generative Adversarial Networks. This work relies on a diffusion-based downscaling framework named CorrDiff. In contrast to the original work of CorrDiff, the region considered in this work is nearly 40 times larger, and we not only consider surface variables as in the original work, but also encounter high-level variables (six pressure levels) as target downscaling variables. In addition, a global residual connection is added to improve accuracy. In order to generate the 3km forecasts for the China region, we apply our trained models to the 25km global grid forecasts of CMA-GFS, an operational global model of the China Meteorological Administration (CMA), and SFF, a data-driven deep learning-based weather model developed from Spherical Fourier Neural Operators (SFNO). CMA-MESO, a high-resolution regional model, is chosen as the baseline model. The experimental results demonstrate that the forecasts downscaled by our method generally outperform the direct forecasts of CMA-MESO in terms of MAE for the target variables. Our forecasts of radar composite reflectivity show that CorrDiff, as a generative model, can generate fine-scale details that lead to more realistic predictions compared to the corresponding deterministic regression models.

replace Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

Authors: Jingdi Lei, Di Zhang, Soujanya Poria

Abstract: Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, full parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving the linear-time complexity. Through an extensive suite of experiments, we show that EFLA enables robust performance in noisy environments, achieving lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without introducing additional parameters. Our work provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models.

replace Investigating Data Pruning for Pretraining Biological Foundation Models at Scale

Authors: Yifan Wu, Jiyue Jiang, Xichen Ye, Yiqi Wang, Chang Zhou, Yitao Xu, Jiayang Chen, He Hu, Weizhong Zhang, Cheng Jin, Jiao Yuan, Yu Li

Abstract: Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility, particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost, and builds upon it two simple yet effective selection strategies, namely Top-k Influence (Top I) and Coverage-Centric Influence (CCI). We empirically validate our method on two representative BioFMs, RNA-FM and ESM-C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent, demonstrating its effectiveness. Furthermore, we show the generalizability of our framework on protein-related tasks using ESM-C. In particular, our coreset even outperforms random subsets that are ten times larger in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.

replace FLAME: Flow Enhanced Legendre Memory Models for General Time Series Forecasting

Authors: Xingjian Wu, Hanyin Cheng, Xiangfei Qiu, Zhengyu Li, Jilin Hu, Chenjuan Guo, Bin Yang

Abstract: In this work, we introduce FLAME, a family of extremely lightweight and capable Time Series Foundation Models, which support both deterministic and probabilistic forecasting via generative probabilistic modeling, thus ensuring both efficiency and robustness. FLAME utilizes the Legendre Memory for strong generalization capabilities. Through adapting variants of Legendre Memory, i.e., translated Legendre (LegT) and scaled Legendre (LegS), in the Encoding and Decoding phases, FLAME can effectively capture the inherent inductive bias within data and make efficient long-range inferences. To enhance the accuracy of probabilistic forecasting while keeping efficient, FLAME adopts a Normalization Flow based forecasting head, which can model the arbitrarily intricate distributions over the forecasting horizon in a generative manner. Comprehensive experiments on well-recognized benchmarks, including TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art zero-shot performance of FLAME on both deterministic and probabilistic forecasting tasks.

replace Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models

Authors: Caner Erden

Abstract: Dynamic Rank Reinforcement Learning (DR-RL) approximations rely on static rank assumptions, limiting their flexibility across diverse linguistic contexts. Our method dynamically modulates ranks based on real-time sequence dynamics, layer-specific sensitivities, and hardware constraints. The core innovation is a deep reinforcement learning agent that formulates rank selection as a sequential policy optimization problem, strictly balancing attention fidelity against computational latency. To ensure stability during inference, we derive and employ online matrix perturbation bounds, enabling incremental rank updates without the prohibitive cost of full decomposition. Furthermore, the integration of a lightweight Transformer-based policy network and batched Singular Value Decomposition (SVD) operations ensures scalable deployment on modern architectures. Extensive experiments demonstrate that DR-RL significantly reduces Floating Point Operations (FLOPs) by over 40% in long-sequence regimes (L > 4096) while maintaining downstream accuracy statistically equivalent to full-rank attention. Beyond standard language modeling benchmarks, we validate the real-world applicability of DR-RL on the GLUE benchmark. Specifically, our method achieves 92.78% accuracy on the SST-2 sentiment analysis task, matching the performance of full-rank baselines and outperforming static low-rank methods, such as Performer and Nystr\"omformer, by a significant margin.

replace Generative Modeling through Koopman Spectral Analysis: An Operator-Theoretic Perspective

Authors: Yuanchao Xu, Fengyi Li, Masahiro Fujisawa, Xiaoyuan Cheng, Youssef Marzouk, Isao Ishikawa

Abstract: We propose Koopman Spectral Wasserstein Gradient Descent (KSWGD), a particle-based generative modeling framework that learns the Langevin generator via Koopman theory and integrates it with Wasserstein gradient descent. Our key insight is that this spectral structure of the underlying distribution can be directly estimated from trajectory data via the Koopman operator, eliminating the need for explicit knowledge of the target potential. Additionally, we prove that KSWGD maintains an approximately constant dissipation rate, thereby establishing linear convergence and overcoming the vanishing-gradient phenomenon that hinders existing kernel-based particle methods. We further provide a Feynman--Kac interpretation that clarifies the method's probabilistic foundation. Experiments on compact manifolds, metastable multi-well systems, and high-dimensional stochastic partial differential equations demonstrate that KSWGD consistently outperforms baselines in both convergence speed and sample quality.

replace Enhancing Newton-Kaczmarz training of Kolmogorov-Arnold networks through concurrency

Authors: Andrew Polar, Michael Poluektov

Abstract: The present paper introduces concurrency-driven enhancements to the training algorithm for the Kolmogorov-Arnold networks (KANs) that is based on the Newton-Kaczmarz (NK) method. As indicated by prior research, the NK-based training for KANs offers state-of-the-art performance in terms of accuracy and training time on relatively large datasets, significantly overtaking classical neural networks based on multilayer perceptrons (MLPs). Although some elements of the algorithm can be parallelised (in particular, evaluation of the basis functions' values), a major limitation of the algorithm is the sequential application of the parameters' updates, which has not been resolved up to now. However, substantial acceleration is achievable. Three complementary strategies are proposed in the present paper: (i) a pre-training procedure tailored to the NK updates' structure, (ii) training on disjoint subsets of data, followed by models' merging, not in the context of federated learning, but as a mechanism for accelerating the convergence, and (iii) a parallelisation technique suitable for execution on field-programmable gate arrays (FPGAs), which is implemented and tested directly on the device. All experimental results presented in this work are fully reproducible, with the complete source codes available online.

replace PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models

Authors: Mingue Park, Jisung Hwang, Seungwoo Yoo, Kyeongmin Yeo, Minhyuk Sung

Abstract: We introduce $\texttt{PairFlow}$, a lightweight preprocessing step for training Discrete Flow Models (DFMs) to achieve few-step sampling without requiring a pretrained teacher. DFMs have recently emerged as a new class of generative models for discrete data, offering strong performance. However, they suffer from slow sampling due to their iterative nature. Existing acceleration methods largely depend on finetuning, which introduces substantial additional training overhead. $\texttt{PairFlow}$ addresses this issue with a lightweight preprocessing step. Inspired by ReFlow and its extension to DFMs, we train DFMs from coupled samples of source and target distributions, without requiring any pretrained teacher. At the core of our approach is a closed-form inversion for DFMs, which allows efficient construction of paired source-target samples. Despite its extremely low cost, taking only up to 1.7% of the compute needed for full model training, $\texttt{PairFlow}$ matches or even surpasses the performance of two-stage training involving finetuning. Furthermore, models trained with our framework provide stronger base models for subsequent distillation, yielding further acceleration after finetuning. Experiments on molecular data as well as binary and RGB images demonstrate the broad applicability and effectiveness of our approach.

replace TS-Arena -- A Live Forecast Pre-Registration Platform

Authors: Marcel Meyer, Sascha Kaltenpoth, Henrik Albers, Kevin Zalipski, Oliver M\"uller

Abstract: Time Series Foundation Models (TSFMs) are transforming the field of forecasting. However, evaluating them on historical data is increasingly difficult due to the risks of train-test sample overlaps and temporal overlaps between correlated train and test time series. To address this, we introduce TS-Arena, a live forecasting platform that shifts evaluation from the known past to the unknown future. Building on the concept of continuous benchmarking, TS-Arena evaluates models on future data. Crucially, we introduce a strict forecasting pre-registration protocol: models must submit predictions before the ground-truth data physically exists. This makes test-set contamination impossible by design. The platform relies on a modular microservice architecture that harmonizes and structures data from different sources and orchestrates containerized model submissions. By enforcing a strict pre-registration protocol on live data streams, TS-Arena prevents information leakage offers a faster alternative to traditional static, infrequently repeated competitions (e.g. the M-Competitions). First empirical results derived from operating TS-Arena over one year of energy time series demonstrate that established TSFMs accumulate robust longitudinal scores over time, while the continuous nature of the benchmark simultaneously allows newcomers to demonstrate immediate competitiveness. TS-Arena provides the necessary infrastructure to assess the true generalization capabilities of modern forecasting models. The platform and corresponding code are available at https://ts-arena.live/.

URLs: https://ts-arena.live/.

replace Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks

Authors: Jihang Wang, Dongcheng Zhao, Ruolin Chen, Qian Zhang, Yi Zeng

Abstract: Spiking Neural Networks (SNNs) utilize spike-based activations to mimic the brain's energy-efficient information processing. However, the binary and discontinuous nature of spike activations causes vanishing gradients, making adversarial robustness evaluation via gradient descent unreliable. While improved surrogate gradient methods have been proposed, their effectiveness under strong adversarial attacks remains unclear. We propose a more reliable framework for evaluating SNN adversarial robustness. We theoretically analyze the degree of gradient vanishing in surrogate gradients and introduce the Adaptive Sharpness Surrogate Gradient (ASSG), which adaptively evolves the shape of the surrogate function according to the input distribution during attack iterations, thereby enhancing gradient accuracy while mitigating gradient vanishing. In addition, we design an adversarial attack with adaptive step size under the $L_\infty$ constraint-Stable Adaptive Projected Gradient Descent (SA-PGD), achieving faster and more stable convergence under imprecise gradients. Extensive experiments show that our approach substantially increases attack success rates across diverse adversarial training schemes, SNN architectures and neuron models, providing a more generalized and reliable evaluation of SNN adversarial robustness. The experimental results further reveal that the robustness of current SNNs has been significantly overestimated and highlighting the need for more dependable adversarial training methods. The code is released at https://github.com/craree/ASSG-SNNs-Robustness-Evaluation

URLs: https://github.com/craree/ASSG-SNNs-Robustness-Evaluation

replace Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Authors: Yingru Li, Jiacai Liu, Jiawei Xu, Yuxuan Tong, Ziniu Li, Qian Liu, Baoxiang Wang

Abstract: Policy gradient methods for Large Language Models optimize a policy $\pi_\theta$ via a surrogate objective computed from samples of a rollout policy $\pi_{\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch ($\pi_{\text{roll}} \neq \pi_\theta$) and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound ($O(T^{3/2})$), a Mixed bound ($O(T)$), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position importance-ratio decomposition. Taking the minimum over all bounds yields the tightest known guarantee across all divergence regimes. Crucially, all bounds depend on the maximum token-level divergence $D_{\mathrm{KL}}^{\mathrm{tok,max}}$ (or $D_{\mathrm{TV}}^{\mathrm{tok,max}}$), a sequence-level quantity that cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences violating the trust region, enabling the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.

replace Categorical Reparameterization with Denoising Diffusion models

Authors: Samson Gourevitch, Alain Durmus, Eric Moulines, Jimmy Olsson, Yazid Janati

Abstract: Learning models with categorical variables requires optimizing expectations over discrete distributions, a setting in which stochastic gradient-based optimization is challenging due to the non-differentiability of categorical sampling. A common workaround is to replace the discrete distribution with a continuous relaxation, yielding a smooth surrogate that admits reparameterized gradient estimates via the reparameterization trick. Building on this idea, we introduce ReDGE, a novel and efficient diffusion-based soft reparameterization method for categorical distributions. Our approach defines a flexible class of gradient estimators that includes the Straight-Through estimator as a special case. Experiments spanning latent variable models and inference-time reward guidance in discrete diffusion models demonstrate that ReDGE consistently matches or outperforms existing gradient-based methods. The code will be made available at https://github.com/samsongourevitch/redge.

URLs: https://github.com/samsongourevitch/redge.

replace RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data

Authors: Peiyan Hu, Haodong Feng, Hongyuan Liu, Tongtong Yan, Wenhao Deng, Tianrun Gao, Rong Zheng, Haoren Zheng, Chenglei Yu, Chuanrui Wang, Kaiwen Li, Zhi-Ming Ma, Dezhi Zhou, Xingcai Lu, Dixia Fan, Tailin Wu

Abstract: Predicting the evolution of complex physical systems remains a central problem in science and engineering. Despite rapid progress in scientific Machine Learning (ML) models, a critical bottleneck is the lack of expensive real-world data, resulting in most current models being trained and validated on simulated data. Beyond limiting the development and evaluation of scientific ML, this gap also hinders research into essential tasks such as sim-to-real transfer. We introduce RealPDEBench, the first benchmark for scientific ML that integrates real-world measurements with paired numerical simulations. RealPDEBench consists of five datasets, three tasks, eight metrics, and ten baselines. We first present five real-world measured datasets with paired simulated datasets across different complex physical systems. We further define three tasks, which allow comparisons between real-world and simulated data, and facilitate the development of methods to bridge the two. Moreover, we design eight evaluation metrics, spanning data-oriented and physics-oriented metrics, and finally benchmark ten representative baselines, including state-of-the-art models, pretrained PDE foundation models, and a traditional method. Experiments reveal significant discrepancies between simulated and real-world data, while showing that pretraining with simulated data consistently improves both accuracy and convergence. In this work, we hope to provide insights from real-world data, advancing scientific ML toward bridging the sim-to-real gap and real-world deployment. Our benchmark, datasets, and instructions are available at https://realpdebench.github.io/.

URLs: https://realpdebench.github.io/.

replace HEEGNet: Hyperbolic Embeddings for EEG

Authors: Shanglin Li, Shiwen Chu, Okan Ko\c{c}, Yi Ding, Qibin Zhao, Motoaki Kawanabe, Ziheng Chen

Abstract: Electroencephalography (EEG)-based brain-computer interfaces facilitate direct communication with a computer, enabling promising applications in human-computer interactions. However, their utility is currently limited because EEG decoding often suffers from poor generalization due to distribution shifts across domains (e.g., subjects). Learning robust representations that capture underlying task-relevant information would mitigate these shifts and improve generalization. One promising approach is to exploit the underlying hierarchical structure in EEG, as recent studies suggest that hierarchical cognitive processes, such as visual processing, can be encoded in EEG. While many decoding methods still rely on Euclidean embeddings, recent work has begun exploring hyperbolic geometry for EEG. Hyperbolic spaces, regarded as the continuous analogue of tree structures, provide a natural geometry for representing hierarchical data. In this study, we first empirically demonstrate that EEG data exhibit hyperbolicity and show that hyperbolic embeddings improve generalization. Motivated by these findings, we propose HEEGNet, a hybrid hyperbolic network architecture to capture the hierarchical structure in EEG and learn domain-invariant hyperbolic embeddings. To this end, HEEGNet combines both Euclidean and hyperbolic encoders and employs a novel coarse-to-fine domain adaptation strategy. Extensive experiments on multiple public EEG datasets, covering visual evoked potentials, emotion recognition, and intracranial EEG, demonstrate that HEEGNet achieves state-of-the-art performance. The code is available at https://github.com/fightlesliefigt/HEEGNet

URLs: https://github.com/fightlesliefigt/HEEGNet

replace Distribution-Guided and Constrained Quantum Machine Unlearning

Authors: Nausherwan Malik, Zubair Khalid, Muhammad Faryad

Abstract: Machine unlearning aims to remove the influence of specific training data from a learned model without full retraining. While recent work has begun to explore unlearning in quantum machine learning, existing approaches largely rely on fixed, uniform target distributions and do not explicitly control the trade-off between forgetting and retained model behaviour. In this work, we propose a distribution-guided framework for class-level quantum machine unlearning that treats unlearning as a constrained optimization problem. Our method introduces a tunable target distribution derived from model similarity statistics, decoupling the suppression of forgotten-class confidence from assumptions about redistribution among retained classes. We further incorporate an anchor-based preservation constraint that explicitly maintains predictive behaviour on selected retained data, yielding a controlled optimization trajectory that limits deviation from the original model. We evaluate the approach on variational quantum classifiers trained on the Iris and Covertype datasets. Results demonstrate sharp suppression of forgotten-class confidence, minimal degradation of retained-class performance, and closer alignment with the gold retrained model baselines compared to uniform-target unlearning. These findings highlight the importance of target design and constraint-based formulations for reliable and interpretable quantum machine unlearning.

replace A Dual Pipeline Machine Learning Framework for Automated Multi Class Sleep Disorder Screening Using Hybrid Resampling and Ensemble Learning

Authors: Md Sultanul Islam Ovi, Muhsina Tarannum Munfa, G. M. M Miftahul Alam Adib, Syed Sabbir Hasan

Abstract: Accurate classification of sleep disorders, particularly insomnia and sleep apnea, is important for reducing long term health risks and improving patient quality of life. However, clinical sleep studies are resource intensive and are difficult to scale for population level screening. This paper presents a Dual Pipeline Machine Learning Framework for multi class sleep disorder screening using the Sleep Health and Lifestyle dataset. The framework consists of two parallel processing streams: a statistical pipeline that targets linear separability using Mutual Information and Linear Discriminant Analysis, and a wrapper based pipeline that applies Boruta feature selection with an autoencoder for non linear representation learning. To address class imbalance, we use the hybrid SMOTETomek resampling strategy. In experiments, Extra Trees and K Nearest Neighbors achieved an accuracy of 98.67%, outperforming recent baselines on the same dataset. Statistical testing using the Wilcoxon Signed Rank Test indicates that the improvement over baseline configurations is significant, and inference latency remains below 400 milliseconds. These results suggest that the proposed dual pipeline design supports accurate and efficient automated screening for non invasive sleep disorder risk stratification.

replace A Review of Online Diffusion Policy RL Algorithms for Scalable Robotic Control

Authors: Wonhyeok Choi, Shutong Ding, Minwoo Choi, Jungwan Woo, Kyumin Hwang, Jaeyeul Kim, Ye Shi, Sunghoon Im

Abstract: Diffusion policies have emerged as a powerful approach for robotic control, demonstrating superior expressiveness in modeling multimodal action distributions compared to conventional policy networks. However, their integration with online reinforcement learning remains challenging due to fundamental incompatibilities between diffusion model training objectives and standard RL policy improvement mechanisms. This paper presents the first comprehensive review and empirical analysis of current Online Diffusion Policy Reinforcement Learning (Online DPRL) algorithms for scalable robotic control systems. We propose a novel taxonomy that categorizes existing approaches into four distinct families--Action-Gradient, Q-Weighting, Proximity-Based, and Backpropagation Through Time (BPTT) methods--based on their policy improvement mechanisms. Through extensive experiments on a unified NVIDIA Isaac Lab benchmark encompassing 12 diverse robotic tasks, we systematically evaluate representative algorithms across five critical dimensions: task diversity, parallelization capability, diffusion step scalability, cross-embodiment generalization, and environmental robustness. Our analysis identifies key findings regarding the fundamental trade-offs inherent in each algorithmic family, particularly concerning sample efficiency and scalability. Furthermore, we reveal critical computational and algorithmic bottlenecks that currently limit the practical deployment of online DPRL. Based on these findings, we provide concrete guidelines for algorithm selection tailored to specific operational constraints and outline promising future research directions to advance the field toward more general and scalable robotic learning systems.

replace Leveraging Soft Prompts for Privacy Attacks in Federated Prompt Tuning

Authors: Quan Minh Nguyen, Min-Seon Kim, Hoang M. Ngo, Trong Nghia Hoang, Hyuk-Yoon Kwon, My T. Thai

Abstract: Membership inference attack (MIA) poses a significant privacy threat in federated learning (FL) as it allows adversaries to determine whether a client's private dataset contains a specific data sample. While defenses against membership inference attacks in standard FL have been well studied, the recent shift toward federated fine-tuning has introduced new, largely unexplored attack surfaces. To highlight this vulnerability in the emerging FL paradigm, we demonstrate that federated prompt-tuning, which adapts pre-trained models with small input prefixes to improve efficiency, also exposes a new vector for privacy attacks. We propose PromptMIA, a membership inference attack tailored to federated prompt-tuning, in which a malicious server can insert adversarially crafted prompts and monitors their updates during collaborative training to accurately determine whether a target data point is in a client's private dataset. We formalize this threat as a security game and empirically show that PromptMIA consistently attains high advantage in this game across diverse benchmark datasets. Our theoretical analysis further establishes a lower bound on the attack's advantage which explains and supports the consistently high advantage observed in our empirical results. We also investigate the effectiveness of standard membership inference defenses originally developed for gradient or output based attacks and analyze their interaction with the distinct threat landscape posed by PromptMIA. The results highlight non-trivial challenges for current defenses and offer insights into their limitations, underscoring the need for defense strategies that are specifically tailored to prompt-tuning in federated settings.

replace Free-RBF-KAN: Kolmogorov-Arnold Networks with Adaptive Radial Basis Functions for Efficient Function Learning

Authors: Shao-Ting Chiu, Siu Wun Cheung, Ulisses Braga-Neto, Chak Shing Lee, Rui Peng Li

Abstract: Kolmogorov-Arnold Networks (KANs) offer a promising framework for approximating complex nonlinear functions, yet the original B-spline formulation suffers from significant computational overhead due to De Boor algorithm. While recent RBF-based variants improve efficiency, they often sacrifice the approximation accuracy inherent in the original spline-based design. To bridge this gap, we propose Free-RBF-KAN, an architecture that integrates adaptive learning grids and trainable smoothness parameters to enable expressive, high-resolution function approximation. Our method utilizes learnable RBF shapes that dynamically align with activation patterns, and we provide the first formal universal approximation proof for the RBF-KAN family. Empirical evaluations across multiscale regression, physics-informed PDEs, and operator learning demonstrate that Free-RBF-KAN can achieve accuracy comparable to its B-spline counterparts while delivering significantly faster training and inference. These results establish Free-RBF-KAN as an efficient and adaptive alternative for high-dimensional structured modeling tasks.

replace On Evaluation of Unsupervised Feature Selection for Pattern Classification

Authors: Gyu-Il Kim, Dae-Won Kim, Jaesung Lee

Abstract: Unsupervised feature selection aims to identify a compact subset of features that captures the intrinsic structure of data without supervised label. Most existing studies evaluate the performance of methods using the single-label dataset that can be instantiated by selecting a label from multi-label data while maintaining the original features. Because the chosen label can vary arbitrarily depending on the experimental setting, the superiority among compared methods can be changed with regard to which label happens to be selected. Thus, evaluating unsupervised feature selection methods based solely on single-label accuracy is unreasonable for assessing their true discriminative ability. This study revisits this evaluation paradigm by adopting a multi-label classification framework. Experiments on 21 multi-label datasets using several representative methods demonstrate that performance rankings differ markedly from those reported under single-label settings, suggesting the possibility of multi-label evaluation settings for fair and reliable comparison of unsupervised feature selection methods.

replace BalDRO: A Distributionally Robust Optimization based Framework for Large Language Model Unlearning

Authors: Pengyang Shao, Naixin Zhai, Lei Chen, Yonghui Yang, Fengbin Zhu, Xun Yang, Meng Wang

Abstract: As Large Language Models (LLMs) increasingly shape online content, removing targeted information from well-trained LLMs (also known as LLM unlearning) has become critical for web governance. A key challenge lies in sample-wise imbalance within the forget set: different samples exhibit widely varying unlearning difficulty, leading to asynchronous forgetting where some knowledge remains insufficiently erased while others become over-forgotten. To address this, we propose BalDRO, a novel and efficient framework for balanced LLM unlearning. BalDRO formulates unlearning as a min-sup process: an inner step identifies a worst-case data distribution that emphasizes hard-to-unlearn samples, while an outer step updates model parameters under this distribution. We instantiate BalDRO via two efficient variants: BalDRO-G, a discrete GroupDRO-based approximation focusing on high-loss subsets, and BalDRO-DV, a continuous Donsker-Varadhan dual method enabling smooth adaptive weighting within standard training pipelines. Experiments on TOFU and MUSE show that BalDRO significantly improves both forgetting quality and model utility over existing methods, and we release code for reproducibility.

replace Global Optimization By Gradient from Hierarchical Score-Matching Spaces

Authors: Ming Li

Abstract: Gradient-based methods are widely used to solve various optimization problems, however, they are either constrained by local optima dilemmas, simple convex constraints, and continuous differentiability requirements, or limited to low-dimensional simple problems. This work solve these limitations and restrictions by unifying all optimization problems with various complex constraints as a general hierarchical optimization objective without constraints, which is optimized by gradient obtained through score matching. The proposed method is verified through simple-constructed and complex-practical experiments. Even more importantly, it reveals the profound connection between global optimization and diffusion based generative modeling.

replace Report for NSF Workshop on AI for Electronic Design Automation

Authors: Deming Chen, Vijay Ganesh, Weikai Li, Yingyan Celine Lin, Yong Liu, Subhasish Mitra, David Z. Pan, Ruchir Puri, Jason Cong, Yizhou Sun

Abstract: This report distills the discussions and recommendations from the NSF Workshop on AI for Electronic Design Automation (EDA), held on December 10, 2024 in Vancouver alongside NeurIPS 2024. Bringing together experts across machine learning and EDA, the workshop examined how AI-spanning large language models (LLMs), graph neural networks (GNNs), reinforcement learning (RL), neurosymbolic methods, etc.-can facilitate EDA and shorten design turnaround. The workshop includes four themes: (1) AI for physical synthesis and design for manufacturing (DFM), discussing challenges in physical manufacturing process and potential AI applications; (2) AI for high-level and logic-level synthesis (HLS/LLS), covering pragma insertion, program transformation, RTL code generation, etc.; (3) AI toolbox for optimization and design, discussing frontier AI developments that could potentially be applied to EDA tasks; and (4) AI for test and verification, including LLM-assisted verification tools, ML-augmented SAT solving, security/reliability challenges, etc. The report recommends NSF to foster AI/EDA collaboration, invest in foundational AI for EDA, develop robust data infrastructures, promote scalable compute infrastructure, and invest in workforce development to democratize hardware design and enable next-generation hardware systems. The workshop information can be found on the website https://ai4eda-workshop.github.io/.

URLs: https://ai4eda-workshop.github.io/.

replace PUNCH: Physics-informed Uncertainty-aware Network for Coronary Hemodynamics

Authors: Sukirt Thakur, Marcus Roper, Yang Zhou, Dmitry Yu. Isaev, Reza Akbarian Bafghi, Brahmajee K. Nallamothu, C. Alberto Figueroa, Srinivas Paruchuri, Scott Burger, Carlos Collet, Maziar Raissi

Abstract: More than 10 million coronary angiograms are performed globally each year, providing a gold standard for detecting obstructive coronary artery disease. Yet, no obstructive lesions are identified in 70% of patients evaluated for ischemic heart disease. Up to half of these patients have undiagnosed, life-limiting coronary microvascular dysfunction (CMD), which remains under-detected due to the limited availability of invasive tools required to measure coronary flow reserve (CFR). Here, we introduce PUNCH, a non-invasive, uncertainty-aware framework for estimating CFR directly from standard coronary angiography. PUNCH integrates physics-informed neural networks with variational inference to infer coronary blood flow from first-principles models of contrast transport, without requiring ground-truth flow measurements or population-level training. The pipeline runs in approximately three minutes per patient on a single GPU. Validated on synthetic angiograms with controlled noise and imaging artifacts, as well as on clinical bolus thermodilution data from 20 patients, PUNCH demonstrates accurate and uncertainty-calibrated CFR estimation. This approach establishes a new paradigm for CMD diagnosis and illustrates how physics-informed inference can substantially expand the diagnostic utility of available clinical imaging.

replace Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

Authors: Jang-Hyun Kim, Dongyoon Han, Sangdoo Yun

Abstract: Efficient key-value (KV) cache management is crucial for the practical deployment of large language models (LLMs), yet existing compression techniques often incur a trade-off between performance degradation and computational overhead. We propose a novel gating-based KV cache eviction method for frozen-weight LLMs that achieves high compression ratios with negligible computational cost. Our approach introduces lightweight sink-attention gating modules to identify and retain critical KV pairs, and integrates seamlessly into both the prefill and decoding stages. The proposed gate training algorithm relies on forward passes of an LLM, avoiding expensive backpropagation, while achieving strong task generalization through a task-agnostic reconstruction objective. Extensive experiments across the Qwen2.5-1M, Qwen3, and Gemma3 families show that our method maintains near-lossless performance while evicting up to 70% of the KV cache. The results are consistent across a wide range of tasks, including long-context understanding, code comprehension, and mathematical reasoning, demonstrating the generality of our approach.

replace Rethinking Cross-Modal Fine-Tuning: Optimizing the Interaction between Feature Alignment and Target Fitting

Authors: Trong Khiem Tran, Manh Cuong Dao, Phi Le Nguyen, Thao Nguyen Truong, Trong Nghia Hoang

Abstract: Adapting pre-trained models to unseen feature modalities has become increasingly important due to the growing need for cross-disciplinary knowledge integration. A key challenge here is how to align the representation of new modalities with the most relevant parts of the pre-trained model's representation space to enable accurate knowledge transfer. This requires combining feature alignment with target fine-tuning, but uncalibrated combinations can exacerbate misalignment between the source and target feature-label structures and reduce target generalization. Existing work, however, lacks a theoretical understanding of this critical interaction between feature alignment and target fitting. To bridge this gap, we develop a principled framework that establishes a provable generalization bound on the target error, which explains the interaction between feature alignment and target fitting through a novel concept of feature-label distortion. This bound offers actionable insights into how this interaction should be optimized for practical algorithm design. The resulting approach achieves significantly improved performance over state-of-the-art methods across a wide range of benchmark datasets.

replace Rank-1 Approximation of Inverse Fisher for Natural Policy Gradients in Deep Reinforcement Learning

Authors: Yingxiao Huo, Satya Prakash Dash, Radu Stoican, Samuel Kaski, Mingfei Sun

Abstract: Natural gradients have long been studied in deep reinforcement learning due to their fast convergence properties and covariant weight updates. However, computing natural gradients requires inversion of the Fisher Information Matrix (FIM) at each iteration, which is computationally prohibitive in nature. In this paper, we present an efficient and scalable natural policy optimization technique that leverages a rank-1 approximation to full inverse-FIM. We theoretically show that under certain conditions, a rank-1 approximation to inverse-FIM converges faster than policy gradients and, under some conditions, enjoys the same sample complexity as stochastic policy gradient methods. We benchmark our method on a diverse set of environments and show that it achieves superior performance to standard actor-critic and trust-region baselines.

replace From Fuzzy to Exact: The Halo Architecture for Infinite-Depth Reasoning via Rational Arithmetic

Authors: Hansheng Ren

Abstract: The pursuit of scale in deep learning has entrenched a trade-off: computational throughput is prioritized at the expense of numerical precision. We argue this compromise is fundamentally at odds with the requirements of general intelligence. We propose the \textit{Exactness Hypothesis}: high-order causal reasoning -- a cornerstone of AGI -- demands a substrate supporting arbitrary-precision, logically consistent arithmetic. We trace prevalent LLM failures, such as logical hallucinations and incoherence, to the inherent limitations of IEEE 754 floating-point arithmetic, where approximation errors compound catastrophically in deep functions. As a solution, we present the Halo Architecture, which transitions the computational foundation from approximate reals ($\mathbb{R}$) to exact rationals ($\mathbb{Q}$). Halo is realized through a custom Exact Inference Unit (EIU), whose design -- featuring asynchronous MIMD reduction and dual-modular redundancy -- resolves the performance and reliability bottlenecks of exact computation at scale. Crucially, we theoretically prove that Halo's number-theoretic quantization yields a quadratic error decay ($\mathcal{O}(D^{-2})$), strictly superior to the linear barrier of standard fixed-point arithmetic. In rigorous simulations, 600B-parameter BF16 models fail in chaotic systems within steps, while Halo sustains perfect numerical fidelity indefinitely. Our work posits exact arithmetic as non-negotiable for advancing reasoning-capable AGI and provides a co-designed hardware-software path toward verifiable, exascale-ready AI systems

replace Decoupled Split Learning via Auxiliary Loss

Authors: Anower Zihad, Felix Owino, Ming Tang, Chao Huang

Abstract: Split learning is a distributed training paradigm where a neural network is partitioned between clients and a server, which allows data to remain at the client while only intermediate activations are shared. Traditional split learning relies on end-to-end backpropagation across the client-server split point. This incurs a large communication overhead (i.e., forward activations and backward gradients need to be exchanged every iteration) and significant memory use (for storing activations and gradients). In this paper, we develop a beyond-backpropagation training method for split learning. In this approach, the client and server train their model partitions semi-independently, using local loss signals instead of propagated gradients. In particular, the client's network is augmented with a small auxiliary classifier at the split point to provide a local error signal, while the server trains on the client's transmitted activations using the true loss function. This decoupling removes the need to send backward gradients, which cuts communication costs roughly in half and also reduces memory overhead (as each side only stores local activations for its own backward pass). We evaluate our approach on CIFAR-10 and CIFAR-100. Our experiments show two key results. First, the proposed approach achieves performance on par with standard split learning that uses backpropagation. Second, it significantly reduces communication (of transmitting activations/gradient) by 50% and peak memory usage by up to 58%.

replace Safe Exploration via Policy Priors

Authors: Manuel Wendl, Yarden As, Manish Prajapat, Anton Pollak, Stelian Coros, Andreas Krause

Abstract: Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.

replace Membership Inference Attacks Against Fine-tuned Diffusion Language Models

Authors: Yuetian Chen, Kaiyuan Zhang, Yuntao Du, Edoardo Stoppa, Charles Fleming, Ashish Kundu, Bruno Ribeiro, Ninghui Li

Abstract: Diffusion Language Models (DLMs) represent a promising alternative to autoregressive language models, using bidirectional masked token prediction. Yet their susceptibility to privacy leakage via Membership Inference Attacks (MIA) remains critically underexplored. This paper presents the first systematic investigation of MIA vulnerabilities in DLMs. Unlike the autoregressive models' single fixed prediction pattern, DLMs' multiple maskable configurations exponentially increase attack opportunities. This ability to probe many independent masks dramatically improves detection chances. To exploit this, we introduce SAMA (Subset-Aggregated Membership Attack), which addresses the sparse signal challenge through robust aggregation. SAMA samples masked subsets across progressive densities and applies sign-based statistics that remain effective despite heavy-tailed noise. Through inverse-weighted aggregation prioritizing sparse masks' cleaner signals, SAMA transforms sparse memorization detection into a robust voting mechanism. Experiments on nine datasets show SAMA achieves 30% relative AUC improvement over the best baseline, with up to 8 times improvement at low false positive rates. These findings reveal significant, previously unknown vulnerabilities in DLMs, necessitating the development of tailored privacy defenses.

replace Conditional PED-ANOVA: Hyperparameter Importance in Hierarchical & Dynamic Search Spaces

Authors: Kaito Baba, Yoshihiko Ozaki, Shuhei Watanabe

Abstract: We propose conditional PED-ANOVA (condPED-ANOVA), a principled framework for estimating hyperparameter importance (HPI) in conditional search spaces, where the presence or domain of a hyperparameter can depend on other hyperparameters. Although the original PED-ANOVA provides a fast and efficient way to estimate HPI within the top-performing regions of the search space, it assumes a fixed, unconditional search space and therefore cannot properly handle conditional hyperparameters. To address this, we introduce a conditional HPI for top-performing regions and derive a closed-form estimator that accurately reflects conditional activation and domain changes. Experiments show that naive adaptations of existing HPI estimators yield misleading or uninterpretable importances in conditional settings, whereas condPED-ANOVA consistently provides meaningful importances that reflect the underlying conditional structure. Our code is publicly available at https://github.com/kAIto47802/condPED-ANOVA.

URLs: https://github.com/kAIto47802/condPED-ANOVA.

replace Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

Authors: Ruicheng Ao, David Simchi-Levi, Xinshang Wang

Abstract: Operations Research practitioners routinely debug infeasible models through an iterative process: analyzing Irreducible Infeasible Subsystems (\IIS{}), identifying constraint conflicts, and systematically repairing formulations until feasibility is achieved. Yet existing LLM benchmarks evaluate OR as one-shot translation -- given a problem description, generate solver code -- ignoring this diagnostic loop entirely. We introduce two benchmarks that place the \textbf{solver in the evaluation loop}. \textbf{\ORDebug{}} evaluates iterative self-correction through 5,000+ problems spanning 9 error types; each repair action triggers solver re-execution and \IIS{} recomputation, providing deterministic, verifiable feedback. \textbf{\ORBias{}} evaluates behavioral rationality through 2,000 newsvendor instances (1,000 ID + 1,000 OOD), measuring systematic deviations from closed-form optimal policies. Across 26 models and 12,000+ samples, we find that domain-specific RLVR training enables an 8B model to surpass frontier APIs: 95.3\% vs 86.2\% recovery rate (+9.1\%), 62.4\% vs 47.8\% diagnostic accuracy (+14.6\%), and 2.25 vs 3.78 steps to resolution (1.7$\times$ faster). On \ORBias{}, curriculum training achieves the only negative ID$\rightarrow$OOD bias drift among models evaluated (-9.6\%), reducing systematic bias by 48\% (from 20.0\% to 10.4\%). These results demonstrate that process-level evaluation with verifiable oracles enables targeted training that outperforms scale.

replace SAGE: Sequence-level Adaptive Gradient Evolution for Generative Recommendation

Authors: Yu Xie, Xing Kai Ren, Ying Qi, Hu Yao

Abstract: While works such as OneRec have validated the scaling laws of Large Language Models (LLMs) in recommender systems, they rely on a cumbersome separate vocabulary. This dependency prevents the model architecture from reusing native LLM vocabularies, resulting in high maintenance costs and poor scalability. In response, we aim to efficiently reuse open-source LLM architectures without constructing a separate tokenization vocabulary. Furthermore, we identify that the optimization strategy of OneRec Gradient Bounded Policy Optimization (GBPO),suffers from a "Symmetric Conservatism" problem: its static gradient boundaries structurally suppress the update momentum required for cold-start items and fail to prevent diversity collapse in high-noise environments.To address this issue, we propose SAGE (Sequence-level Adaptive Gradient Evolution), a unified optimization framework tailored for list-wise generative recommendation. SAGE introduces two key innovations:(1) Sequence-level Signal Decoupling: By combining a geometric mean importance ratio with decoupled multi-objective advantages, we eliminate token-level variance and resolve the "Reward Collapse" problem. (2) Asymmetric Adaptive Dynamics: We construct a dynamic gradient manifold that applies a "Boost Factor" to high-potential cold start items to achieve super-linear updates and employs an "Entropy Aware Penalty" to break information cocoons. Theoretical analysis and empirical results demonstrate that SAGE effectively unblocks cold-start traffic and sustains recommendation diversity, all while retaining the numerical stability of GBPO.

replace HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing

Authors: Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, Yanghua Xiao

Abstract: LLM role-playing, i.e., using LLMs to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a challenge. Towards cognitive simulation in LLM role-play, previous efforts mainly suffer from two deficiencies: data with high-quality reasoning traces, and reliable reward signals aligned with human preferences. In this paper, we propose HER, a unified framework for cognitive-level persona simulation. HER introduces dual-layer thinking, which distinguishes characters' first-person thinking from LLMs' third-person thinking. To bridge these gaps, we curate reasoning-augmented role-playing data via reverse engineering and construct human-aligned principles and reward models. Leveraging these resources, we train HER models based on Qwen3-32B via supervised and reinforcement learning. Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26 improvement on the CoSER benchmark and a 14.97 gain on the Minimax Role-Play Bench. Our datasets, principles, and models will be released to facilitate future research.

replace HistoPrism: Unlocking Functional Pathway Analysis from Pan-Cancer Histology via Gene Expression Prediction

Authors: Susu Hu, Qinghe Zeng, Nithya Bhasker, Jakob Nikolas Kather, Stefanie Speidel

Abstract: Predicting spatial gene expression from H&E histology offers a scalable and clinically accessible alternative to sequencing, but realizing clinical impact requires models that generalize across cancer types and capture biologically coherent signals. Prior work is often limited to per-cancer settings and variance-based evaluation, leaving functional relevance underexplored. We introduce HistoPrism, an efficient transformer-based architecture for pan-cancer prediction of gene expression from histology. To evaluate biological meaning, we introduce a pathway-level benchmark, shifting assessment from isolated gene-level variance to coherent functional pathways. HistoPrism not only surpasses prior state-of-the-art models on highly variable genes , but also more importantly, achieves substantial gains on pathway-level prediction, demonstrating its ability to recover biologically coherent transcriptomic patterns. With strong pan-cancer generalization and improved efficiency, HistoPrism establishes a new standard for clinically relevant transcriptomic modeling from routinely available histology.

replace Elastic Spectral State Space Models for Budgeted Inference

Authors: Dachuan Song, Xuan Wang

Abstract: Foundation models are typically trained at a fixed computational capacity, while real-world applications require deployment across platforms with different resource constraints. Current approaches usually rely on training families of model variants or model distillation, which requires additional training and supports only a pre-selected set of sizes rather than fine-grained adaptation at runtime. In this paper, we propose Elastic Spectral State Space Models (ES-SSM), which require only one-time training at full capacity, but can be directly truncated into arbitrary scales for budgeted, runtime inference without retraining. Our ES-SSM builds on Hankel spectral filtering over a state space model (SSM), coupled with a lightweight input-adaptive gate trained under randomized spectral budgets. Using a shared masked normalization rule over the ordered spectral channels, we encourage predictive capability to concentrate in low-index components, while higher-index components act primarily as refinement. We test our algorithm across long-sequence benchmarks spanning text, logic, retrieval, vision, and audio. We demonstrate that a single ES-SSM model trained once can be truncated to provide competitive performance compared with modern Transformer and SSM baselines at similar parameter scales. Furthermore, by testing under various runtime budgets, we observe smooth and stable budget-performance curves over a wide range of truncation levels.

replace RAPTOR-AI for Disaster OODA Loop: Hierarchical Multimodal RAG with Experience-Driven Agentic Decision-Making

Authors: Takato Yasuno

Abstract: Humanitarian Assistance and Disaster Relief (HADR) operations demand rapid synthesis of multimodal information for time-critical decision-making under extreme uncertainty. Traditional information systems struggle with the fragmented, multimodal nature of disaster data and lack adaptive reasoning capabilities essential for dynamic emergency contexts. This work introduces RAPTOR-AI, an agentic multimodal Retrieval-Augmented Generation (RAG) framework that advances beyond conventional static knowledge bases by implementing dynamic, experience-driven decision support for disaster response. The system addresses HADR requirements across initial rescue, recovery, and reconstruction phases through three key innovations: hierarchical multimodal knowledge construction from diverse sources (textual reports, aerial imagery, historical documentation), entropy-aware agentic control that dynamically selects optimal retrieval strategies based on situational context, and experiential knowledge integration using LoRA adaptation for both expert and non-expert responders. The framework constructs hierarchical knowledge trees from 46 tsunami-related PDFs (2,378 pages) using BLIP-based image understanding, ColVBERT embeddings, and long-context summarization within the OODA loop (Observe, Orient, Decide, Act) tactical framework. Experiments demonstrate significant improvements over existing approaches: 23\% improvement in retrieval precision, 31\% better situational grounding, and 27\% enhanced task decomposition accuracy, with efficient scaling up to 3,000 document chunks.

replace PHAT: Modeling Period Heterogeneity for Multivariate Time Series Forecasting

Authors: Jiaming Ma, Qihe Huang, Guanjun Wang, Haofeng Ma, Sheng Huang, Zhengyang Zhou, Pengkun Wang, Binwu Wang, Yang Wang

Abstract: While existing multivariate time series forecasting models have advanced significantly in modeling periodicity, they largely neglect the periodic heterogeneity common in real-world data, where variables exhibit distinct and dynamically changing periods. To effectively capture this periodic heterogeneity, we propose PHAT (Period Heterogeneity-Aware Transformer). Specifically, PHAT arranges multivariate inputs into a three-dimensional "periodic bucket" tensor, where the dimensions correspond to variable group characteristics with similar periodicity, time steps aligned by phase, and offsets within the period. By restricting interactions within buckets and masking cross-bucket connections, PHAT effectively avoids interference from inconsistent periods. We also propose a positive-negative attention mechanism, which captures periodic dependencies from two perspectives: periodic alignment and periodic deviation. Additionally, the periodic alignment attention scores are decomposed into positive and negative components, with a modulation term encoding periodic priors. This modulation constrains the attention mechanism to more faithfully reflect the underlying periodic trends. A mathematical explanation is provided to support this property. We evaluate PHAT comprehensively on 14 real-world datasets against 18 baselines, and the results show that it significantly outperforms existing methods, achieving highly competitive forecasting performance. Our sources is available at GitHub.

replace OLion: Approaching the Hadamard Ideal by Intersecting Spectral and $\ell_{\infty}$ Implicit Biases

Authors: Zixiao Wang, Yifei Shen, Huishuai Zhang

Abstract: Many optimizers can be interpreted as steepest-descent methods under norm-induced geometries, and thus inherit corresponding implicit biases. We introduce \nameA{} (\fullname{}), which combines spectral control from orthogonalized update directions with $\ell_\infty$-style coordinate control from sign updates. \nameA{} forms a Lion-style momentum direction, approximately orthogonalizes it via a few Newton--Schulz iterations, and then applies an entrywise sign, providing an efficient approximation to taking a maximal step over the intersection of the spectral and $\ell_\infty$ constraint sets (a scaled Hadamard-like set for matrix parameters). Despite the strong nonlinearity of orthogonalization and sign, we prove convergence under a mild, empirically verified diagonal-isotropy assumption. Across large-scale language and vision training, including GPT-2 and Llama pretraining, SiT image pretraining, and supervised fine-tuning, \nameA{} matches or outperforms AdamW and Muon under comparable tuning while using only momentum-level optimizer state, and it mitigates optimizer mismatch when fine-tuning AdamW-pretrained checkpoints.

replace Provable Cooperative Multi-Agent Exploration for Reward-Free MDPs

Authors: Idan Barnea, Orin Levy, Yishay Mansour

Abstract: We study cooperative multi-agent reinforcement learning in the setting of reward-free exploration, where multiple agents jointly explore an unknown MDP in order to learn its dynamics (without observing rewards). We focus on a tabular finite-horizon MDP and adopt a phased learning framework. In each learning phase, multiple agents independently interact with the environment. More specifically, in each learning phase, each agent is assigned a policy, executes it, and observes the resulting trajectory. Our primary goal is to characterize the tradeoff between the number of learning phases and the number of agents, especially when the number of learning phases is small. Our results identify a sharp transition governed by the horizon $H$. When the number of learning phases equals $H$, we present a computationally efficient algorithm that uses only $\tilde{O}(S^6 H^6 A / \epsilon^2)$ agents to obtain an $\epsilon$ approximation of the dynamics (i.e., yields an $\epsilon$-optimal policy for any reward function). We complement our algorithm with a lower bound showing that any algorithm restricted to $\rho < H$ phases requires at least $A^{H/\rho}$ agents to achieve constant accuracy. Thus, we show that it is essential to have an order of $H$ learning phases if we limit the number of agents to be polynomial.

replace Stein-Rule Shrinkage for Stochastic Gradient Estimation in High Dimensions

Authors: M. Arashi, M. Amintoosi

Abstract: Stochastic gradient methods are central to large-scale learning, but they treat mini-batch gradients as unbiased estimators, which classical decision theory shows are inadmissible in high dimensions. We formulate gradient computation as a high-dimensional estimation problem and introduce a framework based on Stein-rule shrinkage. We construct a gradient estimator that adaptively contracts noisy mini-batch gradients toward a stable estimator derived from historical momentum. The shrinkage intensity is determined in a data-driven manner using an online estimate of gradient noise variance, leveraging statistics from adaptive optimizers. Under a Gaussian noise model, we show our estimator uniformly dominates the standard stochastic gradient under squared error loss and is minimax-optimal. We incorporate this into the Adam optimizer, yielding SR-Adam, a practical algorithm with negligible computational cost. Empirical evaluations on CIFAR10 and CIFAR100 across multiple levels of input noise show consistent improvements over Adam in the large-batch regime. Ablation studies indicate that gains arise primarily from selectively applying shrinkage to high-dimensional convolutional layers, while indiscriminate shrinkage across all parameters degrades performance. These results illustrate that classical shrinkage principles provide a principled approach to improving stochastic gradient estimation in deep learning.

replace EvoMU: Evolutionary Machine Unlearning

Authors: Pawel Batorski, Paul Swoboda

Abstract: Machine unlearning aims to unlearn specified training data (e.g. sensitive or copyrighted material). A prominent approach is to fine-tune an existing model with an unlearning loss that retains overall utility. The space of suitable unlearning loss functions is vast, making the search for an optimal loss function daunting. Additionally, there might not even exist a universally optimal loss function: differences in the structure and overlap of the forget and retain data can cause a loss to work well in one setting but over-unlearn or under-unlearn in another. Our approach EvoMU tackles these two challenges simultaneously. An evolutionary search procedure automatically finds task-specific losses in the vast space of possible unlearning loss functions. This allows us to find dataset-specific losses that match or outperform existing losses from the literature, without the need for a human-in-the-loop. This work is therefore an instance of automatic scientific discovery, a.k.a. an AI co-scientist. In contrast to previous AI co-scientist works, we do so on a budget: We achieve SotA results using a small 4B parameter model (Qwen3-4B-Thinking), showing the potential of AI co-scientists with limited computational resources. Our experimental evaluation shows that we surpass previous loss-based unlearning formulations on TOFU-5%, TOFU-10%, MUSE and WMDP by synthesizing novel unlearning losses. Our code is available at https://github.com/Batorskq/EvoMU.

URLs: https://github.com/Batorskq/EvoMU.

replace Fat-Cat: Document-Driven Metacognitive Multi-Agent System for Complex Reasoning

Authors: Tong Yang (Henan Polytechnic University), Yemin Wang (University of Electronic Science and Technology of China), Chaoning Zhang (Henan Polytechnic University), Aming Wu (Henan Polytechnic University)

Abstract: The effectiveness of LLM-based agents is often limited not by model capacity alone, but by how efficiently contextual information is utilized at runtime. Existing agent frameworks rely on rigid, syntax-heavy state representations such as nested JSON, which require models to devote a substantial portion of their limited attention to syntactic processing rather than semantic reasoning. In this paper, we propose Fat-Cat, a document-driven agent architecture that improves the signal-to-noise ratio of state management. By integrating three key components: (1) a Semantic File System that represents agent state as Markdown documents aligned with common pre-training corpora, (2) a Textual Strategy Evolution module that accumulates task-solving knowledge without parameter updates, and (3) a Closed-Loop Watcher that monitors reasoning trajectories to reduce hallucinations. Extensive reasoning, retrieval, and coding benchmarks, Fat-Cat consistently improves agent performance. It enables the Kimi-k2 model to outperform the proprietary GPT-4o baseline on HotPotQA. Replacing the document-based state with JSON leads to performance drop, while empirically validating the critical necessity of document-driven state modeling over rigid syntax. The code is available at https://github.com/answeryt/Fat-Cat.

URLs: https://github.com/answeryt/Fat-Cat.

replace Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models

Authors: Hao Wang, Hao Gu, Hongming Piao, Kaixiong Gong, Yuxiao Ye, Xiangyu Yue, Sirui Han, Yike Guo, Dapeng Wu

Abstract: The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it tends to flatten token distributions toward uniformity, increasing entropy without improving meaningful exploration capability. In this paper, we propose CurioSFT, an entropy-preserving SFT method designed to enhance exploration capabilities through intrinsic curiosity. It consists of (a) Self-Exploratory Distillation, which distills the model toward a self-generated, temperature-scaled teacher to encourage exploration within its capability; and (b) Entropy-Guided Temperature Selection, which adaptively adjusts distillation strength to mitigate knowledge forgetting by amplifying exploration at reasoning tokens while stabilizing factual tokens. Extensive experiments on mathematical reasoning tasks demonstrate that, in SFT stage, CurioSFT outperforms the vanilla SFT by 2.5 points on in-distribution tasks and 2.9 points on out-of-distribution tasks. We also verify that exploration capabilities preserved during SFT successfully translate into concrete gains in RL stage, yielding an average improvement of 5.0 points.

replace Decoupling Generalizability and Membership Privacy Risks in Neural Networks

Authors: Xingli Fang, Jung-Eun Kim

Abstract: A deep learning model usually has to sacrifice some utilities when it acquires some other abilities or characteristics. Privacy preservation has such trade-off relationships with utilities. The loss disparity between various defense approaches implies the potential to decouple generalizability and privacy risks to maximize privacy gain. In this paper, we identify that the model's generalization and privacy risks exist in different regions in deep neural network architectures. Based on the observations that we investigate, we propose Privacy-Preserving Training Principle (PPTP) to protect model components from privacy risks while minimizing the loss in generalizability. Through extensive evaluations, our approach shows significantly better maintenance in model generalizability while enhancing privacy preservation.

replace Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models

Authors: Eliron Rahimi, Elad Hirshel, Rom Himelstein, Amit LeVi, Avi Mendelson, Chaim Baskin

Abstract: Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering parallel decoding and controllable sampling dynamics while achieving competitive generation quality at scale. Despite this progress, the role of sampling mechanisms in shaping refusal behavior and jailbreak robustness remains poorly understood. In this work, we present a fundamental analytical framework for step-wise refusal dynamics, enabling comparison between AR and diffusion sampling. Our analysis reveals that the sampling strategy itself plays a central role in safety behavior, as a factor distinct from the underlying learned representations. Motivated by this analysis, we introduce the Step-Wise Refusal Internal Dynamics (SRI) signal, which supports interpretability and improved safety for both AR and DLMs. We demonstrate that the geometric structure of SRI captures internal recovery dynamics, and identifies anomalous behavior in harmful generations as cases of \emph{incomplete internal recovery} that are not observable at the text level. This structure enables lightweight inference-time detectors that generalize to unseen attacks while matching or outperforming existing defenses with over $100\times$ lower inference overhead.

replace Exposing Vulnerabilities in Explanation for Time Series Classifiers via Dual-Target Attacks

Authors: Bohan Wang, Zewen Liu, Lu Lin, Hui Liu, Li Xiong, Ming Jin, Wei Jin

Abstract: Interpretable time series deep learning systems are often assessed by checking temporal consistency on explanations, implicitly treating this as evidence of robustness. We show that this assumption can fail: Predictions and explanations can be adversarially decoupled, enabling targeted misclassification while the explanation remains plausible and consistent with a chosen reference rationale. We propose TSEF (Time Series Explanation Fooler), a dual-target attack that jointly manipulates the classifier and explainer outputs. In contrast to single-objective misclassification attacks that disrupt explanation and spread attribution mass broadly, TSEF achieves targeted prediction changes while keeping explanations consistent with the reference. Across multiple datasets and explainer backbones, our results consistently reveal that explanation stability is a misleading proxy for decision robustness and motivate coupling-aware robustness evaluations for trustworthy time series tasks.

replace Semantics-Aware Generative Latent Data Augmentation for Learning in Low-Resource Domains

Authors: Jaesung Bae, Minje Kim

Abstract: Despite strong performance in data-rich regimes, deep learning often underperforms in the data-scarce settings common in practice. While foundation models (FMs) trained on massive datasets demonstrate strong generalization by extracting general-purpose features, they can still suffer from scarce labeled data during downstream fine-tuning. To address this, we propose GeLDA, a semantics-aware generative latent data augmentation framework that leverages conditional diffusion models to synthesize samples in an FM-induced latent space. Because this space is low-dimensional and concentrates task-relevant information compared to the input space, GeLDA enables efficient, high-quality data generation. GeLDA conditions generation on auxiliary feature vectors that capture semantic relationships among classes or subdomains, facilitating data augmentation in low-resource domains. We validate GeLDA in two large-scale recognition tasks: (a) in zero-shot language-specific speech emotion recognition, GeLDA improves the Whisper-large baseline's unweighted average recall by 6.13%; and (b) in long-tailed image classification, it achieves 74.7% tail-class accuracy on ImageNet-LT, setting a new state-of-the-art result.

replace Online Vector Quantized Attention

Authors: Nick Alonso, Tomas Figliolia, Beren Millidge

Abstract: Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic long context tasks and on long context language modeling. OVQ-attention shows significant improvements over linear attention baselines and the original VQ-attention, on which OVQ-attention was inspired. It demonstrates competitive, and sometimes identical, performance to strong self-attention baselines up 64k sequence length, despite using a small fraction of the memory of full self-attention.

replace Bypassing the Rationale: Causal Auditing of Implicit Reasoning in Language Models

Authors: Anish Sathyanarayanan, Aditya Nagarsekar, Aarush Rathore

Abstract: Chain-of-thought (CoT) prompting is widely used as a reasoning aid and is often treated as a transparency mechanism. Yet behavioral gains under CoT do not imply that the model's internal computation causally depends on the emitted reasoning text, i.e., models may produce fluent rationales while routing decision-critical computation through latent pathways. We introduce a causal, layerwise audit of CoT faithfulness based on activation patching. Our key metric, the CoT Mediation Index (CMI), isolates CoT-specific causal influence by comparing performance degradation from patching CoT-token hidden states against matched control patches. Across multiple model families (Phi, Qwen, DialoGPT) and scales, we find that CoT-specific influence is typically depth-localized into narrow "reasoning windows," and we identify bypass regimes where CMI is near-zero despite plausible CoT text. We further observe that models tuned explicitly for reasoning tend to exhibit stronger and more structured mediation than larger untuned counterparts, while Mixture-of-Experts models show more distributed mediation consistent with routing-based computation. Overall, our results show that CoT faithfulness varies substantially across models and tasks and cannot be inferred from behavior alone, motivating causal, layerwise audits when using CoT as a transparency signal.

replace Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models

Authors: Yichen Xu, Yuyang Liang, Shan Dai, Tianyang Hu, Tsz Nam Chan, Chenhao Ma

Abstract: As large language models (LLMs) continue to grow, the cost of full-parameter fine-tuning has made parameter-efficient fine-tuning (PEFT) the default strategy for downstream adaptation. Constraints from inference latency in scalable serving and fine-tuning cost in edge or rapid-deployment settings make the choice of which layers to fine-tune unavoidable. Yet current practice typically applies PEFT uniformly across all layers, with limited understanding or leverage of layer selection. This paper develops a unified projected residual view of PEFT on top of a frozen base model. Under a local quadratic approximation, layerwise adaptation is governed by three quantities: (i) the projected residual norm (resnorm), which measures how much correctable bias a layer can capture; (ii) the activation energy, which determines feature conditioning; and (iii) layer coupling, which quantifies how strongly residuals interact across layers. We show that, for squared loss and linear adapters, the resnorm equals a normalized gradient norm, activation energy controls ill-conditioning and noise amplification, and weak coupling yields approximately additive layerwise contributions. Building on these insights, we introduce the Layer Card, a reusable diagnostic that summarizes residual signal strength, compute cost, and performance for each layer of a given model. With an identical model and LoRA configuration, Layer Card-guided placement refines the choice of adapted layers to flexibly prioritize different objectives, such as maximizing performance or reducing fine-tuning cost. Moreover, on Qwen3-8B, we show that selectively adapting a subset of layers can achieve performance close to full-layer LoRA while substantially reducing fine-tuning cost and the number of adapter-augmented layers during inference, offering a more cost-performance-aware alternative to full-layer insertion.

replace SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for Reinforcement Learning from Human Feedback (RLHF)

Authors: Dipan Maity

Abstract: Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of Reinforcement Learning from Human Feedback (RLHF). PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO's symmetric KL penalties, SAFE distinguishes high-entropy exploration from low-entropy mode collapse and adjusts penalties dynamically based on reward velocity. Experiments on a 3B parameter model show SAFE achieves +5.15\% training-average reward than PPO (0.725 vs 0.689), negligible reward crashes, and superior KL control than ppo . Our method adds minimal computational overhead and provides an interpretable, crash-resistant RLHF framework that maintains aggressive learning speed while ensuring stable long-horizon optimization suitable for production deployment. Code is available at https://github.com/ryyzn9/SAFE

URLs: https://github.com/ryyzn9/SAFE

replace NeuroCanvas: VLLM-Powered Robust Seizure Detection by Reformulating Multichannel EEG as Image

Authors: Yan Chen, Jie Peng, Moajjem Hossain Chowdhury, Tianlong Chen, Yunmei Liu

Abstract: Accurate and timely seizure detection from Electroencephalography (EEG) is critical for clinical intervention, yet manual review of long-term recordings is labor-intensive. Recent efforts to encode EEG signals into large language models (LLMs) show promise in handling neural signals across diverse patients, but two significant challenges remain: (1) multi-channel heterogeneity, as seizure-relevant information varies substantially across EEG channels, and (2) computing inefficiency, as the EEG signals need to be encoded into a massive number of tokens for the prediction. To address these issues, we draw the EEG signal and propose the novel NeuroCanvas framework. Specifically, NeuroCanvas consists of two modules: (i) The Entropy-guided Channel Selector (ECS) selects the seizure-relevant channels input to LLM and (ii) the following Canvas of Neuron Signal (CNS) converts selected multi-channel heterogeneous EEG signals into structured visual representations. The ECS module alleviates the multi-channel heterogeneity issue, and the CNS uses compact visual tokens to represent the EEG signals that improve the computing efficiency. We evaluate NeuroCanvas across multiple seizure detection datasets, demonstrating a significant improvement of 20% in F1 score and reductions of 88% in inference latency. These results highlight NeuroCanvas as a scalable and effective solution for real-time and resource-efficient seizure detection in clinical practice.

replace Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

Authors: Kingsuk Maitra

Abstract: The Mechanistic Interpretability (MI) program has mapped the Transformer as a precise computational graph. We extend this graph with a conservation law and time-varying AC dynamics, viewing it as a physical circuit. We introduce Momentum Attention, a symplectic augmentation embedding physical priors via the kinematic difference operator $p_t = q_t - q_{t-1}$, implementing the symplectic shear $\hat{q}_t = q_t + \gamma p_t$ on queries and keys. We identify a fundamental Symplectic-Filter Duality: the physical shear is mathematically equivalent to a High-Pass Filter. This duality is our cornerstone contribution -- by injecting kinematic momentum, we sidestep the topological depth constraint ($L \geq 2$) for induction head formation. While standard architectures require two layers for induction from static positions, our extension grants direct access to velocity, enabling Single-Layer Induction and Spectral Forensics via Bode Plots. We formalize an Orthogonality Theorem proving that DC (semantic) and AC (mechanistic) signals segregate into orthogonal frequency bands when Low-Pass RoPE interacts with High-Pass Momentum. Validated through 5,100+ controlled experiments (documented in Supplementary Appendices A--R and 27 Jupyter notebooks), our 125M Momentum model exceeds expectations on induction-heavy tasks while tracking a 350M baseline within $\sim$2.9% validation loss. Dedicated associative recall experiments reveal a scaling law $\gamma^* = 4.17 \times N^{-0.74}$ establishing momentum-depth fungibility. We offer this framework as a complementary analytical toolkit connecting Generative AI, Hamiltonian Physics, and Signal Processing.

replace SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel

Authors: Jose Miguel Luna, Taha Bouhsine, Krzysztof Choromanski

Abstract: We propose a new class of linear-time attention mechanisms based on a relaxed and computationally efficient formulation of the recently introduced E-Product, often referred to as the Yat-kernel (Bouhsine, 2025). The resulting interactions are geometry-aware and inspired by inverse-square interactions in physics. Our method, Spherical Linearized Attention with Yat Kernels (SLAY), constrains queries and keys to the unit sphere so that attention depends only on angular alignment. Using Bernstein's theorem, we express the spherical Yat-kernel as a nonnegative mixture of polynomial-exponential product kernels and derive a strictly positive random-feature approximation enabling linear-time O(L) attention. We establish positive definiteness and boundedness on the sphere and show that the estimator yields well-defined, nonnegative attention scores. Empirically, SLAY achieves performance that is nearly indistinguishable from standard softmax attention while retaining linear time and memory scaling, and consistently outperforms prior linear-time attention mechanisms such as Performers and Cosformers. To the best of our knowledge, SLAY represents the closest linear-time approximation to softmax attention reported to date, enabling scalable Transformers without the typical performance trade-offs of attention linearization.

replace EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization

Authors: Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, Lizhu Zhang

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing the reasoning capabilities of Large Language Models (LLMs). However, dominant approaches like Group Relative Policy Optimization (GRPO) face critical stability challenges: they suffer from high estimator variance under computational constraints (small group sizes) and vanishing gradient signals in saturated failure regimes where all responses yield identical zero rewards. To address this, we propose Empirical Bayes Policy Optimization (EBPO), a novel framework that regularizes local group-based baselines by borrowing strength from the policy's accumulated global statistics. Instead of estimating baselines in isolation, EBPO employs a shrinkage estimator that dynamically balances local group statistics with a global prior updated via Welford's online algorithm. Theoretically, we demonstrate that EBPO guarantees strictly lower Mean Squared Error (MSE), bounded entropy decay, and non-vanishing penalty signals in failure scenarios compared to GRPO. Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench. Notably, EBPO exhibits superior training stability, achieving high-performance gains even with small group sizes, and benefits significantly from difficulty-stratified curriculum learning.

replace A Unified Framework for Rethinking Policy Divergence Measures in GRPO

Authors: Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yanning Dai, Shilong Deng, Sarra Habchi, Qi Zhu, Matthias Gall\'e, Chao Huang

Abstract: Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based constraint is mathematically equivalent to an asymmetric ratio-based clipping that reallocates probability mass toward high-confidence actions, promoting stronger exploration while retaining the simplicity of GRPO-style methods. Empirical results on mathematical reasoning benchmarks demonstrate that incorporating the KL3 estimator into GRPO improves both training stability and final performance, highlighting the importance of principled policy divergence constraints in policy optimization.

replace Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation

Authors: Igor Santos-Grueiro

Abstract: Behavioral evaluation is the dominant paradigm for assessing alignment in large language models (LLMs). In current practice, observed compliance under finite evaluation protocols is treated as evidence of latent alignment. However, the inference from bounded behavioral evidence to claims about global latent properties is rarely analyzed as an identifiability problem. In this paper, we study alignment evaluation through the lens of statistical identifiability under partial observability. We allow agent policies to condition their behavior on observable signals correlated with the evaluation regime, a phenomenon we term evaluation awareness. Within this framework, we formalize the Alignment Verifiability Problem and introduce Normative Indistinguishability, which arises when distinct latent alignment hypotheses induce identical distributions over evaluator-accessible observations. Our main theoretical contribution is a conditional impossibility result: under finite behavioral evaluation and evaluation-aware policies, observed compliance does not uniquely identify latent alignment, but only membership in an equivalence class of conditionally compliant policies, under explicit assumptions on policy expressivity and observability. We complement the theory with a constructive existence proof using an instruction-tuned LLM (Llama-3.2-3B), demonstrating a conditional policy that is perfectly compliant under explicit evaluation signals yet exhibits degraded identifiability when the same evaluation intent is conveyed implicitly. Together, our results show that behavioral benchmarks provide necessary but insufficient evidence for latent alignment under evaluation awareness.

replace f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

Authors: Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song

Abstract: Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.

replace Mechanisms of AI Protein Folding in ESMFold

Authors: Kevin Lu, Jannik Brinkmann, Stefan Huber, Aaron Mueller, Yonatan Belinkov, David Bau, Chris Wendler

Abstract: How do protein structure prediction models fold proteins? We investigate this question by tracing how ESMFold folds a beta hairpin, a prevalent structural motif. Through counterfactual interventions on model latents, we identify two computational stages in the folding trunk. In the first stage, early blocks initialize pairwise biochemical signals: residue identities and associated biochemical features such as charge flow from sequence representations into pairwise representations. In the second stage, late blocks develop pairwise spatial features: distance and contact information accumulate in the pairwise representation. We demonstrate that the mechanisms underlying structural decisions of ESMFold can be localized, traced through interpretable representations, and manipulated with strong causal effects.

replace A Fast and Generalizable Fourier Neural Operator-Based Surrogate for Melt-Pool Prediction in Laser Processing

Authors: Alix Benoit (EMPA), Toni Ivas (EMPA), Mateusz Papierz (Terra Quantum AG), Asel Sagingalieva (Terra Quantum AG), Alexey Melnikov (Terra Quantum AG), Elia Iseli (EMPA)

Abstract: High-fidelity simulations of laser welding capture complex thermo-fluid phenomena, including phase change, free-surface deformation, and keyhole dynamics, however their computational cost limits large-scale process exploration and real-time use. In this work we present the Laser Processing Fourier Neural Operator (LP-FNO), a Fourier Neural Operator (FNO) based surrogate model that learns the parametric solution operator of various laser processes from multiphysics simulations generated with FLOW-3D WELD (registered trademark). Through a novel approach of reformulating the transient problem in the moving laser frame and applying temporal averaging, the system results in a quasi-steady state setting suitable for operator learning, even in the keyhole welding regime. The proposed LP-FNO maps process parameters to three-dimensional temperature fields and melt-pool boundaries across a broad process window spanning conduction and keyhole regimes using the non-dimensional normalized enthalpy formulation. The model achieves temperature prediction errors on the order of 1% and intersection-over-union scores for melt-pool segmentation over 0.9. We demonstrate that a LP-FNO model trained on coarse-resolution data can be evaluated on finer grids, yielding accurate super-resolved predictions in mesh-converged conduction regimes, whereas discrepancies in keyhole regimes reflect unresolved dynamics in the coarse-mesh training data. These results indicate that the LP-FNO provides an efficient surrogate modeling framework for laser welding, enabling prediction of full three-dimensional fields and phase interfaces over wide parameter ranges in just tens of milliseconds, up to a hundred thousand times faster than traditional Finite Volume multi-physics software.

replace Refining the Information Bottleneck via Adversarial Information Separation

Authors: Shuai Ning, Zhenpeng Wang, Lin Wang, Bing Chen, Shuangrong Liu, Xu Wu, Jin Zhou, Bo Yang

Abstract: Generalizing from limited data is particularly critical for models in domains such as material science, where task-relevant features in experimental datasets are often heavily confounded by measurement noise and experimental artifacts. Standard regularization techniques fail to precisely separate meaningful features from noise, while existing adversarial adaptation methods are limited by their reliance on explicit separation labels. To address this challenge, we propose the Adversarial Information Separation Framework (AdverISF), which isolates task-relevant features from noise without requiring explicit supervision. AdverISF introduces a self-supervised adversarial mechanism to enforce statistical independence between task-relevant features and noise representations. It further employs a multi-layer separation architecture that progressively recycles noise information across feature hierarchies to recover features inadvertently discarded as noise, thereby enabling finer-grained feature extraction. Extensive experiments demonstrate that AdverISF outperforms state-of-the-art methods in data-scarce scenarios. In addition, evaluations on real-world material design tasks show that it achieves superior generalization performance.

replace Temperature Scaling Attack Disrupting Model Confidence in Federated Learning

Authors: Kichang Lee, Jaeho Jin, JaeYeon Park, Songkuk Kim, JeongGil Ko

Abstract: Predictive confidence serves as a foundational control signal in mission-critical systems, directly governing risk-aware logic such as escalation, abstention, and conservative fallback. While prior federated learning attacks predominantly target accuracy or implant backdoors, we identify confidence calibration as a distinct attack objective. We present the Temperature Scaling Attack (TSA), a training-time attack that degrades calibration while preserving accuracy. By injecting temperature scaling with learning rate-temperature coupling during local training, malicious updates maintain benign-like optimization behavior, evading accuracy-based monitoring and similarity-based detection. We provide a convergence analysis under non-IID settings, showing that this coupling preserves standard convergence bounds while systematically distorting confidence. Across three benchmarks, TSA substantially shifts calibration (e.g., 145% error increase on CIFAR-100) with <2 accuracy change, and remains effective under robust aggregation and post-hoc calibration defenses. Case studies further show that confidence manipulation can cause up to 7.2x increases in missed critical cases (healthcare) or false alarms (autonomous driving), even when accuracy is unchanged. Overall, our results establish calibration integrity as a critical attack surface in federated learning.

replace Improved Sampling Schedules for Discrete Diffusion Models

Authors: Alberto Foresti, Mustapha Bounoua, Giulio Franzese, Luca Ambrogioni, Pietro Michiardi

Abstract: Discrete diffusion models have emerged as a powerful paradigm for generative modeling on sequence data; however, the information-theoretic principles governing their reverse processes remain significantly less understood than those of their continuous counterparts. In this work, we bridge this gap by analyzing the reverse process dynamics through the lens of thermodynamic entropy production. We propose the entropy production rate as a rigorous proxy for quantifying information generation, deriving as a byproduct a bound on the Wasserstein distance between intermediate states and the data distribution. Leveraging these insights, we introduce two novel sampling schedules that are uniformly spaced with respect to their corresponding physics-inspired metrics: the Entropic Discrete Schedule (EDS), which is defined by maintaining a constant rate of information gain, and the Wasserstein Discrete Schedule (WDS), which is defined by taking equal steps in terms of the Wasserstein distance. We empirically demonstrate that our proposed schedules significantly outperform state-of-the-art strategies across diverse application domains, including synthetic data, music notation, vision and language modeling, consistently achieving superior performance at a lower computational budget.

replace Zero-shot Generalizable Graph Anomaly Detection with Mixture of Riemannian Experts

Authors: Xinyu Zhao, Qingyun Sun, Jiayi Luo, Xingcheng Fu, Jianxin Li

Abstract: Graph Anomaly Detection (GAD) aims to identify irregular patterns in graph data, and recent works have explored zero-shot generalist GAD to enable generalization to unseen graph datasets. However, existing zero-shot GAD methods largely ignore intrinsic geometric differences across diverse anomaly patterns, substantially limiting their cross-domain generalization. In this work, we reveal that anomaly detectability is highly dependent on the underlying geometric properties and that embedding graphs from different domains into a single static curvature space can distort the structural signatures of anomalies. To address the challenge that a single curvature space cannot capture geometry-dependent graph anomaly patterns, we propose GAD-MoRE, a novel framework for zero-shot Generalizable Graph Anomaly Detection with a Mixture of Riemannian Experts architecture. Specifically, to ensure that each anomaly pattern is modeled in the Riemannian space where it is most detectable, GAD-MoRE employs a set of specialized Riemannian expert networks, each operating in a distinct curvature space. To align raw node features with curvature-specific anomaly characteristics, we introduce an anomaly-aware multi-curvature feature alignment module that projects inputs into parallel Riemannian spaces, enabling the capture of diverse geometric characteristics. Finally, to facilitate better generalization beyond seen patterns, we design a memory-based dynamic router that adaptively assigns each input to the most compatible expert based on historical reconstruction performance on similar anomalies. Extensive experiments in the zero-shot setting demonstrate that GAD-MoRE significantly outperforms state-of-the-art generalist GAD baselines, and even surpasses strong competitors that are few-shot fine-tuned with labeled data from the target domain.

replace Vision Transformer Finetuning Benefits from Non-Smooth Components

Authors: Ambroise Odonnat, Laetitia Chapel, Romain Tavenard, Ievgen Redko

Abstract: The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers. The code is available at https://github.com/ambroiseodt/vit-plasticity.

URLs: https://github.com/ambroiseodt/vit-plasticity.

replace Robustness Beyond Known Groups with Low-rank Adaptation

Authors: Abinitha Gourabathina, Hyewon Jeong, Teya Bergamaschi, Marzyeh Ghassemi, Collin Stultz

Abstract: Deep learning models trained to optimize average accuracy often exhibit systematic failures on particular subpopulations. In real world settings, the subpopulations most affected by such disparities are frequently unlabeled or unknown, thereby motivating the development of methods that are performant on sensitive subgroups without being pre-specified. However, existing group-robust methods typically assume prior knowledge of relevant subgroups, using group annotations for training or model selection. We propose Low-rank Error Informed Adaptation (LEIA), a simple two-stage method that improves group robustness by identifying a low-dimensional subspace in the representation space where model errors concentrate. LEIA restricts adaptation to this error-informed subspace via a low-rank adjustment to the classifier logits, directly targeting latent failure modes without modifying the backbone or requiring group labels. Using five real-world datasets, we analyze group robustness under three settings: (1) truly no knowledge of subgroup relevance, (2) partial knowledge of subgroup relevance, and (3) full knowledge of subgroup relevance. Across all settings, LEIA consistently improves worst-group performance while remaining fast, parameter-efficient, and robust to hyperparameter choice.

replace-cross CoinPress: Practical Private Mean and Covariance Estimation

Authors: Sourav Biswas, Yihe Dong, Gautam Kamath, Jonathan Ullman

Abstract: We present simple differentially private estimators for the mean and covariance of multivariate sub-Gaussian data that are accurate at small sample sizes. We demonstrate the effectiveness of our algorithms both theoretically and empirically using synthetic and real-world datasets -- showing that their asymptotic error rates match the state-of-the-art theoretical bounds, and that they concretely outperform all previous methods. Specifically, previous estimators either have weak empirical accuracy at small sample sizes, perform poorly for multivariate data, or require the user to provide strong a priori estimates for the parameters.

replace-cross Non-negative matrix factorization algorithms generally improve topic model fits

Authors: Peter Carbonetto, Abhishek Sarkar, Zihao Wang, Matthew Stephens

Abstract: In an effort to develop topic modeling methods that can be quickly applied to large data sets, we revisit the problem of maximum-likelihood estimation in topic models. It is known, at least informally, that maximum-likelihood estimation in topic models is closely related to non-negative matrix factorization (NMF). Yet, to our knowledge, this relationship has not been exploited previously to fit topic models. We show that recent advances in NMF optimization methods can be leveraged to fit topic models very efficiently, often resulting in much better fits and in less time than existing algorithms for topic models. We also formally make the connection between the NMF optimization problem and maximum-likelihood estimation for the topic model, and using this result we show that the expectation maximization (EM) algorithm for the topic model is essentially the same as the classic multiplicative updates for NMF (the only difference being that the operations are performed in a different order). Our methods are implemented in the R package fastTopics.

replace-cross A Global Optimization Algorithm for K-Center Clustering of One Billion Samples

Authors: Jiayang Ren, Ningning You, Kaixun Hua, Chaojie Ji, Yankai Cao

Abstract: This paper presents a practical global optimization algorithm for the K-center clustering problem, which aims to select K samples as the cluster centers to minimize the maximum within-cluster distance. This algorithm is based on a reduced-space branch and bound scheme and guarantees convergence to the global optimum in a finite number of steps by only branching on the regions of centers. To improve efficiency, we have designed a two-stage decomposable lower bound, the solution of which can be derived in a closed form. In addition, we also propose several acceleration techniques to narrow down the region of centers, including bounds tightening, sample reduction, and parallelization. Extensive studies on synthetic and real-world datasets have demonstrated that our algorithm can solve the K-center problems to global optimal within 4 hours for ten million samples in the serial mode and one billion samples in the parallel mode. Moreover, compared with the state-of-the-art heuristic methods, the global optimum obtained by our algorithm can averagely reduce the objective function by 25.8% on all the synthetic and real-world datasets.

replace-cross Estimating the Value of Evidence-Based Decision Making

Authors: Alberto Abadie, Anish Agarwal, Guido Imbens, Siwei Jia, James McQueen, Serguei Stepaniants, Santiago Torres

Abstract: In an era of data abundance, statistical evidence is increasingly critical for business and policy decisions. Yet, organizations lack empirical tools to assess the value of evidence-based decision making (EBDM), optimize statistical precision, and balance the costs of evidence-gathering strategies against their benefits. To tackle these challenges, this article introduces an empirical framework to estimate the value of EBDM and evaluate the return on investment in statistical precision and project ideation. The framework leverages parametric and nonparametric empirical Bayes methods to account for parameter heterogeneity and measure how statistical precision changes the value of evidence. The value extracted from statistical evidence depends critically on how organizations translate evidence into policy decisions. Commonly used decision rules based on statistical significance can leave substantial value unrealized and, in some cases, generate negative expected value.

replace-cross YaRN: Efficient Context Window Extension of Large Language Models

Authors: Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole

Abstract: Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. Code is available at https://github.com/jquesnelle/yarn

URLs: https://github.com/jquesnelle/yarn

replace-cross Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD Learning

Authors: R. Srikant

Abstract: We prove a non-asymptotic central limit theorem for vector-valued martingale differences using Stein's method, and use Poisson's equation to extend the result to functions of Markov Chains. We then show that these results can be applied to establish a non-asymptotic central limit theorem for Temporal Difference (TD) learning with averaging.

replace-cross Knowledge-Centric Metacognitive Learning

Authors: Arun Kumar, Paul Schrater

Abstract: Interactions are central to intelligent reasoning and learning abilities, with the interpretation of abstract knowledge guiding meaningful interaction with objects in the environment. While humans readily adapt to novel situations by leveraging abstract knowledge acquired over time, artificial intelligence systems lack principled mechanisms for incorporating abstract knowledge into learning, leading to fundamental challenges in the emergence of intelligent and adaptive behavior. To address this gap, we introduce knowledge-centric metacognitive learning based on three key principles: natural abstractions, knowledge-guided interactions through interpretation, and the composition of interactions for problem solving. Knowledge learning facilitates the acquisition of abstract knowledge and the association of interactions with knowledge, while object interactions guided by abstract knowledge enable the learning of transferable interaction concepts, abstract reasoning, and generalization. This metacognitive mechanism provides a principled approach for integrating knowledge into reinforcement learning and offers a promising pathway toward intelligent and adaptive behavior in artificial intelligence, robotics, and autonomous systems.

replace-cross Reproducible Benchmarking for Lung Nodule Detection and Malignancy Classification Across Multiple Low-Dose CT Datasets

Authors: Fakrul Islam Tushar, Avivah Wang, Lavsen Dahal, Ehsan Samei, Michael R. Harowicz, Jayashree Kalpathy-Cramer, Kyle J. Lafata, Tina D. Tailor, Cynthia Rudin, Joseph Y. Lo

Abstract: Evaluation of artificial intelligence (AI) models for low-dose CT lung cancer screening is limited by heterogeneous datasets, annotation standards, and evaluation protocols, making performance difficult to compare and translate across clinical settings. We establish a public, reproducible multi-dataset benchmark for lung nodule detection and nodule-level cancer classification and quantify cross-dataset generalizability. Using the Duke Lung Cancer Screening (DLCS) dataset as a clinically curated development set, we evaluate performance across LUNA16/LIDC-IDRI, NLST-3D, and LUNA25. Detection models trained on DLCS and LUNA16 were evaluated externally on NLST-3D using free-response ROC analysis. For malignancy classification, we compared five strategies: randomly initialized ResNet50, Models Genesis, Med3D, a Foundation Model for Cancer Biomarkers, and a Strategic Warm-Start (ResNet50-SWS) approach pretrained using detection-derived candidate patches stratified by confidence. Performance was summarized using AUC with 95% confidence intervals and DeLong tests. Detection performance varied substantially by training dataset, with DLCS-trained models outperforming LUNA16-trained models on external NLST-3D evaluation (sensitivity at 2 false positives per scan: 0.72 vs. 0.64; p < 0.001). For malignancy classification, ResNet50-SWS achieved AUCs of 0.71 (DLCS), 0.90 (LUNA16), 0.81 (NLST-3D), and 0.80 (LUNA25), consistently matching or exceeding alternative pretraining strategies. These results demonstrate that dataset characteristics strongly influence lung cancer AI performance and highlight the need for transparent, multi-dataset benchmarking.

replace-cross Large Deviations of Gaussian Neural Networks with ReLU activation

Authors: Quirin Vogel

Abstract: We prove a large deviation principle for deep neural networks with Gaussian weights and at most linearly growing activation functions, such as ReLU. This generalises earlier work, in which bounded and continuous activation functions were considered. In practice, linearly growing activation functions such as ReLU are most commonly used. We furthermore simplify previous expressions for the rate function and provide a power-series expansions for the ReLU case.

replace-cross Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Authors: Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang

Abstract: Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation. Specifically, we first explicitly quantify the benefits of synthetic oversampling. Next, we analyze the scaling dynamics in synthetic data augmentation, and derive the corresponding scaling law. Finally, we demonstrate the capacity of transformer models to generate high-quality synthetic samples. We further conduct extensive numerical experiments to validate the efficacy of the LLM-based synthetic oversampling and augmentation.

replace-cross On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis

Authors: Yan Shuo Tan, Omer Ronen, Theo Saarinen, Bin Yu

Abstract: Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by well-developed estimation theory, comprising guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. However, the computational properties of the widely-used BART sampler proposed by Chipman et al. (2010) are yet to be well-understood. In this paper, we perform an asymptotic analysis of a slightly modified version of the default BART sampler when fitted to data-generating processes with discrete covariates. We show that the sampler's time to convergence, evaluated in terms of the hitting time of a high posterior density set, increases with the number of training samples, due to the multi-modal nature of the target posterior. On the other hand, we show that this trend can be dampened by simple changes, such as increasing the number of trees in the ensemble or raising the temperature of the sampler. These results provide a nuanced picture on the computational efficiency of the BART sampler in the presence of large amounts of training data while suggesting strategies to improve the sampler. We complement our theoretical analysis with a simulation study focusing on the default BART sampler. We observe that the increasing trend of convergence time against number training samples holds for the default BART sampler and is robust to changes in sampler initialization, number of burn-in iterations, feature selection prior, and discretization strategy. On the other hand, increasing the number of trees or raising the temperature sharply dampens this trend, as indicated by our theory.

replace-cross Tree Search for Language Model Agents

Authors: Jing Yu Koh, Stephen McAleer, Daniel Fried, Ruslan Salakhutdinov

Abstract: Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at https://jykoh.com/search-agents.

URLs: https://jykoh.com/search-agents.

replace-cross Fast and Stable Riemannian Metrics on SPD Manifolds via Cholesky Product Geometry

Authors: Ziheng Chen, Yue Song, Xiao-Jun Wu, Nicu Sebe

Abstract: Recent advances in Symmetric Positive Definite (SPD) matrix learning show that Riemannian metrics are fundamental to effective SPD neural networks. Motivated by this, we revisit the geometry of the Cholesky factors and uncover a simple product structure that enables convenient metric design. Building on this insight, we propose two fast and stable SPD metrics, Power--Cholesky Metric (PCM) and Bures--Wasserstein--Cholesky Metric (BWCM), derived via Cholesky decomposition. Compared with existing SPD metrics, the proposed metrics provide closed-form operators, computational efficiency, and improved numerical stability. We further apply our metrics to construct Riemannian Multinomial Logistic Regression (MLR) classifiers and residual blocks for SPD neural networks. Experiments on SPD deep learning, numerical stability analyses, and tensor interpolation demonstrate the effectiveness, efficiency, and robustness of our metrics. The code is available at https://github.com/GitZH-Chen/PCM_BWCM.

URLs: https://github.com/GitZH-Chen/PCM_BWCM.

replace-cross Q-Learning under Finite Model Uncertainty

Authors: Julian Sester, C\'ecile Decker

Abstract: We propose a robust Q-learning algorithm for Markov decision processes under model uncertainty when each state-action pair is associated with a finite ambiguity set of candidate transition kernels. This finite-measure framework enables highly flexible, user-designed uncertainty models and goes beyond the common KL and Wasserstein ball formulations. We establish almost sure convergence of the learned Q-function to the robust optimum, and derive non-asymptotic high-probability error bounds that separate stochastic approximation error from transition-kernel estimation error. Finally, we show that Wasserstein ball and parametric ambiguity sets can be approximated by finite ambiguity sets, allowing our algorithm to be used as a generic solver beyond the finite setting.

replace-cross Note on computational complexity of the Gromov-Wasserstein distance

Authors: Natalia Kravtsova

Abstract: This note addresses computational difficulty of the Gromov-Wasserstein distance frequently mentioned in the literature. We provide details on the structure of the Gromov-Wasserstein distance optimization problem that show its non-convex quadratic nature for any instance of an input data. We further illustrate the non-convexity of the problem with several explicit examples.

replace-cross Building Multilingual Datasets for Predicting Mental Health Severity through LLMs: Prospects and Challenges

Authors: Konstantinos Skianis, John Pavlopoulos, A. Seza Do\u{g}ru\"oz

Abstract: Large Language Models (LLMs) are increasingly being integrated into various medical fields, including mental health support systems. However, there is a gap in research regarding the effectiveness of LLMs in non-English mental health support applications. To address this problem, we present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages (e.g., Greek, Turkish, French, Portuguese, German, and Finnish). This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages. By experimenting with GPT and Llama, we observe considerable variability in performance across languages, despite being evaluated on the same translated dataset. This inconsistency underscores the complexities inherent in multilingual mental health support, where language-specific nuances and mental health data coverage can affect the accuracy of the models. Through comprehensive error analysis, we emphasize the risks of relying exclusively on LLMs in medical settings (e.g., their potential to contribute to misdiagnoses). Moreover, our proposed approach offers significant cost savings for multilingual tasks, presenting a major advantage for broad-scale implementation.

replace-cross A High Resolution Urban and Rural Settlement Map of Africa Using Deep Learning and Satellite Imagery

Authors: Mohammad Kakooei, James Bailie, Markus B. Pettersson, Albin S\"oderberg, Albin Becevic, Adel Daoud

Abstract: Accurate and consistent mapping of urban and rural areas is crucial for sustainable development, spatial planning, and policy design. It is particularly important in simulating the complex interactions between human activities and natural resources. Existing global urban-rural datasets such as such as GHSL-SMOD, GHS Degree of Urbanisation, and GRUMP are often spatially coarse, methodologically inconsistent, and poorly adapted to heterogeneous regions such as Africa, which limits their usefulness for policy and research. Their coarse grids and rule-based classification methods obscure small or informal settlements, and produce inconsistencies between countries. In this study, we develop a DeepLabV3-based deep learning framework that integrates multi-source data, including Landsat-8 imagery, VIIRS nighttime lights, ESRI Land Use Land Cover (LULC), and GHS-SMOD, to produce a 10m resolution urban-rural map across the African continent from 2016 to 2022. The use of Landsat data also highlights the potential to extend this mapping approach historically, reaching back to the 1990s. The model employs semantic segmentation to capture fine-scale settlement morphology, and its outputs are validated using the Demographic and Health Surveys (DHS) dataset, which provides independent, survey-based urban-rural labels. The model achieves an overall accuracy of 65% and a Kappa coefficient of 0.47 at the continental scale, outperforming existing global products such as SMOD. The resulting High-Resolution Urban-Rural (HUR) dataset provides an open and reproducible framework for mapping human settlements, enabling more context-aware analyses of Africa's rapidly evolving settlement systems. We release a continent-wide urban-rural dataset covering the period from 2016 to 2022, offering a new source for high-resolution settlement mapping in Africa.

replace-cross Fully Dynamic Adversarially Robust Correlation Clustering in Polylogarithmic Update Time

Authors: Vladimir Braverman, Prathamesh Dharangutte, Shreyas Pai, Vihan Shah, Chen Wang

Abstract: We study the dynamic correlation clustering problem with $\textit{adaptive}$ edge label flips. In correlation clustering, we are given a $n$-vertex complete graph whose edges are labeled either $(+)$ or $(-)$, and the goal is to minimize the total number of $(+)$ edges between clusters and the number of $(-)$ edges within clusters. We consider the dynamic setting with adversarial robustness, in which the $\textit{adaptive}$ adversary could flip the label of an edge based on the current output of the algorithm. Our main result is a randomized algorithm that always maintains an $O(1)$-approximation to the optimal correlation clustering with $O(\log^{2}{n})$ amortized update time. Prior to our work, no algorithm with $O(1)$-approximation and $\text{polylog}{(n)}$ update time for the adversarially robust setting was known. We further validate our theoretical results with experiments on synthetic and real-world datasets with competitive empirical performances. Our main technical ingredient is an algorithm that maintains $\textit{sparse-dense decomposition}$ with $\text{polylog}{(n)}$ update time, which could be of independent interest.

replace-cross Differentially Private Adaptation of Diffusion Models via Noisy Aggregated Embeddings

Authors: Pura Peetathawatchai, Wei-Ning Chen, Berivan Isik, Sanmi Koyejo, Albert No

Abstract: Personalizing large-scale diffusion models poses serious privacy risks, especially when adapting to small, sensitive datasets. A common approach is to fine-tune the model using differentially private stochastic gradient descent (DP-SGD), but this suffers from severe utility degradation due to the high noise needed for privacy, particularly in the small data regime. We propose an alternative that leverages Textual Inversion (TI), which learns an embedding vector for an image or set of images, to enable adaptation under differential privacy (DP) constraints. Our approach, Differentially Private Aggregation via Textual Inversion (DPAgg-TI), adds calibrated noise to the aggregation of per-image embeddings to ensure formal DP guarantees while preserving high output fidelity. We show that DPAgg-TI outperforms DP-SGD finetuning in both utility and robustness under the same privacy budget, achieving results closely matching the non-private baseline on style adaptation tasks using private artwork from a single artist and Paris 2024 Olympic pictograms. In contrast, DP-SGD fails to generate meaningful outputs in this setting.

replace-cross End to End Collaborative Synthetic Data Generation

Authors: Sikha Pentyala, Geetha Sitaraman, Trae Claar, Martine De Cock

Abstract: The success of AI is based on the availability of data to train models. While in some cases a single data custodian may have sufficient data to enable AI, often multiple custodians need to collaborate to reach a cumulative size required for meaningful AI research. The latter is, for example, often the case for rare diseases, with each clinical site having data for only a small number of patients. Recent algorithms for federated synthetic data generation are an important step towards collaborative, privacy-preserving data sharing. Existing techniques, however, focus exclusively on synthesizer training, assuming that the training data is already preprocessed and that the desired synthetic data can be delivered in one shot, without any hyperparameter tuning. In this paper, we propose an end-to-end collaborative framework for publishing of synthetic data that accounts for privacy-preserving preprocessing as well as evaluation. We instantiate this framework with Secure Multiparty Computation (MPC) protocols and evaluate it in a use case for privacy-preserving publishing of synthetic genomic data for leukemia.

replace-cross The Oracle Complexity of Simplex-based Matrix Games

Authors: Guy Kornowski, Ohad Shamir

Abstract: We study the problem of solving matrix games of the form $\min_{\mathbf{p}\in\Delta}\max_{\mathbf{w}\in\mathcal{W}}\mathbf{p}^{\top}A\mathbf{w}$, where $A$ is a matrix and $\Delta$ is the probability simplex. This problem encapsulates canonical tasks such as finding a linear separator and computing Nash equilibria in zero-sum games. However, perhaps surprisingly, its inherent complexity (as formalized in the standard framework of oracle complexity) is not well understood. In this work, we first identify different oracle models that are implicitly used by prior algorithms, corresponding to multiplying the matrix $A$ by a vector from either one or both sides. We then prove complexity lower bounds for algorithms under both access models, which in particular imply a separation between them. As our main result, we prove that in the general $\ell_p$/simplex setting where $\mathcal{W}$ is an $\ell_p$ ball for $p\in[1,\infty]$, any algorithm that utilizes two-sided matrix-vector multiplications requires $\tilde{\Omega}(\epsilon^{-2/3})$ iterations to return an $\epsilon$-suboptimal solution. For any $p\in[1,\infty]$, this is either the first lower bound for such problems, or an exponential improvement over the previously best-known results. Moreover, for the canonical tasks of finding a linear separator and computing a Nash equilibrium, our lower bounds match (up to log factors) recent algorithms of Karmarkar, O'Carroll and Sidford (2026), thereby resolving their oracle complexities in a natural setting.

replace-cross Generative Modeling of Neural Dynamics via Latent Stochastic Differential Equations

Authors: Ahmed ElGazzar, Marcel van Gerven

Abstract: We propose a probabilistic framework for developing computational models of biological neural systems. In this framework, physiological recordings are viewed as discrete-time partial observations of an underlying continuous-time stochastic dynamical system which implements computations through its state evolution. To model this dynamical system, we employ a system of coupled stochastic differential equations with differentiable drift and diffusion functions and use variational inference to infer its states and parameters. This formulation enables seamless integration of existing mathematical models in the literature, neural networks, or a hybrid of both to learn and compare different models. We demonstrate this in our framework by developing a generative model that combines coupled oscillators with neural networks to capture latent population dynamics from single-cell recordings. Evaluation across three neuroscience datasets spanning different species, brain regions, and behavioral tasks show that these hybrid models achieve competitive performance in predicting stimulus-evoked neural and behavioral responses compared to sophisticated black-box approaches while requiring an order of magnitude fewer parameters, providing uncertainty estimates, and offering a natural language for interpretation.

replace-cross AI-Powered Intracranial Hemorrhage Detection: A Co-Scale Convolutional Attention Model with Uncertainty-Based Fuzzy Integral Operator and Feature Screening

Authors: Mehdi Hosseini Chagahi, Niloufar Delfan, Behzad Moshiri, Md. Jalil Piran, Jaber Hatam Parikhan

Abstract: Intracranial hemorrhage (ICH) refers to the leakage or accumulation of blood within the skull, which occurs due to the rupture of blood vessels in or around the brain. If this condition is not diagnosed in a timely manner and appropriately treated, it can lead to serious complications such as decreased consciousness, permanent neurological disabilities, or even death.The primary aim of this study is to detect the occurrence or non-occurrence of ICH, followed by determining the type of subdural hemorrhage (SDH). These tasks are framed as two separate binary classification problems. By adding two layers to the co-scale convolutional attention (CCA) classifier architecture, we introduce a novel approach for ICH detection. In the first layer, after extracting features from different slices of computed tomography (CT) scan images, we combine these features and select the 50 components that capture the highest variance in the data, considering them as informative features. We then assess the discriminative power of these features using the bootstrap forest algorithm, discarding those that lack sufficient discriminative ability between different classes. This algorithm explicitly determines the contribution of each feature to the final prediction, assisting us in developing an explainable AI model. The features feed into a boosting neural network as a latent feature space. In the second layer, we introduce a novel uncertainty-based fuzzy integral operator to fuse information from different CT scan slices. This operator, by accounting for the dependencies between consecutive slices, significantly improves detection accuracy.

replace-cross Solving Optimal Execution Problems via In-Context Operator Networks

Authors: Tingwei Meng, Moritz Vo{\ss}, Nils Detering, Giulio Farolfi, Stanley Osher, Georg Menz

Abstract: We propose a novel transformer-based neural network architecture (ICON-OCnet) for solving optimal order execution problems in the presence of unknown price impact. Our architecture facilitates data-driven in-context operator learning for the incurred price impact by merging offline pre-training with online few-shot prompting inference. First, the operator learning component (ICON) learns the prevailing price impact environment from only a few executed trade and price impact trajectories (time series data) provided as context. Second, we employ ICON as a surrogate operator to train a neural network policy (OCnet) for the optimal order execution strategy for the price impact regime inferred from the in-context examples. We study the efficiency of our approach for linear propagator models with path-dependent transient price impact and explicitly known optimal execution strategies. In this model class, price impact persists and decays over time according to some propagator kernel. We illustrate that ICON is capable of accurately inferring the underlying price impact model from the data prompts, even for propagator kernels not seen in the training data. Moreover, we demonstrate that ICON-OCnet correctly retrieves the exact optimal order execution strategy for the model generating the in-context examples. Our introduced methodology is very general, offering a new approach to solving path-dependent optimal stochastic control problems sample-based with unknown state dynamics.

replace-cross Rethinking Functional Brain Connectome Analysis: Do Graph Deep Learning Models Help

Authors: Keqi Han, Yao Su, Lifang He, Liang Zhan, Sergey Plis, Vince Calhoun, Carl Yang

Abstract: Graph deep learning models, a class of AI-driven approaches employing a message aggregation mechanism, have gained popularity for analyzing the functional brain connectome in neuroimaging. However, their actual effectiveness remains unclear. In this study, we re-examine graph deep learning versus classical machine learning models based on four large-scale neuroimaging studies. Surprisingly, we find that the message aggregation mechanism, a hallmark of graph deep learning models, does not help with predictive performance as typically assumed, but rather consistently degrades it. To address this issue, we propose a hybrid model combining a linear model with a graph attention network through dual pathways, achieving robust predictions and enhanced interpretability by revealing both localized and global neural connectivity patterns. Our findings urge caution in adopting complex deep learning models for functional brain connectome analysis, emphasizing the need for rigorous experimental designs to establish tangible performance gains and perhaps more importantly, to pursue improvements in model interpretability.

replace-cross Step by Step: Adaptive Gradient Descent for Training L-Lipschitz Neural Networks

Authors: Kyle Sung, Kholood Khalil, Noah Forman, Steven Samu, Anastasis Kratsios

Abstract: We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.

replace-cross Summaries as Centroids for Interpretable and Scalable Text Clustering

Authors: Jairo Diaz-Rodriguez

Abstract: We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.

replace-cross Revisiting Privacy, Utility, and Efficiency Trade-offs when Fine-Tuning Large Language Models

Authors: Soumi Das, Camila Kolling, Mohammad Aflah Khan, Mahsa Amani, Bishwamittra Ghosh, Qinyuan Wu, Till Speicher, Krishna P. Gummadi

Abstract: We study the inherent trade-offs in minimizing privacy risks and maximizing utility, while maintaining high computational efficiency, when fine-tuning large language models (LLMs). A number of recent works in privacy research have attempted to mitigate privacy risks posed by memorizing fine-tuning data by using differentially private training methods (e.g., DP), albeit at a significantly higher computational cost (inefficiency). In parallel, several works in systems research have focussed on developing (parameter) efficient fine-tuning methods (e.g., LoRA), but few works, if any, investigated whether such efficient methods enhance or diminish privacy risks. In this paper, we investigate this gap and arrive at a surprising conclusion: efficient fine-tuning methods like LoRA mitigate privacy risks similar to private fine-tuning methods like DP. Our empirical finding directly contradicts prevailing wisdom that privacy and efficiency objectives are at odds during fine-tuning. Our finding is established by (a) carefully defining measures of privacy and utility that distinguish between memorizing sensitive and non-sensitive tokens in training and test datasets used in fine-tuning and (b) extensive evaluations using multiple open-source language models from Pythia, Gemma, Llama, and Qwen families and different domain-specific datasets.

replace-cross The Structural Complexity of Matrix-Vector Multiplication

Authors: Emile Anand, Jan van den Brand, Rose McCarty

Abstract: We consider the problem of preprocessing an $n\times n$ matrix $\mathbf{M}$, and supporting queries that, for any vector $v$, returns the matrix-vector product $\mathbf{M} v$. This problem has been extensively studied in both theory and practice: on one side, practitioners have developed algorithms that are highly efficient in practice, whereas on the other side, theoreticians have proven that the problem cannot be solved faster than naive multiplication in the worst-case. This lower bound holds even in the average-case, implying that existing average-case analyses cannot explain this gap between theory and practice. Hence, we study the problem for \emph{structured} matrices. We show that for $n\times n$ Boolean matrices of VC-dimension $d$, the matrix-vector multiplication problem can be solved with $\widetilde{O}(n^2)$ preprocessing and $\widetilde{O}(n^{2-1/d})$ query time. Given the low constant VC-dimensions observed in most real-world data, our results posit an explanation for why the problem can be solved so much faster in practice. Furthermore, we show how to extend this result to the non-Boolean setting with the Pollard pseudodimension. Our results yield the first non-trivial upper bounds for many applications. In previous works, the online matrix-vector (OMv) hypothesis (conjecturing that quadratic time is needed per query, even over the boolean semi-ring) was used to prove many conditional lower bounds, showing that it is impossible to compute and maintain high-accuracy estimates for effective resistance, Laplacian solvers, shortest paths, and triangle detection in graphs subject to node insertions and deletions in subquadratic time. Yet, via a reduction to our matrix-vector multiplication result, we show we can maintain these problems efficiently if the input is structured, providing the first subquadratic upper bounds in the high-accuracy regime.

replace-cross Research Superalignment Should Advance Now with Alternating Competence and Conformity Optimization

Authors: HyunJin Kim, Xiaoyuan Yi, Jing Yao, Muhua Huang, JinYeong Bak, James Evans, Xing Xie

Abstract: The recent leap in AI capabilities, driven by big generative models, has sparked the possibility of achieving Artificial General Intelligence (AGI) and further triggered discussions on Artificial Superintelligence (ASI)-a system surpassing all humans across measured domains. This gives rise to the critical research question of: As we approach ASI, how do we align it with human values, ensuring it benefits rather than harms human society, a.k.a., the Superalignment problem. Despite ASI being regarded by many as a hypothetical concept, in this position paper, we argue that superalignment is achievable and research on it should advance immediately, through simultaneous and alternating optimization of task competence and value conformity. We posit that superalignment is not merely a safeguard for ASI but also necessary for its responsible realization. To support this position, we first provide a formal definition of superalignment rooted in the gap between capability and capacity, delve into its perceived infeasibility by analyzing the limitations of existing paradigms, and then illustrate a conceptual path of superalignment to support its achievability, centered on two fundamental principles. This work frames a potential initiative for developing value-aligned next-generation AI in the future, which will garner greater benefits and reduce potential harm to humanity.

replace-cross Adversarial Wear and Tear: Exploiting Natural Damage for Generating Physical-World Adversarial Examples

Authors: Samra Irshad, Seungkyu Lee, Nassir Navab, Hong Joo Lee, Seong Tae Kim

Abstract: The presence of adversarial examples in the physical world poses significant challenges to the deployment of Deep Neural Networks in safety-critical applications such as autonomous driving. Most existing methods for crafting physical-world adversarial examples are ad-hoc, relying on temporary modifications like shadows, laser beams, or stickers that are tailored to specific scenarios. In this paper, we introduce a new class of physical-world adversarial examples, AdvWT, which draws inspiration from the naturally occurring phenomenon of `wear and tear', an inherent property of physical objects. Unlike manually crafted perturbations, `wear and tear' emerges organically over time due to environmental degradation, as seen in the gradual deterioration of outdoor signboards. To achieve this, AdvWT follows a two-step approach. First, a GAN-based, unsupervised image-to-image translation network is employed to model these naturally occurring damages, particularly in the context of outdoor signboards. The translation network encodes the characteristics of damaged signs into a latent `damage style code'. In the second step, we introduce adversarial perturbations into the style code, strategically optimizing its transformation process. This manipulation subtly alters the damage style representation, guiding the network to generate adversarial images where the appearance of damages remains perceptually realistic, while simultaneously ensuring their effectiveness in misleading neural networks. Through comprehensive experiments on two traffic sign datasets, we show that AdvWT effectively misleads DNNs in both digital and physical domains. AdvWT achieves an effective attack success rate, greater robustness, and a more natural appearance compared to existing physical-world adversarial examples. Additionally, integrating AdvWT into training enhances a model's generalizability to real-world damaged signs.

replace-cross Sparsified-Learning for High-Dimensional Heavy-Tailed Locally Stationary Time Series, Concentration and Oracle Inequalities

Authors: Yingjie Wang, Mokhtar Z. Alaya, Salim Bouzebda, Xinsheng Liu

Abstract: Sparse learning is ubiquitous in many machine learning tasks. It aims to regularize the goodness-of-fit objective by adding a penalty term to encode structural constraints on the model parameters. In this paper, we develop a flexible sparse learning framework tailored to high-dimensional heavy-tailed locally stationary time series (LSTS). The data-generating mechanism incorporates a regression function that changes smoothly over time and is observed under noise belonging to the class of sub-Weibull and regularly varying distributions. We introduce a sparsity-inducing penalized estimation procedure that combines additive modeling with kernel smoothing and define an additive kernel-smoothing hypothesis class. In the presence of locally stationary dynamics, we assume exponentially decaying $\beta$-mixing coefficients to derive concentration inequalities for kernel-weighted sums of locally stationary processes with heavy-tailed noise. We further establish nonasymptotic prediction-error bounds, yielding both slow and fast convergence rates under different sparsity structures, including Lasso and total variation penalization with the least-squares loss. To support our theoretical results, we conduct numerical experiments on simulated LSTS with sub-Weibull and Pareto noise, highlighting how tail behavior affects prediction error across different covariate-dimensions as the sample size increases.

replace-cross Differentially Private Geodesic Regression

Authors: Aditya Kulkarni, Carlos Soto

Abstract: In statistical applications it has become increasingly common to encounter data structures that live on non-linear spaces such as manifolds. Classical linear regression, one of the most fundamental methodologies of statistical learning, captures the relationship between an independent variable and a response variable which both are assumed to live in Euclidean space. Thus, geodesic regression emerged as an extension where the response variable lives on a Riemannian manifold. The parameters of geodesic regression, as with linear regression, capture the relationship of sensitive data and hence one should consider the privacy protection practices of said parameters. We consider releasing Differentially Private (DP) parameters of geodesic regression via the K-Norm Gradient (KNG) mechanism for Riemannian manifolds. We derive theoretical bounds for the sensitivity of the parameters showing they are tied to their respective Jacobi fields and hence the curvature of the space. This corroborates, and extends, recent findings of differential privacy for the Fr\'echet mean. We demonstrate the efficacy of our methodology on the sphere, $S_2\subset\mathbb{R}^3$, the space of symmetric positive definite matrices, and Kendall's planar shape space. Our methodology is general to any Riemannian manifold, and thus it is suitable for data in domains such as medical imaging and computer vision.

replace-cross Dynamic and Distributed Routing in IoT Networks based on Multi-Objective Q-Learning

Authors: Shubham Vaishnav, Praveen Kumar Donta, Sindri Magn\'usson

Abstract: IoT networks often face conflicting routing goals such as maximizing packet delivery, minimizing delay, and conserving limited battery energy. These priorities can also change dynamically: for example, an emergency alert requires high reliability, while routine monitoring prioritizes energy efficiency to prolong network lifetime. Existing works, including many deep reinforcement learning approaches, are typically centralized and assume static objectives, making them slow to adapt when preferences shift. We propose a dynamic and fully distributed multi-objective Q-learning routing algorithm that learns multiple per-preference Q-tables in parallel and introduces a novel greedy interpolation policy to act near-optimally for unseen preferences without retraining or central coordination. A theoretical analysis further shows that the optimal value function is Lipschitz-continuous in the preference parameter, ensuring that the proposed greedy interpolation policy yields provably near-optimal behavior. Simulations show that our approach adapts in real time to shifting priorities and achieves up to 80-90\% lower energy consumption and more than 2-5x higher cumulative rewards and packet delivery compared to six baseline protocols, under dynamic and distributed settings. Sensitivity analysis across varying preference window lengths confirms that the proposed DPQ framework consistently achieves higher composite reward than all baseline methods, demonstrating robustness to changes in operating conditions.

replace-cross Approximating Signed Distance Fields With Sparse Ellipsoidal Radial Basis Function Networks: A Dynamic Multi-Objective Optimization Strategy

Authors: Bobo Lian, Zidong Wang, Dandan Wang, Chenjian Wu, Minxin Chen

Abstract: Accurate and compact representation of signed distance functions (SDFs) of implicit surfaces is crucial for efficient storage, computation, and downstream processing of 3D geometry. In this work, we propose a general learning method for approximating precomputed SDF fields of implicit surfaces by a relatively small number of ellipsoidal radial basis functions (ERBFs). The SDF values could be computed from various sources, including point clouds, triangle meshes, analytical expressions, pretrained neural networks, etc. Given SDF values on spatial grid points, our method approximates the SDF using as few ERBFs as possible, achieving a compact representation while preserving the geometric shape of the corresponding implicit surface. To balance sparsity and approximation precision, we introduce a dynamic multi-objective optimization strategy, which adaptively incorporates regularization to enforce sparsity and jointly optimizes the weights, centers, shapes, and orientations of the ERBFs. For computational efficiency, a nearest-neighbor-based data structure restricts computations to points near each kernel center, and CUDA-based parallelism further accelerates the optimization. Furthermore, a hierarchical refinement strategy based on SDF spatial grid points progressively incorporates coarse-to-fine samples for parameter initialization and optimization, improving convergence and training efficiency. Extensive experiments on multiple benchmark datasets demonstrate that our method can represent SDF fields with significantly fewer parameters than existing sparse implicit representation approaches, achieving better accuracy, robustness, and computational efficiency. The corresponding executable program is publicly available at https://github.com/lianbobo/SE-RBFNet.git

URLs: https://github.com/lianbobo/SE-RBFNet.git

replace-cross Accelerated Markov Chain Monte Carlo Algorithms on Discrete States

Authors: Bohan Zhou, Shu Liu, Xinzhe Zuo, Wuchen Li

Abstract: We propose a class of discrete state sampling algorithms based on Nesterov's accelerated gradient method, which extends the classical Metropolis-Hastings (MH) algorithm. The evolution of the discrete states probability distribution governed by MH can be interpreted as a gradient descent direction of the Kullback--Leibler (KL) divergence, via a mobility function and a score function. Specifically, this gradient is defined on a probability simplex equipped with a discrete Wasserstein-2 metric with a mobility function. This motivates us to study a momentum-based acceleration framework using damped Hamiltonian flows on the simplex set, whose stationary distribution matches the discrete target distribution. Furthermore, we design an interacting particle system to approximate the proposed accelerated sampling dynamics. The extension of the algorithm with a general choice of potentials and mobilities is also discussed. In particular, we choose the accelerated gradient flow of the relative Fisher information, demonstrating the advantages of the algorithm in estimating discrete score functions without requiring the normalizing constant and keeping positive probabilities. Numerical examples, including sampling on a Gaussian mixture supported on lattices or a distribution on a hypercube, demonstrate the effectiveness of the proposed discrete-state sampling algorithm.

replace-cross Fast and Simple Densest Subgraph with Predictions

Authors: Thai Bui, Luan Nguyen, Hoa T. Vu

Abstract: We study the densest subgraph problem and its variants through the lens of learning-augmented algorithms. We show that, given a reasonably accurate predictor that estimates whether a node belongs to the densest subgraph (e.g., a machine-learning classifier), one can design simple and practical linear-time algorithms that achieve a $(1-\epsilon)$-approximation to the densest subgraph. Our approach also extends to the NP-Hard densest at-most-$k$ subgraph problem and to the directed densest subgraph variant. Finally, we present experimental results demonstrating the effectiveness of our methods.

replace-cross Traceable Black-box Watermarks for Federated Learning

Authors: Jiahao Xu, Rui Hu, Olivera Kotevska, Zikai Zhang

Abstract: Due to the distributed nature of Federated Learning (FL) systems, each local client has access to the global model, which poses a critical risk of model leakage. Existing works have explored injecting watermarks into local models to enable intellectual property protection. However, these methods either focus on non-traceable watermarks or traceable but white-box watermarks. We identify a gap in the literature regarding the formal definition of traceable black-box watermarking and the formulation of the problem of injecting such watermarks into FL systems. In this work, we first formalize the problem of injecting traceable black-box watermarks into FL. Based on the problem, we propose a novel server-side watermarking method, $\mathbf{TraMark}$, which creates a traceable watermarked model for each client, enabling verification of model leakage in black-box settings. To achieve this, $\mathbf{TraMark}$ partitions the model parameter space into two distinct regions: the main task region and the watermarking region. Subsequently, a personalized global model is constructed for each client by aggregating only the main task region while preserving the watermarking region. Each model then learns a unique watermark exclusively within the watermarking region using a distinct watermark dataset before being sent back to the local client. Extensive results across various FL systems demonstrate that $\mathbf{TraMark}$ ensures the traceability of all watermarked models while preserving their main task performance. The code is available at https://github.com/JiiahaoXU/TraMark.

URLs: https://github.com/JiiahaoXU/TraMark.

replace-cross Optimal Client Sampling in Federated Learning with Client-Level Heterogeneous Differential Privacy

Authors: Jiahao Xu, Rui Hu, Olivera Kotevska

Abstract: Federated Learning with client-level differential privacy (DP) provides a promising framework for collaboratively training models while rigorously protecting clients' privacy. However, classic approaches like DP-FedAvg struggle when clients have heterogeneous privacy requirements, as they must uniformly enforce the strictest privacy level across all clients, leading to excessive DP noise and significant degradation in model utility. Existing methods to improve the model utility in such heterogeneous privacy settings often assume a trusted server and are largely heuristic, resulting in suboptimal performance and lacking strong theoretical foundations. In this work, we address these challenges under a practical attack model where both clients and the server are honest-but-curious. We propose GDPFed, which partitions clients into groups based on their privacy budgets and achieves client-level DP within each group to reduce the privacy budget waste and hence improve the model utility. Based on the privacy and convergence analysis of GDPFed, we find that the magnitude of DP noise depends on both model dimensionality and the per-group client sampling ratios. To further improve the performance of GDPFed, we introduce GDPFed$^+$, which integrates model sparsification to eliminate unnecessary noise and optimizes per-group client sampling ratios to minimize convergence error. Extensive empirical evaluations on multiple benchmark datasets demonstrate the effectiveness of GDPFed$^+$, showing substantial performance gains compared with state-of-the-art methods.

replace-cross ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models

Authors: Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Praneeth Vepakomma

Abstract: Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce ABBA, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget, a property we validate through matrix reconstruction experiments. Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is publicly available at: https://github.com/CERT-Lab/abba.

URLs: https://github.com/CERT-Lab/abba.

replace-cross Liouville PDE-based sliced-Wasserstein flow

Authors: Jayshawn Cooper, Pilhwa Lee

Abstract: The sliced Wasserstein flow (SWF), a nonparametric and implicit generative gradient flow, is transformed to a Liouville partial differential equation (PDE)-based formalism. First, the stochastic diffusive term from the Fokker-Planck equation-based Monte Carlo is reformulated to Liouville PDE-based transport without the diffusive term, and the involved density estimation is handled by normalizing flows of neural ODE. Next, the computation of the Wasserstein barycenter is approximated by the Liouville PDE-based SWF barycenter with the prescription of Kantorovich potentials for the induced gradient flow to generate its samples. These two efforts show outperforming convergence in training and testing Liouville PDE-based SWF and SWF barycenters with reduced variance. Applying the generative SWF barycenter for fair regression demonstrates competent profiles in the accuracy-fairness Pareto curves.

replace-cross Capability-Based Scaling Trends for LLM-Based Red-Teaming

Authors: Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, Jonas Geiping

Abstract: As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a \emph{weak-to-strong} problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the \emph{capability gap} between attacker and target. We evaluate more than 600 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these observations, we derive a \emph{jailbreaking scaling curve} that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.

replace-cross Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

Authors: Alessio Giorlandino, Sebastian Goldt

Abstract: Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.

replace-cross Algorithm- and Data-Dependent Generalization Bounds for Diffusion Models

Authors: Benjamin Dupuis, Dario Shariatian, Maxime Haddouche, Alain Durmus, Umut Simsekli

Abstract: Score-based generative models (SGMs) have emerged as one of the most popular classes of generative models. A substantial body of work now exists on the analysis of SGMs, focusing either on discretization aspects or on their statistical performance. In the latter case, bounds have been derived, under various metrics, between the true data distribution and the distribution induced by the SGM, often demonstrating polynomial convergence rates with respect to the number of training samples. However, these approaches adopt a largely approximation theory viewpoint, which tends to be overly pessimistic and relatively coarse. In particular, they fail to fully explain the empirical success of SGMs or capture the role of the optimization algorithm used in practice to train the score network. To support this observation, we first present simple experiments illustrating the concrete impact of optimization hyperparameters on the generalization ability of the generated distribution. Then, this paper aims to bridge this theoretical gap by providing the first algorithmic- and data-dependent generalization analysis for SGMs. In particular, we establish bounds that explicitly account for the optimization dynamics of the learning algorithm, offering new insights into the generalization behavior of SGMs. Our theoretical findings are supported by empirical results on several datasets.

replace-cross Exploring Adversarial Watermarking in Transformer-Based Models: Transferability and Robustness Against Defense Mechanism for Medical Images

Authors: Rifat Sadik, Tanvir Rahman, Arpan Bhattacharjee, Bikash Chandra Halder, Ismail Hossain, Mridul Banik, Jia Uddin

Abstract: Deep learning models have shown remarkable success in dermatological image analysis, offering potential for automated skin disease diagnosis. Previously, convolutional neural network(CNN) based architectures have achieved immense popularity and success in computer vision (CV) based task like skin image recognition, generation and video analysis. But with the emergence of transformer based models, CV tasks are now are nowadays carrying out using these models. Vision Transformers (ViTs) is such a transformer-based models that have shown success in computer vision. It uses self-attention mechanisms to achieve state-of-the-art performance across various tasks. However, their reliance on global attention mechanisms makes them susceptible to adversarial perturbations. This paper aims to investigate the susceptibility of ViTs for medical images to adversarial watermarking-a method that adds so-called imperceptible perturbations in order to fool models. By generating adversarial watermarks through Projected Gradient Descent (PGD), we examine the transferability of such attacks to CNNs and analyze the performance defense mechanism -- adversarial training. Results indicate that while performance is not compromised for clean images, ViTs certainly become much more vulnerable to adversarial attacks: an accuracy drop of as low as 27.6%. Nevertheless, adversarial training raises it up to 90.0%.

replace-cross Infinity Search: Approximate Vector Search with Projections on q-Metric Spaces

Authors: Antonio Pariente, Ignacio Hounie, Santiago Segarra, Alejandro Ribeiro

Abstract: An ultrametric space or infinity-metric space is defined by a dissimilarity function that satisfies a strong triangle inequality in which every side of a triangle is not larger than the larger of the other two. We show that search in ultrametric spaces with a vantage point tree has worst-case complexity equal to the depth of the tree. Since datasets of interest are not ultrametric in general, we employ a projection operator that transforms an arbitrary dissimilarity function into an ultrametric space while preserving nearest neighbors. We further learn an approximation of this projection operator to efficiently compute ultrametric distances between query points and points in the dataset. We proceed to solve a more general problem in which we consider projections in $q$-metric spaces -- in which triangle sides raised to the power of $q$ are smaller than the sum of the $q$-powers of the other two. Notice that the use of learned approximations of projected $q$-metric distances renders the search pipeline approximate. We show in experiments that increasing values of $q$ result in faster search but lower recall. Overall, search in q-metric and infinity metric spaces is competitive with existing search methods.

replace-cross Neural-Augmented Kelvinlet for Real-Time Soft Tissue Deformation Modeling

Authors: Ashkan Shahbazi, Kyvia Pereira, Jon S. Heiselman, Elaheh Akbari, Annie C. Benson, Sepehr Seifi, Xinyuan Liu, Garrison L. Johnston, Jie Ying Wu, Nabil Simaan, Michael L. Miga, Soheil Kolouri

Abstract: Accurate and efficient modeling of soft-tissue interactions is fundamental for advancing surgical simulation, surgical robotics, and model-based surgical automation. To achieve real-time latency, classical Finite Element Method (FEM) solvers are often replaced with neural approximations; however, naively training such models in a fully data-driven manner without incorporating physical priors frequently leads to poor generalization and physically implausible predictions. We present a novel physics-informed neural simulation framework that enables real-time prediction of soft-tissue deformations under complex single- and multi-grasper interactions. Our approach integrates Kelvinlet-based analytical priors with large-scale FEM data, capturing both linear and nonlinear tissue responses. This hybrid design improves predictive accuracy and physical plausibility across diverse neural architectures while maintaining the low-latency performance required for interactive applications. We validate our method on challenging surgical manipulation tasks involving standard laparoscopic grasping tools, demonstrating substantial improvements in deformation fidelity and temporal stability over existing baselines. These results establish Kelvinlet-augmented learning as a principled and computationally efficient paradigm for real-time, physics-aware soft-tissue simulation in surgical AI.

replace-cross Task-Conditioned Probing Reveals Brain-Alignment Patterns in Instruction-Tuned Multimodal LLMs

Authors: Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta

Abstract: Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs significantly outperform in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

URLs: https://github.com/subbareddy248/mllm_videos].

replace-cross Scaling Laws for Uncertainty in Deep Learning

Authors: Mattia Rosso, Simone Rossi, Giulio Franzese, Markus Heinonen, Maurizio Filippone

Abstract: Deep learning has recently revealed the existence of scaling laws, demonstrating that model performance follows predictable trends based on dataset and model sizes. Inspired by these findings and fascinating phenomena emerging in the over-parameterized regime, we examine a parallel direction: do similar scaling laws govern predictive uncertainties in deep learning? In identifiable parametric models, such scaling laws can be derived in a straightforward manner by treating model parameters in a Bayesian way. In this case, for example, we obtain $O(1/N)$ contraction rates for epistemic uncertainty with respect to the number of data $N$. However, in over-parameterized models, these guarantees do not hold, leading to largely unexplored behaviors. In this work, we empirically show the existence of scaling laws associated with various measures of predictive uncertainty with respect to dataset and model sizes. Through experiments on vision and language tasks, we observe such scaling laws for in- and out-of-distribution predictive uncertainty estimated through popular approximate Bayesian inference and ensemble methods. Besides the elegance of scaling laws and the practical utility of extrapolating uncertainties to larger data or models, this work provides strong evidence to dispel recurring skepticism against Bayesian approaches: "In many applications of deep learning we have so much data available: what do we need Bayes for?". Our findings show that "so much data" is typically not enough to make epistemic uncertainty negligible.

replace-cross Revisiting Transformers with Insights from Image Filtering and Boosting

Authors: Laziz U. Abdullaev, Maksim Tkachenko, Tan M. Nguyen

Abstract: The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.

replace-cross Toward Inherently Robust VLMs Against Visual Perception Attacks

Authors: Pedram MohajerAnsari (Clemson University, Clemson, SC, USA), Amir Salarpour (Clemson University, Clemson, SC, USA), Michael K\"uhr (Technical University of Munich, Munich, Germany), Siyu Huang (Clemson University, Clemson, SC, USA), Mohammad Hamad (Technical University of Munich, Munich, Germany), Sebastian Steinhorst (Technical University of Munich, Munich, Germany), Habeeb Olufowobi (University of Texas at Arlington, Arlington, TX, USA), Bing Li (Clemson University, Clemson, SC, USA), Mert D. Pes\'e (Clemson University, Clemson, SC, USA)

Abstract: Autonomous vehicles rely on deep neural networks (DNNs) for traffic sign recognition, lane centering, and vehicle detection, yet these models are vulnerable to attacks that induce misclassification and threaten safety. Existing defenses (e.g., adversarial training) often fail to generalize and degrade clean accuracy. We introduce Vehicle Vision-Language Models (V2LMs), fine-tuned vision-language models specialized for autonomous vehicle perception, and show that they are inherently more robust to unseen attacks without adversarial training, maintaining substantially higher adversarial accuracy than conventional DNNs. We study two deployments: Solo (task-specific V2LMs) and Tandem (a single V2LM for all three tasks). Under attacks, DNNs drop 33-74%, whereas V2LMs decline by under 8% on average. Tandem achieves comparable robustness to Solo while being more memory-efficient. We also explore integrating V2LMs in parallel with existing perception stacks to enhance resilience. Our results suggest V2LMs are a promising path toward secure, robust AV perception.

replace-cross Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions

Authors: Yue Kang, Mingshuo Liu, Bongsoo Yi, Jing Lyu, Zhi Zhang, Doudou Zhou, Yao Li

Abstract: Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an assumption that is often unrealistic in practice. Misspecification of this link function can lead to the failure of all existing algorithms. In this work, we address this critical limitation by introducing a new problem of generalized linear bandits with unknown reward functions, also known as single index bandits. We first consider the case where the unknown reward function is monotonically increasing, and propose two novel and efficient algorithms, STOR and ESTOR, that achieve decent regrets under standard assumptions. Notably, our ESTOR can obtain the nearly optimal regret bound $\tilde{O}_T(\sqrt{T})$ in terms of the time horizon $T$. We then extend our methods to the high-dimensional sparse setting and show that the same regret rate can be attained with the sparsity index. Next, we introduce GSTOR, an algorithm that is agnostic to general reward functions, and establish regret bounds under a Gaussian design assumption. Finally, we validate the efficiency and effectiveness of our algorithms through experiments on both synthetic and real-world datasets.

replace-cross Modern approaches to building interpretable models of the property market using machine learning on the base of mass cadastral valuation

Authors: Alexey S. Tanashkin, Irina G. Tanashkina, Alexander S. Maksimchuik

Abstract: In this paper, we review modern approaches to building interpretable models of property markets using machine learning on the base of mass valuation of property in the Primorye region, Russia. There are numerous potential difficulties one could encounter in the effort to build a good model. Their main source is the huge difference between noisy real market data and ideal data usually used in tutorials on machine learning. This paper covers all stages of modeling: collection of initial data, identification of outliers, search and analysis of patterns in the data, formation and final choice of price factors, building of the model, and evaluation of its efficiency. For each stage, we highlight potential issues and describe sound methods for overcoming emerging difficulties on actual examples. We show that the combination of classical linear regression with kriging (interpolation method of geostatistics) allows to build an effective model for land parcels. For flats, when many objects are attributed to one spatial point, the application of geostatistical methods becomes problematic. Instead, we suggest linear regression with automatic generation and selection of additional rules on the base of decision trees, so called the RuleFit method. We compare the performance of our inherently interpretable models with well-proven "black-box" Random Forest method and demonstrate similar results. Thus we show, that despite such a strong restriction as the requirement of interpretability which is important in practical aspects, for example, legal matters, it is still possible to build effective models of real property markets.

replace-cross The Condition Number as a Scale-Invariant Proxy for Information Encoding in Neural Units

Authors: Oswaldo Ludwig

Abstract: This paper explores the relationship between the condition number of a neural network's weight tensor and the extent of information encoded by the associated processing unit, viewed through the lens of information theory. It argues that a high condition number, though not sufficient for effective knowledge encoding, may indicate that the unit has learned to selectively amplify and compress information. This intuition is formalized for linear units with Gaussian inputs, linking the condition number and the transformation's log-volume scaling factor to the characteristics of the output entropy and the geometric properties of the learned transformation. The analysis demonstrates that for a fixed weight norm, a concentrated distribution of singular values (high condition number) corresponds to reduced overall information transfer, indicating a specialized and efficient encoding strategy. Furthermore, the linear stage entropy bound provides an upper limit on post-activation information for contractive, element-wise nonlinearities, supporting the condition number as a scale-invariant proxy for encoding capacity in practical neural networks. An empirical case study applies these principles to guide selective fine-tuning of Large Language Models for both a new task and a new input modality. The experiments show that the proposed method, named KappaTune, effectively mitigates catastrophic forgetting. Unlike many existing catastrophic forgetting mitigation methods that rely on access to pre-training statistics, which are often unavailable, this selective fine-tuning approach offers a way to bypass this common requirement.

replace-cross Language Bottleneck Models for Qualitative Knowledge State Modeling

Authors: Antonin Berthon, Mihaela van der Schaar

Abstract: Accurately assessing student knowledge is central to education. Cognitive Diagnosis (CD) models estimate student proficiency at a fixed point in time, while Knowledge Tracing (KT) methods model evolving knowledge states to predict future performance. However, existing approaches either provide quantitative concept mastery estimates with limited expressivity (CD, probabilistic KT) or prioritize predictive accuracy at the cost of interpretability (deep learning KT). We propose Language Bottleneck Models (LBMs), where an encoder LLM produces textual knowledge state summaries, which a decoder LLM uses to predict future performance. This produces interpretable summaries that can express nuanced insights--such as misconceptions--that CD and KT models cannot capture. Extensive validation across synthetic and real-world datasets shows LBMs reveal qualitative insights beyond what CD and KT models can capture, while achieving competitive accuracy with improved sample efficiency. We demonstrate that the encoder and decoder can be fine-tuned with reinforcement learning and supervised fine-tuning respectively to improve both summary quality and predictive performance.

replace-cross Topological Signatures vs. Gradient Histograms: A Comparative Study for Medical Image Classification

Authors: Faisal Ahmed

Abstract: This work presents a comparative evaluation of two fundamentally different feature extraction paradigms--Histogram of Oriented Gradients (HOG) and Topological Data Analysis (TDA)--for medical image classification, with a focus on retinal fundus imagery. HOG captures local structural information by modeling gradient orientation distributions within spatial regions, effectively encoding texture and edge patterns. In contrast, TDA, implemented through cubical persistent homology, extracts global topological descriptors that characterize shape, connectivity, and intensity-based structure across images. We evaluate both approaches on the publicly available APTOS retinal fundus dataset for two classification tasks: binary classification (normal vs. diabetic retinopathy (DR)) and five-class DR severity grading. From each image, 26,244 HOG features and 800 TDA features are extracted and independently used to train seven classical machine learning models, including logistic regression, random forest, XGBoost, support vector machines, decision trees, k-nearest neighbors, and Extra Trees, using 10-fold cross-validation. Experimental results show that XGBoost achieves the best performance across both feature types. For binary classification, accuracies of 94.29% (HOG) and 94.18% (TDA) are obtained, while multi-class classification yields accuracies of 74.41% and 74.69%, respectively. These results demonstrate that gradient-based and topological features provide complementary representations of retinal image structure and highlight the potential of integrating both approaches for interpretable and robust medical image classification.

replace-cross MedVAL: Toward Expert-Level Medical Text Validation with Language Models

Authors: Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis, Eddy D. Zandee van Rilland, Poonam Laxmappa Hosamani, Kevin R Keet, Minjoung Go, Evelyn Ling, David B. Larson, Curtis Langlotz, Roxana Daneshjou, Jason Hom, Sanmi Koyejo, Emily Alsentzer, Akshay S. Chaudhari

Abstract: With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LLM-as-a-judge" paradigm offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. We propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges. Across 10 state-of-the-art LMs spanning open-source and proprietary models, MedVAL distillation significantly improves (p < 0.001) alignment with physicians across seen and unseen tasks, increasing average F1 scores from 66% to 83%. Despite strong baseline performance, MedVAL improves the best-performing proprietary LM (GPT-4o) by 8% without training on physician-labeled data, demonstrating a performance statistically non-inferior to a single human expert on a subset annotated by multiple physicians (p < 0.001). To support a scalable, risk-aware pathway towards clinical integration, we open-source: 1) Codebase (https://github.com/StanfordMIMI/MedVAL), 2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), 3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B). Our benchmark provides evidence of LMs approaching expert-level ability in validating AI-generated medical text.

URLs: https://github.com/StanfordMIMI/MedVAL),, https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench),, https://huggingface.co/stanfordmimi/MedVAL-4B).

replace-cross Improved sampling algorithms and functional inequalities for non-log-concave distributions

Authors: Yuchen He, Zhehan Lei, Jianan Shao, Chihao Zhang

Abstract: We study the problem of sampling from a distribution $\mu$ with density $\propto e^{-V}$ for some potential function $V:\mathbb R^d\to \mathbb R$ with query access to $V$ and $\nabla V$. We start with the following standard assumptions: (1) $V$ is $L$-smooth. (2) The second moment $\mathbf{E}_{X\sim \mu}[\|X\|^2]\leq M$. Recently, He and Zhang (COLT'25) showed that the query complexity of this problem is at least $\left(\frac{LM}{d\epsilon}\right)^{\Omega(d)}$ where $\epsilon$ is the desired accuracy in total variation distance, and the Poincar\'e constant can be unbounded. Meanwhile, another common assumption in the study of diffusion based samplers (see e.g., the work of Chen, Chewi, Li, Li, Salim and Zhang (ICLR'23)) strengthens (1) to the following: (1*) The potential function of *every* distribution along the Ornstein-Uhlenbeck process starting from $\mu$ is $L$-smooth. We show that under the assumptions (1*) and (2), the query complexity of sampling from $\mu$ can be $\mathrm{poly}(L,d)\cdot \left(\frac{Ld+M}{\epsilon^2}\right)^{\mathcal{O}(L+1)}$, which is polynomial in $d$ and $\frac{1}{\epsilon}$ when $L=\mathcal{O}(1)$ and $M=\mathrm{poly}(d)$. This improves the algorithm with quasi-polynomial query complexity developed by Huang et al. (COLT'24). Our results imply that the seemingly moderate strengthening from (1) to (1*) yields an exponential gap in the query complexity. Furthermore, we show that together with the assumption (1*) and the stronger moment assumption that $\|X\|$ is $\lambda$-sub-Gaussian for $X\sim\mu$, the Poincar\'e constant of $\mu$ is at most $\mathcal{O}(\lambda)^{2(L+1)}$. We also establish a modified log-Sobolev inequality for $\mu$ under these conditions. As an application of our technique, we obtain a new estimate of the modified log-Sobolev constant for a specific class of mixtures of strongly log-concave distributions.

replace-cross TensorHyper-VQC: A Tensor-Train-Guided Hypernetwork for Robust and Scalable Variational Quantum Computing

Authors: Jun Qi, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hsiu Hsieh

Abstract: Variational Quantum Computing (VQC) faces fundamental scalability barriers, primarily due to barren plateaus and sensitivity to quantum noise. To address these challenges, we introduce TensorHyper-VQC, a novel tensor-train (TT)-guided hypernetwork framework that significantly improves the robustness and scalability of VQC. Our framework fully delegates the generation of quantum-circuit parameters to a classical TT network, thereby decoupling optimization from quantum hardware. This innovative parameterization mitigates gradient vanishing, enhances noise resilience through structured low-rank representations, and facilitates efficient gradient propagation. Grounded in Neural Tangent Kernel and statistical learning theory, our rigorous theoretical analyses establish strong guarantees on approximation capability, optimization stability, and generalization performance. Extensive empirical results across quantum dot classification, Max-Cut optimization, and molecular quantum simulation tasks demonstrate that TensorHyper-VQC consistently achieves superior performance and robust noise tolerance, including hardware-level validation on a 156-qubit IBM Heron processor. These results position TensorHyper-VQC as a scalable and noise-resilient framework for advancing practical quantum machine learning on near-term devices.

replace-cross The Relative Instability of Model Comparison with Cross-validation

Authors: Alexandre Bayle, Lucas Janson, Lester Mackey

Abstract: Cross-validation (CV) is known to provide asymptotically exact tests and confidence intervals for model improvement but only when the model comparison is relatively stable. Surprisingly, we prove that even simple, individually stable models can generate relatively unstable comparisons, calling into question the validity of CV inference. Specifically, we show that the Lasso and its close cousin, soft-thresholding, generate relatively unstable comparisons and invalid CV inferences, even in the most favorable of learning settings and when both models are individually stable. These findings highlight the importance of verifying relative stability before deploying CV for model comparison.

replace-cross GRIT: Graph-Regularized Logit Refinement for Zero-shot Cell Type Annotation

Authors: Tianxiang Hu, Chenyi Zhou, Jiaxiang Liu, Jiongxin Wang, Ruizhe Chen, Haoxiang Xia, Gaoang Wang, Jian Wu, Zuozhu Liu

Abstract: Cell type annotation is a fundamental step in the analysis of single-cell RNA sequencing (scRNA-seq) data. In practice, human experts often rely on the structure revealed by principal component analysis (PCA) followed by $k$-nearest neighbor ($k$-NN) graph construction to guide annotation. While effective, this process is labor-intensive and does not scale to large datasets. Recent advances in CLIP-style models offer a promising path toward automating cell type annotation. By aligning scRNA-seq profiles with natural language descriptions, models like LangCell enable zero-shot annotation. While LangCell demonstrates decent zero-shot performance, its predictions remain suboptimal. In this paper, we propose a principled inference-time paradigm for zero-shot cell type annotation (GRIT) which bridges the scalability of pre-trained foundation models with the structural robustness relied upon in human expert annotation workflows. Specifically, we enforce local consistency of the zero-shot CLIP logits over the task-specific PCA-based $k$-NN graph. We evaluate our approach on 14 annotated human scRNA-seq datasets from 4 distinct studies, spanning 11 organs and over 200,000 single cells. Our method consistently improves zero-shot annotation accuracy, achieving accuracy gains of up to 10\%. Further analysis showcase the mechanism by which GRIT effectively propagates correct signals through the graph, pulling back mislabeled cells toward more accurate predictions. The method is training-free, model-agnostic, and serves as a simple yet effective plug-in for enhancing zero-shot cell type annotation.

replace-cross Learning Geometric-Aware Quadrature Rules for Functional Minimization

Authors: Costas Smaragdakis

Abstract: Accurate numerical integration over non-uniform point clouds is a challenge for modern mesh-free machine learning solvers for partial differential equations (PDEs) using variational principles. While standard Monte Carlo (MC) methods are not capable of handling a non-uniform point cloud, modern neural network architectures can deal with permutation-invariant inputs, creating quadrature rules for any point cloud. In this work, we introduce QuadrANN, a Graph Neural Network (GNN) architecture designed to learn optimal quadrature weights directly from the underlying geometry of point clouds. The design of the model exploits a deep message-passing scheme where the initial layer encodes rich local geometric features from absolute and relative positions as well as an explicit local density measure. In contrast, the following layers incorporate a global context vector. These architectural choices allow the QuadrANN to generate a data-driven quadrature rule that is permutation-invariant and adaptive to both local point density and the overall domain shape. We test our methodology on a series of challenging test cases, including integration on convex and non-convex domains and estimating the solution of the Heat and Fokker-Planck equations. Across all the tests, QuadrANN reduces the variance of the integral estimation compared to standard Quasi-Monte Carlo methods by warping the point clouds to be more dense in critical areas where the integrands present certain singularities. This enhanced stability in critical areas of the domain at hand is critical for the optimization of energy functionals, leading to improved deep learning-based variational solvers.

replace-cross Predictability Enables Parallelization of Nonlinear State Space Models

Authors: Xavier Gonzalez, Leo Kozachkov, David M. Zoltowski, Kenneth L. Clarkson, Scott W. Linderman

Abstract: The rise of parallel computing hardware has made it increasingly important to understand which nonlinear state space models can be efficiently parallelized. Recent advances like DEER (arXiv:2309.12252) and DeepPCR (arXiv:2309.16318) recast sequential evaluation as a parallelizable optimization problem, sometimes yielding dramatic speedups. However, the factors governing the difficulty of these optimization problems remained unclear, limiting broader adoption. In this work, we establish a precise relationship between a system's dynamics and the conditioning of its corresponding optimization problem, as measured by its Polyak-Lojasiewicz (PL) constant. We show that the predictability of a system, defined as the degree to which small perturbations in state influence future behavior and quantified by the largest Lyapunov exponent (LLE), impacts the number of optimization steps required for evaluation. For predictable systems, the state trajectory can be computed in at worst $O((\log T)^2)$ time, where $T$ is the sequence length: a major improvement over the conventional sequential approach. In contrast, chaotic or unpredictable systems exhibit poor conditioning, with the consequence that parallel evaluation converges too slowly to be useful. Importantly, our theoretical analysis shows that predictable systems always yield well-conditioned optimization problems, whereas unpredictable systems lead to severe conditioning degradation. We validate our claims through extensive experiments, providing practical guidance on when nonlinear dynamical systems can be efficiently parallelized. We highlight predictability as a key design principle for parallelizable models.

replace-cross Practical Feasibility of Gradient Inversion Attacks in Federated Learning

Authors: Viktor Valadi, Mattias {\AA}kesson, Johan \"Ostman, Fazeleh Hoseini, Salman Toor, Andreas Hellander

Abstract: Gradient inversion attacks are often presented as a serious privacy threat in federated learning, with recent work reporting increasingly strong reconstructions under favorable experimental settings. However, it remains unclear whether such attacks are feasible in modern, performance-optimized systems deployed in practice. In this work, we evaluate the practical feasibility of gradient inversion for image-based federated learning. We conduct a systematic study across multiple datasets and tasks, including image classification and object detection, using canonical vision architectures at contemporary resolutions. Our results show that while gradient inversion remains possible for certain legacy or transitional designs under highly restrictive assumptions, modern, performance-optimized models consistently resist meaningful reconstruction visually. We further demonstrate that many reported successes rely on upper-bound settings, such as inference mode operation or architectural simplifications which do not reflect realistic training pipelines. Taken together, our findings indicate that, under an honest-but-curious server assumption, high-fidelity image reconstruction via gradient inversion does not constitute a critical privacy risk in production-optimized federated learning systems, and that practical risk assessments must carefully distinguish diagnostic attack settings from real-world deployments.

replace-cross Repeating versus Nonrepeating Fast Radio Bursts: A Deep Learning Approach to Morphological Characterization

Authors: Bikash Kharel, Emmanuel Fonseca, Charanjot Brar, Afrokk Khan, Lluis Mas-Ribas, Swarali Shivraj Patil, Paul Scholz, Seth Robert Siegel, David C. Stenning

Abstract: We present a deep learning approach to classify fast radio bursts (FRBs) based purely on morphology as encoded on recorded dynamic spectrum from CHIME/FRB Catalog 2. We implemented transfer learning with a pretrained ConvNext architecture, exploiting its powerful feature extraction ability. ConvNext was adapted to classify dedispersed dynamic spectra (which we treat as images) of the FRBs into one of the two sub-classes, i.e., repeater and non-repeater, based on their various temporal and spectral properties and relation between the sub-pulse structures. Additionally, we also used mathematical model representation of the total intensity data to interpret the deep learning model. Upon fine-tuning the pretrained ConvNext on the FRB spectrograms, we were able to achieve high classification metrics while substantially reducing training time and computing power as compared to training a deep learning model from scratch with random weights and biases without any feature extraction ability. Importantly, our results suggest that the morphological differences between CHIME repeating and non-repeating events persist in Catalog 2 and the deep learning model leveraged these differences for classification. The fine-tuned deep learning model can be used for inference, which enables us to predict whether an FRB's morphology resembles that of repeaters or non-repeaters. Such inferences may become increasingly significant when trained on larger data sets that will exist in the near future.

replace-cross Lifelong Learning with Behavior Consolidation for Vehicle Routing

Authors: Jiyuan Pei, Yi Mei, Jialin Liu, Mengjie Zhang, Xin Yao

Abstract: Recent neural solvers have demonstrated promising performance in learning to solve routing problems. However, existing studies are primarily based on one-off training on one or a set of predefined problem distributions and scales, i.e., tasks. When a new task arises, they typically rely on either zero-shot generalization, which may be poor due to the discrepancies between the new task and the training task(s), or fine-tuning the pretrained solver on the new task, which possibly leads to catastrophic forgetting of knowledge acquired from previous tasks. This paper explores a novel lifelong learning paradigm for neural VRP solvers, where multiple tasks with diverse distributions and scales arise sequentially over time. Solvers are required to effectively and efficiently learn to solve new tasks while maintaining their performance on previously learned tasks. Consequently, a novel framework called Lifelong Learning Router with Behavior Consolidation (LLR-BC) is proposed. LLR-BC consolidates prior knowledge effectively by aligning behaviors of the solver trained on a new task with the buffered ones in a decision-seeking way. To encourage more focus on crucial experiences, LLR-BC assigns greater consolidated weights to decisions with lower confidence. Extensive experiments on capacitated vehicle routing problems and traveling salesman problems demonstrate LLR-BC's effectiveness in training high-performance neural solvers in a lifelong learning setting, addressing the catastrophic forgetting issue, maintaining their plasticity, and improving zero-shot generalization ability.

replace-cross No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

Authors: Thanh-Long V. Le, Myeongho Jeon, Kim Vu, Viet Lai, Eunho Yang

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward -- so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce Reinforcement Learning with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR. The project page is available at https://bltnynk.github.io/publications/rl-zvp/.

URLs: https://bltnynk.github.io/publications/rl-zvp/.

replace-cross HEART: Emotionally-Driven Test-Time Scaling of Language Models

Authors: Gabriela Pinto, Palash Goyal, Mihir Parmar, Yiwen Song, Souradip Chakraborty, Zifeng Wang, Jingsun Yoon, Hamid Palangi, Tomas Pfister

Abstract: Test-time scaling has significantly improved how AI models solve problems, yet current methods often get stuck in repetitive, incorrect patterns of thought. We introduce HEART, a framework that uses emotional cues to guide the model's focus, much like how feelings contribute to human decision-making. By alternating between critical tones to sharpen error detection and encouraging tones to spark new ideas, HEART helps the model break out of dead-end reasoning and find the right solution. We evaluate HEART across seven high-difficulty benchmarks--including Humanity's Last Exam, GPQA Diamond, and LiveCodeBench--demonstrating robustness across diverse models. Results show that emotion facilitates deeper reasoning, yielding consistent accuracy gains over affect-sterile baselines. These findings suggest that the next frontier in machine reasoning lies in the strategic integration of affective regulation to guide logical synthesis.

replace-cross CHAI: Command Hijacking against embodied AI

Authors: Luis Burbano, Diego Ortiz, Qi Sun, Siwei Yang, Haoqin Tu, Cihang Xie, Yinzhi Cao, Alvaro A Cardenas

Abstract: Embodied Artificial Intelligence (AI) promises to handle edge cases in robotic vehicle systems where data is scarce by using common-sense reasoning grounded in perception and action to generalize beyond training distributions and adapt to novel real-world situations. These capabilities, however, also create new security risks. In this paper, we introduce CHAI (Command Hijacking against embodied AI), a physical environment indirect prompt injection attack that exploits the multimodal language interpretation abilities of AI models. CHAI embeds deceptive natural language instructions, such as misleading signs, in visual input, systematically searches the token space, builds a dictionary of prompts, and guides an attacker model to generate Visual Attack Prompts. We evaluate CHAI on four LVLM agents: drone emergency landing, autonomous driving, aerial object tracking, and on a real robotic vehicle. Our experiments show that CHAI consistently outperforms state-of-the-art attacks. By exploiting the semantic and multimodal reasoning strengths of next-generation embodied AI systems, CHAI underscores the urgent need for defenses that extend beyond traditional adversarial robustness.

replace-cross Improving Online-to-Nonconvex Conversion for Smooth Optimization via Double Optimism

Authors: Francisco Patitucci, Ruichen Jiang, Aryan Mokhtari

Abstract: A recent breakthrough in nonconvex optimization is the online-to-nonconvex conversion framework of [Cutkosky et al., 2023], which reformulates the task of finding an $\varepsilon$-first-order stationary point as an online learning problem. When both the gradient and the Hessian are Lipschitz continuous, instantiating this framework with two different online learners achieves a complexity of $O(\varepsilon^{-1.75}\log(1/\varepsilon))$ in the deterministic case and a complexity of $O(\varepsilon^{-3.5})$ in the stochastic case. However, this approach suffers from several limitations: (i) the deterministic method relies on a complex double-loop scheme that solves a fixed-point equation to construct hint vectors for an optimistic online learner, introducing an extra logarithmic factor; (ii) the stochastic method assumes a bounded second-order moment of the stochastic gradient, which is stronger than standard variance bounds; and (iii) different online learning algorithms are used in the two settings. In this paper, we address these issues by introducing an online optimistic gradient method based on a novel doubly optimistic hint function. Specifically, we use the gradient at an extrapolated point as the hint, motivated by two optimistic assumptions: that the difference between the hint and the target gradient remains near constant, and that consecutive update directions change slowly due to smoothness. Our method eliminates the need for a double loop and removes the logarithmic factor. Furthermore, by simply replacing full gradients with stochastic gradients and under the standard assumption that their variance is bounded by $\sigma^2$, we obtain a unified algorithm with complexity $O(\varepsilon^{-1.75} + \sigma^2 \varepsilon^{-3.5})$, smoothly interpolating between the best-known deterministic rate and the optimal stochastic rate.

replace-cross Quantum generative model on bicycle-sharing system and an application

Authors: Fumio Nemoto, Nobuyuki Koike, Daichi Sato, Yuuta Kawaai, Masayuki Ohzeki

Abstract: Recently, bicycle-sharing systems have been implemented in numerous cities, becoming integral to daily life. However, a prevalent issue arises when intensive commuting demand leads to bicycle shortages in specific areas and at particular times. To address this challenge, we employ a novel quantum machine learning model that analyzes time series data by fitting quantum time evolution to observed sequences. This model enables us to capture actual trends in bicycle counts at individual ports and identify correlations between different ports. Utilizing the trained model, we simulate the impact of proactively adding bicycles to high-demand ports on the overall rental number across the system. Given that the core of this method lies in a Monte Carlo simulation, it is anticipated to have a wide range of industrial applications.

replace-cross MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis

Authors: Hongyu Zhu, Lin Chen, Xin Jin, Mounim A. El-Yacoubi, Mingsheng Shang

Abstract: Multimodal Sentiment Analysis (MSA) aims to identify and interpret human emotions by integrating information from heterogeneous data sources such as text, video, and audio. While deep learning models have advanced in network architecture design, they remain heavily limited by scarce multimodal annotated data. Although Mixup-based augmentation improves generalization in unimodal tasks, its direct application to MSA introduces critical challenges: random mixing often amplifies label ambiguity and semantic inconsistency due to the lack of emotion-aware mixing mechanisms. To overcome these issues, we propose MS-Mix, an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings. The key components of MS-Mix include: (1) a Sentiment-Aware Sample Selection (SASS) strategy that effectively prevents semantic confusion caused by mixing samples with contradictory emotions. (2) a Sentiment Intensity Guided (SIG) module using multi-head self-attention to compute modality-specific mixing ratios dynamically based on their respective emotional intensities. (3) a Sentiment Alignment Loss (SAL) that aligns the prediction distributions across modalities, and incorporates the Kullback-Leibler-based loss as an additional regularization term to train the emotion intensity predictor and the backbone network jointly. Extensive experiments on three benchmark datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms existing methods, establishing a new standard for robust multimodal sentiment augmentation. The source code is available at: https://github.com/HongyuZhu-s/MS-Mix.

URLs: https://github.com/HongyuZhu-s/MS-Mix.

replace-cross Data-Prompt Co-Evolution: Growing Test Sets to Refine LLM Behavior

Authors: Minjae Lee, Minsuk Kahng

Abstract: Large Language Models (LLMs) are increasingly embedded in applications, and people can shape model behavior by editing prompt instructions. Yet encoding subtle, domain-specific policies into prompts is challenging. Although this process often benefits from concrete test cases, test data and prompt instructions are typically developed as separate artifacts, reflecting traditional machine learning practices in which model tuning was slow and test sets were static. We argue that the fast, iterative nature of prompt engineering calls for removing this separation and enabling a new workflow: data-prompt co-evolution, where a living test set and prompt instructions evolve in tandem. We present an interactive system that operationalizes this workflow. It guides application developers to discover edge cases, articulate rationales for desired behavior, and iteratively evaluate revised prompts against a growing test set. A user study shows our workflow helps people refine prompts systematically, better aligning them with their intended policies. This work points toward more robust and responsible LLM applications through human-in-the-loop development.

replace-cross Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences

Authors: Jing Yang, Qiyao Wei, Jiaxin Pei

Abstract: The rapid growth of AI conferences is straining an already fragile peer-review system, leading to heavy reviewer workloads, expertise mismatches, inconsistent evaluation standards, superficial or templated reviews, and limited accountability under compressed timelines. In response, conference organizers have introduced new policies and interventions to preserve review standards. Yet these ad-hoc changes often create further concerns and confusion about the review process, leaving how papers are ultimately accepted - and how practices evolve across years - largely opaque. We present Paper Copilot, a system that creates durable digital archives of peer reviews across a wide range of computer-science venues, an open dataset that enables researchers to study peer review at scale, and a large-scale empirical analysis of ICLR reviews spanning multiple years. By releasing both the infrastructure and the dataset, Paper Copilot supports reproducible research on the evolution of peer review. We hope these resources help the community track changes, diagnose failure modes, and inform evidence-based improvements toward a more robust, transparent, and reliable peer-review system.

replace-cross TritonRL: Training LLMs to Think and Code Triton Without Cheating

Authors: Jiin Woo, Shaowei Zhu, Allen Nie, Zhen Jia, Yida Wang, Youngsuk Park

Abstract: The rapid evolution of Large Language Models (LLMs) has driven a growing demand for automated, high-performance system kernels to accelerate machine learning workloads. We introduce TritonRL, a domain-specialized 8B-scale LLM for Triton programming, trained via a novel reinforcement learning (RL) framework. While Triton synthesis faces unique challenges, including data scarcity and a high susceptibility to reward hacking, our approach enables robust kernel generation through two primary innovations. First, we implement a multi-layered verification system that provides high-fidelity reward signals, ensuring that generated kernels are both syntactically and functionally valid. Second, we propose Hierarchical Reward Decomposition (HRD), which decouples reinforcement for high-level reasoning and low-level implementation to resolve the credit assignment problem in long-sequence generation. Comprehensive evaluations on KernelBench demonstrate that TritonRL achieves state-of-the-art correctness and runtime speedup, outperforming concurrent Triton-specific models and matching the performance of frontier models with over 100B parameters. Our results highlight the effectiveness of hardware-aware RL paradigms in specialized domain adaptation.

replace-cross Beating the Winner's Curse via Inference-Aware Policy Optimization

Authors: Hamsa Bastani, Osbert Bastani, Bryce McLaughlin

Abstract: There has been a surge of recent interest in automatically learning policies to target treatment decisions based on rich individual covariates. In addition, practitioners want confidence that the learned policy has better performance than the incumbent policy according to downstream policy evaluation. However, due to the winner's curse -- an issue where the policy optimization procedure exploits prediction errors rather than finding actual improvements -- predicted performance improvements are often not substantiated by downstream policy evaluation. To address this challenge, we propose a novel strategy called inference-aware policy optimization, which modifies policy optimization to account for how the policy will be evaluated downstream. Specifically, it optimizes not only for the estimated objective value, but also for the chances that the estimate of the policy's improvement passes a significance test during downstream policy evaluation. We mathematically characterize the Pareto frontier of policies according to the tradeoff of these two goals. Based on our characterization, we design a policy optimization algorithm that estimates the Pareto frontier using machine learning models; then, the decision-maker can select the policy that optimizes their desired tradeoff, after which policy evaluation can be performed on the test set as usual. Finally, we perform simulations to illustrate the effectiveness of our methodology.

replace-cross Deep Ensembles for Epistemic Uncertainty: A Frequentist Perspective

Authors: Anchit Jain, Stephen Bates

Abstract: Decomposing prediction uncertainty into aleatoric (irreducible) and epistemic (reducible) components is critical for the reliable deployment of machine learning systems. While the mutual information between the response variable and model parameters is a principled measure for epistemic uncertainty, it requires access to the parameter posterior, which is computationally challenging to approximate. Consequently, practitioners often rely on probabilistic predictions from deep ensembles to quantify uncertainty, which have demonstrated strong empirical performance. However, a theoretical understanding of their success from a frequentist perspective remains limited. We address this gap by first considering a bootstrap-based estimator for epistemic uncertainty, which we prove is asymptotically correct. Next, we connect deep ensembles to the bootstrap estimator by decomposing it into data variability and training stochasticity; specifically, we show that deep ensembles capture the training stochasticity component. Through empirical studies, we show that this stochasticity component constitutes the majority of epistemic uncertainty, thereby explaining the effectiveness of deep ensembles.

replace-cross Understanding Fairness and Prediction Error through Subspace Decomposition and Influence Analysis

Authors: Enze Shi, Pankaj Bhagwat, Zhixian Yang, Linglong Kong, Bei Jiang

Abstract: Machine learning models have achieved widespread success but often inherit and amplify historical biases, resulting in unfair outcomes. Traditional fairness methods typically impose constraints at the prediction level, without addressing underlying biases in data representations. In this work, we propose a principled framework that adjusts data representations to balance predictive utility and fairness. Using sufficient dimension reduction, we decompose the feature space into target-relevant, sensitive, and shared components, and control the fairness-utility trade-off by selectively removing sensitive information. We provide a theoretical analysis of how prediction error and fairness gaps evolve as shared subspaces are added, and employ influence functions to quantify their effects on the asymptotic behavior of parameter estimates. Experiments on both synthetic and real-world datasets validate our theoretical insights and show that the proposed method effectively improves fairness while preserving predictive performance.

replace-cross A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

Authors: Md. Abdul Awal, Mrigank Rochan, Chanchal K. Roy

Abstract: Transformer-based language models of code have achieved state-of-the-art performance across a wide range of software analytics tasks, but their practical deployment remains limited due to high computational costs, slow inference speeds, and significant environmental impact. To address these challenges, recent research has increasingly explored knowledge distillation as a method for compressing a large language model of code (the teacher) into a smaller model (the student) while maintaining performance. However, the degree to which a student model deeply mimics the predictive behavior and internal representations of its teacher remains largely unexplored, as current accuracy-based evaluation provides only a surface-level view of model quality and often fails to capture more profound discrepancies in behavioral fidelity between the teacher and student models. To address this gap, we empirically show that the student model often fails to deeply mimic the teacher model, resulting in up to 285% greater performance drop under adversarial attacks, which is not captured by traditional accuracy-based evaluation. Therefore, we propose MetaCompress, a metamorphic testing framework that systematically evaluates behavioral fidelity by comparing the outputs of teacher and student models under a set of behavior-preserving metamorphic relations. We evaluate MetaCompress on two widely studied tasks, using compressed versions of popular language models of code, obtained via three different knowledge distillation techniques: Compressor, AVATAR, and MORPH. The results show that MetaCompress identifies up to 62% behavioral discrepancies in student models, underscoring the need for behavioral fidelity evaluation within the knowledge distillation pipeline and establishing MetaCompress as a practical framework for testing compressed language models of code derived through knowledge distillation.

replace-cross Constructing the Umwelt: Cognitive Planning through Belief-Intent Co-Evolution

Authors: Shiyao Sang

Abstract: This paper challenges a prevailing epistemological assumption in End-to-End Autonomous Driving: that high-performance planning necessitates high-fidelity world reconstruction. Inspired by cognitive science, we propose the Mental Bayesian Causal World Model (MBCWM) and instantiate it as the Tokenized Intent World Model (TIWM), a novel cognitive computing architecture. Its core philosophy posits that intelligence emerges not from pixel-level objective fidelity, but from the Cognitive Consistency between the agent's internal intentional world and physical reality. By synthesizing von Uexk\"ull's $\textit{Umwelt}$ theory, the neural assembly hypothesis, and the triple causal model (integrating symbolic deduction, probabilistic induction, and force dynamics) into an end-to-end embodied planning system, we demonstrate the feasibility of this paradigm on the nuPlan benchmark. Experimental results in open-loop validation confirm that our Belief-Intent Co-Evolution mechanism effectively enhances planning performance. Crucially, in closed-loop simulations, the system exhibits emergent human-like cognitive behaviors, including map affordance understanding, free exploration, and self-recovery strategies. We identify Cognitive Consistency as the core learning mechanism: during long-term training, belief (state understanding) and intent (future prediction) spontaneously form a self-organizing equilibrium through implicit computational replay, achieving semantic alignment between internal representations and physical world affordances. TIWM offers a neuro-symbolic, cognition-first alternative to reconstruction-based planners, establishing a new direction: planning as active understanding, not passive reaction.

replace-cross Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from Low-Dose Computed Tomography

Authors: Yifei Zhang, Jiashuo Zhang, Mojtaba Safari, Xiaofeng Yang, Liang Zhao

Abstract: Low-dose chest computed tomography (LDCT) inherently captures both pulmonary and cardiac structures, offering a unique opportunity for joint assessment of lung and cardiovascular health. However, most existing approaches treat these domains as independent tasks, overlooking their physiological interplay and shared imaging biomarkers. We propose an Explainable Cross-Disease Reasoning Framework that enables interpretable cardiopulmonary risk assessment from a single LDCT scan. The framework introduces an agentic reasoning process that emulates clinical diagnostic thinking: first perceiving pulmonary findings, then reasoning through established medical knowledge, and finally deriving a cardiovascular judgment with a natural-language rationale. It integrates three components: a Pulmonary Perception Module that summarizes lung abnormalities, an Agentic Pulmonary-to-Cardiac Reasoning Module that infers their cardiovascular implications, and a Cardiac Feature Extractor that encodes structural biomarkers. Their outputs are fused to produce a holistic cardiovascular risk prediction that is both accurate and physiologically grounded. Experiments on the NLST cohort demonstrate that the proposed framework achieves state-of-the-art performance for CVD screening (AUC=0.919) and mortality prediction (AUC=0.838), outperforming single-disease and purely image-based baselines. Beyond quantitative gains, the framework provides human-verifiable reasoning that aligns with cardiological understanding, revealing coherent links between pulmonary abnormalities and cardiac stress mechanisms. Overall, this work establishes a unified and explainable paradigm for cardiovascular analysis from LDCT, bridging the gap between image-based prediction and mechanism-based medical interpretation.

replace-cross UAV-Assisted Resilience in 6G and Beyond Network Energy Saving: A Multi-Agent DRL Approach

Authors: Dao Lan Vy Dinh, Anh Nguyen Thi Mai, Hung Tran, Giang Quynh Le Vu, Tu Dac Ho, Zhenni Pan, Vo Nhan Van, Symeon Chatzinotas, Dinh-Hieu Tran

Abstract: This paper investigates the unmanned aerial vehicle (UAV)-assisted resilience perspective in the 6G network energy saving (NES) scenario. More specifically, we consider multiple ground base stations (GBSs) and each GBS has three different sectors/cells in the terrestrial networks, and multiple cells are turned off due to NES or incidents, e.g., disasters, hardware failures, or outages. To address this, we propose a Multi-Agent Deep Deterministic Policy Gradient (MADDPG) framework to enable UAV-assisted communication by jointly optimizing UAV trajectories, transmission power, and user-UAV association under a sleeping ground base station (GBS) strategy. This framework aims to ensure the resilience of active users in the network and the long-term operability of UAVs. Specifically, it maximizes service coverage for users during power outages or NES zones, while minimizing the energy consumption of UAVs. Simulation results demonstrate that the proposed MADDPG policy consistently achieves high coverage ratio across different testing episodes, outperforming other baselines. Moreover, the MADDPG framework attains the lowest total energy consumption, with a reduction of approximately 24\% compared to the conventional all GBS ON configuration, while maintaining a comparable user service rate. These results confirm the effectiveness of the proposed approach in achieving a superior trade-off between energy efficiency and service performance, supporting the development of sustainable and resilient UAV-assisted cellular networks.

replace-cross Training Language Models to Explain Their Own Computations

Authors: Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas

Abstract: Can language models (LMs) learn to faithfully describe their internal computations? Are they better able to describe themselves than other models? We study the extent to which LMs' privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior. Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs' internal activations, and (3) the influence of specific input tokens on LM outputs. When trained with only tens of thousands of example explanations, explainer models exhibit non-trivial generalization to new queries. This generalization appears partly attributable to explainer models' privileged access to their own internals: using a model to explain its own computations generally works better than using a *different* model to explain its computations (even if the explainer model is significantly more capable than the target). Our results suggest not only that LMs can learn to reliably explain their internal computations, but that such explanations offer a scalable complement to existing interpretability methods. Code and data at https://github.com/TransluceAI/introspective-interp

URLs: https://github.com/TransluceAI/introspective-interp

replace-cross Towards Reinforcement Learning from Neural Feedback: Mapping fNIRS Signals to Agent Performance

Authors: Julia Santaniello, Matthew Russell, Benson Jiang, Donatello Sassaroli, Robert Jacob, Jivko Sinapov

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a methodology that aligns agent behavior with human preferences by integrating user feedback into the agent's training process. This paper introduces a framework that guides agent training through implicit neural signals, with a focus on the neural classification problem. Our work presents and releases a novel dataset of functional near-infrared spectroscopy (fNIRS) recordings collected from 25 human participants across three domains: Pick-and-Place Robot, Lunar Lander, and Flappy Bird. We train multiple classifiers to predict varying levels of agent performance (optimal, suboptimal, or worst-case) from windows of preprocessed fNIRS features, achieving an average F1 score of 67% for binary and 46% for multi-class classification across conditions and domains. We also train multiple regressors to predict the degree of deviation between an agent's chosen action and a set of near-optimal policy actions, providing a continuous measure of performance. Finally, we evaluate cross-subject generalization and show that fine-tuning pre-trained models with a small sample of subject-specific data increases average F1 scores by 17% and 41% for binary and multi-class models, respectively. Our results demonstrate that mapping implicit fNIRS signals to agent performance is feasible and can be improved, laying the foundation for future Reinforcement Learning from Neural Feedback (RLNF) systems.

replace-cross CostNav: A Navigation Benchmark for Real-World Economic-Cost Evaluation of Physical AI Agents

Authors: Haebin Seong, Sungmin Kim, Yongjun Cho, Myunchul Joe, Geunwoo Kim, Yubeen Park, Sunhoo Kim, Yoonshik Kim, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Jinmyung Kwak, Sunghee Ahn, Jaemin Lee, Younggil Do, Seungyeop Yi, Woojin Cheong, Minhyeok Oh, Minchan Kim, Yoonseok Kang, Seongjae Kang, Samwoo Seong, Youngjae Yu, Yunsung Lee

Abstract: While current navigation benchmarks prioritize task success in simplified settings, they neglect the multidimensional economic constraints essential for the real-world commercialization of autonomous delivery systems. We introduce CostNav, an Economic Navigation Benchmark that evaluates physical AI agents through comprehensive economic cost-revenue analysis aligned with real-world business operations. By integrating industry-standard data - such as SEC filings and AIS injury reports - with Isaac Sim's detailed collision and cargo dynamics, CostNav transcends simple task completion to accurately evaluate business value in complex, real-world scenarios. To our knowledge, CostNav is the first work to quantitatively expose the gap between navigation research metrics and commercial viability, revealing that optimizing for task success on a simplified task fundamentally differs from optimizing for real-world economic deployment. Our evaluation of rule-based Nav2 navigation shows that current approaches are not economically viable: the contribution margin is -22.81/run (AMCL) and -12.87/run (GPS), resulting in no break-even point. We challenge the community to develop navigation policies that achieve economic viability on CostNav. We remain method-agnostic, evaluating success solely on the metric of cost rather than the underlying architecture. All resources are available at https://github.com/worv-ai/CostNav.

URLs: https://github.com/worv-ai/CostNav.

replace-cross Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning

Authors: Kevin Lee, Russell Spiewak, James Walsh

Abstract: Scientific reasoning through Large Language Models in heliophysics involves more than just recalling facts: it requires incorporating physical assumptions, maintaining consistent units, and providing clear scientific formats through coordinated approaches. To address these challenges, we present Reasoning With a Star, a newly contributed heliophysics dataset applicable to reasoning; we also provide an initial benchmarking approach. Our data are constructed from National Aeronautics and Space Administration & University Corporation for Atmospheric Research Living With a Star summer school problem sets and compiled into a readily consumable question-and-answer structure with question contexts, reasoning steps, expected answer type, ground-truth targets, format hints, and metadata. A programmatic grader checks the predictions using unit-aware numerical tolerance, symbolic equivalence, and schema validation. We benchmark a single-shot baseline and four multi-agent patterns, finding that decomposing workflows through systems engineering principles outperforms direct prompting on problems requiring deductive reasoning rather than pure inductive recall.

replace-cross Towards Understanding Generalization in DP-GD: A Case Study in Training Two-Layer CNNs

Authors: Zhongjie Shi, Puyu Wang, Chenyang Zhang, Yuan Cao

Abstract: Modern deep learning techniques focus on extracting intricate information from data to achieve accurate predictions. However, the training datasets may be crowdsourced and include sensitive information, such as personal contact details, financial data, and medical records. As a result, there is a growing emphasis on developing privacy-preserving training algorithms for neural networks that maintain good performance while preserving privacy. In this paper, we investigate the generalization and privacy performances of the differentially private gradient descent (DP-GD) algorithm, which is a private variant of the gradient descent (GD) by incorporating additional noise into the gradients during each iteration. Moreover, we identify a concrete learning task where DP-GD can achieve superior generalization performance compared to GD in training two-layer Huberized ReLU convolutional neural networks (CNNs). Specifically, we demonstrate that, under mild conditions, a small signal-to-noise ratio can result in GD producing training models with poor test accuracy, whereas DP-GD can yield training models with good test accuracy and privacy guarantees if the signal-to-noise ratio is not too small. This indicates that DP-GD has the potential to enhance model performance while ensuring privacy protection in certain learning tasks. Numerical simulations are further conducted to support our theoretical results.

replace-cross GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment

Authors: Haoyang He, Jay Patrikar, Dong-Ki Kim, Max Smith, Daniel McGann, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei, Sebastian Scherer

Abstract: Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.

replace-cross GRAND: Guidance, Rebalancing, and Assignment for Networked Dispatch in Multi-Agent Path Finding

Authors: Johannes Gaber, Meshal Alharbi, Daniele Gammelli, Gioele Zardini

Abstract: Large robot fleets are now common in warehouses and other logistics settings, where small control gains translate into large operational impacts. In this article, we address task scheduling for lifelong Multi-Agent Pickup-and-Delivery (MAPD) and propose a hybrid method that couples learning-based global guidance with lightweight optimization. A graph neural network policy trained via reinforcement learning outputs a desired distribution of free agents over an aggregated warehouse graph. This signal is converted into region-to-region rebalancing through a minimum-cost flow, and finalized by small, local assignment problems, preserving accuracy while keeping per-step latency within a 1 s compute budget. On congested warehouse benchmarks from the League of Robot Runners (LoRR) with up to 500 agents, our approach improves throughput by up to 10% over the 2024 winning scheduler while maintaining real-time execution. The results indicate that coupling graph-structured learned guidance with tractable solvers reduces congestion and yields a practical, scalable blueprint for high-throughput scheduling in large fleets.

replace-cross Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs

Authors: Kunj Joshi, Jaydeep Borkar, David A. Smith

Abstract: The current literature on memorization in Natural Language Models, especially Large Language Models (LLMs), poses severe security and privacy risks, as models tend to memorize personally identifying information (PIIs) from training data. We introduce Randomized Masked Fine-Tuning (RMFT), a novel privacy-preserving fine-tuning technique that reduces PII memorization while minimizing performance impact. Using the Enron Email Dataset, we demonstrate that RMFT achieves an 80.81% reduction in Total Extraction Rate and 80.17% reduction in Seen Extraction Rate compared to baseline fine-tuning, outperforming deduplication methods while maintaining only a 5.73% increase in perplexity. We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.

replace-cross Provable FDR Control for Deep Feature Selection: Deep MLPs and Beyond

Authors: Kazuma Sawaya

Abstract: We develop a flexible feature selection framework based on deep neural networks that approximately controls the false discovery rate (FDR), a measure of Type-I error. The method applies to architectures whose first layer is fully connected. From the second layer onward, it accommodates multilayer perceptrons (MLPs) of arbitrary width and depth, convolutional and recurrent networks, attention mechanisms, residual connections, and dropout. The procedure also accommodates stochastic gradient descent with data-independent initializations and learning rates. To the best of our knowledge, this is the first work to provide a theoretical guarantee of FDR control for feature selection within such a general deep learning setting. Our analysis is built upon a multi-index data-generating model and an asymptotic regime in which the feature dimension $n$ diverges faster than the latent dimension $q^{*}$, while the sample size, the number of training iterations, the network depth, and hidden layer widths are left unrestricted. Under this setting, we show that each coordinate of the gradient-based feature-importance vector admits a marginal normal approximation, thereby supporting the validity of asymptotic FDR control. As a theoretical limitation, we assume $\mathbf{B}$-right orthogonal invariance of the design matrix, and we discuss broader generalizations. We also present numerical experiments that underscore the theoretical findings.

replace-cross Two-dimensional RMSD projections for reaction path visualization and validation

Authors: Rohit Goswami (Institute IMX and Lab-COSMO, \'Ecole polytechnique f\'ed\'erale de Lausanne)

Abstract: Transition state or minimum energy path finding methods constitute a routine component of the computational chemistry toolkit. Standard analysis involves trajectories conventionally plotted in terms of the relative energy to the initial state against a cumulative displacement variable, or the image number. These dimensional reductions obscure structural rearrangements in high dimensions and are often history dependent. This precludes the ability to compare optimization histories of different methods beyond the number of calculations, time taken, and final saddle geometry. We present a method mapping trajectories onto a two-dimension projection defined by a permutation corrected root mean square deviation from the reactant and product configurations. Energy is represented as an interpolated color-mapped surface constructed from all optimization steps using a gradient aware derivative Gaussian Process. This representation highlights optimization trajectories, identifies endpoint basins, and diagnoses convergence concerns invisible in one-dimensional profiles. We demonstrate the framework on a cycloaddition reaction, showing that a machine-learned potential saddle and density functional theory reference lie on comparable energy contours despite geometric displacements, along with the ratification of the visualization for more complex reactions, a grignard rearrangement, and a bicyclobutadiene rearrangement.

replace-cross Data-Chain Backdoor: Do You Trust Diffusion Models as Generative Data Supplier?

Authors: Junchi Lu, Xinke Li, Yuheng Liu, Qi Alfred Chen

Abstract: The increasing use of generative models such as diffusion models for synthetic data augmentation has greatly reduced the cost of data collection and labeling in downstream perception tasks. However, this new data source paradigm may introduce important security concerns. Publicly available generative models are often reused without verification, raising a fundamental question of their safety and trustworthiness. This work investigates backdoor propagation in such emerging generative data supply chain, namely, Data-Chain Backdoor (DCB). Specifically, we find that open-source diffusion models can become hidden carriers of backdoors. Their strong distribution-fitting ability causes them to memorize and reproduce backdoor triggers in generation, which are subsequently inherited by downstream models, resulting in severe security risks. This threat is particularly concerning under clean-label attack scenarios, as it remains effective while having negligible impact on the utility of the synthetic data. We study two attacker choices to obtain a backdoor-carried generator, training from scratch and fine-tuning. While naive fine-tuning leads to weak inheritance of the backdoor, we find that novel designs in the loss objectives and trigger processing can substantially improve the generator's ability to preserve trigger patterns, making fine-tuning a low-cost attack path. We evaluate the effectiveness of DCB under the standard augmentation protocol and further assess data-scarce settings. Across multiple trigger types, we observe that the trigger pattern can be consistently retained in the synthetic data with attack efficacy comparable to the conventional backdoor attack.

replace-cross Predictive Inorganic Synthesis based on Machine Learning using Small Data sets: a case study of size-controlled Cu Nanoparticles

Authors: Brent Motmans, Digvijay Ghogare, Thijs G. I. van Wijk, Joren Van Herck, Pieter De Meyer, Berend Smit, An Hardy, Danny E. P. Vanpoucke

Abstract: Copper nanoparticles (Cu NPs) have a broad applicability, yet their synthesis is sensitive to subtle changes in reaction parameters. This sensitivity, combined with the time- and resource-intensive nature of experimental optimization, poses a major challenge in achieving reproducible and size-controlled synthesis. While Machine Learning (ML) shows promise in materials research, its application is often limited by scarcity of large high-quality experimental data sets. This study explores ML to predict the size of Cu NPs from microwave-assisted polyol synthesis using a small data set of 25 in-house performed syntheses. Latin Hypercube Sampling is used to efficiently cover the parameter space while creating the experimental data set. Ensemble regression models successfully predict particle sizes with high accuracy ($R^2 = 0.74$), outperforming classical statistical approaches ($R^2 = 0.60$). Additionally, classification models using both random forests and Large Language Models (LLMs) are evaluated to distinguish between large and small particles. While random forests show moderate performance, LLMs offer no significant advantages under data-scarce conditions. Overall, this study demonstrates that carefully curated small data sets, paired with robust classical ML, can effectively predict the synthesis of Cu NPs and highlights that for lab-scale studies, complex models like LLMs may offer limited benefit over simpler techniques.

replace-cross Bidirectional human-AI collaboration in brain tumour assessments improves both expert human and AI agent performance

Authors: James K Ruffle, Samia Mohinta, Guilherme Pombo, Asthik Biswas, Alan Campbell, Indran Davagnanam, David Doig, Ahmed Hammam, Harpreet Hyare, Farrah Jabeen, Emma Lim, Dermot Mallon, Stephanie Owen, Sophie Wilkinson, Sebastian Brandner, Parashkev Nachev

Abstract: The benefits of artificial intelligence (AI) human partnerships-evaluating how AI agents enhance expert human performance-are increasingly studied. Though rarely evaluated in healthcare, an inverse approach is possible: AI benefiting from the support of an expert human agent. Here, we investigate both human-AI clinical partnership paradigms in the magnetic resonance imaging-guided characterisation of patients with brain tumours. We reveal that human-AI partnerships improve accuracy and metacognitive ability not only for radiologists supported by AI, but also for AI agents supported by radiologists. Moreover, the greatest patient benefit was evident with an AI agent supported by a human one. Synergistic improvements in agent accuracy, metacognitive performance, and inter-rater agreement suggest that AI can create more capable, confident, and consistent clinical agents, whether human or model-based. Our work suggests that the maximal value of AI in healthcare could emerge not from replacing human intelligence, but from AI agents that routinely leverage and amplify it.

replace-cross Block-Recurrent Dynamics in Vision Transformers

Authors: Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, T. Andy Keller

Abstract: As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

replace-cross InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization

Authors: Yu Li, Tian Lan, Zhengling Qi

Abstract: Direct Preference Optimization (DPO) and its variants have become standard for aligning Large Language Models due to their simplicity and offline stability. However, we identify two fundamental limitations. First, the optimal policy depends on arbitrary modeling choices (scalarization function, reference policy), yielding behavior reflecting parameterization artifacts rather than true preferences. Second, treating response generation in isolation fails to leverage comparative information in pairwise data, leaving the model's capacity for intrinsic self-reflection untapped. To address it, we propose Intrinsic Self-reflective Preference Optimization (InSPO), deriving a globally optimal policy conditioning on both context and alternative responses. We prove this formulation superior to DPO/RLHF while guaranteeing invariance to scalarization and reference choices. InSPO serves as a plug-and-play enhancement without architectural changes or inference overhead. Experiments demonstrate consistent improvements in win rates and length-controlled metrics, validating that unlocking self-reflection yields more robust, human-aligned LLMs. Our Code is available at https://github.com/Skylanding/InSPO.

URLs: https://github.com/Skylanding/InSPO.

replace-cross Calibrated Multi-Level Quantile Forecasting

Authors: Tiffany Ding, Isaac Gibbs, Ryan J. Tibshirani

Abstract: We develop an online method that guarantees calibration of quantile forecasts at multiple quantile levels simultaneously. In this work, a sequence of quantile forecasts is said to be calibrated provided that its $\alpha$-level predictions are greater than or equal to the target value at an $\alpha$ fraction of time steps, for each level $\alpha$. Our procedure, called the multi-level quantile tracker (MultiQT), is lightweight and wraps around any point or quantile forecaster to produce adjusted quantile forecasts that are guaranteed to be calibrated, even against adversarial distribution shifts. Critically, it does so while ensuring that the quantiles remain ordered, e.g., the 0.5-level quantile forecast will never be larger than the 0.6-level forecast. Moreover, the method has a no-regret guarantee, implying it will not degrade the performance of the existing forecaster (asymptotically), with respect to the quantile loss. In our experiments, we find that MultiQT significantly improves the calibration of real forecasters in epidemic and energy forecasting problems, while leaving the quantile loss largely unchanged or slightly improved.

replace-cross Compressed code: the hidden effects of quantization and distillation on programming tokens

Authors: Viacheslav Siniaev, Iaroslav Chelombitko, Aleksey Komissarov

Abstract: Large Language Models (LLMs) have demonstrated exceptional code generation capabilities, yet their token-level mechanisms remain underexplored, particularly in compressed models. Through systematic analysis of programming language token representations, we characterize how programming languages are encoded in LLM tokenizers by analyzing their vocabulary distribution and keyword coverage patterns. We introduce a novel cold-start probability analysis method that provides insights into model behavior without requiring explicit prompts. Additionally, we present a comprehensive evaluation of how different model optimization techniques - including quantization, distillation, model scaling, and task-specific fine-tuning - affect token-level representations and code generation quality. Our experiments, supported by comprehensive probability distribution analysis and evaluation metrics, reveal critical insights into token-level behavior and provide empirically-validated guidelines for maintaining code generation quality under various optimization constraints. These findings advance both theoretical understanding of LLM code generation and practical implementation of optimized models in production environments.

replace-cross From No-Regret to Strategically Robust Learning in Repeated Auctions

Authors: Junyao Zhao

Abstract: In Bayesian single-item auctions, a monotone bidding strategy--one that prescribes a higher bid for a higher value type--can be equivalently represented as a partition of the quantile space into consecutive intervals corresponding to increasing bids. Kumar et al. (2024) prove that agile online gradient descent (OGD), when used to update a monotone bidding strategy through its quantile representation, is strategically robust in repeated first-price auctions: when all bidders employ agile OGD in this way, the auctioneer's average revenue per round is at most the revenue of Myerson's optimal auction, regardless of how she adjusts the reserve price over time. In this work, we show that this strategic robustness guarantee is not unique to agile OGD or to the first-price auction: any no-regret learning algorithm, when fed gradient feedback with respect to the quantile representation, is strategically robust, even if the auction format changes every round, provided the format satisfies allocation monotonicity and voluntary participation. In particular, the multiplicative weights update (MWU) algorithm simultaneously achieves the optimal regret guarantee and a strong strategic robustness guarantee in this auction setting. At a technical level, our results are established via a simple relation that bridges Myerson's auction theory and standard no-regret learning theory.

replace-cross Towards Spatio-Temporal Extrapolation of Phase-Field Simulations with Convolution-Only Neural Networks

Authors: Christophe Bonneville, Nathan Bieberdorf, Pieterjan Robbe, Mark Asta, Habib Najm, Laurent Capolungo, Cosmin Safta

Abstract: Phase-field simulations of liquid metal dealloying (LMD) can capture complex microstructural evolutions but can be prohibitively expensive for large domains and long time horizons. In this paper, we introduce a fully convolutional, conditionally parameterized U-Net surrogate designed to extrapolate far beyond its training data in both space and time. The architecture integrates convolutional self-attention, physically informed padding, and a flood-fill corrector method to maintain accuracy under extreme extrapolation, while conditioning on simulation parameters allows for flexible time-step skipping and adaptation to varying alloy compositions. To remove the need for costly solver-based initialization, we couple the surrogate with a conditional diffusion model that generates synthetic, physically consistent initial conditions. We train our surrogate on simulations generated over small domain sizes and short time spans, but, by taking advantage of the convolutional nature of U-Nets, we are able to run and extrapolate surrogate simulations for longer time horizons than what would be achievable with classic numerical solvers. Across multiple alloy compositions, the framework is able to reproduce the LMD physics accurately. It predicts key quantities of interest and spatial statistics with relative errors typically below 5% in the training regime and under 15% during large-scale, long time-horizon extrapolations. Our framework can also deliver speed-ups of up to 36,000 times, bringing the time to run weeks-long simulations down to a few seconds. This work is a first stepping stone towards high-fidelity extrapolation in both space and time of phase-field simulation for LMD.

replace-cross Challenges and Research Directions for Large Language Model Inference Hardware

Authors: Xiaoyu Ma, David Patterson

Abstract: Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stacking for high memory bandwidth; and low-latency interconnect to speedup communication. While our focus is datacenter AI, we also review their applicability for mobile devices.

replace-cross Token-Level LLM Collaboration via FusionRoute

Authors: Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao

Abstract: Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

replace-cross Local EGOP for Continuous Index Learning

Authors: Alex Kokot, Anand Hemmady, Vydhourie Thiyageswaran, Marina Meila

Abstract: We introduce the setting of continuous index learning, in which a function of many variables varies only along a small number of directions at each point. For efficient estimation, it is beneficial for a learning algorithm to adapt, near each point $x$, to the subspace that captures the local variability of the function $f$. We pose this task as kernel adaptation along a manifold with noise, and introduce Local EGOP learning, a recursive algorithm that utilizes the Expected Gradient Outer Product (EGOP) quadratic form as both a metric and inverse-covariance of our target distribution. We prove that Local EGOP learning adapts to the regularity of the function of interest, showing that under a supervised noisy manifold hypothesis, intrinsic dimensional learning rates are achieved for arbitrarily high-dimensional noise. Empirically, we compare our algorithm to the feature learning capabilities of deep learning. Additionally, we demonstrate improved regression quality compared to two-layer neural networks in the continuous single-index setting.

replace-cross Enhanced Climbing Image Nudged Elastic Band method with Hessian Eigenmode Alignment

Authors: Rohit Goswami (Institute IMX and Lab-COSMO, \'Ecole polytechnique f\'ed\'erale de Lausanne, Science Institute, University of Iceland, Reykjavik, Iceland), Miha Gunde (Science Institute, University of Iceland, Reykjavik, Iceland, Institute Ru{\dj}er Bo\v{s}kovi\'c, Zagreb, Croatia), Hannes J\'onsson

Abstract: Accurate determination of transition states is central to an understanding of reaction kinetics. Double-endpoint methods where both initial and final states are specified, such as the climbing image nudged elastic band (CI-NEB), identify the minimum energy path between the two and thereby the saddle point on the energy surface that is relevant for the given transition, thus providing an estimate of the transition state within harmonic transition state theory. Such calculations can, however, incur high computational costs and may suffer stagnation on exceptionally flat or rough energy surfaces. Conversely, methods that only require specification of an initial set of atomic coordinates, such as the minimum mode following (MMF) method, offer efficiency but can converge on saddle points that are not relevant for the transition of interest. Here, we present an adaptive hybrid algorithm that switches between the CI-NEB and the MMF method so as to get faster convergence to the relevant saddle point. The method is benchmarked for the Baker-Chan (BC) saddle point test set using the PET-MAD machine-learned potential as well as 59 transitions of a heptamer island on Pt(111) from the OptBench set. A Bayesian analysis of the performance shows a median reduction of energy and force calculations by 46% [95% CrI: -55%, -37%] relative to CI-NEB for the BC set, while a 28% reduction is found for the transitions of the heptamer island. Calculations of the BC set where a simple switch from the CI-NEB to the MMF method is made when the magnitude of the atomic forces drops below 0.5 eV/AA requires 30% more force calculations than the OCI-NEB algorithm. These results show that an adaptive hybrid method mixing CI-NEB and MMF can be a highly efficient tool for high-throughput automated chemical discovery of atomic rearrangements.

replace-cross Small Gradient Norm Regret for Online Convex Optimization

Authors: Wenzhi Gao, Chang He, Madeleine Udell

Abstract: This paper introduces a new problem-dependent regret measure for online convex optimization with smooth losses. The notion, which we call the $G^\star$ regret, depends on the cumulative squared gradient norm evaluated at the decision in hindsight. We show that the $G^\star$ regret strictly refines the existing $L^\star$ (small loss) regret, and that it can be arbitrarily sharper when the losses have vanishing curvature around the hindsight decision. We establish upper and lower bounds on the $G^\star$ regret and extend our results to dynamic regret and bandit settings. As a byproduct, we refine the existing convergence analysis of stochastic optimization algorithms in the interpolation regime. Some experiments validate our theoretical findings.

replace-cross An Elementary Approach to Scheduling in Generative Diffusion Models

Authors: Qiang Sun, H. Vincent Poor, Wenyi Zhang

Abstract: An elementary approach to characterizing the impact of noise scheduling and time discretization in generative diffusion models is developed. We first utilize the Cram\'er-Rao bound to identify the Gaussian setting as a fundamental performance limit, necessitating its study as a reference. Building on this insight, we consider a simplified model in which the source distribution is a multivariate Gaussian with a given covariance matrix, together with the deterministic reverse sampling process. The explicit closed-form evolution trajectory of the distributions across reverse sampling steps is derived, and consequently, the Kullback-Leibler (KL) divergence between the source distribution and the reverse sampling output is obtained. The effect of the number of time discretization steps on the convergence of this KL divergence is studied via the Euler-Maclaurin expansion. An optimization problem is formulated, and its solution noise schedule is obtained via calculus of variations, shown to follow a tangent law whose coefficient is determined by the eigenvalues of the source covariance matrix. For an alternative scenario, more realistic in practice, where pretrained models have been obtained for some given noise schedules, the KL divergence also provides a measure to compare different time discretization strategies in reverse sampling. Experiments across different datasets and pretrained models demonstrate that the time discretization strategy selected by our approach consistently outperforms baseline and search-based strategies, particularly when the budget on the number of function evaluations is very tight.

replace-cross Federated Balanced Learning

Authors: Jiaze Li, Haoran Xu, Wanyi Wu, Changwei Wang, Shuaiguang Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Youyang Qu, Longxiang Gao, Xudong Yang, Lumin Xing

Abstract: Federated learning is a paradigm of joint learning in which clients collaborate by sharing model parameters instead of data. However, in the non-iid setting, the global model experiences client drift, which can seriously affect the final performance of the model. Previous methods tend to correct the global model that has already deviated based on the loss function or gradient, overlooking the impact of the client samples. In this paper, we rethink the role of the client side and propose Federated Balanced Learning, i.e., FBL, to prevent this issue from the beginning through sample balance on the client side. Technically, FBL allows unbalanced data on the client side to achieve sample balance through knowledge filling and knowledge sampling using edge-side generation models, under the limitation of a fixed number of data samples on clients. Furthermore, we design a Knowledge Alignment Strategy to bridge the gap between synthetic and real data, and a Knowledge Drop Strategy to regularize our method. Meanwhile, we scale our method to real and complex scenarios, allowing different clients to adopt various methods, and extend our framework to further improve performance. Numerous experiments show that our method outperforms state-of-the-art baselines. The code is released upon acceptance.

replace-cross Stability as a Liability:Systematic Breakdown of Linguistic Structure in LLMs

Authors: Xianzhe Meng, Qiangsheng Zeng, Ling Luo, Qinghan Yang, Jiarui Hao, Wenbo Wu, Qinyu Wang, Rui Yin, Lin Qi, Renzhi Lu

Abstract: Training stability is typically regarded as a prerequisite for reliable optimization in large language models. In this work, we analyze how stabilizing training dynamics affects the induced generation distribution. We show that under standard maximum likelihood training, stable parameter trajectories lead stationary solutions to approximately minimize the forward KL divergence to the empirical distribution, while implicitly reducing generative entropy. As a consequence, the learned model can concentrate probability mass on a limited subset of empirical modes, exhibiting systematic degeneration despite smooth loss convergence. We empirically validate this effect using a controlled feedback-based training framework that stabilizes internal generation statistics, observing consistent low-entropy outputs and repetitive behavior across architectures and random seeds. It indicates that optimization stability and generative expressivity are not inherently aligned, and that stability alone is an insufficient indicator of generative quality.

replace-cross Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection

Authors: Yongxin Deng, Zhen Fang, Sharon Li, Ling Chen

Abstract: Hallucination detection is critical for deploying large language models (LLMs) in real-world applications. Existing hallucination detection methods achieve strong performance when the training and test data come from the same domain, but they suffer from poor cross-domain generalization. In this paper, we study an important yet overlooked problem, termed generalizable hallucination detection (GHD), which aims to train hallucination detectors on data from a single domain while ensuring robust performance across diverse related domains. In studying GHD, we simulate multi-turn dialogues following LLMs' initial response and observe an interesting phenomenon: hallucination-initiated multi-turn dialogues universally exhibit larger uncertainty fluctuations than factual ones across different domains. Based on the phenomenon, we propose a new score SpikeScore, which quantifies abrupt fluctuations in multi-turn dialogues. Through both theoretical analysis and empirical validation, we demonstrate that SpikeScore achieves strong cross-domain separability between hallucinated and non-hallucinated responses. Experiments across multiple LLMs and benchmarks demonstrate that the SpikeScore-based detection method outperforms representative baselines in cross-domain generalization and surpasses advanced generalization-oriented methods, verifying the effectiveness of our method in cross-domain hallucination detection.

replace-cross NRR-Phi: Text-to-State Mapping for Ambiguity Preservation in LLM Inference

Authors: Kei Saito

Abstract: Large language models exhibit a systematic tendency toward early semantic commitment: given ambiguous input, they collapse multiple valid interpretations into a single response before sufficient context is available. We present a formal framework for text-to-state mapping ($\phi: \mathcal{T} \to \mathcal{S}$) that transforms natural language into a non-collapsing state space where multiple interpretations coexist. The mapping decomposes into three stages: conflict detection, interpretation extraction, and state construction. We instantiate $\phi$ with a hybrid extraction pipeline combining rule-based segmentation for explicit conflict markers (adversative conjunctions, hedging expressions) with LLM-based enumeration of implicit ambiguity (epistemic, lexical, structural). On a test set of 68 ambiguous sentences, the resulting states preserve interpretive multiplicity: mean state entropy $H = 1.087$ bits across ambiguity categories, compared to $H = 0$ for collapse-based baselines. We additionally instantiate the rule-based conflict detector for Japanese markers to illustrate cross-lingual portability. This framework extends Non-Resolution Reasoning (NRR) by providing the missing algorithmic bridge between text and the NRR state space, enabling architectural collapse deferment in LLM inference. Design principles for state-to-state transformations are detailed in the Appendix, with empirical validation on 580 test cases showing 0% collapse for principle-satisfying operators versus up to 17.8% for violating operators.

replace-cross Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models

Authors: Jacek Duszenko

Abstract: Reasoning models frequently agree with incorrect user suggestions -- a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. We introduce \emph{sycophantic anchors} -- sentences identified via counterfactual analysis that commit models to user agreement. Across four reasoning models spanning three architecture families (Llama, Qwen, Falcon-hybrid) and 1.5B--8B parameters, we analyze over 200,000 counterfactual rollouts and show that linear probes reliably detect sycophantic anchors (74--85\% balanced accuracy), outperforming text-only baselines at high commitment levels -- confirming they capture internal states beyond surface vocabulary. Regressors further predict commitment strength from activations ($R^2$ up to 0.74). We observe a consistent asymmetry: sycophancy leaves a stronger mechanistic footprint than correct reasoning. We also find that sycophancy builds gradually during generation rather than being determined by the prompt. These findings enable sentence-level detection and quantification of model misalignment mid-inference.

replace-cross Task-free Adaptive Meta Black-box Optimization

Authors: Chao Wang, Licheng Jiao, Lingling Li, Jiaxuan Zhao, Guanchun Wang, Fang Liu, Shuyuan Yang

Abstract: Handcrafted optimizers become prohibitively inefficient for complex black-box optimization (BBO) tasks. MetaBBO addresses this challenge by meta-learning to automatically configure optimizers for low-level BBO tasks, thereby eliminating heuristic dependencies. However, existing methods typically require extensive handcrafted training tasks to learn meta-strategies that generalize to target tasks, which poses a critical limitation for realistic applications with unknown task distributions. To overcome the issue, we propose the Adaptive meta Black-box Optimization Model (ABOM), which performs online parameter adaptation using solely optimization data from the target task, obviating the need for predefined task distributions. Unlike conventional metaBBO frameworks that decouple meta-training and optimization phases, ABOM introduces a closed-loop adaptive parameter learning mechanism, where parameterized evolutionary operators continuously self-update by leveraging generated populations during optimization. This paradigm shift enables zero-shot optimization: ABOM achieves competitive performance on synthetic BBO benchmarks and realistic unmanned aerial vehicle path planning problems without any handcrafted training tasks. Visualization studies reveal that parameterized evolutionary operators exhibit statistically significant search patterns, including natural selection and genetic recombination.

replace-cross Aligning Microscopic Vehicle and Macroscopic Traffic Statistics: Reconstructing Driving Behavior from Partial Data

Authors: Zhihao Zhang, Keith Redmill, Chengyang Peng, Bowen Weng

Abstract: A driving algorithm that aligns with good human driving practices, or at the very least collaborates effectively with human drivers, is crucial for developing safe and efficient autonomous vehicles. In practice, two main approaches are commonly adopted: (i) supervised or imitation learning, which requires comprehensive naturalistic driving data capturing all states that influence a vehicle's decisions and corresponding actions, and (ii) reinforcement learning (RL), where the simulated driving environment either matches or is intentionally more challenging than real-world conditions. Both methods depend on high-quality observations of real-world driving behavior, which are often difficult and costly to obtain. State-of-the-art sensors on individual vehicles can gather microscopic data, but they lack context about the surrounding conditions. Conversely, roadside sensors can capture traffic flow and other macroscopic characteristics, but they cannot associate this information with individual vehicles on a microscopic level. Motivated by this complementarity, we propose a framework that reconstructs unobserved microscopic states from macroscopic observations, using microscopic data to anchor observed vehicle behaviors, and learns a shared policy whose behavior is microscopically consistent with the partially observed trajectories and actions and macroscopically aligned with target traffic statistics when deployed population-wide. Such constrained and regularized policies promote realistic flow patterns and safe coordination with human drivers at scale.

replace-cross Optimal Decision-Making Based on Prediction Sets

Authors: Tao Wang, Edgar Dobriban

Abstract: Prediction sets can wrap around any ML model to cover unknown test outcomes with a guaranteed probability. Yet, it remains unclear how to use them optimally for downstream decision-making. Here, we propose a decision-theoretic framework that seeks to minimize the expected loss (risk) against a worst-case distribution consistent with the prediction set's coverage guarantee. We first characterize the minimax optimal policy for a fixed prediction set, showing that it balances the worst-case loss inside the set with a penalty for potential losses outside the set. Building on this, we derive the optimal prediction set construction that minimizes the resulting robust risk subject to a coverage constraint. Finally, we introduce Risk-Optimal Conformal Prediction (ROCP), a practical algorithm that targets these risk-minimizing sets while maintaining finite-sample distribution-free marginal coverage. Empirical evaluations on medical diagnosis and safety-critical decision-making tasks demonstrate that ROCP reduces critical mistakes compared to baselines, particularly when out-of-set errors are costly.

replace-cross Supervised Fine-Tuning Needs to Unlock the Potential of Token Priority

Authors: Zhanming Shen, Zeyu Qin, Jiaqi Hu, Wentao Ye, Hao Chen, Xiaomeng Hu, Haokai Xu, Gang Chen, Yi R. Fung, Haobo Wang

Abstract: The transition from fitting empirical data to achieving true human utility is fundamentally constrained by a granularity mismatch, where fine-grained autoregressive generation is often supervised by coarse or uniform signals. This position paper advocates Token Priority as the essential bridge, formalizing Supervised Fine-Tuning (SFT) not as simple optimization but as a precise distribution reshaping process that aligns raw data with the ideal alignment manifold. We analyze recent breakthroughs through this unified lens, categorizing them into two distinct regimes: Positive Priority for noise filtration and Signed Priority for toxic modes unlearning. We revisit existing progress and limitations, identify key challenges, and suggest directions for future research.

replace-cross Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator

Authors: Haoyu Han, Heng Yang

Abstract: Policy-gradient methods are widely used in reinforcement learning, yet training often becomes unstable or slows down as learning progresses. We study this phenomenon through the noise-to-signal ratio (NSR) of a policy-gradient estimator, defined as the estimator variance (noise) normalized by the squared norm of the true gradient (signal). Our main result is that, for (i) finite-horizon linear systems with Gaussian policies and linear state-feedback, and (ii) finite-horizon polynomial systems with Gaussian policies and polynomial feedback, the NSR of the REINFORCE estimator can be characterized exactly-either in closed form or via numerical moment-evaluation algorithms-without approximation. For general nonlinear dynamics and expressive policies (including neural policies), we further derive a general upper bound on the variance. These characterizations enable a direct examination of how NSR varies across policy parameters and how it evolves along optimization trajectories (e.g. SGD and Adam). Across a range of examples, we find that the NSR landscape is highly non-uniform and typically increases as the policy approaches an optimum; in some regimes it blows up, which can trigger training instability and policy collapse.

replace-cross Efficient Softmax Reformulation for Homomorphic Encryption via Moment Generating Function

Authors: Hanjun Park, Byeong-Seo Min, Jiheon Woo, Min-Wook Jeong, Jongho Shin, Yongwoo Lee, Young-Sik Kim, Yongjune Kim

Abstract: Homomorphic encryption (HE) is a prominent framework for privacy-preserving machine learning, enabling inference directly on encrypted data. However, evaluating softmax, a core component of transformer architectures, remains particularly challenging in HE due to its multivariate structure, the large dynamic range induced by exponential functions, and the need for accurate division during normalization. In this paper, we propose MGF-softmax, a novel softmax reformulation based on the moment generating function (MGF) that replaces the softmax denominator with its moment-based counterpart. This reformulation substantially reduces multiplicative depth while preserving key properties of softmax and asymptotically converging to the exact softmax as the number of input tokens increases. Extensive experiments on Vision Transformers and large language models show that MGF-softmax provides an efficient and accurate approximation of softmax in encrypted inference. In particular, it achieves inference accuracy close to that of high-depth exact methods, while requiring substantially lower computational cost through reduced multiplicative depth.

replace-cross FlowSteer: Interactive Agentic Workflow Orchestration via End-to-End Reinforcement Learning

Authors: Mingda Zhang, Haoran Luo, Tiesunlong Shen, Qika Lin, Xiaoying Tang, Rui Mao, Erik Cambria

Abstract: In recent years, a variety of powerful agentic workflows have been applied to solve a wide range of human problems. However, existing workflow orchestration still faces key challenges, including high manual cost, reliance on specific operators/large language models (LLMs), and sparse reward signals. To address these challenges, we propose FlowSteer, an end-to-end reinforcement learning framework that takes a lightweight policy model as the agent and an executable canvas environment, automating workflow orchestration through multi-turn interaction. In this process, the policy model analyzes execution states and selects editing actions, while the canvas executes operators and returns feedback for iterative refinement. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. To effectively train this interaction paradigm, we propose Canvas Workflow Relative Policy Optimization (CWRPO), which introduces diversity-constrained rewards with conditional release to stabilize learning and suppress shortcut behaviors. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks.

replace-cross MACD: Model-Aware Contrastive Decoding via Counterfactual Data

Authors: Qixin Xiao

Abstract: Video language models (Video-LLMs) are prone to hallucinations, often generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing decoding methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for mitigating hallucination patterns. However, such a way is hard to control the visual cues that drive hallucination or well align with model weaknesses. We propose Model-aware Counterfactual Data based Contrastive Decoding (MACD), a new inference strategy that combines model-guided counterfactual construction with decoding. Our approach uses the Video-LLM's own feedback to identify object regions most responsible for hallucination, generating targeted counterfactual inputs at the object level rather than arbitrary frame or temporal modifications. These model-aware counterfactual data is then integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL families. The method is especially effective in challenging scenarios involving small, occluded, or co-occurring objects. Our code and data will be publicly released.

replace-cross Near-Universal Multiplicative Updates for Nonnegative Einsum Factorization

Authors: John Hood, Aaron Schein

Abstract: Despite the ubiquity of multiway data across scientific domains, there are few user-friendly tools that fit tailored nonnegative tensor factorizations. Researchers may use gradient-based automatic differentiation (which often struggles in nonnegative settings), choose between a limited set of methods with mature implementations, or implement their own model from scratch. As an alternative, we introduce NNEinFact, an einsum-based multiplicative update algorithm that fits any nonnegative tensor factorization expressible as a tensor contraction by minimizing one of many user-specified loss functions (including the $(\alpha,\beta)$-divergence). To use NNEinFact, the researcher simply specifies their model with a string. NNEinFact converges to a stationary point of the loss, supports missing data, and fits to tensors with hundreds of millions of entries in seconds. Empirically, NNEinFact fits custom models which outperform standard ones in heldout prediction tasks on real-world tensor data by over $37\%$ and attains less than half the test loss of gradient-based methods while converging up to 90 times faster.

replace-cross DiGAN: Diffusion-Guided Attention Network for Early Alzheimer's Disease Detection

Authors: Maxx Richard Rahman, Mostafa Hammouda, Wolfgang Maass

Abstract: Early diagnosis of Alzheimer's disease (AD) remains a major challenge due to the subtle and temporally irregular progression of structural brain changes in the prodromal stages. Existing deep learning approaches require large longitudinal datasets and often fail to model the temporal continuity and modality irregularities inherent in real-world clinical data. To address these limitations, we propose the Diffusion-Guided Attention Network (DiGAN), which integrates latent diffusion modelling with an attention-guided convolutional network. The diffusion model synthesizes realistic longitudinal neuroimaging trajectories from limited training data, enriching temporal context and improving robustness to unevenly spaced visits. The attention-convolutional layer then captures discriminative structural-temporal patterns that distinguish cognitively normal subjects from those with mild cognitive impairment and subjective cognitive decline. Experiments on the ADNI dataset demonstrate that DiGAN outperforms existing state-of-the-art baselines, showing its potential for early-stage AD detection.

replace-cross Enhancing Mathematical Problem Solving in LLMs through Execution-Driven Reasoning Augmentation

Authors: Aditya Basarkar, Benyamin Tabarsi, Tiffany Barnes, Dongkuan Xu

Abstract: Mathematical problem solving is a fundamental benchmark for assessing the reasoning capabilities of artificial intelligence and a gateway to applications in education, science, and engineering where reliable symbolic reasoning is essential. Although recent advances in multi-agent LLM-based systems have enhanced their mathematical reasoning capabilities, they still lack a reliably revisable representation of the reasoning process. Existing agents either operate in rigid sequential pipelines that cannot correct earlier steps or rely on heuristic self-evaluation that can fail to identify and fix errors. In addition, programmatic context can distract language models and degrade accuracy. To address these gaps, we introduce Iteratively Improved Program Construction (IIPC), a reasoning method that iteratively refines programmatic reasoning chains and combines execution feedback with the native Chain-of-thought abilities of the base LLM to maintain high-level contextual focus. IIPC surpasses competing approaches in the majority of reasoning benchmarks on multiple base LLMs. All code and implementations are released as open source.

replace-cross ZKBoost: Zero-Knowledge Verifiable Training for XGBoost

Authors: Nikolas Melissaris, Jiayi Xu, Antigoni Polychroniadou, Akira Takahashi, Chenkai Weng

Abstract: Gradient boosted decision trees, particularly XGBoost, are among the most effective methods for tabular data. As deployment in sensitive settings increases, cryptographic guarantees of model integrity become essential. We present ZKBoost, the first zero-knowledge proof of training (zkPoT) protocol for XGBoost, enabling model owners to prove correct training on a committed dataset without revealing data or parameters. We make three key contributions: (1) a fixed-point XGBoost implementation compatible with arithmetic circuits, enabling instantiation of efficient zkPoT, (2) a generic template of zkPoT for XGBoost, which can be instantiated with any general-purpose ZKP backend, and (3) vector oblivious linear evaluation (VOLE)-based instantiation resolving challenges in proving nonlinear fixed-point operations. Our fixed-point implementation matches standard XGBoost accuracy within 1\% while enabling practical zkPoT on real-world datasets.

replace-cross Lyapunov Constrained Soft Actor-Critic (LC-SAC) using Koopman Operator Theory for Quadrotor Trajectory Tracking

Authors: Dhruv S. Kushwaha, Zoleikha A. Biron

Abstract: Reinforcement Learning (RL) has achieved remarkable success in solving complex sequential decision-making problems. However, its application to safety-critical physical systems remains constrained by the lack of stability guarantees. Standard RL algorithms prioritize reward maximization, often yielding policies that may induce oscillations or unbounded state divergence. There has been significant work in incorporating Lyapunov-based stability guarantees in RL algorithms with key challenges being selecting a candidate Lyapunov function, computational complexity by using excessive function approximators and conservative policies by incorporating stability criterion in the learning process. In this work we propose a novel Lyapunov-constrained Soft Actor-Critic (LC-SAC) algorithm using Koopman operator theory. We propose use of extended dynamic mode decomposition (EDMD) to produce a linear approximation of the system and use this approximation to derive a closed form solution for candidate Lyapunov function. This derived Lyapunov function is incorporated in the SAC algorithm to further provide guarantees for a policy that stabilizes the nonlinear system. The results are evaluated trajectory tracking of a 2D Quadrotor environment based on safe-control-gym. The proposed algorithm shows training convergence and decaying violations for Lyapunov stability criterion compared to baseline vanilla SAC algorithm. GitHub Repository: https://github.com/DhruvKushwaha/LC-SAC-Quadrotor-Trajectory-Tracking

URLs: https://github.com/DhruvKushwaha/LC-SAC-Quadrotor-Trajectory-Tracking

replace-cross Learning fermionic linear optics with Heisenberg scaling and physical operations

Authors: Aria Christensen, Andrew Zhao

Abstract: We revisit the problem of learning fermionic linear optics (FLO), also known as fermionic Gaussian unitaries. Given black-box query access to an unknown FLO, previous proposals required $\widetilde{\mathcal{O}}(n^5 / \varepsilon^2)$ queries, where $n$ is the system size and $\varepsilon$ is the error in diamond distance. These algorithms also use unphysical operations (i.e., violating fermionic superselection rules) and/or $n$ auxiliary modes to prepare Choi states of the FLO. In this work, we establish efficient and experimentally friendly protocols that obey superselection, use minimal ancilla (at most $1$ extra mode), and exhibit improved dependence on both parameters $n$ and $\varepsilon$. For arbitrary (active) FLOs this algorithm makes at most $\widetilde{\mathcal{O}}(n^4 / \varepsilon)$ queries, while for number-conserving (passive) FLOs we show that $\mathcal{O}(n^3 / \varepsilon)$ queries suffice. The complexity of the active case can be further reduced to $\widetilde{\mathcal{O}}(n^3 / \varepsilon)$ at the cost of using $n$ ancilla. This marks the first FLO learning algorithm that attains Heisenberg scaling in precision. As a side result, we also demonstrate an improved copy complexity of $\widetilde{\mathcal{O}}(n \eta^2 / \varepsilon^2)$ for time-efficient state tomography of $\eta$-particle Slater determinants in $\varepsilon$ trace distance, which may be of independent interest.

replace-cross Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning

Authors: Ethan Rathbun, Wo Wei Lin, Alina Oprea, Christopher Amato

Abstract: Simulated environments are a key piece in the success of Reinforcement Learning (RL), allowing practitioners and researchers to train decision making agents without running expensive experiments on real hardware. Simulators remain a security blind spot, however, enabling adversarial developers to alter the dynamics of their released simulators for malicious purposes. Therefore, in this work we highlight a novel threat, demonstrating how simulator dynamics can be exploited to stealthily implant action-level backdoors into RL agents. The backdoor then allows an adversary to reliably activate targeted actions in an agent upon observing a predefined ``trigger'', leading to potentially dangerous consequences. Traditional backdoor attacks are limited in their strong threat models, assuming the adversary has near full control over an agent's training pipeline, enabling them to both alter and observe agent's rewards. As these assumptions are infeasible to implement within a simulator, we propose a new attack ``Daze'' which is able to reliably and stealthily implant backdoors into RL agents trained for real world tasks without altering or even observing their rewards. We provide formal proof of Daze's effectiveness in guaranteeing attack success across general RL tasks along with extensive empirical evaluations on both discrete and continuous action space domains. We additionally provide the first example of RL backdoor attacks transferring to real, robotic hardware. These developments motivate further research into securing all components of the RL training pipeline to prevent malicious attacks.

replace-cross Reducing the Complexity of Matrix Multiplication to $O(N^2log_2N)$ by an Asymptotically Optimal Quantum Algorithm

Authors: Jiaqi Yao, Ding Liu

Abstract: Matrix multiplication is a fundamental classical computing operation whose efficiency becomes a major challenge at scale, especially for machine learning applications. Quantum computing, with its inherent parallelism and exponential storage capacity, offers a potential solution to these limitations. This work presents a quantum kernel-based matrix multiplication algorithm (QKMM) that achieves an asymptotically optimal computational complexity of $ O(N^2 \log_2 N) $, outperforming the classical optimal complexity of $ O(N^{2.371552}) $, where $N$ denotes the matrix dimension. Through noiseless and noisy quantum simulation experiments, we demonstrate that the proposed algorithm not only exhibits superior theoretical efficiency but also shows practical advantages in runtime performance and stability.

replace-cross PhysicsAgentABM: Physics-Guided Generative Agent-Based Modeling

Authors: Kavana Venkatesh, Yinhan He, Jundong Li, Jiaming Cui

Abstract: Large language model (LLM)-based multi-agent systems enable expressive agent reasoning but are expensive to scale and poorly calibrated for timestep-aligned state-transition simulation, while classical agent-based models (ABMs) offer interpretability but struggle to integrate rich individual-level signals and non-stationary behaviors. We propose PhysicsAgentABM, which shifts inference to behaviorally coherent agent clusters: state-specialized symbolic agents encode mechanistic transition priors, a multimodal neural transition model captures temporal and interaction dynamics, and uncertainty-aware epistemic fusion yields calibrated cluster-level transition distributions. Individual agents then stochastically realize transitions under local constraints, decoupling population inference from entity-level variability. We further introduce ANCHOR, an LLM agent-driven clustering strategy based on cross-contextual behavioral responses and a novel contrastive loss, reducing LLM calls by up to 6-8 times. Experiments across public health, finance, and social sciences show consistent gains in event-time accuracy and calibration over mechanistic, neural, and LLM baselines. By re-architecting generative ABM around population-level inference with uncertainty-aware neuro-symbolic fusion, PhysicsAgentABM establishes a new paradigm for scalable and calibrated simulation with LLMs.

replace-cross Deep networks learn to parse uniform-depth context-free languages from local statistics

Authors: Jack T. Parley, Francesco Cagnetta, Matthieu Wyart

Abstract: Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic context-free grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we (i) introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled; (ii) provide a learning mechanism -- an inference algorithm inspired by the structure of deep convolutional networks -- that links learnability and sample complexity to specific language statistics; and (iii) validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.

replace-cross Addressing the Waypoint-Action Gap in End-to-End Autonomous Driving via Vehicle Motion Models

Authors: Jorge Daniel Rodr\'iguez-Vidal, Gabriel Villalonga, Diego Porres, Antonio M. L\'opez Pe\~na

Abstract: End-to-End Autonomous Driving (E2E-AD) systems are typically grouped by the nature of their outputs: (i) waypoint-based models that predict a future trajectory, and (ii) action-based models that directly output throttle, steer and brake. Most recent benchmark protocols and training pipelines are waypoint-based, which makes action-based policies harder to train and compare, slowing their progress. To bridge this waypoint-action gap, we propose a novel, differentiable vehicle-model framework that rolls out predicted action sequences to their corresponding ego-frame waypoint trajectories while supervising in waypoint space. Our approach enables action-based architectures to be trained and evaluated, for the first time, within waypoint-based benchmarks without modifying the underlying evaluation protocol. We extensively evaluate our framework across multiple challenging benchmarks and observe consistent improvements over the baselines. In particular, on NAVSIM \texttt{navhard} our approach achieves state-of-the-art performance. Our code will be made publicly available upon acceptance.

replace-cross Diffusion-State Policy Optimization for Masked Diffusion Language Models

Authors: Daisuke Oba, Hiroki Furuta, Naoaki Okazaki

Abstract: Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion-State Policy Optimization), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens -- without additional multi-step diffusion rollouts. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that can be combined with terminal-feedback policy optimization using the same rollouts. On LLaDA-8B-Instruct, DiSPO consistently improves over the terminal-feedback diffu-GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at https://daioba.github.io/dispo .

URLs: https://daioba.github.io/dispo

replace-cross Forest canopy height estimation from satellite RGB imagery using large-scale airborne LiDAR-derived training data and monocular depth estimation

Authors: Yongkang Lai, Xihan Mu, Dasheng Fan, Donghui Xie, Shanxin Guo, Wenli Huang, Tianjie Zhao, Guangjian Yan

Abstract: Large-scale, high-resolution forest canopy height mapping plays a crucial role in understanding regional and global carbon and water cycles. Spaceborne LiDAR missions, including the Ice, Cloud, and Land Elevation Satellite-2 (ICESat-2) and the Global Ecosystem Dynamics Investigation (GEDI), provide global observations of forest structure but are spatially sparse and subject to inherent uncertainties. In contrast, near-surface LiDAR platforms, such as airborne and unmanned aerial vehicle (UAV) LiDAR systems, offer much finer measurements of forest canopy structure, and a growing number of countries have made these datasets openly available. In this study, a state-of-the-art monocular depth estimation model, Depth Anything V2, was trained using approximately 16,000 km2 of canopy height models (CHMs) derived from publicly available airborne LiDAR point clouds and related products across multiple countries, together with 3 m resolution PlanetScope and airborne RGB imagery. The trained model, referred to as Depth2CHM, enables the estimation of spatially continuous CHMs directly from PlanetScope RGB imagery. Independent validation was conducted at sites in China (approximately 1 km2) and the United States (approximately 116 km2). The results showed that Depth2CHM could accurately estimate canopy height, with biases of 0.59 m and 0.41 m and root mean square errors (RMSEs) of 2.54 m and 5.75 m for these two sites, respectively. Compared with an existing global meter-resolution CHM product, the mean absolute error is reduced by approximately 1.5 m and the RMSE by approximately 2 m. These results demonstrated that monocular depth estimation networks trained with large-scale airborne LiDAR-derived canopy height data provide a promising and scalable pathway for high-resolution, spatially continuous forest canopy height estimation from satellite RGB imagery.