new A Theory of Multi-Agent Generative Flow Networks

Authors: Leo Maxime Brunswic, Haozhi Wang, Shuang Luo, Jianye Hao, Amir Rasouli, Yinchuan Li

Abstract: Generative flow networks utilize a flow-matching loss to learn a stochastic policy for generating objects from a sequence of actions, such that the probability of generating a pattern can be proportional to the corresponding given reward. However, a theoretical framework for multi-agent generative flow networks (MA-GFlowNets) has not yet been proposed. In this paper, we propose the theory framework of MA-GFlowNets, which can be applied to multiple agents to generate objects collaboratively through a series of joint actions. We further propose four algorithms: a centralized flow network for centralized training of MA-GFlowNets, an independent flow network for decentralized execution, a joint flow network for achieving centralized training with decentralized execution, and its updated conditional version. Joint Flow training is based on a local-global principle allowing to train a collection of (local) GFN as a unique (global) GFN. This principle provides a loss of reasonable complexity and allows to leverage usual results on GFN to provide theoretical guarantees that the independent policies generate samples with probability proportional to the reward function. Experimental results demonstrate the superiority of the proposed framework compared to reinforcement learning and MCMC-based methods.

new FastEagle: Cascaded Drafting for Accelerating Speculative Decoding

Authors: Haiduo Huang, Jiangcheng Song, Wenzhe Zhao, Pengju Ren

Abstract: Speculative decoding accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive cascaded drafter that emits an entire draft in a single forward pass. FastEagle replaces temporal steps with a lightweight layer cascade and trains with layer-wise supervision to mitigate error accumulation. Coupled with a constrained draft tree that preserves lossless verification cost, FastEagle delivers substantial wall-clock speedups over strong autoregressive drafters while maintaining competitive acceptance behavior. Across multiple LLMs (Vicuna-13B, LLaMA-Instruct 3.x, and DeepSeek-R1-Distill-LLaMA) and tasks (MT-Bench, HumanEval, GSM8K, CNN/DM, Alpaca), FastEagle consistently outperforms EAGLE-3 in speedup under both greedy and stochastic decoding, with comparable average acceptance lengths. These results indicate that removing sequential dependencies in drafting is a practical path toward lossless LLM inference acceleration.

new mloz: A Highly Efficient Machine Learning-Based Ozone Parameterization for Climate Sensitivity Simulations

Authors: Yiling Ma, Nathan Luke Abraham, Stefan Versick, Roland Ruhnke, Andrea Schneidereit, Ulrike Niemeier, Felix Back, Peter Braesicke, Peer Nowack

Abstract: Atmospheric ozone is a crucial absorber of solar radiation and an important greenhouse gas. However, most climate models participating in the Coupled Model Intercomparison Project (CMIP) still lack an interactive representation of ozone due to the high computational costs of atmospheric chemistry schemes. Here, we introduce a machine learning parameterization (mloz) to interactively model daily ozone variability and trends across the troposphere and stratosphere in standard climate sensitivity simulations, including two-way interactions of ozone with the Quasi-Biennial Oscillation. We demonstrate its high fidelity on decadal timescales and its flexible use online across two different climate models -- the UK Earth System Model (UKESM) and the German ICOsahedral Nonhydrostatic (ICON) model. With atmospheric temperature profile information as the only input, mloz produces stable ozone predictions around 31 times faster than the chemistry scheme in UKESM, contributing less than 4 percent of the respective total climate model runtimes. In particular, we also demonstrate its transferability to different climate models without chemistry schemes by transferring the parameterization from UKESM to ICON. This highlights the potential for widespread adoption in CMIP-level climate models that lack interactive chemistry for future climate change assessments, particularly when focusing on climate sensitivity simulations, where ozone trends and variability are known to significantly modulate atmospheric feedback processes.

new Bridging Privacy and Utility: Synthesizing anonymized EEG with constraining utility functions

Authors: Kay Fuhrmeister, Arne Pelzer, Fabian Radke, Julia Lechinger, Mahzad Gharleghi, Thomas K\"ollmer, Insa Wolf

Abstract: Electroencephalography (EEG) is widely used for recording brain activity and has seen numerous applications in machine learning, such as detecting sleep stages and neurological disorders. Several studies have successfully shown the potential of EEG data for re-identification and leakage of other personal information. Therefore, the increasing availability of EEG consumer devices raises concerns about user privacy, motivating us to investigate how to safeguard this sensitive data while retaining its utility for EEG applications. To address this challenge, we propose a transformer-based autoencoder to create EEG data that does not allow for subject re-identification while still retaining its utility for specific machine learning tasks. We apply our approach to automatic sleep staging by evaluating the re-identification and utility potential of EEG data before and after anonymization. The results show that the re-identifiability of the EEG signal can be substantially reduced while preserving its utility for machine learning.

new Efficiently Attacking Memorization Scores

Authors: Tue Do, Varun Chandrasekaran, Daniel Alabi

Abstract: Influence estimation tools -- such as memorization scores -- are widely used to understand model behavior, attribute training data, and inform dataset curation. However, recent applications in data valuation and responsible machine learning raise the question: can these scores themselves be adversarially manipulated? In this work, we present a systematic study of the feasibility of attacking memorization-based influence estimators. We characterize attacks for producing highly memorized samples as highly sensitive queries in the regime where a trained algorithm is accurate. Our attack (calculating the pseudoinverse of the input) is practical, requiring only black-box access to model outputs and incur modest computational overhead. We empirically validate our attack across a wide suite of image classification tasks, showing that even state-of-the-art proxies are vulnerable to targeted score manipulations. In addition, we provide a theoretical analysis of the stability of memorization scores under adversarial perturbations, revealing conditions under which influence estimates are inherently fragile. Our findings highlight critical vulnerabilities in influence-based attribution and suggest the need for robust defenses. All code can be found at https://anonymous.4open.science/r/MemAttack-5413/

URLs: https://anonymous.4open.science/r/MemAttack-5413/

new Offline Goal-conditioned Reinforcement Learning with Quasimetric Representations

Authors: Vivek Myers, Bill Chunyuan Zheng, Benjamin Eysenbach, Sergey Levine

Abstract: Approaches for goal-conditioned reinforcement learning (GCRL) often use learned state representations to extract goal-reaching policies. Two frameworks for representation structure have yielded particularly effective GCRL algorithms: (1) *contrastive representations*, in which methods learn "successor features" with a contrastive objective that performs inference over future outcomes, and (2) *temporal distances*, which link the (quasimetric) distance in representation space to the transit time from states to goals. We propose an approach that unifies these two frameworks, using the structure of a quasimetric representation space (triangle inequality) with the right additional constraints to learn successor representations that enable optimal goal-reaching. Unlike past work, our approach is able to exploit a **quasimetric** distance parameterization to learn **optimal** goal-reaching distances, even with **suboptimal** data and in **stochastic** environments. This gives us the best of both worlds: we retain the stability and long-horizon capabilities of Monte Carlo contrastive RL methods, while getting the free stitching capabilities of quasimetric network parameterizations. On existing offline GCRL benchmarks, our representation learning objective improves performance on stitching tasks where methods based on contrastive learning struggle, and on noisy, high-dimensional environments where methods based on quasimetric networks struggle.

new CoSupFormer : A Contrastive Supervised learning approach for EEG signal Classification

Authors: D. Darankoum, C. Habermacher, J. Volle, S. Grudinin

Abstract: Electroencephalography signals (EEGs) contain rich multi-scale information crucial for understanding brain states, with potential applications in diagnosing and advancing the drug development landscape. However, extracting meaningful features from raw EEG signals while handling noise and channel variability remains a major challenge. This work proposes a novel end-to-end deep-learning framework that addresses these issues through several key innovations. First, we designed an encoder capable of explicitly capturing multi-scale frequency oscillations covering a wide range of features for different EEG-related tasks. Secondly, to model complex dependencies and handle the high temporal resolution of EEGs, we introduced an attention-based encoder that simultaneously learns interactions across EEG channels and within localized {\em patches} of individual channels. We integrated a dedicated gating network on top of the attention encoder to dynamically filter out noisy and non-informative channels, enhancing the reliability of EEG data. The entire encoding process is guided by a novel loss function, which leverages supervised and contrastive learning, significantly improving model generalization. We validated our approach in multiple applications, ranging from the classification of effects across multiple Central Nervous System (CNS) disorders treatments to the diagnosis of Parkinson's and Alzheimer's disease. Our results demonstrate that the proposed learning paradigm can extract biologically meaningful patterns from raw EEG signals across different species, autonomously select high-quality channels, and achieve robust generalization through innovative architectural and loss design.

new Beyond Visual Similarity: Rule-Guided Multimodal Clustering with explicit domain rules

Authors: Kishor Datta Gupta, Mohd Ariful Haque, Marufa Kamal, Ahmed Rafi Hasan, Md. Mahfuzur Rahman, Roy George

Abstract: Traditional clustering techniques often rely solely on similarity in the input data, limiting their ability to capture structural or semantic constraints that are critical in many domains. We introduce the Domain Aware Rule Triggered Variational Autoencoder (DARTVAE), a rule guided multimodal clustering framework that incorporates domain specific constraints directly into the representation learning process. DARTVAE extends the VAE architecture by embedding explicit rules, semantic representations, and data driven features into a unified latent space, while enforcing constraint compliance through rule consistency and violation penalties in the loss function. Unlike conventional clustering methods that rely only on visual similarity or apply rules as post hoc filters, DARTVAE treats rules as first class learning signals. The rules are generated by LLMs, structured into knowledge graphs, and enforced through a loss function combining reconstruction, KL divergence, consistency, and violation penalties. Experiments on aircraft and automotive datasets demonstrate that rule guided clustering produces more operationally meaningful and interpretable clusters for example, isolating UAVs, unifying stealth aircraft, or separating SUVs from sedans while improving traditional clustering metrics. However, the framework faces challenges: LLM generated rules may hallucinate or conflict, excessive rules risk overfitting, and scaling to complex domains increases computational and consistency difficulties. By combining rule encodings with learned representations, DARTVAE achieves more meaningful and consistent clustering outcomes than purely data driven models, highlighting the utility of constraint guided multimodal clustering for complex, knowledge intensive settings.

new Myosotis: structured computation for attention like layer

Authors: Evgenii Egorov, Hanno Ackermann, Markus Nagel, Hong Cai

Abstract: Attention layers apply a sequence-to-sequence mapping whose parameters depend on the pairwise interactions of the input elements. However, without any structural assumptions, memory and compute scale quadratically with the sequence length. The two main ways to mitigate this are to introduce sparsity by ignoring a sufficient amount of pairwise interactions or to introduce recurrent dependence along them, as SSM does. Although both approaches are reasonable, they both have disadvantages. We propose a novel algorithm that combines the advantages of both concepts. Our idea is based on the efficient inversion of tree-structured matrices.

new Auto-Regressive U-Net for Full-Field Prediction of Shrinkage-Induced Damage in Concrete

Authors: Liya Gaynutdinova, Petr Havl\'asek, Ond\v{r}ej Roko\v{s}, Fleur Hendriks, Martin Do\v{s}k\'a\v{r}

Abstract: This paper introduces a deep learning approach for predicting time-dependent full-field damage in concrete. The study uses an auto-regressive U-Net model to predict the evolution of the scalar damage field in a unit cell given microstructural geometry and evolution of an imposed shrinkage profile. By sequentially using the predicted damage output as input for subsequent predictions, the model facilitates the continuous assessment of damage progression. Complementarily, a convolutional neural network (CNN) utilises the damage estimations to forecast key mechanical properties, including observed shrinkage and residual stiffness. The proposed dual-network architecture demonstrates high computational efficiency and robust predictive performance on the synthesised datasets. The approach reduces the computational load traditionally associated with full-field damage evaluations and is used to gain insights into the relationship between aggregate properties, such as shape, size, and distribution, and the effective shrinkage and reduction in stiffness. Ultimately, this can help to optimize concrete mix designs, leading to improved durability and reduced internal damage.

new Complexity-Driven Policy Optimization

Authors: Luca Serfilippi, Giorgio Franceschelli, Antonio Corradi, Mirco Musolesi

Abstract: Policy gradient methods often balance exploitation and exploration via entropy maximization. However, maximizing entropy pushes the policy towards a uniform random distribution, which represents an unstructured and sometimes inefficient exploration strategy. In this work, we propose replacing the entropy bonus with a more robust complexity bonus. In particular, we adopt a measure of complexity, defined as the product of Shannon entropy and disequilibrium, where the latter quantifies the distance from the uniform distribution. This regularizer encourages policies that balance stochasticity (high entropy) with structure (high disequilibrium), guiding agents toward regimes where useful, non-trivial behaviors can emerge. Such behaviors arise because the regularizer suppresses both extremes, e.g., maximal disorder and complete order, creating pressure for agents to discover structured yet adaptable strategies. Starting from Proximal Policy Optimization (PPO), we introduce Complexity-Driven Policy Optimization (CDPO), a new learning algorithm that replaces entropy with complexity. We show empirically across a range of discrete action space tasks that CDPO is more robust to the choice of the complexity coefficient than PPO is with the entropy coefficient, especially in environments requiring greater exploration.

new A Recovery Theory for Diffusion Priors: Deterministic Analysis of the Implicit Prior Algorithm

Authors: Oscar Leong, Yann Traonmilin

Abstract: Recovering high-dimensional signals from corrupted measurements is a central challenge in inverse problems. Recent advances in generative diffusion models have shown remarkable empirical success in providing strong data-driven priors, but rigorous recovery guarantees remain limited. In this work, we develop a theoretical framework for analyzing deterministic diffusion-based algorithms for inverse problems, focusing on a deterministic version of the algorithm proposed by Kadkhodaie \& Simoncelli \cite{kadkhodaie2021stochastic}. First, we show that when the underlying data distribution concentrates on a low-dimensional model set, the associated noise-convolved scores can be interpreted as time-varying projections onto such a set. This leads to interpreting previous algorithms using diffusion priors for inverse problems as generalized projected gradient descent methods with varying projections. When the sensing matrix satisfies a restricted isometry property over the model set, we can derive quantitative convergence rates that depend explicitly on the noise schedule. We apply our framework to two instructive data distributions: uniform distributions over low-dimensional compact, convex sets and low-rank Gaussian mixture models. In the latter setting, we can establish global convergence guarantees despite the nonconvexity of the underlying model set.

new MDBench: Benchmarking Data-Driven Methods for Model Discovery

Authors: Amirmohammad Ziaei Bideh, Aleksandra Georgievska, Jonathan Gryak

Abstract: Model discovery aims to uncover governing differential equations of dynamical systems directly from experimental data. Benchmarking such methods is essential for tracking progress and understanding trade-offs in the field. While prior efforts have focused mostly on identifying single equations, typically framed as symbolic regression, there remains a lack of comprehensive benchmarks for discovering dynamical models. To address this, we introduce MDBench, an open-source benchmarking framework for evaluating model discovery methods on dynamical systems. MDBench assesses 12 algorithms on 14 partial differential equations (PDEs) and 63 ordinary differential equations (ODEs) under varying levels of noise. Evaluation metrics include derivative prediction accuracy, model complexity, and equation fidelity. We also introduce seven challenging PDE systems from fluid dynamics and thermodynamics, revealing key limitations in current methods. Our findings illustrate that linear methods and genetic programming methods achieve the lowest prediction error for PDEs and ODEs, respectively. Moreover, linear models are in general more robust against noise. MDBench accelerates the advancement of model discovery methods by offering a rigorous, extensible benchmarking framework and a rich, diverse collection of dynamical system datasets, enabling systematic evaluation, comparison, and improvement of equation accuracy and robustness.

new Understanding and Improving Adversarial Robustness of Neural Probabilistic Circuits

Authors: Weixin Chen, Han Zhao

Abstract: Neural Probabilistic Circuits (NPCs), a new class of concept bottleneck models, comprise an attribute recognition model and a probabilistic circuit for reasoning. By integrating the outputs from these two modules, NPCs produce compositional and interpretable predictions. While offering enhanced interpretability and high performance on downstream tasks, the neural-network-based attribute recognition model remains a black box. This vulnerability allows adversarial attacks to manipulate attribute predictions by introducing carefully crafted subtle perturbations to input images, potentially compromising the final predictions. In this paper, we theoretically analyze the adversarial robustness of NPC and demonstrate that it only depends on the robustness of the attribute recognition model and is independent of the robustness of the probabilistic circuit. Moreover, we propose RNPC, the first robust neural probabilistic circuit against adversarial attacks on the recognition module. RNPC introduces a novel class-wise integration for inference, ensuring a robust combination of outputs from the two modules. Our theoretical analysis demonstrates that RNPC exhibits provably improved adversarial robustness compared to NPC. Empirical results on image classification tasks show that RNPC achieves superior adversarial robustness compared to existing concept bottleneck models while maintaining high accuracy on benign inputs.

new Generalizable Diabetes Risk Stratification via Hybrid Machine Learning Models

Authors: Athar Parvez, Muhammad Jawad Mufti

Abstract: Background/Purpose: Diabetes affects over 537 million people worldwide and is projected to reach 783 million by 2045. Early risk stratification can benefit from machine learning. We compare two hybrid classifiers and assess their generalizability on an external cohort. Methods: Two hybrids were built: (i) XGBoost + Random Forest (XGB-RF) and (ii) Support Vector Machine + Logistic Regression (SVM-LR). A leakage-safe, standardized pipeline (encoding, imputation, min-max scaling; SMOTE on training folds only; probability calibration for SVM) was fit on the primary dataset and frozen. Evaluation prioritized threshold-independent discrimination (AUROC/AUPRC) and calibration (Brier, slope/intercept). External validation used the PIMA cohort (N=768) with the frozen pipeline; any thresholded metrics on PIMA were computed at the default rule tau = 0.5. Results: On the primary dataset (PR baseline = 0.50), XGB-RF achieved AUROC ~0.995 and AUPRC ~0.998, outperforming SVM-LR (AUROC ~0.978; AUPRC ~0.947). On PIMA (PR baseline ~0.349), XGB-RF retained strong performance (AUROC ~0.990; AUPRC ~0.959); SVM-LR was lower (AUROC ~0.963; AUPRC ~0.875). Thresholded metrics on PIMA at tau = 0.5 were XGB-RF (Accuracy 0.960; Precision 0.941; Recall 0.944; F1 0.942) and SVM-LR (Accuracy 0.900; Precision 0.855; Recall 0.858; F1 0.857). Conclusions: Across internal and external cohorts, XGB-RF consistently dominated SVM-LR and exhibited smaller external attenuation on ROC/PR with acceptable calibration. These results support gradient-boosting-based hybridization as a robust, transferable approach for diabetes risk stratification and motivate prospective, multi-site validation with deployment-time threshold selection based on clinical trade-offs.

new PIRF: Physics-Informed Reward Fine-Tuning for Diffusion Models

Authors: Mingze Yuan, Pengfei Jin, Na Li, Quanzheng Li

Abstract: Diffusion models have demonstrated strong generative capabilities across scientific domains, but often produce outputs that violate physical laws. We propose a new perspective by framing physics-informed generation as a sparse reward optimization problem, where adherence to physical constraints is treated as a reward signal. This formulation unifies prior approaches under a reward-based paradigm and reveals a shared bottleneck: reliance on diffusion posterior sampling (DPS)-style value function approximations, which introduce non-negligible errors and lead to training instability and inference inefficiency. To overcome this, we introduce Physics-Informed Reward Fine-tuning (PIRF), a method that bypasses value approximation by computing trajectory-level rewards and backpropagating their gradients directly. However, a naive implementation suffers from low sample efficiency and compromised data fidelity. PIRF mitigates these issues through two key strategies: (1) a layer-wise truncated backpropagation method that leverages the spatiotemporally localized nature of physics-based rewards, and (2) a weight-based regularization scheme that improves efficiency over traditional distillation-based methods. Across five PDE benchmarks, PIRF consistently achieves superior physical enforcement under efficient sampling regimes, highlighting the potential of reward fine-tuning for advancing scientific generative modeling.

new The Sensitivity of Variational Bayesian Neural Network Performance to Hyperparameters

Authors: Scott Koermer, Natalie Klein

Abstract: In scientific applications, predictive modeling is often of limited use without accurate uncertainty quantification (UQ) to indicate when a model may be extrapolating or when more data needs to be collected. Bayesian Neural Networks (BNNs) produce predictive uncertainty by propagating uncertainty in neural network (NN) weights and offer the promise of obtaining not only an accurate predictive model but also accurate UQ. However, in practice, obtaining accurate UQ with BNNs is difficult due in part to the approximations used for practical model training and in part to the need to choose a suitable set of hyperparameters; these hyperparameters outnumber those needed for traditional NNs and often have opaque effects on the results. We aim to shed light on the effects of hyperparameter choices for BNNs by performing a global sensitivity analysis of BNN performance under varying hyperparameter settings. Our results indicate that many of the hyperparameters interact with each other to affect both predictive accuracy and UQ. For improved usage of BNNs in real-world applications, we suggest that global sensitivity analysis, or related methods such as Bayesian optimization, should be used to aid in dimensionality reduction and selection of hyperparameters to ensure accurate UQ in BNNs.

new Learning Greens Operators through Hierarchical Neural Networks Inspired by the Fast Multipole Method

Authors: Emilio McAllister Fognini, Marta M. Betcke, Ben T. Cox

Abstract: The Fast Multipole Method (FMM) is an efficient numerical algorithm for computation of long-ranged forces in $N$-body problems within gravitational and electrostatic fields. This method utilizes multipole expansions of the Green's function inherent to the underlying dynamical systems. Despite its widespread application in physics and engineering, the integration of FMM with modern machine learning architectures remains underexplored. In this work, we propose a novel neural network architecture, the Neural FMM, that integrates the information flow of the FMM into a hierarchical machine learning framework for learning the Green's operator of an Elliptic PDE. Our Neural FMM architecture leverages a hierarchical computation flow of the FMM method to split up the local and far-field interactions and efficiently learn their respective representations.

new TSKAN: Interpretable Machine Learning for QoE modeling over Time Series Data

Authors: Kamal Singh, Priyanka Rawat, Sami Marouani, Baptiste Jeudy

Abstract: Quality of Experience (QoE) modeling is crucial for optimizing video streaming services to capture the complex relationships between different features and user experience. We propose a novel approach to QoE modeling in video streaming applications using interpretable Machine Learning (ML) techniques over raw time series data. Unlike traditional black-box approaches, our method combines Kolmogorov-Arnold Networks (KANs) as an interpretable readout on top of compact frequency-domain features, allowing us to capture temporal information while retaining a transparent and explainable model. We evaluate our method on popular datasets and demonstrate its enhanced accuracy in QoE prediction, while offering transparency and interpretability.

new Explicit and Effectively Symmetric Schemes for Neural SDEs

Authors: Daniil Shmelev, Cristopher Salvi

Abstract: Backpropagation through (neural) SDE solvers is traditionally approached in two ways: discretise-then-optimise, which offers accurate gradients but incurs prohibitive memory costs due to storing the full computational graph (even when mitigated by checkpointing); and optimise-then-discretise, which achieves constant memory cost by solving an auxiliary backward SDE, but suffers from slower evaluation and gradient approximation errors. Algebraically reversible solvers promise both memory efficiency and gradient accuracy, yet existing methods such as the Reversible Heun scheme are often unstable under complex models and large step sizes. We address these limitations by introducing a novel class of stable, near-reversible Runge--Kutta schemes for neural SDEs. These Explicit and Effectively Symmetric (EES) schemes retain the benefits of reversible solvers while overcoming their instability, enabling memory-efficient training without severe restrictions on step size or model complexity. Through numerical experiments, we demonstrate the superior stability and reliability of our schemes, establishing them as a practical foundation for scalable and accurate training of neural SDEs.

new Function Spaces Without Kernels: Learning Compact Hilbert Space Representations

Authors: Su Ann Low, Quentin Rommel, Kevin S. Miller, Adam J. Thorpe, Ufuk Topcu

Abstract: Function encoders are a recent technique that learn neural network basis functions to form compact, adaptive representations of Hilbert spaces of functions. We show that function encoders provide a principled connection to feature learning and kernel methods by defining a kernel through an inner product of the learned feature map. This kernel-theoretic perspective explains their ability to scale independently of dataset size while adapting to the intrinsic structure of data, and it enables kernel-style analysis of neural models. Building on this foundation, we develop two training algorithms that learn compact bases: a progressive training approach that constructively grows bases, and a train-then-prune approach that offers a computationally efficient alternative after training. Both approaches use principles from PCA to reveal the intrinsic dimension of the learned space. In parallel, we derive finite-sample generalization bounds using Rademacher complexity and PAC-Bayes techniques, providing inference time guarantees. We validate our approach on a polynomial benchmark with a known intrinsic dimension, and on nonlinear dynamical systems including a Van der Pol oscillator and a two-body orbital model, demonstrating that the same accuracy can be achieved with substantially fewer basis functions. This work suggests a path toward neural predictors with kernel-level guarantees, enabling adaptable models that are both efficient and principled at scale.

new MMG: Mutual Information Estimation via the MMSE Gap in Diffusion

Authors: Longxuan Yu, Xing Shi, Xianghao Kong, Tong Jia, Greg Ver Steeg

Abstract: Mutual information (MI) is one of the most general ways to measure relationships between random variables, but estimating this quantity for complex systems is challenging. Denoising diffusion models have recently set a new bar for density estimation, so it is natural to consider whether these methods could also be used to improve MI estimation. Using the recently introduced information-theoretic formulation of denoising diffusion models, we show the diffusion models can be used in a straightforward way to estimate MI. In particular, the MI corresponds to half the gap in the Minimum Mean Square Error (MMSE) between conditional and unconditional diffusion, integrated over all Signal-to-Noise-Ratios (SNRs) in the noising process. Our approach not only passes self-consistency tests but also outperforms traditional and score-based diffusion MI estimators. Furthermore, our method leverages adaptive importance sampling to achieve scalable MI estimation, while maintaining strong performance even when the MI is high.

new Policy Compatible Skill Incremental Learning via Lazy Learning Interface

Authors: Daehee Lee, Dongsu Lee, TaeYoon Kwack, Wonje Choi, Honguk Woo

Abstract: Skill Incremental Learning (SIL) is the process by which an embodied agent expands and refines its skill set over time by leveraging experience gained through interaction with its environment or by the integration of additional data. SIL facilitates efficient acquisition of hierarchical policies grounded in reusable skills for downstream tasks. However, as the skill repertoire evolves, it can disrupt compatibility with existing skill-based policies, limiting their reusability and generalization. In this work, we propose SIL-C, a novel framework that ensures skill-policy compatibility, allowing improvements in incrementally learned skills to enhance the performance of downstream policies without requiring policy re-training or structural adaptation. SIL-C employs a bilateral lazy learning-based mapping technique to dynamically align the subtask space referenced by policies with the skill space decoded into agent behaviors. This enables each subtask, derived from the policy's decomposition of a complex task, to be executed by selecting an appropriate skill based on trajectory distribution similarity. We evaluate SIL-C across diverse SIL scenarios and demonstrate that it maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process.

new Latent Twins

Authors: Matthias Chung, Deepanshu Verma, Max Collins, Amit N. Subrahmanya, Varuni Katti Sastry, Vishwas Rao

Abstract: Over the past decade, scientific machine learning has transformed the development of mathematical and computational frameworks for analyzing, modeling, and predicting complex systems. From inverse problems to numerical PDEs, dynamical systems, and model reduction, these advances have pushed the boundaries of what can be simulated. Yet they have often progressed in parallel, with representation learning and algorithmic solution methods evolving largely as separate pipelines. With \emph{Latent Twins}, we propose a unifying mathematical framework that creates a hidden surrogate in latent space for the underlying equations. Whereas digital twins mirror physical systems in the digital world, Latent Twins mirror mathematical systems in a learned latent space governed by operators. Through this lens, classical modeling, inversion, model reduction, and operator approximation all emerge as special cases of a single principle. We establish the fundamental approximation properties of Latent Twins for both ODEs and PDEs and demonstrate the framework across three representative settings: (i) canonical ODEs, capturing diverse dynamical regimes; (ii) a PDE benchmark using the shallow-water equations, contrasting Latent Twin simulations with DeepONet and forecasts with a 4D-Var baseline; and (iii) a challenging real-data geopotential reanalysis dataset, reconstructing and forecasting from sparse, noisy observations. Latent Twins provide a compact, interpretable surrogate for solution operators that evaluate across arbitrary time gaps in a single-shot, while remaining compatible with scientific pipelines such as assimilation, control, and uncertainty quantification. Looking forward, this framework offers scalable, theory-grounded surrogates that bridge data-driven representation learning and classical scientific modeling across disciplines.

new Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning

Authors: Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training LLM agents for complex multi-turn task planning faces significant challenges, including sparse episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in higher multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks with over 30 steps. We also theoretically and empirically validate the strong cross-task generalizability that the models trained on complex tasks can lead to the successful completion of all simpler subtasks.

new Personalized Federated Dictionary Learning for Modeling Heterogeneity in Multi-site fMRI Data

Authors: Yipu Zhang, Chengshuo Zhang, Ziyu Zhou, Gang Qu, Hao Zheng, Yuping Wang, Hui Shen, Hongwen Deng

Abstract: Data privacy constraints pose significant challenges for large-scale neuroimaging analysis, especially in multi-site functional magnetic resonance imaging (fMRI) studies, where site-specific heterogeneity leads to non-independent and identically distributed (non-IID) data. These factors hinder the development of generalizable models. To address these challenges, we propose Personalized Federated Dictionary Learning (PFedDL), a novel federated learning framework that enables collaborative modeling across sites without sharing raw data. PFedDL performs independent dictionary learning at each site, decomposing each site-specific dictionary into a shared global component and a personalized local component. The global atoms are updated via federated aggregation to promote cross-site consistency, while the local atoms are refined independently to capture site-specific variability, thereby enhancing downstream analysis. Experiments on the ABIDE dataset demonstrate that PFedDL outperforms existing methods in accuracy and robustness across non-IID datasets.

new Investigating Modality Contribution in Audio LLMs for Music

Authors: Giovana Morais, Magdalena Fuentes

Abstract: Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.

new Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration

Authors: Yiyuan Pan, Zhe Liu, Hesheng Wang

Abstract: Autonomous exploration in complex multi-agent reinforcement learning (MARL) with sparse rewards critically depends on providing agents with effective intrinsic motivation. While artificial curiosity offers a powerful self-supervised signal, it often confuses environmental stochasticity with meaningful novelty. Moreover, existing curiosity mechanisms exhibit a uniform novelty bias, treating all unexpected observations equally. However, peer behavior novelty, which encode latent task dynamics, are often overlooked, resulting in suboptimal exploration in decentralized, communication-free MARL settings. To this end, inspired by how human children adaptively calibrate their own exploratory behaviors via observing peers, we propose a novel approach to enhance multi-agent exploration. We introduce CERMIC, a principled framework that empowers agents to robustly filter noisy surprise signals and guide exploration by dynamically calibrating their intrinsic curiosity with inferred multi-agent context. Additionally, CERMIC generates theoretically-grounded intrinsic rewards, encouraging agents to explore state transitions with high information gain. We evaluate CERMIC on benchmark suites including VMAS, Meltingpot, and SMACv2. Empirical results demonstrate that exploration with CERMIC significantly outperforms SoTA algorithms in sparse-reward environments.

new Guiding Application Users via Estimation of Computational Resources for Massively Parallel Chemistry Computations

Authors: Tanzila Tabassum (Saday), Omer Subasi (Saday), Ajay Panyala (Saday), Epiya Ebiapia (Saday), Gerald Baumgartner (Saday), Erdal Mutlu (Saday), P. (Saday), Sadayappan, Karol Kowalski

Abstract: In this work, we develop machine learning (ML) based strategies to predict resources (costs) required for massively parallel chemistry computations, such as coupled-cluster methods, to guide application users before they commit to running expensive experiments on a supercomputer. By predicting application execution time, we determine the optimal runtime parameter values such as number of nodes and tile sizes. Two key questions of interest to users are addressed. The first is the shortest-time question, where the user is interested in knowing the parameter configurations (number of nodes and tile sizes) to achieve the shortest execution time for a given problem size and a target supercomputer. The second is the cheapest-run question in which the user is interested in minimizing resource usage, i.e., finding the number of nodes and tile size that minimizes the number of node-hours for a given problem size. We evaluate a rich family of ML models and strategies, developed based on the collections of runtime parameter values for the CCSD (Coupled Cluster with Singles and Doubles) application executed on the Department of Energy (DOE) Frontier and Aurora supercomputers. Our experiments show that when predicting the total execution time of a CCSD iteration, a Gradient Boosting (GB) ML model achieves a Mean Absolute Percentage Error (MAPE) of 0.023 and 0.073 for Aurora and Frontier, respectively. In the case where it is expensive to run experiments just to collect data points, we show that active learning can achieve a MAPE of about 0.2 with just around 450 experiments collected from Aurora and Frontier.

new Theoretical Bounds for Stable In-Context Learning

Authors: Tongxi Wang, Zhuoyang Xia

Abstract: In-context learning (ICL) is flexible but its reliability is highly sensitive to prompt length. This paper establishes a non-asymptotic lower bound that links the minimal number of demonstrations to ICL stability under fixed high-dimensional sub-Gaussian representations. The bound gives explicit sufficient conditions in terms of spectral properties of the covariance, providing a computable criterion for practice. Building on this analysis, we propose a two-stage observable estimator with a one-shot calibration that produces practitioner-ready prompt-length estimates without distributional priors. Experiments across diverse datasets, encoders, and generators show close alignment between the predicted thresholds and empirical knee-points, with the theory acting as a conservative but reliable upper bound; the calibrated variant further tightens this gap. These results connect spectral coverage to stable ICL, bridge theory and deployment, and improve the interpretability and reliability of large-scale prompting in realistic finite-sample regimes.

new Bispectral OT: Dataset Comparison using Symmetry-Aware Optimal Transport

Authors: Annabel Ma, Kaiying Hou, David Alvarez-Melis, Melanie Weber

Abstract: Optimal transport (OT) is a widely used technique in machine learning, graphics, and vision that aligns two distributions or datasets using their relative geometry. In symmetry-rich settings, however, OT alignments based solely on pairwise geometric distances between raw features can ignore the intrinsic coherence structure of the data. We introduce Bispectral Optimal Transport, a symmetry-aware extension of discrete OT that compares elements using their representation using the bispectrum, a group Fourier invariant that preserves all signal structure while removing only the variation due to group actions. Empirically, we demonstrate that the transport plans computed with Bispectral OT achieve greater class preservation accuracy than naive feature OT on benchmark datasets transformed with visual symmetries, improving the quality of meaningful correspondences that capture the underlying semantic label structure in the dataset while removing nuisance variation not affecting class or content.

new Can Federated Learning Safeguard Private Data in LLM Training? Vulnerabilities, Attacks, and Defense Evaluation

Authors: Wenkai Guo, Xuefeng Liu, Haolin Wang, Jianwei Niu, Shaojie Tang, Jing Yuan

Abstract: Fine-tuning large language models (LLMs) with local data is a widely adopted approach for organizations seeking to adapt LLMs to their specific domains. Given the shared characteristics in data across different organizations, the idea of collaboratively fine-tuning an LLM using data from multiple sources presents an appealing opportunity. However, organizations are often reluctant to share local data, making centralized fine-tuning impractical. Federated learning (FL), a privacy-preserving framework, enables clients to retain local data while sharing only model parameters for collaborative training, offering a potential solution. While fine-tuning LLMs on centralized datasets risks data leakage through next-token prediction, the iterative aggregation process in FL results in a global model that encapsulates generalized knowledge, which some believe protects client privacy. In this paper, however, we present contradictory findings through extensive experiments. We show that attackers can still extract training data from the global model, even using straightforward generation methods, with leakage increasing as the model size grows. Moreover, we introduce an enhanced attack strategy tailored to FL, which tracks global model updates during training to intensify privacy leakage. To mitigate these risks, we evaluate privacy-preserving techniques in FL, including differential privacy, regularization-constrained updates and adopting LLMs with safety alignment. Our results provide valuable insights and practical guidelines for reducing privacy risks when training LLMs with FL.

new Learning to Align Molecules and Proteins: A Geometry-Aware Approach to Binding Affinity

Authors: Mohammadsaleh Refahi, Bahrad A. Sokhansanj, James R. Brown, Gail Rosen

Abstract: Accurate prediction of drug-target binding affinity can accelerate drug discovery by prioritizing promising compounds before costly wet-lab screening. While deep learning has advanced this task, most models fuse ligand and protein representations via simple concatenation and lack explicit geometric regularization, resulting in poor generalization across chemical space and time. We introduce FIRM-DTI, a lightweight framework that conditions molecular embeddings on protein embeddings through a feature-wise linear modulation (FiLM) layer and enforces metric structure with a triplet loss. An RBF regression head operating on embedding distances yields smooth, interpretable affinity predictions. Despite its modest size, FIRM-DTI achieves state-of-the-art performance on the Therapeutics Data Commons DTI-DG benchmark, as demonstrated by an extensive ablation study and out-of-domain evaluation. Our results underscore the value of conditioning and metric learning for robust drug-target affinity prediction.

new CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

Authors: Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou

Abstract: Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}ontrolling \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

new A Genetic Algorithm for Navigating Synthesizable Molecular Spaces

Authors: Alston Lo, Connor W. Coley, Wojciech Matusik

Abstract: Inspired by the effectiveness of genetic algorithms and the importance of synthesizability in molecular design, we present SynGA, a simple genetic algorithm that operates directly over synthesis routes. Our method features custom crossover and mutation operators that explicitly constrain it to synthesizable molecular space. By modifying the fitness function, we demonstrate the effectiveness of SynGA on a variety of design tasks, including synthesizable analog search and sample-efficient property optimization, for both 2D and 3D objectives. Furthermore, by coupling SynGA with a machine learning-based filter that focuses the building block set, we boost SynGA to state-of-the-art performance. For property optimization, this manifests as a model-based variant SynGBO, which employs SynGA and block filtering in the inner loop of Bayesian optimization. Since SynGA is lightweight and enforces synthesizability by construction, our hope is that SynGA can not only serve as a strong standalone baseline but also as a versatile module that can be incorporated into larger synthesis-aware workflows in the future.

new Scaling Laws are Redundancy Laws

Authors: Yuda Bi, Vince D Calhoun

Abstract: Scaling laws, a defining feature of deep learning, reveal a striking power-law improvement in model performance with increasing dataset and model size. Yet, their mathematical origins, especially the scaling exponent, have remained elusive. In this work, we show that scaling laws can be formally explained as redundancy laws. Using kernel regression, we show that a polynomial tail in the data covariance spectrum yields an excess risk power law with exponent alpha = 2s / (2s + 1/beta), where beta controls the spectral tail and 1/beta measures redundancy. This reveals that the learning curve's slope is not universal but depends on data redundancy, with steeper spectra accelerating returns to scale. We establish the law's universality across boundedly invertible transformations, multi-modal mixtures, finite-width approximations, and Transformer architectures in both linearized (NTK) and feature-learning regimes. This work delivers the first rigorous mathematical explanation of scaling laws as finite-sample redundancy laws, unifying empirical observations with theoretical foundations.

new The Impact of Audio Watermarking on Audio Anti-Spoofing Countermeasures

Authors: Zhenshan Zhang, Xueping Zhang, Yechen Wang, Liwei Jin, Ming Li

Abstract: This paper presents the first study on the impact of audio watermarking on spoofing countermeasures. While anti-spoofing systems are essential for securing speech-based applications, the influence of widely used audio watermarking, originally designed for copyright protection, remains largely unexplored. We construct watermark-augmented training and evaluation datasets, named the Watermark-Spoofing dataset, by applying diverse handcrafted and neural watermarking methods to existing anti-spoofing datasets. Experiments show that watermarking consistently degrades anti-spoofing performance, with higher watermark density correlating with higher Equal Error Rates (EERs). To mitigate this, we propose the Knowledge-Preserving Watermark Learning (KPWL) framework, enabling models to adapt to watermark-induced shifts while preserving their original-domain spoofing detection capability. These findings reveal audio watermarking as a previously overlooked domain shift and establish the first benchmark for developing watermark-resilient anti-spoofing systems. All related protocols are publicly available at https://github.com/Alphawarheads/Watermark_Spoofing.git

URLs: https://github.com/Alphawarheads/Watermark_Spoofing.git

new Measuring LLM Sensitivity in Transformer-based Tabular Data Synthesis

Authors: Maria F. Davila R, Azizjon Turaev, Wolfram Wingerath

Abstract: Synthetic tabular data is used for privacy-preserving data sharing and data-driven model development. Its effectiveness, however, depends heavily on the used Tabular Data Synthesis (TDS) tool. Recent studies have shown that Transformer-based models outperform other state-of-the-art models such as Generative Adversarial Networks (GANs) and Diffusion models in terms of data quality. However, Transformer-based models also come with high computational costs, making them sometimes unfeasible for end users with prosumer hardware. This study presents a sensitivity assessment on how the choice of hyperparameters, such as number of layers or hidden dimension affects the quality of the resultant synthetic data and the computational performance. It is performed across two tools, GReaT and REaLTabFormer, evaluating 10 model setups that vary in architecture type and depth. We assess the sensitivity on three dimensions: runtime, machine learning (ML) utility, and similarity to real data distributions. Experiments were conducted on four real-world datasets. Our findings reveal that runtime is proportional to the number of hyperparameters, with shallower configurations completing faster. GReaT consistently achieves lower runtimes than REaLTabFormer, and only on the largest dataset they have comparable runtime. For small datasets, both tools achieve synthetic data with high utility and optimal similarity, but on larger datasets only REaLTabFormer sustains strong utility and similarity. As a result, REaLTabFormer with lightweight LLMs provides the best balance, since it preserves data quality while reducing computational requirements. Nonetheless, its runtime remains higher than that of GReaT and other TDS tools, suggesting that efficiency gains are possible but only up to a certain level.

new Sig2Model: A Boosting-Driven Model for Updatable Learned Indexes

Authors: Alireza Heidari, Amirhossein Ahmad, Wei Zhang, Ying Xiong

Abstract: Learned Indexes (LIs) represent a paradigm shift from traditional index structures by employing machine learning models to approximate the cumulative distribution function (CDF) of sorted data. While LIs achieve remarkable efficiency for static datasets, their performance degrades under dynamic updates: maintaining the CDF invariant (sum of F(k) equals 1) requires global model retraining, which blocks queries and limits the queries-per-second (QPS) metric. Current approaches fail to address these retraining costs effectively, rendering them unsuitable for real-world workloads with frequent updates. In this paper, we present Sig2Model, an efficient and adaptive learned index that minimizes retraining cost through three key techniques: (1) a sigmoid boosting approximation technique that dynamically adjusts the index model by approximating update-induced shifts in data distribution with localized sigmoid functions while preserving bounded error guarantees and deferring full retraining; (2) proactive update training via Gaussian mixture models (GMMs) that identifies high-update-probability regions for strategic placeholder allocation to speed up updates; and (3) a neural joint optimization framework that continuously refines both the sigmoid ensemble and GMM parameters via gradient-based learning. We evaluate Sig2Model against state-of-the-art updatable learned indexes on real-world and synthetic workloads, and show that Sig2Model reduces retraining cost by up to 20x, achieves up to 3x higher QPS, and uses up to 1000x less memory.

new IConv: Focusing on Local Variation with Channel Independent Convolution for Multivariate Time Series Forecasting

Authors: Gawon Lee, Hanbyeol Park, Minseop Kim, Dohee Kim, Hyerim Bae

Abstract: Real-world time-series data often exhibit non-stationarity, including changing trends, irregular seasonality, and residuals. In terms of changing trends, recently proposed multi-layer perceptron (MLP)-based models have shown excellent performance owing to their computational efficiency and ability to capture long-term dependency. However, the linear nature of MLP architectures poses limitations when applied to channels with diverse distributions, resulting in local variations such as seasonal patterns and residual components being ignored. However, convolutional neural networks (CNNs) can effectively incorporate these variations. To resolve the limitations of MLP, we propose combining them with CNNs. The overall trend is modeled using an MLP to consider long-term dependencies. The CNN uses diverse kernels to model fine-grained local patterns in conjunction with MLP trend predictions. To focus on modeling local variation, we propose IConv, a novel convolutional architecture that processes the temporal dependency channel independently and considers the inter-channel relationship through distinct layers. Independent channel processing enables the modeling of diverse local temporal dependencies and the adoption of a large kernel size. Distinct inter-channel considerations reduce computational cost. The proposed model is evaluated through extensive experiments on time-series datasets. The results reveal the superiority of the proposed method for multivariate time-series forecasting.

new LiLAW: Lightweight Learnable Adaptive Weighting to Meta-Learn Sample Difficulty and Improve Noisy Training

Authors: Abhishek Moturu, Anna Goldenberg, Babak Taati

Abstract: Training deep neural networks in the presence of noisy labels and data heterogeneity is a major challenge. We introduce Lightweight Learnable Adaptive Weighting (LiLAW), a novel method that dynamically adjusts the loss weight of each training sample based on its evolving difficulty level, categorized as easy, moderate, or hard. Using only three learnable parameters, LiLAW adaptively prioritizes informative samples throughout training by updating these weights using a single mini-batch gradient descent step on the validation set after each training mini-batch, without requiring excessive hyperparameter tuning or a clean validation set. Extensive experiments across multiple general and medical imaging datasets, noise levels and types, loss functions, and architectures with and without pretraining demonstrate that LiLAW consistently enhances performance, even in high-noise environments. It is effective without heavy reliance on data augmentation or advanced regularization, highlighting its practicality. It offers a computationally efficient solution to boost model generalization and robustness in any neural network training setup.

new Aligning Inductive Bias for Data-Efficient Generalization in State Space Models

Authors: Qiyu Chen, Guozhang Chen

Abstract: The remarkable success of large-scale models is fundamentally tied to scaling laws, yet the finite nature of high-quality data presents a looming challenge. One of the next frontiers in modeling is data efficiency: the ability to learn more from less. A model's inductive bias is a critical lever for this, but foundational sequence models like State Space Models (SSMs) rely on a fixed bias. This fixed prior is sample-inefficient when a task's underlying structure does not match. In this work, we introduce a principled framework to solve this problem. We first formalize the inductive bias of linear time-invariant SSMs through an SSM-induced kernel, mathematically and empirically proving its spectrum is directly governed by the model's frequency response. Further, we propose a method of Task-Dependent Initialization (TDI): power spectrum matching, a fast and efficient method that aligns the model's inductive bias with the task's spectral characteristics before large-scale training. Our experiments on a diverse set of real-world benchmarks show that TDI significantly improves generalization and sample efficiency, particularly in low-data regimes. This work provides a theoretical and practical tool to create more data-efficient models, a crucial step towards sustainable scaling.

new FERD: Fairness-Enhanced Data-Free Robustness Distillation

Authors: Zhengxiao Li, Liming Lu, Xu Zheng, Siyuan Liang, Zhenghan Chen, Yongbin Zhou, Shuchao Pang

Abstract: Data-Free Robustness Distillation (DFRD) aims to transfer the robustness from the teacher to the student without accessing the training data. While existing methods focus on overall robustness, they overlook the robust fairness issues, leading to severe disparity of robustness across different categories. In this paper, we find two key problems: (1) student model distilled with equal class proportion data behaves significantly different across distinct categories; and (2) the robustness of student model is not stable across different attacks target. To bridge these gaps, we present the first Fairness-Enhanced data-free Robustness Distillation (FERD) framework to adjust the proportion and distribution of adversarial examples. For the proportion, FERD adopts a robustness-guided class reweighting strategy to synthesize more samples for the less robust categories, thereby improving robustness of them. For the distribution, FERD generates complementary data samples for advanced robustness distillation. It generates Fairness-Aware Examples (FAEs) by enforcing a uniformity constraint on feature-level predictions, which suppress the dominance of class-specific non-robust features, providing a more balanced representation across all categories. Then, FERD constructs Uniform-Target Adversarial Examples (UTAEs) from FAEs by applying a uniform target class constraint to avoid biased attack directions, which distribute the attack targets across all categories and prevents overfitting to specific vulnerable categories. Extensive experiments on three public datasets show that FERD achieves state-of-the-art worst-class robustness under all adversarial attack (e.g., the worst-class robustness under FGSM and AutoAttack are improved by 15.1\% and 6.4\% using MobileNet-V2 on CIFAR-10), demonstrating superior performance in both robustness and fairness aspects.

new T2I-Diff: fMRI Signal Generation via Time-Frequency Image Transform and Classifier-Free Denoising Diffusion Models

Authors: Hwa Hui Tew, Junn Yong Loo, Yee-Fan Tan, Xinyu Tang, Hernando Ombao, Fuad Noman, Raphael C. -W. Phan, Chee-Ming Ting

Abstract: Functional Magnetic Resonance Imaging (fMRI) is an advanced neuroimaging method that enables in-depth analysis of brain activity by measuring dynamic changes in the blood oxygenation level-dependent (BOLD) signals. However, the resource-intensive nature of fMRI data acquisition limits the availability of high-fidelity samples required for data-driven brain analysis models. While modern generative models can synthesize fMRI data, they often underperform because they overlook the complex non-stationarity and nonlinear BOLD dynamics. To address these challenges, we introduce T2I-Diff, an fMRI generation framework that leverages time-frequency representation of BOLD signals and classifier-free denoising diffusion. Specifically, our framework first converts BOLD signals into windowed spectrograms via a time-dependent Fourier transform, capturing both the underlying temporal dynamics and spectral evolution. Subsequently, a classifier-free diffusion model is trained to generate class-conditioned frequency spectrograms, which are then reverted to BOLD signals via inverse Fourier transforms. Finally, we validate the efficacy of our approach by demonstrating improved accuracy and generalization in downstream fMRI-based brain network classification.

new CaTS-Bench: Can Language Models Describe Numeric Time Series?

Authors: Luca Zhou, Pratham Yashwante, Marshall Fisher, Alessio Sampieri, Zihao Zhou, Fabio Galasso, Rose Yu

Abstract: Time series captioning, the task of describing numeric time series in natural language, requires numerical reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on synthetic data or overly simplistic captions, and typically neglect metadata and visual representations. To close this gap, we introduce CaTS-Bench, the first large-scale, real-world benchmark for Context-aware Time Series captioning. CaTS-Bench is derived from 11 diverse datasets reframed as captioning and Q&A tasks, comprising roughly 465k training and 105k test timestamps. Each sample includes a numeric series segment, contextual metadata, a line-chart image, and a caption. A key contribution of this work is the scalable pipeline used to generate reference captions: while most references are produced by an oracle LLM and verified through factual checks, human indistinguishability studies, and diversity analyses, we also provide a human-revisited subset of 579 test captions, refined from LLM outputs to ensure accuracy and human-like style. Beyond captioning, CaTS-Bench offers 460 multiple-choice questions targeting deeper aspects of time series reasoning. We further propose new tailored evaluation metrics and benchmark leading VLMs, highlighting both their strengths and persistent limitations. Together, these contributions establish CaTS-Bench and its captioning pipeline as a reliable and extensible foundation for future research at the intersection of time series analysis and foundation models.

new Explaining Grokking and Information Bottleneck through Neural Collapse Emergence

Authors: Keitaro Sakamoto, Issei Sato

Abstract: The training dynamics of deep neural networks often defy expectations, even as these models form the foundation of modern machine learning. Two prominent examples are grokking, where test performance improves abruptly long after the training loss has plateaued, and the information bottleneck principle, where models progressively discard input information irrelevant to the prediction task as training proceeds. However, the mechanisms underlying these phenomena and their relations remain poorly understood. In this work, we present a unified explanation of such late-phase phenomena through the lens of neural collapse, which characterizes the geometry of learned representations. We show that the contraction of population within-class variance is a key factor underlying both grokking and information bottleneck, and relate this measure to the neural collapse measure defined on the training set. By analyzing the dynamics of neural collapse, we show that distinct time scales between fitting the training set and the progression of neural collapse account for the behavior of the late-phase phenomena. Finally, we validate our theoretical findings on multiple datasets and architectures.

new Shaping Initial State Prevents Modality Competition in Multi-modal Fusion: A Two-stage Scheduling Framework via Fast Partial Information Decomposition

Authors: Jiaqi Tang, Yinsong Xu, Yang Liu, Qingchao Chen

Abstract: Multi-modal fusion often suffers from modality competition during joint training, where one modality dominates the learning process, leaving others under-optimized. Overlooking the critical impact of the model's initial state, most existing methods address this issue during the joint learning stage. In this study, we introduce a two-stage training framework to shape the initial states through unimodal training before the joint training. First, we propose the concept of Effective Competitive Strength (ECS) to quantify a modality's competitive strength. Our theoretical analysis further reveals that properly shaping the initial ECS by unimodal training achieves a provably tighter error bound. However, ECS is computationally intractable in deep neural networks. To bridge this gap, we develop a framework comprising two core components: a fine-grained computable diagnostic metric and an asynchronous training controller. For the metric, we first prove that mutual information(MI) is a principled proxy for ECS. Considering MI is induced by per-modality marginals and thus treats each modality in isolation, we further propose FastPID, a computationally efficient and differentiable solver for partial information decomposition, which decomposes the joint distribution's information into fine-grained measurements: modality-specific uniqueness, redundancy, and synergy. Guided by these measurements, our asynchronous controller dynamically balances modalities by monitoring uniqueness and locates the ideal initial state to start joint training by tracking peak synergy. Experiments on diverse benchmarks demonstrate that our method achieves state-of-the-art performance. Our work establishes that shaping the pre-fusion models' initial state is a powerful strategy that eases competition before it starts, reliably unlocking synergistic multi-modal fusion.

new Robust Multi-Omics Integration from Incomplete Modalities Significantly Improves Prediction of Alzheimer's Disease

Authors: Sungjoon Park, Kyungwook Lee, Soorin Yim, Doyeong Hwang, Dongyun Kim, Soonyoung Lee, Amy Dunn, Daniel Gatti, Elissa Chesler, Kristen O'Connell, Kiyoung Kim

Abstract: Multi-omics data capture complex biomolecular interactions and provide insights into metabolism and disease. However, missing modalities hinder integrative analysis across heterogeneous omics. To address this, we present MOIRA (Multi-Omics Integration with Robustness to Absent modalities), an early integration method enabling robust learning from incomplete omics data via representation alignment and adaptive aggregation. MOIRA leverages all samples, including those with missing modalities, by projecting each omics dataset onto a shared embedding space where a learnable weighting mechanism fuses them. Evaluated on the Religious Order Study and Memory and Aging Project (ROSMAP) dataset for Alzheimer's Disease (AD), MOIRA outperformed existing approaches, and further ablation studies confirmed modality-wise contributions. Feature importance analysis revealed AD-related biomarkers consistent with prior literature, highlighting the biological relevance of our approach.

new Causal Time Series Generation via Diffusion Models

Authors: Yutong Xia, Chang Xu, Yuxuan Liang, Qingsong Wen, Roger Zimmermann, Jiang Bian

Abstract: Time series generation (TSG) synthesizes realistic sequences and has achieved remarkable success. Among TSG, conditional models generate sequences given observed covariates, however, such models learn observational correlations without considering unobserved confounding. In this work, we propose a causal perspective on conditional TSG and introduce causal time series generation as a new TSG task family, formalized within Pearl's causal ladder, extending beyond observational generation to include interventional and counterfactual settings. To instantiate these tasks, we develop CaTSG, a unified diffusion-based framework with backdoor-adjusted guidance that causally steers sampling toward desired interventions and individual counterfactuals while preserving observational fidelity. Specifically, our method derives causal score functions via backdoor adjustment and the abduction-action-prediction procedure, thus enabling principled support for all three levels of TSG. Extensive experiments on both synthetic and real-world datasets show that CaTSG achieves superior fidelity and also supporting interventional and counterfactual generation that existing baselines cannot handle. Overall, we propose the causal TSG family and instantiate it with CaTSG, providing an initial proof-of-concept and opening a promising direction toward more reliable simulation under interventions and counterfactual generation.

new FHRFormer: A Self-supervised Transformer Approach for Fetal Heart Rate Inpainting and Forecasting

Authors: Kjersti Engan, Neel Kanwal, Anita Yeconia, Ladislaus Blacy, Yuda Munyaw, Estomih Mduma, Hege Ersdal

Abstract: Approximately 10\% of newborns require assistance to initiate breathing at birth, and around 5\% need ventilation support. Fetal heart rate (FHR) monitoring plays a crucial role in assessing fetal well-being during prenatal care, enabling the detection of abnormal patterns and supporting timely obstetric interventions to mitigate fetal risks during labor. Applying artificial intelligence (AI) methods to analyze large datasets of continuous FHR monitoring episodes with diverse outcomes may offer novel insights into predicting the risk of needing breathing assistance or interventions. Recent advances in wearable FHR monitors have enabled continuous fetal monitoring without compromising maternal mobility. However, sensor displacement during maternal movement, as well as changes in fetal or maternal position, often lead to signal dropouts, resulting in gaps in the recorded FHR data. Such missing data limits the extraction of meaningful insights and complicates automated (AI-based) analysis. Traditional approaches to handle missing data, such as simple interpolation techniques, often fail to preserve the spectral characteristics of the signals. In this paper, we propose a masked transformer-based autoencoder approach to reconstruct missing FHR signals by capturing both spatial and frequency components of the data. The proposed method demonstrates robustness across varying durations of missing data and can be used for signal inpainting and forecasting. The proposed approach can be applied retrospectively to research datasets to support the development of AI-based risk algorithms. In the future, the proposed method could be integrated into wearable FHR monitoring devices to achieve earlier and more robust risk detection.

new Federated Markov Imputation: Privacy-Preserving Temporal Imputation in Multi-Centric ICU Environments

Authors: Christoph D\"using, Philipp Cimiano

Abstract: Missing data is a persistent challenge in federated learning on electronic health records, particularly when institutions collect time-series data at varying temporal granularities. To address this, we propose Federated Markov Imputation (FMI), a privacy-preserving method that enables Intensive Care Units (ICUs) to collaboratively build global transition models for temporal imputation. We evaluate FMI on a real-world sepsis onset prediction task using the MIMIC-IV dataset and show that it outperforms local imputation baselines, especially in scenarios with irregular sampling intervals across ICUs.

new StyleBench: Evaluating thinking styles in Large Language Models

Authors: Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei

Abstract: The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), and Chain-of-Draft (CoD) on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, we open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench.

URLs: https://github.com/JamesJunyuGuo/Style_Bench.

new Model-Based Reinforcement Learning under Random Observation Delays

Authors: Armin Karamzade, Kyungmin Kim, JB Lanier, Davide Corsi, Roy Fox

Abstract: Delays frequently occur in real-world environments, yet standard reinforcement learning (RL) algorithms often assume instantaneous perception of the environment. We study random sensor delays in POMDPs, where observations may arrive out-of-sequence, a setting that has not been previously addressed in RL. We analyze the structure of such delays and demonstrate that naive approaches, such as stacking past observations, are insufficient for reliable performance. To address this, we propose a model-based filtering process that sequentially updates the belief state based on an incoming stream of observations. We then introduce a simple delay-aware framework that incorporates this idea into model-based RL, enabling agents to effectively handle random delays. Applying this framework to Dreamer, we compare our approach to delay-aware baselines developed for MDPs. Our method consistently outperforms these baselines and demonstrates robustness to delay distribution shifts during deployment. Additionally, we present experiments on simulated robotic tasks, comparing our method to common practical heuristics and emphasizing the importance of explicitly modeling observation delays.

new Distribution-Controlled Client Selection to Improve Federated Learning Strategies

Authors: Christoph D\"using, Philipp Cimiano

Abstract: Federated learning (FL) is a distributed learning paradigm that allows multiple clients to jointly train a shared model while maintaining data privacy. Despite its great potential for domains with strict data privacy requirements, the presence of data imbalance among clients is a thread to the success of FL, as it causes the performance of the shared model to decrease. To address this, various studies have proposed enhancements to existing FL strategies, particularly through client selection methods that mitigate the detrimental effects of data imbalance. In this paper, we propose an extension to existing FL strategies, which selects active clients that best align the current label distribution with one of two target distributions, namely a balanced distribution or the federations combined label distribution. Subsequently, we empirically verify the improvements through our distribution-controlled client selection on three common FL strategies and two datasets. Our results show that while aligning the label distribution with a balanced distribution yields the greatest improvements facing local imbalance, alignment with the federation's combined label distribution is superior for global imbalance.

new Improving Early Sepsis Onset Prediction Through Federated Learning

Authors: Christoph D\"using, Philipp Cimiano

Abstract: Early and accurate prediction of sepsis onset remains a major challenge in intensive care, where timely detection and subsequent intervention can significantly improve patient outcomes. While machine learning models have shown promise in this domain, their success is often limited by the amount and diversity of training data available to individual hospitals and Intensive Care Units (ICUs). Federated Learning (FL) addresses this issue by enabling collaborative model training across institutions without requiring data sharing, thus preserving patient privacy. In this work, we propose a federated, attention-enhanced Long Short-Term Memory model for sepsis onset prediction, trained on multi-centric ICU data. Unlike existing approaches that rely on fixed prediction windows, our model supports variable prediction horizons, enabling both short- and long-term forecasting in a single unified model. During analysis, we put particular emphasis on the improvements through our approach in terms of early sepsis detection, i.e., predictions with large prediction windows by conducting an in-depth temporal analysis. Our results prove that using FL does not merely improve overall prediction performance (with performance approaching that of a centralized model), but is particularly beneficial for early sepsis onset prediction. Finally, we show that our choice of employing a variable prediction window rather than a fixed window does not hurt performance significantly but reduces computational, communicational, and organizational overhead.

new Deterministic Discrete Denoising

Authors: Hideyuki Suzuki, Hiroshi Yamashita

Abstract: We propose a deterministic denoising algorithm for discrete-state diffusion models based on Markov chains. The generative reverse process is derandomized by introducing a variant of the herding algorithm with weakly chaotic dynamics, which induces deterministic discrete state transitions. Our approach is a direct replacement for the stochastic denoising process, requiring neither retraining nor continuous state embeddings. We demonstrate consistent improvements in both efficiency and sample quality on text and image generation tasks. Thus, this simple derandomization approach is expected to enhance the significance of discrete diffusion in generative modeling. Furthermore, our results reveal that deterministic reverse processes, well established in continuous diffusion, can also be effective in discrete state spaces.

new Deep Learning for Crime Forecasting: The Role of Mobility at Fine-grained Spatiotemporal Scales

Authors: Ariadna Albors Zumel, Michele Tizzoni, Gian Maria Campedelli

Abstract: Objectives: To develop a deep learning framework to evaluate if and how incorporating micro-level mobility features, alongside historical crime and sociodemographic data, enhances predictive performance in crime forecasting at fine-grained spatial and temporal resolutions. Methods: We advance the literature on computational methods and crime forecasting by focusing on four U.S. cities (i.e., Baltimore, Chicago, Los Angeles, and Philadelphia). We employ crime incident data obtained from each city's police department, combined with sociodemographic data from the American Community Survey and human mobility data from Advan, collected from 2019 to 2023. This data is aggregated into grids with equally sized cells of 0.077 sq. miles (0.2 sq. kms) and used to train our deep learning forecasting model, a Convolutional Long Short-Term Memory (ConvLSTM) network, which predicts crime occurrences 12 hours ahead using 14-day and 2-day input sequences. We also compare its performance against three baseline models: logistic regression, random forest, and standard LSTM. Results: Incorporating mobility features improves predictive performance, especially when using shorter input sequences. Noteworthy, however, the best results are obtained when both mobility and sociodemographic features are used together, with our deep learning model achieving the highest recall, precision, and F1 score in all four cities, outperforming alternative methods. With this configuration, longer input sequences enhance predictions for violent crimes, while shorter sequences are more effective for property crimes. Conclusion: These findings underscore the importance of integrating diverse data sources for spatiotemporal crime forecasting, mobility included. They also highlight the advantages (and limits) of deep learning when dealing with fine-grained spatial and temporal scales.

new Energy saving in off-road vehicles using leakage compensation technique

Authors: Gyan Wrat, J. Das

Abstract: The article focuses on enhancing the energy efficiency of linear actuators used in heavy earth moving equipment, particularly in the booms ofexcavation equipment. Two hydraulic circuits are compared in terms of energy efficiency, with one using a conventional proportional directionalcontrol valve (PDCV) and the other using an innovative solution of proportional flow control valve (PFCV) with artificial leakage between thetwo ends of the actuator. The PFCV reduces energy loss in the form of heat by bypassing the extra flow from the pump during position control,unlike the PDCV that uses a pressure relief valve. The hydraulic circuit using PFCV is found to be 8.5% more energy efficient than theconventional circuit using PDCV. The article also discusses the position control of the actuator, which is achieved using a PID controller tuned by a fuzzy controller. Thesimulation of the hydraulic circuit is carried out using MATLAB/Simulink, and the results are compared with experiments. Overall, the proposedapproach could lead to significant improvements in the energy efficiency of linear actuators used in heavy earth moving equipment, therebyreducing their environmental impact and operating costs.

new GenFacts-Generative Counterfactual Explanations for Multi-Variate Time Series

Authors: Sarah Seifi, Anass Ibrahimi, Tobias Sukianto, Cecilia Carbonelli, Lorenzo Servadei, Robert Wille

Abstract: Counterfactual explanations aim to enhance model transparency by showing how inputs can be minimally altered to change predictions. For multivariate time series, existing methods often generate counterfactuals that are invalid, implausible, or unintuitive. We introduce GenFacts, a generative framework based on a class-discriminative variational autoencoder. It integrates contrastive and classification-consistency objectives, prototype-based initialization, and realism-constrained optimization. We evaluate GenFacts on radar gesture data as an industrial use case and handwritten letter trajectories as an intuitive benchmark. Across both datasets, GenFacts outperforms state-of-the-art baselines in plausibility (+18.7%) and achieves the highest interpretability scores in a human study. These results highlight that plausibility and user-centered interpretability, rather than sparsity alone, are key to actionable counterfactuals in time series data.

new Why Attention Fails: The Degeneration of Transformers into MLPs in Time Series Forecasting

Authors: Zida Liang, Jiayi Zhu, Weiqiang Sun

Abstract: Transformer-based architectures achieved high performance in natural language processing and computer vision, yet many studies have shown that they have not demonstrated a clear advantage in time series forecasting and even underperform simple linear baselines in some cases. However, most of these studies have not thoroughly explored the reasons behind the failure of transformers. To better understand time-series transformers(TST), we designed a series of experiments, progressively modifying transformers into MLPs to investigate the impact of the attention mechanism. Surprisingly, transformer blocks often degenerate into simple MLPs in existing time-series transformers. We designed a interpretable dataset to investigate the reasons behind the failure of the attention mechanism and revealed that the attention mechanism is not working in the expected way. We theoretically analyzed the reasons behind this phenomenon, demonstrating that the current embedding methods fail to allow transformers to function in a well-structured latent space, and further analyzed the deeper underlying causes of the failure of embedding.

new Decoupled-Value Attention for Prior-Data Fitted Networks: GP Inference for Physical Equations

Authors: Kaustubh Sharma, Simardeep Singh, Parikshit Pareek

Abstract: Prior-data fitted networks (PFNs) are a promising alternative to time-consuming Gaussian Process (GP) inference for creating fast surrogates of physical systems. PFN reduces the computational burden of GP-training by replacing Bayesian inference in GP with a single forward pass of a learned prediction model. However, with standard Transformer attention, PFNs show limited effectiveness on high-dimensional regression tasks. We introduce Decoupled-Value Attention (DVA)-- motivated by the GP property that the function space is fully characterized by the kernel over inputs and the predictive mean is a weighted sum of training targets. DVA computes similarities from inputs only and propagates labels solely through values. Thus, the proposed DVA mirrors the Gaussian-process update while remaining kernel-free. We demonstrate that the crucial factor for scaling PFNs is the attention rule rather than the architecture itself. Specifically, our results demonstrate that (a) localized attention consistently reduces out-of-sample validation loss in PFNs across different dimensional settings, with validation loss reduced by more than 50% in five- and ten-dimensional cases, and (b) the role of attention is more decisive than the choice of backbone architecture, showing that CNN-based PFNs can perform at par with their Transformer-based counterparts. The proposed PFNs provide 64-dimensional power flow equation approximations with a mean absolute error of the order of 1E-3, while being over 80x faster than exact GP inference.

new Flow Matching in the Low-Noise Regime: Pathologies and a Contrastive Remedy

Authors: Weili Zeng, Yichao Yan

Abstract: Flow matching has recently emerged as a powerful alternative to diffusion models, providing a continuous-time formulation for generative modeling and representation learning. Yet, we show that this framework suffers from a fundamental instability in the low-noise regime. As noise levels approach zero, arbitrarily small perturbations in the input can induce large variations in the velocity target, causing the condition number of the learning problem to diverge. This ill-conditioning not only slows optimization but also forces the encoder to reallocate its limited Jacobian capacity toward noise directions, thereby degrading semantic representations. We provide the first theoretical analysis of this phenomenon, which we term the low-noise pathology, establishing its intrinsic link to the structure of the flow matching objective. Building on these insights, we propose Local Contrastive Flow (LCF), a hybrid training protocol that replaces direct velocity regression with contrastive feature alignment at small noise levels, while retaining standard flow matching at moderate and high noise. Empirically, LCF not only improves convergence speed but also stabilizes representation quality. Our findings highlight the critical importance of addressing low-noise pathologies to unlock the full potential of flow matching for both generation and representation learning.

new Alignment Unlocks Complementarity: A Framework for Multiview Circuit Representation Learning

Authors: Zhengyuan Shi, Jingxin Wang, Wentao Jiang, Chengyu Ma, Ziyang Zheng, Zhufei Chu, Weikang Qian, Qiang Xu

Abstract: Multiview learning on Boolean circuits holds immense promise, as different graph-based representations offer complementary structural and semantic information. However, the vast structural heterogeneity between views, such as an And-Inverter Graph (AIG) versus an XOR-Majority Graph (XMG), poses a critical barrier to effective fusion, especially for self-supervised techniques like masked modeling. Naively applying such methods fails, as the cross-view context is perceived as noise. Our key insight is that functional alignment is a necessary precondition to unlock the power of multiview self-supervision. We introduce MixGate, a framework built on a principled training curriculum that first teaches the model a shared, function-aware representation space via an Equivalence Alignment Loss. Only then do we introduce a multiview masked modeling objective, which can now leverage the aligned views as a rich, complementary signal. Extensive experiments, including a crucial ablation study, demonstrate that our alignment-first strategy transforms masked modeling from an ineffective technique into a powerful performance driver.

new Knowledgeable Language Models as Black-Box Optimizers for Personalized Medicine

Authors: Michael S. Yao, Osbert Bastani, Alma Andersson, Tommaso Biancalani, A\"icha Bentaieb, Claudia Iriondo

Abstract: The goal of personalized medicine is to discover a treatment regimen that optimizes a patient's clinical outcome based on their personal genetic and environmental factors. However, candidate treatments cannot be arbitrarily administered to the patient to assess their efficacy; we often instead have access to an in silico surrogate model that approximates the true fitness of a proposed treatment. Unfortunately, such surrogate models have been shown to fail to generalize to previously unseen patient-treatment combinations. We hypothesize that domain-specific prior knowledge - such as medical textbooks and biomedical knowledge graphs - can provide a meaningful alternative signal of the fitness of proposed treatments. To this end, we introduce LLM-based Entropy-guided Optimization with kNowledgeable priors (LEON), a mathematically principled approach to leverage large language models (LLMs) as black-box optimizers without any task-specific fine-tuning, taking advantage of their ability to contextualize unstructured domain knowledge to propose personalized treatment plans in natural language. In practice, we implement LEON via 'optimization by prompting,' which uses LLMs as stochastic engines for proposing treatment designs. Experiments on real-world optimization tasks show LEON outperforms both traditional and LLM-based methods in proposing individualized treatments for patients.

new CLUE: Conflict-guided Localization for LLM Unlearning Framework

Authors: Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang

Abstract: The LLM unlearning aims to eliminate the influence of undesirable data without affecting causally unrelated information. This process typically involves using a forget set to remove target information, alongside a retain set to maintain non-target capabilities. While recent localization-based methods demonstrate promise in identifying important neurons to be unlearned, they fail to disentangle neurons responsible for forgetting undesirable knowledge or retaining essential skills, often treating them as a single entangled group. As a result, these methods apply uniform interventions, risking catastrophic over-forgetting or incomplete erasure of the target knowledge. To address this, we turn to circuit discovery, a mechanistic interpretability technique, and propose the Conflict-guided Localization for LLM Unlearning framEwork (CLUE). This framework identifies the forget and retain circuit composed of important neurons, and then the circuits are transformed into conjunctive normal forms (CNF). The assignment of each neuron in the CNF satisfiability solution reveals whether it should be forgotten or retained. We then provide targeted fine-tuning strategies for different categories of neurons. Extensive experiments demonstrate that, compared to existing localization methods, CLUE achieves superior forget efficacy and retain utility through precise neural localization.

new FracAug: Fractional Augmentation boost Graph-level Anomaly Detection under Limited Supervision

Authors: Xiangyu Dong, Xingyi Zhang, Sibo Wang

Abstract: Graph-level anomaly detection (GAD) is critical in diverse domains such as drug discovery, yet high labeling costs and dataset imbalance hamper the performance of Graph Neural Networks (GNNs). To address these issues, we propose FracAug, an innovative plug-in augmentation framework that enhances GNNs by generating semantically consistent graph variants and pseudo-labeling with mutual verification. Unlike previous heuristic methods, FracAug learns semantics within given graphs and synthesizes fractional variants, guided by a novel weighted distance-aware margin loss. This captures multi-scale topology to generate diverse, semantic-preserving graphs unaffected by data imbalance. Then, FracAug utilizes predictions from both original and augmented graphs to pseudo-label unlabeled data, iteratively expanding the training set. As a model-agnostic module compatible with various GNNs, FracAug demonstrates remarkable universality and efficacy: experiments across 14 GNNs on 12 real-world datasets show consistent gains, boosting average AUROC, AUPRC, and F1-score by up to 5.72%, 7.23%, and 4.18%, respectively.

new Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

Authors: Peng Chen, Jiaji Zhang, Hailiang Zhao, Yirong Zhang, Jiahong Yu, Xueyan Tang, Yixuan Wang, Hao Li, Jianping Zou, Gang Xiong, Kingsum Chow, Shuibing He, Shuiguang Deng

Abstract: In modern GPU inference, cache efficiency remains a major bottleneck. In recommendation models, embedding hit rates largely determine throughput, while in large language models, KV-cache misses substantially increase time-to-first-token (TTFT). Heuristic policies such as \textsc{LRU} often struggle under structured access patterns. Learning-based approaches are promising, but in practice face two major limitations: they degrade sharply when predictions are inaccurate, or they gain little even with accurate predictions due to conservative designs. Some also incur high overhead, further limiting practicality. We present \textsc{LCR}, a practical framework for learning-based GPU caching that delivers performance gains while ensuring robustness and efficiency. Its core algorithm, \textsc{LARU}, enhances \textsc{LRU} with machine-learned predictions and dynamically adapts to prediction accuracy through online error estimation. When predictions are accurate, \textsc{LARU} achieves near-optimal performance. With inaccurate predictions, it degrades gracefully to near-\textsc{LRU} performance. With \textsc{LCR}, we bridge the gap between empirical progress and theoretical advances in learning-based caching. Experiments show that \textsc{LCR} delivers consistent gains under realistic conditions. In DLRM and LLM scenarios, it improves throughput by up to 24.2\% and reduces P99 TTFT by up to 28.3\%, outperforming widely used inference systems. Even under poor predictions, its performance remains stable, demonstrating practical robustness.

new Learning Ising Models under Hard Constraints using One Sample

Authors: Rohan Chauhan, Ioannis Panageas

Abstract: We consider the problem of estimating inverse temperature parameter $\beta$ of an $n$-dimensional truncated Ising model using a single sample. Given a graph $G = (V,E)$ with $n$ vertices, a truncated Ising model is a probability distribution over the $n$-dimensional hypercube $\{-1,1\}^n$ where each configuration $\mathbf{\sigma}$ is constrained to lie in a truncation set $S \subseteq \{-1,1\}^n$ and has probability $\Pr(\mathbf{\sigma}) \propto \exp(\beta\mathbf{\sigma}^\top A\mathbf{\sigma})$ with $A$ being the adjacency matrix of $G$. We adopt the recent setting of [Galanis et al. SODA'24], where the truncation set $S$ can be expressed as the set of satisfying assignments of a $k$-SAT formula. Given a single sample $\mathbf{\sigma}$ from a truncated Ising model, with inverse parameter $\beta^*$, underlying graph $G$ of bounded degree $\Delta$ and $S$ being expressed as the set of satisfying assignments of a $k$-SAT formula, we design in nearly $O(n)$ time an estimator $\hat{\beta}$ that is $O(\Delta^3/\sqrt{n})$-consistent with the true parameter $\beta^*$ for $k \gtrsim \log(d^2k)\Delta^3.$ Our estimator is based on the maximization of the pseudolikelihood, a notion that has received extensive analysis for various probabilistic models without [Chatterjee, Annals of Statistics '07] or with truncation [Galanis et al. SODA '24]. Our approach generalizes recent techniques from [Daskalakis et al. STOC '19, Galanis et al. SODA '24], to confront the more challenging setting of the truncated Ising model.

new Binary Autoencoder for Mechanistic Interpretability of Large Language Models

Authors: Hakaze Cho, Haolin Yang, Brian M. Kurkoski, Naoya Inoue

Abstract: Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs) for interpreting their mechanism. However, they typically rely on autoencoders constrained by some implicit training-time regularization on single training instances (i.e., $L_1$ normalization, top-k function, etc.), without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which we empirically evaluate and leverage to characterize the inference dynamics of LLMs and In-context Learning. (2) Feature untangling. Similar to typical methods, BAE can extract atomized features from LLM's hidden states. To robustly evaluate such feature extraction capability, we refine traditional feature-interpretation methods to avoid unreliable handling of numerical tokens, and show that BAE avoids dense features while producing the largest number of interpretable ones among baselines, which confirms the effectiveness of BAE serving as a feature extractor.

new Feature Augmentation of GNNs for ILPs: Local Uniqueness Suffices

Authors: Qingyu Han, Qian Li, Linxin Yang, Qian Chen, Qingjiang Shi, Ruoyu Sun

Abstract: Integer Linear Programs (ILPs) are central to real-world optimizations but notoriously difficult to solve. Learning to Optimize (L2O) has emerged as a promising paradigm, with Graph Neural Networks (GNNs) serving as the standard backbone. However, standard anonymous GNNs are limited in expressiveness for ILPs, and the common enhancement of augmenting nodes with globally unique identifiers (UIDs) typically introduces spurious correlations that severely harm generalization. To address this tradeoff, we propose a parsimonious Local-UID scheme based on d-hop uniqueness coloring, which ensures identifiers are unique only within each node's d-hop neighborhood. Building on this scheme, we introduce ColorGNN, which incorporates color information via color-conditioned embeddings, and ColorUID, a lightweight feature-level variant. We prove that for d-layer networks, Local-UIDs achieve the expressive power of Global-UIDs while offering stronger generalization. Extensive experiments show that our approach (i) yields substantial gains on three ILP benchmarks, (ii) exhibits strong OOD generalization on linear programming datasets, and (iii) further improves a general graph-level task when paired with a state-of-the-art method.

new Lossless Compression: A New Benchmark for Time Series Model Evaluation

Authors: Meng Wan, Benxi Tian, Jue Wang, Cui Hui, Ningming Nie, Tiantian Liu, Zongguo Wang, Cao Rongqiang, Peng Shi, Yangang Wang

Abstract: The evaluation of time series models has traditionally focused on four canonical tasks: forecasting, imputation, anomaly detection, and classification. While these tasks have driven significant progress, they primarily assess task-specific performance and do not rigorously measure whether a model captures the full generative distribution of the data. We introduce lossless compression as a new paradigm for evaluating time series models, grounded in Shannon's source coding theorem. This perspective establishes a direct equivalence between optimal compression length and the negative log-likelihood, providing a strict and unified information-theoretic criterion for modeling capacity. Then We define a standardized evaluation protocol and metrics. We further propose and open-source a comprehensive evaluation framework TSCom-Bench, which enables the rapid adaptation of time series models as backbones for lossless compression. Experiments across diverse datasets on state-of-the-art models, including TimeXer, iTransformer, and PatchTST, demonstrate that compression reveals distributional weaknesses overlooked by classic benchmarks. These findings position lossless compression as a principled task that complements and extends existing evaluation for time series modeling.

new MAIFormer: Multi-Agent Inverted Transformer for Flight Trajectory Prediction

Authors: Seokbin Yoon, Keumjin Lee

Abstract: Flight trajectory prediction for multiple aircraft is essential and provides critical insights into how aircraft navigate within current air traffic flows. However, predicting multi-agent flight trajectories is inherently challenging. One of the major difficulties is modeling both the individual aircraft behaviors over time and the complex interactions between flights. Generating explainable prediction outcomes is also a challenge. Therefore, we propose a Multi-Agent Inverted Transformer, MAIFormer, as a novel neural architecture that predicts multi-agent flight trajectories. The proposed framework features two key attention modules: (i) masked multivariate attention, which captures spatio-temporal patterns of individual aircraft, and (ii) agent attention, which models the social patterns among multiple agents in complex air traffic scenes. We evaluated MAIFormer using a real-world automatic dependent surveillance-broadcast flight trajectory dataset from the terminal airspace of Incheon International Airport in South Korea. The experimental results show that MAIFormer achieves the best performance across multiple metrics and outperforms other methods. In addition, MAIFormer produces prediction outcomes that are interpretable from a human perspective, which improves both the transparency of the model and its practical utility in air traffic control.

new ExMolRL: Phenotype-Target Joint Generation of De Novo Molecules via Multi-Objective Reinforcement Learning

Authors: Haotian Guo, Hui Liu

Abstract: The generation of high-quality candidate molecules remains a central challenge in AI-driven drug design. Current phenotype-based and target-based strategies each suffer limitations, either incurring high experimental costs or overlook system-level cellular responses. To bridge this gap, we propose ExMoIRL, a novel generative framework that synergistically integrates phenotypic and target-specific cues for de novo molecular generation. The phenotype-guided generator is first pretrained on expansive drug-induced transcriptional profiles and subsequently fine-tuned via multi-objective reinforcement learning (RL). Crucially, the reward function fuses docking affinity and drug-likeness scores, augmented with ranking loss, prior-likelihood regularization, and entropy maximization. The multi-objective RL steers the model toward chemotypes that are simultaneously potent, diverse, and aligned with the specified phenotypic effects. Extensive experiments demonstrate ExMoIRL's superior performance over state-of-the-art phenotype-based and target-based models across multiple well-characterized targets. Our generated molecules exhibit favorable drug-like properties, high target affinity, and inhibitory potency (IC50) against cancer cells. This unified framework showcases the synergistic potential of combining phenotype-guided and target-aware strategies, offering a more effective solution for de novo drug discovery.

new Mechanism of Task-oriented Information Removal in In-context Learning

Authors: Hakaze Cho, Haolin Yang, Gouki Minegishi, Naoya Inoue

Abstract: In-context Learning (ICL) is an emerging few-shot learning paradigm based on modern Language Models (LMs), yet its inner mechanism remains unclear. In this paper, we investigate the mechanism through a novel perspective of information removal. Specifically, we demonstrate that in the zero-shot scenario, LMs encode queries into non-selective representations in hidden states containing information for all possible tasks, leading to arbitrary outputs without focusing on the intended task, resulting in near-zero accuracy. Meanwhile, we find that selectively removing specific information from hidden states by a low-rank filter effectively steers LMs toward the intended task. Building on these findings, by measuring the hidden states on carefully designed metrics, we observe that few-shot ICL effectively simulates such task-oriented information removal processes, selectively removing the redundant information from entangled non-selective representations, and improving the output based on the demonstrations, which constitutes a key mechanism underlying ICL. Moreover, we identify essential attention heads inducing the removal operation, termed Denoising Heads, which enables the ablation experiments blocking the information removal operation from the inference, where the ICL accuracy significantly degrades, especially when the correct label is absent from the few-shot demonstrations, confirming both the critical role of the information removal mechanism and denoising heads.

new Predicting LLM Reasoning Performance with Small Proxy Model

Authors: Woosung Koh, Juyoung Suk, Sungjun Han, Se-Young Yun, Jay Shin

Abstract: Given the prohibitive cost of pre-training large language models, it is essential to leverage smaller proxy models to optimize datasets before scaling up. However, this approach becomes challenging for reasoning capabilities, which exhibit emergent behavior that only appear reliably at larger model sizes, often exceeding 7B parameters. To address this, we introduce rBridge, showing that small proxies ($\leq$1B) can effectively predict large-model reasoning by aligning more closely with (1) the pre-training objective and (2) the target task. rBridge achieves this by weighting negative log-likelihood with task alignment, using reasoning traces from frontier models as gold labels. In our experiments, rBridge (i) reduces dataset ranking costs by over 100x relative to the best baseline, (ii) achieves the strongest correlation across six reasoning benchmarks at 1B to 32B scale, and (iii) zero-shot transfers predictive relationships across pre-training datasets at 1B to 7B scale. These findings indicate that rBridge offers a practical path for exploring reasoning-oriented pre-training at lower cost.

new DELTA-Code: How Does RL Unlock and Transfer New Programming Algorithms in LLMs?

Authors: Yiyou Sun, Yuhan Cao, Pohao Huang, Haoyue Bai, Hannaneh Hajishirzi, Nouha Dziri, Dawn Song

Abstract: It remains an open question whether LLMs can acquire or generalize genuinely new reasoning strategies, beyond the sharpened skills encoded in their parameters during pre-training or post-training. To attempt to answer this debate, we introduce DELTA-Code--Distributional Evaluation of Learnability and Transferrability in Algorithmic Coding, a controlled benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability -- can LLMs, through reinforcement learning (RL), solve problem families where pretrained models exhibit failure with large enough attempts (pass@K=0)? --and transferrability -- if learnability happens, can such skills transfer systematically to out-of-distribution (OOD) test sets? Unlike prior public coding datasets, DELTA isolates reasoning skills through templated problem generators and introduces fully OOD problem families that demand novel strategies rather than tool invocation or memorized patterns. Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy. To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop. Beyond learnability, we use DELTA to evaluate transferability or generalization along exploratory, compositional, and transformative axes, as well as cross-family transfer. Results show solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases. DELTA thus offers a clean testbed for probing the limits of RL-driven reasoning and for understanding how models can move beyond existing priors to acquire new algorithmic skills.

new Efficient Ensemble Conditional Independence Test Framework for Causal Discovery

Authors: Zhengkang Guan, Kun Kuang

Abstract: Constraint-based causal discovery relies on numerous conditional independence tests (CITs), but its practical applicability is severely constrained by the prohibitive computational cost, especially as CITs themselves have high time complexity with respect to the sample size. To address this key bottleneck, we introduce the Ensemble Conditional Independence Test (E-CIT), a general and plug-and-play framework. E-CIT operates on an intuitive divide-and-aggregate strategy: it partitions the data into subsets, applies a given base CIT independently to each subset, and aggregates the resulting p-values using a novel method grounded in the properties of stable distributions. This framework reduces the computational complexity of a base CIT to linear in the sample size when the subset size is fixed. Moreover, our tailored p-value combination method offers theoretical consistency guarantees under mild conditions on the subtests. Experimental results demonstrate that E-CIT not only significantly reduces the computational burden of CITs and causal discovery but also achieves competitive performance. Notably, it exhibits an improvement in complex testing scenarios, particularly on real-world datasets.

new Actor-Critic without Actor

Authors: Donghyeon Ki, Hee-Jun Ahn, Kyungyoon Kim, Byung-Jun Lee

Abstract: Actor-critic methods constitute a central paradigm in reinforcement learning (RL), coupling policy evaluation with policy improvement. While effective across many domains, these methods rely on separate actor and critic networks, which makes training vulnerable to architectural decisions and hyperparameter tuning. Such complexity limits their scalability in settings that require large function approximators. Recently, diffusion models have recently been proposed as expressive policies that capture multi-modal behaviors and improve exploration, but they introduce additional design choices and computational burdens, hindering efficient deployment. We introduce Actor-Critic without Actor (ACA), a lightweight framework that eliminates the explicit actor network and instead generates actions directly from the gradient field of a noise-level critic. This design removes the algorithmic and computational overhead of actor training while keeping policy improvement tightly aligned with the critic's latest value estimates. Moreover, ACA retains the ability to capture diverse, multi-modal behaviors without relying on diffusion-based actors, combining simplicity with expressiveness. Through extensive experiments on standard online RL benchmarks,ACA achieves more favorable learning curves and competitive performance compared to both standard actor-critic and state-of-the-art diffusion-based methods, providing a simple yet powerful solution for online RL.

new FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

Authors: Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu

Abstract: The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.

new Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs

Authors: Honglin Zhang, Qianyue Hao, Fengli Xu, Yong Li

Abstract: Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training. A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves. However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored. In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning. Our analysis across multiple model families shows two robust effects of online RL post-training: (i) an overall increase in activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches. Our code is open source at https://anonymous.4open.science/r/llm_rl_probing_analysis-F673.

URLs: https://anonymous.4open.science/r/llm_rl_probing_analysis-F673.

new Physics of Learning: A Lagrangian perspective to different learning paradigms

Authors: Siyuan Guo, Bernhard Sch\"olkopf

Abstract: We study the problem of building an efficient learning system. Efficient learning processes information in the least time, i.e., building a system that reaches a desired error threshold with the least number of observations. Building upon least action principles from physics, we derive classic learning algorithms, Bellman's optimality equation in reinforcement learning, and the Adam optimizer in generative models from first principles, i.e., the Learning $\textit{Lagrangian}$. We postulate that learning searches for stationary paths in the Lagrangian, and learning algorithms are derivable by seeking the stationary trajectories.

new GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions

Authors: Bing Liu, Wenqiang Yv, Xuzheng Yang, Shichang Wang, Junzhuo Liu, Peng Wang, Guoqing Wang, Yang Yang, Heng Tao Shen

Abstract: AI-driven geometric problem solving is a complex vision-language task that requires accurate diagram interpretation, mathematical reasoning, and robust cross-modal grounding. A foundational yet underexplored capability for this task is the ability to identify and interpret geometric elements based on natural language queries. To address this, we introduce the task of Referring Expression Comprehension (REC) for geometric problems, which evaluates whether models can localize points, shapes, and spatial relations in diagrams in response to textual prompts. We present GeoRef, a benchmark dataset constructed from existing geometric problem corpora, featuring diverse, high-quality annotations and queries. Due to the lack of annotated data for this task, we generate a large-scale synthetic training dataset using a structured geometric formal language, enabling broad coverage of geometric concepts and facilitating model adaptation. We explore two fine-tuning approaches: Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). Our results show that GRPO significantly outperforms SFT by better aligning model behavior with task-specific rewards. Furthermore, we propose a verify-and-regenerate mechanism that detects incorrect predictions and re-infers answers using contextual reasoning history, further boosting accuracy. Notably, even state-of-the-art Multimodal Large Language Models (MLLMs) struggle with this task, underscoring the necessity of explicitly evaluating and strengthening geometric grounding as a prerequisite for robust geometric problem solving. Moreover, models trained on GeoRef demonstrate measurable improvements on downstream geometric reasoning tasks, highlighting the broader value of REC as a foundation for multimodal mathematical understanding.

new SPREAD: Sampling-based Pareto front Refinement via Efficient Adaptive Diffusion

Authors: Sedjro Salomon Hotegni, Sebastian Peitz

Abstract: Developing efficient multi-objective optimization methods to compute the Pareto set of optimal compromises between conflicting objectives remains a key challenge, especially for large-scale and expensive problems. To bridge this gap, we introduce SPREAD, a generative framework based on Denoising Diffusion Probabilistic Models (DDPMs). SPREAD first learns a conditional diffusion process over points sampled from the decision space and then, at each reverse diffusion step, refines candidates via a sampling scheme that uses an adaptive multiple gradient descent-inspired update for fast convergence alongside a Gaussian RBF-based repulsion term for diversity. Empirical results on multi-objective optimization benchmarks, including offline and Bayesian surrogate-based settings, show that SPREAD matches or exceeds leading baselines in efficiency, scalability, and Pareto front coverage.

new Structure-Attribute Transformations with Markov Chain Boost Graph Domain Adaptation

Authors: Zhen Liu, Yongtao Zhang, Shaobo Ren, Yuxin You

Abstract: Graph domain adaptation has gained significant attention in label-scarce scenarios across different graph domains. Traditional approaches to graph domain adaptation primarily focus on transforming node attributes over raw graph structures and aligning the distributions of the transformed node features across networks. However, these methods often struggle with the underlying structural heterogeneity between distinct graph domains, which leads to suboptimal distribution alignment. To address this limitation, we propose Structure-Attribute Transformation with Markov Chain (SATMC), a novel framework that sequentially aligns distributions across networks via both graph structure and attribute transformations. To mitigate the negative influence of domain-private information and further enhance the model's generalization, SATMC introduces a private domain information reduction mechanism and an empirical Wasserstein distance. Theoretical proofs suggest that SATMC can achieve a tighter error bound for cross-network node classification compared to existing graph domain adaptation methods. Extensive experiments on nine pairs of publicly available cross-domain datasets show that SATMC outperforms state-of-the-art methods in the cross-network node classification task. The code is available at https://github.com/GiantZhangYT/SATMC.

URLs: https://github.com/GiantZhangYT/SATMC.

new ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning

Authors: Qizhi Pei, Zhuoshi Pan, Honglin Lin, Xin Gao, Yu Li, Zinan Tang, Conghui He, Rui Yan, Lijun Wu

Abstract: Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between "Thinking" and "NoThinking" modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME'24, AIME'25, HMMT-Feb'25, BRUMO'25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: https://github.com/QizhiPei/ScaleDiff.

URLs: https://github.com/QizhiPei/ScaleDiff.

new TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix

Authors: Ahmet Caner Y\"uz\"ug\"uler, Ahmet \c{C}elik, Jiawei Zhuang, Lukas Cavigelli

Abstract: Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with only a 3% overhead in HBM size.

new GraphUniverse: Enabling Systematic Evaluation of Inductive Generalization

Authors: Louis Van Langendonck, Guillermo Bern\'ardez, Nina Miolane, Pere Barlet-Ros

Abstract: A fundamental challenge in graph learning is understanding how models generalize to new, unseen graphs. While synthetic benchmarks offer controlled settings for analysis, existing approaches are confined to single-graph, transductive settings where models train and test on the same graph structure. Addressing this gap, we introduce GraphUniverse, a framework for generating entire families of graphs to enable the first systematic evaluation of inductive generalization at scale. Our core innovation is the generation of graphs with persistent semantic communities, ensuring conceptual consistency while allowing fine-grained control over structural properties like homophily and degree distributions. This enables crucial but underexplored robustness tests, such as performance under controlled distribution shifts. Benchmarking a wide range of architectures -- from GNNs to graph transformers and topological architectures -- reveals that strong transductive performance is a poor predictor of inductive generalization. Furthermore, we find that robustness to distribution shift is highly sensitive not only to model architecture choice but also to the initial graph regime (e.g., high vs. low homophily). Beyond benchmarking, GraphUniverse's flexibility and scalability can facilitate the development of robust and truly generalizable architectures -- including next-generation graph foundation models. An interactive demo is available at https://graphuniverse.streamlit.app.

URLs: https://graphuniverse.streamlit.app.

new Teaching RL Agents to Act Better: VLM as Action Advisor for Online Reinforcement Learning

Authors: Xiefeng Wu, Jing Zhao, Shu Zhang, Mingyu Hu

Abstract: Online reinforcement learning in complex tasks is time-consuming, as massive interaction steps are needed to learn the optimal Q-function.Vision-language action (VLA) policies represent a promising direction for solving diverse tasks; however, their performance on low-level control remains limited, and effective deployment often requires task-specific expert demonstrations for fine-tuning. In this paper, we propose \textbf{VARL} (\textbf{V}LM as \textbf{A}ction advisor for online \textbf{R}einforcement \textbf{L}earning), a framework that leverages the domain knowledge of vision-language models (VLMs) to provide action suggestions for reinforcement learning agents. Unlike previous methods, VARL provides action suggestions rather than designing heuristic rewards, thereby guaranteeing unchanged optimality and convergence. The suggested actions increase sample diversity and ultimately improve sample efficiency, especially in sparse-reward tasks. To validate the effectiveness of VARL, we evaluate it across diverse environments and agent settings. Results show that VARL greatly improves sample efficiency without introducing significant computational overhead. These advantages make VARL a general framework for online reinforcement learning and make it feasible to directly apply reinforcement learning from scratch in real-world environments.

new EvoMail: Self-Evolving Cognitive Agents for Adaptive Spam and Phishing Email Defense

Authors: Wei Huang, De-Tian Chu, Lin-Yuan Bai, Wei Kang, Hai-Tao Zhang, Bo Li, Zhi-Mo Han, Jing Ge, Hai-Feng Lin

Abstract: Modern email spam and phishing attacks have evolved far beyond keyword blacklists or simple heuristics. Adversaries now craft multi-modal campaigns that combine natural-language text with obfuscated URLs, forged headers, and malicious attachments, adapting their strategies within days to bypass filters. Traditional spam detection systems, which rely on static rules or single-modality models, struggle to integrate heterogeneous signals or to continuously adapt, leading to rapid performance degradation. We propose EvoMail, a self-evolving cognitive agent framework for robust detection of spam and phishing. EvoMail first constructs a unified heterogeneous email graph that fuses textual content, metadata (headers, senders, domains), and embedded resources (URLs, attachments). A Cognitive Graph Neural Network enhanced by a Large Language Model (LLM) performs context-aware reasoning across these sources to identify coordinated spam campaigns. Most critically, EvoMail engages in an adversarial self-evolution loop: a ''red-team'' agent generates novel evasion tactics -- such as character obfuscation or AI-generated phishing text -- while the ''blue-team'' detector learns from failures, compresses experiences into a memory module, and reuses them for future reasoning. Extensive experiments on real-world datasets (Enron-Spam, Ling-Spam, SpamAssassin, and TREC) and synthetic adversarial variants demonstrate that EvoMail consistently outperforms state-of-the-art baselines in detection accuracy, adaptability to evolving spam tactics, and interpretability of reasoning traces. These results highlight EvoMail's potential as a resilient and explainable defense framework against next-generation spam and phishing threats.

new Sparse Representations Improve Adversarial Robustness of Neural Network Classifiers

Authors: Killian Steunou, Sigurd Saue, Th\'eo Druilhe

Abstract: Deep neural networks perform remarkably well on image classification tasks but remain vulnerable to carefully crafted adversarial perturbations. This work revisits linear dimensionality reduction as a simple, data-adapted defense. We empirically compare standard Principal Component Analysis (PCA) with its sparse variant (SPCA) as front-end feature extractors for downstream classifiers, and we complement these experiments with a theoretical analysis. On the theory side, we derive exact robustness certificates for linear heads applied to SPCA features: for both $\ell_\infty$ and $\ell_2$ threat models (binary and multiclass), the certified radius grows as the dual norms of $W^\top u$ shrink, where $W$ is the projection and $u$ the head weights. We further show that for general (non-linear) heads, sparsity reduces operator-norm bounds through a Lipschitz composition argument, predicting lower input sensitivity. Empirically, with a small non-linear network after the projection, SPCA consistently degrades more gracefully than PCA under strong white-box and black-box attacks while maintaining competitive clean accuracy. Taken together, the theory identifies the mechanism (sparser projections reduce adversarial leverage) and the experiments verify that this benefit persists beyond the linear setting. Our code is available at https://github.com/killian31/SPCARobustness.

URLs: https://github.com/killian31/SPCARobustness.

new LAVA: Explainability for Unsupervised Latent Embeddings

Authors: Ivan Stresec, Joana P. Gon\c{c}alves

Abstract: Unsupervised black-box models can be drivers of scientific discovery, but remain difficult to interpret. Crucially, discovery hinges on understanding the model output, which is often a multi-dimensional latent embedding rather than a well-defined target. While explainability for supervised learning usually seeks to uncover how input features are used to predict a target, its unsupervised counterpart should relate input features to the structure of the learned latent space. Adaptations of supervised model explainability for unsupervised learning provide either single-sample or dataset-wide summary explanations. However, without automated strategies of relating similar samples to one another guided by their latent proximity, explanations remain either too fine-grained or too reductive to be meaningful. This is especially relevant for manifold learning methods that produce no mapping function, leaving us only with the relative spatial organization of their embeddings. We introduce Locality-Aware Variable Associations (LAVA), a post-hoc model-agnostic method designed to explain local embedding organization through its relationship with the input features. To achieve this, LAVA represents the latent space as a series of localities (neighborhoods) described in terms of correlations between the original features, and then reveals reoccurring patterns of correlations across the entire latent space. Based on UMAP embeddings of MNIST and a single-cell kidney dataset, we show that LAVA captures relevant feature associations, with visually and biologically relevant local patterns shared among seemingly distant regions of the latent spaces.

new CAD-Tokenizer: Towards Text-based CAD Prototyping via Modality-Specific Tokenization

Authors: Ruiyu Wang, Shizhao Sun, Weijian Ma, Jiang Bian

Abstract: Computer-Aided Design (CAD) is a foundational component of industrial prototyping, where models are defined not by raw coordinates but by construction sequences such as sketches and extrusions. This sequential structure enables both efficient prototype initialization and subsequent editing. Text-guided CAD prototyping, which unifies Text-to-CAD generation and CAD editing, has the potential to streamline the entire design pipeline. However, prior work has not explored this setting, largely because standard large language model (LLM) tokenizers decompose CAD sequences into natural-language word pieces, failing to capture primitive-level CAD semantics and hindering attention modules from modeling geometric structure. We conjecture that a multimodal tokenization strategy, aligned with CAD's primitive and structural nature, can provide more effective representations. To this end, we propose CAD-Tokenizer, a framework that represents CAD data with modality-specific tokens using a sequence-based VQ-VAE with primitive-level pooling and constrained decoding. This design produces compact, primitive-aware representations that align with CAD's structural nature. Applied to unified text-guided CAD prototyping, CAD-Tokenizer significantly improves instruction following and generation quality, achieving better quantitative and qualitative performance over both general-purpose LLMs and task-specific baselines.

new GRPO is Secretly a Process Reward Model

Authors: Michael Sullivan

Abstract: We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($\lambda$-GRPO), and show that LLMs trained with $\lambda$-GRPO achieve higher validation accuracy and performance on downstream reasoning tasks$-$and reach peak performance more rapidly$-$than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.

new DATS: Distance-Aware Temperature Scaling for Calibrated Class-Incremental Learning

Authors: Giuseppe Serra, Florian Buettner

Abstract: Continual Learning (CL) is recently gaining increasing attention for its ability to enable a single model to learn incrementally from a sequence of new classes. In this scenario, it is important to keep consistent predictive performance across all the classes and prevent the so-called Catastrophic Forgetting (CF). However, in safety-critical applications, predictive performance alone is insufficient. Predictive models should also be able to reliably communicate their uncertainty in a calibrated manner - that is, with confidence scores aligned to the true frequencies of target events. Existing approaches in CL address calibration primarily from a data-centric perspective, relying on a single temperature shared across all tasks. Such solutions overlook task-specific differences, leading to large fluctuations in calibration error across tasks. For this reason, we argue that a more principled approach should adapt the temperature according to the distance to the current task. However, the unavailability of the task information at test time/during deployment poses a major challenge to achieve the intended objective. For this, we propose Distance-Aware Temperature Scaling (DATS), which combines prototype-based distance estimation with distance-aware calibration to infer task proximity and assign adaptive temperatures without prior task information. Through extensive empirical evaluation on both standard benchmarks and real-world, imbalanced datasets taken from the biomedical domain, our approach demonstrates to be stable, reliable and consistent in reducing calibration error across tasks compared to state-of-the-art approaches.

new Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say

Authors: Jacob Fein-Ashley, Dhruv Parikh, Rajgopal Kannan, Viktor Prasanna

Abstract: Open-source Large Language Models (LLMs) increasingly specialize by domain (e.g., math, code, general reasoning), motivating systems that leverage complementary strengths across models. Prior multi-LLM approaches either (i) route a query to one or a few experts and generate independently, (ii) aggregate outputs from each model via costly multi-turn exchanges, or (iii) fuse weights into a single model-typically requiring architectural homogeneity. We introduce Mixture of Thoughts (MoT), a simple method for latent-level collaboration among heterogeneous experts under a global routing scheme. For each query, a lightweight router selects top-$K$ experts and designates a primary expert; uniformly placed interaction layers project hidden states into a shared latent space where the primary expert performs cross-attention over its active (selected) peers. Pre-trained experts remain frozen; only the router and the lightweight interaction layers are trained with a novel joint training objective that improves both the expert selection and inter-expert collaboration. Across five in-distribution (ID) and three out-of-distribution (OOD) benchmarks, MoT surpasses the current routing and aggregation-based state-of-the-art, Avengers, by $+0.38\%$ and $+2.92\%$, respectively. Further, MoT significantly outperforms the best-performing single model. It achieves this with single-pass inference, runtime comparable to routing baselines, and none of the overheads of iterative aggregation. MoT offers a simple latent-space mechanism for combining heterogeneous LLMs, a practical step toward broader multi-LLM collaboration. Our code is publicly available at https://github.com/jacobfa/mot.

URLs: https://github.com/jacobfa/mot.

new A Unified Framework for Diffusion Model Unlearning with f-Divergence

Authors: Nicola Novello, Federico Fontana, Luigi Cinque, Deniz Gunduz, Andrea M. Tonello

Abstract: Machine unlearning aims to remove specific knowledge from a trained model. While diffusion models (DMs) have shown remarkable generative capabilities, existing unlearning methods for text-to-image (T2I) models often rely on minimizing the mean squared error (MSE) between the output distribution of a target and an anchor concept. We show that this MSE-based approach is a special case of a unified $f$-divergence-based framework, in which any $f$-divergence can be utilized. We analyze the benefits of using different $f$-divergences, that mainly impact the convergence properties of the algorithm and the quality of unlearning. The proposed unified framework offers a flexible paradigm that allows to select the optimal divergence for a specific application, balancing different trade-offs between aggressive unlearning and concept preservation.

new Inverse Reinforcement Learning Using Just Classification and a Few Regressions

Authors: Lars van der Laan, Nathan Kallus, Aur\'elien Bibaut

Abstract: Inverse reinforcement learning (IRL) aims to explain observed behavior by uncovering an underlying reward. In the maximum-entropy or Gumbel-shocks-to-reward frameworks, this amounts to fitting a reward function and a soft value function that together satisfy the soft Bellman consistency condition and maximize the likelihood of observed actions. While this perspective has had enormous impact in imitation learning for robotics and understanding dynamic choices in economics, practical learning algorithms often involve delicate inner-loop optimization, repeated dynamic programming, or adversarial training, all of which complicate the use of modern, highly expressive function approximators like neural nets and boosting. We revisit softmax IRL and show that the population maximum-likelihood solution is characterized by a linear fixed-point equation involving the behavior policy. This observation reduces IRL to two off-the-shelf supervised learning problems: probabilistic classification to estimate the behavior policy, and iterative regression to solve the fixed point. The resulting method is simple and modular across function approximation classes and algorithms. We provide a precise characterization of the optimal solution, a generic oracle-based algorithm, finite-sample error bounds, and empirical results showing competitive or superior performance to MaxEnt IRL.

new Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias

Authors: Shuofeng Zhang, Ard Louis

Abstract: For overparameterized linear regression with isotropic Gaussian design and minimum-$\ell_p$ interpolator $p\in(1,2]$, we give a unified, high-probability characterization for the scaling of the family of parameter norms $ \\{ \lVert \widehat{w_p} \rVert_r \\}_{r \in [1,p]} $ with sample size. We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal *spike* and a *bulk* of null coordinates in $X^\top Y$, yielding closed-form predictions for (i) a data-dependent transition $n_\star$ (the "elbow"), and (ii) a universal threshold $r_\star=2(p-1)$ that separates $\lVert \widehat{w_p} \rVert_r$'s which plateau from those that continue to grow with an explicit exponent. This unified solution resolves the scaling of *all* $\ell_r$ norms within the family $r\in [1,p]$ under $\ell_p$-biased interpolation, and explains in one picture which norms saturate and which increase as $n$ grows. We then study diagonal linear networks (DLNs) trained by gradient descent. By calibrating the initialization scale $\alpha$ to an effective $p_{\mathrm{eff}}(\alpha)$ via the DLN separable potential, we show empirically that DLNs inherit the same elbow/threshold laws, providing a predictive bridge between explicit and implicit bias. Given that many generalization proxies depend on $\lVert \widehat {w_p} \rVert_r$, our results suggest that their predictive power will depend sensitively on which $l_r$ norm is used.

new Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy

Authors: Tian Lan, Hao Duong Le, Jinbo Li, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang

Abstract: Time series anomaly detection (TSAD) is a critical task, but developing models that generalize to unseen data in a zero-shot manner remains a major challenge. Prevailing foundation models for TSAD predominantly rely on reconstruction-based objectives, which suffer from a fundamental objective mismatch: they struggle to identify subtle anomalies while often misinterpreting complex normal patterns, leading to high rates of false negatives and positives. To overcome these limitations, we introduce \texttt{TimeRCD}, a novel foundation model for TSAD built upon a new pre-training paradigm: Relative Context Discrepancy (RCD). Instead of learning to reconstruct inputs, \texttt{TimeRCD} is explicitly trained to identify anomalies by detecting significant discrepancies between adjacent time windows. This relational approach, implemented with a standard Transformer architecture, enables the model to capture contextual shifts indicative of anomalies that reconstruction-based methods often miss. To facilitate this paradigm, we develop a large-scale, diverse synthetic corpus with token-level anomaly labels, providing the rich supervisory signal necessary for effective pre-training. Extensive experiments demonstrate that \texttt{TimeRCD} significantly outperforms existing general-purpose and anomaly-specific foundation models in zero-shot TSAD across diverse datasets. Our results validate the superiority of the RCD paradigm and establish a new, effective path toward building robust and generalizable foundation models for time series anomaly detection.

new Differential-Integral Neural Operator for Long-Term Turbulence Forecasting

Authors: Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Qingsong Wen, Kun Wang, Xiaomeng Huang, Xian Wu

Abstract: Accurately forecasting the long-term evolution of turbulence represents a grand challenge in scientific computing and is crucial for applications ranging from climate modeling to aerospace engineering. Existing deep learning methods, particularly neural operators, often fail in long-term autoregressive predictions, suffering from catastrophic error accumulation and a loss of physical fidelity. This failure stems from their inability to simultaneously capture the distinct mathematical structures that govern turbulent dynamics: local, dissipative effects and global, non-local interactions. In this paper, we propose the {\textbf{\underline{D}}}ifferential-{\textbf{\underline{I}}}ntegral {\textbf{\underline{N}}}eural {\textbf{\underline{O}}}perator (\method{}), a novel framework designed from a first-principles approach of operator decomposition. \method{} explicitly models the turbulent evolution through parallel branches that learn distinct physical operators: a local differential operator, realized by a constrained convolutional network that provably converges to a derivative, and a global integral operator, captured by a Transformer architecture that learns a data-driven global kernel. This physics-based decomposition endows \method{} with exceptional stability and robustness. Through extensive experiments on the challenging 2D Kolmogorov flow benchmark, we demonstrate that \method{} significantly outperforms state-of-the-art models in long-term forecasting. It successfully suppresses error accumulation over hundreds of timesteps, maintains high fidelity in both the vorticity fields and energy spectra, and establishes a new benchmark for physically consistent, long-range turbulence forecast.

new From Physics to Machine Learning and Back: Part II - Learning and Observational Bias in PHM

Authors: Olga Fink, Ismail Nejjar, Vinay Sharma, Keivan Faghih Niresi, Han Sun, Hao Dong, Chenghao Xu, Amaury Wei, Arthur Bizzi, Raffael Theiler, Yuan Tian, Leandro Von Krannichfeldt, Zhan Ma, Sergei Garmaev, Zepeng Zhang, Mengjie Zhao

Abstract: Prognostics and Health Management ensures the reliability, safety, and efficiency of complex engineered systems by enabling fault detection, anticipating equipment failures, and optimizing maintenance activities throughout an asset lifecycle. However, real-world PHM presents persistent challenges: sensor data is often noisy or incomplete, available labels are limited, and degradation behaviors and system interdependencies can be highly complex and nonlinear. Physics-informed machine learning has emerged as a promising approach to address these limitations by embedding physical knowledge into data-driven models. This review examines how incorporating learning and observational biases through physics-informed modeling and data strategies can guide models toward physically consistent and reliable predictions. Learning biases embed physical constraints into model training through physics-informed loss functions and governing equations, or by incorporating properties like monotonicity. Observational biases influence data selection and synthesis to ensure models capture realistic system behavior through virtual sensing for estimating unmeasured states, physics-based simulation for data augmentation, and multi-sensor fusion strategies. The review then examines how these approaches enable the transition from passive prediction to active decision-making through reinforcement learning, which allows agents to learn maintenance policies that respect physical constraints while optimizing operational objectives. This closes the loop between model-based predictions, simulation, and actual system operation, empowering adaptive decision-making. Finally, the review addresses the critical challenge of scaling PHM solutions from individual assets to fleet-wide deployment. Fast adaptation methods including meta-learning and few-shot learning are reviewed alongside domain generalization techniques ...

new Go With The Flow: Churn-Tolerant Decentralized Training of Large Language Models

Authors: Nikolay Blagoev, Bart Cox, J\'er\'emie Decouchant, Lydia Y. Chen

Abstract: Motivated by the emergence of large language models (LLMs) and the importance of democratizing their training, we propose GWTF, the first crash tolerant practical decentralized training framework for LLMs. Differently from existing distributed and federated training frameworks, GWTF enables the efficient collaborative training of a LLM on heterogeneous clients that volunteer their resources. In addition, GWTF addresses node churn, i.e., clients joining or leaving the system at any time, and network instabilities, i.e., network links becoming unstable or unreliable. The core of GWTF is a novel decentralized flow algorithm that finds the most effective routing that maximizes the number of microbatches trained with the lowest possible delay. We extensively evaluate GWTF on GPT-like and LLaMa-like models and compare it against the prior art. Our results indicate that GWTF reduces the training time by up to 45% in realistic and challenging scenarios that involve heterogeneous client nodes distributed over 10 different geographic locations with a high node churn rate.

new AbideGym: Turning Static RL Worlds into Adaptive Challenges

Authors: Abi Aryan, Zac Liu, Aaron Childress

Abstract: Agents trained with reinforcement learning often develop brittle policies that fail when dynamics shift, a problem amplified by static benchmarks. AbideGym, a dynamic MiniGrid wrapper, introduces agent-aware perturbations and scalable complexity to enforce intra-episode adaptation. By exposing weaknesses in static policies and promoting resilience, AbideGym provides a modular, reproducible evaluation framework for advancing research in curriculum learning, continual learning, and robust generalization.

new Tree Search for LLM Agent Reinforcement Learning

Authors: Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu

Abstract: Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

new Explaining Fine Tuned LLMs via Counterfactuals A Knowledge Graph Driven Framework

Authors: Yucheng Wang, Ziyang Chen, Md Faisal Kabir

Abstract: The widespread adoption of Low-Rank Adaptation (LoRA) has enabled large language models (LLMs) to acquire domain-specific knowledge with remarkable efficiency. However, understanding how such a fine-tuning mechanism alters a model's structural reasoning and semantic behavior remains an open challenge. This work introduces a novel framework that explains fine-tuned LLMs via counterfactuals grounded in knowledge graphs. Specifically, we construct BioToolKG, a domain-specific heterogeneous knowledge graph in bioinformatics tools and design a counterfactual-based fine-tuned LLMs explainer (CFFTLLMExplainer) that learns soft masks over graph nodes and edges to generate minimal structural perturbations that induce maximum semantic divergence. Our method jointly optimizes structural sparsity and semantic divergence while enforcing interpretability preserving constraints such as entropy regularization and edge smoothness. We apply this framework to a fine-tuned LLaMA-based LLM and reveal that counterfactual masking exposes the model's structural dependencies and aligns with LoRA-induced parameter shifts. This work provides new insights into the internal mechanisms of fine-tuned LLMs and highlights counterfactual graphs as a potential tool for interpretable AI.

new Federated Flow Matching

Authors: Zifan Wang, Anqi Dong, Mahmoud Selim, Michael M. Zavlanos, Karl H. Johansson

Abstract: Data today is decentralized, generated and stored across devices and institutions where privacy, ownership, and regulation prevent centralization. This motivates the need to train generative models directly from distributed data locally without central aggregation. In this paper, we introduce Federated Flow Matching (FFM), a framework for training flow matching models under privacy constraints. Specifically, we first examine FFM-vanilla, where each client trains locally with independent source and target couplings, preserving privacy but yielding curved flows that slow inference. We then develop FFM-LOT, which employs local optimal transport couplings to improve straightness within each client but lacks global consistency under heterogeneous data. Finally, we propose FFM-GOT, a federated strategy based on the semi-dual formulation of optimal transport, where a shared global potential function coordinates couplings across clients. Experiments on synthetic and image datasets show that FFM enables privacy-preserving training while enhancing both the flow straightness and sample quality in federated settings, with performance comparable to the centralized baseline.

new humancompatible.train: Implementing Optimization Algorithms for Stochastically-Constrained Stochastic Optimization Problems

Authors: Andrii Kliachkin, Jana Lep\v{s}ov\'a, Gilles Bareilles, Jakub Mare\v{c}ek

Abstract: There has been a considerable interest in constrained training of deep neural networks (DNNs) recently for applications such as fairness and safety. Several toolkits have been proposed for this task, yet there is still no industry standard. We present humancompatible.train (https://github.com/humancompatible/train), an easily-extendable PyTorch-based Python package for training DNNs with stochastic constraints. We implement multiple previously unimplemented algorithms for stochastically constrained stochastic optimization. We demonstrate the toolkit use by comparing two algorithms on a deep learning task with fairness constraints.

URLs: https://github.com/humancompatible/train),

new A Causality-Aware Spatiotemporal Model for Multi-Region and Multi-Pollutant Air Quality Forecasting

Authors: Junxin Lu, Shiliang Sun

Abstract: Air pollution, a pressing global problem, threatens public health, environmental sustainability, and climate stability. Achieving accurate and scalable forecasting across spatially distributed monitoring stations is challenging due to intricate multi-pollutant interactions, evolving meteorological conditions, and region specific spatial heterogeneity. To address this challenge, we propose AirPCM, a novel deep spatiotemporal forecasting model that integrates multi-region, multi-pollutant dynamics with explicit meteorology-pollutant causality modeling. Unlike existing methods limited to single pollutants or localized regions, AirPCM employs a unified architecture to jointly capture cross-station spatial correlations, temporal auto-correlations, and meteorology-pollutant dynamic causality. This empowers fine-grained, interpretable multi-pollutant forecasting across varying geographic and temporal scales, including sudden pollution episodes. Extensive evaluations on multi-scale real-world datasets demonstrate that AirPCM consistently surpasses state-of-the-art baselines in both predictive accuracy and generalization capability. Moreover, the long-term forecasting capability of AirPCM provides actionable insights into future air quality trends and potential high-risk windows, offering timely support for evidence-based environmental governance and carbon mitigation planning.

new SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

Authors: Xinyu Lian, Masahiro Tanaka, Olatunji Ruwase, Minjia Zhang

Abstract: The emergence of Superchips represents a significant advancement in next-generation AI hardware. These Superchips employ a tightly coupled heterogeneous architecture that integrates GPU and CPU on the same package, which offers unprecedented computational power. However, there has been scant research investigating how LLM training benefits from this new architecture. In this work, for the first time, we study LLM training solutions based on offloading for Superchips. We observe important differences between Superchips and traditional loosely-coupled GPU-CPU architecture, which necessitate revisiting prevailing assumptions about offloading. Based on that, we present SuperOffload, a Superchip-centric offloading system that simultaneously uses Hopper GPU, Grace CPU, and NVLink-C2C interconnect more efficiently. SuperOffload accomplishes this via a combination of techniques, such as adaptive weight offloading, bucketization repartitioning, Superchip-aware casting, speculative execution, and a highly optimized Adam optimizer for Grace CPUs. Our evaluation of SuperOffload on NVIDIA GH200 demonstrates up to 2.5x throughput improvement compared to state-of-the-art offloading-based systems, enabling training of up to 25B model on a single Superchip while achieving high training throughput. We also extend SuperOffload with ZeRO-style data parallelism and DeepSpeed-Ulysses sequence parallelism, enabling training of 13B model with sequence lengths up to 1 million tokens on 8 GH200 while achieving 55% MFU.

new It's Not You, It's Clipping: A Soft Trust-Region via Probability Smoothing for LLM RL

Authors: Madeleine Dwyer, Adam Sobey, Adriane Chapman

Abstract: Training large language models (LLMs) with reinforcement learning (RL) methods such as PPO and GRPO commonly relies on ratio clipping to stabilise updates. While effective at preventing instability, clipping discards information and introduces gradient discontinuities. We propose Probability Smoothing Policy Optimisation (PSPO), which smooths the current policy's probabilities toward the old (behaviour) policy before computing the importance ratio, analogous to label smoothing. Unlike clipping, PSPO preserves gradient signal, while interpolation toward the old policy creates a soft trust region that discourages large, destabilising updates, with formal guarantees. We instantiate PSPO within GRPO (GR-PSPO) and fine-tune Qwen2.5-0.5B and Qwen2.5-1.5B on GSM8K, evaluating on GSM8K test and the cross-dataset generalisation on SVAMP, ASDiv, and MATH-500. Relative to unclipped GRPO (single iteration; no data reuse, ratio always = 1), GR-PSPO achieves similar performance but improves the reasoning leading to clearer and more concise responses which are more logical. Compared to clipped GRPO, GR-PSPO substantially improves performance both the 0.5B and 1.5B models, with a boost of over 20% on GSM8K (39.7% vs. 17.6% for 0.5B, 59.4% vs. 37.8% for 1.5B).

new Optimal Robust Recourse with $L^p$-Bounded Model Change

Authors: Phone Kyaw, Kshitij Kayastha, Shahin Jabbari

Abstract: Recourse provides individuals who received undesirable labels (e.g., denied a loan) from algorithmic decision-making systems with a minimum-cost improvement suggestion to achieve the desired outcome. However, in practice, models often get updated to reflect changes in the data distribution or environment, invalidating the recourse recommendations (i.e., following the recourse will not lead to the desirable outcome). The robust recourse literature addresses this issue by providing a framework for computing recourses whose validity is resilient to slight changes in the model. However, since the optimization problem of computing robust recourse is non-convex (even for linear models), most of the current approaches do not have any theoretical guarantee on the optimality of the recourse. Recent work by Kayastha et. al. provides the first provably optimal algorithm for robust recourse with respect to generalized linear models when the model changes are measured using the $L^{\infty}$ norm. However, using the $L^{\infty}$ norm can lead to recourse solutions with a high price. To address this shortcoming, we consider more constrained model changes defined by the $L^p$ norm, where $p\geq 1$ but $p\neq \infty$, and provide a new algorithm that provably computes the optimal robust recourse for generalized linear models. Empirically, for both linear and non-linear models, we demonstrate that our algorithm achieves a significantly lower price of recourse (up to several orders of magnitude) compared to prior work and also exhibits a better trade-off between the implementation cost of recourse and its validity. Our empirical analysis also illustrates that our approach provides more sparse recourses compared to prior work and remains resilient to post-processing approaches that guarantee feasibility.

new No Prior, No Leakage: Revisiting Reconstruction Attacks in Trained Neural Networks

Authors: Yehonatan Refael, Guy Smorodinsky, Ofir Lindenbaum, Itay Safran

Abstract: The memorization of training data by neural networks raises pressing concerns for privacy and security. Recent work has shown that, under certain conditions, portions of the training set can be reconstructed directly from model parameters. Some of these methods exploit implicit bias toward margin maximization, suggesting that properties often regarded as beneficial for generalization may actually compromise privacy. Yet despite striking empirical demonstrations, the reliability of these attacks remains poorly understood and lacks a solid theoretical foundation. In this work, we take a complementary perspective: rather than designing stronger attacks, we analyze the inherent weaknesses and limitations of existing reconstruction methods and identify conditions under which they fail. We rigorously prove that, without incorporating prior knowledge about the data, there exist infinitely many alternative solutions that may lie arbitrarily far from the true training set, rendering reconstruction fundamentally unreliable. Empirically, we further demonstrate that exact duplication of training examples occurs only by chance. Our results refine the theoretical understanding of when training set leakage is possible and offer new insights into mitigating reconstruction attacks. Remarkably, we demonstrate that networks trained more extensively, and therefore satisfying implicit bias conditions more strongly -- are, in fact, less susceptible to reconstruction attacks, reconciling privacy with the need for strong generalization in this setting.

cross Copycats: the many lives of a publicly available medical imaging dataset

Authors: Amelia Jim\'enez-S\'anchez, Natalia-Rozalia Avlona, Dovile Juodelyte, Th\'eo Sourget, Caroline Vang-Larsen, Anna Rogers, Hubert Dariusz Zaj\k{a}c, Veronika Cheplygina

Abstract: Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare. The accuracy, robustness, and fairness of diagnostic algorithms depend on the data (and its quality) used to train and evaluate the models. MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets. In this paper, we conduct an analysis of publicly available machine learning datasets on CCPs, discussing datasets' context, and identifying limitations and gaps in the current CCP landscape. We highlight differences between MI and computer vision datasets, particularly in the potentially harmful downstream effects from poor adoption of recommended dataset management practices. We compare the analyzed datasets across several dimensions, including data sharing, data documentation, and maintenance. We find vague licenses, lack of persistent identifiers and storage, duplicates, and missing metadata, with differences between the platforms. Our research contributes to efforts in responsible data curation and AI algorithms for healthcare.

cross An Analytical and AI-discovered Stable, Accurate, and Generalizable Subgrid-scale Closure for Geophysical Turbulence

Authors: Karan Jakhar, Yifei Guan, Pedram Hassanzadeh

Abstract: By combining AI and fluid physics, we discover a closed-form closure for 2D turbulence from small direct numerical simulation (DNS) data. Large-eddy simulation (LES) with this closure is accurate and stable, reproducing DNS statistics including those of extremes. We also show that the new closure could be derived from a 4th-order truncated Taylor expansion. Prior analytical and AI-based work only found the 2nd-order expansion, which led to unstable LES. The additional terms emerge only when inter-scale energy transfer is considered alongside standard reconstruction criterion in the sparse-equation discovery.

cross Philosophy-informed Machine Learning

Authors: MZ Naser

Abstract: Philosophy-informed machine learning (PhIML) directly infuses core ideas from analytic philosophy into ML model architectures, objectives, and evaluation protocols. Therefore, PhIML promises new capabilities through models that respect philosophical concepts and values by design. From this lens, this paper reviews conceptual foundations to demonstrate philosophical gains and alignment. In addition, we present case studies on how ML users/designers can adopt PhIML as an agnostic post-hoc tool or intrinsically build it into ML model architectures. Finally, this paper sheds light on open technical barriers alongside philosophical, practical, and governance challenges and outlines a research roadmap toward safe, philosophy-aware, and ethically responsible PhIML.

cross Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition

Authors: Shreya G. Upadhyay, Carlos Busso, Chi-Chun Lee

Abstract: Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.

cross Leveraging NTPs for Efficient Hallucination Detection in VLMs

Authors: Ofir Azachi, Kfir Eliyahu, Eyal El Ani, Rom Himelstein, Roi Reichart, Yuval Pinter, Nitay Calderon

Abstract: Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM's next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs.

cross Lightweight MobileNetV1+GRU for ECG Biometric Authentication: Federated and Adversarial Evaluation

Authors: Dilli Hang Rai, Sabin Kafley

Abstract: ECG biometrics offer a unique, secure authentication method, yet their deployment on wearable devices faces real-time processing, privacy, and spoofing vulnerability challenges. This paper proposes a lightweight deep learning model (MobileNetV1+GRU) for ECG-based authentication, injection of 20dB Gaussian noise & custom preprocessing. We simulate wearable conditions and edge deployment using the ECGID, MIT-BIH, CYBHi, and PTB datasets, achieving accuracies of 99.34%, 99.31%, 91.74%, and 98.49%, F1-scores of 0.9869, 0.9923, 0.9125, and 0.9771, Precision of 0.9866, 0.9924, 0.9180 and 0.9845, Recall of 0.9878, 0.9923, 0.9129, and 0.9756, equal error rates (EER) of 0.0009, 0.00013, 0.0091, and 0.0009, and ROC-AUC values of 0.9999, 0.9999, 0.9985, and 0.9998, while under FGSM adversarial attacks, accuracy drops from 96.82% to as low as 0.80%. This paper highlights federated learning, adversarial testing, and the need for diverse wearable physiological datasets to ensure secure and scalable biometrics.

cross A Comparative Analysis of Ensemble-Based Machine Learning Approaches with Explainable AI for Multi-Class Intrusion Detection in Drone Networks

Authors: Md. Alamgir Hossain, Waqas Ishtiaq, Md. Samiul Islam

Abstract: The growing integration of drones into civilian, commercial, and defense sectors introduces significant cybersecurity concerns, particularly with the increased risk of network-based intrusions targeting drone communication protocols. Detecting and classifying these intrusions is inherently challenging due to the dynamic nature of drone traffic and the presence of multiple sophisticated attack vectors such as spoofing, injection, replay, and man-in-the-middle (MITM) attacks. This research aims to develop a robust and interpretable intrusion detection framework tailored for drone networks, with a focus on handling multi-class classification and model explainability. We present a comparative analysis of ensemble-based machine learning models, namely Random Forest, Extra Trees, AdaBoost, CatBoost, and XGBoost, trained on a labeled dataset comprising benign traffic and nine distinct intrusion types. Comprehensive data preprocessing was performed, including missing value imputation, scaling, and categorical encoding, followed by model training and extensive evaluation using metrics such as macro F1-score, ROC AUC, Matthews Correlation Coefficient, and Log Loss. Random Forest achieved the highest performance with a macro F1-score of 0.9998 and ROC AUC of 1.0000. To validate the superiority of the models, statistical tests, including Friedmans test, the Wilcoxon signed-rank test with Holm correction, and bootstrapped confidence intervals, were applied. Furthermore, explainable AI methods, SHAP and LIME, were integrated to interpret both global and local feature importance, enhancing model transparency and decision trustworthiness. The proposed approach not only delivers near-perfect accuracy but also ensures interpretability, making it highly suitable for real-time and safety-critical drone operations.

cross The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind

Authors: Caleb DeLeeuw, Gaurav Chawla, Aniket Sharma, Vanessa Dietze

Abstract: We investigate strategic deception in large language models using two complementary testbeds: Secret Agenda (across 38 models) and Insider Trading compliance (via SAE architectures). Secret Agenda reliably induced lying when deception advantaged goal achievement across all model families. Analysis revealed that autolabeled SAE features for "deception" rarely activated during strategic dishonesty, and feature steering experiments across 100+ deception-related features failed to prevent lying. Conversely, insider trading analysis using unlabeled SAE activations separated deceptive versus compliant responses through discriminative patterns in heatmaps and t-SNE visualizations. These findings suggest autolabel-driven interpretability approaches fail to detect or control behavioral deception, while aggregate unlabeled activations provide population-level structure for risk assessment. Results span Llama 8B/70B SAE implementations and GemmaScope under resource constraints, representing preliminary findings that motivate larger studies on feature discovery, labeling methodology, and causal interventions in realistic deception contexts.

cross Sample completion, structured correlation, and Netflix problems

Authors: Leonardo N. Coregliano, Maryanthe Malliaris

Abstract: We develop a new high-dimensional statistical learning model which can take advantage of structured correlation in data even in the presence of randomness. We completely characterize learnability in this model in terms of VCN${}_{k,k}$-dimension (essentially $k$-dependence from Shelah's classification theory). This model suggests a theoretical explanation for the success of certain algorithms in the 2006~Netflix Prize competition.

cross Structuring Collective Action with LLM-Guided Evolution: From Ill-Structured Problems to Executable Heuristics

Authors: Kevin Bradley Dsouza, Graham Alexander Watt, Yuri Leonenko, Juan Moreno-Cruz

Abstract: Collective action problems, which require aligning individual incentives with collective goals, are classic examples of Ill-Structured Problems (ISPs). For an individual agent, the causal links between local actions and global outcomes are unclear, stakeholder objectives often conflict, and no single, clear algorithm can bridge micro-level choices with macro-level welfare. We present ECHO-MIMIC, a computational framework that converts this global complexity into a tractable, Well-Structured Problem (WSP) for each agent by discovering compact, executable heuristics and persuasive rationales. The framework operates in two stages: ECHO (Evolutionary Crafting of Heuristics from Outcomes) evolves snippets of Python code that encode candidate behavioral policies, while MIMIC (Mechanism Inference & Messaging for Individual-to-Collective Alignment) evolves companion natural language messages that motivate agents to adopt those policies. Both phases employ a large-language-model-driven evolutionary search: the LLM proposes diverse and context-aware code or text variants, while population-level selection retains those that maximize collective performance in a simulated environment. We demonstrate this framework on a canonical ISP in agricultural landscape management, where local farming decisions impact global ecological connectivity. Results show that ECHO-MIMIC discovers high-performing heuristics compared to baselines and crafts tailored messages that successfully align simulated farmer behavior with landscape-level ecological goals. By coupling algorithmic rule discovery with tailored communication, ECHO-MIMIC transforms the cognitive burden of collective action into a simple set of agent-level instructions, making previously ill-structured problems solvable in practice and opening a new path toward scalable, adaptive policy design.

cross SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

Authors: Yandan Yang, Baoxiong Jia, Shujie Zhang, Siyuan Huang

Abstract: Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation. Project website: https://scene-weaver.github.io/.

URLs: https://scene-weaver.github.io/.

cross Neural Networks as Surrogate Solvers for Time-Dependent Accretion Disk Dynamics

Authors: Shunyuan Mao, Weiqi Wang, Sifan Wang, Ruobing Dong, Lu Lu, Kwang Moo Yi, Paris Perdikaris, Andrea Isella, S\'ebastien Fabbro, Lile Wang

Abstract: Accretion disks are ubiquitous in astrophysics, appearing in diverse environments from planet-forming systems to X-ray binaries and active galactic nuclei. Traditionally, modeling their dynamics requires computationally intensive (magneto)hydrodynamic simulations. Recently, Physics-Informed Neural Networks (PINNs) have emerged as a promising alternative. This approach trains neural networks directly on physical laws without requiring data. We for the first time demonstrate PINNs for solving the two-dimensional, time-dependent hydrodynamics of non-self-gravitating accretion disks. Our models provide solutions at arbitrary times and locations within the training domain, and successfully reproduce key physical phenomena, including the excitation and propagation of spiral density waves and gap formation from disk-companion interactions. Notably, the boundary-free approach enabled by PINNs naturally eliminates the spurious wave reflections at disk edges, which are challenging to suppress in numerical simulations. These results highlight how advanced machine learning techniques can enable physics-driven, data-free modeling of complex astrophysical systems, potentially offering an alternative to traditional numerical simulations in the future.

cross Document Summarization with Conformal Importance Guarantees

Authors: Bruce Kuwahara, Chen-Yuan Lin, Xiao Shi Huang, Kin Kwan Leung, Jullian Arta Yapeter, Ilya Stanevich, Felipe Perez, Jesse C. Cresswell

Abstract: Automatic summarization systems have advanced rapidly with large language models (LLMs), yet they still lack reliable guarantees on inclusion of critical content in high-stakes domains like healthcare, law, and finance. In this work, we introduce Conformal Importance Summarization, the first framework for importance-preserving summary generation which uses conformal prediction to provide rigorous, distribution-free coverage guarantees. By calibrating thresholds on sentence-level importance scores, we enable extractive document summarization with user-specified coverage and recall rates over critical content. Our method is model-agnostic, requires only a small calibration set, and seamlessly integrates with existing black-box LLMs. Experiments on established summarization benchmarks demonstrate that Conformal Importance Summarization achieves the theoretically assured information coverage rate. Our work suggests that Conformal Importance Summarization can be combined with existing techniques to achieve reliable, controllable automatic summarization, paving the way for safer deployment of AI summarization tools in critical applications. Code is available at https://github.com/layer6ai-labs/conformal-importance-summarization.

URLs: https://github.com/layer6ai-labs/conformal-importance-summarization.

cross Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

Authors: Ismail Rasim Ulgen, Zongyang Du, Junchen Lu, Philipp Koehn, Berrak Sisman

Abstract: Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.

cross Fast Estimation of Wasserstein Distances via Regression on Sliced Wasserstein Distances

Authors: Khai Nguyen, Hai Nguyen, Nhat Ho

Abstract: We address the problem of efficiently computing Wasserstein distances for multiple pairs of distributions drawn from a meta-distribution. To this end, we propose a fast estimation method based on regressing Wasserstein distance on sliced Wasserstein (SW) distances. Specifically, we leverage both standard SW distances, which provide lower bounds, and lifted SW distances, which provide upper bounds, as predictors of the true Wasserstein distance. To ensure parsimony, we introduce two linear models: an unconstrained model with a closed-form least-squares solution, and a constrained model that uses only half as many parameters. We show that accurate models can be learned from a small number of distribution pairs. Once estimated, the model can predict the Wasserstein distance for any pair of distributions via a linear combination of SW distances, making it highly efficient. Empirically, we validate our approach on diverse tasks, including Gaussian mixtures, point-cloud classification, and Wasserstein-space visualizations for 3D point clouds. Across various datasets such as MNIST point clouds, ShapeNetV2, MERFISH Cell Niches, and scRNA-seq, our method consistently provides a better approximation of Wasserstein distance than the state-of-the-art Wasserstein embedding model, Wasserstein Wormhole, particularly in low-data regimes. Finally, we demonstrate that our estimator can also accelerate Wormhole training, yielding \textit{RG-Wormhole}.

cross Adaptive Approach to Enhance Machine Learning Scheduling Algorithms During Runtime Using Reinforcement Learning in Metascheduling Applications

Authors: Samer Alshaer, Ala Khalifeh, Roman Obermaisser

Abstract: Metascheduling in time-triggered architectures has been crucial in adapting to dynamic and unpredictable environments, ensuring the reliability and efficiency of task execution. However, traditional approaches face significant challenges when training Artificial Intelligence (AI) scheduling inferences offline, particularly due to the complexities involved in constructing a comprehensive Multi-Schedule Graph (MSG) that accounts for all possible scenarios. The process of generating an MSG that captures the vast probability space, especially when considering context events like hardware failures, slack variations, or mode changes, is resource-intensive and often infeasible. To address these challenges, we propose an adaptive online learning unit integrated within the metascheduler to enhance performance in real-time. The primary motivation for developing this unit stems from the limitations of offline training, where the MSG created is inherently a subset of the complete space, focusing only on the most probable and critical context events. In the online mode, Reinforcement Learning (RL) plays a pivotal role by continuously exploring and discovering new scheduling solutions, thus expanding the MSG and enhancing system performance over time. This dynamic adaptation allows the system to handle unexpected events and complex scheduling scenarios more effectively. Several RL models were implemented within the online learning unit, each designed to address specific challenges in scheduling. These models not only facilitate the discovery of new solutions but also optimize existing schedulers, particularly when stricter deadlines or new performance criteria are introduced. By continuously refining the AI inferences through real-time training, the system remains flexible and capable of meeting evolving demands, thus ensuring robustness and efficiency in large-scale, safety-critical environments.

cross A Compound Classification System Based on Fuzzy Relations Applied to the Noise-Tolerant Control of a Bionic Hand via EMG Signal Recognition

Authors: Pawel Trajdos, Marek Kurzynski

Abstract: Modern anthropomorphic upper limb bioprostheses are typically controlled by electromyographic (EMG) biosignals using a pattern recognition scheme. Unfortunately, there are many factors originating from the human source of objects to be classified and from the human-prosthesis interface that make it difficult to obtain an acceptable classification quality. One of these factors is the high susceptibility of biosignals to contamination, which can considerably reduce the quality of classification of a recognition system. In the paper, the authors propose a new recognition system intended for EMG based control of the hand prosthesis with detection of contaminated biosignals in order to mitigate the adverse effect of contaminations. The system consists of two ensembles: the set of one-class classifiers (OCC) to assess the degree of contamination of individual channels and the ensemble of K-nearest neighbours (KNN) classifier to recognise the patient's intent. For all recognition systems, an original, coherent fuzzy model was developed, which allows the use of a uniform soft (fuzzy) decision scheme throughout the recognition process. The experimental evaluation was conducted using real biosignals from a public repository. The goal was to provide an experimental comparative analysis of the parameters and procedures of the developed method on which the quality of the recognition system depends. The proposed fuzzy recognition system was also compared with similar systems described in the literature.

cross Innovative Deep Learning Architecture for Enhanced Altered Fingerprint Recognition

Authors: Dana A Abdullah, Dana Rasul Hamad, Bishar Rasheed Ibrahim, Sirwan Abdulwahid Aula, Aso Khaleel Ameen, Sabat Salih Hamadamin

Abstract: Altered fingerprint recognition (AFR) is challenging for biometric verification in applications such as border control, forensics, and fiscal admission. Adversaries can deliberately modify ridge patterns to evade detection, so robust recognition of altered prints is essential. We present DeepAFRNet, a deep learning recognition model that matches and recognizes distorted fingerprint samples. The approach uses a VGG16 backbone to extract high-dimensional features and cosine similarity to compare embeddings. We evaluate on the SOCOFing Real-Altered subset with three difficulty levels (Easy, Medium, Hard). With strict thresholds, DeepAFRNet achieves accuracies of 96.7 percent, 98.76 percent, and 99.54 percent for the three levels. A threshold-sensitivity study shows that relaxing the threshold from 0.92 to 0.72 sharply degrades accuracy to 7.86 percent, 27.05 percent, and 29.51 percent, underscoring the importance of threshold selection in biometric systems. By using real altered samples and reporting per-level metrics, DeepAFRNet addresses limitations of prior work based on synthetic alterations or limited verification protocols, and indicates readiness for real-world deployments where both security and recognition resilience are critical.

cross Large Pre-Trained Models for Bimanual Manipulation in 3D

Authors: Hanna Yurchyk, Wei-Di Chang, Gregory Dudek, David Meger

Abstract: We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.

cross Region-of-Interest Augmentation for Mammography Classification under Patient-Level Cross-Validation

Authors: Farbod Bigdeli, Mohsen Mohammadagha, Ali Bigdeli

Abstract: Breast cancer screening with mammography remains central to early detection and mortality reduction. Deep learning has shown strong potential for automating mammogram interpretation, yet limited-resolution datasets and small sample sizes continue to restrict performance. We revisit the Mini-DDSM dataset (9,684 images; 2,414 patients) and introduce a lightweight region-of-interest (ROI) augmentation strategy. During training, full images are probabilistically replaced with random ROI crops sampled from a precomputed, label-free bounding-box bank, with optional jitter to increase variability. We evaluate under strict patient-level cross-validation and report ROC-AUC, PR-AUC, and training-time efficiency metrics (throughput and GPU memory). Because ROI augmentation is training-only, inference-time cost remains unchanged. On Mini-DDSM, ROI augmentation (best: p_roi = 0.10, alpha = 0.10) yields modest average ROC-AUC gains, with performance varying across folds; PR-AUC is flat to slightly lower. These results demonstrate that simple, data-centric ROI strategies can enhance mammography classification in constrained settings without requiring additional labels or architectural modifications.

cross Unsupervised Domain Adaptation with an Unobservable Source Subpopulation

Authors: Chao Ying, Jun Jin, Haotian Zhang, Qinglong Tian, Yanyuan Ma, Yixuan Li, Jiwei Zhao

Abstract: We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label $Y$ and a binary background (or environment) $A$. We focus on a challenging setting in which one such subpopulation in the source domain is unobservable. Naively ignoring this unobserved group can result in biased estimates and degraded predictive performance. Despite this structured missingness, we show that the prediction in the target domain can still be recovered. Specifically, we rigorously derive both background-specific and overall prediction models for the target domain. For practical implementation, we propose the distribution matching method to estimate the subpopulation proportions. We provide theoretical guarantees for the asymptotic behavior of our estimator, and establish an upper bound on the prediction error. Experiments on both synthetic and real-world datasets show that our method outperforms the naive benchmark that does not account for this unobservable source subpopulation.

cross Every Character Counts: From Vulnerability to Defense in Phishing Detection

Authors: Maria Chiper, Radu Tudor Ionescu

Abstract: Phishing attacks targeting both organizations and individuals are becoming an increasingly significant threat as technology advances. Current automatic detection methods often lack explainability and robustness in detecting new phishing attacks. In this work, we investigate the effectiveness of character-level deep learning models for phishing detection, which can provide both robustness and interpretability. We evaluate three neural architectures adapted to operate at the character level, namely CharCNN, CharGRU, and CharBiLSTM, on a custom-built email dataset, which combines data from multiple sources. Their performance is analyzed under three scenarios: (i) standard training and testing, (ii) standard training and testing under adversarial attacks, and (iii) training and testing with adversarial examples. Aiming to develop a tool that operates as a browser extension, we test all models under limited computational resources. In this constrained setup, CharGRU proves to be the best-performing model across all scenarios. All models show vulnerability to adversarial attacks, but adversarial training substantially improves their robustness. In addition, by adapting the Gradient-weighted Class Activation Mapping (Grad-CAM) technique to character-level inputs, we are able to visualize which parts of each email influence the decision of each model. Our open-source code and data is released at https://github.com/chipermaria/every-character-counts.

URLs: https://github.com/chipermaria/every-character-counts.

cross An LLM-based Agentic Framework for Accessible Network Control

Authors: Samuel Lin, Jiawei Zhou, Minlan Yu

Abstract: Traditional approaches to network management have been accessible only to a handful of highly-trained network operators with significant expert knowledge. This creates barriers for lay users to easily manage their networks without resorting to experts. With recent development of powerful large language models (LLMs) for language comprehension, we design a system to make network management accessible to a broader audience of non-experts by allowing users to converse with networks in natural language. To effectively leverage advancements in LLMs, we propose an agentic framework that uses an intermediate representation to streamline configuration across diverse vendor equipment, retrieves the network state from memory in real-time, and provides an interface for external feedback. We also conduct pilot studies to collect real user data of natural language utterances for network control, and present a visualization interface to facilitate dialogue-driven user interaction and enable large-scale data collection for future development. Preliminary experiments validate the effectiveness of our proposed system components with LLM integration on both synthetic and real user utterances. Through our data collection and visualization efforts, we pave the way for more effective use of LLMs and democratize network control for everyday users.

cross Experience Deploying Containerized GenAI Services at an HPC Center

Authors: Angel M. Beltre, Jeff Ogden, Kevin Pedretti

Abstract: Generative Artificial Intelligence (GenAI) applications are built from specialized components -- inference servers, object storage, vector and graph databases, and user interfaces -- interconnected via web-based APIs. While these components are often containerized and deployed in cloud environments, such capabilities are still emerging at High-Performance Computing (HPC) centers. In this paper, we share our experience deploying GenAI workloads within an established HPC center, discussing the integration of HPC and cloud computing environments. We describe our converged computing architecture that integrates HPC and Kubernetes platforms running containerized GenAI workloads, helping with reproducibility. A case study illustrates the deployment of the Llama Large Language Model (LLM) using a containerized inference server (vLLM) across both Kubernetes and HPC platforms using multiple container runtimes. Our experience highlights practical considerations and opportunities for the HPC container community, guiding future research and tool development.

cross A Gapped Scale-Sensitive Dimension and Lower Bounds for Offset Rademacher Complexity

Authors: Zeyu Jia, Yury Polyanskiy, Alexander Rakhlin

Abstract: We study gapped scale-sensitive dimensions of a function class in both sequential and non-sequential settings. We demonstrate that covering numbers for any uniformly bounded class are controlled above by these gapped dimensions, generalizing the results of \cite{anthony2000function,alon1997scale}. Moreover, we show that the gapped dimensions lead to lower bounds on offset Rademacher averages, thereby strengthening existing approaches for proving lower bounds on rates of convergence in statistical and online learning.

cross FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Authors: Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, Irina Belousova

Abstract: Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains.

cross Design, Implementation and Evaluation of a Novel Programming Language Topic Classification Workflow

Authors: Michael Zhang, Yuan Tian, Mariam Guizani

Abstract: As software systems grow in scale and complexity, understanding the distribution of programming language topics within source code becomes increasingly important for guiding technical decisions, improving onboarding, and informing tooling and education. This paper presents the design, implementation, and evaluation of a novel programming language topic classification workflow. Our approach combines a multi-label Support Vector Machine (SVM) with a sliding window and voting strategy to enable fine-grained localization of core language concepts such as operator overloading, virtual functions, inheritance, and templates. Trained on the IBM Project CodeNet dataset, our model achieves an average F1 score of 0.90 across topics and 0.75 in code-topic highlight. Our findings contribute empirical insights and a reusable pipeline for researchers and practitioners interested in code analysis and data-driven software engineering.

cross A Hierarchical Variational Graph Fused Lasso for Recovering Relative Rates in Spatial Compositional Data

Authors: Joaquim Valerio Teixeira, Ed Reznik, Sudpito Banerjee, Wesley Tansey

Abstract: The analysis of spatial data from biological imaging technology, such as imaging mass spectrometry (IMS) or imaging mass cytometry (IMC), is challenging because of a competitive sampling process which convolves signals from molecules in a single pixel. To address this, we develop a scalable Bayesian framework that leverages natural sparsity in spatial signal patterns to recover relative rates for each molecule across the entire image. Our method relies on the use of a heavy-tailed variant of the graphical lasso prior and a novel hierarchical variational family, enabling efficient inference via automatic differentiation variational inference. Simulation results show that our approach outperforms state-of-the-practice point estimate methodologies in IMS, and has superior posterior coverage than mean-field variational inference techniques. Results on real IMS data demonstrate that our approach better recovers the true anatomical structure of known tissue, removes artifacts, and detects active regions missed by the standard analysis approach.

cross Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions

Authors: Jungsoo Park, Ethan Mendes, Gabriel Stanovsky, Alan Ritter

Abstract: Progress in large language models is constrained by an evaluation bottleneck: build a benchmark, evaluate models and settings, then iterate. We therefore ask a simple question: can we forecast outcomes before running any experiments? We study text-only performance forecasting: estimating a model's score from a redacted task description and intended configuration, with no access to dataset instances. To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics. Experiments show the task is challenging but feasible: models equipped with a retrieval module that excludes source papers achieve moderate prediction performance with well-calibrated uncertainty, reaching mean absolute error as low as 8.7 on the Accuracy subset at high-confidence thresholds. Our analysis indicates that stronger reasoning models engage in diverse, iterative querying, whereas current open-source models lag and often skip retrieval or gather evidence with limited diversity. We further test a zero-leakage setting, forecasting on newly released datasets or experiments before their papers are indexed, where GPT-5 with built-in web search still attains nontrivial prediction accuracy. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter experiment prioritization.

cross Implicit Augmentation from Distributional Symmetry in Turbulence Super-Resolution

Authors: Julia Balla, Jeremiah Bailey, Ali Backour, Elyssa Hofgard, Tommi Jaakkola, Tess Smidt, Ryley McConkey

Abstract: The immense computational cost of simulating turbulence has motivated the use of machine learning approaches for super-resolving turbulent flows. A central challenge is ensuring that learned models respect physical symmetries, such as rotational equivariance. We show that standard convolutional neural networks (CNNs) can partially acquire this symmetry without explicit augmentation or specialized architectures, as turbulence itself provides implicit rotational augmentation in both time and space. Using 3D channel-flow subdomains with differing anisotropy, we find that models trained on more isotropic mid-plane data achieve lower equivariance error than those trained on boundary layer data, and that greater temporal or spatial sampling further reduces this error. We show a distinct scale-dependence of equivariance error that occurs regardless of dataset anisotropy that is consistent with Kolmogorov's local isotropy hypothesis. These results clarify when rotational symmetry must be explicitly incorporated into learning algorithms and when it can be obtained directly from turbulence, enabling more efficient and symmetry-aware super-resolution.

cross Cryptographic Backdoor for Neural Networks: Boon and Bane

Authors: Anh Tu Ngo, Anupam Chattopadhyay, Subhamoy Maitra

Abstract: In this paper we show that cryptographic backdoors in a neural network (NN) can be highly effective in two directions, namely mounting the attacks as well as in presenting the defenses as well. On the attack side, a carefully planted cryptographic backdoor enables powerful and invisible attack on the NN. Considering the defense, we present applications: first, a provably robust NN watermarking scheme; second, a protocol for guaranteeing user authentication; and third, a protocol for tracking unauthorized sharing of the NN intellectual property (IP). From a broader theoretical perspective, borrowing the ideas from Goldwasser et. al. [FOCS 2022], our main contribution is to show that all these instantiated practical protocol implementations are provably robust. The protocols for watermarking, authentication and IP tracking resist an adversary with black-box access to the NN, whereas the backdoor-enabled adversarial attack is impossible to prevent under the standard assumptions. While the theoretical tools used for our attack is mostly in line with the Goldwasser et. al. ideas, the proofs related to the defense need further studies. Finally, all these protocols are implemented on state-of-the-art NN architectures with empirical results corroborating the theoretical claims. Further, one can utilize post-quantum primitives for implementing the cryptographic backdoors, laying out foundations for quantum-era applications in machine learning (ML).

cross PALQO: Physics-informed Model for Accelerating Large-scale Quantum Optimization

Authors: Yiming Huang, Yajie Hao, Jing Zhou, Xiao Yuan, Xiaoting Wang, Yuxuan Du

Abstract: Variational quantum algorithms (VQAs) are leading strategies to reach practical utilities of near-term quantum devices. However, the no-cloning theorem in quantum mechanics precludes standard backpropagation, leading to prohibitive quantum resource costs when applying VQAs to large-scale tasks. To address this challenge, we reformulate the training dynamics of VQAs as a nonlinear partial differential equation and propose a novel protocol that leverages physics-informed neural networks (PINNs) to model this dynamical system efficiently. Given a small amount of training trajectory data collected from quantum devices, our protocol predicts the parameter updates of VQAs over multiple iterations on the classical side, dramatically reducing quantum resource costs. Through systematic numerical experiments, we demonstrate that our method achieves up to a 30x speedup compared to conventional methods and reduces quantum resource costs by as much as 90\% for tasks involving up to 40 qubits, including ground state preparation of different quantum systems, while maintaining competitive accuracy. Our approach complements existing techniques aimed at improving the efficiency of VQAs and further strengthens their potential for practical applications.

cross Real-Time System for Audio-Visual Target Speech Enhancement

Authors: T. Aleksandra Ma, Sile Yin, Li-Chia Yang, Shuo Zhang

Abstract: We present a live demonstration for RAVEN, a real-time audio-visual speech enhancement system designed to run entirely on a CPU. In single-channel, audio-only settings, speech enhancement is traditionally approached as the task of extracting clean speech from environmental noise. More recent work has explored the use of visual cues, such as lip movements, to improve robustness, particularly in the presence of interfering speakers. However, to our knowledge, no prior work has demonstrated an interactive system for real-time audio-visual speech enhancement operating on CPU hardware. RAVEN fills this gap by using pretrained visual embeddings from an audio-visual speech recognition model to encode lip movement information. The system generalizes across environmental noise, interfering speakers, transient sounds, and even singing voices. In this demonstration, attendees will be able to experience live audio-visual target speech enhancement using a microphone and webcam setup, with clean speech playback through headphones.

cross RAPTOR-GEN: RApid PosTeriOR GENerator for Bayesian Learning in Biomanufacturing

Authors: Wandi Xu, Wei Xie

Abstract: Biopharmaceutical manufacturing is vital to public health but lacks the agility for rapid, on-demand production of biotherapeutics due to the complexity and variability of bioprocesses. To overcome this, we introduce RApid PosTeriOR GENerator (RAPTOR-GEN), a mechanism-informed Bayesian learning framework designed to accelerate intelligent digital twin development from sparse and heterogeneous experimental data. This framework is built on a multi-scale probabilistic knowledge graph (pKG), formulated as a stochastic differential equation (SDE)-based foundational model that captures the nonlinear dynamics of bioprocesses. RAPTOR-GEN consists of two ingredients: (i) an interpretable metamodel integrating linear noise approximation (LNA) that exploits the structural information of bioprocessing mechanisms and a sequential learning strategy to fuse heterogeneous and sparse data, enabling inference of latent state variables and explicit approximation of the intractable likelihood function; and (ii) an efficient Bayesian posterior sampling method that utilizes Langevin diffusion (LD) to accelerate posterior exploration by exploiting the gradients of the derived likelihood. It generalizes the LNA approach to circumvent the challenge of step size selection, facilitating robust learning of mechanistic parameters with provable finite-sample performance guarantees. We develop a fast and robust RAPTOR-GEN algorithm with controllable error. Numerical experiments demonstrate its effectiveness in uncovering the underlying regulatory mechanisms of biomanufacturing processes.

cross Identifying Group Anchors in Real-World Group Interactions Under Label Scarcity

Authors: Fanchen Bu, Geon Lee, Minyoung Choe, Kijung Shin

Abstract: Group interactions occur in various real-world contexts, e.g., co-authorship, email communication, and online Q&A. In each group, there is often a particularly significant member, around whom the group is formed. Examples include the first or last author of a paper, the sender of an email, and the questioner in a Q&A session. In this work, we discuss the existence of such individuals in real-world group interactions. We call such individuals group anchors and study the problem of identifying them. First, we introduce the concept of group anchors and the identification problem. Then, we discuss our observations on group anchors in real-world group interactions. Based on our observations, we develop AnchorRadar, a fast and effective method for group anchor identification under realistic settings with label scarcity, i.e., when only a few groups have known anchors. AnchorRadar is a semi-supervised method using information from groups both with and without known group anchors. Finally, through extensive experiments on thirteen real-world datasets, we demonstrate the empirical superiority of AnchorRadar over various baselines w.r.t. accuracy and efficiency. In most cases, AnchorRadar achieves higher accuracy in group anchor identification than all the baselines, while using 10.2$\times$ less training time than the fastest baseline and 43.6$\times$ fewer learnable parameters than the most lightweight baseline on average.

cross Leveraging Temporally Extended Behavior Sharing for Multi-task Reinforcement Learning

Authors: Gawon Lee (Seoul National University), Daesol Cho (Seoul National University), H. Jin Kim (Seoul National University)

Abstract: Multi-task reinforcement learning (MTRL) offers a promising approach to improve sample efficiency and generalization by training agents across multiple tasks, enabling knowledge sharing between them. However, applying MTRL to robotics remains challenging due to the high cost of collecting diverse task data. To address this, we propose MT-L\'evy, a novel exploration strategy that enhances sample efficiency in MTRL environments by combining behavior sharing across tasks with temporally extended exploration inspired by L\'evy flight. MT-L\'evy leverages policies trained on related tasks to guide exploration towards key states, while dynamically adjusting exploration levels based on task success ratios. This approach enables more efficient state-space coverage, even in complex robotics environments. Empirical results demonstrate that MT-L\'evy significantly improves exploration and sample efficiency, supported by quantitative and qualitative analyses. Ablation studies further highlight the contribution of each component, showing that combining behavior sharing with adaptive exploration strategies can significantly improve the practicality of MTRL in robotics applications.

cross Extrapolating Phase-Field Simulations in Space and Time with Purely Convolutional Architectures

Authors: Christophe Bonneville, Nathan Bieberdorf, Pieterjan Robbe, Mark Asta, Habib N. Najm, Laurent Capolungo, Cosmin Safta

Abstract: Phase-field models of liquid metal dealloying (LMD) can resolve rich microstructural dynamics but become intractable for large domains or long time horizons. We present a conditionally parameterized, fully convolutional U-Net surrogate that generalizes far beyond its training window in both space and time. The design integrates convolutional self-attention and physics-aware padding, while parameter conditioning enables variable time-step skipping and adaptation to diverse alloy systems. Although trained only on short, small-scale simulations, the surrogate exploits the translational invariance of convolutions to extend predictions to much longer horizons than traditional solvers. It accurately reproduces key LMD physics, with relative errors typically under 5% within the training regime and below 10% when extrapolating to larger domains and later times. The method accelerates computations by up to 16,000 times, cutting weeks of simulation down to seconds, and marks an early step toward scalable, high-fidelity extrapolation of LMD phase-field models.

cross DAC-LoRA: Dynamic Adversarial Curriculum for Efficient and Robust Few-Shot Adaptation

Authors: Ved Umrajkar

Abstract: Vision-Language Models (VLMs) are foundational to critical applications like autonomous driving, medical diagnosis, and content moderation. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA enable their efficient adaptation to specialized tasks, these models remain vulnerable to adversarial attacks that can compromise safety-critical decisions. CLIP, the backbone for numerous downstream VLMs, is a high-value target whose vulnerabilities can cascade across the multimodal AI ecosystem. We propose Dynamic Adversarial Curriculum DAC-LoRA, a novel framework that integrates adversarial training into PEFT. The core principle of our method i.e. an intelligent curriculum of progressively challenging attack, is general and can potentially be applied to any iterative attack method. Guided by the First-Order Stationary Condition (FOSC) and a TRADES-inspired loss, DAC-LoRA achieves substantial improvements in adversarial robustness without significantly compromising clean accuracy. Our work presents an effective, lightweight, and broadly applicable method to demonstrate that the DAC-LoRA framework can be easily integrated into a standard PEFT pipeline to significantly enhance robustness.

cross ImaginationPolicy: Towards Generalizable, Precise and Reliable End-to-End Policy for Robotic Manipulation

Authors: Dekun Lu, Wei Gao, Kui Jia

Abstract: End-to-end robot manipulation policies offer significant potential for enabling embodied agents to understand and interact with the world. Unlike traditional modular pipelines, end-to-end learning mitigates key limitations such as information loss between modules and feature misalignment caused by isolated optimization targets. Despite these advantages, existing end-to-end neural networks for robotic manipulation--including those based on large VLM/VLA models--remain insufficiently performant for large-scale practical deployment. In this paper, we take a step towards an end-to-end manipulation policy that is generalizable, accurate and reliable. To achieve this goal, we propose a novel Chain of Moving Oriented Keypoints (CoMOK) formulation for robotic manipulation. Our formulation is used as the action representation of a neural policy, which can be trained in an end-to-end fashion. Such an action representation is general, as it extends the standard end-effector pose action representation and supports a diverse set of manipulation tasks in a unified manner. The oriented keypoint in our method enables natural generalization to objects with different shapes and sizes, while achieving sub-centimeter accuracy. Moreover, our formulation can easily handle multi-stage tasks, multi-modal robot behaviors, and deformable objects. Extensive simulated and hardware experiments demonstrate the effectiveness of our method.

cross Actively Learning Halfspaces without Synthetic Data

Authors: Hadley Black, Kasper Green Larsen, Arya Mazumdar, Barna Saha, Geelon So

Abstract: In the classic point location problem, one is given an arbitrary dataset $X \subset \mathbb{R}^d$ of $n$ points with query access to an unknown halfspace $f : \mathbb{R}^d \to \{0,1\}$, and the goal is to learn the label of every point in $X$. This problem is extremely well-studied and a nearly-optimal $\widetilde{O}(d \log n)$ query algorithm is known due to Hopkins-Kane-Lovett-Mahajan (FOCS 2020). However, their algorithm is granted the power to query arbitrary points outside of $X$ (point synthesis), and in fact without this power there is an $\Omega(n)$ query lower bound due to Dasgupta (NeurIPS 2004). In this work our goal is to design efficient algorithms for learning halfspaces without point synthesis. To circumvent the $\Omega(n)$ lower bound, we consider learning halfspaces whose normal vectors come from a set of size $D$, and show tight bounds of $\Theta(D + \log n)$. As a corollary, we obtain an optimal $O(d + \log n)$ query deterministic learner for axis-aligned halfspaces, closing a previous gap of $O(d \log n)$ vs. $\Omega(d + \log n)$. In fact, our algorithm solves the more general problem of learning a Boolean function $f$ over $n$ elements which is monotone under at least one of $D$ provided orderings. Our technical insight is to exploit the structure in these orderings to perform a binary search in parallel rather than considering each ordering sequentially, and we believe our approach may be of broader interest. Furthermore, we use our exact learning algorithm to obtain nearly optimal algorithms for PAC-learning. We show that $O(\min(D + \log(1/\varepsilon), 1/\varepsilon) \cdot \log D)$ queries suffice to learn $f$ within error $\varepsilon$, even in a setting when $f$ can be adversarially corrupted on a $c\varepsilon$-fraction of points, for a sufficiently small constant $c$. This bound is optimal up to a $\log D$ factor, including in the realizable setting.

cross Single Answer is Not Enough: On Generating Ranked Lists with Medical Reasoning Models

Authors: Pittawat Taveekitworachai, Natpatchara Pongjirapat, Krittaphas Chaisutyakorn, Piyalitt Ittichaiwong, Tossaporn Saengja, Kunat Pipatanakul

Abstract: This paper presents a systematic study on enabling medical reasoning models (MRMs) to generate ranked lists of answers for open-ended questions. Clinical decision-making rarely relies on a single answer but instead considers multiple options, reducing the risks of narrow perspectives. Yet current MRMs are typically trained to produce only one answer, even in open-ended settings. We propose an alternative format: ranked lists and investigate two approaches: prompting and fine-tuning. While prompting is a cost-effective way to steer an MRM's response, not all MRMs generalize well across different answer formats: choice, short text, and list answers. Based on our prompting findings, we train and evaluate MRMs using supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT teaches a model to imitate annotated responses, and RFT incentivizes exploration through the responses that maximize a reward. We propose new reward functions targeted at ranked-list answer formats, and conduct ablation studies for RFT. Our results show that while some SFT models generalize to certain answer formats, models trained with RFT are more robust across multiple formats. We also present a case study on a modified MedQA with multiple valid answers, finding that although MRMs might fail to select the benchmark's preferred ground truth, they can recognize valid answers. To the best of our knowledge, this is the first systematic investigation of approaches for enabling MRMs to generate answers as ranked lists. We hope this work provides a first step toward developing alternative answer formats that are beneficial beyond single answers in medical domains.

cross RecIS: Sparse to Dense, A Unified Training Framework for Recommendation Models

Authors: Hua Zong, Qingtao Zeng, Zhengxiong Zhou, Zhihua Han, Zhensong Yan, Mingjie Liu, Hechen Sun, Jiawei Liu, Yiwen Hu, Qi Wang, YiHan Xian, Wenjie Guo, Houyuan Xiang, Zhiyuan Zeng, Xiangrong Sheng, Bencheng Yan, Nan Hu, Yuheng Huang, Jinqing Lian, Ziru Xu, Yan Zhang, Ju Huang, Siran Yang, Huimin Yi, Jiamang Wang, Pengjie Wang, Han Zhu, Jian Wu, Dan Ou, Jian Xu, Haihong Tang, Yuning Jiang, Bo Zheng, Lin Qu

Abstract: In this paper, we propose RecIS, a unified Sparse-Dense training framework designed to achieve two primary goals: 1. Unified Framework To create a Unified sparse-dense training framework based on the PyTorch ecosystem that meets the training needs of industrial-grade recommendation models that integrated with large models. 2.System Optimization To optimize the sparse component, offering superior efficiency over the TensorFlow-based recommendation models. The dense component, meanwhile, leverages existing optimization technologies within the PyTorch ecosystem. Currently, RecIS is being used in Alibaba for numerous large-model enhanced recommendation training tasks, and some traditional sparse models have also begun training in it.

cross Nuclear Diffusion Models for Low-Rank Background Suppression in Videos

Authors: Tristan S. W. Stevens, Ois\'in Nolan, Jean-Luc Robert, Ruud J. G. van Sloun

Abstract: Video sequences often contain structured noise and background artifacts that obscure dynamic content, posing challenges for accurate analysis and restoration. Robust principal component methods address this by decomposing data into low-rank and sparse components. Still, the sparsity assumption often fails to capture the rich variability present in real video data. To overcome this limitation, a hybrid framework that integrates low-rank temporal modeling with diffusion posterior sampling is proposed. The proposed method, Nuclear Diffusion, is evaluated on a real-world medical imaging problem, namely cardiac ultrasound dehazing, and demonstrates improved dehazing performance compared to traditional RPCA concerning contrast enhancement (gCNR) and signal preservation (KS statistic). These results highlight the potential of combining model-based temporal models with deep generative priors for high-fidelity video restoration.

cross Conditionally Whitened Generative Models for Probabilistic Time Series Forecasting

Authors: Yanfeng Yang, Siwei Chen, Pingping Hu, Zhaotong Shen, Yingjie Zhang, Zhuoran Sun, Shuai Li, Ziqi Chen, Kenji Fukumizu

Abstract: Probabilistic forecasting of multivariate time series is challenging due to non-stationarity, inter-variable dependencies, and distribution shifts. While recent diffusion and flow matching models have shown promise, they often ignore informative priors such as conditional means and covariances. In this work, we propose Conditionally Whitened Generative Models (CW-Gen), a framework that incorporates prior information through conditional whitening. Theoretically, we establish sufficient conditions under which replacing the traditional terminal distribution of diffusion models, namely the standard multivariate normal, with a multivariate normal distribution parameterized by estimators of the conditional mean and covariance improves sample quality. Guided by this analysis, we design a novel Joint Mean-Covariance Estimator (JMCE) that simultaneously learns the conditional mean and sliding-window covariance. Building on JMCE, we introduce Conditionally Whitened Diffusion Models (CW-Diff) and extend them to Conditionally Whitened Flow Matching (CW-Flow). Experiments on five real-world datasets with six state-of-the-art generative models demonstrate that CW-Gen consistently enhances predictive performance, capturing non-stationary dynamics and inter-variable correlations more effectively than prior-free approaches. Empirical results further demonstrate that CW-Gen can effectively mitigate the effects of distribution shift.

cross Reverse Fa\`a di Bruno's Formula for Cartesian Reverse Differential Categories

Authors: Aaron Biggin (Macquarie University), Jean-Simon Pacaud Lemay (Macquarie University)

Abstract: Reverse differentiation is an essential operation for automatic differentiation. Cartesian reverse differential categories axiomatize reverse differentiation in a categorical framework, where one of the primary axioms is the reverse chain rule, which is the formula that expresses the reverse derivative of a composition. Here, we present the reverse differential analogue of Faa di Bruno's Formula, which gives a higher-order reverse chain rule in a Cartesian reverse differential category. To properly do so, we also define partial reverse derivatives and higher-order reverse derivatives in a Cartesian reverse differential category.

cross Unlocking Noise-Resistant Vision: Key Architectural Secrets for Robust Models

Authors: Bum Jun Kim, Makoto Kawano, Yusuke Iwasawa, Yutaka Matsuo

Abstract: While the robustness of vision models is often measured, their dependence on specific architectural design choices is rarely dissected. We investigate why certain vision architectures are inherently more robust to additive Gaussian noise and convert these empirical insights into simple, actionable design rules. Specifically, we performed extensive evaluations on 1,174 pretrained vision models, empirically identifying four consistent design patterns for improved robustness against Gaussian noise: larger stem kernels, smaller input resolutions, average pooling, and supervised vision transformers (ViTs) rather than CLIP ViTs, which yield up to 506 rank improvements and 21.6\%p accuracy gains. We then develop a theoretical analysis that explains these findings, converting observed correlations into causal mechanisms. First, we prove that low-pass stem kernels attenuate noise with a gain that decreases quadratically with kernel size and that anti-aliased downsampling reduces noise energy roughly in proportion to the square of the downsampling factor. Second, we demonstrate that average pooling is unbiased and suppresses noise in proportion to the pooling window area, whereas max pooling incurs a positive bias that grows slowly with window size and yields a relatively higher mean-squared error and greater worst-case sensitivity. Third, we reveal and explain the vulnerability of CLIP ViTs via a pixel-space Lipschitz bound: The smaller normalization standard deviations used in CLIP preprocessing amplify worst-case sensitivity by up to 1.91 times relative to the Inception-style preprocessing common in supervised ViTs. Our results collectively disentangle robustness into interpretable modules, provide a theory that explains the observed trends, and build practical, plug-and-play guidelines for designing vision models more robust against Gaussian noise.

cross An Adaptor for Triggering Semi-Supervised Learning to Out-of-Box Serve Deep Image Clustering

Authors: Yue Duan, Lei Qi, Yinghuan Shi, Yang Gao

Abstract: Recently, some works integrate SSL techniques into deep clustering frameworks to enhance image clustering performance. However, they all need pretraining, clustering learning, or a trained clustering model as prerequisites, limiting the flexible and out-of-box application of SSL learners in the image clustering task. This work introduces ASD, an adaptor that enables the cold-start of SSL learners for deep image clustering without any prerequisites. Specifically, we first randomly sample pseudo-labeled data from all unlabeled data, and set an instance-level classifier to learn them with semantically aligned instance-level labels. With the ability of instance-level classification, we track the class transitions of predictions on unlabeled data to extract high-level similarities of instance-level classes, which can be utilized to assign cluster-level labels to pseudo-labeled data. Finally, we use the pseudo-labeled data with assigned cluster-level labels to trigger a general SSL learner trained on the unlabeled data for image clustering. We show the superior performance of ASD across various benchmarks against the latest deep image clustering approaches and very slight accuracy gaps compared to SSL methods using ground-truth, e.g., only 1.33% on CIFAR-10. Moreover, ASD can also further boost the performance of existing SSL-embedded deep image clustering methods.

cross Empirical PAC-Bayes bounds for Markov chains

Authors: Vahe Karagulyan, Pierre Alquier

Abstract: The core of generalization theory was developed for independent observations. Some PAC and PAC-Bayes bounds are available for data that exhibit a temporal dependence. However, there are constants in these bounds that depend on properties of the data-generating process: mixing coefficients, mixing time, spectral gap... Such constants are unknown in practice. In this paper, we prove a new PAC-Bayes bound for Markov chains. This bound depends on a quantity called the pseudo-spectral gap. The main novelty is that we can provide an empirical bound on the pseudo-spectral gap when the state space is finite. Thus, we obtain the first fully empirical PAC-Bayes bound for Markov chains. This extends beyond the finite case, although this requires additional assumptions. On simulated experiments, the empirical version of the bound is essentially as tight as the non-empirical one.

cross Fast-SEnSeI: Lightweight Sensor-Independent Cloud Masking for On-board Multispectral Sensors

Authors: Jan Kn\v{e}\v{z}\'ik, Jon\'a\v{s} Herec, Rado Pito\v{n}\'ak

Abstract: Cloud segmentation is a critical preprocessing step for many Earth observation tasks, yet most models are tightly coupled to specific sensor configurations and rely on ground-based processing. In this work, we propose Fast-SEnSeI, a lightweight, sensor-independent encoder module that enables flexible, on-board cloud segmentation across multispectral sensors with varying band configurations. Building upon SEnSeI-v2, Fast-SEnSeI integrates an improved spectral descriptor, lightweight architecture, and robust padding-band handling. It accepts arbitrary combinations of spectral bands and their wavelengths, producing fixed-size feature maps that feed into a compact, quantized segmentation model based on a modified U-Net. The module runs efficiently on embedded CPUs using Apache TVM, while the segmentation model is deployed on FPGA, forming a CPU-FPGA hybrid pipeline suitable for space-qualified hardware. Evaluations on Sentinel-2 and Landsat 8 datasets demonstrate accurate segmentation across diverse input configurations.

cross RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training

Authors: Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang

Abstract: Reinforcement Learning (RL) is a pivotal post-training technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, synchronous RL post-training often suffers from significant GPU underutilization, referred to as bubbles, caused by imbalanced response lengths within rollout steps. Many RL systems attempt to alleviate this problem by relaxing synchronization, but this can compromise training accuracy. In this paper, we introduce tail batching, a novel rollout scheduling strategy for synchronous RL that systematically consolidates prompts leading to long-tail responses into a small subset of rollout steps (long rounds), while ensuring that the majority of steps (short rounds) involve only balanced, short rollouts. By excluding long responses from short rounds and rescheduling them into a few designated long rounds, tail batching effectively reduces GPU idle time during rollouts and significantly accelerates RL training without sacrificing accuracy. We present RollPacker, a system that fully harnesses the benefits of tail batching through holistic optimizations across all three RL stages: elastic parallelism adaptation for rollout, dynamic resource allocation and scheduling for reward, and stream-based training. Empirical results show that RollPacker achieves a 2.03x-2.56x end-to-end training time reduction compared to veRL and up to 2.24x speedup compared to RLHFuse for the Qwen2.5 family of LLMs on up to 128 H800 GPUs.

cross Behind RoPE: How Does Causal Mask Encode Positional Information?

Authors: Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, Edward Choi

Abstract: While explicit positional encodings such as RoPE are a primary source of positional information in Transformer decoders, the causal mask also provides positional information. In this work, we prove that the causal mask can induce position-dependent patterns in attention scores, even without parameters or causal dependency in the input. Our theoretical analysis indicates that the induced attention pattern tends to favor nearby query-key pairs, mirroring the behavior of common positional encodings. Empirical analysis confirms that trained models exhibit the same behavior, with learned parameters further amplifying these patterns. Notably, we found that the interaction of causal mask and RoPE distorts RoPE's relative attention score patterns into non-relative ones. We consistently observed this effect in modern large language models, suggesting the importance of considering the causal mask as a source of positional information alongside explicit positional encodings.

cross Combinatorial Creativity: A New Frontier in Generalization Abilities

Authors: Samuel Schapiro, Sumuk Shashidhar, Alexi Gladstone, Jonah Black, Royce Moon, Dilek Hakkani-Tur, Lav R. Varshney

Abstract: Artificial intelligence (AI) systems, and large language models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Though in many ways similar to forms of compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Importantly, this tradeoff remains persistent even at scale, casting doubt on the long-term creative potential of LLMs in their current form. Together, our conceptual framework and empirical findings provide a foundation for understanding and improving creativity in modern AI models, marking a new frontier in generalization abilities.

cross MPC-based Deep Reinforcement Learning Method for Space Robotic Control with Fuel Sloshing Mitigation

Authors: Mahya Ramezani, M. Amin Alandihallaj, Bar{\i}\c{s} Can Yal\c{c}{\i}n, Miguel Angel Olivares Mendez, Holger Voos

Abstract: This paper presents an integrated Reinforcement Learning (RL) and Model Predictive Control (MPC) framework for autonomous satellite docking with a partially filled fuel tank. Traditional docking control faces challenges due to fuel sloshing in microgravity, which induces unpredictable forces affecting stability. To address this, we integrate Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) RL algorithms with MPC, leveraging MPC's predictive capabilities to accelerate RL training and improve control robustness. The proposed approach is validated through Zero-G Lab of SnT experiments for planar stabilization and high-fidelity numerical simulations for 6-DOF docking with fuel sloshing dynamics. Simulation results demonstrate that SAC-MPC achieves superior docking accuracy, higher success rates, and lower control effort, outperforming standalone RL and PPO-MPC methods. This study advances fuel-efficient and disturbance-resilient satellite docking, enhancing the feasibility of on-orbit refueling and servicing missions.

cross Communication Bias in Large Language Models: A Regulatory Perspective

Authors: Adrian Kuenzler, Stefan Schmid

Abstract: Large language models (LLMs) are increasingly central to many applications, raising concerns about bias, fairness, and regulatory compliance. This paper reviews risks of biased outputs and their societal impact, focusing on frameworks like the EU's AI Act and the Digital Services Act. We argue that beyond constant regulation, stronger attention to competition and design governance is needed to ensure fair, trustworthy AI. This is a preprint of the Communications of the ACM article of the same title.

cross Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

Authors: Rostislav Makarov, Lea Sch\"onherr, Timo Gerkmann

Abstract: Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.

cross Best-of-$\infty$ -- Asymptotic Performance of Test-Time Compute

Authors: Junpei Komiyama, Daisuke Oba, Masafumi Oyamada

Abstract: We study best-of-$N$ for large language models (LLMs) where the selection is based on majority voting. In particular, we analyze the limit $N \to \infty$, which we denote as Best-of-$\infty$. While this approach achieves impressive performance in the limit, it requires an infinite test-time budget. To address this, we propose an adaptive generation scheme that selects $N$ based on answer agreement, thereby efficiently allocating inference-time computation. Beyond adaptivity, we extend the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model. The optimal ensemble weighting is formulated and efficiently computed as a mixed-integer linear program. Extensive experiments demonstrate the effectiveness of our approach.

cross Cross-Modal Instructions for Robot Motion Generation

Authors: William Barron, Xiaoxiang Dong, Matthew Johnson-Roberson, Weiming Zhi

Abstract: Teaching robots novel behaviors typically requires motion demonstrations via teleoperation or kinaesthetic teaching, that is, physically guiding the robot. While recent work has explored using human sketches to specify desired behaviors, data collection remains cumbersome, and demonstration datasets are difficult to scale. In this paper, we introduce an alternative paradigm, Learning from Cross-Modal Instructions, where robots are shaped by demonstrations in the form of rough annotations, which can contain free-form text labels, and are used in lieu of physical motion. We introduce the CrossInstruct framework, which integrates cross-modal instructions as examples into the context input to a foundational vision-language model (VLM). The VLM then iteratively queries a smaller, fine-tuned model, and synthesizes the desired motion over multiple 2D views. These are then subsequently fused into a coherent distribution over 3D motion trajectories in the robot's workspace. By incorporating the reasoning of the large VLM with a fine-grained pointing model, CrossInstruct produces executable robot behaviors that generalize beyond the environment of in the limited set of instruction examples. We then introduce a downstream reinforcement learning pipeline that leverages CrossInstruct outputs to efficiently learn policies to complete fine-grained tasks. We rigorously evaluate CrossInstruct on benchmark simulation tasks and real hardware, demonstrating effectiveness without additional fine-tuning and providing a strong initialization for policies subsequently refined via reinforcement learning.

cross Physics Informed Neural Networks for design optimisation of diamond particle detectors for charged particle fast-tracking at high luminosity hadron colliders

Authors: Alessandro Bombini, Alessandro Rosa, Clarissa Buti, Giovanni Passaleva, Lucio Anderlini

Abstract: Future high-luminosity hadron colliders demand tracking detectors with extreme radiation tolerance, high spatial precision, and sub-nanosecond timing. 3D diamond pixel sensors offer these capabilities due to diamond's radiation hardness and high carrier mobility. Conductive electrodes, produced via femtosecond IR laser pulses, exhibit high resistivity that delays signal propagation. This effect necessitates extending the classical Ramo-Shockley weighting potential formalism. We model the phenomenon through a 3rd-order, 3+1D PDE derived as a quasi-stationary approximation of Maxwell's equations. The PDE is solved numerically and coupled with charge transport simulations for realistic 3D sensor geometries. A Mixture-of-Experts Physics-Informed Neural Network, trained on Spectral Method data, provides a meshless solver to assess timing degradation from electrode resistance.

cross The Unwinnable Arms Race of AI Image Detection

Authors: Till Aczel, Lorenzo Vettor, Andreas Plesner, Roger Wattenhofer

Abstract: The rapid progress of image generative AI has blurred the boundary between synthetic and real images, fueling an arms race between generators and discriminators. This paper investigates the conditions under which discriminators are most disadvantaged in this competition. We analyze two key factors: data dimensionality and data complexity. While increased dimensionality often strengthens the discriminators ability to detect subtle inconsistencies, complexity introduces a more nuanced effect. Using Kolmogorov complexity as a measure of intrinsic dataset structure, we show that both very simple and highly complex datasets reduce the detectability of synthetic images; generators can learn simple datasets almost perfectly, whereas extreme diversity masks imperfections. In contrast, intermediate-complexity datasets create the most favorable conditions for detection, as generators fail to fully capture the distribution and their errors remain visible.

cross Emerging Paradigms for Securing Federated Learning Systems

Authors: Amr Akmal Abouelmagd, Amr Hilal

Abstract: Federated Learning (FL) facilitates collaborative model training while keeping raw data decentralized, making it a conduit for leveraging the power of IoT devices while maintaining privacy of the locally collected data. However, existing privacy- preserving techniques present notable hurdles. Methods such as Multi-Party Computation (MPC), Homomorphic Encryption (HE), and Differential Privacy (DP) often incur high compu- tational costs and suffer from limited scalability. This survey examines emerging approaches that hold promise for enhancing both privacy and efficiency in FL, including Trusted Execution Environments (TEEs), Physical Unclonable Functions (PUFs), Quantum Computing (QC), Chaos-Based Encryption (CBE), Neuromorphic Computing (NC), and Swarm Intelligence (SI). For each paradigm, we assess its relevance to the FL pipeline, outlining its strengths, limitations, and practical considerations. We conclude by highlighting open challenges and prospective research avenues, offering a detailed roadmap for advancing secure and scalable FL systems.

cross WISER: Segmenting watermarked region - an epidemic change-point perspective

Authors: Soham Bonnerjee, Sayar Karmakar, Subhrajyoty Roy

Abstract: With the increasing popularity of large language models, concerns over content authenticity have led to the development of myriad watermarking schemes. These schemes can be used to detect a machine-generated text via an appropriate key, while being imperceptible to readers with no such keys. The corresponding detection mechanisms usually take the form of statistical hypothesis testing for the existence of watermarks, spurring extensive research in this direction. However, the finer-grained problem of identifying which segments of a mixed-source text are actually watermarked, is much less explored; the existing approaches either lack scalability or theoretical guarantees robust to paraphrase and post-editing. In this work, we introduce a unique perspective to such watermark segmentation problems through the lens of epidemic change-points. By highlighting the similarities as well as differences of these two problems, we motivate and propose WISER: a novel, computationally efficient, watermark segmentation algorithm. We theoretically validate our algorithm by deriving finite sample error-bounds, and establishing its consistency in detecting multiple watermarked segments in a single text. Complementing these theoretical results, our extensive numerical experiments show that WISER outperforms state-of-the-art baseline methods, both in terms of computational speed as well as accuracy, on various benchmark datasets embedded with diverse watermarking schemes. Our theoretical and empirical findings establish WISER as an effective tool for watermark localization in most settings. It also shows how insights from a classical statistical problem can lead to a theoretically valid and computationally efficient solution of a modern and pertinent problem.

cross Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization's Impact on CLIP Beyond Accuracy

Authors: Aymen Bouguerra, Daniel Montoya, Alexandra Gomez-Villa, Fabio Arnez, Chokri Mraidha

Abstract: The powerful zero-shot generalization capabilities of vision-language models (VLMs) like CLIP have enabled new paradigms for safety-related tasks such as out-of-distribution (OOD) detection. However, additional aspects crucial for the computationally efficient and reliable deployment of CLIP are still overlooked. In particular, the impact of quantization on CLIP's performance beyond accuracy remains underexplored. This work presents a large-scale evaluation of quantization on CLIP models, assessing not only in-distribution accuracy but a comprehensive suite of reliability metrics and revealing counterintuitive results driven by pre-training source. We demonstrate that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants. Intriguingly, this degradation in calibration does not preclude gains in other reliability metrics; we find that OOD detection can still improve for these same poorly calibrated models. Furthermore, we identify specific quantization-aware training (QAT) methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness, challenging the view of a strict efficiency-performance trade-off. These findings offer critical insights for navigating the multi-objective problem of deploying efficient, reliable, and robust VLMs by utilizing quantization beyond its conventional role.

cross Breaking the curse of dimensionality for linear rules: optimal predictors over the ellipsoid

Authors: Alexis Ayme, Bruno Loureiro

Abstract: In this work, we address the following question: What minimal structural assumptions are needed to prevent the degradation of statistical learning bounds with increasing dimensionality? We investigate this question in the classical statistical setting of signal estimation from $n$ independent linear observations $Y_i = X_i^{\top}\theta + \epsilon_i$. Our focus is on the generalization properties of a broad family of predictors that can be expressed as linear combinations of the training labels, $f(X) = \sum_{i=1}^{n} l_{i}(X) Y_i$. This class -- commonly referred to as linear prediction rules -- encompasses a wide range of popular parametric and non-parametric estimators, including ridge regression, gradient descent, and kernel methods. Our contributions are twofold. First, we derive non-asymptotic upper and lower bounds on the generalization error for this class under the assumption that the Bayes predictor $\theta$ lies in an ellipsoid. Second, we establish a lower bound for the subclass of rotationally invariant linear prediction rules when the Bayes predictor is fixed. Our analysis highlights two fundamental contributions to the risk: (a) a variance-like term that captures the intrinsic dimensionality of the data; (b) the noiseless error, a term that arises specifically in the high-dimensional regime. These findings shed light on the role of structural assumptions in mitigating the curse of dimensionality.

cross IntSR: An Integrated Generative Framework for Search and Recommendation

Authors: Huimin Yan, Longfei Xu, Junjie Sun, Ni Ou, Wei Luo, Xing Tan, Ran Cheng, Kaikui Liu, Xiangxiang Chu

Abstract: Generative recommendation has emerged as a promising paradigm, demonstrating remarkable results in both academic benchmarks and industrial applications. However, existing systems predominantly focus on unifying retrieval and ranking while neglecting the integration of search and recommendation (S&R) tasks. What makes search and recommendation different is how queries are formed: search uses explicit user requests, while recommendation relies on implicit user interests. As for retrieval versus ranking, the distinction comes down to whether the queries are the target items themselves. Recognizing the query as central element, we propose IntSR, an integrated generative framework for S&R. IntSR integrates these disparate tasks using distinct query modalities. It also addresses the increased computational complexity associated with integrated S&R behaviors and the erroneous pattern learning introduced by a dynamically changing corpus. IntSR has been successfully deployed across various scenarios in Amap, leading to substantial improvements in digital asset's GMV(+3.02%), POI recommendation's CTR(+2.76%), and travel mode suggestion's ACC(+5.13%).

cross Data-driven Neural Networks for Windkessel Parameter Calibration

Authors: Benedikt Hoock, Tobias K\"oppl

Abstract: In this work, we propose a novel method for calibrating Windkessel (WK) parameters in a dimensionally reduced 1D-0D coupled blood flow model. To this end, we design a data-driven neural network (NN)trained on simulated blood pressures in the left brachial artery. Once trained, the NN emulates the pressure pulse waves across the entire simulated domain, i.e., over time, space and varying WK parameters, with negligible error and computational effort. To calibrate the WK parameters on a measured pulse wave, the NN is extended by dummy neurons and retrained only on these. The main objective of this work is to assess the effectiveness of the method in various scenarios -- particularly, when the exact measurement location is unknown or the data are affected by noise.

cross Learning Conformal Explainers for Image Classifiers

Authors: Amr Alkhatib, Stephanie Lowry

Abstract: Feature attribution methods are widely used for explaining image-based predictions, as they provide feature-level insights that can be intuitively visualized. However, such explanations often vary in their robustness and may fail to faithfully reflect the reasoning of the underlying black-box model. To address these limitations, we propose a novel conformal prediction-based approach that enables users to directly control the fidelity of the generated explanations. The method identifies a subset of salient features that is sufficient to preserve the model's prediction, regardless of the information carried by the excluded features, and without demanding access to ground-truth explanations for calibration. Four conformity functions are proposed to quantify the extent to which explanations conform to the model's predictions. The approach is empirically evaluated using five explainers across six image datasets. The empirical results demonstrate that FastSHAP consistently outperforms the competing methods in terms of both fidelity and informational efficiency, the latter measured by the size of the explanation regions. Furthermore, the results reveal that conformity measures based on super-pixels are more effective than their pixel-wise counterparts.

cross Response to Promises and Pitfalls of Deep Kernel Learning

Authors: Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, Eric P. Xing

Abstract: This note responds to "Promises and Pitfalls of Deep Kernel Learning" (Ober et al., 2021). The marginal likelihood of a Gaussian process can be compartmentalized into a data fit term and a complexity penalty. Ober et al. (2021) shows that if a kernel can be multiplied by a signal variance coefficient, then reparametrizing and substituting in the maximized value of this parameter sets a reparametrized data fit term to a fixed value. They use this finding to argue that the complexity penalty, a log determinant of the kernel matrix, then dominates in determining the other values of kernel hyperparameters, which can lead to data overcorrelation. By contrast, we show that the reparametrization in fact introduces another data-fit term which influences all other kernel hyperparameters. Thus, a balance between data fit and complexity still plays a significant role in determining kernel hyperparameters.

cross Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations

Authors: Zhijian Yang, Noel DSouza, Istvan Megyeri, Xiaojian Xu, Amin Honarmandi Shandiz, Farzin Haddadpour, Krisztian Koos, Laszlo Rusko, Emanuele Valeriano, Bharadwaj Swaninathan, Lei Wu, Parminder Bhatia, Taha Kass-Hout, Erhan Bas

Abstract: Magnetic Resonance Imaging (MRI) is a critical medical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity pose challenges for automated analysis, particularly in scalable and generalizable machine learning applications. While foundation models have revolutionized natural language and vision tasks, their application to MRI remains limited due to data scarcity and narrow anatomical focus. In this work, we present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on a large-scale dataset comprising 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust, generalizable representations, enabling effective adaptation across broad applications. To enable robust and diverse clinical tasks with minimal computational overhead, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across diverse benchmarks including disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent performance gains over existing foundation models and task-specific approaches. Our results establish Decipher-MR as a scalable and versatile foundation for MRI-based AI, facilitating efficient development across clinical and research domains.

cross Does FLUX Already Know How to Perform Physically Plausible Image Composition?

Authors: Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, Adams Wai-Kin Kong

Abstract: Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

cross Taxonomy-aware Dynamic Motion Generation on Hyperbolic Manifolds

Authors: Luis Augenstein, No\'emie Jaquier, Tamim Asfour, Leonel Rozo

Abstract: Human-like motion generation for robots often draws inspiration from biomechanical studies, which often categorize complex human motions into hierarchical taxonomies. While these taxonomies provide rich structural information about how movements relate to one another, this information is frequently overlooked in motion generation models, leading to a disconnect between the generated motions and their underlying hierarchical structure. This paper introduces the \ac{gphdm}, a novel approach that learns latent representations preserving both the hierarchical structure of motions and their temporal dynamics to ensure physical consistency. Our model achieves this by extending the dynamics prior of the Gaussian Process Dynamical Model (GPDM) to the hyperbolic manifold and integrating it with taxonomy-aware inductive biases. Building on this geometry- and taxonomy-aware frameworks, we propose three novel mechanisms for generating motions that are both taxonomically-structured and physically-consistent: two probabilistic recursive approaches and a method based on pullback-metric geodesics. Experiments on generating realistic motion sequences on the hand grasping taxonomy show that the proposed GPHDM faithfully encodes the underlying taxonomy and temporal dynamics, and generates novel physically-consistent trajectories.

cross Maxout Polytopes

Authors: Andrei Balakin, Shelby Cox, Georg Loho, Bernd Sturmfels

Abstract: Maxout polytopes are defined by feedforward neural networks with maxout activation function and non-negative weights after the first layer. We characterize the parameter spaces and extremal f-vectors of maxout polytopes for shallow networks, and we study the separating hypersurfaces which arise when a layer is added to the network. We also show that maxout polytopes are cubical for generic networks without bottlenecks.

cross RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

Authors: Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev

Abstract: Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost).

replace Learning to Bid Optimally and Efficiently in Adversarial First-price Auctions

Authors: Yanjun Han, Zhengyuan Zhou, Aaron Flores, Erik Ordentlich, Tsachy Weissman

Abstract: First-price auctions have very recently swept the online advertising industry, replacing second-price auctions as the predominant auction mechanism on many platforms. This shift has brought forth important challenges for a bidder: how should one bid in a first-price auction, where unlike in second-price auctions, it is no longer optimal to bid one's private value truthfully and hard to know the others' bidding behaviors? In this paper, we take an online learning angle and address the fundamental problem of learning to bid in repeated first-price auctions, where both the bidder's private valuations and other bidders' bids can be arbitrary. We develop the first minimax optimal online bidding algorithm that achieves an $\widetilde{O}(\sqrt{T})$ regret when competing with the set of all Lipschitz bidding policies, a strong oracle that contains a rich set of bidding strategies. This novel algorithm is built on the insight that the presence of a good expert can be leveraged to improve performance, as well as an original hierarchical expert-chaining structure, both of which could be of independent interest in online learning. Further, by exploiting the product structure that exists in the problem, we modify this algorithm--in its vanilla form statistically optimal but computationally infeasible--to a computationally efficient and space efficient algorithm that also retains the same $\widetilde{O}(\sqrt{T})$ minimax optimal regret guarantee. Additionally, through an impossibility result, we highlight that one is unlikely to compete this favorably with a stronger oracle (than the considered Lipschitz bidding policies). Finally, we test our algorithm on three real-world first-price auction datasets obtained from Verizon Media and demonstrate our algorithm's superior performance compared to several existing bidding algorithms.

replace Contextual Combinatorial Bandits with Changing Action Sets via Gaussian Processes

Authors: Andi Nika, Sepehr Elahi, Cem Tekin

Abstract: We consider a contextual bandit problem with a combinatorial action set and time-varying base arm availability. At the beginning of each round, the agent observes the set of available base arms and their contexts and then selects an action that is a feasible subset of the set of available base arms to maximize its cumulative reward in the long run. We assume that the mean outcomes of base arms are samples from a Gaussian Process (GP) indexed by the context set ${\cal X}$, and the expected reward is Lipschitz continuous in expected base arm outcomes. For this setup, we propose an algorithm called Optimistic Combinatorial Learning and Optimization with Kernel Upper Confidence Bounds (O'CLOK-UCB) and prove that it incurs $\tilde{O}(\sqrt{\lambda^*(K)KT\gamma_{KT}(\cup_{t\leq T}\mathcal{X}_t)} )$ regret with high probability, where $\gamma_{KT}(\cup_{t\leq T}\mathcal{X}_t)$ is the maximum information gain associated with the sets of base arm contexts $\mathcal{X}_t$ that appeared in the first $T$ rounds, $K$ is the maximum cardinality of any feasible action over all rounds, and $\lambda^*(K)$ is the maximum eigenvalue of all covariance matrices of selected actions up to time $T$, which is a function of $K$. To dramatically speed up the algorithm, we also propose a variant of O'CLOK-UCB that uses sparse GPs. Finally, we experimentally show that both algorithms exploit inter-base arm outcome correlation and vastly outperform the previous state-of-the-art UCB-based algorithms in realistic setups.

replace Security of Deep Reinforcement Learning for Autonomous Driving: A Survey

Authors: Ambra Demontis, Srishti Gupta, Maura Pintor, Luca Demetrio, Kathrin Grosse, Hsiao-Ying Lin, Chengfang Fang, Battista Biggio, Fabio Roli

Abstract: Reinforcement learning (RL) enables agents to learn optimal behaviors through interaction with their environment and has been increasingly deployed in safety-critical applications, including autonomous driving. Despite its promise, RL is susceptible to attacks designed either to compromise policy learning or to induce erroneous decisions by trained agents. Although the literature on RL security has grown rapidly and several surveys exist, existing categorizations often fall short in guiding the selection of appropriate defenses for specific systems. In this work, we present a comprehensive survey of 86 recent studies on RL security, addressing these limitations by systematically categorizing attacks and defenses according to defined threat models and single- versus multi-agent settings. Furthermore, we examine the relevance and applicability of state-of-the-art attacks and defense mechanisms within the context of autonomous driving, providing insights to inform the design of robust RL systems.

replace Towards the Identifiability in Noisy Label Learning: A Multinomial Mixture Modelling Approach

Authors: Cuong Nguyen, Thanh-Toan Do, Gustavo Carneiro

Abstract: Learning from noisy labels (LNL) is crucial in deep learning, in which one of the approaches is to identify clean-label samples from poorly-annotated datasets. Such an identification is challenging because the conventional LNL problem, which assumes only one noisy label per instance, is non-identifiable, i.e., clean labels cannot be estimated theoretically without additional heuristics. This paper presents a novel data-driven approach that addresses this issue without requiring any heuristics about clean samples. We discover that the LNL problem becomes identifiable if there are at least $2C - 1$ i.i.d. noisy labels per instance, where $C$ is the number of classes. Our finding relies on the assumption of i.i.d. noisy labels and multinomial mixture modelling, making it easier to interpret than previous studies that require full-rank noisy-label transition matrices. To fulfil this condition without additional manual annotations, we propose a method that automatically generates additional i.i.d. noisy labels through nearest neighbours. These noisy labels are then used in the Expectation-Maximisation algorithm to infer clean labels. Our method demonstrably estimates clean labels accurately across various label noise benchmarks, including synthetic, web-controlled, and real-world datasets. Furthermore, the model trained with our method performs competitively with many state-of-the-art methods.

replace Estimating Deep Learning energy consumption based on model architecture and training environment

Authors: Santiago del Rey, Lu\'is Cruz, Xavier Franch, Silverio Mart\'inez-Fern\'andez

Abstract: To raise awareness of the environmental impact of deep learning (DL), many studies estimate the energy use of DL systems. However, energy estimates during DL training often rely on unverified assumptions. This work addresses that gap by investigating how model architecture and training environment affect energy consumption. We train a variety of computer vision models and collect energy consumption and accuracy metrics to analyze their trade-offs across configurations. Our results show that selecting the right model-training environment combination can reduce training energy consumption by up to 80.68% with less than 2% loss in $F_1$ score. We find a significant interaction effect between model and training environment: energy efficiency improves when GPU computational power scales with model complexity. Moreover, we demonstrate that common estimation practices, such as using FLOPs or GPU TDP, fail to capture these dynamics and can lead to substantial errors. To address these shortcomings, we propose the Stable Training Epoch Projection (STEP) and the Pre-training Regression-based Estimation (PRE) methods. Across evaluations, our methods outperform existing tools by a factor of two or more in estimation accuracy.

replace Energy based diffusion generator for efficient sampling of Boltzmann distributions

Authors: Yan Wang, Ling Guo, Hao Wu, Tao Zhou

Abstract: Sampling from Boltzmann distributions, particularly those tied to high dimensional and complex energy functions, poses a significant challenge in many fields. In this work, we present the Energy-Based Diffusion Generator (EDG), a novel approach that integrates ideas from variational autoencoders and diffusion models. EDG uses a decoder to generate Boltzmann-distributed samples from simple latent variables, and a diffusion-based encoder to estimate the Kullback-Leibler divergence to the target distribution. Notably, EDG is simulation-free, eliminating the need to solve ordinary or stochastic differential equations during training. Furthermore, by removing constraints such as bijectivity in the decoder, EDG allows for flexible network design. Through empirical evaluation, we demonstrate the superior performance of EDG across a variety of sampling tasks with complex target distributions, outperforming existing methods.

replace Reinforcement Learning in Categorical Cybernetics

Authors: Jules Hedges (University of Strathclyde), Riu Rodr\'iguez Sakamoto (University of Strathclyde)

Abstract: We show that several major algorithms of reinforcement learning (RL) fit into the framework of categorical cybernetics, that is to say, parametrised bidirectional processes. We build on our previous work in which we show that value iteration can be represented by precomposition with a certain optic. The outline of the main construction in this paper is: (1) We extend the Bellman operators to parametrised optics that apply to action-value functions and depend on a sample. (2) We apply a representable contravariant functor, obtaining a parametrised function that applies the Bellman iteration. (3) This parametrised function becomes the backward pass of another parametrised optic that represents the model, which interacts with an environment via an agent. Thus, parametrised optics appear in two different ways in our construction, with one becoming part of the other. As we show, many of the major classes of algorithms in RL can be seen as different extremal cases of this general setup: dynamic programming, Monte Carlo methods, temporal difference learning, and deep RL. We see this as strong evidence that this approach is a natural one and believe that it will be a fruitful way to think about RL in the future.

replace Least Volume Analysis

Authors: Qiuyi Chen, Cashen Diniz, Mark Fuge

Abstract: This paper introduces Least Volume (LV)--a simple yet effective regularization method inspired by geometric intuition--that reduces the number of latent dimensions required by an autoencoder without prior knowledge of the dataset's intrinsic dimensionality. We show that its effectiveness depends on the Lipschitz continuity of the decoder, prove that Principal Component Analysis (PCA) is a linear special case, and demonstrate that LV induces a PCA-like importance ordering in nonlinear models. We extend LV to non-Euclidean settings as Generalized Least Volume (GLV), enabling the integration of label information into the latent representation. To support implementation, we also develop an accompanying Dynamic Pruning algorithm. We evaluate LV on several benchmark problems, demonstrating its effectiveness in dimension reduction. Leveraging this, we reveal the role of low-dimensional latent spaces in data sampling and disentangled representation, and use them to probe the varying topological complexity of various datasets. GLV is further applied to labeled datasets, where it induces a contrastive learning effect in representations of discrete labels. On a continuous-label airfoil dataset, it produces representations that lead to smooth changes in aerodynamic performance, thereby stabilizing downstream optimization.

replace Edge Probability Graph Models Beyond Edge Independency: Concepts, Analyses, and Algorithms

Authors: Fanchen Bu, Ruochen Yang, Paul Bogdan, Kijung Shin

Abstract: Desirable random graph models (RGMs) should (i) reproduce common patterns in real-world graphs (e.g., power-law degrees, small diameters, and high clustering), (ii) generate variable (i.e., not overly similar) graphs, and (iii) remain tractable to compute and control graph statistics. A common class of RGMs (e.g., Erdos-Renyi and stochastic Kronecker) outputs edge probabilities, so we need to realize (i.e., sample from) the output edge probabilities to generate graphs. Typically, the existence of each edge is assumed to be determined independently, for simplicity and tractability. However, with edge independency, RGMs provably cannot produce high subgraph densities and high output variability simultaneously. In this work, we explore RGMs beyond edge independence that can better reproduce common patterns while maintaining high tractability and variability. Theoretically, we propose an edge-dependent realization (i.e., sampling) framework called binding that provably preserves output variability, and derive closed-form tractability results on subgraph (e.g., triangle) densities. Practically, we propose algorithms for graph generation with binding and parameter fitting of binding. Our empirical results demonstrate that RGMs with binding exhibit high tractability and well reproduce common patterns, significantly improving upon edge-independent RGMs.

replace Agreement-Based Cascading for Efficient Inference

Authors: Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith

Abstract: Adaptive inference schemes reduce the cost of machine learning inference by assigning smaller models to easier examples, attempting to avoid invocation of larger models when possible. In this work we explore a simple, effective adaptive inference technique we term Agreement-Based Cascading (ABC). ABC builds a cascade of models of increasing size/complexity, and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing. Although ensemble execution introduces additional expense, we show that these costs can be easily offset in practice due to large expected differences in model sizes, parallel inference execution capabilities, and accuracy benefits of ensembling. We examine ABC theoretically and empirically in terms of these parameters, showing that the approach can reliably act as a drop-in replacement for existing models and surpass the best single model it aims to replace in terms of both efficiency and accuracy. Additionally, we explore the performance of ABC relative to existing cascading methods in three common scenarios: (1) edge-to-cloud inference, where ABC reduces communication costs by up to 14x; (2) cloud-based model serving, where it achieves a 3x reduction in rental costs; and (3) inference via model API services, where ABC achieves a 2-25x reduction in average price per token/request relative to state-of-the-art LLM cascades.

replace Context-Aware Reasoning On Parametric Knowledge for Inferring Causal Variables

Authors: Ivaxi Sheth, Sahar Abdelnabi, Mario Fritz

Abstract: Scientific discovery catalyzes human intellectual advances, driven by the cycle of hypothesis generation, experimental design, evaluation, and assumption refinement. Central to this process is causal inference, uncovering the mechanisms behind observed phenomena. While randomized experiments provide strong inferences, they are often infeasible due to ethical or practical constraints. However, observational studies are prone to confounding or mediating biases. While crucial, identifying such backdoor paths is expensive and heavily depends on scientists' domain knowledge to generate hypotheses. We introduce a novel benchmark where the objective is to complete a partial causal graph. We design a benchmark with varying difficulty levels with over 4000 queries. We show the strong ability of LLMs to hypothesize the backdoor variables between a cause and its effect. Unlike simple knowledge memorization of fixed associations, our task requires the LLM to reason according to the context of the entire graph.

replace FoMo-0D: A Foundation Model for Zero-shot Tabular Outlier Detection

Authors: Yuchen Shen, Haomin Wen, Leman Akoglu

Abstract: Outlier detection (OD) has a vast literature as it finds numerous real-world applications. Being an unsupervised task, model selection is a key bottleneck for OD without label supervision. Despite a long list of available OD algorithms with tunable hyperparameters, the lack of systematic approaches for unsupervised algorithm and hyperparameter selection limits their effective use in practice. In this paper, we present FoMo-0D, a pre-trained Foundation Model for zero/0-shot OD on tabular data, which bypasses the hurdle of model selection altogether. Having been pre-trained on synthetic data, FoMo-0D can directly predict the (outlier/inlier) label of test samples without parameter fine-tuning -- requiring no labeled data, and no additional training or hyperparameter tuning when given a new task. Extensive experiments on 57 real-world datasets against 26 baselines show that FoMo-0D is highly competitive; outperforming the majority of the baselines with no statistically significant difference from the 2nd best method. Further, FoMo-0D is efficient in inference time requiring only 7.7 ms per sample on average, with at least 7x speed-up compared to previous methods. To facilitate future research, our implementations for data synthesis and pre-training as well as model checkpoints are openly available at https://github.com/A-Chicharito-S/FoMo-0D.

URLs: https://github.com/A-Chicharito-S/FoMo-0D.

replace DimINO: Dimension-Informed Neural Operator Learning

Authors: Yichen Song, Yalun Wu, Yunbo Wang, Xiaokang Yang

Abstract: In computational physics, a longstanding challenge lies in finding numerical solutions to partial differential equations (PDEs). Recently, research attention has increasingly focused on Neural Operator methods, which are notable for their ability to approximate operators-mappings between functions. Although neural operators benefit from a universal approximation theorem, achieving reliable error bounds often necessitates large model architectures, such as deep stacks of Fourier layers. This raises a natural question: Can we design lightweight models without sacrificing generalization? To address this, we introduce DimINO (Dimension-Informed Neural Operators), a framework inspired by dimensional analysis. DimINO incorporates two key components, DimNorm and a redimensionalization operation, which can be seamlessly integrated into existing neural operator architectures. These components enhance the model's ability to generalize across datasets with varying physical parameters. Theoretically, we establish a universal approximation theorem for DimINO and prove that it satisfies a critical property we term Similar Transformation Invariance (STI). Empirically, DimINO achieves up to 76.3% performance gain on PDE datasets while exhibiting clear evidence of the STI property.

replace Expressiveness of Multi-Neuron Convex Relaxations in Neural Network Certification

Authors: Yuhao Mao, Yani Zhang, Martin Vechev

Abstract: Neural network certification methods heavily rely on convex relaxations to provide robustness guarantees. However, these relaxations are often imprecise: even the most accurate single-neuron relaxation is incomplete for general ReLU networks, a limitation known as the \emph{single-neuron convex barrier}. While multi-neuron relaxations have been heuristically applied to address this issue, two central questions arise: (i) whether they overcome the convex barrier, and if not, (ii) whether they offer theoretical capabilities beyond those of single-neuron relaxations. In this work, we present the first rigorous analysis of the expressiveness of multi-neuron relaxations. Perhaps surprisingly, we show that they are inherently incomplete, even when allocated sufficient resources to capture finitely many neurons and layers optimally. This result extends the single-neuron barrier to a \textit{universal convex barrier} for neural network certification. On the positive side, we show that completeness can be achieved by either (i) augmenting the network with a polynomial number of carefully designed ReLU neurons or (ii) partitioning the input domain into convex sub-polytopes, thereby distinguishing multi-neuron relaxations from single-neuron ones which are unable to realize the former and have worse partition complexity for the latter. Our findings establish a foundation for multi-neuron relaxations and point to new directions for certified robustness, including training methods tailored to multi-neuron relaxations and verification methods with multi-neuron relaxations as the main subroutine.

replace Bias Similarity Measurement: A Black-Box Audit of Fairness Across LLMs

Authors: Hyejun Jeong, Shiqing Ma, Amir Houmansadr

Abstract: Large Language Models (LLMs) reproduce social biases, yet prevailing evaluations score models in isolation, obscuring how biases persist across families and releases. We introduce Bias Similarity Measurement (BSM), which treats fairness as a relational property between models, unifying scalar, distributional, behavioral, and representational signals into a single similarity space. Evaluating 30 LLMs on 1M+ prompts, we find that instruction tuning primarily enforces abstention rather than altering internal representations; small models gain little accuracy and can become less fair under forced choice; and open-weight models can match or exceed proprietary systems. Family signatures diverge: Gemma favors refusal, LLaMA 3.1 approaches neutrality with fewer refusals, and converges toward abstention-heavy behavior overall. Counterintuitively, Gemma 3 Instruct matches GPT-4-level fairness at far lower cost, whereas Gemini's heavy abstention suppresses utility. Beyond these findings, BSM offers an auditing workflow for procurement, regression testing, and lineage screening, and extends naturally to code and multilingual settings. Our results reframe fairness not as isolated scores but as comparative bias similarity, enabling systematic auditing of LLM ecosystems. Code available at https://github.com/HyejunJeong/bias_llm.

URLs: https://github.com/HyejunJeong/bias_llm.

replace Understanding Optimization in Deep Learning with Central Flows

Authors: Jeremy M. Cohen, Alex Damian, Ameet Talwalkar, J. Zico Kolter, Jason D. Lee

Abstract: Traditional theories of optimization cannot describe the dynamics of optimization in deep learning, even in the simple setting of deterministic training. The challenge is that optimizers typically operate in a complex, oscillatory regime called the "edge of stability." In this paper, we develop theory that can describe the dynamics of optimization in this regime. Our key insight is that while the *exact* trajectory of an oscillatory optimizer may be challenging to analyze, the *time-averaged* (i.e. smoothed) trajectory is often much more tractable. To analyze an optimizer, we derive a differential equation called a "central flow" that characterizes this time-averaged trajectory. We empirically show that these central flows can predict long-term optimization trajectories for generic neural networks with a high degree of numerical accuracy. By interpreting these central flows, we are able to understand how gradient descent makes progress even as the loss sometimes goes up; how adaptive optimizers "adapt" to the local loss landscape; and how adaptive optimizers implicitly navigate towards regions where they can take larger steps. Our results suggest that central flows can be a valuable theoretical tool for reasoning about optimization in deep learning.

replace Bayesian Optimization with Preference Exploration using a Monotonic Neural Network Ensemble

Authors: Hanyang Wang, Juergen Branke, Matthias Poloczek

Abstract: Many real-world black-box optimization problems have multiple conflicting objectives. Rather than attempting to approximate the entire set of Pareto-optimal solutions, interactive preference learning allows to focus the search on the most relevant subset. However, few previous studies have exploited the fact that utility functions are usually monotonic. In this paper, we address the Bayesian Optimization with Preference Exploration (BOPE) problem and propose using a neural network ensemble as a utility surrogate model. This approach naturally integrates monotonicity and supports pairwise comparison data. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches and exhibits robustness to noise in utility evaluations. An ablation study highlights the critical role of monotonicity in enhancing performance.

replace Strassen Attention, Split VC Dimension and Compositionality in Transformers

Authors: Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez, Tomasz Steifer, Germ\'an Pizarro, Mat\'ias Fuentes, Francisco Meza, Cristian B. Calderon, Crist\'obal Rojas

Abstract: We propose the first method to show theoretical limitations for one-layer softmax transformers with arbitrarily many precision bits (even infinite). We establish those limitations for three tasks that require advanced reasoning. The first task, Match 3 (Sanford et al., 2023), requires looking at all possible token triplets in an input sequence. The second and third tasks address compositionality-based reasoning: function composition (Peng et al., 2024) and binary relations composition, respectively. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. To overcome these limitations, we introduce Strassen attention and prove that, equipped with this mechanism, a one-layer transformer can in principle solve all these tasks. Importantly, we show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously proposed mechanisms, such as higher-order attention (Sanford et al., 2023). To complement our theoretical findings, we experimentally studied Strassen attention and compared it against standard (Vaswani et al, 2017), higher-order attention (Sanford et al., 2023), and triangular attention (Bergen et al. 2021). Our results help to disentangle all these attention mechanisms, highlighting their strengths and limitations. In particular, Strassen attention outperforms standard attention significantly on all the tasks. Altogether, understanding the theoretical limitations can guide research towards scalable attention mechanisms that improve the reasoning abilities of Transformers.

replace Reformulation is All You Need: Addressing Malicious Text Features in DNNs

Authors: Yi Jiang, Oubo Ma, Yong Yang, Tong Zhang, Shouling Ji

Abstract: Human language encompasses a wide range of intricate and diverse implicit features, which attackers can exploit to launch adversarial or backdoor attacks, compromising DNN models for NLP tasks. Existing model-oriented defenses often require substantial computational resources as model size increases, whereas sample-oriented defenses typically focus on specific attack vectors or schemes, rendering them vulnerable to adaptive attacks. We observe that the root cause of both adversarial and backdoor attacks lies in the encoding process of DNN models, where subtle textual features, negligible for human comprehension, are erroneously assigned significant weight by less robust or trojaned models. Based on it we propose a unified and adaptive defense framework that is effective against both adversarial and backdoor attacks. Our approach leverages reformulation modules to address potential malicious features in textual inputs while preserving the original semantic integrity. Extensive experiments demonstrate that our framework outperforms existing sample-oriented defense baselines across a diverse range of malicious textual features.

replace A Quotient Homology Theory of Representation in Neural Networks

Authors: Kosio Beshkov

Abstract: Previous research has proven that the set of maps implemented by neural networks with a ReLU activation function is identical to the set of piecewise linear continuous maps. Furthermore, such networks induce a hyperplane arrangement splitting the input domain of the network into convex polyhedra $G_J$ over which a network $\Phi$ operates in an affine manner. In this work, we leverage these properties to define an equivalence class $\sim_\Phi$ on top of an input dataset, which can be split into two sets related to the local rank of $\Phi_J$ and the intersections $\cap \text{Im}\Phi_{J_i}$. We refer to the latter as the \textit{overlap decomposition} $\mathcal{O}_\Phi$ and prove that if the intersections between each polyhedron and an input manifold are convex, the homology groups of neural representations are isomorphic to quotient homology groups $H_k(\Phi(\mathcal{M})) \simeq H_k(\mathcal{M}/\mathcal{O}_\Phi)$. This lets us intrinsically calculate the Betti numbers of neural representations without the choice of an external metric. We develop methods to numerically compute the overlap decomposition through linear programming and a union-find algorithm. Using this framework, we perform several experiments on toy datasets showing that, compared to standard persistent homology, our overlap homology-based computation of Betti numbers tracks purely topological rather than geometric features. Finally, we study the evolution of the overlap decomposition during training on several classification problems while varying network width and depth and discuss some shortcomings of our method.

replace A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning

Authors: Qingyue Zhang, Haohao Fu, Guanbo Huang, Yaoyuan Liang, Chang Chu, Tianren Peng, Yanru Wu, Qi Li, Yang Li, Shao-Lun Huang

Abstract: Multi-source transfer learning provides an effective solution to data scarcity in real- world supervised learning scenarios by leveraging multiple source tasks. In this field, existing works typically use all available samples from sources in training, which constrains their training efficiency and may lead to suboptimal results. To address this, we propose a theoretical framework that answers the question: what is the optimal quantity of source samples needed from each source task to jointly train the target model? Specifically, we introduce a generalization error measure based on K-L divergence, and minimize it based on high-dimensional statistical analysis to determine the optimal transfer quantity for each source task. Additionally, we develop an architecture-agnostic and data-efficient algorithm OTQMS to implement our theoretical results for target model training in multi- source transfer learning. Experimental studies on diverse architectures and two real-world benchmark datasets show that our proposed algorithm significantly outperforms state-of-the-art approaches in both accuracy and data efficiency. The code and supplementary materials are available in https://anonymous.4open.science/r/Materials.

URLs: https://anonymous.4open.science/r/Materials.

replace Training Set Reconstruction from Differentially Private Forests: How Effective is DP?

Authors: Alice Gorg\'e, Julien Ferry, S\'ebastien Gambs, Thibaut Vidal

Abstract: Recent research has shown that structured machine learning models such as tree ensembles are vulnerable to privacy attacks targeting their training data. To mitigate these risks, differential privacy (DP) has become a widely adopted countermeasure, as it offers rigorous privacy protection. In this paper, we introduce a reconstruction attack targeting state-of-the-art $\epsilon$-DP random forests. By leveraging a constraint programming model that incorporates knowledge of the forest's structure and DP mechanism characteristics, our approach formally reconstructs the most likely dataset that could have produced a given forest. Through extensive computational experiments, we examine the interplay between model utility, privacy guarantees and reconstruction accuracy across various configurations. Our results reveal that random forests trained with meaningful DP guarantees can still leak portions of their training data. Specifically, while DP reduces the success of reconstruction attacks, the only forests fully robust to our attack exhibit predictive performance no better than a constant classifier. Building on these insights, we also provide practical recommendations for the construction of DP random forests that are more resilient to reconstruction attacks while maintaining a non-trivial predictive performance.

replace Regularization can make diffusion models more efficient

Authors: Mahsa Taheri, Johannes Lederer

Abstract: Diffusion models are one of the key architectures of generative AI. Their main drawback, however, is the computational costs. This study indicates that the concept of sparsity, well known especially in statistics, can provide a pathway to more efficient diffusion pipelines. Our mathematical guarantees prove that sparsity can reduce the input dimension's influence on the computational complexity to that of a much smaller intrinsic dimension of the data. Our empirical findings confirm that inducing sparsity can indeed lead to better samples at a lower cost.

replace You Are Your Own Best Teacher: Achieving Centralized-level Performance in Federated Learning under Heterogeneous and Long-tailed Data

Authors: Shanshan Yan, Zexi Li, Chao Wu, Meng Pang, Yang Lu, Yan Yan, Hanzi Wang

Abstract: Data heterogeneity, stemming from local non-IID data and global long-tailed distributions, is a major challenge in federated learning (FL), leading to significant performance gaps compared to centralized learning. Previous research found that poor representations and biased classifiers are the main problems and proposed neural-collapse-inspired synthetic simplex ETF to help representations be closer to neural collapse optima. However, we find that the neural-collapse-inspired methods are not strong enough to reach neural collapse and still have huge gaps to centralized training. In this paper, we rethink this issue from a self-bootstrap perspective and propose FedYoYo (You Are Your Own Best Teacher), introducing Augmented Self-bootstrap Distillation (ASD) to improve representation learning by distilling knowledge between weakly and strongly augmented local samples, without needing extra datasets or models. We further introduce Distribution-aware Logit Adjustment (DLA) to balance the self-bootstrap process and correct biased feature representations. FedYoYo nearly eliminates the performance gap, achieving centralized-level performance even under mixed heterogeneity. It enhances local representation learning, reducing model drift and improving convergence, with feature prototypes closer to neural collapse optimality. Extensive experiments show FedYoYo achieves state-of-the-art results, even surpassing centralized logit adjustment methods by 5.4\% under global long-tailed settings.

replace What Makes a Reward Model a Good Teacher? An Optimization Perspective

Authors: Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora

Abstract: The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient~optimization.

replace Benchmarking for Practice: Few-Shot Time-Series Crop-Type Classification on the EuroCropsML Dataset

Authors: Joana Reuss, Jan Macdonald, Simon Becker, Ekaterina Gikalo, Konrad Schultka, Lorenz Richter, Marco K\"orner

Abstract: Accurate crop-type classification from satellite time series is essential for agricultural monitoring. While various machine learning algorithms have been developed to enhance performance on data-scarce tasks, their evaluation often lacks real-world scenarios. Consequently, their efficacy in challenging practical applications has not yet been profoundly assessed. To facilitate future research in this domain, we present the first comprehensive benchmark for evaluating supervised and SSL methods for crop-type classification under real-world conditions. This benchmark study relies on the EuroCropsML time-series dataset, which combines farmer-reported crop data with Sentinel-2 satellite observations from Estonia, Latvia, and Portugal. Our findings indicate that MAML-based meta-learning algorithms achieve slightly higher accuracy compared to supervised transfer learning and SSL methods. However, compared to simpler transfer learning, the improvement of meta-learning comes at the cost of increased computational demands and training time. Moreover, supervised methods benefit most when pre-trained and fine-tuned on geographically close regions. In addition, while SSL generally lags behind meta-learning, it demonstrates advantages over training from scratch, particularly in capturing fine-grained features essential for real-world crop-type classification, and also surpasses standard transfer learning. This highlights its practical value when labeled pre-training crop data is scarce. Our insights underscore the trade-offs between accuracy and computational demand in selecting supervised machine learning methods for real-world crop-type classification tasks and highlight the difficulties of knowledge transfer across diverse geographic regions. Furthermore, they demonstrate the practical value of SSL approaches when labeled pre-training crop data is scarce.

replace Process Reward Models That Think

Authors: Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang

Abstract: Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models are released at https://github.com/mukhal/thinkprm.

URLs: https://github.com/mukhal/thinkprm.

replace Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates

Authors: Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang

Abstract: Circuit discovery has gradually become one of the prominent methods for mechanistic interpretability, and research on circuit completeness has also garnered increasing attention. Methods of circuit discovery that do not guarantee completeness not only result in circuits that are not fixed across different runs but also cause key mechanisms to be omitted. The nature of incompleteness arises from the presence of OR gates within the circuit, which are often only partially detected in standard circuit discovery methods. To this end, we systematically introduce three types of logic gates: AND, OR, and ADDER gates, and decompose the circuit into combinations of these logical gates. Through the concept of these gates, we derive the minimum requirements necessary to achieve faithfulness and completeness. Furthermore, we propose a framework that combines noising-based and denoising-based interventions, which can be easily integrated into existing circuit discovery methods without significantly increasing computational complexity. This framework is capable of fully identifying the logic gates and distinguishing them within the circuit. In addition to the extensive experimental validation of the framework's ability to restore the faithfulness, completeness, and sparsity of circuits, using this framework, we uncover fundamental properties of the three logic gates, such as their proportions and contributions to the output, and explore how they behave among the functionalities of language models.

replace Identification and Optimal Nonlinear Control of Turbojet Engine Using Koopman Eigenfunction Model

Authors: David Grasev

Abstract: Gas turbine engines are complex and highly nonlinear dynamical systems. Deriving their physics-based models can be challenging because it requires performance characteristics that are not always available, often leading to many simplifying assumptions. This paper discusses the limitations of conventional experimental methods used to derive component-level and locally linear parameter-varying models, and addresses these issues by employing identification techniques based on data collected from standard engine operation under closed-loop control. The rotor dynamics are estimated using the sparse identification of nonlinear dynamics. Subsequently, the autonomous part of the dynamics is mapped into an optimally constructed Koopman eigenfunction space. This process involves eigenvalue optimization using metaheuristic algorithms and temporal projection, followed by gradient-based eigenfunction identification. The resulting Koopman model is validated against an in-house reference component-level model. A globally optimal nonlinear feedback controller and a Kalman estimator are then designed within the eigenfunction space and compared to traditional and gain-scheduled proportional-integral controllers, as well as a proposed internal model control approach. The eigenmode structure enables targeting individual modes during optimization, leading to improved performance tuning. Results demonstrate that the Koopman-based controller surpasses other benchmark controllers in both reference tracking and disturbance rejection under sea-level and varying flight conditions, due to its global nature.

replace UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech

Authors: Jiaxuan Liu, Yang Xiang, Han Zhao, Xiangang Li, Yingying Gao, Shilei Zhang, Zhenhua Ling

Abstract: Recent large language models (LLMs) have made great progress in the field of text-to-speech (TTS), but they still face major challenges in synthesizing fine-grained emotional speech in an interpretable manner. Traditional methods rely on discrete emotion labels to control emotion categories and intensities, which cannot capture the complexity and continuity of human emotional perception and expression. The lack of large-scale emotional speech datasets with balanced emotion distributions and fine-grained emotional annotations often causes overfitting in synthesis models and impedes effective emotion control. To address these issues, we propose UDDETTS, a universal LLM framework unifying discrete and dimensional emotions for controllable emotional TTS. This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values. Furthermore, a semi-supervised training strategy is designed to comprehensively utilize diverse speech datasets with different types of emotional annotations to train the UDDETTS. Experiments show that UDDETTS achieves linear emotion control along three interpretable dimensions, and exhibits superior end-to-end emotional speech synthesis capabilities. Code and demos are available at: https://anonymous.4open.science/w/UDDETTS.

URLs: https://anonymous.4open.science/w/UDDETTS.

replace Fractal Graph Contrastive Learning

Authors: Nero Z. Li, Xuehao Zhai, Zhichao Shi, Boshen Shi, Xuhui Jiang

Abstract: While Graph Contrastive Learning (GCL) has attracted considerable attention in the field of graph self-supervised learning, its performance heavily relies on data augmentations that are expected to generate semantically consistent positive pairs. Existing strategies typically resort to random perturbations or local structure preservation, yet lack explicit control over global structural consistency between augmented views. To address this limitation, we propose Fractal Graph Contrastive Learning (FractalGCL), a theory-driven framework introducing two key innovations: a renormalisation-based augmentation that generates structurally aligned positive views via box coverings; and a fractal-dimension-aware contrastive loss that aligns graph embeddings according to their fractal dimensions, equipping the method with a fallback mechanism guaranteeing a performance lower bound even on non-fractal graphs. While combining the two innovations markedly boosts graph-representation quality, it also adds non-trivial computational overhead. To mitigate the computational overhead of fractal dimension estimation, we derive a one-shot estimator by proving that the dimension discrepancy between original and renormalised graphs converges weakly to a centred Gaussian distribution. This theoretical insight enables a reduction in dimension computation cost by an order of magnitude, cutting overall training time by approximately 61\%. The experiments show that FractalGCL not only delivers state-of-the-art results on standard benchmarks but also outperforms traditional and latest baselines on traffic networks by an average margin of about remarkably 4\%. Codes are available at (https://anonymous.4open.science/r/FractalGCL-0511/).

URLs: https://anonymous.4open.science/r/FractalGCL-0511/).

replace OLMA: One Loss for More Accurate Time Series Forecasting

Authors: Tianyi Shi, Zhu Meng, Yue Chen, Siyang Zheng, Fei Su, Jin Huang, Changrui Ren, Zhicheng Zhao

Abstract: Time series forecasting faces two important but often overlooked challenges. Firstly, the inherent random noise in the time series labels sets a theoretical lower bound for the forecasting error, which is positively correlated with the entropy of the labels. Secondly, neural networks exhibit a frequency bias when modeling the state-space of time series, that is, the model performs well in learning certain frequency bands but poorly in others, thus restricting the overall forecasting performance. To address the first challenge, we prove a theorem that there exists a unitary transformation that can reduce the marginal entropy of multiple correlated Gaussian processes, thereby providing guidance for reducing the lower bound of forecasting error. Furthermore, experiments confirm that Discrete Fourier Transform (DFT) can reduce the entropy in the majority of scenarios. Correspondingly, to alleviate the frequency bias, we jointly introduce supervision in the frequency domain along the temporal dimension through DFT and Discrete Wavelet Transform (DWT). This supervision-side strategy is highly general and can be seamlessly integrated into any supervised learning method. Moreover, we propose a novel loss function named OLMA, which utilizes the frequency domain transformation across both channel and temporal dimensions to enhance forecasting. Finally, the experimental results on multiple datasets demonstrate the effectiveness of OLMA in addressing the above two challenges and the resulting improvement in forecasting accuracy. The results also indicate that the perspectives of entropy and frequency bias provide a new and feasible research direction for time series forecasting. The code is available at: https://github.com/Yuyun1011/OLMA-One-Loss-for-More-Accurate-Time-Series-Forecasting.

URLs: https://github.com/Yuyun1011/OLMA-One-Loss-for-More-Accurate-Time-Series-Forecasting.

replace Redefining Neural Operators in $d+1$ Dimensions

Authors: Haoze Song, Zhihao Li, Xiaobo Zhang, Zecheng Gan, Zhilu Lai, Wei Wang

Abstract: Neural Operators have emerged as powerful tools for learning mappings between function spaces. Among them, the kernel integral operator has been widely validated on universally approximating various operators. Although many advancements following this definition have developed effective modules to better approximate the kernel function defined on the original domain (with $d$ dimensions, $d=1, 2, 3\dots$), the unclarified evolving mechanism in the embedding spaces blocks researchers' view to design neural operators that can fully capture the target system evolution. Drawing on the Schr\"odingerisation method in quantum simulations of partial differential equations (PDEs), we elucidate the linear evolution mechanism in neural operators. Based on that, we redefine neural operators on a new $d+1$ dimensional domain. Within this framework, we implement a Schr\"odingerised Kernel Neural Operator (SKNO) aligning better with the $d+1$ dimensional evolution. In experiments, the $d+1$ dimensional evolving designs in our SKNO consistently outperform other baselines across ten benchmarks of increasing difficulty, ranging from the simple 1D heat equation to the highly nonlinear 3D Rayleigh-Taylor instability. We also validate the resolution-invariance of SKNO on mixing-resolution training and zero-shot super-resolution tasks. In addition, we show the impact of different lifting and recovering operators on the prediction within the redefined NO framework, reflecting the alignment between our model and the underlying $d+1$ dimensional evolution.

replace Generative and Contrastive Graph Representation Learning

Authors: Jiali Chen, Avijit Mukherjee

Abstract: Self-supervised learning (SSL) on graphs generates node and graph representations (i.e., embeddings) that can be used for downstream tasks such as node classification, node clustering, and link prediction. Graph SSL is particularly useful in scenarios with limited or no labeled data. Existing SSL methods predominantly follow contrastive or generative paradigms, each excelling in different tasks: contrastive methods typically perform well on classification tasks, while generative methods often excel in link prediction. In this paper, we present a novel architecture for graph SSL that integrates the strengths of both approaches. Our framework introduces community-aware node-level contrastive learning, providing more robust and effective positive and negative node pairs generation, alongside graph-level contrastive learning to capture global semantic information. Additionally, we employ a comprehensive augmentation strategy that combines feature masking, node perturbation, and edge perturbation, enabling robust and diverse representation learning. By incorporating these enhancements, our model achieves superior performance across multiple tasks, including node classification, clustering, and link prediction. Evaluations on open benchmark datasets demonstrate that our model outperforms state-of-the-art methods, achieving a performance lift of 0.23%-2.01% depending on the task and dataset.

replace Provably Sample-Efficient Robust Reinforcement Learning with Average Reward

Authors: Zachary Roch, Chi Zhang, George Atia, Yue Wang

Abstract: Robust reinforcement learning (RL) under the average-reward criterion is essential for long-term decision-making, particularly when the environment may differ from its specification. However, a significant gap exists in understanding the finite-sample complexity of these methods, as most existing work provides only asymptotic guarantees. This limitation hinders their principled understanding and practical deployment, especially in data-limited scenarios. We close this gap by proposing \textbf{Robust Halpern Iteration (RHI)}, a new algorithm designed for robust Markov Decision Processes (MDPs) with transition uncertainty characterized by $\ell_p$-norm and contamination models. Our approach offers three key advantages over previous methods: (1). Weaker Structural Assumptions: RHI only requires the underlying robust MDP to be communicating, a less restrictive condition than the commonly assumed ergodicity or irreducibility; (2). No Prior Knowledge: Our algorithm operates without requiring any prior knowledge of the robust MDP; (3). State-of-the-Art Sample Complexity: To learn an $\epsilon$-optimal robust policy, RHI achieves a sample complexity of $\tilde{\mathcal O}\left(\frac{SA\mathcal H^{2}}{\epsilon^{2}}\right)$, where $S$ and $A$ denote the numbers of states and actions, and $\mathcal H$ is the robust optimal bias span. This result represents the tightest known bound. Our work hence provides essential theoretical understanding of sample efficiency of robust average reward RL.

replace Optimal Formats for Weight Quantisation

Authors: Douglas Orr, Luka Ribar, Carlo Luschi

Abstract: Weight quantisation is an essential technique for enabling efficient training and deployment of modern deep learning models. However, the recipe book of quantisation formats is large and formats are often chosen empirically. In this paper, we propose a framework for systematic design and analysis of quantisation formats. By connecting the question of format design with the classical quantisation theory, we show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes. We frame the problem as minimising the KL divergence between original and quantised model outputs under a model size constraint, which can be approximated by minimising the squared quantisation error, a well-studied problem where entropy-constrained quantisers with variable-length codes are optimal. We develop non-linear quantisation curves for block-scaled data across multiple distribution families and observe that these formats, along with sparse outlier formats, consistently outperform fixed-length formats, indicating that they also exploit variable-length encoding. Finally, by using the relationship between the Fisher information and KL divergence, we derive the optimal allocation of bit-widths to individual parameter tensors across the model's layers, saving up to 0.25 bits per parameter when applied to large language models.

replace Time series saliency maps: explaining models across multiple domains

Authors: Christodoulos Kechris, Jonathan Dan, David Atienza

Abstract: Traditional saliency map methods, popularized in computer vision, highlight individual points (pixels) of the input that contribute the most to the model's output. However, in time-series they offer limited insights as semantically meaningful features are often found in other domains. We introduce Cross-domain Integrated Gradients, a generalization of Integrated Gradients. Our method enables feature attributions on any domain that can be formulated as an invertible, differentiable transformation of the time domain. Crucially, our derivation extends the original Integrated Gradients into the complex domain, enabling frequency-based attributions. We provide the necessary theoretical guarantees, namely, path independence and completeness. Our approach reveals interpretable, problem-specific attributions that time-domain methods cannot capture, on three real-world tasks: wearable sensor heart rate extraction, electroencephalography-based seizure detection, and zero-shot time-series forecasting. We release an open-source Tensorflow/PyTorch library to enable plug-and-play cross-domain explainability for time-series models. These results demonstrate the ability of cross-domain integrated gradients to provide semantically meaningful insights in time-series models that are impossible with traditional time-domain saliency.

replace Why and When Deep is Better than Shallow: An Implementation-Agnostic State-Transition View of Depth Supremacy

Authors: Sho Sonoda, Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda

Abstract: Why and when is deep better than shallow? We answer this question in a framework that is agnostic to network implementation. We formulate a deep model as an abstract state-transition semigroup acting on a general metric space, and separate the implementation (e.g., ReLU nets, transformers, and chain-of-thought) from the abstract state transition. We prove a bias-variance decomposition in which the variance depends only on the abstract depth-$k$ network and not on the implementation (Theorem 1). We further split the bounds into output and hidden parts to tie the depth dependence of the variance to the metric entropy of the state-transition semigroup (Theorem 2). We then investigate implementation-free conditions under which the variance grow polynomially or logarithmically with depth (Section 4). Combining these with exponential or polynomial bias decay identifies four canonical bias-variance trade-off regimes (EL/EP/PL/PP) and produces explicit optimal depths $k^\ast$. Across regimes, $k^\ast>1$ typically holds, giving a rigorous form of depth supremacy. The lowest generalization error bound is achieved under the EL regime (exp-decay bias + log-growth variance), explaining why and when deep is better, especially for iterative or hierarchical concept classes such as neural ODEs, diffusion/score models, and chain-of-thought reasoning.

replace Cohort-Based Active Modality Acquisition

Authors: Tillmann Rheude, Roland Eils, Benjamin Wild

Abstract: Real-world machine learning applications often involve data from multiple modalities that must be integrated effectively to make robust predictions. However, in many practical settings, not all modalities are available for every sample, and acquiring additional modalities can be costly. This raises the question: which samples should be prioritized for additional modality acquisition when resources are limited? While prior work has explored individual-level acquisition strategies and training-time active learning paradigms, test-time and cohort-based acquisition remain underexplored. We introduce Cohort-based Active Modality Acquisition (CAMA), a novel test-time setting to formalize the challenge of selecting which samples should receive additional modalities. We derive acquisition strategies that leverage a combination of generative imputation and discriminative modeling to estimate the expected benefit of acquiring missing modalities based on common evaluation metrics. We also introduce upper-bound heuristics that provide performance ceilings to benchmark acquisition strategies. Experiments on multimodal datasets with up to 15 modalities demonstrate that our proposed imputation-based strategies can more effectively guide the acquisition of additional modalities for selected samples compared with methods relying solely on unimodal information, entropy-based guidance, or random selection. We showcase the real-world relevance and scalability of our method by demonstrating its ability to effectively guide the costly acquisition of proteomics data for disease prediction in a large prospective cohort, the UK Biobank (UKBB). Our work provides an effective approach for optimizing modality acquisition at the cohort level, enabling more effective use of resources in constrained settings.

replace Runtime-Adaptive Pruning for LLM Inference

Authors: Huanrong Liu, Chunlin Tian, Xuyang Wei, Qingbiao Li, Li Li

Abstract: Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.

replace Supervised Graph Contrastive Learning for Gene Regulatory Networks

Authors: Sho Oshima, Yuji Okamoto, Taisei Tosaki, Ryosuke Kojima, Yasushi Okuno

Abstract: Graph Contrastive Learning (GCL) is a powerful self-supervised learning framework that performs data augmentation through graph perturbations, with growing applications in the analysis of biological networks such as Gene Regulatory Networks (GRNs). The artificial perturbations commonly used in GCL, such as node dropping, induce structural changes that can diverge from biological reality. This concern has contributed to a broader trend in graph representation learning toward augmentation-free methods, which view such structural changes as problematic and to be avoided. However, this trend overlooks the fundamental insight that structural changes from biologically meaningful perturbations are not a problem to be avoided but a rich source of information, thereby ignoring the valuable opportunity to leverage data from real biological experiments. Motivated by this insight, we propose SupGCL (Supervised Graph Contrastive Learning), a new GCL method for GRNs that directly incorporates biological perturbations from gene knockdown experiments as supervision. SupGCL is a probabilistic formulation that continuously generalizes conventional GCL, linking artificial augmentations with real perturbations measured in knockdown experiments and using the latter as explicit supervisory signals. To assess effectiveness, we train GRN representations with SupGCL and evaluate their performance on downstream tasks. The evaluation includes both node-level tasks, such as gene function classification, and graph-level tasks on patient-specific GRNs, such as patient survival hazard prediction. Across 13 tasks built from GRN datasets derived from patients with three cancer types, SupGCL consistently outperforms state-of-the-art baselines.

replace FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models

Authors: Ionut-Vlad Modoranu, Mher Safaryan, Erik Schultheis, Max Ryabinin, Artem Chumachenko, Dan Alistarh

Abstract: Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul's $N$-point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25\%$ across different model sizes.

replace MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees

Authors: Herbert Woisetschl\"ager, Ryan Zhang, Shiqiang Wang, Hans-Arno Jacobsen

Abstract: Open-weight large language model (LLM) zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of $2\times$ cost savings compared to existing LLM routing techniques.

replace Bridging Arbitrary and Tree Metrics via Differentiable Gromov Hyperbolicity

Authors: Pierre Houedry, Nicolas Courty, Florestan Martin-Baillon, Laetitia Chapel, Titouan Vayer

Abstract: Trees and the associated shortest-path tree metrics provide a powerful framework for representing hierarchical and combinatorial structures in data. Given an arbitrary metric space, its deviation from a tree metric can be quantified by Gromov's $\delta$-hyperbolicity. Nonetheless, designing algorithms that bridge an arbitrary metric to its closest tree metric is still a vivid subject of interest, as most common approaches are either heuristical and lack guarantees, or perform moderately well. In this work, we introduce a novel differentiable optimization framework, coined DeltaZero, that solves this problem. Our method leverages a smooth surrogate for Gromov's $\delta$-hyperbolicity which enables a gradient-based optimization, with a tractable complexity. The corresponding optimization procedure is derived from a problem with better worst case guarantees than existing bounds, and is justified statistically. Experiments on synthetic and real-world datasets demonstrate that our method consistently achieves state-of-the-art distortion.

replace CRISP-NAM: Competing Risks Interpretable Survival Prediction with Neural Additive Models

Authors: Dhanesh Ramachandram, Ananya Raval

Abstract: Competing risks are crucial considerations in survival modelling, particularly in healthcare domains where patients may experience multiple distinct event types. We propose CRISP-NAM (Competing Risks Interpretable Survival Prediction with Neural Additive Models), an interpretable neural additive model for competing risks survival analysis which extends the neural additive architecture to model cause-specific hazards while preserving feature-level interpretability. Each feature contributes independently to risk estimation through dedicated neural networks, allowing for visualization of complex non-linear relationships between covariates and each competing risk. We demonstrate competitive performance on multiple datasets compared to existing approaches.

replace On the Dynamic Regret of Following the Regularized Leader: Optimism with History Pruning

Authors: Naram Mhaisen, George Iosifidis

Abstract: We revisit the Follow the Regularized Leader (FTRL) framework for Online Convex Optimization (OCO) over compact sets, focusing on achieving dynamic regret guarantees. Prior work has highlighted the framework's limitations in dynamic environments due to its tendency to produce "lazy" iterates. However, building on insights showing FTRL's ability to produce "agile" iterates, we show that it can indeed recover known dynamic regret bounds through optimistic composition of future costs and careful linearization of past costs, which can lead to pruning some of them. This new analysis of FTRL against dynamic comparators yields a principled way to interpolate between greedy and agile updates and offers several benefits, including refined control over regret terms, optimism without cyclic dependence, and the application of minimal recursive regularization akin to AdaFTRL. More broadly, we show that it is not the "lazy" projection style of FTRL that hinders (optimistic) dynamic regret, but the decoupling of the algorithm's state (linearized history) from its iterates, allowing the state to grow arbitrarily. Instead, pruning synchronizes these two when necessary.

replace Buffer-free Class-Incremental Learning with Out-of-Distribution Detection

Authors: Srishti Gupta, Daniele Angioni, Maura Pintor, Ambra Demontis, Lea Sch\"onherr, Battista Biggio, Fabio Roli

Abstract: Class-incremental learning (CIL) poses significant challenges in open-world scenarios, where models must not only learn new classes over time without forgetting previous ones but also handle inputs from unknown classes that a closed-set model would misclassify. Recent works address both issues by (i)~training multi-head models using the task-incremental learning framework, and (ii) predicting the task identity employing out-of-distribution (OOD) detectors. While effective, the latter mainly relies on joint training with a memory buffer of past data, raising concerns around privacy, scalability, and increased training time. In this paper, we present an in-depth analysis of post-hoc OOD detection methods and investigate their potential to eliminate the need for a memory buffer. We uncover that these methods, when applied appropriately at inference time, can serve as a strong substitute for buffer-based OOD detection. We show that this buffer-free approach achieves comparable or superior performance to buffer-based methods both in terms of class-incremental learning and the rejection of unknown samples. Experimental results on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets support our findings, offering new insights into the design of efficient and privacy-preserving CIL systems for open-world settings.

replace Differential Gated Self-Attention

Authors: Elpiniki Maria Lygizou, M\'onika Farsang, Radu Grosu

Abstract: Transformers excel across a large variety of tasks but remain susceptible to corrupted inputs, since standard self-attention treats all query-key interactions uniformly. Inspired by lateral inhibition in biological neural circuits and building on the recent use by the Differential Transformer's use of two parallel softmax subtraction for noise cancellation, we propose Multihead Differential Gated Self-Attention (M-DGSA) that learns per-head input-dependent gating to dynamically suppress attention noise. Each head splits into excitatory and inhibitory branches whose dual softmax maps are fused by a sigmoid gate predicted from the token embedding, yielding a context-aware contrast enhancement. M-DGSA integrates seamlessly into existing Transformer stacks with minimal computational overhead. We evaluate on both vision and language benchmarks, demonstrating consistent robustness gains over vanilla Transformer, Vision Transformer, and Differential Transformer baselines. Our contributions are (i) a novel input-dependent gating mechanism for self-attention grounded in lateral inhibition, (ii) a principled synthesis of biological contrast-enhancement and self-attention theory, and (iii) comprehensive experiments demonstrating noise resilience and cross-domain applicability.

replace A Finite-Time Analysis of TD Learning with Linear Function Approximation without Projections or Strong Convexity

Authors: Wei-Cheng Lee, Francesco Orabona

Abstract: We investigate the finite-time convergence properties of Temporal Difference (TD) learning with linear function approximation, a cornerstone algorithm in the field of reinforcement learning. We are interested in the so-called ``robust'' setting, where the convergence guarantee does not depend on the minimal curvature of the potential function. While prior work has established convergence guarantees in this setting, these results typically rely on the assumption that each iterate is projected onto a bounded set, a condition that is both artificial and does not match the current practice. In this paper, we challenge the necessity of such an assumption and present a refined analysis of TD learning. For the first time, we show that the simple projection-free variant converges with a rate of $\widetilde{\mathcal{O}}(\frac{||\theta^*||^2_2}{\sqrt{T}})$, even in the presence of Markovian noise. Our analysis reveals a novel self-bounding property of the TD updates and exploits it to guarantee bounded iterates.

replace Spiking Brain Compression: Exploring One-Shot Post-Training Pruning and Quantization for Spiking Neural Networks

Authors: Lianfeng Shi, Ao Li, Benjamin Ward-Cherrier

Abstract: Spiking Neural Networks (SNNs) have emerged as a new generation of energy-efficient neural networks suitable for implementation on neuromorphic hardware. As neuromorphic hardware has limited memory and computing resources, weight pruning and quantization have recently been explored to improve SNNs' efficiency. State-of-the-art SNN pruning/quantization methods employ multiple compression and training iterations, increasing the cost for pre-trained or very large SNNs. In this paper, we propose a new one-shot post-training pruning/quantization framework, Spiking Brain Compression (SBC), that extends the Optimal Brain Compression (OBC) method to SNNs. SBC replaces the current-based loss found in OBC with a spike train-based objective whose Hessian is cheaply computable, allowing a single backward pass to prune or quantize synapses and analytically rescale the rest. Our experiments on models trained with neuromorphic datasets (N-MNIST, CIFAR10-DVS, DVS128-Gesture) and large static datasets (CIFAR-100, ImageNet) show state-of-the-art results for one-shot post-training compression methods on SNNs, with single-digit to double-digit accuracy gains compared to OBC. SBC also approaches the accuracy of costly iterative methods, while cutting compression time by 2-3 orders of magnitude.

replace UNO: Unlearning via Orthogonalization in Generative models

Authors: Pinak Mandal, Georg A. Gottwald

Abstract: As generative models become increasingly powerful and pervasive, the ability to unlearn specific data, whether due to privacy concerns, legal requirements, or the correction of harmful content, has become increasingly important. Unlike in conventional training, where data are accumulated and knowledge is reinforced, unlearning aims to selectively remove the influence of particular data points without costly retraining from scratch. To be effective and reliable, such algorithms need to achieve (i) forgetting of the undesired data, (ii) preservation of the quality of the generation, (iii) preservation of the influence of the desired training data on the model parameters, and (iv) small number of training steps. We propose fast unlearning algorithms based on loss gradient orthogonalization for unconditional and conditional generative models. We show that our algorithms are able to forget data while maintaining the fidelity of the original model. On standard image benchmarks, our algorithms achieve orders of magnitude faster unlearning times than their predecessors, such as gradient surgery. We demonstrate our algorithms with datasets of increasing complexity (MNIST, CelebA and ImageNet-1K) and for generative models of increasing complexity (VAEs and diffusion transformers).

replace There Was Never a Bottleneck in Concept Bottleneck Models

Authors: Antonio Almud\'evar, Jos\'e Miguel Hern\'andez-Lobato, Alfonso Ortega

Abstract: Deep learning representations are often difficult to interpret, which can hinder their deployment in sensitive applications. Concept Bottleneck Models (CBMs) have emerged as a promising approach to mitigate this issue by learning representations that support target task performance while ensuring that each component predicts a concrete concept from a predefined set. In this work, we argue that CBMs do not impose a true bottleneck: the fact that a component can predict a concept does not guarantee that it encodes only information about that concept. This shortcoming raises concerns regarding interpretability and the validity of intervention procedures. To overcome this limitation, we propose Minimal Concept Bottleneck Models (MCBMs), which incorporate an Information Bottleneck (IB) objective to constrain each representation component to retain only the information relevant to its corresponding concept. This IB is implemented via a variational regularization term added to the training loss. As a result, MCBMs yield more interpretable representations, support principled concept-level interventions, and remain consistent with probability-theoretic foundations.

replace AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification

Authors: Geonwoo Cho, Jaemoon Lee, Jaegyun Im, Subi Lee, Jihwan Lee, Sundong Kim

Abstract: Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy. Effective skill learning requires jointly maximizing both exploration and skill diversity. However, existing methods often face challenges in simultaneously optimizing for these two conflicting objectives. In this work, we propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED), which explicitly addresses both: during pre-training, a gradient-surgery projection balances the exploration and diversity gradients, and during fine-tuning, a skill selector exploits the learned diversity by choosing skills suited to downstream tasks. Our approach achieves performance that surpasses SBRL baselines across various benchmarks. Through an extensive ablation study, we identify the role of each component and demonstrate that each element in AMPED is contributing to performance. We further provide theoretical and empirical evidence that, with a greedy skill selector, greater skill diversity reduces fine-tuning sample complexity. These results highlight the importance of explicitly harmonizing exploration and diversity and demonstrate the effectiveness of AMPED in enabling robust and generalizable skill learning. Project Page: https://geonwoo.me/amped/

URLs: https://geonwoo.me/amped/

replace Improved Scaling Laws in Linear Regression via Data Reuse

Authors: Licong Lin, Jingfeng Wu, Peter L. Bartlett

Abstract: Neural scaling laws suggest that the test error of large language models trained online decreases polynomially as the model size and data size increase. However, such scaling can be unsustainable when running out of new data. In this work, we show that data reuse can improve existing scaling laws in linear regression. Specifically, we derive sharp test error bounds on $M$-dimensional linear models trained by multi-pass stochastic gradient descent (multi-pass SGD) on $N$ data with sketched features. Assuming that the data covariance has a power-law spectrum of degree $a$, and that the true parameter follows a prior with an aligned power-law spectrum of degree $b-a$ (with $a > b > 1$), we show that multi-pass SGD achieves a test error of $\Theta(M^{1-b} + L^{(1-b)/a})$, where $L \lesssim N^{a/b}$ is the number of iterations. In the same setting, one-pass SGD only attains a test error of $\Theta(M^{1-b} + N^{(1-b)/a})$ (see e.g., Lin et al., 2024). This suggests an improved scaling law via data reuse (i.e., choosing $L>N$) in data-constrained regimes. Numerical simulations are also provided to verify our theoretical findings.

replace CodeBrain: Towards Decoupled Interpretability and Multi-Scale Architecture for EEG Foundation Model

Authors: Jingying Ma, Feng Wu, Qika Lin, Yucheng Xing, Chenyu Liu, Ziyu Jia, Mengling Feng

Abstract: Electroencephalography (EEG) provides real-time insights into brain activity and supports diverse applications in neuroscience. While EEG foundation models (EFMs) have emerged to address the scalability issues of task-specific models, current approaches still yield clinically uninterpretable and weakly discriminative representations, inefficiently capture global dependencies, and neglect important local neural events. We present CodeBrain, a two-stage EFM designed to fill this gap. In the first stage, we introduce the TFDual-Tokenizer, which decouples heterogeneous temporal and frequency EEG signals into discrete tokens, quadratically expanding the representation space to enhance discriminative power and offering domain-specific interpretability by suggesting potential links to neural events and spectral rhythms. In the second stage, we propose the multi-scale EEGSSM architecture, which combines structured global convolution with sliding window attention to efficiently capture both sparse long-range and local dependencies, reflecting the brain's small-world topology. Pretrained on the largest public EEG corpus, CodeBrain achieves strong generalization across 8 downstream tasks and 10 datasets under distribution shifts, supported by comprehensive ablations, scaling-law analyses, and interpretability evaluations. Both code and pretraining weights will be released in the future version.

replace Robust Molecular Property Prediction via Densifying Scarce Labeled Data

Authors: Jina Kim, Jeffrey Willette, Bruno Andreis, Sung Ju Hwang

Abstract: A widely recognized limitation of molecular prediction models is their reliance on structures observed in the training data, resulting in poor generalization to out-of-distribution compounds. Yet in drug discovery, the compounds most critical for advancing research often lie beyond the training set, making the bias toward the training data particularly problematic. This mismatch introduces substantial covariate shift, under which standard deep learning models produce unstable and inaccurate predictions. Furthermore, the scarcity of labeled data-stemming from the onerous and costly nature of experimental validation-further exacerbates the difficulty of achieving reliable generalization. To address these limitations, we propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data, enabling the model to learn how to generalize beyond the training distribution. We demonstrate significant performance gains on challenging real-world datasets with substantial covariate shift, supported by t-SNE visualizations highlighting our interpolation method.

replace Exploring the Secondary Risks of Large Language Models

Authors: Jiawei Chen, Zhengwei Fang, Xiao Yang, Chao Yu, Zhaoxia Yin, Hang Su

Abstract: Ensuring the safety and alignment of Large Language Models is a significant challenge with their growing integration into critical applications and societal functions. While prior research has primarily focused on jailbreak attacks, less attention has been given to non-adversarial failures that subtly emerge during benign interactions. We introduce secondary risks a novel class of failure modes marked by harmful or misleading behaviors during benign prompts. Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms. To enable systematic evaluation, we introduce two risk primitives verbose response and speculative advice that capture the core failure patterns. Building on these definitions, we propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors by optimizing task relevance, risk activation, and linguistic plausibility. To support reproducible evaluation, we release SecRiskBench, a benchmark dataset of 650 prompts covering eight diverse real-world risk categories. Experimental results from extensive evaluations on 16 popular models demonstrate that secondary risks are widespread, transferable across models, and modality independent, emphasizing the urgent need for enhanced safety mechanisms to address benign yet harmful LLM behaviors in real-world deployments.

replace Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute

Authors: Sheng Liu, Tianlang Chen, Pan Lu, Haotian Ye, Yizheng Chen, Lei Xing, James Zou

Abstract: Test-time compute has emerged as a powerful paradigm for improving the performance of large language models (LLMs), where generating multiple outputs or refining individual chains can significantly boost answer accuracy. However, existing methods like Best-of-N, majority voting, and self-reflection typically apply reasoning in a uniform way across inputs, overlooking the fact that different problems may require different levels of reasoning depth. In this work, we propose Fractional Reasoning, a training-free and model-agnostic framework that enables continuous control over reasoning intensity at inference time, going beyond the limitations of fixed instructional prompts. Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor, allowing the model to tailor its reasoning process to the complexity of each input. This supports two key modes of test-time scaling: (1) improving output quality in breadth-based strategies (e.g., Best-of-N, majority voting), and (2) enhancing the correctness of individual reasoning chains in depth-based strategies (e.g., self-reflection). Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.

replace VRAIL: Vectorized Reward-based Attribution for Interpretable Learning

Authors: Jina Kim, Youjin Jang, Jeongjin Han

Abstract: We propose VRAIL (Vectorized Reward-based Attribution for Interpretable Learning), a bi-level framework for value-based reinforcement learning (RL) that learns interpretable weight representations from state features. VRAIL consists of two stages: a deep learning (DL) stage that fits an estimated value function using state features, and an RL stage that uses this to shape learning via potential-based reward transformations. The estimator is modeled in either linear or quadratic form, allowing attribution of importance to individual features and their interactions. Empirical results on the Taxi-v3 environment demonstrate that VRAIL improves training stability and convergence compared to standard DQN, without requiring environment modifications. Further analysis shows that VRAIL uncovers semantically meaningful subgoals, such as passenger possession, highlighting its ability to produce human-interpretable behavior. Our findings suggest that VRAIL serves as a general, model-agnostic framework for reward shaping that enhances both learning and interpretability.

replace CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

Authors: Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux

Abstract: Diabetes mellitus poses a significant health risk, as nearly 1 in 9 people are affected by it. Early detection can significantly lower this risk. Despite significant advancements in machine learning for identifying diabetic cases, results can still be influenced by the imbalanced nature of the data. To address this challenge, our study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class and integrates it with machine learning (ML) techniques. We selected the Pima Indian dataset and generated data using A2 copula, then applied five machine learning algorithms: logistic regression, random forest, gradient boosting, extreme gradient boosting, and Multilayer Perceptron. Overall, our findings show that Random Forest with A2 copula oversampling (theta = 10) achieved the best performance, with improvements of 5.3% in accuracy, 9.5% in precision, 5.7% in recall, 7.6% in F1-score, and 1.1% in AUC compared to the standard SMOTE method. Furthermore, we statistically validated our results using the McNemar's test. This research represents the first known use of A2 copulas for data augmentation and serves as an alternative to the SMOTE technique, highlighting the efficacy of copulas as a statistical method in machine learning applications.

replace An entropy-optimal path to humble AI

Authors: Davide Bassetti, Luk\'a\v{s} Posp\'i\v{s}il, Michael Groom, Terence J. O'Kane, Illia Horenko

Abstract: Progress of AI has led to very successful, but by no means humble models and tools, especially regarding (i) the huge and further exploding costs and resources they demand, and (ii) the over-confidence of these tools with the answers they provide. Here we introduce a novel mathematical framework for a non-equilibrium entropy-optimizing reformulation of Boltzmann machines based on the exact law of total probability and the exact convex polytope representations. We show that it results in the highly-performant, but much cheaper, gradient-descent-free learning framework with mathematically-justified existence and uniqueness criteria, and cheaply-computable confidence/reliability measures for both the model inputs and the outputs. Comparisons to state-of-the-art AI tools in terms of performance, cost and the model descriptor lengths on a broad set of synthetic and real-world problems with varying complexity reveal that the proposed method results in more performant and slim models, with the descriptor lengths being very close to the intrinsic complexity scaling bounds for the underlying problems. Applying this framework to historical climate data results in models with systematically higher prediction skills for the onsets of important La Ni\~na and El Ni\~no climate phenomena, requiring just few years of climate data for training - a small fraction of what is necessary for contemporary climate prediction tools.

replace Shift Happens: Mixture of Experts based Continual Adaptation in Federated Learning

Authors: Rahul Atul Bhope, K. R. Jayaram, Praveen Venkateswaran, Nalini Venkatasubramanian

Abstract: Federated Learning (FL) enables collaborative model training across decentralized clients without sharing raw data, yet faces significant challenges in real-world settings where client data distributions evolve dynamically over time. This paper tackles the critical problem of covariate and label shifts in streaming FL environments, where non-stationary data distributions degrade model performance and necessitate a middleware layer that adapts FL to distributional shifts. We introduce ShiftEx, a shift-aware mixture of experts framework that dynamically creates and trains specialized global models in response to detected distribution shifts using Maximum Mean Discrepancy for covariate shifts. The framework employs a latent memory mechanism for expert reuse and implements facility location-based optimization to jointly minimize covariate mismatch, expert creation costs, and label imbalance. Through theoretical analysis and comprehensive experiments on benchmark datasets, we demonstrate 5.5-12.9 percentage point accuracy improvements and 22-95 % faster adaptation compared to state-of-the-art FL baselines across diverse shift scenarios. The proposed approach offers a scalable, privacy-preserving middleware solution for FL systems operating in non-stationary, real-world conditions while minimizing communication and computational overhead.

replace TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design

Authors: Geonwoo Cho, Jaegyun Im, Jihwan Lee, Hojun Yi, Sejin Kim, Sundong Kim

Abstract: Generalizing deep reinforcement learning agents to unseen environments remains a significant challenge. One promising solution is Unsupervised Environment Design (UED), a co-evolutionary framework in which a teacher adaptively generates tasks with high learning potential, while a student learns a robust policy from this evolving curriculum. Existing UED methods typically measure learning potential via regret, the gap between optimal and current performance, approximated solely by value-function loss. Building on these approaches, we introduce the transition-prediction error as an additional term in our regret approximation. To capture how training on one task affects performance on others, we further propose a lightweight metric called Co-Learnability. By combining these two measures, we present Transition-aware Regret Approximation with Co-learnability for Environment Design (TRACED). Empirical evaluations show that TRACED produces curricula that improve zero-shot generalization over strong baselines across multiple benchmarks. Ablation studies confirm that the transition-prediction error drives rapid complexity ramp-up and that Co-Learnability delivers additional gains when paired with the transition-prediction error. These results demonstrate how refined regret approximation and explicit modeling of task relationships can be leveraged for sample-efficient curriculum design in UED. Project Page: https://geonwoo.me/traced/

URLs: https://geonwoo.me/traced/

replace Mirror Descent Policy Optimisation for Robust Constrained Markov Decision Processes

Authors: David Bossens, Atsushi Nitanda

Abstract: Safety is an essential requirement for reinforcement learning systems. The newly emerging framework of robust constrained Markov decision processes allows learning policies that satisfy long-term constraints while providing guarantees under epistemic uncertainty. This paper presents mirror descent policy optimisation for robust constrained Markov decision processes (RCMDPs), making use of policy gradient techniques to optimise both the policy (as a maximiser) and the transition kernel (as an adversarial minimiser) on the Lagrangian representing a constrained MDP. Our proposed algorithm obtains an $\tilde{\mathcal{O}}\left(1/T^{1/3}\right)$ convergence rate in the sample-based RCMDP setting. In addition to the RCMDP setting, the paper also contributes an algorithm for approximate gradient descent in the space of transition kernels, which is of independent interest for designing adversarial environments. Experiments confirm the benefits of mirror descent policy optimisation in constrained and unconstrained optimisation, and significant improvements are observed in robustness tests when compared to baseline policy optimisation algorithms.

replace Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

Authors: Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov

Abstract: Existing post-training techniques for large language models are broadly categorized into Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT). Each paradigm presents a distinct trade-off: SFT excels at mimicking demonstration data but can lead to problematic generalization as a form of behavior cloning. Conversely, RFT can significantly enhance a model's performance but is prone to learn unexpected behaviors, and its performance is highly sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a testbed, we empirically demonstrate that Prefix-RFT is both simple and effective. It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods. A key advantage is its seamless integration into existing open-source frameworks, requiring only minimal modifications to the standard RFT pipeline. Our analysis highlights the complementary nature of SFT and RFT, and validates that Prefix-RFT effectively harmonizes these two learning paradigms. Furthermore, ablation studies confirm the method's robustness to variations in the quality and quantity of demonstration data. We hope this work offers a new perspective on LLM post-training, suggesting that a unified paradigm that judiciously integrates demonstration and exploration could be a promising direction for future research.

replace Enhanced Generative Model Evaluation with Clipped Density and Coverage

Authors: Nicolas Salvy, Hugues Talbot, Bertrand Thirion

Abstract: Although generative models have made remarkable progress in recent years, their use in critical applications has been hindered by an inability to reliably evaluate the quality of their generated samples. Quality refers to at least two complementary concepts: fidelity and coverage. Current quality metrics often lack reliable, interpretable values due to an absence of calibration or insufficient robustness to outliers. To address these shortcomings, we introduce two novel metrics: Clipped Density and Clipped Coverage. By clipping individual sample contributions, as well as the radii of nearest neighbor balls for fidelity, our metrics prevent out-of-distribution samples from biasing the aggregated values. Through analytical and empirical calibration, these metrics demonstrate linear score degradation as the proportion of bad samples increases. Thus, they can be straightforwardly interpreted as equivalent proportions of good samples. Extensive experiments on synthetic and real-world datasets demonstrate that Clipped Density and Clipped Coverage outperform existing methods in terms of robustness, sensitivity, and interpretability when evaluating generative models.

replace Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation

Authors: Fran\c{c}ois Rozet, Ruben Ohana, Michael McCabe, Gilles Louppe, Fran\c{c}ois Lanusse, Shirley Ho

Abstract: The steep computational cost of diffusion models at inference hinders their use as fast physics emulators. In the context of image and video generation, this computational drawback has been addressed by generating in the latent space of an autoencoder instead of the pixel space. In this work, we investigate whether a similar strategy can be effectively applied to the emulation of dynamical systems and at what cost. We find that the accuracy of latent-space emulation is surprisingly robust to a wide range of compression rates (up to 1000x). We also show that diffusion-based emulators are consistently more accurate than non-generative counterparts and compensate for uncertainty in their predictions with greater diversity. Finally, we cover practical design choices, spanning from architectures to optimizers, that we found critical to train latent-space emulators.

replace Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

Authors: Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu

Abstract: Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis reveals that this stability is not primarily due to explicit mechanisms like KL penalty or chain-of-thought reasoning. Instead, we identify an implicit regularization mechanism inherent to RFT as a key contributing factor. Our theoretical analysis suggests that RFT's gradient updates are naturally scaled by the reward variance, acting as a data-dependent regularizer that inherently protects previously acquired knowledge. Finally, we propose a rollout-based instance filtering algorithm to enhance the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.

replace Training-Free Stein Diffusion Guidance: Posterior Correction for Sampling Beyond High-Density Regions

Authors: Van Khoa Nguyen, Lionel Blond\'e, Alexandros Kalousis

Abstract: Training free diffusion guidance provides a flexible way to leverage off-the-shelf classifiers without additional training. Yet, current approaches hinge on posterior approximations via Tweedie's formula, which often yield unreliable guidance, particularly in low-density regions. Stochastic optimal control (SOC), in contrast, provides principled posterior simulation but is prohibitively expensive for fast sampling. In this work, we reconcile the strengths of these paradigms by introducing Stein Diffusion Guidance (SDG), a novel training-free framework grounded in a surrogate SOC objective. We establish a theoretical bound on the value function, demonstrating the necessity of correcting approximate posteriors to faithfully reflect true diffusion dynamics. Leveraging Stein variational inference, SDG identifies the steepest descent direction that minimizes the Kullback-Leibler divergence between approximate and true posteriors. By incorporating a principled Stein correction mechanism and a novel running cost functional, SDG enables effective guidance in low-density regions. Experiments on molecular low-density sampling tasks suggest that SDG consistently surpasses standard training-free guidance methods, highlighting its potential for broader diffusion-based sampling beyond high-density regions.

replace From Sorting Algorithms to Scalable Kernels: Bayesian Optimization in High-Dimensional Permutation Spaces

Authors: Zikai Xie, Linjiang Chen

Abstract: Bayesian Optimization (BO) is a powerful tool for black-box optimization, but its application to high-dimensional permutation spaces is severely limited by the challenge of defining scalable representations. The current state-of-the-art BO approach for permutation spaces relies on an exhaustive $\Omega(n^2)$ pairwise comparison, inducing a dense representation that is impractical for large-scale permutations. To break this barrier, we introduce a novel framework for generating efficient permutation representations via kernel functions derived from sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from enumeration sort. Further, we introduce the \textbf{Merge Kernel} , which leverages the divide-and-conquer structure of merge sort to produce a compact, $\Theta(n\log n)$ to achieve the lowest possible complexity with no information loss and effectively capture permutation structure. Our central thesis is that the Merge Kernel performs competitively with the Mallows kernel in low-dimensional settings, but significantly outperforms it in both optimization performance and computational efficiency as the dimension $n$ grows. Extensive evaluations on various permutation optimization benchmarks confirm our hypothesis, demonstrating that the Merge Kernel provides a scalable and more effective solution for Bayesian optimization in high-dimensional permutation spaces, thereby unlocking the potential for tackling previously intractable problems such as large-scale feature ordering and combinatorial neural architecture search.

replace ParallelTime: Dynamically Weighting the Balance of Short- and Long-Term Temporal Dependencies

Authors: Itay Katav, Aryeh Kontorovich

Abstract: Modern multivariate time series forecasting primarily relies on two architectures: the Transformer with attention mechanism and Mamba. In natural language processing, an approach has been used that combines local window attention for capturing short-term dependencies and Mamba for capturing long-term dependencies, with their outputs averaged to assign equal weight to both. We find that for time-series forecasting tasks, assigning equal weight to long-term and short-term dependencies is not optimal. To mitigate this, we propose a dynamic weighting mechanism, ParallelTime Weighter, which calculates interdependent weights for long-term and short-term dependencies for each token based on the input and the model's knowledge. Furthermore, we introduce the ParallelTime architecture, which incorporates the ParallelTime Weighter mechanism to deliver state-of-the-art performance across diverse benchmarks. Our architecture demonstrates robustness, achieves lower FLOPs, requires fewer parameters, scales effectively to longer prediction horizons, and significantly outperforms existing methods. These advances highlight a promising path for future developments of parallel Attention-Mamba in time series forecasting. The implementation is readily available at: \href{https://github.com/itay1551/ParallelTime}{GitHub}.

URLs: https://github.com/itay1551/ParallelTime

replace Merging Memory and Space: A State Space Neural Operator

Authors: Nodens F. Koren, Samuel Lanthaler

Abstract: We propose the State Space Neural Operator (SS-NO), a compact architecture for learning solution operators of time-dependent partial differential equations (PDEs). Our formulation extends structured state space models (SSMs) to joint spatiotemporal modeling, introducing two key mechanisms: adaptive damping, which stabilizes learning by localizing receptive fields, and learnable frequency modulation, which enables data-driven spectral selection. These components provide a unified framework for capturing long-range dependencies with parameter efficiency. Theoretically, we establish connections between SSMs and neural operators, proving a universality theorem for convolutional architectures with full field-of-view. Empirically, SS-NO achieves state-of-the-art performance across diverse PDE benchmarks-including 1D Burgers' and Kuramoto-Sivashinsky equations, and 2D Navier-Stokes and compressible Euler flows-while using significantly fewer parameters than competing approaches. A factorized variant of SS-NO further demonstrates scalable performance on challenging 2D problems. Our results highlight the effectiveness of damping and frequency learning in operator modeling, while showing that lightweight factorization provides a complementary path toward efficient large-scale PDE learning.

replace GEDAN: Learning the Edit Costs for Graph Edit Distance

Authors: Francesco Leonardi, Markus Orsi, Jean-Louis Reymond, Kaspar Riesen

Abstract: Graph Edit Distance (GED) is defined as the minimum cost transformation of one graph into another and is a widely adopted metric for measuring the dissimilarity between graphs. The major problem of GED is that its computation is NP-hard, which has in turn led to the development of various approximation methods, including approaches based on neural networks (NN). However, most NN methods assume a unit cost for edit operations -- a restrictive and often unrealistic simplification, since topological and functional distances rarely coincide in real-world data. In this paper, we propose a fully end-to-end Graph Neural Network framework for learning the edit costs for GED, at a fine-grained level, aligning topological and task-specific similarity. Our method combines an unsupervised self-organizing mechanism for GED approximation with a Generalized Additive Model that flexibly learns contextualized edit costs. Experiments demonstrate that our approach overcomes the limitations of non-end-to-end methods, yielding directly interpretable graph matchings, uncovering meaningful structures in complex graphs, and showing strong applicability to domains such as molecular analysis.

replace Causal Reflection with Language Models

Authors: Abi Aryan, Zac Liu

Abstract: While LLMs exhibit impressive fluency and factual recall, they struggle with robust causal reasoning, often relying on spurious correlations and brittle patterns. Similarly, traditional Reinforcement Learning agents also lack causal understanding, optimizing for rewards without modeling why actions lead to outcomes. We introduce Causal Reflection, a framework that explicitly models causality as a dynamic function over state, action, time, and perturbation, enabling agents to reason about delayed and nonlinear effects. Additionally, we define a formal Reflect mechanism that identifies mismatches between predicted and observed outcomes and generates causal hypotheses to revise the agent's internal model. In this architecture, LLMs serve not as black-box reasoners, but as structured inference engines translating formal causal outputs into natural language explanations and counterfactuals. Our framework lays the theoretical groundwork for Causal Reflective agents that can adapt, self-correct, and communicate causal understanding in evolving environments.

replace Reparameterization Proximal Policy Optimization

Authors: Hai Zhong, Xun Wang, Zhuoran Li, Longbo Huang

Abstract: Reparameterization policy gradient (RPG) is promising for improving sample efficiency by leveraging differentiable dynamics. However, a critical barrier is its training instability, where high-variance gradients can destabilize the learning process. To address this, we draw inspiration from Proximal Policy Optimization (PPO), which uses a surrogate objective to enable stable sample reuse in the model-free setting. We first establish a connection between this surrogate objective and RPG, which has been largely unexplored and is non-trivial. Then, we bridge this gap by demonstrating that the reparameterization gradient of a PPO-like surrogate objective can be computed efficiently using backpropagation through time. Based on this key insight, we propose Reparameterization Proximal Policy Optimization (RPO), a stable and sample-efficient RPG-based method. RPO enables stable sample reuse over multiple epochs by employing a policy gradient clipping mechanism tailored for RPG. It is further stabilized by Kullback-Leibler (KL) divergence regularization and remains fully compatible with existing variance reduction methods. We evaluate RPO on a suite of challenging locomotion and manipulation tasks, where experiments demonstrate that our method achieves superior sample efficiency and strong performance.

replace Discovery Learning accelerates battery design evaluation

Authors: Jiawei Zhang, Yifei Zhang, Baozhao Yi, Yao Ren, Qi Jiao, Hanyu Bai, Weiran Jiang, Ziyou Song

Abstract: Fast and reliable validation of novel designs in complex physical systems such as batteries is critical to accelerating technological innovation. However, battery research and development remain bottlenecked by the prohibitively high time and energy costs required to evaluate numerous new design candidates, particularly in battery prototyping and life testing. Despite recent progress in data-driven battery lifetime prediction, existing methods require labeled data of target designs to improve accuracy and cannot make reliable predictions until after prototyping, thus falling far short of the efficiency needed to enable rapid feedback for battery design. Here, we introduce Discovery Learning (DL), a scientific machine-learning paradigm that integrates active learning, physics-guided learning, and zero-shot learning into a human-like reasoning loop, drawing inspiration from learning theories in educational psychology. DL can learn from historical battery designs and actively reduce the need for prototyping, thus enabling rapid lifetime evaluation for unobserved material-design combinations without requiring additional data labeling. To test DL, we present 123 industrial-grade large-format lithium-ion pouch cells, spanning eight material-design combinations and diverse cycling protocols. Trained solely on public datasets of small-capacity cylindrical cells, DL achieves 7.2% test error in predicting the average cycle life under unknown device variability. This results in savings of 98% in time and 95% in energy compared to industrial practices. This work highlights the potential of uncovering insights from historical designs to inform and accelerate the development of next-generation battery technologies. DL represents a key advance toward efficient data-driven modeling and helps realize the promise of machine learning for accelerating scientific discovery and engineering innovation.

replace A Principled Loss Function for Direct Language Model Alignment

Authors: Yuandong Tan

Abstract: The alignment of large language models (LLMs) with human preferences is commonly achieved through Reinforcement Learning from Human Feedback (RLHF). Direct Preference Optimization (DPO) simplified this paradigm by establishing a direct mapping between the optimal policy and a reward function, eliminating the need for an explicit reward model. However, we argue that the DPO loss function is theoretically misaligned with its own derivation, as it promotes the indefinite maximization of a logits difference, which can lead to training instability and reward hacking. In this paper, we propose a novel loss function derived directly from the RLHF optimality condition. Our proposed loss targets a specific, finite value for the logits difference, which is dictated by the underlying reward, rather than its maximization. We provide a theoretical analysis, including a gradient-based comparison, to demonstrate that our method avoids the large gradients that plague DPO when the probability of dispreferred responses approaches zero. This inherent stability prevents reward hacking and leads to more effective alignment. We validate our approach by fine-tuning a Qwen2.5-7B model, showing significant win-rate improvements over a standard DPO baseline and achieving competitive performance against larger models like Llama-3.1-8B.

replace Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks

Authors: Songyao Jin, Biwei Huang

Abstract: Multivariate Hawkes process provides a powerful framework for modeling temporal dependencies and event-driven interactions in complex systems. While existing methods primarily focus on uncovering causal structures among observed subprocesses, real-world systems are often only partially observed, with latent subprocesses posing significant challenges. In this paper, we show that continuous-time event sequences can be represented by a discrete-time causal model as the time interval shrinks, and we leverage this insight to establish necessary and sufficient conditions for identifying latent subprocesses and the causal influences. Accordingly, we propose a two-phase iterative algorithm that alternates between inferring causal relationships among discovered subprocesses and uncovering new latent subprocesses, guided by path-based conditions that guarantee identifiability. Experiments on both synthetic and real-world datasets show that our method effectively recovers causal structures despite the presence of latent subprocesses.

replace MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models

Authors: Haoyu He, Katrin Renz, Yong Cao, Andreas Geiger

Abstract: Diffusion language models, as a promising alternative to traditional autoregressive (AR) models, enable faster generation and richer conditioning on bidirectional context. However, they suffer from a key discrepancy between training and inference: during inference, MDLMs progressively reveal the structure of the generated sequence by producing fewer and fewer masked tokens, whereas this structure is ignored in training as tokens are masked at random. Although this discrepancy between training and inference can lead to suboptimal performance, it has been largely overlooked by previous works, leaving closing this gap between the two stages an open problem. To address this, we frame the problem of learning effective denoising trajectories as a sequential decision-making problem and use the resulting framework to apply reinforcement learning. We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion possesses and explicitly train the model under the same progressive refining schedule used at inference. MDPO matches the performance of the previous state-of-the-art (SOTA) method with 60x fewer gradient updates, while achieving average improvements of 9.6% on MATH500 and 54.2% on Countdown over SOTA when trained within the same number of weight updates. Additionally, we improve the remasking strategy of MDLMs as a plug-in inference replacement to overcome the limitation that the model cannot refine tokens flexibly. This training-free method, termed Running Confidence Remasking (RCR), consistently enhances performance and provides further improvements when used with MDPO. Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs. Code: https://github.com/autonomousvision/mdpo. Project Page: https://cli212.github.io/MDPO/.

URLs: https://github.com/autonomousvision/mdpo., https://cli212.github.io/MDPO/.

replace Aligning Distributionally Robust Optimization with Practical Deep Learning Needs

Authors: Dmitrii Feoktistov, Igor Ignashin, Andrey Veprikov, Nikita Borovko, Alexander Bogdanov, Savelii Chezhegov, Aleksandr Beznosikov

Abstract: While traditional Deep Learning (DL) optimization methods treat all training samples equally, Distributionally Robust Optimization (DRO) adaptively assigns importance weights to different samples. However, a significant gap exists between DRO and current DL practices. Modern DL optimizers require adaptivity and the ability to handle stochastic gradients, as these methods demonstrate superior performance. Additionally, for practical applications, a method should allow weight assignment not only to individual samples, but also to groups of objects (for example, all samples of the same class). This paper aims to bridge this gap by introducing ALSO $\unicode{x2013}$ Adaptive Loss Scaling Optimizer $\unicode{x2013}$ an adaptive algorithm for a modified DRO objective that can handle weight assignment to sample groups. We prove the convergence of our proposed algorithm for non-convex objectives, which is the typical case for DL models. Empirical evaluation across diverse Deep Learning tasks, from Tabular DL to Split Learning tasks, demonstrates that ALSO outperforms both traditional optimizers and existing DRO methods.

replace Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

Authors: Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, Hengtong Lu, Wei Chen, Yan Xie, Mingli Song

Abstract: Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the Best-of-N evaluation. Notably, RuscaRL significantly boosts Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1. Furthermore, our fine-tuned variant on Qwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leading LLMs including OpenAI-o3. Our code is available at https://github.com/IANNXANG/RuscaRL.

URLs: https://github.com/IANNXANG/RuscaRL.

replace Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Authors: Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota

Abstract: Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-$k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

URLs: https://github.com/rioyokotalab/optimal-sparsity.

replace Data-Augmented Few-Shot Neural Emulator for Computer-Model System Identification

Authors: Sanket Jantre, Deepak Akhare, Zhiyuan Wang, Xiaoning Qian, Nathan M. Urban

Abstract: Partial differential equations (PDEs) underpin the modeling of many natural and engineered systems. It can be convenient to express such models as neural PDEs rather than using traditional numerical PDE solvers by replacing part or all of the PDE's governing equations with a neural network representation. Neural PDEs are often easier to differentiate, linearize, reduce, or use for uncertainty quantification than the original numerical solver. They are usually trained on solution trajectories obtained by long-horizon rollout of the PDE solver. Here we propose a more sample-efficient data-augmentation strategy for generating neural PDE training data from a computer model by space-filling sampling of local "stencil" states. This approach removes a large degree of spatiotemporal redundancy present in trajectory data and oversamples states that may be rarely visited but help the neural PDE generalize across the state space. We demonstrate that accurate neural PDE stencil operators can be learned from synthetic training data generated by the computational equivalent of 10 timesteps' worth of numerical simulation. Accuracy is further improved if we assume access to a single full-trajectory simulation from the computer model, which is typically available in practice. Across several PDE systems, we show that our data-augmented stencil data yield better trained neural stencil operators, with clear performance gains compared with naively sampled stencil data from simulation trajectories. Finally, with only 10 solver steps' worth of augmented stencil data, our approach outperforms traditional ML emulators trained on thousands of trajectories in long-horizon rollout accuracy and stability.

replace The Lifecycle Principle: Stabilizing Dynamic Neural Networks with State Memory

Authors: Zichuan Yang

Abstract: I investigate a stronger form of regularization by deactivating neurons for extended periods, a departure from the temporary changes of methods like Dropout. However, this long-term dynamism introduces a critical challenge: severe training instability when neurons are revived with random weights. To solve this, I propose the Lifecycle (LC) principle, a regularization mechanism centered on a key innovation: state memory. Instead of re-initializing a revived neuron, my method restores its parameters to their last known effective state. This process preserves learned knowledge and avoids destructive optimization shocks. My theoretical analysis reveals that the LC principle smooths the loss landscape, guiding optimization towards flatter minima associated with better generalization. Experiments on image classification benchmarks demonstrate that my method improves generalization and robustness. Crucially, ablation studies confirm that state memory is essential for achieving these gains.

replace On Entropy Control in LLM-RL Algorithms

Authors: Han Shen

Abstract: For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C. Although entropy regularization proves effective in robotic and games RL conventionally, studies found that it gives weak to no gains in LLM-RL training. In this work, we study the issues of entropy bonus in LLM-RL setting. Specifically, we first argue that the conventional entropy regularization suffers from the LLM's extremely large response space and the sparsity of the optimal outputs. As a remedy, we propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient. The clamped entropy is evaluated with the re-normalized policy defined on certain smaller token space, which encourages exploration within a more compact response set. In addition, the algorithm automatically adjusts entropy coefficient according to the clamped entropy value, effectively controlling the entropy-induced bias while leveraging the entropy's benefits. AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently across multiple benchmarks.

replace One-Embedding-Fits-All: Efficient Zero-Shot Time Series Forecasting by a Model Zoo

Authors: Hao-Nan Shi, Ting-Ji Huang, Lu Han, De-Chuan Zhan, Han-Jia Ye

Abstract: The proliferation of Time Series Foundation Models (TSFMs) has significantly advanced zero-shot forecasting, enabling predictions for unseen time series without task-specific fine-tuning. Extensive research has confirmed that no single TSFM excels universally, as different models exhibit preferences for distinct temporal patterns. This diversity suggests an opportunity: how to take advantage of the complementary abilities of TSFMs. To this end, we propose ZooCast, which characterizes each model's distinct forecasting strengths. ZooCast can intelligently assemble current TSFMs into a model zoo that dynamically selects optimal models for different forecasting tasks. Our key innovation lies in the One-Embedding-Fits-All paradigm that constructs a unified representation space where each model in the zoo is represented by a single embedding, enabling efficient similarity matching for all tasks. Experiments demonstrate ZooCast's strong performance on the GIFT-Eval zero-shot forecasting benchmark while maintaining the efficiency of a single TSFM. In real-world scenarios with sequential model releases, the framework seamlessly adds new models for progressive accuracy gains with negligible overhead.

replace Data-Efficient Time-Dependent PDE Surrogates: Graph Neural Simulators vs. Neural Operators

Authors: Dibyajyoti Nayak, Somdatta Goswami

Abstract: Developing accurate, data-efficient surrogate models is central to advancing AI for Science. Neural operators (NOs), which approximate mappings between infinite-dimensional function spaces using conventional neural architectures, have gained popularity as surrogates for systems driven by partial differential equations (PDEs). However, their reliance on large datasets and limited ability to generalize in low-data regimes hinder their practical utility. We argue that these limitations arise from their global processing of data, which fails to exploit the local, discretized structure of physical systems. To address this, we propose Graph Neural Simulators (GNS) as a principled surrogate modeling paradigm for time-dependent PDEs. GNS leverages message-passing combined with numerical time-stepping schemes to learn PDE dynamics by modeling the instantaneous time derivatives. This design mimics traditional numerical solvers, enabling stable long-horizon rollouts and strong inductive biases that enhance generalization. We rigorously evaluate GNS on four canonical PDE systems: (1) 2D scalar Burgers', (2) 2D coupled Burgers', (3) 2D Allen-Cahn, and (4) 2D nonlinear shallow-water equations, comparing against state-of-the-art NOs including Deep Operator Network (DeepONet) and Fourier Neural Operator (FNO). Results demonstrate that GNS is markedly more data-efficient, achieving less than 1% relative L2 error using only 3% of available trajectories, and exhibits dramatically reduced error accumulation over time (82.5% lower autoregressive error than FNO, 99.9% lower than DeepONet). To choose the training data, we introduce a PCA combined with KMeans trajectory selection strategy. These findings provide compelling evidence that GNS, with its graph-based locality and solver-inspired design, is the most suitable and scalable surrogate modeling framework for AI-driven scientific discovery.

replace ALICE: An Interpretable Neural Architecture for Generalization in Substitution Ciphers

Authors: Jeff Shen, Lindsay M. Smith

Abstract: We present cryptogram solving as an ideal testbed for studying neural network reasoning and generalization; models must decrypt text encoded with substitution ciphers, choosing from 26! possible mappings without explicit access to the cipher. We develop ALICE (an Architecture for Learning Interpretable Cryptogram dEcipherment), a simple encoder-only Transformer that sets a new state-of-the-art for both accuracy and speed on this decryption problem. Surprisingly, ALICE generalizes to unseen ciphers after training on only ${\sim}1500$ unique ciphers, a minute fraction ($3.7 \times 10^{-24}$) of the possible cipher space. To enhance interpretability, we introduce a novel bijective decoding head that explicitly models permutations via the Gumbel-Sinkhorn method, enabling direct extraction of learned cipher mappings. Through early exit and probing experiments, we reveal how ALICE progressively refines its predictions in a way that appears to mirror common human strategies -- early layers place greater emphasis on letter frequencies, while later layers form word-level structures. Our architectural innovations and analysis methods are applicable beyond cryptograms and offer new insights into neural network generalization and interpretability.

replace One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning

Authors: Yuan Pu, Yazhe Niu, Jia Tang, Junyu Xiong, Shuai Hu, Hongsheng Li

Abstract: In heterogeneous multi-task decision-making, tasks not only exhibit diverse observation and action spaces but also vary substantially in their underlying complexities. While conventional multi-task world models like UniZero excel in single-task settings, we find that when handling a broad and diverse suite of tasks, gradient conflicts and the loss of model plasticity often constrain their sample efficiency. In this work, we address these challenges from two complementary perspectives: the single learning iteration and the overall learning process. First, to mitigate the gradient conflicts, we systematically investigate key architectural designs for extending UniZero. Our investigation identifies a Mixture-of-Experts (MoE) architecture as the most effective approach. We demonstrate, both theoretically and empirically, that this architecture alleviates gradient conflicts by routing task-specific representations to specialized sub-networks. This finding leads to our proposed model, \textit{ScaleZero}. Second, to dynamically allocate model capacity throughout the learning process, we introduce an online Dynamic Parameter Scaling (DPS) strategy. This strategy progressively integrates LoRA adapters in response to task-specific progress, enabling adaptive knowledge retention and parameter expansion. Evaluations on a diverse set of standard benchmarks (Atari, DMC, Jericho) demonstrate that ScaleZero, utilizing solely online reinforcement learning with one model, performs on par with specialized single-task agents. With the DPS strategy, it remains competitive while using just 71.5% of the environment interactions. These findings underscore the potential of ScaleZero for effective multi-task planning. Our code is available at \textcolor{magenta}{https://github.com/opendilab/LightZero}.

URLs: https://github.com/opendilab/LightZero

replace ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

Authors: Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang

Abstract: Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence $\mu = 1/\sqrt{n}$--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete $\{+1, -1\}$ entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. For LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 37.3 for QuIP. \href{https://github.com/42Shawn/Butterflyquant-llm}{Codes} are available.

URLs: https://github.com/42Shawn/Butterflyquant-llm

replace Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications

Authors: Janis Keuper

Abstract: The ongoing intense discussion on rising LLM usage in the scientific peer-review process has recently been mingled by reports of authors using hidden prompt injections to manipulate review scores. Since the existence of such "attacks" - although seen by some commentators as "self-defense" - would have a great impact on the further debate, this paper investigates the practicability and technical success of the described manipulations. Our systematic evaluation uses 1k reviews of 2024 ICLR papers generated by a wide range of LLMs shows two distinct results: I) very simple prompt injections are indeed highly effective, reaching up to 100% acceptance scores. II) LLM reviews are generally biased toward acceptance (>95% in many models). Both results have great impact on the ongoing discussions on LLM usage in peer-review.

replace Matched-Pair Experimental Design with Active Learning

Authors: Weizhi Li, Gautam Dasarathy, Visar Berisha

Abstract: Matched-pair experimental designs aim to detect treatment effects by pairing participants and comparing within-pair outcome differences. In many situations, the overall effect size across the entire population is small. Then, the focus naturally shifts to identifying and targeting high treatment-effect regions where the intervention is most effective. This paper proposes a matched-pair experimental design that sequentially and actively enrolls patients in high treatment-effect regions. Importantly, we frame the identification of the target region as a classification problem and propose an active learning framework tailored to matched-pair designs. Our design not only reduces the experimental cost of detecting treatment efficacy, but also ensures that the identified regions enclose the entire high-treatment-effect regions. Our theoretical analysis of the framework's label complexity and experiments in practical scenarios demonstrate the efficiency and advantages of the approach.

replace Selective Risk Certification for LLM Outputs via Information-Lift Statistics: PAC-Bayes, Robustness, and Skeleton Design

Authors: Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

Abstract: Large language models frequently generate confident but incorrect outputs, requiring formal uncertainty quantification with abstention guarantees. We develop information-lift certificates that compare model probabilities to a skeleton baseline, accumulating evidence into sub-gamma PAC-Bayes bounds valid under heavy-tailed distributions. Across eight datasets, our method achieves 77.2\% coverage at 2\% risk, outperforming recent 2023-2024 baselines by 8.6-15.1 percentage points, while blocking 96\% of critical errors in high-stakes scenarios vs 18-31\% for entropy methods. Limitations include skeleton dependence and frequency-only (not severity-aware) risk control, though performance degrades gracefully under corruption.

replace Adversarial generalization of unfolding (model-based) networks

Authors: Vicky Kouni

Abstract: Unfolding networks are interpretable networks emerging from iterative algorithms, incorporate prior knowledge of data structure, and are designed to solve inverse problems like compressed sensing, which deals with recovering data from noisy, missing observations. Compressed sensing finds applications in critical domains, from medical imaging to cryptography, where adversarial robustness is crucial to prevent catastrophic failures. However, a solid theoretical understanding of the performance of unfolding networks in the presence of adversarial attacks is still in its infancy. In this paper, we study the adversarial generalization of unfolding networks when perturbed with $l_2$-norm constrained attacks, generated by the fast gradient sign method. Particularly, we choose a family of state-of-the-art overaparameterized unfolding networks and deploy a new framework to estimate their adversarial Rademacher complexity. Given this estimate, we provide adversarial generalization error bounds for the networks under study, which are tight with respect to the attack level. To our knowledge, this is the first theoretical analysis on the adversarial generalization of unfolding networks. We further present a series of experiments on real-world data, with results corroborating our derived theory, consistently for all data. Finally, we observe that the family's overparameterization can be exploited to promote adversarial robustness, shedding light on how to efficiently robustify neural networks.

replace Small LLMs with Expert Blocks Are Good Enough for Hyperparamter Tuning

Authors: Om Naphade, Saksham Bansal, Parikshit Pareek

Abstract: Hyper-parameter Tuning (HPT) is a necessary step in machine learning (ML) pipelines but becomes computationally expensive and opaque with larger models. Recently, Large Language Models (LLMs) have been explored for HPT, yet most rely on models exceeding 100 billion parameters. We propose an Expert Block Framework for HPT using Small LLMs. At its core is the Trajectory Context Summarizer (TCS), a deterministic block that transforms raw training trajectories into structured context, enabling small LLMs to analyze optimization progress with reliability comparable to larger models. Using two locally-run LLMs (phi4:reasoning14B and qwen2.5-coder:32B) and a 10-trial budget, our TCS-enabled HPT pipeline achieves average performance within ~0.9 percentage points of GPT-4 across six diverse tasks.

replace Local Mechanisms of Compositional Generalization in Conditional Diffusion

Authors: Arwen Bradley

Abstract: Conditional diffusion models appear capable of compositional generalization, i.e., generating convincing samples for out-of-distribution combinations of conditioners, but the mechanisms underlying this ability remain unclear. To make this concrete, we study length generalization, the ability to generate images with more objects than seen during training. In a controlled CLEVR setting (Johnson et al., 2017), we find that length generalization is achievable in some cases but not others, suggesting that models only sometimes learn the underlying compositional structure. We then investigate locality as a structural mechanism for compositional generalization. Prior works proposed score locality as a mechanism for creativity in unconditional diffusion models (Kamb & Ganguli, 2024; Niedoba et al., 2024), but did not address flexible conditioning or compositional generalization. In this paper, we prove an exact equivalence between a specific compositional structure ("conditional projective composition") (Bradley et al., 2025) and scores with sparse dependencies on both pixels and conditioners ("local conditional scores"). This theory also extends to feature-space compositionality. We validate our theory empirically: CLEVR models that succeed at length generalization exhibit local conditional scores, while those that fail do not. Furthermore, we show that a causal intervention explicitly enforcing local conditional scores restores length generalization in a previously failing model. Finally, we investigate feature-space compositionality in color-conditioned CLEVR, and find preliminary evidence of compositional structure in SDXL.

replace Long-Tailed Out-of-Distribution Detection with Refined Separate Class Learning

Authors: Shuai Feng, Yuxin Ge, Yuntao Du, Mingcai Chen, Chongjun Wang, Lei Feng

Abstract: Out-of-distribution (OOD) detection is crucial for deploying robust machine learning models. However, when training data follows a long-tailed distribution, the model's ability to accurately detect OOD samples is significantly compromised, due to the confusion between OOD samples and head/tail classes. To distinguish OOD samples from both head and tail classes, the separate class learning (SCL) approach has emerged as a promising solution, which separately conduct head-specific and tail-specific class learning. To this end, we examine the limitations of existing works of SCL and reveal that the OOD detection performance is notably influenced by the use of static scaling temperature value and the presence of uninformative outliers. To mitigate these limitations, we propose a novel approach termed Refined Separate Class Learning (RSCL), which leverages dynamic class-wise temperature adjustment to modulate the temperature parameter for each in-distribution class and informative outlier mining to identify diverse types of outliers based on their affinity with head and tail classes. Extensive experiments demonstrate that RSCL achieves superior OOD detection performance while improving the classification accuracy on in-distribution data.

replace MolPILE - large-scale, diverse dataset for molecular representation learning

Authors: Jakub Adamczyk, Jakub Poziemski, Franciszek Job, Mateusz Kr\'ol, Maciej Makowski

Abstract: The size, diversity, and quality of pretraining datasets critically determine the generalization ability of foundation models. Despite their growing importance in chemoinformatics, the effectiveness of molecular representation learning has been hindered by limitations in existing small molecule datasets. To address this gap, we present MolPILE, large-scale, diverse, and rigorously curated collection of 222 million compounds, constructed from 6 large-scale databases using an automated curation pipeline. We present a comprehensive analysis of current pretraining datasets, highlighting considerable shortcomings for training ML models, and demonstrate how retraining existing models on MolPILE yields improvements in generalization performance. This work provides a standardized resource for model training, addressing the pressing need for an ImageNet-like dataset in molecular chemistry.

replace APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation

Authors: Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xiaodong Yu, Ze Wang, Kangrui Du, Jialian Wu, Ximeng Sun, Jiang Liu, Qiaolin Yu, Hao Chen, Zicheng Liu, Emad Barsoum

Abstract: Reinforcement learning (RL) has become a cornerstone in advancing large-scale pre-trained language models (LLMs). Successive generations, including GPT-o series, DeepSeek-R1, Kimi-K1.5, Grok 4, and GLM-4.5, have relied on large-scale RL training to enhance reasoning and coding capabilities. To meet the community's growing RL needs, numerous RL frameworks have been proposed. However, RL training remains computationally expensive, with rollout generation accounting for more than 90% of total runtime. In addition, its efficiency is often constrained by the long-tail distribution of rollout response lengths, where a few lengthy responses stall entire batches, leaving GPUs idle and underutilized. As model and rollout sizes continue to grow, this bottleneck increasingly limits scalability. To address this challenge, we propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency. In the rollout phase, APRIL over-provisions rollout requests, terminates once the target number of responses is reached, and recycles incomplete responses for continuation in future steps. This strategy ensures that no rollouts are discarded while substantially reducing GPU idle time. Experiments show that APRIL improves rollout throughput by at most 44% across commonly used RL algorithms (GRPO, DAPO, GSPO), accelerates convergence, and achieves at most 8% higher final accuracy across tasks. Moreover, APRIL is both framework and hardware agnostic, already integrated into the slime RL framework, and deployable on NVIDIA and AMD GPUs alike. Taken together, this work unifies system-level and algorithmic considerations in proposing APRIL, with the aim of advancing RL training efficiency and inspiring further optimizations in RL systems. Our codebase is available at https://github.com/RLsys-Foundation/APRIL

URLs: https://github.com/RLsys-Foundation/APRIL

replace S$^2$Transformer: Scalable Structured Transformers for Global Station Weather Forecasting

Authors: Hongyi Chen, Xiucheng Li, Xinyang Chen, Yun Cheng, Jing Li, Kehai Chen, Liqiang Nie

Abstract: Global Station Weather Forecasting (GSWF) is a key meteorological research area, critical to energy, aviation, and agriculture. Existing time series forecasting methods often ignore or unidirectionally model spatial correlation when conducting large-scale global station forecasting. This contradicts the intrinsic nature underlying observations of the global weather system, limiting forecast performance. To address this, we propose a novel Spatial Structured Attention Block in this paper. It partitions the spatial graph into a set of subgraphs and instantiates Intra-subgraph Attention to learn local spatial correlation within each subgraph, and aggregates nodes into subgraph representations for message passing among the subgraphs via Inter-subgraph Attention -- considering both spatial proximity and global correlation. Building on this block, we develop a multiscale spatiotemporal forecasting model S$^2$Transformer by progressively expanding subgraph scales. The resulting model is both scalable and able to produce structured spatial correlation, and meanwhile, it is easy to implement. The experimental results show that it can achieve performance improvements up to 16.8% over time series forecasting baselines at low running costs.

replace Frictional Q-Learning

Authors: Hyunwoo Kim, Hyo Kyung Lee

Abstract: We draw an analogy between static friction in classical mechanics and extrapolation error in off-policy RL, and use it to formulate a constraint that prevents the policy from drifting toward unsupported actions. In this study, we present Frictional Q-learning, a deep reinforcement learning algorithm for continuous control, which extends batch-constrained reinforcement learning. Our algorithm constrains the agent's action space to encourage behavior similar to that in the replay buffer, while maintaining a distance from the manifold of the orthonormal action space. The constraint preserves the simplicity of batch-constrained, and provides an intuitive physical interpretation of extrapolation error. Empirically, we further demonstrate that our algorithm is robustly trained and achieves competitive performance across standard continuous control benchmarks.

replace Advancing Universal Deep Learning for Electronic-Structure Hamiltonian Prediction of Materials

Authors: Shi Yin, Zujian Dai, Xinyang Pan, Lixin He

Abstract: Deep learning methods for electronic-structure Hamiltonian prediction has offered significant computational efficiency advantages over traditional DFT methods, yet the diversity of atomic types, structural patterns, and the high-dimensional complexity of Hamiltonians pose substantial challenges to the generalization performance. In this work, we contribute on both the methodology and dataset sides to advance universal deep learning paradigm for Hamiltonian prediction. On the method side, we propose NextHAM, a neural E(3)-symmetry and expressive correction method for efficient and generalizable materials electronic-structure Hamiltonian prediction. First, we introduce the zeroth-step Hamiltonians, which can be efficiently constructed by the initial charge density of DFT, as informative descriptors of neural regression model in the input level and initial estimates of the target Hamiltonian in the output level, so that the regression model directly predicts the correction terms to the target ground truths, thereby significantly simplifying the input-output mapping for learning. Second, we present a neural Transformer architecture with strict E(3)-Symmetry and high non-linear expressiveness for Hamiltonian prediction. Third, we propose a novel training objective to ensure the accuracy performance of Hamiltonians in both real space and reciprocal space, preventing error amplification and the occurrence of "ghost states" caused by the large condition number of the overlap matrix. On the dataset side, we curate a high-quality broad-coverage large benchmark, namely Materials-HAM-SOC, comprising 17,000 material structures spanning 68 elements from six rows of the periodic table and explicitly incorporating SOC effects. Experimental results on Materials-HAM-SOC demonstrate that NextHAM achieves excellent accuracy and efficiency in predicting Hamiltonians and band structures.

replace MCGrad: Multicalibration at Web Scale

Authors: Lorenzo Perini, Daniel Haimovich, Fridolin Linder, Niek Tax, Dima Karamshuk, Milan Vojnovic, Nastaran Okati, Pavlos Athanasios Apostolopoulos

Abstract: We propose MCGrad, a novel and scalable multicalibration algorithm. Multicalibration - calibration in sub-groups of the data - is an important property for the performance of machine learning-based systems. Existing multicalibration methods have thus far received limited traction in industry. We argue that this is because existing methods (1) require such subgroups to be manually specified, which ML practitioners often struggle with, (2) are not scalable, or (3) may harm other notions of model performance such as log loss and Area Under the Precision-Recall Curve (PRAUC). MCGrad does not require explicit specification of protected groups, is scalable, and often improves other ML evaluation metrics instead of harming them. MCGrad has been in production at Meta, and is now part of hundreds of production models. We present results from these deployments as well as results on public datasets.

replace Discovering Association Rules in High-Dimensional Small Tabular Data

Authors: Erkan Karabulut, Daniel Daza, Paul Groth, Victoria Degeler

Abstract: Association Rule Mining (ARM) aims to discover patterns between features in datasets in the form of propositional rules, supporting both knowledge discovery and interpretable machine learning in high-stakes decision-making. However, in high-dimensional settings, rule explosion and computational overhead render popular algorithmic approaches impractical without effective search space reduction, challenges that propagate to downstream tasks. Neurosymbolic methods, such as Aerial+, have recently been proposed to address the rule explosion in ARM. While they tackle the high dimensionality of the data, they also inherit limitations of neural networks, particularly reduced performance in low-data regimes. This paper makes three key contributions to association rule discovery in high-dimensional tabular data. First, we empirically show that Aerial+ scales one to two orders of magnitude better than state-of-the-art algorithmic and neurosymbolic baselines across five real-world datasets. Second, we introduce the novel problem of ARM in high-dimensional, low-data settings, such as gene expression data from the biomedicine domain with around 18k features and 50 samples. Third, we propose two fine-tuning approaches to Aerial+ using tabular foundation models. Our proposed approaches are shown to significantly improve rule quality on five real-world datasets, demonstrating their effectiveness in low-data, high-dimensional scenarios.

replace Predictive Coding-based Deep Neural Network Fine-tuning for Computationally Efficient Domain Adaptation

Authors: Matteo Cardoni, Sam Leroux

Abstract: As deep neural networks are increasingly deployed in dynamic, real-world environments, relying on a single static model is often insufficient. Changes in input data distributions caused by sensor drift or lighting variations necessitate continual model adaptation. In this paper, we propose a hybrid training methodology that enables efficient on-device domain adaptation by combining the strengths of Backpropagation and Predictive Coding. The method begins with a deep neural network trained offline using Backpropagation to achieve high initial performance. Subsequently, Predictive Coding is employed for online adaptation, allowing the model to recover accuracy lost due to shifts in the input data distribution. This approach leverages the robustness of Backpropagation for initial representation learning and the computational efficiency of Predictive Coding for continual learning, making it particularly well-suited for resource-constrained edge devices or future neuromorphic accelerators. Experimental results on the MNIST and CIFAR-10 datasets demonstrate that this hybrid strategy enables effective adaptation with a reduced computational overhead, offering a promising solution for maintaining model performance in dynamic environments.

replace-cross Predicting Male Domestic Violence Using Explainable Ensemble Learning and Exploratory Data Analysis

Authors: Md Abrar Jahin, Saleh Akram Naife, Fatema Tuj Johora Lima, M. F. Mridha, Md. Jakir Hossen

Abstract: Domestic violence is commonly viewed as a gendered issue that primarily affects women, which tends to leave male victims largely overlooked. This study presents a novel, data-driven analysis of male domestic violence (MDV) in Bangladesh, highlighting the factors that influence it and addressing the challenges posed by a significant categorical imbalance of 5:1 and limited data availability. We collected data from nine major cities in Bangladesh and conducted exploratory data analysis (EDA) to understand the underlying dynamics. EDA revealed patterns such as the high prevalence of verbal abuse, the influence of financial dependency, and the role of familial and socio-economic factors in MDV. To predict and analyze MDV, we implemented 10 traditional machine learning (ML) models, three deep learning models, and two ensemble models, including stacking and hybrid approaches. We propose a stacking ensemble model with ANN and CatBoost as base classifiers and Logistic Regression as the meta-model, which demonstrated the best performance, achieving $95\%$ accuracy, a $99.29\%$ AUC, and balanced metrics across evaluation criteria. Model-specific feature importance analysis of the base classifiers identified key features influencing their decision-making. Model-agnostic explainable AI techniques, such as SHAP and LIME, provided both local and global insights into the decision-making processes of the proposed model, thereby increasing transparency and interpretability. Statistical validation using paired $t$-tests with 10-fold cross-validation and Bonferroni correction ($\alpha = 0.0036$) confirmed the superior performance of our proposed model over alternatives. Our findings challenge the prevailing notion that domestic abuse primarily affects women, emphasizing the need for tailored interventions and support systems for male victims.

replace-cross Can social media provide early warning of retraction? Evidence from critical tweets identified by human annotation and large language models

Authors: Er-Te Zheng, Hui-Zhen Fu, Mike Thelwall, Zhichao Fang

Abstract: Timely detection of problematic research is essential for safeguarding scientific integrity. To explore whether social media commentary can serve as an early indicator of potentially problematic articles, this study analysed 3,815 tweets referencing 604 retracted articles and 3,373 tweets referencing 668 comparable non-retracted articles. Tweets critical of the articles were identified through both human annotation and large language models (LLMs). Human annotation revealed that 8.3% of retracted articles were associated with at least one critical tweet prior to retraction, compared to only 1.5% of non-retracted articles, highlighting the potential of tweets as early warning signals of retraction. However, critical tweets identified by LLMs (GPT-4o mini, Gemini 2.0 Flash-Lite, and Claude 3.5 Haiku) only partially aligned with human annotation, suggesting that fully automated monitoring of post-publication discourse should be applied with caution. A human-AI collaborative approach may offer a more reliable and scalable alternative, with human expertise helping to filter out tweets critical of issues unrelated to the research integrity of the articles. Overall, this study provides insights into how social media signals, combined with generative AI technologies, may support efforts to strengthen research integrity.

replace-cross Generalized Gradient Descent is a Hypergraph Functor

Authors: Tyler Hanks (University of Florida), James Fairbanks (University of Florida), Matthew Klawonn (Air Force Research Lab)

Abstract: Cartesian reverse derivative categories (CRDCs) provide an axiomatic generalization of the reverse derivative, which allows generalized analogues of classic optimization algorithms such as gradient descent to be applied to a broad class of problems. In this paper, we show that generalized gradient descent with respect to a given CRDC induces a hypergraph functor from a hypergraph category of optimization problems to a hypergraph category of dynamical systems. The domain of this functor consists of objective functions that are 1) general in the sense that they are defined with respect to an arbitrary CRDC, and 2) open in that they are decorated spans that can be composed with other such objective functions via variable sharing. The codomain is specified analogously as a category of general and open dynamical systems for the underlying CRDC. We describe how the hypergraph functor induces a distributed optimization algorithm for arbitrary composite problems specified in the domain. To illustrate the kinds of problems our framework can model, we show that parameter sharing models in multitask learning, a prevalent machine learning paradigm, yield a composite optimization problem for a given choice of CRDC. We then apply the gradient descent functor to this composite problem and describe the resulting distributed gradient descent algorithm for training parameter sharing models.

replace-cross Two-level overlapping additive Schwarz preconditioner for training scientific machine learning applications

Authors: Youngkyu Lee, Alena Kopani\v{c}\'akov\'a, George Em Karniadakis

Abstract: We introduce a novel two-level overlapping additive Schwarz preconditioner for accelerating the training of scientific machine learning applications. The design of the proposed preconditioner is motivated by the nonlinear two-level overlapping additive Schwarz preconditioner. The neural network parameters are decomposed into groups (subdomains) with overlapping regions. In addition, the network's feed-forward structure is indirectly imposed through a novel subdomain-wise synchronization strategy and a coarse-level training step. Through a series of numerical experiments, which consider physics-informed neural networks and operator learning approaches, we demonstrate that the proposed two-level preconditioner significantly speeds up the convergence of the standard (LBFGS) optimizer while also yielding more accurate machine learning models. Moreover, the devised preconditioner is designed to take advantage of model-parallel computations, which can further reduce the training time.

replace-cross A Deep Learning Framework for Evaluating Dynamic Network Generative Models and Anomaly Detection

Authors: Alireza Rashnu, Sadegh Aliakbary

Abstract: Understanding dynamic systems like disease outbreaks, social influence, and information diffusion requires effective modeling of complex networks. Traditional evaluation methods for static networks often fall short when applied to temporal networks. This paper introduces DGSP-GCN (Dynamic Graph Similarity Prediction based on Graph Convolutional Network), a deep learning-based framework that integrates graph convolutional networks with dynamic graph signal processing techniques to provide a unified solution for evaluating generative models and detecting anomalies in dynamic networks. DGSP-GCN assesses how well a generated network snapshot matches the expected temporal evolution, incorporating an attention mechanism to improve embedding quality and capture dynamic structural changes. The approach was tested on five real-world datasets: WikiMath, Chickenpox, PedalMe, MontevideoBus, and MetraLa. Results show that DGSP-GCN outperforms baseline methods, such as time series regression and random similarity assignment, achieving the lowest error rates (MSE of 0.0645, MAE of 0.1781, RMSE of 0.2507). These findings highlight DGSP-GCN's effectiveness in evaluating and detecting anomalies in dynamic networks, offering valuable insights for network evolution and anomaly detection research.

replace-cross Towards Complete Causal Explanation with Expert Knowledge

Authors: Aparajithan Venkateswaran, Emilija Perkovi\'c

Abstract: We study the problem of restricting a Markov equivalence class of maximal ancestral graphs (MAGs) to only those MAGs that contain certain edge marks, which we refer to as expert or orientation knowledge. Such a restriction of the Markov equivalence class can be uniquely represented by a restricted essential ancestral graph. Our contributions are several-fold. First, we prove certain properties for the entire Markov equivalence class including a conjecture from Ali et al. (2009). Second, we present several new sound graphical orientation rules for adding orientation knowledge to an essential ancestral graph. We also show that some orientation rules of Zhang (2008b) are not needed for restricting the Markov equivalence class with orientation knowledge. Third, we provide an algorithm for including this orientation knowledge and show that in certain settings the output of our algorithm is a restricted essential ancestral graph. Finally, outside of the specified settings, we provide an algorithm for checking whether a graph is a restricted essential graph and discuss its runtime. This work can be seen as a generalization of Meek (1995) to settings which allow for latent confounding.

replace-cross Text-Augmented Multimodal LLMs for Chemical Reaction Condition Recommendation

Authors: Yu Zhang, Ruijie Yu, Kaipeng Zeng, Ding Li, Feng Zhu, Xiaokang Yang, Yaohui Jin, Yanyan Xu

Abstract: Identifying reaction conditions that are broadly applicable across diverse substrates is a longstanding challenge in chemical and pharmaceutical research. While many methods are available to generate conditions with acceptable performance, a universal approach for reliably discovering effective conditions during reaction exploration is rare. Consequently, current reaction optimization processes are often labor-intensive, time-consuming, and costly, relying heavily on trial-and-error experimentation. Nowadays, large language models (LLMs) are capable of tackling chemistry-related problems, such as molecule design and chemical reasoning tasks. Here, we report the design, implementation and application of Chemma-RC, a text-augmented multimodal LLM to identify effective conditions through task-specific dialogue and condition generation. Chemma-RC learns a unified representation of chemical reactions by aligning multiple modalities-including text corpus, reaction SMILES, and reaction graphs-within a shared embedding module. Performance benchmarking on datasets showed high precision in identifying optimal conditions, with up to 17% improvement over the current state-of-the-art methods. A palladium-catalysed imidazole C-H arylation reaction was investigated experimentally to evaluate the functionalities of the Chemma-RC in practice. Our findings suggest that Chemma-RC holds significant potential to accelerate high-throughput condition screening in chemical synthesis.

replace-cross Real-time Hybrid System Identification with Online Deterministic Annealing

Authors: Christos Mavridis, Karl Henrik Johansson

Abstract: We introduce a real-time identification method for discrete-time state-dependent switching systems in both the input--output and state-space domains. In particular, we design a system of adaptive algorithms running in two timescales; a stochastic approximation algorithm implements an online deterministic annealing scheme at a slow timescale and estimates the mode-switching signal, and an recursive identification algorithm runs at a faster timescale and updates the parameters of the local models based on the estimate of the switching signal. We first focus on piece-wise affine systems and discuss identifiability conditions and convergence properties based on the theory of two-timescale stochastic approximation. In contrast to standard identification algorithms for switched systems, the proposed approach gradually estimates the number of modes and is appropriate for real-time system identification using sequential data acquisition. The progressive nature of the algorithm improves computational efficiency and provides real-time control over the performance-complexity trade-off. Finally, we address specific challenges that arise in the application of the proposed methodology in identification of more general switching systems. Simulation results validate the efficacy of the proposed methodology.

replace-cross Lightweight Modular Parameter-Efficient Tuning for Open-Vocabulary Object Detection

Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah

Abstract: Open-vocabulary object detection (OVD) extends recognition beyond fixed taxonomies by aligning visual and textual features, as in MDETR, GLIP, or RegionCLIP. While effective, these models require updating all parameters of large vision--language backbones, leading to prohibitive training cost. Recent efficient OVD approaches, inspired by parameter-efficient fine-tuning methods such as LoRA or adapters, reduce trainable parameters but often face challenges in selecting which layers to adapt and in balancing efficiency with accuracy. We propose UniProj-Det, a lightweight modular framework for parameter-efficient OVD. UniProj-Det freezes pretrained backbones and introduces a Universal Projection module with a learnable modality token, enabling unified vision--language adaptation at minimal cost. Applied to MDETR, our framework trains only about ~2-5% of parameters while achieving competitive or superior performance on phrase grounding, referring expression comprehension, and segmentation. Comprehensive analysis of FLOPs, memory, latency, and ablations demonstrates UniProj-Det as a principled step toward scalable and efficient open-vocabulary detection.

replace-cross The Northeast Materials Database for Magnetic Materials

Authors: Suman Itani, Yibo Zhang, Jiadong Zang

Abstract: The discovery of magnetic materials with high operating temperature ranges and optimized performance is essential for advanced applications. Current data-driven approaches are limited by the lack of accurate, comprehensive, and feature-rich databases. This study aims to address this challenge by using Large Language Models (LLMs) to create a comprehensive, experiment-based, magnetic materials database named the Northeast Materials Database (NEMAD), which consists of 67,573 magnetic materials entries(www.nemad.org). The database incorporates chemical composition, magnetic phase transition temperatures, structural details, and magnetic properties. Enabled by NEMAD, we trained machine learning models to classify materials and predict transition temperatures. Our classification model achieved an accuracy of 90% in categorizing materials as ferromagnetic (FM), antiferromagnetic (AFM), and non-magnetic (NM). The regression models predict Curie (N\'eel) temperature with a coefficient of determination (R2) of 0.87 (0.83) and a mean absolute error (MAE) of 56K (38K). These models identified 25 (13) FM (AFM) candidates with a predicted Curie (N\'eel) temperature above 500K (100K) from the Materials Project. This work shows the feasibility of combining LLMs for automated data extraction and machine learning models to accelerate the discovery of magnetic materials.

replace-cross Hybrid Summary Statistics

Authors: T. Lucas Makinen, Ce Sui, Benjamin D. Wandelt, Natalia Porqueres, Alan Heavens

Abstract: We present a way to capture high-information posteriors from training sets that are sparsely sampled over the parameter space for robust simulation-based inference. In physical inference problems, we can often apply domain knowledge to define traditional summary statistics to capture some of the information in a dataset. We show that augmenting these statistics with neural network outputs to maximise the mutual information improves information extraction compared to neural summaries alone or their concatenation to existing summaries and makes inference robust in settings with low training data. We introduce 1) two loss formalisms to achieve this and 2) apply the technique to two different cosmological datasets to extract non-Gaussian parameter information.

replace-cross Scaling Laws for Online Advertisement Retrieval

Authors: Yunli Wang, Zhen Zhang, Zixuan Yang, Tianyu Xu, Zhiqiang Wang, Yu Li, Rufan Zhou, Zhiqiang Liu, Yanjie Zhu, Jian Yang, Shiyang Wen, Peng Jiang

Abstract: The scaling law is a notable property of neural network models and has significantly propelled the development of large language models. Scaling laws hold great promise in guiding model design and resource allocation. Recent research increasingly shows that scaling laws are not limited to NLP tasks or Transformer architectures; they also apply to domains such as recommendation. However, there is still a lack of literature on scaling law research in online advertisement retrieval systems. This may be because 1) identifying the scaling law for resource cost and online revenue is often expensive in both time and training resources for industrial applications, and 2) varying settings for different systems prevent the scaling law from being applied across various scenarios. To address these issues, we propose a lightweight paradigm to identify online scaling laws of retrieval models, incorporating a novel offline metric and an offline simulation algorithm. We prove that under mild assumptions, the correlation between the novel metric and online revenue asymptotically approaches 1 and empirically validates its effectiveness. The simulation algorithm can estimate the machine cost offline. Based on the lightweight paradigm, we can identify online scaling laws for retrieval models almost exclusively through offline experiments, and quickly estimate machine costs and revenues for given model configurations. We further validate the existence of scaling laws across mainstream model architectures (e.g., Transformer, MLP, and DSSM) in our real-world advertising system. With the identified scaling laws, we demonstrate practical applications for ROI-constrained model designing and multi-scenario resource allocation in the online advertising system. To the best of our knowledge, this is the first work to study identification and application of online scaling laws for online advertisement retrieval.

replace-cross The Asymptotic Behavior of Attention in Transformers

Authors: \'Alvaro Rodr\'iguez Abella, Jo\~ao Pedro Silvestre, Paulo Tabuada

Abstract: The transformer architecture has become the foundation of modern Large Language Models (LLMs), yet its theoretical properties are still not well understood. As with classic neural networks, a common approach to improve these models is to increase their size and depth. However, such strategies may be suboptimal, as several works have shown that adding more layers yields increasingly diminishing returns. More importantly, prior studies have shown that increasing depth may lead to model collapse, i.e., all the tokens converge to a single cluster, undermining the ability of LLMs to generate diverse outputs. Building on differential equation models for the transformer dynamics, we prove that all the tokens in a transformer asymptotically converge to a cluster as depth increases. At the technical level we leverage tools from control theory, including consensus dynamics on manifolds and input-to-state stability (ISS). We then extend our analysis to autoregressive models, exploiting their structure to further generalize the theoretical guarantees.

replace-cross The Value of Information in Human-AI Decision-making

Authors: Ziyang Guo, Yifan Wu, Jason Hartline, Jessica Hullman

Abstract: Multiple agents are increasingly combined to make decisions with the expectation of achieving complementary performance, where the decisions they make together outperform those made individually. However, knowing how to improve the performance of collaborating agents requires knowing what information and strategies each agent employs. With a focus on human-AI pairings, we contribute a decision-theoretic framework for characterizing the value of information. By defining complementary information, our approach identifies opportunities for agents to better exploit available information in AI-assisted decision workflows. We present a novel explanation technique (ILIV-SHAP) that adapts SHAP explanations to highlight human-complementing information. We validate the effectiveness of ACIV and ILIV-SHAP through a study of human-AI decision-making, and demonstrate the framework on examples from chest X-ray diagnosis and deepfake detection. We find that presenting ILIV-SHAP with AI predictions leads to reliably greater reductions in error over non-AI assisted decisions more than vanilla SHAP.

replace-cross Quantifying depressive mental states with large language models

Authors: Jakub Onysk, Quentin J. M. Huys

Abstract: Large Language Models (LLMs) may have an important role to play in mental health by facilitating the quantification of verbal expressions used to communicate emotions, feelings and thoughts. While there has been substantial and very promising work in this area, the fundamental limits are uncertain. Here, focusing on depressive symptoms, we outline and evaluate LLM performance on three critical tests. The first test evaluates LLM performance on a novel ground-truth dataset from a large human sample (n=770). This dataset is novel as it contains both standard clinically validated quantifications of depression symptoms and specific verbal descriptions of the thoughts related to each symptom by the same individual. The performance of LLMs on this richly informative data shows an upper bound on the performance in this domain, and allow us to examine the extent to which inference about symptoms generalises. Second, we test to what extent the latent structure in LLMs can capture the clinically observed patterns. We train supervised sparse auto-encoders (sSAE) to predict specific symptoms and symptom patterns within a syndrome. We find that sSAE weights can effectively modify the clinical pattern produced by the model, and thereby capture the latent structure of relevant clinical variation. Third, if LLMs correctly capture and quantify relevant mental states, then these states should respond to changes in emotional states induced by validated emotion induction interventions. We show that this holds in a third experiment with 190 participants. Overall, this work provides foundational insights into the quantification of pathological mental states with LLMs, highlighting hard limits on the requirements of the data underlying LLM-based quantification; but also suggesting LLMs show substantial conceptual alignment.

replace-cross Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation

Authors: Giorgio Franceschelli, Mirco Musolesi

Abstract: Despite the increasing use of large language models for creative tasks, their outputs often lack diversity. Common solutions, such as sampling at higher temperatures, can compromise the quality of the results. Dealing with this trade-off is still an open challenge in designing AI systems for creativity. Drawing on information theory, we propose a context-based score to quantitatively evaluate value and originality. This score incentivizes accuracy and adherence to the request while fostering divergence from the learned distribution. We show that our score can be used as a reward in a reinforcement learning framework to fine-tune large language models for maximum performance. We validate our strategy through experiments considering a variety of creative tasks, such as poetry generation and math problem solving, demonstrating that it enhances the value and originality of the generated solutions.

replace-cross Multimodal AI predicts clinical outcomes of drug combinations from preclinical data

Authors: Yepeng Huang, Xiaorui Su, Varun Ullanat, Intae Moon, Ivy Liang, Lindsay Clegg, Damilola Olabode, Ruthie Johnson, Nicholas Ho, Megan Gibbs, Megan Gibbs, Alexander Gusev, Bino John, Marinka Zitnik

Abstract: Predicting clinical outcomes from preclinical data is essential for identifying safe and effective drug combinations, reducing late-stage clinical failures, and accelerating the development of precision therapies. Current AI models rely on structural or target-based features but fail to incorporate the multimodal data necessary for accurate, clinically relevant predictions. Here, we introduce Madrigal, a multimodal AI model that learns from structural, pathway, cell viability, and transcriptomic data to predict drug-combination effects across 953 clinical outcomes and 21,842 compounds, including combinations of approved drugs and novel compounds in development. Madrigal uses an attention bottleneck module to unify preclinical drug data modalities while handling missing data during training and inference, a major challenge in multimodal learning. It outperforms single-modality methods and state-of-the-art models in predicting adverse drug interactions, and ablations show both modality alignment and multimodality are necessary. It captures transporter-mediated interactions and aligns with head-to-head clinical trial differences for neutropenia, anemia, alopecia, and hypoglycemia. In type 2 diabetes and MASH, Madrigal supports polypharmacy decisions and prioritizes resmetirom among safer candidates. Extending to personalization, Madrigal improves patient-level adverse-event prediction in a longitudinal EHR cohort and an independent oncology cohort, and predicts ex vivo efficacy in primary acute myeloid leukemia samples and patient-derived xenograft models. Madrigal links preclinical multimodal readouts to safety risks of drug combinations and offers a generalizable foundation for safer combination design.

replace-cross Scaling Rich Style-Prompted Text-to-Speech Datasets

Authors: Anuj Diwan, Zhisheng Zheng, David Harwath, Eunsol Choi

Abstract: We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .

URLs: https://github.com/ajd12342/paraspeechcaps

replace-cross Beyond SHAP and Anchors: A large-scale experiment on how developers struggle to design meaningful end-user explanations

Authors: Zahra Abba Omar, Nadia Nahar, Jacob Tjaden, In\`es M. Gilles, Fikir Mekonnen, Erica Okeh, Jane Hsieh, Christian K\"astner, Alka Menon

Abstract: Modern machine learning produces models that are impossible for users or developers to fully understand -- raising concerns about trust, oversight, safety, and human dignity when they are integrated into software products. Transparency and explainability methods aim to provide some help in understanding models, but it remains challenging for developers to design explanations that are understandable to target users and effective for their purpose. Emerging guidelines and regulations set goals but may not provide effective actionable guidance to developers. In a large-scale experiment with 124 participants, we explored how developers approach providing end-user explanations, including what challenges they face, and to what extent specific policies can guide their actions. We investigated whether and how specific forms of policy guidance help developers design explanations and provide evidence for policy compliance for an ML-powered screening tool for diabetic retinopathy. Participants across the board struggled to produce quality explanations and comply with the provided policies. Contrary to our expectations, we found that the nature and specificity of policy guidance had little effect. We posit that participant noncompliance is in part due to a failure to imagine and anticipate the needs of non-technical stakeholders. Drawing on cognitive process theory and the sociological imagination to contextualize participants' failure, we recommend educational interventions.

replace-cross Revenue Maximization Under Sequential Price Competition Via The Estimation Of s-Concave Demand Functions

Authors: Daniele Bracale, Moulinath Banerjee, Cong Shi, Yuekai Sun

Abstract: We consider price competition among multiple sellers over a selling horizon of $T$ periods. In each period, sellers simultaneously offer their prices (which are made public) and subsequently observe their respective demand (not made public). The demand function of each seller depends on all sellers' prices through a private, unknown, and nonlinear relationship. We propose a dynamic pricing policy that uses semi-parametric least-squares estimation and show that when the sellers employ our policy, their prices converge at a rate of $O(T^{-1/7})$ to the Nash equilibrium prices that sellers would reach if they were fully informed. Each seller incurs a regret of $O(T^{5/7})$ relative to a dynamic benchmark policy. A theoretical contribution of our work is proving the existence of equilibrium under shape-constrained demand functions via the concept of $s$-concavity and establishing regret bounds of our proposed policy. Technically, we also establish new concentration results for the least squares estimator under shape constraints. Our findings offer significant insights into dynamic competition-aware pricing and contribute to the broader study of non-parametric learning in strategic decision-making.

replace-cross Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment

Authors: Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu, Xiaoshuang Jia, Simeng Qin, Xiaochun Cao, Yang Liu, Xiaojun Jia

Abstract: Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based (train a reward model on preference pairs and optimize with reinforcement learning) or reward-free (directly fine-tune on ranked outputs). Recent research shows that well-tuned reward-based pipelines remain robust, and single-response demonstrations can outperform pairwise preference data. However, two challenges persist: (1) imbalanced safety datasets that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains. We propose DR-IRL (Dynamically adjusting Rewards through Inverse Reinforcement Learning). We first train category-specific reward models using a balanced safety dataset covering seven harmful categories via IRL. Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling--adjusting rewards by task difficulty--data-level hardness by text encoder cosine similarity, model-level responsiveness by reward gaps. Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.

replace-cross Inference-Time Scaling for Generalist Reward Modeling

Authors: Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, Yu Wu

Abstract: Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $\textit{proper learning methods could enable effective inference-time scalability}$. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the $\textbf{inference-time scalability of generalist RM}$, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in $\textbf{DeepSeek-GRM}$ models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models are released at Hugging Face and ModelScope.

replace-cross Uncertainty-Aware Surrogate-based Amortized Bayesian Inference for Computationally Expensive Models

Authors: Stefania Scheurer, Philipp Reiser, Tim Br\"unnette, Wolfgang Nowak, Anneli Guthke, Paul-Christian B\"urkner

Abstract: Bayesian inference typically relies on a large number of model evaluations to estimate posterior distributions. Established methods like Markov Chain Monte Carlo (MCMC) and Amortized Bayesian Inference (ABI) can become computationally challenging. While ABI enables fast inference after training, generating sufficient training data still requires thousands of model simulations, which is infeasible for expensive models. Surrogate models offer a solution by providing approximate simulations at a lower computational cost, allowing the generation of large data sets for training. However, the introduced approximation errors and uncertainties can lead to overconfident posterior estimates. To address this, we propose Uncertainty-Aware Surrogate-based Amortized Bayesian Inference (UA-SABI) -- a framework that combines surrogate modeling and ABI while explicitly quantifying and propagating surrogate uncertainties through the inference pipeline. Our experiments show that this approach enables reliable, fast, and repeated Bayesian inference for computationally expensive models, even under tight time constraints.

replace-cross Ambiguity Resolution in Text-to-Structured Data Mapping

Authors: Zhibo Hu, Chen Wang, Yanfeng Shu, Hye-Young Paik, Liming Zhu

Abstract: Ambiguity in natural language is a significant obstacle for achieving accurate text to structured data mapping through large language models (LLMs), which affects the performance of tasks such as mapping text to agentic tool calling and text-to-SQL queries. Existing methods to ambiguity handling either rely on the ReACT framework to obtain correct mappings through trial and error, or on supervised fine-tuning to bias models toward specific tasks. In this paper, we adopt a different approach that characterizes representation differences of ambiguous text in the latent space and leverages these differences to identify ambiguity before mapping them to structured data. To detect sentence-level ambiguity, we focus on the relationship between ambiguous questions and their interpretations. Unlike distances calculated by dense embeddings, we introduce a new distance measure based on a path kernel over concepts. With this measurement, we identify patterns to distinguish ambiguous from unambiguous questions. Furthermore, we propose a method for improving LLM performance on ambiguous agentic tool calling through missing concept prediction. Both achieve state-of-the-art results.

replace-cross CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition

Authors: Bruno Viti, Elias Karabelas, Martin Holler

Abstract: Most machine learning-based image segmentation models produce pixel-wise confidence scores that represent the model's predicted probability for each class label at every pixel. While this information can be particularly valuable in high-stakes domains such as medical imaging, these scores are heuristic in nature and do not constitute rigorous quantitative uncertainty estimates. Conformal prediction (CP) provides a principled framework for transforming heuristic confidence scores into statistically valid uncertainty estimates. However, applying CP directly to image segmentation ignores the spatial correlations between pixels, a fundamental characteristic of image data. This can result in overly conservative and less interpretable uncertainty estimates. To address this, we propose CONSIGN (Conformal Segmentation Informed by Spatial Groupings via Decomposition), a CP-based method that incorporates spatial correlations to improve uncertainty quantification in image segmentation. Our method generates meaningful prediction sets that come with user-specified, high-probability error guarantees. It is compatible with any pre-trained segmentation model capable of generating multiple sample outputs. We evaluate CONSIGN against two CP baselines across three medical imaging datasets and two COCO dataset subsets, using three different pre-trained segmentation models. Results demonstrate that accounting for spatial structure significantly improves performance across multiple metrics and enhances the quality of uncertainty estimates.

replace-cross R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization

Authors: Yuante Li, Xu Yang, Xiao Yang, Minrui Xu, Xisen Wang, Weiqing Liu, Jiang Bian

Abstract: Financial markets pose fundamental challenges for asset return prediction due to their high dimensionality, non-stationarity, and persistent volatility. Despite advances in large language models and multi-agent systems, current quantitative research pipelines suffer from limited automation, weak interpretability, and fragmented coordination across key components such as factor mining and model innovation. In this paper, we propose R&D-Agent for Quantitative Finance, in short RD-Agent(Q), the first data-centric multi-agent framework designed to automate the full-stack research and development of quantitative strategies via coordinated factor-model co-optimization. RD-Agent(Q) decomposes the quant process into two iterative stages: a Research stage that dynamically sets goal-aligned prompts, formulates hypotheses based on domain priors, and maps them to concrete tasks, and a Development stage that employs a code-generation agent, Co-STEER, to implement task-specific code, which is then executed in real-market backtests. The two stages are connected through a feedback stage that thoroughly evaluates experimental outcomes and informs subsequent iterations, with a multi-armed bandit scheduler for adaptive direction selection. Empirically, RD-Agent(Q) achieves up to 2X higher annualized returns than classical factor libraries using 70% fewer factors, and outperforms state-of-the-art deep time-series models on real markets. Its joint factor-model optimization delivers a strong balance between predictive accuracy and strategy robustness. Our code is available at: https://github.com/microsoft/RD-Agent.

URLs: https://github.com/microsoft/RD-Agent.

replace-cross Flexible MOF Generation with Torsion-Aware Flow Matching

Authors: Nayoung Kim, Seongsu Kim, Sungsoo Ahn

Abstract: Designing metal-organic frameworks (MOFs) with novel chemistries is a longstanding challenge due to their large combinatorial space and complex 3D arrangements of the building blocks. While recent deep generative models have enabled scalable MOF generation, they assume (1) a fixed set of building blocks and (2) known local 3D coordinates of building blocks. However, this limits their ability to (1) design novel MOFs and (2) generate the structure using novel building blocks. We propose a two-stage MOF generation framework that overcomes these limitations by modeling both chemical and geometric degrees of freedom. First, we train an SMILES-based autoregressive model to generate metal and organic building blocks, paired with a cheminformatics toolkit for 3D structure initialization. Second, we introduce a flow matching model that predicts translations, rotations, and torsional angles to assemble the blocks into valid 3D frameworks. Our experiments demonstrate improved reconstruction accuracy, the generation of valid, novel, and unique MOFs, and the ability to create novel building blocks. Our code is available at https://github.com/nayoung10/MOFFlow-2.

URLs: https://github.com/nayoung10/MOFFlow-2.

replace-cross Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories

Authors: Sunshine Jiang, Xiaolin Fang, Nicholas Roy, Tom\'as Lozano-P\'erez, Leslie Pack Kaelbling, Siddharth Ancha

Abstract: Recent advances in diffusion$/$flow-matching policies have enabled imitation learning of complex, multi-modal action trajectories. However, they are computationally expensive because they sample a trajectory of trajectories: a diffusion$/$flow trajectory of action trajectories. They discard intermediate action trajectories, and must wait for the sampling process to complete before any actions can be executed on the robot. We simplify diffusion$/$flow policies by treating action trajectories as flow trajectories. Instead of starting from pure noise, our algorithm samples from a narrow Gaussian around the last action. Then, it incrementally integrates a velocity field learned via flow matching to produce a sequence of actions that constitute a single trajectory. This enables actions to be streamed to the robot on-the-fly during the flow sampling process, and is well-suited for receding horizon policy execution. Despite streaming, our method retains the ability to model multi-modal behavior. We train flows that stabilize around demonstration trajectories to reduce distribution shift and improve imitation learning performance. Streaming flow policy outperforms prior methods while enabling faster policy execution and tighter sensorimotor loops for learning-based robot control. Project website: https://streaming-flow-policy.github.io/

URLs: https://streaming-flow-policy.github.io/

replace-cross Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Authors: Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinsk\"u

Abstract: Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.

replace-cross Tensor State Space-based Dynamic Multilayer Network Modeling

Authors: Tian Lan, Jie Guo, Chen Zhang

Abstract: Understanding the complex interactions within dynamic multilayer networks is critical for advancements in various scientific domains. Existing models often fail to capture such networks' temporal and cross-layer dynamics. This paper introduces a novel Tensor State Space Model for Dynamic Multilayer Networks (TSSDMN), utilizing a latent space model framework. TSSDMN employs a symmetric Tucker decomposition to represent latent node features, their interaction patterns, and layer transitions. Then by fixing the latent features and allowing the interaction patterns to evolve over time, TSSDMN uniquely captures both the temporal dynamics within layers and across different layers. The model identifiability conditions are discussed. By treating latent features as variables whose posterior distributions are approximated using a mean-field variational inference approach, a variational Expectation Maximization algorithm is developed for efficient model inference. Numerical simulations and case studies demonstrate the efficacy of TSSDMN for understanding dynamic multilayer networks.

replace-cross Language-Guided Multi-Agent Learning in Simulations: A Unified Framework and Evaluation

Authors: Zhengyang Li

Abstract: This paper introduces LLM-MARL, a unified framework that incorporates large language models (LLMs) into multi-agent reinforcement learning (MARL) to enhance coordination, communication, and generalization in simulated game environments. The framework features three modular components of Coordinator, Communicator, and Memory, which dynamically generate subgoals, facilitate symbolic inter-agent messaging, and support episodic recall. Training combines PPO with a language-conditioned loss and LLM query gating. LLM-MARL is evaluated in Google Research Football, MAgent Battle, and StarCraft II. Results show consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization. Ablation studies demonstrate that subgoal generation and language-based messaging each contribute significantly to performance gains. Qualitative analysis reveals emergent behaviors such as role specialization and communication-driven tactics. By bridging language modeling and policy learning, this work contributes to the design of intelligent, cooperative agents in interactive simulations. It offers a path forward for leveraging LLMs in multi-agent systems used for training, games, and human-AI collaboration.

replace-cross Generalizing while preserving monotonicity in comparison-based preference learning models

Authors: Julien Fageot, Peva Blanchard, Gilles Bareilles, L\^e-Nguy\^en Hoang

Abstract: If you tell a learning model that you prefer an alternative $a$ over another alternative $b$, then you probably expect the model to be monotone, that is, the valuation of $a$ increases, and that of $b$ decreases. Yet, perhaps surprisingly, many widely deployed comparison-based preference learning models, including large language models, fail to have this guarantee. Until now, the only comparison-based preference learning algorithms that were proved to be monotone are the Generalized Bradley-Terry models. Yet, these models are unable to generalize to uncompared data. In this paper, we advance the understanding of the set of models with generalization ability that are monotone. Namely, we propose a new class of Linear Generalized Bradley-Terry models with Diffusion Priors, and identify sufficient conditions on alternatives' embeddings that guarantee monotonicity. Our experiments show that this monotonicity is far from being a general guarantee, and that our new class of generalizing models improves accuracy, especially when the dataset is limited.

replace-cross AMLgentex: Mobilizing Data-Driven Research to Combat Money Laundering

Authors: Johan \"Ostman, Edvin Callisen, Anton Chen, Kristiina Ausmees, Emanuel G{\aa}rdh, Jovan Zamac, Jolanta Goldsteine, Hugo Wefer, Simon Whelan, Markus Reimeg{\aa}rd

Abstract: Money laundering enables organized crime by moving illicit funds into the legitimate economy. Although trillions of dollars are laundered each year, detection rates remain low because launderers evade oversight, confirmed cases are rare, and institutions see only fragments of the global transaction network. Since access to real transaction data is tightly restricted, synthetic datasets are essential for developing and evaluating detection methods. However, existing datasets fall short: they often neglect partial observability, temporal dynamics, strategic behavior, uncertain labels, class imbalance, and network-level dependencies. We introduce AMLGentex, an open-source suite for generating realistic, configurable transaction data and benchmarking detection methods. AMLGentex enables systematic evaluation of anti-money laundering systems under conditions that mirror real-world challenges. By releasing multiple country-specific datasets and practical parameter guidance, we aim to empower researchers and practitioners and provide a common foundation for collaboration and progress in combating money laundering.

replace-cross ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining

Authors: Seonwu Kim, Yohan Na, Kihun Kim, Hanhee Cho, Geun Lim, Mintae Kim, Seongik Park, Ki Hyun Kim, Youngsub Han, Byoung-Ki Jeon

Abstract: The emergence of open-source large language models (LLMs) has expanded opportunities for enterprise applications; however, many organizations still lack the infrastructure to deploy and maintain large-scale models. As a result, small LLMs (sLLMs) have become a practical alternative, despite their inherent performance limitations. While Domain Adaptive Continual Pretraining (DACP) has been previously explored as a method for domain adaptation, its utility in commercial applications remains under-examined. In this study, we validate the effectiveness of applying a DACP-based recipe across diverse foundation models and service domains. Through extensive experiments and real-world evaluations, we demonstrate that DACP-applied sLLMs achieve substantial gains in target domain performance while preserving general capabilities, offering a cost-efficient and scalable solution for enterprise-level deployment.

replace-cross NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

Authors: Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, Aleksandr Gordeev

Abstract: Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets (original image, instruction, edited image), yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approx. 2.6x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit, an open dataset of 720k high-quality triplets, curated at industrial scale via millions of guided generations and validator passes, and we analyze the pipeline's stage-wise survival rates, providing a framework for estimating computational effort across different model stacks. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, a fine-tuned Bagel model with state-of-the-art metrics.

replace-cross GeMix: Conditional GAN-Based Mixup for Improved Medical Image Augmentation

Authors: Hugo Carlesso, Maria Eliza Patulea, Moncef Garouani, Radu Tudor Ionescu, Josiane Mothe

Abstract: Mixup has become a popular augmentation strategy for image classification, yet its naive pixel-wise interpolation often produces unrealistic images that can hinder learning, particularly in high-stakes medical applications. We propose GeMix, a two-stage framework that replaces heuristic blending with a learned, label-aware interpolation powered by class-conditional GANs. First, a StyleGAN2-ADA generator is trained on the target dataset. During augmentation, we sample two label vectors from Dirichlet priors biased toward different classes and blend them via a Beta-distributed coefficient. Then, we condition the generator on this soft label to synthesize visually coherent images that lie along a continuous class manifold. We benchmark GeMix on the large-scale COVIDx-CT-3 dataset using three backbones (ResNet-50, ResNet-101, EfficientNet-B0). When combined with real data, our method increases macro-F1 over traditional mixup for all backbones, reducing the false negative rate for COVID-19 detection. GeMix is thus a drop-in replacement for pixel-space mixup, delivering stronger regularization and greater semantic fidelity, without disrupting existing training pipelines. We publicly release our code at https://github.com/hugocarlesso/GeMix to foster reproducibility and further research.

URLs: https://github.com/hugocarlesso/GeMix

replace-cross Robustifying Learning-Augmented Caching Efficiently without Compromising 1-Consistency

Authors: Peng Chen, Hailiang Zhao, Jiaji Zhang, Xueyan Tang, Yixuan Wang, Shuiguang Deng

Abstract: The online caching problem aims to minimize cache misses when serving a sequence of requests under a limited cache size. While naive learning-augmented caching algorithms achieve ideal $1$-consistency, they lack robustness guarantees. Existing robustification methods either sacrifice $1$-consistency or introduce significant computational overhead. In this paper, we introduce Guard, a lightweight robustification framework that enhances the robustness of a broad class of learning-augmented caching algorithms to $2H_k + 2$, while preserving their $1$-consistency. Guard achieves the current best-known trade-off between consistency and robustness, with only $O(1)$ additional per-request overhead, thereby maintaining the original time complexity of the base algorithm. Extensive experiments across multiple real-world datasets and prediction models validate the effectiveness of Guard in practice.

replace-cross Locally Adaptive Conformal Inference for Operator Models

Authors: Trevor Harris, Yan Liu

Abstract: Operator models are regression algorithms between Banach spaces of functions. They have become an increasingly critical tool for spatiotemporal forecasting and physics emulation, especially in high-stakes scenarios where robust, calibrated uncertainty quantification is required. We introduce Local Sliced Conformal Inference (LSCI), a distribution-free framework for generating function-valued, locally adaptive prediction sets for operator models. We prove finite-sample validity and derive a data-dependent upper bound on the coverage gap under local exchangeability. On synthetic Gaussian-process tasks and real applications (air quality monitoring, energy demand forecasting, and weather prediction), LSCI yields tighter sets with stronger adaptivity compared to conformal baselines. We also empirically demonstrate robustness against biased predictions and certain out-of-distribution noise regimes.

replace-cross villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Authors: Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, Jianyu Chen, Jiang Bian

Abstract: Vision-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent works have begun to explore the incorporation of latent actions, abstract representations of motion between two frames, into VLA pre-training. In this paper, we introduce villa-X, a novel Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. We demonstrate that villa-X can generate latent action plans in a zero-shot fashion, even for unseen embodiments and open-vocabulary symbolic understanding. This capability enables villa-X to achieve superior performance across diverse simulation tasks in SIMPLER and on two real-world robotic setups involving both gripper and dexterous hand manipulation. These results establish villa-X as a principled and scalable paradigm for learning generalizable robot manipulation policies. We believe it provides a strong foundation for future research.

replace-cross AuthPrint: Fingerprinting Generative Models Against Malicious Model Providers

Authors: Kai Yao, Marc Juarez

Abstract: Generative models are increasingly adopted in high-stakes domains, yet current deployments offer no mechanisms to verify whether a given output truly originates from the certified model. We address this gap by extending model fingerprinting techniques beyond the traditional collaborative setting to one where the model provider itself may act adversarially, replacing the certified model with a cheaper or lower-quality substitute. To our knowledge, this is the first work to study fingerprinting for provenance attribution under such a threat model. Our approach introduces a trusted verifier that, during a certification phase, extracts hidden fingerprints from the authentic model's output space and trains a detector to recognize them. During verification, this detector can determine whether new outputs are consistent with the certified model, without requiring specialized hardware or model modifications. In extensive experiments, our methods achieve near-zero FPR@95%TPR on both GANs and diffusion models, and remain effective even against subtle architectural or training changes. Furthermore, the approach is robust to adaptive adversaries that actively manipulate outputs in an attempt to evade detection.

replace-cross ML-PWS: Estimating the Mutual Information Between Experimental Time Series Using Neural Networks

Authors: Manuel Reinhardt, Ga\v{s}per Tka\v{c}ik, Pieter Rein ten Wolde

Abstract: The ability to quantify information transmission is crucial for the analysis and design of natural and engineered systems. The information transmission rate is the fundamental measure for systems with time-varying signals, yet computing it is extremely challenging. In particular, the rate cannot be obtained directly from experimental time-series data without approximations, because of the high dimensionality of the signal trajectory space. Path Weight Sampling (PWS) is a computational technique that makes it possible to obtain the information rate exactly for any stochastic system. However, it requires a mathematical model of the system of interest, be it described by a master equation or a set of differential equations. Here, we present a technique that employs Machine Learning (ML) to develop a generative model from experimental time-series data, which is then combined with PWS to obtain the information rate. We demonstrate the accuracy of this technique, called ML-PWS, by comparing its results on synthetic time-series data generated from a non-linear model against ground-truth results obtained by applying PWS directly to the same model. We illustrate the utility of ML-PWS by applying it to neuronal time-series data.

replace-cross ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

Authors: Manlai Liang, Mandi Liu, Jiangzhou Ji, Huaijun Li, Haobo Yang, Yaohan He, Jinlong Li

Abstract: Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the prefilling complexity from $O(L^2)$ to $O(L)$ and trims the memory footprint to a few tenths of that required for the full context, but also delivers performance comparable to or superior to the full-context setup in long-context scenarios. Without additional post training or operator development, ILRe can process a single $1M$ tokens request in less than half a minute (speedup $\approx 180\times$) and scores RULER-$1M$ benchmark of $\approx 79.8$ with model Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.

replace-cross Universal Dynamics with Globally Controlled Analog Quantum Simulators

Authors: Hong-Ye Hu, Abigail McClain Gomez, Liyuan Chen, Aaron Trowbridge, Andy J. Goldschmidt, Zachary Manchester, Frederic T. Chong, Arthur Jaffe, Susanne F. Yelin

Abstract: Analog quantum simulators with global control fields have emerged as powerful platforms for exploring complex quantum phenomena. Recent breakthroughs, such as the coherent control of thousands of atoms, highlight the growing potential for quantum applications at scale. Despite these advances, a fundamental theoretical question remains unresolved: to what extent can such systems realize universal quantum dynamics under global control? Here we establish a necessary and sufficient condition for universal quantum computation using only global pulse control, proving that a broad class of analog quantum simulators is, in fact, universal. We further extend this framework to fermionic and bosonic systems, including modern platforms such as ultracold atoms in optical superlattices. Crucially, to connect the theoretical possibility with experimental reality, we introduce a new control technique into the experiment - direct quantum optimal control. This method enables the synthesis of complex effective Hamiltonians and allows us to incorporate realistic hardware constraints. To show its practical power, we experimentally engineer three-body interactions outside the blockade regime and demonstrate topological dynamics on a Rydberg atom array. Using the new control framework, we overcome key experimental challenges, including hardware limitations and atom position fluctuations in the non-blockade regime, by identifying smooth, short-duration pulses that achieve high-fidelity dynamics. Experimental measurements reveal dynamical signatures of symmetry-protected-topological edge modes, confirming both the expressivity and feasibility of our approach. Our work opens a new avenue for quantum simulation beyond native hardware Hamiltonians, enabling the engineering of effective multi-body interactions and advancing the frontier of quantum information processing with globally-controlled analog platforms.

replace-cross Constrained Decoding for Robotics Foundation Models

Authors: Parv Kapoor, Akila Ganlath, Changliu Liu, Sebastian Scherer, Eunsuk Kang

Abstract: Recent advances in the development of robotic foundation models have led to promising end-to-end and general-purpose capabilities in robotic systems. These models are pretrained on vast datasets of robot trajectories to process multi-modal inputs and directly output a sequence of action that the system then executes in the real world. Although this approach is attractive from the perspective of improved generalization across diverse tasks, these models are still data-driven and, therefore, lack explicit notions of behavioral correctness and safety constraints. We address these limitations by introducing a constrained decoding framework for robotics foundation models that enforces logical constraints on action trajectories in dynamical systems. Our method ensures that generated actions provably satisfy signal temporal logic (STL) specifications at runtime without retraining, while remaining agnostic of the underlying foundation model. We perform comprehensive evaluation of our approach across state-of-the-art navigation foundation models and we show that our decoding-time interventions are useful not only for filtering unsafe actions but also for conditional action-generation. Videos available on our website: https://constrained-robot-fms.github.io

URLs: https://constrained-robot-fms.github.io

replace-cross PLaMo 2 Technical Report

Authors: Preferred Networks, :, Kaizaburo Chubachi, Yasuhiro Fujita, Shinichi Hemmi, Yuta Hirokawa, Kentaro Imajo, Toshiki Kataoka, Goro Kobayashi, Kenichi Maehashi, Calvin Metzger, Hiroaki Mikami, Shogo Murai, Daisuke Nishino, Kento Nozawa, Toru Ogawa, Shintarou Okada, Daisuke Okanohara, Shunta Saito, Shotaro Sano, Shuji Suzuki, Kuniyuki Takahashi, Daisuke Tanaka, Avinash Ummadisingu, Hanqin Wang, Sixue Wang, Tianqi Xu

Abstract: In this report, we introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture that transitions to full attention via continual pre-training to support 32K token contexts. Training leverages extensive synthetic corpora to overcome data scarcity, while computational efficiency is achieved through weight reuse and structured pruning. This efficient pruning methodology produces an 8B model that achieves performance comparable to our previous 100B model. Post-training further refines the models using a pipeline of supervised fine-tuning (SFT) and direct preference optimization (DPO), enhanced by synthetic Japanese instruction data and model merging techniques. Optimized for inference using vLLM and quantization with minimal accuracy loss, the PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.

replace-cross Expressive Power of Deep Networks on Manifolds: Simultaneous Approximation

Authors: Hanfei Zhou, Lei Shi

Abstract: A key challenge in scientific machine learning is solving partial differential equations (PDEs) on complex domains, where the curved geometry complicates the approximation of functions and their derivatives required by differential operators. This paper establishes the first simultaneous approximation theory for deep neural networks on manifolds. We prove that a constant-depth $\mathrm{ReLU}^{k-1}$ network with bounded weights--a property that plays a crucial role in controlling generalization error--can approximate any function in the Sobolev space $\mathcal{W}_p^{k}(\mathcal{M}^d)$ to an error of $\varepsilon$ in the $\mathcal{W}_p^{s}(\mathcal{M}^d)$ norm, for $k\geq 3$ and $s

replace-cross Neural Audio Codecs for Prompt-Driven Universal Sound Separation

Authors: Adhiraj Banerjee, Vipul Arora

Abstract: Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to fixed-class separation. We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a Transformer masker modulated by CLAP-derived FiLM parameters. Across six open-domain benchmarks under matched training/prompt protocols, \textbf{CodecSep} surpasses \textbf{AudioSep} in separation fidelity (SI-SDR) while remaining competitive in perceptual quality (ViSQOL) and matching or exceeding fixed-stem baselines (TDANet, CodecFormer, SDCodec). In code-stream deployments, it needs just 1.35~GMACs end-to-end -- approximately $54\times$ less compute ($25\times$ architecture-only) than spectrogram-domain separators like AudioSep -- while remaining fully bitstream-compatible.

replace-cross Label-Efficient Grasp Joint Prediction with Point-JEPA

Authors: Jed Guzelkabaagac, Boris Petrovi\'c

Abstract: We study whether 3D self-supervised pretraining with Point--JEPA enables label-efficient grasp joint-angle prediction. Meshes are sampled to point clouds and tokenized; a ShapeNet-pretrained Point--JEPA encoder feeds a $K{=}5$ multi-hypothesis head trained with winner-takes-all and evaluated by top--logit selection. On a multi-finger hand dataset with strict object-level splits, Point--JEPA improves top--logit RMSE and Coverage@15$^{\circ}$ in low-label regimes (e.g., 26% lower RMSE at 25% data) and reaches parity at full supervision, suggesting JEPA-style pretraining is a practical lever for data-efficient grasp learning.

replace-cross On the System Theoretic Offline Learning of Continuous-Time LQR with Exogenous Disturbances

Authors: Sayak Mukherjee, Ramij R. Hossain, Mahantesh Halappanavar

Abstract: We analyze offline designs of linear quadratic regulator (LQR) strategies with uncertain disturbances. First, we consider the scenario where the exogenous variable can be estimated in a controlled environment, and subsequently, consider a more practical and challenging scenario where it is unknown in a stochastic setting. Our approach builds on the fundamental learning-based framework of adaptive dynamic programming (ADP), combined with a Lyapunov-based analytical methodology to design the algorithms and derive sample-based approximations motivated from the Markov decision process (MDP)-based approaches. For the scenario involving non-measurable disturbances, we further establish stability and convergence guarantees for the learned control gains under sample-based approximations. The overall methodology emphasizes simplicity while providing rigorous guarantees. Finally, numerical experiments focus on the intricacies and validations for the design of offline continuous-time LQR with exogenous disturbances.

replace-cross DiffSyn: A Generative Diffusion Approach to Materials Synthesis Planning

Authors: Elton Pan, Soonhyoung Kwon, Sulin Liu, Mingrou Xie, Alexander J. Hoffman, Yifei Duan, Thorben Prein, Killian Sheriff, Yuriy Roman-Leshkov, Manuel Moliner, Rafael Gomez-Bombarelli, Elsa Olivetti

Abstract: The synthesis of crystalline materials, such as zeolites, remains a significant challenge due to a high-dimensional synthesis space, intricate structure-synthesis relationships and time-consuming experiments. Considering the one-to-many relationship between structure and synthesis, we propose DiffSyn, a generative diffusion model trained on over 23,000 synthesis recipes spanning 50 years of literature. DiffSyn generates probable synthesis routes conditioned on a desired zeolite structure and an organic template. DiffSyn achieves state-of-the-art performance by capturing the multi-modal nature of structure-synthesis relationships. We apply DiffSyn to differentiate among competing phases and generate optimal synthesis routes. As a proof of concept, we synthesize a UFI material using DiffSyn-generated synthesis routes. These routes, rationalized by density functional theory binding energies, resulted in the successful synthesis of a UFI material with a high Si/Al$_{\text{ICP}}$ of 19.0, which is expected to improve thermal stability and is higher than that of any previously recorded.

replace-cross Bilateral Distribution Compression: Reducing Both Data Size and Dimensionality

Authors: Dominic Broadbent, Nick Whiteley, Robert Allison, Tom Lovett

Abstract: Existing distribution compression methods reduce dataset size by minimising the Maximum Mean Discrepancy (MMD) between original and compressed sets, but modern datasets are often large in both sample size and dimensionality. We propose Bilateral Distribution Compression (BDC), a two-stage framework that compresses along both axes while preserving the underlying distribution, with overall linear time and memory complexity in dataset size and dimension. Central to BDC is the Decoded MMD (DMMD), which quantifies the discrepancy between the original data and a compressed set decoded from a low-dimensional latent space. BDC proceeds by (i) learning a low-dimensional projection using the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with the Encoded MMD (EMMD). We show that this procedure minimises the DMMD, guaranteeing that the compressed set faithfully represents the original distribution. Experiments show that across a variety of scenarios BDC can achieve comparable or superior performance to ambient-space compression at substantially lower cost.

replace-cross Efficient & Correct Predictive Equivalence for Decision Trees

Authors: Joao Marques-Silva, Alexey Ignatiev

Abstract: The Rashomon set of decision trees (DTs) finds importance uses. Recent work showed that DTs computing the same classification function, i.e. predictive equivalent DTs, can represent a significant fraction of the Rashomon set. Such redundancy is undesirable. For example, feature importance based on the Rashomon set becomes inaccurate due the existence of predictive equivalent DTs, i.e. DTs with the same prediction for every possible input. In recent work, McTavish et al. proposed solutions for several computational problems related with DTs, including that of deciding predictive equivalent DTs. This approach, which this paper refers to as MBDSR, consists of applying the well-known method of Quine-McCluskey (QM) for obtaining minimum-size DNF (disjunctive normal form) representations of DTs, which are then used for comparing DTs for predictive equivalence. Furthermore, the minimum-size DNF representation was also applied to computing explanations for the predictions made by DTs, and to finding predictions in the presence of missing data. However, the problem of formula minimization is hard for the second level of the polynomial hierarchy, and the QM method may exhibit worst-case exponential running time and space. This paper first demonstrates that there exist decision trees that trigger the worst-case exponential running time and space of the QM method. Second, the paper shows that, depending on the QM method implementation, the MBDSR approach can produce incorrect results for the problem of deciding predictive equivalence. Third, the paper shows that any of the problems to which the smallest DNF representation has been applied to can be solved in polynomial time, in the size of the DT. The experiments confirm that, for DTs for which the worst-case of the QM method is triggered, the algorithms proposed in this paper are orders of magnitude faster than the ones proposed by McTavish et al.

replace-cross CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

Authors: Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

Abstract: Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT's core dimensions: intrinsic difficulty ($d$) controls intrinsic load; distractor-to-signal ratio ($\rho$) regulates extraneous load; and task length ($N$) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.

replace-cross Reinforcement Learning on Pre-Training Data

Authors: Siheng Li, Kejiao Li, Zenan Xu, Guanhua Huang, Evander Yang, Kun Li, Haoyuan Wu, Jiajia Wu, Zihao Zheng, Chenchen Zhang, Kun Shi, Kyrierl Deng, Qi Yi, Ruibin Xiong, Tingqiang Xu, Yuhao Jiang, Jianfeng Yan, Yuyuan Zeng, Guanghui Xu, Jinbao Xue, Zhijiang Xu, Zheng Fang, Shuai Li, Qibin Liu, Xiaoxue Li, Zhuoyu Li, Yangyu Tao, Fei Gao, Cheng Jiang, Bo Chao Wang, Kai Liu, Jianchen Zhu, Wai Lam, Wayyt Wang, Bo Zhou, Di Wang

Abstract: The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.

replace-cross HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

Authors: Alessandro Saviolo, Jeffrey Mao, Giuseppe Loianno

Abstract: Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.

replace-cross Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models

Authors: Juana Valeria Hurtado, Rohit Mohan, Abhinav Valada

Abstract: Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features. Additionally, we introduce a modality-aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods. We make the code available at https://hsi-adapter.cs.uni-freiburg.de.

URLs: https://hsi-adapter.cs.uni-freiburg.de.

replace-cross Thinking Augmented Pre-training

Authors: Liang Wang, Nan Yang, Shaohan Huang, Li Dong, Furu Wei

Abstract: This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.