new Decorrelating the Future: Joint Frequency Domain Learning for Spatio-temporal Forecasting

Authors: Zepu Wang (Xuegang), Bowen Liao (Xuegang), Jeff (Xuegang), Ban

Abstract: Standard direct forecasting models typically rely on point-wise objectives such as Mean Squared Error, which fail to capture the complex spatio-temporal dependencies inherent in graph-structured signals. While recent frequency-domain approaches such as FreDF mitigate temporal autocorrelation, they often overlook spatial and cross spatio-temporal interactions. To address this limitation, we propose FreST Loss, a frequency-enhanced spatio-temporal training objective that extends supervision to the joint spatio-temporal spectrum. By leveraging the Joint Fourier Transform (JFT), FreST Loss aligns model predictions with ground truth in a unified spectral domain, effectively decorrelating complex dependencies across both space and time. Theoretical analysis shows that this formulation reduces estimation bias associated with time-domain training objectives. Extensive experiments on six real-world datasets demonstrate that FreST Loss is model-agnostic and consistently improves state-of-the-art baselines by better capturing holistic spatio-temporal dynamics.

new Machine Learning for Complex Systems Dynamics: Detecting Bifurcations in Dynamical Systems with Deep Neural Networks

Authors: Swadesh Pal, Roderick Melnik

Abstract: Critical transitions are the abrupt shifts between qualitatively different states of a system, and they are crucial to understanding tipping points in complex dynamical systems across ecology, climate science, and biology. Detecting these shifts typically involves extensive forward simulations or bifurcation analyses, which are often computationally intensive and limited by parameter sampling. In this study, we propose a novel machine learning approach based on deep neural networks (DNNs) called equilibrium-informed neural networks (EINNs) to identify critical thresholds associated with catastrophic regime shifts. Rather than fixing parameters and searching for solutions, the EINN method reverses this process by using candidate equilibrium states as inputs and training a DNN to infer the corresponding system parameters that satisfy the equilibrium condition. By analyzing the learned parameter landscape and observing abrupt changes in the feasibility or continuity of equilibrium mappings, critical thresholds can be effectively detected. We demonstrate this capability on nonlinear systems exhibiting saddle-node bifurcations and multi-stability, showing that EINNs can recover the parameter regions associated with impending transitions. This method provides a flexible alternative to traditional techniques, offering new insights into the early detection and structure of critical shifts in high-dimensional and nonlinear systems.

new FedEMA-Distill: Exponential Moving Average Guided Knowledge Distillation for Robust Federated Learning

Authors: Hamza Reguieg, Mohamed El Kamili, Essaid Sabir

Abstract: Federated learning (FL) often degrades when clients hold heterogeneous non-Independent and Identically Distributed (non-IID) data and when some clients behave adversarially, leading to client drift, slow convergence, and high communication overhead. This paper proposes FedEMA-Distill, a server-side procedure that combines an exponential moving average (EMA) of the global model with ensemble knowledge distillation from client-uploaded prediction logits evaluated on a small public proxy dataset. Clients run standard local training, upload only compressed logits, and may use different model architectures, so no changes are required to client-side software while still supporting model heterogeneity across devices. Experiments on CIFAR-10, CIFAR-100, FEMNIST, and AG News under Dirichlet-0.1 label skew show that FedEMA-Distill improves top-1 accuracy by several percentage points (up to +5% on CIFAR-10 and +6% on CIFAR-100) over representative baselines, reaches a given target accuracy in 30-35% fewer communication rounds, and reduces per-round client uplink payloads to 0.09-0.46 MB, i.e., roughly an order of magnitude less than transmitting full model weights. Using coordinate-wise median or trimmed-mean aggregation of logits at the server further stabilizes training in the presence of up to 10-20% Byzantine clients and yields well-calibrated predictions under attack. These results indicate that coupling temporal smoothing with logits-only aggregation provides a communication-efficient and attack-resilient FL pipeline that is deployment-friendly and compatible with secure aggregation and differential privacy, since only aggregated or obfuscated model outputs are exchanged.

new Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

Authors: Aly Kassem, Thomas Jiralerspong, Negar Rostamzadeh, Golnoosh Farnadi

Abstract: Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shared dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our results demonstrate that crosscoders remain a powerful tool for model diffing.

new Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Authors: Hengshuai Yao, Guan Wang

Abstract: Standard transformer attention uses identical dimensionality for queries, keys, and values ($d_q = d_k = d_v = \dmodel$). Our insight is that these components serve fundamentally different roles, and this symmetry is unnecessary. Queries and keys produce scalar attention weights (\emph{selection}), while values carry rich semantic representations (\emph{value transfer}). We argue that selection is an inherently lower-dimensional operation than value transfer, requiring only $\BigO(\log N)$ dimensions to distinguish among $N$ relevant patterns. We validate this hypothesis across seven experiments: (1)~positional selection tasks requiring just 1~dimension per head, (2)~content-based retrieval requiring $\sim\!\log_2 N$ dimensions, (3--4)~WikiText-2 and WikiText-103 language modeling where $\dselect = \dmodel/4$ incurs only 4.3\% perplexity increase while reducing QK parameters by 75\%, (5)~post-training SVD compression of GPT-2, revealing keys to be far more compressible than queries, with lightweight QK fine-tuning recovering nearly all quality loss, (6)~a 125M-parameter LLaMA model confirming identical degradation ratios across architectures, and (7)~Mistral-7B (7.2B parameters), where SVD compression followed by QK fine-tuning achieves 75\% key cache savings at just 2.0\% residual quality cost. For existing models, SVD compression followed by QK fine-tuning (3 epochs on a small fraction of pretraining data) achieves 75\% key cache savings at $<$2\% residual quality cost. For a 7B-parameter model serving 128K context, asymmetric attention saves 25\,GB of KV cache per user, enabling approximately 60\% more concurrent users on the same GPU.

new Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

Authors: Yakov Pyotr Shkolnikov

Abstract: Multi-agent LLM systems on edge devices face a memory management problem: device RAM is too small to hold every agent's KV cache simultaneously. On Apple M4 Pro with 10.2 GB of cache budget, only 3 agents fit at 8K context in FP16. A 10-agent workflow must constantly evict and reload caches. Without persistence, every eviction forces a full re-prefill through the model -- 15.7 seconds per agent at 4K context. We address this by persisting each agent's KV cache to disk in 4-bit quantized format and reloading it directly into the attention layer, eliminating redundant O(n) prefill computation via direct cache restoration. The system comprises three components: a block pool providing per-agent isolated Q4 KV caches in safetensors format, a BatchQuantizedKVCache for concurrent inference over multiple agents' quantized caches, and cross-phase context injection that accumulates attention state across conversation phases without re-computation. Evaluated on three architectures (Gemma 3 12B, dense GQA, 48 layers; DeepSeek-Coder-V2-Lite 16B, MoE MLA, 27 layers; Llama 3.1 8B, dense GQA, 32 layers), cache restoration reduces time-to-first-token by up to 136x (Gemma: 22--136x at 4K--32K; DeepSeek: 11--76x at 4K--32K; Llama: 24--111x at 4K--16K; 3--10x at 1K). Q4 quantization fits 4x more agent contexts into fixed device memory than FP16. Perplexity measured with actual Q4 KV caches shows -0.7% for Gemma, +2.8% for Llama, and +3.0% for DeepSeek. Open-source at https://github.com/yshk-mxim/agent-memory

URLs: https://github.com/yshk-mxim/agent-memory

new Flowers: A Warp Drive for Neural PDE Solvers

Authors: Till Muser, Alexandra Spitzer, Matti Lassas, Maarten V. de Hoop, Ivan Dokmani\'c

Abstract: We introduce Flowers, a neural architecture for learning PDE solution operators built entirely from multihead warps. Aside from pointwise channel mixing and a multiscale scaffold, Flowers use no Fourier multipliers, no dot-product attention, and no convolutional mixing. Each head predicts a displacement field and warps the mixed input features. Motivated by physics and computational efficiency, displacements are predicted pointwise, without any spatial aggregation, and nonlocality enters \emph{only} through sparse sampling at source coordinates, \emph{one} per head. Stacking warps in multiscale residual blocks yields Flowers, which implement adaptive, global interactions at linear cost. We theoretically motivate this design through three complementary lenses: flow maps for conservation laws, waves in inhomogeneous media, and a kinetic-theoretic continuum limit. Flowers achieve excellent performance on a broad suite of 2D and 3D time-dependent PDE benchmarks, particularly flows and waves. A compact 17M-parameter model consistently outperforms Fourier, convolution, and attention-based baselines of similar size, while a 150M-parameter variant improves over recent transformer-based foundation models with much more parameters, data, and training compute.

new Uncertainty-Calibrated Spatiotemporal Field Diffusion with Sparse Supervision

Authors: Kevin Valencia, Xihaier Luo, Shinjae Yoo, David Keetae Park

Abstract: Physical fields are typically observed only at sparse, time-varying sensor locations, making forecasting and reconstruction ill-posed and uncertainty-critical. We present SOLID, a mask-conditioned diffusion framework that learns spatiotemporal dynamics from sparse observations alone: training and evaluation use only observed target locations, requiring no dense fields and no pre-imputation. Unlike prior work that trains on dense reanalysis or simulations and only tests under sparsity, SOLID is trained end-to-end with sparse supervision only. SOLID conditions each denoising step on the measured values and their locations, and introduces a dual-masking objective that (i) emphasizes learning in unobserved void regions while (ii) upweights overlap pixels where inputs and targets provide the most reliable anchors. This strict sparse-conditioning pathway enables posterior sampling of full fields consistent with the measurements, achieving up to an order-of-magnitude improvement in probabilistic error and yielding calibrated uncertainty maps (\r{ho} > 0.7) under severe sparsity.

new ZorBA: Zeroth-order Federated Fine-tuning of LLMs with Heterogeneous Block Activation

Authors: Chuiyang Meng, Ming Tang, Vincent W. S. Wong

Abstract: Federated fine-tuning of large language models (LLMs) enables collaborative tuning across distributed clients. However, due to the large size of LLMs, local updates in federated learning (FL) may incur substantial video random-access memory (VRAM) usage. Moreover, frequent model exchange may lead to significant communication overhead. To tackle these challenges, in this paper we propose ZorBA, a zeroth-order optimization-based federated fine-tuning framework with heterogeneous block activation. ZorBA leverages zeroth-order optimization to eliminate the storage of gradients at the clients by forward passes. ZorBA includes a heterogeneous block activation mechanism in which the central server allocates different subsets of transformer blocks to clients in order to accelerate the convergence rate and reduce the VRAM usage. Furthermore, ZorBA utilizes shared random seeds and the finite differences of gradients in order to reduce the communication overhead. We conduct theoretical analysis to characterize the effect of block activation decisions on the convergence rate and VRAM usage. To jointly enhance the convergence rate and reduce the VRAM usage, we formulate an optimization problem to optimize the block activation decisions. We propose an $\epsilon$-constraint lexicographic algorithm to solve this problem. Experimental results show that ZorBA outperforms three federated fine-tuning baselines in VRAM usage by up to 62.41% and incurs a low communication overhead.

new ASFL: An Adaptive Model Splitting and Resource Allocation Framework for Split Federated Learning

Authors: Chuiyang Meng, Ming Tang, Vincent W. S. Wong

Abstract: Federated learning (FL) enables multiple clients to collaboratively train a machine learning model without sharing their raw data. However, the limited computation resources of the clients may result in a high delay and energy consumption on training. In this paper, we propose an adaptive split federated learning (ASFL) framework over wireless networks. ASFL exploits the computation resources of the central server to train part of the model and enables adaptive model splitting as well as resource allocation during training. To optimize the learning performance (i.e., convergence rate) and efficiency (i.e., delay and energy consumption) of ASFL, we theoretically analyze the convergence rate and formulate a joint learning performance and resource allocation optimization problem. Solving this problem is challenging due to the long-term delay and energy consumption constraints as well as the coupling of the model splitting and resource allocation decisions. We propose an online optimization enhanced block coordinate descent (OOE-BCD) algorithm to solve the problem iteratively. Experimental results show that when compared with five baseline schemes, our proposed ASFL framework converges faster and reduces the total delay and energy consumption by up to 75% and 80%, respectively.

new An Explainable Ensemble Framework for Alzheimer's Disease Prediction Using Structured Clinical and Cognitive Data

Authors: Nishan Mitra

Abstract: Early and accurate detection of Alzheimer's disease (AD) remains a major challenge in medical diagnosis due to its subtle onset and progressive nature. This research introduces an explainable ensemble learning Framework designed to classify individuals as Alzheimer's or Non-Alzheimer's using structured clinical, lifestyle, metabolic, and lifestyle features. The workflow incorporates rigorous preprocessing, advanced feature engineering, SMOTE-Tomek hybrid class balancing, and optimized modeling using five ensemble algorithms-Random Forest, XGBoost, LightGBM, CatBoost, and Extra Trees-alongside a deep artificial neural network. Model selection was performed using stratified validation to prevent leakage, and the best-performing model was evaluated on a fully unseen test set. Ensemble methods achieved superior performance over deep learning, with XGBoost, Random Forest, and Soft Voting showing the strongest accuracy, sensitivity, and F1-score profiles. Explainability techniques, including SHAP and feature importance analysis, highlighted MMSE, Functional Assessment Age, and several engineered interaction features as the most influential determinants. The results demonstrate that the proposed framework provides a reliable and transparent approach to Alzheimer's disease prediction, offering strong potential for clinical decision support applications.

new On Emergences of Non-Classical Statistical Characteristics in Classical Neural Networks

Authors: Hanyu Zhao, Yang Wu, Yuexian Hou

Abstract: Inspired by measurement incompatibility and Bell-family inequalities in quantum mechanics, we propose the Non-Classical Network (NCnet), a simple classical neural architecture that stably exhibits non-classical statistical behaviors under typical and interpretable experimental setups. We find non-classicality, measured by the $S$ statistic of CHSH inequality, arises from gradient competitions of hidden-layer neurons shared by multi-tasks. Remarkably, even without physical links supporting explicit communication, one task head can implicitly sense the training task of other task heads via local loss oscillations, leading to non-local correlations in their training outcomes. Specifically, in the low-resource regime, the value of $S$ increases gradually with increasing resources and approaches toward its classical upper-bound 2, which implies that underfitting is alleviated with resources increase. As the model nears the critical scale required for adequate performance, $S$ may temporarily exceed 2. As resources continue to grow, $S$ then asymptotically decays down to and fluctuates around 2. Empirically, when model capacity is insufficient, $S$ is positively correlated with generalization performance, and the regime where $S$ first approaches $2$ often corresponding to good generalization. Overall, our results suggest that non-classical statistics can provide a novel perspective for understanding internal interactions and training dynamics of deep networks.

new Learning Unified Distance Metric for Heterogeneous Attribute Data Clustering

Authors: Yiqun Zhang, Mingjie Zhao, Yizhou Chen, Yang Lu, Yiu-ming Cheung

Abstract: Datasets composed of numerical and categorical attributes (also called mixed data hereinafter) are common in real clustering tasks. Differing from numerical attributes that indicate tendencies between two concepts (e.g., high and low temperature) with their values in well-defined Euclidean distance space, categorical attribute values are different concepts (e.g., different occupations) embedded in an implicit space. Simultaneously exploiting these two very different types of information is an unavoidable but challenging problem, and most advanced attempts either encode the heterogeneous numerical and categorical attributes into one type, or define a unified metric for them for mixed data clustering, leaving their inherent connection unrevealed. This paper, therefore, studies the connection among any-type of attributes and proposes a novel Heterogeneous Attribute Reconstruction and Representation (HARR) learning paradigm accordingly for cluster analysis. The paradigm transforms heterogeneous attributes into a homogeneous status for distance metric learning, and integrates the learning with clustering to automatically adapt the metric to different clustering tasks. Differing from most existing works that directly adopt defined distance metrics or learn attribute weights to search clusters in a subspace. We propose to project the values of each attribute into unified learnable multiple spaces to more finely represent and learn the distance metric for categorical data. HARR is parameter-free, convergence-guaranteed, and can more effectively self-adapt to different sought number of clusters $k$. Extensive experiments illustrate its superiority in terms of accuracy and efficiency.

new VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling

Authors: Chen Guanzhong

Abstract: The quadratic complexity of self-attention during the prefill phase impedes long-context inference in large language models. Existing sparse attention methods face a trade-off among context adaptivity, sampling overhead, and fine-tuning costs. We propose VSPrefill, a mechanism requiring lightweight training that uses the vertical-slash structural pattern in attention distributions. Our compact VSIndexer module predicts context-aware importance scores for vertical columns and slash diagonals from key-value representations augmented with RoPE. This approach constructs sparse masks with linear complexity without modifying the backbone parameters. During inference, an adaptive cumulative-threshold strategy allocates sparsity budgets per layer, while a fused kernel executes attention with on-the-fly index merging. Evaluated on Qwen3-4B-Instruct and LLaMA-3.1-8B-Instruct across the LongBench and RULER benchmarks, VSPrefill preserves 98.35% of the full attention accuracy while delivering a 4.95x average speedup at a context length of 128k. These results establish a new Pareto frontier in the trade-off between accuracy and efficiency.

new MAD-SmaAt-GNet: A Multimodal Advection-Guided Neural Network for Precipitation Nowcasting

Authors: Samuel van Wonderen, Siamak Mehrkanoon

Abstract: Precipitation nowcasting (short-term forecasting) is still often performed using numerical solvers for physical equations, which are computationally expensive and make limited use of the large volumes of available weather data. Deep learning models have shown strong potential for precipitation nowcasting, offering both accuracy and computational efficiency. Among these models, convolutional neural networks (CNNs) are particularly effective for image-to-image prediction tasks. The SmaAt-UNet is a lightweight CNN based architecture that has demonstrated strong performance for precipitation nowcasting. This paper introduces the Multimodal Advection-Guided Small Attention GNet (MAD-SmaAt-GNet), which extends the core SmaAt-UNet by (i) incorporating an additional encoder to learn from multiple weather variables and (ii) integrating a physics-based advection component to ensure physically consistent predictions. We show that each extension individually improves rainfall forecasts and that their combination yields further gains. MAD-SmaAt-GNet reduces the mean squared error (MSE) by 8.9% compared with the baseline SmaAt-UNet for four-step precipitation forecasting up to four hours ahead. Additionally, experiments indicate that multimodal inputs are particularly beneficial for short lead times, while the advection-based component enhances performance across both short and long forecasting horizons.

new Understanding the Dynamics of Demonstration Conflict in In-Context Learning

Authors: Difan Jiao, Di Wang, Lijie Hu

Abstract: In-context learning enables large language models to perform novel tasks through few-shot demonstrations. However, demonstrations per se can naturally contain noise and conflicting examples, making this capability vulnerable. To understand how models process such conflicts, we study demonstration-dependent tasks requiring models to infer underlying patterns, a process we characterize as rule inference. We find that models suffer substantial performance degradation from a single demonstration with corrupted rule. This systematic misleading behavior motivates our investigation of how models process conflicting evidence internally. Using linear probes and logit lens analysis, we discover that under corruption models encode both correct and incorrect rules in intermediate layers but develop prediction confidence only in late layers, revealing a two-phase computational structure. We then identify attention heads for each phase underlying the reasoning failures: Vulnerability Heads in early-to-middle layers exhibit positional attention bias with high sensitivity to corruption, while Susceptible Heads in late layers significantly reduce support for correct predictions when exposed to the corrupted evidence. Targeted ablation validates our findings, with masking a small number of identified heads improving performance by over 10%.

new Towards Explainable Deep Learning for Ship Trajectory Prediction in Inland Waterways

Authors: Tom Legel, Dirk S\"offker, Roland Sch\"atzle, Kathrin Donandt

Abstract: Accurate predictions of ship trajectories in crowded environments are essential to ensure safety in inland waterways traffic. Recent advances in deep learning promise increased accuracy even for complex scenarios. While the challenge of ship-to-ship awareness is being addressed with growing success, the explainability of these models is often overlooked, potentially obscuring an inaccurate logic and undermining the confidence in their reliability. This study examines an LSTM-based vessel trajectory prediction model by incorporating trained ship domain parameters that provide insight into the attention-based fusion of the interacting vessels' hidden states. This approach has previously been explored in the field of maritime shipping, yet the variety and complexity of encounters in inland waterways allow for a more profound analysis of the model's interpretability. The prediction performance of the proposed model variants are evaluated using standard displacement error statistics. Additionally, the plausibility of the generated ship domain values is analyzed. With an final displacement error of around 40 meters in a 5-minute prediction horizon, the model performs comparably to similar studies. Though the ship-to-ship attention architecture enhances prediction accuracy, the weights assigned to vessels in encounters using the learnt ship domain values deviate from the expectation. The observed accuracy improvements are thus not entirely driven by a causal relationship between a predicted trajectory and the trajectories of nearby ships. This finding underscores the model's explanatory capabilities through its intrinsically interpretable design. Future work will focus on utilizing the architecture for counterfactual analysis and on the incorporation of more sophisticated attention mechanisms.

new Activity Recognition from Smart Insole Sensor Data Using a Circular Dilated CNN

Authors: Yanhua Zhao

Abstract: Smart insoles equipped with pressure sensors, accelerometers, and gyroscopes offer a non-intrusive means of monitoring human gait and posture. We present an activity classification system based on a circular dilated convolutional neural network (CDCNN) that processes multi-modal time-series data from such insoles. The model operates on 160-frame windows with 24 channels (18 pressure, 3 accelerometer, 3 gyroscope axes), achieving 86.42% test accuracy in a subject-independent evaluation on a four-class task (Standing, Walking, Sitting, Tandem), compared with 87.83% for an extreme gradient-boosted tree (XGBoost) model trained on flattened data. Permutation feature importance reveals that inertial sensors (accelerometer and gyroscope) contribute substantially to discrimination. The approach is suitable for embedded deployment and real-time inference.

new Standing on the Shoulders of Giants: Rethinking EEG Foundation Model Pretraining via Multi-Teacher Distillation

Authors: Chenqi Li, Yu Liu, Shuo Zhang, Timothy Denison, Tingting Zhu

Abstract: Pretraining for electroencephalogram (EEG) foundation models has predominantly relied on self-supervised masked reconstruction, a paradigm largely adapted from and inspired by the success of vision and language foundation models. However, unlike images and text, EEG datasets are notoriously expensive to collect and characterized by low signal-to-noise ratio. These challenges introduce difficulties in scaling the EEG foundation models and capturing the underlying neural semantics through reconstruction. In this work, we ask the question: can we stand on the shoulders of well-established foundation models from well-represented modalities to bootstrap the pretraining of EEG foundation models? We first demonstrate that mainstream foundation models, such as those from vision and time series, transfer surprisingly well to EEG domain. To this end, we propose the Multi-Teacher Distillation Pretraining (MTDP) framework for pretraining EEG foundation models via a two-stage multi-teacher distillation. In the first stage, we introduce a learnable gating network to fuse representations from diverse teachers (e.g., DINOv3 and Chronos) via a masked latent denoising objective. In the second stage, we distill the fused representation into an EEG foundation model. Extensive evaluations across 9 downstream tasks and 12 datasets demonstrate that our MTDP-based EEG foundation model outperforms its self-supervised counterparts while requiring only 25% of the pretraining data.

new Augmenting representations with scientific papers

Authors: Nicol\`o Oreste Pinciroli Vago, Rocco Di Tella, Carolina Cuesta-L\'azaro, Michael J. Smith, Cecilia Garraffo, Rafael Mart\'inez-Galarza

Abstract: Astronomers have acquired vast repositories of multimodal data, including images, spectra, and time series, complemented by decades of literature that analyzes astrophysical sources. Still, these data sources are rarely systematically integrated. This work introduces a contrastive learning framework designed to align X-ray spectra with domain knowledge extracted from scientific literature, facilitating the development of shared multimodal representations. Establishing this connection is inherently complex, as scientific texts encompass a broader and more diverse physical context than spectra. We propose a contrastive pipeline that achieves a 20% Recall@1% when retrieving texts from spectra, proving that a meaningful alignment between these modalities is not only possible but capable of accelerating the interpretation of rare or poorly understood sources. Furthermore, the resulting shared latent space effectively encodes physically significant information. By fusing spectral and textual data, we improve the estimation of 20 physical variables by 16-18% over unimodal spectral baselines. Our results indicate that a Mixture of Experts (MoE) strategy, which leverages both unimodal and shared representations, yields superior performance. Finally, outlier analysis within the multimodal latent space identifies high-priority targets for follow-up investigation, including a candidate pulsating ULX (PULX) and a gravitational lens system. Importantly, this framework can be extended to other scientific domains where aligning observational data with existing literature is possible.

new Invariant Causal Routing for Governing Social Norms in Online Market Economies

Authors: Xiangning Yu, Qirui Mi, Xiao Xue, Haoxuan Li, Yiwei Shi, Xiaowei Liu, Mengyue Yang

Abstract: Social norms are stable behavioral patterns that emerge endogenously within economic systems through repeated interactions among agents. In online market economies, such norms -- like fair exposure, sustained participation, and balanced reinvestment -- are critical for long-term stability. We aim to understand the causal mechanisms driving these emergent norms and to design principled interventions that can steer them toward desired outcomes. This is challenging because norms arise from countless micro-level interactions that aggregate into macro-level regularities, making causal attribution and policy transferability difficult. To address this, we propose \textbf{Invariant Causal Routing (ICR)}, a causal governance framework that identifies policy-norm relations stable across heterogeneous environments. ICR integrates counterfactual reasoning with invariant causal discovery to separate genuine causal effects from spurious correlations and to construct interpretable, auditable policy rules that remain effective under distribution shift. In heterogeneous agent simulations calibrated with real data, ICR yields more stable norms, smaller generalization gaps, and more concise rules than correlation or coverage baselines, demonstrating that causal invariance offers a principled and interpretable foundation for governance.

new An LLM-Guided Query-Aware Inference System for GNN Models on Large Knowledge Graphs

Authors: Waleed Afandi, Hussein Abdallah, Ashraf Aboulnaga, Essam Mansour

Abstract: Efficient inference for graph neural networks (GNNs) on large knowledge graphs (KGs) is essential for many real-world applications. GNN inference queries are computationally expensive and vary in complexity, as each involves a different number of target nodes linked to subgraphs of diverse densities and structures. Existing acceleration methods, such as pruning, quantization, and knowledge distillation, instantiate smaller models but do not adapt them to the structure or semantics of individual queries. They also store models as monolithic files that must be fully loaded, and miss the opportunity to retrieve only the neighboring nodes and corresponding model components that are semantically relevant to the target nodes. These limitations lead to excessive data loading and redundant computation on large KGs. This paper presents KG-WISE, a task-driven inference paradigm for large KGs. KG-WISE decomposes trained GNN models into fine-grained components that can be partially loaded based on the structure of the queried subgraph. It employs large language models (LLMs) to generate reusable query templates that extract semantically relevant subgraphs for each task, enabling query-aware and compact model instantiation. We evaluate KG-WISE on six large KGs with up to 42 million nodes and 166 million edges. KG-WISE achieves up to 28x faster inference and 98% lower memory usage than state-of-the-art systems while maintaining or improving accuracy across both commercial and open-weight LLMs.

new Oracle-efficient Hybrid Learning with Constrained Adversaries

Authors: Princewill Okoroafor, Robert Kleinberg, Michael P. Kim

Abstract: The Hybrid Online Learning Problem, where features are drawn i.i.d. from an unknown distribution but labels are generated adversarially, is a well-motivated setting positioned between statistical and fully-adversarial online learning. Prior work has presented a dichotomy: algorithms that are statistically-optimal, but computationally intractable (Wu et al., 2023), and algorithms that are computationally-efficient (given an ERM oracle), but statistically-suboptimal (Wu et al., 2024). This paper takes a significant step towards achieving statistical optimality and computational efficiency simultaneously in the Hybrid Learning setting. To do so, we consider a structured setting, where the Adversary is constrained to pick labels from an expressive, but fixed, class of functions $R$. Our main result is a new learning algorithm, which runs efficiently given an ERM oracle and obtains regret scaling with the Rademacher complexity of a class derived from the Learner's hypothesis class $H$ and the Adversary's label class $R$. As a key corollary, we give an oracle-efficient algorithm for computing equilibria in stochastic zero-sum games when action sets may be high-dimensional but the payoff function exhibits a type of low-dimensional structure. Technically, we develop a number of tools for the design and analysis of our learning algorithm, including a novel Frank-Wolfe reduction with "truncated entropy regularizer" and a new tail bound for sums of "hybrid" martingale difference sequences.

new Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

Authors: Tal Daniel, Carl Qi, Dan Haramati, Amir Zadeh, Chuan Li, Aviv Tamar, Deepak Pathak, David Held

Abstract: We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, data, pre-trained models and video rollouts are available: https://taldatech.github.io/lpwm-web

URLs: https://taldatech.github.io/lpwm-web

new Why Do Neural Networks Forget: A Study of Collapse in Continual Learning

Authors: Yunqin Zhu, Jun Jin

Abstract: Catastrophic forgetting is a major problem in continual learning, and lots of approaches arise to reduce it. However, most of them are evaluated through task accuracy, which ignores the internal model structure. Recent research suggests that structural collapse leads to loss of plasticity, as evidenced by changes in effective rank (eRank). This indicates a link to forgetting, since the networks lose the ability to expand their feature space to learn new tasks, which forces the network to overwrite existing representations. Therefore, in this study, we investigate the correlation between forgetting and collapse through the measurement of both weight and activation eRank. To be more specific, we evaluated four architectures, including MLP, ConvGRU, ResNet-18, and Bi-ConvGRU, in the split MNIST and Split CIFAR-100 benchmarks. Those models are trained through the SGD, Learning-without-Forgetting (LwF), and Experience Replay (ER) strategies separately. The results demonstrate that forgetting and collapse are strongly related, and different continual learning strategies help models preserve both capacity and performance in different efficiency.

new A Late-Fusion Multimodal AI Framework for Privacy-Preserving Deduplication in National Healthcare Data Environments

Authors: Mohammed Omer Shakeel Ahmed

Abstract: Duplicate records pose significant challenges in customer relationship management (CRM)and healthcare, often leading to inaccuracies in analytics, impaired user experiences, and compliance risks. Traditional deduplication methods rely heavily on direct identifiers such as names, emails, or Social Security Numbers (SSNs), making them ineffective under strict privacy regulations like GDPR and HIPAA, where such personally identifiable information (PII) is restricted or masked. In this research, I propose a novel, scalable, multimodal AI framework for detecting duplicates without depending on sensitive information. This system leverages three distinct modalities: semantic embeddings derived from textual fields (names, cities) using pre-trained DistilBERT models, behavioral patterns extracted from user login timestamps, and device metadata encoded through categorical embeddings. These heterogeneous modalities are combined using a late fusion approach and clustered via DBSCAN, an unsupervised density-based algorithm. This proposed model is evaluated against a traditional string-matching baseline on a synthetic CRM dataset specifically designed to reflect privacy-preserving constraints. The multimodal framework demonstrated good performance, achieving a good F1-score by effectively identifying duplicates despite variations and noise inherent in the data. This approach offers a privacy-compliant solution to entity resolution and supports secure digital infrastructure, enhances the reliability of public health analytics, and promotes ethical AI adoption across government and enterprise settings. It is well-suited for integration into national health data modernization efforts, aligning with broader goals of privacy-first innovation.

new PDE foundation model-accelerated inverse estimation of system parameters in inertial confinement fusion

Authors: Mahindra Rautela, Alexander Scheinker, Bradley Love, Diane Oyen, Nathan DeBardeleben, Earl Lawrence, Ayan Biswas

Abstract: PDE foundation models are typically pretrained on large, diverse corpora of PDE datasets and can be adapted to new settings with limited task-specific data. However, most downstream evaluations focus on forward problems, such as autoregressive rollout prediction. In this work, we study an inverse problem in inertial confinement fusion (ICF): estimating system parameters (inputs) from multi-modal, snapshot-style observations (outputs). Using the open JAG benchmark, which provides hyperspectral X-ray images and scalar observables per simulation, we finetune the PDE foundation model and train a lightweight task-specific head to jointly reconstruct hyperspectral images and regress system parameters. The fine-tuned model achieves accurate hyperspectral reconstruction (test MSE 1.2e-3) and strong parameter-estimation performance (up to R^2=0.995). Data-scaling experiments (5%-100% of the training set) show consistent improvements in both reconstruction and regression losses as the amount of training data increases, with the largest marginal gains in the low-data regime. Finally, finetuning from pretrained MORPH weights outperforms training the same architecture from scratch, demonstrating that foundation-model initialization improves sample efficiency for data-limited inverse problems in ICF.

new K-Means as a Radial Basis function Network: a Variational and Gradient-based Equivalence

Authors: Felipe de Jesus Felix Arredondo, Alejandro Ucan-Puc, Carlos Astengo Noguez

Abstract: This work establishes a rigorous variational and gradient-based equivalence between the classical K-Means algorithm and differentiable Radial Basis Function (RBF) neural networks with smooth responsibilities. By reparameterizing the K-Means objective and embedding its distortion functional into a smooth weighted loss, we prove that the RBF objective $\Gamma$-converges to the K-Means solution as the temperature parameter $\sigma$ vanishes. We further demonstrate that the gradient-based updates of the RBF centers recover the exact K-Means centroid update rule and induce identical training trajectories in the limit. To address the numerical instability of the Softmax transformation in the low-temperature regime, we propose the integration of Entmax-1.5, which ensures stable polynomial convergence while preserving the underlying Voronoi partition structure. These results bridge the conceptual gap between discrete partitioning and continuous optimization, enabling K-Means to be embedded directly into deep learning architectures for the joint optimization of representations and clusters. Empirical validation across diverse synthetic geometries confirms a monotone collapse of soft RBF centroids toward K-Means fixed points, providing a unified framework for end-to-end differentiable clustering.

new When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift

Authors: Kevin Vogt-Lowell, Theodoros Tsiligkaridis, Rodney Lafuente-Mercado, Surabhi Ghatti, Shanghua Gao, Marinka Zitnik, Daniela Rus

Abstract: Real-world reinforcement learning systems must operate under distributional drift in their observation streams, yet most policy architectures implicitly assume fully observed and noise-free states. We study robustness of Proximal Policy Optimization (PPO) under temporally persistent sensor failures that induce partial observability and representation shift. To respond to this drift, we augment PPO with temporal sequence models, including Transformers and State Space Models (SSMs), to enable policies to infer missing information from history and maintain performance. Under a stochastic sensor failure process, we prove a high-probability bound on infinite-horizon reward degradation that quantifies how robustness depends on policy smoothness and failure persistence. Empirically, on MuJoCo continuous-control benchmarks with severe sensor dropout, we show Transformer-based sequence policies substantially outperform MLP, RNN, and SSM baselines in robustness, maintaining high returns even when large fractions of sensors are unavailable. These results demonstrate that temporal sequence reasoning provides a principled and practical mechanism for reliable operation under observation drift caused by sensor unreliability.

new Neuro-Symbolic Financial Reasoning via Deterministic Fact Ledgers and Adversarial Low-Latency Hallucination Detector

Authors: Pedram Agand

Abstract: Standard Retrieval-Augmented Generation (RAG) architectures fail in high-stakes financial domains due to two fundamental limitations: the inherent arithmetic incompetence of Large Language Models (LLMs) and the distributional semantic conflation of dense vector retrieval (e.g., mapping ``Net Income'' to ``Net Sales'' due to contextual proximity). In deterministic domains, a 99% accuracy rate yields 0% operational trust. To achieve zero-hallucination financial reasoning, we introduce the Verifiable Numerical Reasoning Agent (VeNRA). VeNRA shifts the RAG paradigm from retrieving probabilistic text to retrieving deterministic variables via a strictly typed Universal Fact Ledger (UFL), mathematically bounded by a novel Double-Lock Grounding algorithm. Recognizing that upstream parsing anomalies inevitably occur, we introduce the VeNRA Sentinel: a 3-billion parameter SLM trained to forensically audit Python execution traces with only one token test budget. To train this model, we avoid traditional generative hallucination datasets in favor of Adversarial Simulation, programmatically sabotaging golden financial records to simulate production-level ``Ecological Errors'' (e.g., Logic Code Lies and Numeric Neighbor Traps). Finally, to optimize the Sentinel under strict latency budgets, we utilize a single-pass classification paradigm with optional post thinking for debug. We identify the phenomenon of Loss Dilution in Reverse-Chain-of-Thought training and present a novel, OOM-safe Micro-Chunking loss algorithm to stabilize gradients under extreme differential penalization.

new Direct Estimation of Tree Volume and Aboveground Biomass Using Deep Regression with Synthetic Lidar Data

Authors: Habib Pourdelan, Zhengkang Xiang, Hugh Stewart, Cam Nicholson, Martin Tomko, Kourosh Khoshelham

Abstract: Accurate estimation of forest biomass is crucial for monitoring carbon sequestration and informing climate change mitigation strategies. Existing methods often rely on allometric models, which estimate individual tree biomass by relating it to measurable biophysical parameters, e.g., trunk diameter and height. This indirect approach is limited in accuracy due to measurement uncertainties and the inherently approximate nature of allometric equations, which may not fully account for the variability in tree characteristics and forest conditions. This study proposes a direct approach that leverages synthetic point cloud data to train a deep regression network, which is then applied to real point clouds for plot-level wood volume and aboveground biomass (AGB) estimation. We created synthetic 3D forest plots with ground truth volume, which were then converted into point cloud data using a lidar simulator. These point clouds were subsequently used to train deep regression networks based on PointNet, PointNet++, DGCNN, and PointConv. When applied to synthetic data, the deep regression networks achieved mean absolute percentage error (MAPE) values ranging from 1.69% to 8.11%. The trained networks were then applied to real lidar data to estimate volume and AGB. When compared against field measurements, our direct approach showed discrepancies of 2% to 20%. In contrast, indirect approaches based on individual tree segmentation followed by allometric conversion, as well as FullCAM, exhibited substantially large underestimation, with discrepancies ranging from 27% to 85%. Our results highlight the potential of integrating synthetic data with deep learning for efficient and scalable forest carbon estimation at plot level.

new Engineering Regression Without Real-Data Training: Domain Adaptation for Tabular Foundation Models Using Multi-Dataset Embeddings

Authors: Lyle Regenwetter, Rosen Yu, Cyril Picard, Faez Ahmed

Abstract: Predictive modeling in engineering applications has long been dominated by bespoke models and small, siloed tabular datasets, limiting the applicability of large-scale learning approaches. Despite recent progress in tabular foundation models, the resulting synthetic training distributions used for pre-training may not reflect the statistical structure of engineering data, limiting transfer to engineering regression. We introduce TREDBench, a curated collection of 83 real-world tabular regression datasets with expert engineering/non-engineering labels, and use TabPFN 2.5's dataset-level embedding to study domain structure in a common representation space. We find that engineering datasets are partially distinguishable from non-engineering datasets, while standard procedurally generated datasets are highly distinguishable from engineering datasets, revealing a substantial synthetic-real domain gap. To bridge this gap without training on real engineering samples, we propose an embedding-guided synthetic data curation method: we generate and identify "engineering-like" synthetic datasets, and perform continued pre-training of TabPFN 2.5 using only the selected synthetic tasks. Across 35 engineering regression datasets, this synthetic-only adaptation improves predictive accuracy and data efficiency, outperforming TabPFN 2.5 on 29/35 datasets and AutoGluon on 27/35, with mean multiplicative data-efficiency gains of 1.75x and 4.44x, respectively. More broadly, our results indicate that principled synthetic data curation can convert procedural generators into domain-relevant "data engines," enabling foundation models to improve in data-sparse scientific and industrial domains where real data collection is the primary bottleneck.

new Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness

Authors: Baekrok Shin, Chulhee Yun

Abstract: We study matrix completion via deep matrix factorization (a.k.a. deep linear neural networks) as a simplified testbed to examine how network depth influences training dynamics. Despite the simplicity and importance of the problem, prior theory largely focuses on shallow (depth-2) models and does not fully explain the implicit low-rank bias observed in deeper networks. We identify coupled dynamics as a key mechanism behind this bias and show that it intensifies with increasing depth. Focusing on gradient flow under block-diagonal observations, we prove: (a) networks of depth $\geq 3$ exhibit coupling unless initialized diagonally, and (b) convergence to rank-1 occurs if and only if the dynamics is coupled -- resolving an open question by Menon (2024) for a family of initializations. We also revisit the loss of plasticity phenomenon in matrix completion (Kleinman et al., 2024), where pre-training on few observations and resuming with more degrades performance. We show that deep models avoid plasticity loss due to their low-rank bias, whereas depth-2 networks pre-trained under decoupled dynamics fail to converge to low-rank, even when resumed training (with additional data) satisfies the coupling condition -- shedding light on the mechanism behind this phenomenon.

new Probabilistic Dreaming for World Models

Authors: Gavin Wong

Abstract: "Dreaming" enables agents to learn from imagined experiences, enabling more robust and sample-efficient learning of world models. In this work, we consider innovations to the state-of-the-art Dreamer model using probabilistic methods that enable: (1) the parallel exploration of many latent states; and (2) maintaining distinct hypotheses for mutually exclusive futures while retaining the desirable gradient properties of continuous latents. Evaluating on the MPE SimpleTag domain, our method outperforms standard Dreamer with a 4.5% score improvement and 28% lower variance in episode returns. We also discuss limitations and directions for future work, including how optimal hyperparameters (e.g. particle count K) scale with environmental complexity, and methods to capture epistemic uncertainty in world models.

new Count Bridges enable Modeling and Deconvolving Transcriptomic Data

Authors: Nic Fishman, Gokul Gowri, Tanush Kumar, Jiaqi Lu, Valentin de Bortoli, Jonathan S. Gootenberg, Omar Abudayyeh

Abstract: Many modern biological assays, including RNA sequencing, yield integer-valued counts that reflect the number of molecules detected. These measurements are often not at the desired resolution: while the unit of interest is typically a single cell, many measurement technologies produce counts aggregated over sets of cells. Although recent generative frameworks such as diffusion and flow matching have been extended to non-Euclidean and discrete settings, it remains unclear how best to model integer-valued data or how to systematically deconvolve aggregated observations. We introduce Count Bridges, a stochastic bridge process on the integers that provides an exact, tractable analogue of diffusion-style models for count data, with closed-form conditionals for efficient training and sampling. We extend this framework to enable direct training from aggregated measurements via an Expectation-Maximization-style approach that treats unit-level counts as latent variables. We demonstrate state-of-the-art performance on integer distribution matching benchmarks, comparing against flow matching and discrete flow matching baselines across various metrics. We then apply Count Bridges to two large-scale problems in biology: modeling single-cell gene expression data at the nucleotide resolution, with applications to deconvolving bulk RNA-seq, and resolving multicellular spatial transcriptomic spots into single-cell count profiles. Our methods offer a principled foundation for generative modeling and deconvolution of biological count data across scales and modalities.

new When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining

Authors: Zhihao Li, Gezheng Xu, Jiale Cai, Ruiyi Fang, Di Wu, Qicheng Lao, Charles Ling, Boyu Wang

Abstract: Unlearnable Examples (UEs) serve as a data protection strategy that generates imperceptible perturbations to mislead models into learning spurious correlations instead of underlying semantics. In this paper, we uncover a fundamental vulnerability of UEs that emerges when learning starts from a pretrained model. Crucially, our empirical analysis shows that even when data are protected by carefully crafted perturbations, pretraining priors still furnish rich semantic representations that allow the model to circumvent the shortcuts introduced by UEs and capture genuine features, thereby nullifying unlearnability. To address this, we propose BAIT (Binding Artificial perturbations to Incorrect Targets), a novel bi-level optimization formulation. Specifically, the inner level aims at associating the perturbed samples with real labels to simulate standard data-label alignment, while the outer level actively disrupts this alignment by enforcing a mislabel-perturbation binding that maps samples to designated incorrect targets. This mechanism effectively overrides the semantic guidance of priors, forcing the model to rely on the injected perturbations and consequently preventing the acquisition of true semantics. Extensive experiments on standard benchmarks and multiple pretrained backbones demonstrate that BAIT effectively mitigates the influence of pretraining priors and maintains data unlearnability.

new Distribution-Conditioned Transport

Authors: Nic Fishman, Gokul Gowri, Paolo L. B. Fischer, Marinka Zitnik, Omar Abudayyeh, Jonathan Gootenberg

Abstract: Learning a transport model that maps a source distribution to a target distribution is a canonical problem in machine learning, but scientific applications increasingly require models that can generalize to source and target distributions unseen during training. We introduce distribution-conditioned transport (DCT), a framework that conditions transport maps on learned embeddings of source and target distributions, enabling generalization to unseen distribution pairs. DCT also allows semi-supervised learning for distributional forecasting problems: because it learns from arbitrary distribution pairs, it can leverage distributions observed at only one condition to improve transport prediction. DCT is agnostic to the underlying transport mechanism, supporting models ranging from flow matching to distributional divergence-based models (e.g. Wasserstein, MMD). We demonstrate the practical performance benefits of DCT on synthetic benchmarks and four applications in biology: batch effect transfer in single-cell genomics, perturbation prediction from mass cytometry data, learning clonal transcriptional dynamics in hematopoiesis, and modeling T-cell receptor sequence evolution.

new KindSleep: Knowledge-Informed Diagnosis of Obstructive Sleep Apnea from Oximetry

Authors: Micky C Nnamdi, Wenqi Shi, Cheng Wan, J. Ben Tamo, Benjamin M Smith, Chad A Purnell, May D Wang

Abstract: Obstructive sleep apnea (OSA) is a sleep disorder that affects nearly one billion people globally and significantly elevates cardiovascular risk. Traditional diagnosis through polysomnography is resource-intensive and limits widespread access, creating a critical need for accurate and efficient alternatives. In this paper, we introduce KindSleep, a deep learning framework that integrates clinical knowledge with single-channel patient-specific oximetry signals and clinical data for precise OSA diagnosis. KindSleep first learns to identify clinically interpretable concepts, such as desaturation indices and respiratory disturbance events, directly from raw oximetry signals. It then fuses these AI-derived concepts with multimodal clinical data to estimate the Apnea-Hypopnea Index (AHI). We evaluate KindSleep on three large, independent datasets from the National Sleep Research Resource (SHHS, CFS, MrOS; total n = 9,815). KindSleep demonstrates excellent performance in estimating AHI scores (R2 = 0.917, ICC = 0.957) and consistently outperforms existing approaches in classifying OSA severity, achieving weighted F1-scores from 0.827 to 0.941 across diverse populations. By grounding its predictions in a layer of clinically meaningful concepts, KindSleep provides a more transparent and trustworthy diagnostic tool for sleep medicine practices.

new ConTSG-Bench: A Unified Benchmark for Conditional Time Series Generation

Authors: Shaocheng Lan, Shuqi Gu, Zhangzhi Xiong, Kan Ren

Abstract: Conditional time series generation plays a critical role in addressing data scarcity and enabling causal analysis in real-world applications. Despite its increasing importance, the field lacks a standardized and systematic benchmarking framework for evaluating generative models across diverse conditions. To address this gap, we introduce the Conditional Time Series Generation Benchmark (ConTSG-Bench). ConTSG-Bench comprises a large-scale, well-aligned dataset spanning diverse conditioning modalities and levels of semantic abstraction, first enabling systematic evaluation of representative generation methods across these dimensions with a comprehensive suite of metrics for generation fidelity and condition adherence. Both the quantitative benchmarking and in-depth analyses of conditional generation behaviors have revealed the traits and limitations of the current approaches, highlighting critical challenges and promising research directions, particularly with respect to precise structural controllability and downstream task utility under complex conditions.

new Distributional Reinforcement Learning with Information Bottleneck for Uncertainty-Aware DRAM Equalization

Authors: Muhammad Usama, Dong Eui Chang

Abstract: Equalizer parameter optimization is critical for signal integrity in high-speed memory systems operating at multi-gigabit data rates. However, existing methods suffer from computationally expensive eye diagram evaluation, optimization of expected rather than worst-case performance, and absence of uncertainty quantification for deployment decisions. In this paper, we propose a distributional risk-sensitive reinforcement learning framework integrating Information Bottleneck latent representations with Conditional Value-at-Risk optimization. We introduce rate-distortion optimal signal compression achieving 51 times speedup over eye diagrams while quantifying epistemic uncertainty through Monte Carlo dropout. Distributional reinforcement learning with quantile regression enables explicit worst-case optimization, while PAC-Bayesian regularization certifies generalization bounds. Experimental validation on 2.4 million waveforms from eight memory units demonstrated mean improvements of 37.1\% and 41.5\% for 4-tap and 8-tap equalizer configurations with worst-case guarantees of 33.8\% and 38.2\%, representing 80.7\% and 89.1\% improvements over Q-learning baselines. The framework achieved 62.5\% high-reliability classification eliminating manual validation for most configurations. These results suggest the proposed framework provides a practical solution for production-scale equalizer optimization with certified worst-case guarantees.

new Distributional Equivalence in Linear Non-Gaussian Latent-Variable Cyclic Causal Models: Characterization and Learning

Authors: Haoyue Dai, Immanuel Albrecht, Peter Spirtes, Kun Zhang

Abstract: Causal discovery with latent variables is a fundamental task. Yet most existing methods rely on strong structural assumptions, such as enforcing specific indicator patterns for latents or restricting how they can interact with others. We argue that a core obstacle to a general, structural-assumption-free approach is the lack of an equivalence characterization: without knowing what can be identified, one generally cannot design methods for how to identify it. In this work, we aim to close this gap for linear non-Gaussian models. We establish the graphical criterion for when two graphs with arbitrary latent structure and cycles are distributionally equivalent, that is, they induce the same observed distribution set. Key to our approach is a new tool, edge rank constraints, which fills a missing piece in the toolbox for latent-variable causal discovery in even broader settings. We further provide a procedure to traverse the whole equivalence class and develop an algorithm to recover models from data up to such equivalence. To our knowledge, this is the first equivalence characterization with latent variables in any parametric setting without structural assumptions, and hence the first structural-assumption-free discovery method. Code and an interactive demo are available at https://equiv.cc.

URLs: https://equiv.cc.

new Diffusion Policy through Conditional Proximal Policy Optimization

Authors: Ben Liu, Shunpeng Yang, Hua Chen

Abstract: Reinforcement learning (RL) has been extensively employed in a wide range of decision-making problems, such as games and robotics. Recently, diffusion policies have shown strong potential in modeling multi-modal behaviors, enabling more diverse and flexible action generation compared to the conventional Gaussian policy. Despite various attempts to combine RL with diffusion, a key challenge is the difficulty of computing action log-likelihood under the diffusion model. This greatly hinders the direct application of diffusion policies in on-policy reinforcement learning. Most existing methods calculate or approximate the log-likelihood through the entire denoising process in the diffusion model, which can be memory- and computationally inefficient. To overcome this challenge, we propose a novel and efficient method to train a diffusion policy in an on-policy setting that requires only evaluating a simple Gaussian probability. This is achieved by aligning the policy iteration with the diffusion process, which is a distinct paradigm compared to previous work. Moreover, our formulation can naturally handle entropy regularization, which is often difficult to incorporate into diffusion policies. Experiments demonstrate that the proposed method produces multimodal policy behaviors and achieves superior performance on a variety of benchmark tasks in both IsaacLab and MuJoCo Playground.

new Multilevel Training for Kolmogorov Arnold Networks

Authors: Ben S. Southworth, Jonas A. Actor, Graham Harper, Eric C. Cyr

Abstract: Algorithmic speedup of training common neural architectures is made difficult by the lack of structure guaranteed by the function compositions inherent to such networks. In contrast to multilayer perceptrons (MLPs), Kolmogorov-Arnold networks (KANs) provide more structure by expanding learned activations in a specified basis. This paper exploits this structure to develop practical algorithms and theoretical insights, yielding training speedup via multilevel training for KANs. To do so, we first establish an equivalence between KANs with spline basis functions and multichannel MLPs with power ReLU activations through a linear change of basis. We then analyze how this change of basis affects the geometry of gradient-based optimization with respect to spline knots. The KANs change-of-basis motivates a multilevel training approach, where we train a sequence of KANs naturally defined through a uniform refinement of spline knots with analytic geometric interpolation operators between models. The interpolation scheme enables a ``properly nested hierarchy'' of architectures, ensuring that interpolation to a fine model preserves the progress made on coarse models, while the compact support of spline basis functions ensures complementary optimization on subsequent levels. Numerical experiments demonstrate that our multilevel training approach can achieve orders of magnitude improvement in accuracy over conventional methods to train comparable KANs or MLPs, particularly for physics informed neural networks. Finally, this work demonstrates how principled design of neural networks can lead to exploitable structure, and in this case, multilevel algorithms that can dramatically improve training performance.

new Missingness Bias Calibration in Feature Attribution Explanations

Authors: Shailesh Sridhar, Anton Xue, Eric Wong

Abstract: Popular explanation methods often produce unreliable feature importance scores due to missingness bias, a systematic distortion that arises when models are probed with ablated, out-of-distribution inputs. Existing solutions treat this as a deep representational flaw that requires expensive retraining or architectural modifications. In this work, we challenge this assumption and show that missingness bias can be effectively treated as a superficial artifact of the model's output space. We introduce MCal, a lightweight post-hoc method that corrects this bias by fine-tuning a simple linear head on the outputs of a frozen base model. Surprisingly, we find this simple correction consistently reduces missingness bias and is competitive with, or even outperforms, prior heavyweight approaches across diverse medical benchmarks spanning vision, language, and tabular domains.

new Why Is RLHF Alignment Shallow? A Gradient Analysis

Authors: Robin Young

Abstract: Why is safety alignment in LLMs shallow? We prove that gradient-based alignment inherently concentrates on positions where harm is decided and vanishes beyond. Using a martingale decomposition of sequence-level harm, we derive an exact characterization of alignment gradients. The gradient at position $t$ equals the covariance between the conditional expected harm and the score function. This implies that positions beyond the harm horizon where the output's harmfulness is already determined receive zero gradient signal during training. This explains empirical observations that KL divergence between aligned and base models concentrates on early tokens. Consequently, standard alignment objectives cannot produce deep alignment, regardless of optimization quality. We introduce the concept of harm information $I_t$, which quantifies each position's influence on harm, and prove that equilibrium KL divergence tracks this quantity. Finally, we derive an objective based on recovery penalties that creates gradient signal at all positions, providing theoretical grounding for empirically successful data augmentation techniques.

new Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness

Authors: Ruichen Xu, Kexin Chen

Abstract: Differentially private learning is essential for training models on sensitive data, but empirical studies consistently show that it can degrade performance, introduce fairness issues like disparate impact, and reduce adversarial robustness. The theoretical underpinnings of these phenomena in modern, non-convex neural networks remain largely unexplored. This paper introduces a unified feature-centric framework to analyze the feature learning dynamics of differentially private stochastic gradient descent (DP-SGD) in two-layer ReLU convolutional neural networks. Our analysis establishes test loss bounds governed by a crucial metric: the feature-to-noise ratio (FNR). We demonstrate that the noise required for privacy leads to suboptimal feature learning, and specifically show that: 1) imbalanced FNRs across classes and subpopulations cause disparate impact; 2) even in the same class, noise has a greater negative impact on semantically long-tailed data; and 3) noise injection exacerbates vulnerability to adversarial attacks. Furthermore, our analysis reveals that the popular paradigm of public pre-training and private fine-tuning does not guarantee improvement, particularly under significant feature distribution shifts between datasets. Experiments on synthetic and real-world data corroborate our theoretical findings.

new FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation

Authors: Min Tan, Junchao Ma, Yinfu Feng, Jiajun Ding, Wenwen Pan, Tingting Han, Qian Zheng, Zhenzhong Kuang, Zhou Yu

Abstract: Multimodal Federated Learning (MFL) enables clients with heterogeneous data modalities to collaboratively train models without sharing raw data, offering a privacy-preserving framework that leverages complementary cross-modal information. However, existing methods often overlook personalized client performance and struggle with modality/task discrepancies, as well as model heterogeneity. To address these challenges, we propose FedAFD, a unified MFL framework that enhances client and server learning. On the client side, we introduce a bi-level adversarial alignment strategy to align local and global representations within and across modalities, mitigating modality and task gaps. We further design a granularity-aware fusion module to integrate global knowledge into the personalized features adaptively. On the server side, to handle model heterogeneity, we propose a similarity-guided ensemble distillation mechanism that aggregates client representations on shared public data based on feature similarity and distills the fused knowledge into the global model. Extensive experiments conducted under both IID and non-IID settings demonstrate that FedAFD achieves superior performance and efficiency for both the client and the server.

new U-Parking: Distributed UWB-Assisted Autonomous Parking System with Robust Localization and Intelligent Planning

Authors: Yiang Wu, Qiong Wu, Pingyi Fan, Kezhi Wang, Wen Chen, Guoqiang Mao, Khaled B. Letaief

Abstract: This demonstration presents U-Parking, a distributed Ultra-Wideband (UWB)-assisted autonomous parking system. By integrating Large Language Models (LLMs)-assisted planning with robust fusion localization and trajectory tracking, it enables reliable automated parking in challenging indoor environments, as validated through real-vehicle demonstrations.

new EVMbench: Evaluating AI Agents on Smart Contract Security

Authors: Justin Wang, Andreas Bigger, Xiaohai Xu, Justin W. Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, Olivia Watkins

Abstract: Smart contracts on public blockchains now manage large amounts of value, and vulnerabilities in these systems can lead to substantial losses. As AI agents become more capable at reading, writing, and running code, it is natural to ask how well they can already navigate this landscape, both in ways that improve security and in ways that might increase risk. We introduce EVMbench, an evaluation that measures the ability of agents to detect, patch, and exploit smart contract vulnerabilities. EVMbench draws on 117 curated vulnerabilities from 40 repositories and, in the most realistic setting, uses programmatic grading based on tests and blockchain state under a local Ethereum execution environment. We evaluate a range of frontier agents and find that they are capable of discovering and exploiting vulnerabilities end-to-end against live blockchain instances. We release code, tasks, and tooling to support continued measurement of these capabilities and future work on security.

new BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

Authors: Yuan Li, Bo Wang, Yufei Gao, Yuqian Yao, Xinyuan Wang, Zhangyue Yin, Xipeng Qiu

Abstract: Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

new Semantic Communication-Enhanced Split Federated Learning for Vehicular Networks: Architecture, Challenges, and Case Study

Authors: Lu Yu, Zheng Chang, Ying-Chang Liang

Abstract: Vehicular edge intelligence (VEI) is vital for future intelligent transportation systems. However, traditional centralized learning in dynamic vehicular networks faces significant communication overhead and privacy risks. Split federated learning (SFL) offers a distributed solution but is often hindered by substantial communication bottlenecks from transmitting high-dimensional intermediate features and can present label privacy concerns. Semantic communication offers a transformative approach to alleviate these communication challenges in SFL by focusing on transmitting only task-relevant information. This paper leverages the advantages of semantic communication in the design of SFL, and presents a case study the semantic communication-enhanced U-Shaped split federated learning (SC-USFL) framework that inherently enhances label privacy by localizing sensitive computations with reduced overhead. It features a dedicated semantic communication module (SCM), with pre-trained and parameter-frozen encoding/decoding units, to efficiently compress and transmit only the task-relevant semantic information over the critical uplink path from vehicular users to the edge server (ES). Furthermore, a network status monitor (NSM) module enables adaptive adjustment of the semantic compression rate in real-time response to fluctuating wireless channel conditions. The SC-USFL framework demonstrates a promising approach for efficiently balancing communication load, preserving privacy, and maintaining learning performance in resource-constrained vehicular environments. Finally, this paper highlights key open research directions to further advance the synergy between semantic communication and SFL in the vehicular network.

new $\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

Authors: Peihao Wang, Ruisi Cai, Zhen Wang, Hongyuan Mei, Qiang Liu, Pan Li, Zhangyang Wang

Abstract: Scaling inference-time compute for Large Language Models (LLMs) has unlocked unprecedented reasoning capabilities. However, existing inference-time scaling methods typically rely on inefficient and suboptimal discrete search algorithms or trial-and-error prompting to improve the online policy. In this paper, we propose $\nabla$-Reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly. Our core component, Differentiable Textual Optimization (DTO), leverages gradient signals from both the LLM's likelihood and a reward model to refine textual representations. $\nabla$-Reasoner further incorporates rejection sampling and acceleration design to robustify and speed up decoding. Theoretically, we show that performing inference-time gradient descent in the sample space to maximize reward is dual to aligning an LLM policy via KL-regularized reinforcement learning. Empirically, $\nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark, while reducing number of model calls by approximately 10-40% compared to strong baselines. Overall, our work introduces a paradigm shift from zeroth-order search to first-order optimization at test time, offering a cost-effective path to amplify LLM reasoning.

new Uncertainty-aware Blood Glucose Prediction from Continuous Glucose Monitoring Data

Authors: Hai Siong Tan

Abstract: In this work, we investigate uncertainty-aware neural network models for blood glucose prediction and adverse glycemic event identification in Type 1 diabetes. We consider three families of sequence models based on LSTM, GRU, and Transformer architectures, with uncertainty quantification enabled by either Monte Carlo dropout or through evidential output layers compatible with Deep Evidential Regression. Using the HUPA-UCM diabetes dataset for validation, we find that Transformer-based models equipped with evidential output heads provide the most effective uncertainty-aware framework, achieving consistently higher predictive accuracies and better-calibrated uncertainty estimates whose magnitudes significantly correlate with prediction errors. We further evaluate the clinical risk of each model using the recently proposed Diabetes Technology Society error grid, with risk categories defined by international expert consensus. Our results demonstrate the value of integrating principled uncertainty quantification into real-time machine-learning-based blood glucose prediction systems.

new WaterSIC: information-theoretically (near) optimal linear layer quantization

Authors: Egor Lifar, Semyon Savkin, Or Ordentlich, Yury Polyanskiy

Abstract: This paper considers the problem of converting a given dense linear layer to low precision. The tradeoff between compressed length and output discrepancy is analyzed information theoretically (IT). It is shown that a popular GPTQ algorithm may have an arbitrarily large gap to the IT limit. To alleviate this problem, a novel algorithm, termed ''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit, uniformly over all possible covariance matrices of input activations. The key innovation of WaterSIC's is to allocate different quantization rates to different columns (in-features) of the weight matrix, mimicking the classical IT solution known as ''waterfilling''. Applying WaterSIC to the Llama and Qwen family of LLMs establishes new state-of-the-art performance for all quantization rates from 1 to 4 bits.

new Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation

Authors: Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

Abstract: Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.

new Functionality-Oriented LLM Merging on the Fisher--Rao Manifold

Authors: Jiayu Wang, Zuojun Ye, Wenpeng Yin

Abstract: Weight-space merging aims to combine multiple fine-tuned LLMs into a single model without retraining, yet most existing approaches remain fundamentally parameter-space heuristics. This creates three practical limitations. First, linear averaging, task vectors, and related rules operate on Euclidean coordinates, even though the desired goal is to merge functionality, i.e., predictive behaviors across tasks. Second, when the source checkpoints are farther apart or more heterogeneous, Euclidean blends often trigger representation collapse, manifested as activation variance shrinkage and effective-rank degradation, which sharply degrades accuracy. Third, many geometry-inspired methods are most natural for two-model interpolation and do not extend cleanly to merging N>2 experts with a principled objective. We address these issues by formulating model merging as computing a weighted Karcher mean on the Fisher--Rao manifold, which is locally equivalent to minimizing a KL-based function distance between predictive distributions. We derive a practical fixed-point algorithm using a lightweight spherical proxy that preserves norms and generalizes directly to multi-expert merging. Across various benchmarks and collapse diagnostics, our method remains stable as the number and heterogeneity of merged models increase, consistently outperforming prior baselines.

new Lightweight and Scalable Transfer Learning Framework for Load Disaggregation

Authors: L. E. Garcia-Marrero, G. Petrone, E. Monmasson

Abstract: Non-Intrusive Load Monitoring (NILM) aims to estimate appliance-level consumption from aggregate electrical signals recorded at a single measurement point. In recent years, the field has increasingly adopted deep learning approaches; however, cross-domain generalization remains a persistent challenge due to variations in appliance characteristics, usage patterns, and background loads across homes. Transfer learning provides a practical paradigm to adapt models with limited target data. However, existing methods often assume a fixed appliance set, lack flexibility for evolving real-world deployments, remain unsuitable for edge devices, or scale poorly for real-time operation. This paper proposes RefQuery, a scalable multi-appliance, multi-task NILM framework that conditions disaggregation on compact appliance fingerprints, allowing one shared model to serve many appliances without a fixed output set. RefQuery keeps a pretrained disaggregation network fully frozen and adapts to a target home by learning only a per-appliance embedding during a lightweight backpropagation stage. Experiments on three public datasets demonstrate that RefQuery delivers a strong accuracy-efficiency trade-off against single-appliance and multi-appliance baselines, including modern Transformer-based methods. These results support RefQuery as a practical path toward scalable, real-time NILM on resource-constrained edge devices.

new Competitive Multi-Operator Reinforcement Learning for Joint Pricing and Fleet Rebalancing in AMoD Systems

Authors: Emil Kragh Toft, Carolin Schmidt, Daniele Gammelli, Filipe Rodrigues

Abstract: Autonomous Mobility-on-Demand (AMoD) systems promise to revolutionize urban transportation by providing affordable on-demand services to meet growing travel demand. However, realistic AMoD markets will be competitive, with multiple operators competing for passengers through strategic pricing and fleet deployment. While reinforcement learning has shown promise in optimizing single-operator AMoD control, existing work fails to capture competitive market dynamics. We investigate the impact of competition on policy learning by introducing a multi-operator reinforcement learning framework where two operators simultaneously learn pricing and fleet rebalancing policies. By integrating discrete choice theory, we enable passenger allocation and demand competition to emerge endogenously from utility-maximizing decisions. Experiments using real-world data from multiple cities demonstrate that competition fundamentally alters learned behaviors, leading to lower prices and distinct fleet positioning patterns compared to monopolistic settings. Notably, we demonstrate that learning-based approaches are robust to the additional stochasticity of competition, with competitive agents successfully converging to effective policies while accounting for partially unobserved competitor strategies.

new Non-Euclidean Gradient Descent Operates at the Edge of Stability

Authors: Rustem Islamov, Michael Crawshaw, Jeremy Cohen, Robert Gower

Abstract: The Edge of Stability (EoS) is a phenomenon where the sharpness (largest eigenvalue) of the Hessian converges to $2/\eta$ during training with gradient descent (GD) with a step-size $\eta$. Despite (apparently) violating classical smoothness assumptions, EoS has been widely observed in deep learning, but its theoretical foundations remain incomplete. We provide an interpretation of EoS through the lens of Directional Smoothness Mishkin et al. [2024]. This interpretation naturally extends to non-Euclidean norms, which we use to define generalized sharpness under an arbitrary norm. Our generalized sharpness measure includes previously studied vanilla GD and preconditioned GD as special cases, as well as methods for which EoS has not been studied, such as $\ell_{\infty}$-descent, Block CD, Spectral GD, and Muon without momentum. Through experiments on neural networks, we show that non-Euclidean GD with our generalized sharpness also exhibits progressive sharpening followed by oscillations around or above the threshold $2/\eta$. Practically, our framework provides a single, geometry-aware spectral measure that works across optimizers.

new Poisoning the Inner Prediction Logic of Graph Neural Networks for Clean-Label Backdoor Attacks

Authors: Yuxiang Zhang, Bin Ma, Enyan Dai

Abstract: Graph Neural Networks (GNNs) have achieved remarkable results in various tasks. Recent studies reveal that graph backdoor attacks can poison the GNN model to predict test nodes with triggers attached as the target class. However, apart from injecting triggers to training nodes, these graph backdoor attacks generally require altering the labels of trigger-attached training nodes into the target class, which is impractical in real-world scenarios. In this work, we focus on the clean-label graph backdoor attack, a realistic but understudied topic where training labels are not modifiable. According to our preliminary analysis, existing graph backdoor attacks generally fail under the clean-label setting. Our further analysis identifies that the core failure of existing methods lies in their inability to poison the prediction logic of GNN models, leading to the triggers being deemed unimportant for prediction. Therefore, we study a novel problem of effective clean-label graph backdoor attacks by poisoning the inner prediction logic of GNN models. We propose BA-Logic to solve the problem by coordinating a poisoned node selector and a logic-poisoning trigger generator. Extensive experiments on real-world datasets demonstrate that our method effectively enhances the attack success rate and surpasses state-of-the-art graph backdoor attack competitors under clean-label settings. Our code is available at https://anonymous.4open.science/r/BA-Logic

URLs: https://anonymous.4open.science/r/BA-Logic

new MCEL: Margin-Based Cross-Entropy Loss for Error-Tolerant Quantized Neural Networks

Authors: Mikail Yayla, Akash Kumar

Abstract: Robustness to bit errors is a key requirement for the reliable use of neural networks (NNs) on emerging approximate computing platforms and error-prone memory technologies. A common approach to achieve bit error tolerance in NNs is injecting bit flips during training according to a predefined error model. While effective in certain scenarios, training-time bit flip injection introduces substantial computational overhead, often degrades inference accuracy at high error rates, and scales poorly for larger NN architectures. These limitations make error injection an increasingly impractical solution for ensuring robustness on future approximate computing platforms and error-prone memory technologies. In this work, we investigate the mechanisms that enable NNs to tolerate bit errors without relying on error-aware training. We establish a direct connection between bit error tolerance and classification margins at the output layer. Building on this insight, we propose a novel loss function, the Margin Cross-Entropy Loss (MCEL), which explicitly promotes logit-level margin separation while preserving the favorable optimization properties of the standard cross-entropy loss. Furthermore, MCEL introduces an interpretable margin parameter that allows robustness to be tuned in a principled manner. Extensive experimental evaluations across multiple datasets of varying complexity, diverse NN architectures, and a range of quantization schemes demonstrate that MCEL substantially improves bit error tolerance, up to 15 % in accuracy for an error rate of 1 %. Our proposed MCEL method is simple to implement, efficient, and can be integrated as a drop-in replacement for standard CEL. It provides a scalable and principled alternative to training-time bit flip injection, offering new insights into the origins of NN robustness and enabling more efficient deployment on approximate computing and memory systems.

new Asymptotic Behavior of Multi--Task Learning: Implicit Regularization and Double Descent Effects

Authors: Ayed M. Alrashdi, Oussama Dhifallah, Houssem Sifaou

Abstract: Multi--task learning seeks to improve the generalization error by leveraging the common information shared by multiple related tasks. One challenge in multi--task learning is identifying formulations capable of uncovering the common information shared between different but related tasks. This paper provides a precise asymptotic analysis of a popular multi--task formulation associated with misspecified perceptron learning models. The main contribution of this paper is to precisely determine the reasons behind the benefits gained from combining multiple related tasks. Specifically, we show that combining multiple tasks is asymptotically equivalent to a traditional formulation with additional regularization terms that help improve the generalization performance. Another contribution is to empirically study the impact of combining tasks on the generalization error. In particular, we empirically show that the combination of multiple tasks postpones the double descent phenomenon and can mitigate it asymptotically.

new Deep Learning-Driven Friendly Jamming for Secure Multicarrier ISAC Under Channel Uncertainty

Authors: Bui Minh Tuan, Van-Dinh Nguyen, Diep N. Nguyen, Nguyen Linh Trung, Nguyen Van Huynh, Dinh Thai Hoang, Marwan Krunz, Eryk Dutkiewicz

Abstract: Integrated sensing and communication (ISAC) systems promise efficient spectrum utilization by jointly supporting radar sensing and wireless communication. This paper presents a deep learning-driven framework for enhancing physical-layer security in multicarrier ISAC systems under imperfect channel state information (CSI) and in the presence of unknown eavesdropper (Eve) locations. Unlike conventional ISAC-based friendly jamming (FJ) approaches that require Eve's CSI or precise angle-of-arrival (AoA) estimates, our method exploits radar echo feedback to guide directional jamming without explicit Eve's information. To enhance robustness to radar sensing uncertainty, we propose a radar-aware neural network that jointly optimizes beamforming and jamming by integrating a novel nonparametric Fisher Information Matrix (FIM) estimator based on f-divergence. The jamming design satisfies the Cramer-Rao lower bound (CRLB) constraints even in the presence of noisy AoA. For efficient implementation, we introduce a quantized tensor train-based encoder that reduces the model size by more than 100 times with negligible performance loss. We also integrate a non-overlapping secure scheme into the proposed framework, in which specific sub-bands can be dedicated solely to communication. Extensive simulations demonstrate that the proposed solution achieves significant improvements in secrecy rate, reduced block error rate (BLER), and strong robustness against CSI uncertainty and angular estimation errors, underscoring the effectiveness of the proposed deep learning-driven friendly jamming framework under practical ISAC impairments.

new Reward-Conditioned Reinforcement Learning

Authors: Michal Nauman, Marek Cygan, Pieter Abbeel

Abstract: RL agents are typically trained under a single, fixed reward function, which makes them brittle to reward misspecification and limits their ability to adapt to changing task preferences. We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications while collecting experience under only one nominal objective. RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy, enabling a single policy to represent reward-specific behaviors. Across single-task, multi-task, and vision-based benchmarks, we show that RCRL not only improves performance under the nominal reward parameterization, but also enables efficient adaptation to new parameterizations. Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.

new Synchronization-based clustering on the unit hypersphere

Authors: Zinaid Kapi\'c, Aladin Crnki\'c, Goran Mau\v{s}a

Abstract: Clustering on the unit hypersphere is a fundamental problem in various fields, with applications ranging from gene expression analysis to text and image classification. Traditional clustering methods are not always suitable for unit sphere data, as they do not account for the geometric structure of the sphere. We introduce a novel algorithm for clustering data represented as points on the unit sphere $\mathbf{S}^{d-1}$. Our method is based on the $d$-dimensional generalized Kuramoto model. The effectiveness of the introduced method is demonstrated on synthetic and real-world datasets. Results are compared with some of the traditional clustering methods, showing that our method achieves similar or better results in terms of clustering accuracy.

new Aura: Universal Multi-dimensional Exogenous Integration for Aviation Time Series

Authors: Jiafeng Lin, Mengren Zheng, Simeng Ye, Yuxuan Wang, Huan Zhang, Yuhui Liu, Zhongyi Pei, Jianmin Wang

Abstract: Time series forecasting has witnessed an increasing demand across diverse industrial applications, where accurate predictions are pivotal for informed decision-making. Beyond numerical time series data, reliable forecasting in practical scenarios requires integrating diverse exogenous factors. Such exogenous information is often multi-dimensional or even multimodal, introducing heterogeneous interactions that unimodal time series models struggle to capture. In this paper, we delve into an aviation maintenance scenario and identify three distinct types of exogenous factors that influence temporal dynamics through distinct interaction modes. Based on this empirical insight, we propose Aura, a universal framework that explicitly organizes and encodes heterogeneous external information according to its interaction mode with the target time series. Specifically, Aura utilizes a tailored tripartite encoding mechanism to embed heterogeneous features into well-established time series models, ensuring seamless integration of non-sequential context. Extensive experiments on a large-scale, three-year industrial dataset from China Southern Airlines, covering the Boeing 777 and Airbus A320 fleets, demonstrate that Aura consistently achieves state-of-the-art performance across all baselines and exhibits superior adaptability. Our findings highlight Aura's potential as a general-purpose enhancement for aviation safety and reliability.

new Axiomatic On-Manifold Shapley via Optimal Generative Flows

Authors: Cenwei Zhang, Lin Zhu, Manxi Lin, Lei You

Abstract: Shapley-based attribution is critical for post-hoc XAI but suffers from off-manifold artifacts due to heuristic baselines. While generative methods attempt to address this, they often introduce geometric inefficiency and discretization drift. We propose a formal theory of on-manifold Aumann-Shapley attributions driven by optimal generative flows. We prove a representation theorem establishing the gradient line integral as the unique functional satisfying efficiency and geometric axioms, notably reparameterization invariance. To resolve path ambiguity, we select the kinetic-energy-minimizing Wasserstein-2 geodesic transporting a prior to the data distribution. This yields a canonical attribution family that recovers classical Shapley for additive models and admits provable stability bounds against flow approximation errors. By reframing baseline selection as a variational problem, our method experimentally outperforms baselines, achieving strict manifold adherence via vanishing Flow Consistency Error and superior semantic alignment characterized by Structure-Aware Total Variation. Our code is on https://github.com/cenweizhang/OTFlowSHAP.

URLs: https://github.com/cenweizhang/OTFlowSHAP.

new Decoupling Task and Behavior: A Two-Stage Reward Curriculum in Reinforcement Learning for Robotics

Authors: Kilian Freitag, Knut {\AA}kesson, Morteza Haghir Chehreghani

Abstract: Deep Reinforcement Learning is a promising tool for robotic control, yet practical application is often hindered by the difficulty of designing effective reward functions. Real-world tasks typically require optimizing multiple objectives simultaneously, necessitating precise tuning of their weights to learn a policy with the desired characteristics. To address this, we propose a two-stage reward curriculum where we decouple task-specific objectives from behavioral terms. In our method, we first train the agent on a simplified task-only reward function to ensure effective exploration before introducing the full reward that includes auxiliary behavior-related terms such as energy efficiency. Further, we analyze various transition strategies and demonstrate that reusing samples between phases is critical for training stability. We validate our approach on the DeepMind Control Suite, ManiSkill3, and a mobile robot environment, modified to include auxiliary behavioral objectives. Our method proves to be simple yet effective, substantially outperforming baselines trained directly on the full reward while exhibiting higher robustness to specific reward weightings.

new FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning

Authors: Junkang Liu, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Yuangang Li, YunXiang Gong

Abstract: Although Federated Learning has been widely studied in recent years, there are still high overhead expenses in each communication round for large-scale models such as Vision Transformer. To lower the communication complexity, we propose a novel Federated Block Coordinate Gradient Descent (FedBCGD) method for communication efficiency. The proposed method splits model parameters into several blocks, including a shared block and enables uploading a specific parameter block by each client, which can significantly reduce communication overhead. Moreover, we also develop an accelerated FedBCGD algorithm (called FedBCGD+) with client drift control and stochastic variance reduction. To the best of our knowledge, this paper is the first work on parameter block communication for training large-scale deep models. We also provide the convergence analysis for the proposed algorithms. Our theoretical results show that the communication complexities of our algorithms are a factor $1/N$ lower than those of existing methods, where $N$ is the number of parameter blocks, and they enjoy much faster convergence than their counterparts. Empirical results indicate the superiority of the proposed algorithms compared to state-of-the-art algorithms. The code is available at https://github.com/junkangLiu0/FedBCGD.

URLs: https://github.com/junkangLiu0/FedBCGD.

new Federated Causal Discovery Across Heterogeneous Datasets under Latent Confounding

Authors: Maximilian Hahn, Alina Zajak, Dominik Heider, Ad\`ele Helena Ribeiro

Abstract: Causal discovery across multiple datasets is often constrained by data privacy regulations and cross-site heterogeneity, limiting the use of conventional methods that require a single, centralized dataset. To address these challenges, we introduce fedCI, a federated conditional independence test that rigorously handles heterogeneous datasets with non-identical sets of variables, site-specific effects, and mixed variable types, including continuous, ordinal, binary, and categorical variables. At its core, fedCI uses a federated Iteratively Reweighted Least Squares (IRLS) procedure to estimate the parameters of generalized linear models underlying likelihood-ratio tests for conditional independence. Building on this, we develop fedCI-IOD, a federated extension of the Integration of Overlapping Datasets (IOD) algorithm, that replaces its meta-analysis strategy and enables, for the fist time, federated causal discovery under latent confounding across distributed and heterogeneous datasets. By aggregating evidence federatively, fedCI-IOD not only preserves privacy but also substantially enhances statistical power, achieving performance comparable to fully pooled analyses and mitigating artifacts from low local sample sizes. Our tools are publicly available as the fedCI Python package, a privacy-preserving R implementation of IOD, and a web application for the fedCI-IOD pipeline, providing versatile, user-friendly solutions for federated conditional independence testing and causal discovery.

new Balancing Privacy-Quality-Efficiency in Federated Learning through Round-Based Interleaving of Protection Techniques

Authors: Yenan Wang, Carla Fabiana Chiasserini, Elad Michael Schiller

Abstract: In federated learning (FL), balancing privacy protection, learning quality, and efficiency remains a challenge. Privacy protection mechanisms, such as Differential Privacy (DP), degrade learning quality, or, as in the case of Homomorphic Encryption (HE), incur substantial system overhead. To address this, we propose Alt-FL, a privacy-preserving FL framework that combines DP, HE, and synthetic data via a novel round-based interleaving strategy. Alt-FL introduces three new methods, Privacy Interleaving (PI), Synthetic Interleaving with DP (SI/DP), and Synthetic Interleaving with HE (SI/HE), that enable flexible quality-efficiency trade-offs while providing privacy protection. We systematically evaluate Alt-FL against representative reconstruction attacks, including Deep Leakage from Gradients, Inverting Gradients, When the Curious Abandon Honesty, and Robbing the Fed, using a LeNet-5 model on CIFAR-10 and Fashion-MNIST. To enable fair comparison between DP- and HE-based defenses, we introduce a new attacker-centric framework that compares empirical attack success rates across the three proposed interleaving methods. Our results show that, for the studied attacker model and dataset, PI achieves the most balanced trade-offs at high privacy protection levels, while DP-based methods are preferable at intermediate privacy requirements. We also discuss how such results can be the basis for selecting privacy-preserving FL methods under varying privacy and resource constraints.

new Trainable Bitwise Soft Quantization for Input Feature Compression

Authors: Karsten Schr\"odter, Jan Stenkamp, Nina Herrmann, Fabian Gieseke

Abstract: The growing demand for machine learning applications in the context of the Internet of Things calls for new approaches to optimize the use of limited compute and memory resources. Despite significant progress that has been made w.r.t. reducing model sizes and improving efficiency, many applications still require remote servers to provide the required resources. However, such approaches rely on transmitting data from edge devices to remote servers, which may not always be feasible due to bandwidth, latency, or energy constraints. We propose a task-specific, trainable feature quantization layer that compresses the input features of a neural network. This can significantly reduce the amount of data that needs to be transferred from the device to a remote server. In particular, the layer allows each input feature to be quantized to a user-defined number of bits, enabling a simple on-device compression at the time of data collection. The layer is designed to approximate step functions with sigmoids, enabling trainable quantization thresholds. By concatenating outputs from multiple sigmoids, introduced as bitwise soft quantization, it achieves trainable quantized values when integrated with a neural network. We compare our method to full-precision inference as well as to several quantization baselines. Experiments show that our approach outperforms standard quantization methods, while maintaining accuracy levels close to those of full-precision models. In particular, depending on the dataset, compression factors of $5\times$ to $16\times$ can be achieved compared to $32$-bit input without significant performance loss.

new Incentive Aware AI Regulations: A Credal Characterisation

Authors: Anurag Singh, Julian Rodemann, Rajeev Verma, Siu Lun Chau, Krikamol Muandet

Abstract: While high-stakes ML applications demand strict regulations, strategic ML providers often evade them to lower development costs. To address this challenge, we cast AI regulation as a mechanism design problem under uncertainty and introduce regulation mechanisms: a framework that maps empirical evidence from models to a license for some market share. The providers can select from a set of licenses, effectively forcing them to bet on their model's ability to fulfil regulation. We aim at regulation mechanisms that achieve perfect market outcome, i.e. (a) drive non-compliant providers to self-exclude, and (b) ensure participation from compliant providers. We prove that a mechanism has perfect market outcome if and only if the set of non-compliant distributions forms a credal set, i.e., a closed, convex set of probability measures. This result connects mechanism design and imprecise probability by establishing a duality between regulation mechanisms and the set of non-compliant distributions. We also demonstrate these mechanisms in practice via experiments on regulating use of spurious features for prediction and fairness. Our framework provides new insights at the intersection of mechanism design and imprecise probability, offering a foundation for development of enforceable AI regulations.

new Towards a data-scale independent regulariser for robust sparse identification of non-linear dynamics

Authors: Jay Raut, Daniel N. Wilke, Stephan Schmidt

Abstract: Data normalisation, a common and often necessary preprocessing step in engineering and scientific applications, can severely distort the discovery of governing equations by magnitudebased sparse regression methods. This issue is particularly acute for the Sparse Identification of Nonlinear Dynamics (SINDy) framework, where the core assumption of sparsity is undermined by the interaction between data scaling and measurement noise. The resulting discovered models can be dense, uninterpretable, and physically incorrect. To address this critical vulnerability, we introduce the Sequential Thresholding of Coefficient of Variation (STCV), a novel, computationally efficient sparse regression algorithm that is inherently robust to data scaling. STCV replaces conventional magnitude-based thresholding with a dimensionless statistical metric, the Coefficient Presence (CP), which assesses the statistical validity and consistency of candidate terms in the model library. This shift from magnitude to statistical significance makes the discovery process invariant to arbitrary data scaling. Through comprehensive benchmarking on canonical dynamical systems and practical engineering problems, including a physical mass-spring-damper experiment, we demonstrate that STCV consistently and significantly outperforms standard Sequential Thresholding Least Squares (STLSQ) and Ensemble-SINDy (E-SINDy) on normalised, noisy datasets. The results show that STCV-based methods can successfully identify the correct, sparse physical laws even when other methods fail. By mitigating the distorting effects of normalisation, STCV makes sparse system identification a more reliable and automated tool for real-world applications, thereby enhancing model interpretability and trustworthiness.

new Stable-LoRA: Stabilizing Feature Learning of Low-Rank Adaptation

Authors: Yize Wu, Ke Gao, Ling Li, Yanjun Wu

Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient method for fine-tuning Large Langauge Models. It updates the weight matrix as $W=W_0+sBA$, where $W_0$ is the original frozen weight, $s$ is a scaling factor and $A$,$B$ are trainable low-rank matrices. Despite its robust empirical effectiveness, the theoretical foundations of LoRA remain insufficiently understood, particularly with respect to feature learning stability. In this paper, we first establish that, LoRA can, in principle, naturally achieve and sustain stable feature learning (i.e., be self-stabilized) under appropriate hyper-parameters and initializations of $A$ and $B$. However, we also uncover a fundamental limitation that the necessary non-zero initialization of $A$ compromises self-stability, leading to suboptimal performances. To address this challenge, we propose Stable-LoRA, a weight-shrinkage optimization strategy that dynamically enhances stability of LoRA feature learning. By progressively shrinking $A$ during the earliest training steps, Stable-LoRA is both theoretically and empirically validated to effectively eliminate instability of LoRA feature learning while preserving the benefits of the non-zero start. Experiments show that Stable-LoRA consistently outperforms other baselines across diverse models and tasks, with no additional memory usage and only negligible computation overheads. The code is available at https://github.com/Yize-Wu/Stable-LoRA.

URLs: https://github.com/Yize-Wu/Stable-LoRA.

new Early Warning of Intraoperative Adverse Events via Transformer-Driven Multi-Label Learning

Authors: Xueyao Wang, Xiuding Cai, Honglin Shang, Yaoyao Zhu, Yu Yao

Abstract: Early warning of intraoperative adverse events plays a vital role in reducing surgical risk and improving patient safety. While deep learning has shown promise in predicting the single adverse event, several key challenges remain: overlooking adverse event dependencies, underutilizing heterogeneous clinical data, and suffering from the class imbalance inherent in medical datasets. To address these issues, we construct the first Multi-label Adverse Events dataset (MuAE) for intraoperative adverse events prediction, covering six critical events. Next, we propose a novel Transformerbased multi-label learning framework (IAENet) that combines an improved Time-Aware Feature-wise Linear Modulation (TAFiLM) module for static covariates and dynamic variables robust fusion and complex temporal dependencies modeling. Furthermore, we introduce a Label-Constrained Reweighting Loss (LCRLoss) with co-occurrence regularization to effectively mitigate intra-event imbalance and enforce structured consistency among frequently co-occurring events. Extensive experiments demonstrate that IAENet consistently outperforms strong baselines on 5, 10, and 15-minute early warning tasks, achieving improvements of +5.05%, +2.82%, and +7.57% on average F1 score. These results highlight the potential of IAENet for supporting intelligent intraoperative decision-making in clinical practice.

new The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

Authors: Alper Y{\i}ld{\i}r{\i}m

Abstract: Mechanistic interpretability typically relies on post-hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to observe training dynamics. We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp) - investigating if specific architectural degrees of freedom prolong the memorization phase. We identify two independent structural factors in standard Transformers: unbounded representational magnitude and data-dependent attention routing. First, we introduce a fully bounded spherical topology enforcing L2 normalization throughout the residual stream and an unembedding matrix with a fixed temperature scale. This removes magnitude-based degrees of freedom, reducing grokking onset time by over 20x without weight decay. Second, a Uniform Attention Ablation overrides data-dependent query-key routing with a uniform distribution, reducing the attention layer to a Continuous Bag-of-Words (CBOW) aggregator. Despite removing adaptive routing, these models achieve 100% generalization across all seeds and bypass the grokking delay entirely. To evaluate whether this acceleration is a task-specific geometric alignment rather than a generic optimization stabilizer, we use non-commutative S5 permutation composition as a negative control. Enforcing spherical constraints on S5 does not accelerate generalization. This suggests eliminating the memorization phase depends strongly on aligning architectural priors with the task's intrinsic symmetries. Together, these findings provide interventional evidence that architectural degrees of freedom substantially influence grokking, suggesting a predictive structural perspective on training dynamics.

new SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

Authors: Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Di Zhang, Shaohan Huang, Xun Wu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yi Zou, Furu Wei

Abstract: NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning -- a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder $(2N-2):2N$ patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the $(2N-2):2N$ model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any $(2N-2):2N$ weight block into $N-1$ overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model families (Llama, Qwen, BitNet). On compute-bound workloads, the measured speedup ratio (1.33x) approaches the theoretical upper-bound $N/(N-1)=4/3$ at 6:8 weight sparsity in Qwen2.5-7B, establishing $(2N-2):2N$ as a practical path to accuracy-preserving LLM acceleration. Code available at https://github.com/bcacdwk/vllmbench.

URLs: https://github.com/bcacdwk/vllmbench.

new Recursive Inference Machines for Neural Reasoning

Authors: Mieszko Komisarczyk, Saurabh Mathur, Maurice Kraus, Sriraam Natarajan, Kristian Kersting

Abstract: Neural reasoners such as Tiny Recursive Models (TRMs) solve complex problems by combining neural backbones with specialized inference schemes. Such inference schemes have been a central component of stochastic reasoning systems, where inference rules are applied to a stochastic model to derive answers to complex queries. In this work, we bridge these two paradigms by introducing Recursive Inference Machines (RIMs), a neural reasoning framework that explicitly incorporates recursive inference mechanisms inspired by classical inference engines. We show that TRMs can be expressed as an instance of RIMs, allowing us to extend them through a reweighting component, yielding better performance on challenging reasoning benchmarks, including ARC-AGI-1, ARC-AGI-2, and Sudoku Extreme. Furthermore, we show that RIMs can be used to improve reasoning on other tasks, such as the classification of tabular data, outperforming TabPFNs.

new A Behaviour-Aware Federated Forecasting Framework for Distributed Stand-Alone Wind Turbines

Authors: Bowen Li, Xiufeng Liu, Maria Sinziiana Astefanoaei

Abstract: Accurate short-term wind power forecasting is essential for grid dispatch and market operations, yet centralising turbine data raises privacy, cost, and heterogeneity concerns. We propose a two-stage federated learning framework that first clusters turbines by long-term behavioural statistics using Double Roulette Selection (DRS) initialisation with recursive Auto-split refinement, and then trains cluster-specific LSTM models via FedAvg. Experiments on 400 stand-alone turbines in Denmark show that DRS-auto discovers behaviourally coherent groups and achieves competitive forecasting accuracy while preserving data locality. Behaviour-aware grouping consistently outperforms geographic partitioning and matches strong k-means++ baselines, suggesting a practical privacy-friendly solution for heterogeneous distributed turbine fleets.

new Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography

Authors: Ting-Hui Cheng, Line H. Clemmensen, Sneha Das

Abstract: Automatic speech recognition (ASR) systems are predominantly evaluated using the Word Error Rate (WER). However, raw token-level metrics fail to capture semantic fidelity and routinely obscures the `diversity tax', the disproportionate burden on marginalized and atypical speaker due to systematic recognition failures. In this paper, we explore the limitations of relying solely on lexical counts by systematically evaluating a broader class of non-linear and semantic metrics. To enable rigorous model auditing, we introduce the sample difficulty index (SDI), a novel metric that quantifies how intrinsic demographic and acoustic factors drive model failure. By mapping SDI on data cartography, we demonstrate that metrics EmbER and SemDist expose hidden systemic biases and inter-model disagreements that WER ignores. Finally, our findings are the first steps towards a robust audit framework for prospective safety analysis, empowering developers to audit and mitigate ASR disparities prior to deployment.

new Whispering to a Blackbox: Bootstrapping Frozen OCR with Visual Prompts

Authors: Samandar Samandarov, Nazirjon Ismoiljonov, Abdullah Sattorov, Temirlan Sabyrbayev

Abstract: In the landscape of modern machine learning, frozen pre-trained models provide stability and efficiency but often underperform on specific tasks due to mismatched data distributions. This paper introduces the Whisperer, a novel visual prompting framework that learns diffusion-based preprocessors to adapt inputs in pixel space, effectively "whispering" enhancements to frozen downstream models like EasyOCR. By framing the process as behavioral cloning of stochastically discovered improvement policies, our method achieves an 8% absolute (10.6% relative) reduction in Character Error Rate (CER) on a challenging dataset of 300k degraded synthetic text images, surpassing hand-engineered baselines such as CLAHE. The key innovation is a four-stage training curriculum that uses behavioral cloning to amplify "lucky" improvements discovered through the stochastic exploration of a partially trained diffusion model. This approach is highly sample-efficient and avoids the pitfalls of traditional reinforcement learning. Crucially, we frame this not as naive reinforcement learning, but as behavioral cloning of an exploration policy: we stochastically sample intermediate diffusion outputs, select those that improve CER by chance, and then train the model to reproduce them. This bootstrapping curriculum (4 stages over 60 GPU-hours) amplifies random successes into a systematic strategy. In summary, by whispering to the frozen OCR through its inputs, we improve an imperfect classifier without touching its weights.

new Knowledge Divergence and the Value of Debate for Scalable Oversight

Authors: Robin Young

Abstract: AI safety via debate and reinforcement learning from AI feedback (RLAIF) are both proposed methods for scalable oversight of advanced AI systems, yet no formal framework relates them or characterizes when debate offers an advantage. We analyze this by parameterizing debate's value through the geometry of knowledge divergence between debating models. Using principal angles between models' representation subspaces, we prove that the debate advantage admits an exact closed form. When models share identical training corpora, debate reduces to RLAIF-like where a single-agent method recovers the same optimum. When models possess divergent knowledge, debate advantage scales with a phase transition from quadratic regime (debate offers negligible benefit) to linear regime (debate is essential). We classify three regimes of knowledge divergence (shared, one-sided, and compositional) and provide existence results showing that debate can achieve outcomes inaccessible to either model alone, alongside a negative result showing that sufficiently strong adversarial incentives cause coordination failure in the compositional regime, with a sharp threshold separating effective from ineffective debate. We offer the first formal connection between debate and RLAIF, a geometric foundation for understanding when adversarial oversight protocols are justified, and connection to the problem of eliciting latent knowledge across models with complementary information.

new WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation

Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli

Abstract: Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference. Demo samples are available at https://lucadellalib.github.io/wavslm-web/.

URLs: https://lucadellalib.github.io/wavslm-web/.

new GALACTIC: Global and Local Agnostic Counterfactuals for Time-series Clustering

Authors: Christos Fragkathoulas, Eleni Psaroudaki, Themis Palpanas, Evaggelia Pitoura

Abstract: Time-series clustering is a fundamental tool for pattern discovery, yet existing explainability methods, primarily based on feature attribution or metadata, fail to identify the transitions that move an instance across cluster boundaries. While Counterfactual Explanations (CEs) identify the minimal temporal perturbations required to alter the prediction of a model, they have been mostly confined to supervised settings. This paper introduces GALACTIC, the first unified framework to bridge local and global counterfactual explainability for unsupervised time-series clustering. At instance level (local), GALACTIC generates perturbations via a cluster-aware optimization objective that respects the target and underlying cluster assignments. At cluster level (global), to mitigate cognitive load and enhance interpretability, we formulate a representative CE selection problem. We propose a Minimum Description Length (MDL) objective to extract a non-redundant summary of global explanations that characterize the transitions between clusters. We prove that our MDL objective is supermodular, which allows the corresponding MDL reduction to be framed as a monotone submodular set function. This enables an efficient greedy selection algorithm with provable $(1-1/e)$ approximation guarantees. Extensive experimental evaluation on the UCR Archive demonstrates that GALACTIC produces significantly sparser local CEs and more concise global summaries than state-of-the-art baselines adapted for our problem, offering the first unified approach for interpreting clustered time-series through counterfactuals.

new FairFinGAN: Fairness-aware Synthetic Financial Data Generation

Authors: Tai Le Quy, Dung Nguyen Tuan, Trung Nguyen Thanh, Duy Tran Cong, Huyen Giang Thi Thu, Frank Hopfgartner

Abstract: Financial datasets often suffer from bias that can lead to unfair decision-making in automated systems. In this work, we propose FairFinGAN, a WGAN-based framework designed to generate synthetic financial data while mitigating bias with respect to the protected attribute. Our approach incorporates fairness constraints directly into the training process through a classifier, ensuring that the synthetic data is both fair and preserves utility for downstream predictive tasks. We evaluate our proposed model on five real-world financial datasets and compare it with existing GAN-based data generation methods. Experimental results show that our approach achieves superior fairness metrics without significant loss in data utility, demonstrating its potential as a tool for bias-aware data generation in financial applications.

new Preserving Continuous Symmetry in Discrete Spaces: Geometric-Aware Quantization for SO(3)-Equivariant GNNs

Authors: Haoyu Zhou, Ping Xue, Hao Zhang, Tianfan Fu

Abstract: Equivariant Graph Neural Networks (GNNs) are essential for physically consistent molecular simulations but suffer from high computational costs and memory bottlenecks, especially with high-order representations. While low-bit quantization offers a solution, applying it naively to rotation-sensitive features destroys the SO(3)-equivariant structure, leading to significant errors and violations of conservation laws. To address this issue, in this work, we propose a Geometric-Aware Quantization (GAQ) framework that compresses and accelerates equivariant models while rigorously preserving continuous symmetry in discrete spaces. Our approach introduces three key contributions: (1) a Magnitude-Direction Decoupled Quantization (MDDQ) scheme that separates invariant lengths from equivariant orientations to maintain geometric fidelity; (2) a symmetry-aware training strategy that treats scalar and vector features with distinct quantization schedules; and (3) a robust attention normalization mechanism to stabilize gradients in low-bit regimes. Experiments on the rMD17 benchmark demonstrate that our W4A8 models match the accuracy of FP32 baselines (9.31 meV vs. 23.20 meV) while reducing Local Equivariance Error (LEE) by over 30x compared to naive quantization. On consumer hardware, GAQ achieves 2.39x inference speedup and 4x memory reduction, enabling stable, energy-conserving molecular dynamics simulations for nanosecond timescales.

new InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context

Authors: Xin Teng, Canyu Zhang, Shaoyi Zheng, Danyang Zhuo, Tianyi Zhou, Shengjie Wang

Abstract: Retrieval-augmented generation (RAG) for long-context question answering is bottlenecked by inference-time prefilling over large retrieved contexts. A common strategy is to precompute key-value (KV) caches for individual documents and selectively recompute a small subset of tokens to restore global causal dependencies, but existing methods rely on heuristics or representation discrepancies without modeling whether selected tokens can effectively influence generation. We cast selective KV recomputation as an information flow problem and show that a simple attention-norm signal from the query reliably identifies tokens that are both semantically relevant and structurally positioned to propagate information, when computed under an inference-consistent RoPE geometry. We therefore reconstruct global positional assignments for retrieved chunks and introduce an information-flow-guided chunk reordering strategy. Experiments on LLM and VLM benchmarks demonstrate consistent gains over prior methods under comparable efficiency budgets.

new Learning Causal Structure of Time Series using Best Order Score Search

Authors: Irene Gema Castillo Mansilla, Urmi Ninad

Abstract: Causal structure learning from observational data is central to many scientific and policy domains, but the time series setting common to many disciplines poses several challenges due to temporal dependence. In this paper we focus on score-based causal discovery for multivariate time series and introduce TS-BOSS, a time series extension of the recently proposed Best Order Score Search (BOSS) (Andrews et al. 2023). TS-BOSS performs a permutation-based search over dynamic Bayesian network structures while leveraging grow-shrink trees to cache intermediate score computations, preserving the scalability and strong empirical performance of BOSS in the static setting. We provide theoretical guarantees establishing the soundness of TS-BOSS under suitable assumptions, and we present an intermediate result that extends classical subgraph minimality results for permutation-based methods to the dynamic (time series) setting. Our experiments on synthetic data show that TS-BOSS is especially effective in high auto-correlation regimes, where it consistently achieves higher adjacency recall at comparable precision than standard constraint-based methods. Overall, TS-BOSS offers a high-performing, scalable approach for time series causal discovery and our results provide a principled bridge for extending sparsity-based, permutation-driven causal learning theory to dynamic settings.

new Embedded Inter-Subject Variability in Adversarial Learning for Inertial Sensor-Based Human Activity Recognition

Authors: Francisco M. Calatrava-Nicol\'as, Shoko Miyauchi, Vitor Fortes Rey, Paul Lukowicz, Todor Stoyanov, Oscar Martinez Mozos

Abstract: This paper addresses the problem of Human Activity Recognition (HAR) using data from wearable inertial sensors. An important challenge in HAR is the model's generalization capabilities to new unseen individuals due to inter-subject variability, i.e., the same activity is performed differently by different individuals. To address this problem, we propose a novel deep adversarial framework that integrates the concept of inter-subject variability in the adversarial task, thereby encouraging subject-invariant feature representations and enhancing the classification performance in the HAR problem. Our approach outperforms previous methods in three well-established HAR datasets using a leave-one-subject-out (LOSO) cross-validation. Further results indicate that our proposed adversarial task effectively reduces inter-subject variability among different users in the feature space, and it outperforms adversarial tasks from previous works when integrated into our framework. Code: https://github.com/FranciscoCalatrava/EmbeddedSubjectVariability.git

URLs: https://github.com/FranciscoCalatrava/EmbeddedSubjectVariability.git

new Robust Node Affinities via Jaccard-Biased Random Walks and Rank Aggregation

Authors: Bastian Pfeifer, Michael G. Schimek

Abstract: Estimating node similarity is a fundamental task in network analysis and graph-based machine learning, with applications in clustering, community detection, classification, and recommendation. We propose TopKGraphs, a method based on start-node-anchored random walks that bias transitions toward nodes with structurally similar neighborhoods, measured via Jaccard similarity. Rather than computing stationary distributions, walks are treated as stochastic neighborhood samplers, producing partial node rankings that are aggregated using robust rank aggregation to construct interpretable node-to-node affinity matrices. TopKGraphs provides a non-parametric, interpretable, and general-purpose representation of node similarity that can be applied in both network analysis and machine learning workflows. We evaluate the method on synthetic graphs (stochastic block models, Lancichinetti-Fortunato-Radicchi benchmark graphs), k-nearest-neighbor graphs from tabular datasets, and a curated high-confidence protein-protein interaction network. Across all scenarios, TopKGraphs achieves competitive or superior performance compared to standard similarity measures (Jaccard, Dice), a diffusion-based method (personalized PageRank), and an embedding-based approach (Node2Vec), demonstrating robustness in sparse, noisy, or heterogeneous networks. These results suggest that TopKGraphs is a versatile and interpretable tool for bridging simple local similarity measures with more complex embedding-based approaches, facilitating both data mining and network analysis applications.

new On the Necessity of Learnable Sheaf Laplacians

Authors: Ferran Hernandez Caralt, Mar Gonz\`alez i Catal\`a, Adri\'an Bazaga, Pietro Li\`o

Abstract: Sheaf Neural Networks (SNNs) were introduced as an extension of Graph Convolutional Networks to address oversmoothing on heterophilous graphs by attaching a sheaf to the input graph and replacing the adjacency-based operator with a sheaf Laplacian defined by (learnable) restriction maps. Prior work motivates this design through theoretical properties of sheaf diffusion and the kernel of the sheaf Laplacian, suggesting that suitable non-identity restriction maps can avoid representations converging to constants across connected components. Since oversmoothing can also be mitigated through residual connections and normalization, we revisit a trivial sheaf construction to ask whether the additional complexity of learning restriction maps is necessary. We introduce an Identity Sheaf Network baseline, where all restriction maps are fixed to the identity, and use it to ablate the empirical improvements reported by sheaf-learning architectures. Across five popular heterophilic benchmarks, the identity baseline achieves comparable performance to a range of SNN variants. Finally, we introduce the Rayleigh quotient as a normalized measure for comparing oversmoothing across models and show that, in trained networks, the behavior predicted by the diffusion-based analysis of SNNs is not reflected empirically. In particular, Identity Sheaf Networks do not appear to suffer more significant oversmoothing than their SNN counterparts.

new An interpretable prototype parts-based neural network for medical tabular data

Authors: Jacek Karolczak, Jerzy Stefanowski

Abstract: The ability to interpret machine learning model decisions is critical in such domains as healthcare, where trust in model predictions is as important as their accuracy. Inspired by the development of prototype parts-based deep neural networks in computer vision, we propose a new model for tabular data, specifically tailored to medical records, that requires discretization of diagnostic result norms. Unlike the original vision models that rely on the spatial structure, our method employs trainable patching over features describing a patient, to learn meaningful prototypical parts from structured data. These parts are represented as binary or discretized feature subsets. This allows the model to express prototypes in human-readable terms, enabling alignment with clinical language and case-based reasoning. Our proposed neural network is inherently interpretable and offers interpretable concept-based predictions by comparing the patient's description to learned prototypes in the latent space of the network. In experiments, we demonstrate that the model achieves classification performance competitive to widely used baseline models on medical benchmark datasets, while also offering transparency, bridging the gap between predictive performance and interpretability in clinical decision support.

new On-Policy Self-Distillation for Reasoning Compression

Authors: Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, Jiachen Sun

Abstract: Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.

new Latent Wasserstein Adversarial Imitation Learning

Authors: Siqi Yang, Kai Yan, Alexander G. Schwing, Yu-Xiong Wang

Abstract: Imitation Learning (IL) enables agents to mimic expert behavior by learning from demonstrations. However, traditional IL methods require large amounts of medium-to-high-quality demonstrations as well as actions of expert demonstrations, both of which are often unavailable. To reduce this need, we propose Latent Wasserstein Adversarial Imitation Learning (LWAIL), a novel adversarial imitation learning framework that focuses on state-only distribution matching. It benefits from the Wasserstein distance computed in a dynamics-aware latent space. This dynamics-aware latent space differs from prior work and is obtained via a pre-training stage, where we train the Intention Conditioned Value Function (ICVF) to capture a dynamics-aware structure of the state space using a small set of randomly generated state-only data. We show that this enhances the policy's understanding of state transitions, enabling the learning process to use only one or a few state-only expert episodes to achieve expert-level performance. Through experiments on multiple MuJoCo environments, we demonstrate that our method outperforms prior Wasserstein-based IL methods and prior adversarial IL methods, achieving better results across various tasks.

new Kraus Constrained Sequence Learning For Quantum Trajectories from Continuous Measurement

Authors: Priyanshi Singh, Krishna Bhatia

Abstract: Real-time reconstruction of conditional quantum states from continuous measurement records is a fundamental requirement for quantum feedback control, yet standard stochastic master equation (SME) solvers require exact model specification, known system parameters, and are sensitive to parameter mismatch. While neural sequence models can fit these stochastic dynamics, the unconstrained predictors can violate physicality such as positivity or trace constraints, leading to unstable rollouts and unphysical estimates. We propose a Kraus-structured output layer that converts the hidden representation of a generic sequence backbone into a completely positive trace preserving (CPTP) quantum operation, yielding physically valid state updates by construction. We instantiate this layer across diverse backbones, RNN, GRU, LSTM, TCN, ESN and Mamba; including Neural ODE as a comparative baseline, on stochastic trajectories characterized by parameter drift. Our evaluation reveals distinct trade-offs between gating mechanisms, linear recurrence, and global attention. Across all models, Kraus-LSTM achieves the strongest results, improving state estimation quality by 7% over its unconstrained counterpart while guaranteeing physically valid predictions in non-stationary regimes.

new SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

Authors: Shahriar Noroozizadeh, Xiaobin Shen, Jeremy C. Weiss, George H. Chen

Abstract: Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE-Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi-synthetic, and real-world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTE-Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods. The data and code of our benchmark are available at: https://github.com/Shahriarnz14/SurvHTE-Bench .

URLs: https://github.com/Shahriarnz14/SurvHTE-Bench

new Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Authors: Helena Casademunt, Bartosz Cywi\'nski, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda

Abstract: Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.

new Cheap Thrills: Effective Amortized Optimization Using Inexpensive Labels

Authors: Khai Nguyen, Petros Ellinas, Anvita Bhagavathula, Priya Donti

Abstract: To scale the solution of optimization and simulation problems, prior work has explored machine-learning surrogates that inexpensively map problem parameters to corresponding solutions. Commonly used approaches, including supervised and self-supervised learning with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult optimization landscapes. To address their trade-offs, we propose a novel framework that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the model through self-supervised learning to improve overall performance. Our theoretical analysis and merit-based criterion show that labeled data need only place the model within a basin of attraction, confirming that only modest numbers of inexact labels and training epochs are required. We empirically validate our simple three-stage strategy across challenging domains, including nonconvex constrained optimization, power-grid operation, and stiff dynamical systems, and show that it yields faster convergence; improved accuracy, feasibility, and optimality; and up to 59x reductions in total offline cost.

new POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

Authors: Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu

Abstract: Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

cross Lost in Translation: How Language Re-Aligns Vision for Cross-Species Pathology

Authors: Ekansh Arora

Abstract: Foundation models are increasingly applied to computational pathology, yet their behavior under cross-cancer and cross-species transfer remains unspecified. This study investigated how fine-tuning CPath-CLIP affects cancer detection under same-cancer, cross-cancer, and cross-species conditions using whole-slide image patches from canine and human histopathology. Performance was measured using area under the receiver operating characteristic curve (AUC). Few-shot fine-tuning improved same-cancer (64.9% to 72.6% AUC) and cross-cancer performance (56.84% to 66.31% AUC). Cross-species evaluation revealed that while tissue matching enables meaningful transfer, performance remains below state-of-the-art benchmarks (H-optimus-0: 84.97% AUC), indicating that standard vision-language alignment is suboptimal for cross-species generalization. Embedding space analysis revealed extremely high cosine similarity (greater than 0.99) between tumor and normal prototypes. Grad-CAM shows prototype-based models remain domain-locked, while language-guided models attend to conserved tumor morphology. To address this, we introduce Semantic Anchoring, which uses language to provide a stable coordinate system for visual features. Ablation studies reveal that benefits stem from the text-alignment mechanism itself, regardless of text encoder complexity. Benchmarking against H-optimus-0 shows that CPath-CLIP's failure stems from intrinsic embedding collapse, which text alignment effectively circumvents. Additional gains were observed in same-cancer (8.52%) and cross-cancer classification (5.67%). We identified a previously uncharacterized failure mode: semantic collapse driven by species-dominated alignment rather than missing visual information. These results demonstrate that language acts as a control mechanism, enabling semantic re-interpretation without retraining.

cross One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

Authors: Liming Lu, Kaixi Qiu, Jiayu Zhou, Jushi Kai, Haoyan Zhang, Huanyu Wang, Jingwen Leng, Ziwei He, Zhouhan Lin

Abstract: Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. Furthermore, our approach is orthogonal to sequence-level pruning methods. When integrated with SnapKV, DynaKV retains only 6% of the KV cache while maintaining 94% of the baseline performance on the LongBench benchmark.

cross Context-Dependent Affordance Computation in Vision-Language Models

Authors: Murad Farzulla

Abstract: We characterize the phenomenon of context-dependent affordance computation in vision-language models (VLMs). Through a large-scale computational study (n=3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context-dependent. Sentence-level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context-dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within-prime variance is substantially lower than cross-prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a "Culinary Manifold" isolated to chef contexts and an "Access Axis" spanning child-mobility contrasts. These findings establish that VLMs compute affordances in a substantially context-dependent manner -- with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts -- and suggest a direction for robotics research: dynamic, query-dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.

cross When Scaling Fails: Network and Fabric Effects on Distributed GPU Training Performance

Authors: Dinesh Gopalan, Ratul Ali

Abstract: Scaling distributed GPU training is commonly assumed to yield predictable performance gains as additional nodes are added. In practice, many large-scale deployments encounter diminishing returns and unstable behavior well before theoretical limits are reached. This paper examines why scaling fails in real systems, with a focus on the role of network and fabric effects that are often overlooked by higher-level training frameworks. We present an empirical study of distributed GPU training performance across multiple production-scale clusters. Our results show that network topology, congestion dynamics, collective synchronization behavior, and GPU locality frequently dominate end-to-end training performance once workloads move beyond a small number of nodes. Identical models and software stacks can exhibit sharply different scaling characteristics depending on fabric design and runtime communication patterns. We identify recurring failure modes that emerge as training transitions from single-node to multi-node execution, including synchronization amplification, topology-induced contention, and locality-driven performance variance. These effects are often invisible to standard profiling tools and are therefore misdiagnosed as framework or model-level inefficiencies. Based on these findings, we outline practical diagnostic principles that system builders can apply to understand scaling limits, improve predictability, and reduce the cost of large-scale distributed training.

cross Data-Driven Optimization of Multi-Generational Cellular Networks: A Performance Classification Framework for Strategic Infrastructure Management

Authors: Maryam Sabahat, M. Umar Khan

Abstract: The exponential growth in mobile data demand necessitates intelligent management of telecommunications infrastructure to ensure Quality of Service (QoS) and operational efficiency. This paper presents a comprehensive analysis of a multigenerational cellular network dataset, sourced from the OpenCelliD project, to identify patterns in network deployment, utilization, and infrastructure gaps. The methodology involves geographical, temporal, and performance analysis of 1,818 cell tower entries, predominantly Long Term Evolution (LTE), across three countries with a significant concentration in Pakistan. Key findings reveal the long-term persistence of legacy 2G/3G infrastructure in major urban centers, the existence of a substantial number of under-utilized towers representing opportunities for cost savings, and the identification of specific "non-4G demand zones" where active user bases are served by outdated technologies. By introducing a signal-density metric, we distinguish between absolute over-utilization and localized congestion. The results provide actionable intelligence for Mobile Network Operators (MNOs) to guide strategic LTE upgrades, optimize resource allocation, and bridge the digital divide in underserved regions.

cross Auction-Based RIS Allocation With DRL: Controlling the Cost-Performance Trade-Off

Authors: Martin Mark Zan, Stefan Schwarz

Abstract: We study the allocation of reconfigurable intelligent surfaces (RISs) in a multi-cell wireless network, where base stations compete for control of shared RIS units deployed at the cell edges. These RISs, provided by an independent operator, are dynamically leased to the highest bidder using a simultaneously ascending auction format. Each base station estimates the utility of acquiring additional RISs based on macroscopic channel parameters, enabling a scalable and low-overhead allocation mechanism. To optimize the bidding behavior, we integrate deep reinforcement learning (DRL) agents that learn to maximize performance while adhering to budget constraints. Through simulations in clustered cell-edge environments, we demonstrate that reinforcement learning (RL)-based bidding significantly outperforms heuristic strategies, achieving optimal trade-offs between cost and spectral efficiency. Furthermore, we introduce a tunable parameter that governs the bidding aggressiveness of RL agents, enabling a flexible control of the trade-off between network performance and expenditure. Our results highlight the potential of combining auction-based allocation with adaptive RL mechanisms for efficient and fair utilization of RISs in next-generation wireless networks.

cross CogGen: Cognitive-Load-Informed Fully Unsupervised Deep Generative Modeling for Compressively Sampled MRI Reconstruction

Authors: Qingyong Zhu, Yumin Tan, Xiang Gu, Dong Liang

Abstract: Fully unsupervised deep generative modeling (FU-DGM) is promising for compressively sampled MRI (CS-MRI) when training data or compute are limited. Classical FU-DGMs such as DIP and INR rely on architectural priors, but the ill-conditioned inverse problem often demands many iterations and easily overfits measurement noise. We propose CogGen, a cognitive-load-informed FU-DGM that casts CS-MRI as staged inversion and regulates task-side "cognitive load" by progressively scheduling intrinsic difficulty and extraneous interference. CogGen replaces uniform data fitting with an easy-to-hard k-space weighting/selection strategy: early iterations emphasize low-frequency, high-SNR, structure-dominant samples, while higher-frequency or noise-dominated measurements are introduced later. We realize this schedule via self-paced curriculum learning with complementary student-mode (what the model can currently learn) and teacher-mode (what it should follow) criteria, supporting both soft weighting and hard selection. Experiments and analysis show that CogGen-DIP and CogGen-INR improve fidelity and convergence over strong unsupervised baselines and competitive supervised pipelines.

cross Explainable Regime Aware Investing

Authors: Amine Boukardagha

Abstract: We propose an explainable regime-aware portfolio construction framework based on a strictly causal Wasserstein Hidden Markov Model. The model combines rolling Gaussian HMM inference with predictive model-order selection and template-based identity tracking using the 2-Wasserstein distance between Gaussian components. This allows regime complexity to adapt dynamically while preserving stable economic interpretation. Regime probabilities are embedded into a transaction-cost-aware mean-variance optimization framework and evaluated on a diversified daily cross-asset universe. Relative to equal-weight and SPX buy-and-hold benchmarks, the Wasserstein HMM achieves materially higher risk-adjusted performance with Sharpe ratios of 2.18 versus 1.59 and 1.18 and substantially lower maximum drawdown of negative 5.43 percent versus negative 14.62 percent for SPX. During the early 2025 equity selloff labeled Liberation Day, the strategy dynamically reduced equity exposure and shifted toward defensive assets, mitigating peak-to-trough losses. Compared to a nonparametric KNN conditional-moment estimator using the same features and optimization layer, the parametric regime model produces materially lower turnover and smoother weight evolution. The results demonstrate that regime inference stability, particularly identity preservation and adaptive complexity control, is a first-order determinant of portfolio drawdown and implementation robustness in daily asset allocation.

cross AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Authors: Emmanuel Bamidele

Abstract: Long-running LLM agents require persistent memory to preserve state across interactions, yet most deployed systems manage memory with age-based retention (e.g., TTL). While TTL bounds item lifetime, it does not bound the computational footprint of memory on the request path: as retained items accumulate, retrieval candidate sets and vector similarity scans can grow unpredictably, yielding heavy-tailed latency and unstable throughput. We present AMV-L (Adaptive Memory Value Lifecycle), a memory-management framework that treats agent memory as a managed systems resource. AMV-L assigns each memory item a continuously updated utility score and uses value-driven promotion, demotion, and eviction to maintain lifecycle tiers; retrieval is restricted to a bounded, tier-aware candidate set that decouples the request-path working set from total retained memory. We implement AMV-L in a full-stack LLM serving system and evaluate it under identical long-running workloads against two baselines: TTL and an LRU working-set policy, with fixed prompt-injection caps. Relative to TTL, AMV-L improves throughput by 3.1x and reduces latency by 4.2x (median), 4.7x (p95), and 4.4x (p99), while reducing the fraction of requests exceeding 2s from 13.8% to 0.007%. Compared to LRU, AMV-L trades a small regression in median/p95 latency (+26% / +3%) for improved extreme-tail behavior (-15% p99; -98% >2s) and lower token overhead (approximately 6% fewer tokens/request), while matching retrieval quality (value means within approximately 0-2%). The gains arise primarily from bounding retrieval-set size and vector-search work, not from shortening prompts. Our results show that predictable performance for long-running LLM agents requires explicit control of memory working-set size and value-driven lifecycle management, rather than retention time alone.

cross SkillNet: Create, Evaluate, and Connect AI Skills

Authors: Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Xin Xu, Tongtong Wu, Kun Wang, Yang Liu, Zhen Bi, Jungang Lou, Yuchen Eleanor Jiang, Hangcheng Zhu, Gang Yu, Haiwen Hong, Longtao Huang, Hui Xue, Chenxi Wang, Yijun Wang, Zifei Shan, Xi Chen, Zhaopeng Tu, Feiyu Xiong, Xin Xie, Peng Zhang, Zhengke Gui, Lei Liang, Jun Zhou, Chiyu Wu, Jin Shang, Yu Gong, Junyu Lin, Changliang Xu, Hongjie Deng, Wen Zhang, Keyan Ding, Qiang Zhang, Fei Huang, Ningyu Zhang, Jeff Z. Pan, Guilin Qi, Haofen Wang, Huajun Chen

Abstract: Current AI agents can flexibly invoke tools and execute complex tasks, yet their long-term advancement is hindered by the lack of systematic accumulation and transfer of skills. Without a unified mechanism for skill consolidation, agents frequently ``reinvent the wheel'', rediscovering solutions in isolated contexts without leveraging prior strategies. To overcome this limitation, we introduce SkillNet, an open infrastructure designed to create, evaluate, and organize AI skills at scale. SkillNet structures skills within a unified ontology that supports creating skills from heterogeneous sources, establishing rich relational connections, and performing multi-dimensional evaluation across Safety, Completeness, Executability, Maintainability, and Cost-awareness. Our infrastructure integrates a repository of over 200,000 skills, an interactive platform, and a versatile Python toolkit. Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that SkillNet significantly enhances agent performance, improving average rewards by 40% and reducing execution steps by 30% across multiple backbone models. By formalizing skills as evolving, composable assets, SkillNet provides a robust foundation for agents to move from transient experience to durable mastery.

cross MPBMC: Multi-Property Bounded Model Checking with GNN-guided Clustering

Authors: Soumik Guha Roy, Sumana Ghosh, Ansuman Banerjee, Raj Kumar Gajavelly, Sudhakar Surendran

Abstract: Formal verification of designs with multiple properties has been a long-standing challenge for the verification research community. The task of coming up with an effective strategy that can efficiently cluster properties to be solved together has inspired a number of proposals, ranging from structural clustering based on the property cone of influence (COI) to leverage runtime design and verification statistics. In this paper, we present an attempt towards functional clustering of properties utilizing graph neural network (GNN) embeddings for creating effective property clusters. We propose a hybrid approach that can exploit neural functional representations of hardware circuits and runtime design statistics to speed up the performance of Bounded Model Checking (BMC) in the context of multi-property verification (MPV). Our method intelligently groups properties based on their functional embedding and design statistics, resulting in speedup in verification results. Experimental results on the HWMCC benchmarks show the efficacy of our proposal with respect to the state-of-the-art.

cross Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

Authors: Wai Tuck Wong, Jun Sun, Arunesh Sinha

Abstract: The use of multimodal large language models has become widespread, and as such the study of these models and their failure points has become of utmost importance. We study a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models. We apply this loss term as the optimization target to construct images that, when used on multimodal large language models, cause significant degradation in the output. We validate our hypothesis on state of the art models large vision language models (LLaVa-v1.5-7B, Idefics3-8B, SmolVLM-2B-Instruct) against standard datasets (Flickr30k, MMVet, TextVQA, VQAv2, POPE, COCO) and show that performance degrades significantly, even with a very small change to the input image, compared to baselines. Our results uncover a fundamentally different vector of performance degradation, highlighting a failure mode not captured by adversarial perturbations.

cross Act-Observe-Rewrite: Multimodal Coding Agents as In-Context Policy Learners for Robot Manipulation

Authors: Vaishak Kumar

Abstract: Can a multimodal language model learn to manipulate physical objects by reasoning about its own failures-without gradient updates, demonstrations, or reward engineering? We argue the answer is yes, under conditions we characterise precisely. We present Act-Observe-Rewrite (AOR), a framework in which an LLM agent improves a robot manipulation policy by synthesising entirely new executable Python controller code between trials, guided by visual observations and structured episode outcomes. Unlike prior work that grounds LLMs in pre-defined skill libraries or uses code generation for one-shot plan synthesis, AOR makes the full low-level motor control implementation the unit of LLM reasoning, enabling the agent to change not just what the robot does, but how it does it. The central claim is that interpretable code as the policy representation creates a qualitatively different kind of in-context learning from opaque neural policies: the agent can diagnose systematic failures and rewrite their causes. We validate this across three robosuite manipulation tasks and report promising results, with the agent achieving high success rates without demonstrations, reward engineering, or gradient updates.

cross Dictionary Based Pattern Entropy for Causal Direction Discovery

Authors: Harikrishnan N B, Shubham Bhilare, Aditi Kathpalia, Nithin Nagaraj

Abstract: Discovering causal direction from temporal observational data is particularly challenging for symbolic sequences, where functional models and noise assumptions are often unavailable. We propose a novel \emph{Dictionary Based Pattern Entropy ($DPE$)} framework that infers both the direction of causation and the specific subpatterns driving changes in the effect variable. The framework integrates \emph{Algorithmic Information Theory} (AIT) and \emph{Shannon Information Theory}. Causation is interpreted as the emergence of compact, rule based patterns in the candidate cause that systematically constrain the effect. $DPE$ constructs direction-specific dictionaries and quantifies their influence using entropy-based measures, enabling a principled link between deterministic pattern structure and stochastic variability. Causal direction is inferred via a minimum-uncertainty criterion, selecting the direction exhibiting stronger and more consistent pattern-driven organization. As summarized in Table 7, $DPE$ consistently achieves reliable performance across diverse synthetic systems, including delayed bit-flip perturbations, AR(1) coupling, 1D skew-tent maps, and sparse processes, outperforming or matching competing AIT-based methods ($ETC_E$, $ETC_P$, $LZ_P$). In biological and ecological datasets, performance is competitive, while alternative methods show advantages in specific genomic settings. Overall, the results demonstrate that minimizing pattern level uncertainty yields a robust, interpretable, and broadly applicable framework for causal discovery.

cross Bayesian Modeling of Collatz Stopping Times: A Probabilistic Machine Learning Perspective

Authors: Nicol\`o Bonacorsi, Matteo Bordoni

Abstract: We study the Collatz total stopping time $\tau(n)$ over $n\le 10^7$ from a probabilistic machine learning viewpoint. Empirically, $\tau(n)$ is a skewed and heavily overdispersed count with pronounced arithmetic heterogeneity. We develop two complementary models. First, a Bayesian hierarchical Negative Binomial regression (NB2-GLM) predicts $\tau(n)$ from simple covariates ($\log n$ and residue class $n \bmod 8$), quantifying uncertainty via posterior and posterior predictive distributions. Second, we propose a mechanistic generative approximation based on the odd-block decomposition: for odd $m$, write $3m+1=2^{K(m)}m'$ with $m'$ odd and $K(m)=v_2(3m+1)\ge 1$; randomizing these block lengths yields a stochastic approximation calibrated via a Dirichlet-multinomial update. On held-out data, the NB2-GLM achieves substantially higher predictive likelihood than the odd-block generators. Conditioning the block-length distribution on $m\bmod 8$ markedly improves the generator's distributional fit, indicating that low-order modular structure is a key driver of heterogeneity in $\tau(n)$.

cross AbAffinity: A Large Language Model for Predicting Antibody Binding Affinity against SARS-CoV-2

Authors: Faisal Bin Ashraf, Animesh Ray, Stefano Lonardi

Abstract: Machine learning-based antibody design is emerging as one of the most promising approaches to combat infectious diseases, due to significant advancements in the field of artificial intelligence and an exponential surge in experimental antibody data (in particular related to COVID-19). The ability of an antibody to bind to an antigens (called binding affinity) is one of the the most critical properties in designing neutralizing antibodies. In this study we introduce Ab-Affinity, a new large language model that can accurately predict the binding affinity of antibodies against a target peptide, e.g., the SARS-CoV-2 spike protein. Code and model are available at https://github.com/ucrbioinfo/AbAffinity.

URLs: https://github.com/ucrbioinfo/AbAffinity.

cross Projected Hessian Learning: Fast Curvature Supervision for Accurate Machine-Learning Interatomic Potentials

Authors: Austin Rodriguez, Justin S. Smith, Sakib Matin, Nicholas Lubbers, Kipton Barros, Jose L. Mendoza-Cortes

Abstract: The Hessian matrix (second derivatives) encodes far richer local curvature of the potential energy surface than energies and forces alone. However, training machine-learning interatomic potentials (MLIPs) with full Hessians is often impractical because explicitly forming and storing Hessian matrices scales quadratically in cost and memory. We introduce Projected Hessian Learning (PHL), a scalable second-order training framework that injects curvature information using only Hessian-vector products (HVPs). Rather than constructing the Hessian, PHL projects curvature along stochastic probe directions and uses an unbiased stochastic trace-based loss with favorable system-size scaling, enabling curvature-informed training without quadratic memory growth. We benchmark PHL on a chemically diverse dataset of reactants, products, transition states, intrinsic reaction coordinates, and normal-mode sampled geometries computed at omegaB97XD/6-31G(d). We compare energy-force training (E-F), two HVP-based schemes (E-F-HVP with one-hot or randomized probes), and full energy-force-Hessian training (E-F-H). With randomized probes per minibatch, both HVP schemes match full-Hessian training in energy, force, and Hessian accuracy while delivering >24x epoch speedups for the small molecular systems studied. In a fixed-probe regime with one HVP per molecule, randomized projections consistently outperform one-column probing, especially for far-from-equilibrium geometries. Overall, PHL replaces explicit Hessian supervision with force-complexity curvature training, retaining most second-order accuracy gains while scaling to larger, more complex molecular systems.

cross The Volterra signature

Authors: Paul P. Hager, Fabian N. Harang, Luca Pelizzari, Samy Tindel

Abstract: Modern approaches for learning from non-Markovian time series, such as recurrent neural networks, neural controlled differential equations or transformers, typically rely on implicit memory mechanisms that can be difficult to interpret or to train over long horizons. We propose the Volterra signature $\mathrm{VSig}(x;K)$ as a principled, explicit feature representation for history-dependent systems. By developing the input path $x$ weighted by a temporal kernel $K$ into the tensor algebra, we leverage the associated Volterra--Chen identity to derive rigorous learning-theoretic guarantees. Specifically, we prove an injectivity statement (identifiability under augmentation) that leads to a universal approximation theorem on the infinite dimensional path space, which in certain cases is achieved by linear functionals of $\mathrm{VSig}(x;K)$. Moreover, we demonstrate applicability of the kernel trick by showing that the inner product associated with Volterra signatures admits a closed characterization via a two-parameter integral equation, enabling numerical methods from PDEs for computation. For a large class of exponential-type kernels, $\mathrm{VSig}(x;K)$ solves a linear state-space ODE in the tensor algebra. Combined with inherent invariance to time reparameterization, these results position the Volterra signature as a robust, computationally tractable feature map for data science. We demonstrate its efficacy in dynamic learning tasks on real and synthetic data, where it consistently improves classical path signature baselines.

cross A Fast Generative Framework for High-dimensional Posterior Sampling: Application to CMB Delensing

Authors: Hadi Sotoudeh, Pablo Lemos, Laurence Perreault-Levasseur

Abstract: We introduce a deep generative framework for high-dimensional Bayesian inference that enables efficient posterior sampling. As telescopes and simulations rapidly expand the volume and resolution of astrophysical data, fast simulation-based inference methods are increasingly needed to extract scientific insights. While diffusion-based approaches offer high-quality generative capabilities, they are hindered by slow sampling speeds. Our method performs posterior sampling an order of magnitude faster than a diffusion baseline. Applied to the problem of CMB delensing, it successfully recovers the unlensed CMB power spectrum from simulated observations. The model also remains robust to shifts in cosmological parameters, demonstrating its potential for out-of-distribution generalization and application to observational cosmological data.

cross Weather-Related Crash Risk Forecasting: A Deep Learning Approach for Heterogenous Spatiotemporal Data

Authors: Abimbola Ogungbire, Srinivas Pulugurtha

Abstract: This study introduces a deep learning-based framework for forecasting weather-related traffic crash risk using heterogeneous spatiotemporal data. Given the complex, non-linear relationship between crash occurrence and factors such as road characteristics, and traffic conditions, we propose an ensemble of Convolutional Long Short-Term Memory (ConvLSTM) models trained over overlapping spatial grids. This approach captures both spatial dependencies and temporal dynamics while addressing spatial heterogeneity in crash patterns. North Carolina was selected as the study area due to its diverse weather conditions, with historical crash, weather, and traffic data aggregated at 5-mi by 5-mi grid resolution. The framework was evaluated using Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and spatial cross-K analysis. Results show that the ensembled ConvLSTM significantly outperforms baseline models, including linear regression, ARIMA, and standard ConvLSTM, particularly in high-risk zones. The ensemble approach effectively combines the strengths of multiple ConvLSTM models, resulting in lower MSE and RMSE values across all regions, particularly when data from different crash risk zones are aggregated. Notably, the model performs exceptionally well in volatile high-risk areas (Cluster 1), achieving the lowest MSE and RMSE, while in stable low-risk areas (Cluster 2), it still improves upon simpler models but with slightly higher errors due to challenges in capturing subtle variations.

cross Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data

Authors: Ancymol Thomas, Jaya Sreevalsan-Nair

Abstract: Local Climate Zones (LCZs) give a zoning map to study urban structures and land use and analyze the impact of urbanization on local climate. Multimodal remote sensing enables LCZ classification, for which data fusion is significant for improving accuracy owing to the data complexity. However, there is a gap in a comprehensive analysis of the fusion mechanisms used in their deep learning (DL) classifier architectures. This study analyzes different fusion strategies in the multi-class LCZ classification models for multimodal data and grouping strategies based on inherent data characteristics. The different models involving Convolutional Neural Networks (CNNs) include: (i) baseline hybrid fusion (FM1), (ii) with self- and cross-attention mechanisms (FM2), (iii) with the multi-scale Gaussian filtered images (FM3), and (iv) weighted decision-level fusion (FM4). Ablation experiments are conducted to study the pixel-, feature-, and decision-level fusion effects in the model performance. Grouping strategies include band grouping (BG) within the data modalities and label merging (LM) in the ground truth. Our analysis is exclusively done on the So2Sat LCZ42 dataset, which consists of Synthetic Aperture Radar (SAR) and Multispectral Imaging (MSI) image pairs. Our results show that FM1 consistently outperforms simple fusion methods. FM1 with BG and LM is found to be the most effective approach among all fusion strategies, giving an overall accuracy of 76.6\%. Importantly, our study highlights the effect of these strategies in improving prediction accuracy for the underrepresented classes. Our code and processed datasets are available at https://github.com/GVCL/LCZC-MultiModalHybridFusion

URLs: https://github.com/GVCL/LCZC-MultiModalHybridFusion

cross Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Authors: Dipika Khullar, Jack Hopkins, Rowan Wang, Fabien Roger

Abstract: Agentic systems increasingly rely on language models to monitor their own behavior. For example, coding agents may self critique generated code for pull request approval or assess the safety of tool-use actions. We show that this design pattern can fail when the action is presented in a previous or in the same assistant turn instead of being presented by the user in a user turn. We define self-attribution bias as the tendency of a model to evaluate an action as more correct or less risky when the action is implicitly framed as its own, compared to when the same action is evaluated under off-policy attribution. Across four coding and tool-use datasets, we find that monitors fail to report high-risk or low-correctness actions more often when evaluation follows a previous assistant turn in which the action was generated, compared to when the same action is evaluated in a new context presented in a user turn. In contrast, explicitly stating that the action comes from the monitor does not by itself induce self-attribution bias. Because monitors are often evaluated on fixed examples rather than on their own generated actions, these evaluations can make monitors appear more reliable than they actually are in deployment, leading developers to unknowingly deploy inadequate monitors in agentic systems.

cross Optimal Prediction-Augmented Algorithms for Testing Independence of Distributions

Authors: Maryam Aliakbarpour, Alireza Azizi, Ria Stevens

Abstract: Independence testing is a fundamental problem in statistical inference: given samples from a joint distribution $p$ over multiple random variables, the goal is to determine whether $p$ is a product distribution or is $\epsilon$-far from all product distributions in total variation distance. In the non-parametric finite-sample regime, this task is notoriously expensive, as the minimax sample complexity scales polynomially with the support size. In this work, we move beyond these worst-case limitations by leveraging the framework of \textit{augmented distribution testing}. We design independence testers that incorporate auxiliary, but potentially untrustworthy, predictive information. Our framework ensures that the tester remains robust, maintaining worst-case validity regardless of the prediction's quality, while significantly improving sample efficiency when the prediction is accurate. Our main contributions include: (i) a bivariate independence tester for discrete distributions that adaptively reduces sample complexity based on the prediction error; (ii) a generalization to the high-dimensional multivariate setting for testing the independence of $d$ random variables; and (iii) matching minimax lower bounds demonstrating that our testers achieve optimal sample complexity.

cross Spinverse: Differentiable Physics for Permeability-Aware Microstructure Reconstruction from Diffusion MRI

Authors: Prathamesh Pradeep Khole, Mario M. Brenes, Zahra Kais Petiwala, Ehsan Mirafzali, Utkarsh Gupta, Jing-Rebecca Li, Andrada Ianus, Razvan Marinescu

Abstract: Diffusion MRI (dMRI) is sensitive to microstructural barriers, yet most existing methods either assume impermeable boundaries or estimate voxel-level parameters without recovering explicit interfaces. We present Spinverse, a permeability-aware reconstruction method that inverts dMRI measurements through a fully differentiable Bloch-Torrey simulator. Spinverse represents tissue on a fixed tetrahedral grid and treats each interior face permeability as a learnable parameter; low-permeability faces act as diffusion barriers, so microstructural boundaries whose topology is not fixed a priori (up to the resolution of the ambient mesh) emerge without changing mesh connectivity or vertex positions. Given a target signal, we optimize face permeabilities by backpropagating a signal-matching loss through the PDE forward model, and recover an interface by thresholding the learned permeability field. To mitigate the ill-posedness of permeability inversion, we use mesh-based geometric priors; to avoid local minima, we use a staged multi-sequence optimization curriculum. Across a collection of synthetic voxel meshes, Spinverse reconstructs diverse geometries and demonstrates that sequence scheduling and regularization are critical to avoid outline-only solutions while improving both boundary accuracy and structural validity.

cross iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

Authors: Preetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta, Chirag Shah

Abstract: With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.

cross Improving the accuracy of physics-informed neural networks via last-layer retraining

Authors: Saad Qadeer, Panos Stinis

Abstract: Physics-informed neural networks (PINNs) are a versatile tool in the burgeoning field of scientific machine learning for solving partial differential equations (PDEs). However, determining suitable training strategies for them is not obvious, with the result that they typically yield moderately accurate solutions. In this article, we propose a method for improving the accuracy of PINNs by coupling them with a post-processing step that seeks the best approximation in a function space associated with the network. We find that our method yields errors four to five orders of magnitude lower than those of the parent PINNs across architectures and dimensions. Moreover, we can reuse the basis functions for the linear space in more complex settings, such as time-dependent and nonlinear problems, allowing for transfer learning. Out approach also provides a residual-based metric that allows us to optimally choose the number of basis functions employed.

cross Why the Brain Consolidates: Predictive Forgetting for Optimal Generalisation

Authors: Zafeirios Fountas, Adnan Oomerjee, Haitham Bou-Ammar, Jun Wang, Neil Burgess

Abstract: Standard accounts of memory consolidation emphasise the stabilisation of stored representations, but struggle to explain representational drift, semanticisation, or the necessity of offline replay. Here we propose that high-capacity neocortical networks optimise stored representations for generalisation by reducing complexity via predictive forgetting, i.e. the selective retention of experienced information that predicts future outcomes or experience. We show that predictive forgetting formally improves information-theoretic generalisation bounds on stored representations. Under high-fidelity encoding constraints, such compression is generally unattainable in a single pass; high-capacity networks therefore benefit from temporally separated, iterative refinement of stored traces without re-accessing sensory input. We demonstrate this capacity dependence with simulations in autoencoder-based neocortical models, biologically plausible predictive coding circuits, and Transformer-based language models, and derive quantitative predictions for consolidation-dependent changes in neural representational geometry. These results identify a computational role for off-line consolidation beyond stabilisation, showing that outcome-conditioned compression optimises the retention-generalisation trade-off.

cross Generalizing Fair Top-$k$ Selection: An Integrative Approach

Authors: Guangya Cai

Abstract: Fair top-$k$ selection, which ensures appropriate proportional representation of members from minority or historically disadvantaged groups among the top-$k$ selected candidates, has drawn significant attention. We study the problem of finding a fair (linear) scoring function with multiple protected groups while also minimizing the disparity from a reference scoring function. This generalizes the prior setup, which was restricted to the single-group setting without disparity minimization. Previous studies imply that the number of protected groups may have a limited impact on the runtime efficiency. However, driven by the need for experimental exploration, we find that this implication overlooks a critical issue that may affect the fairness of the outcome. Once this issue is properly considered, our hardness analysis shows that the problem may become computationally intractable even for a two-dimensional dataset and small values of $k$. However, our analysis also reveals a gap in the hardness barrier, enabling us to recover the efficiency for the case of small $k$ when the number of protected groups is sufficiently small. Furthermore, beyond measuring disparity as the "distance" between the fair and the reference scoring functions, we introduce an alternative disparity measure$\unicode{x2014}$utility loss$\unicode{x2014}$that may yield a more stable scoring function under small weight perturbations. Through careful engineering trade-offs that balance implementation complexity, robustness, and performance, our augmented two-pronged solution demonstrates strong empirical performance on real-world datasets, with experimental observations also informing algorithm design and implementation decisions.

cross When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper

Authors: Akif Islam, Raufun Nahar, Md. Ekramul Hamid

Abstract: Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, a recent foundation-scale speech enhancement model proposed by Meta, when used as a preprocessing step for zero-shot transcription with Whisper. Experiments are conducted across multiple Whisper model variants and two linguistically distinct noisy speech datasets: a real-world Bengali YouTube corpus and a publicly available English noisy dataset. Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance, increasing both Word Error Rate (WER) and Character Error Rate (CER) compared to raw noisy speech, despite substantial improvements in signal-level quality. Objective Peak Signal-to-Noise Ratio analysis on the English dataset confirms that SAM-Audio produces acoustically cleaner signals, yet this improvement fails to translate into recognition gains. Therefore, we conducted a detailed utterance-level analysis to understand this counterintuitive result. We found that the recognition degradation is a systematic issue affecting the majority of the audio, not just isolated outliers, and that the errors worsen as the Whisper model size increases. These findings expose a fundamental mismatch: audio that is perceptually cleaner to human listeners is not necessarily robust for machine recognition. This highlights the risk of blindly applying state-of-the-art denoising as a preprocessing step in zero-shot ASR pipelines.

cross SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference

Authors: Luchang Li, Dongfang Li, Bozhao Gong, Yu Zhang

Abstract: Prefill-Decode (P/D) disaggregation has emerged as a widely adopted optimization strategy for Large Language Model (LLM) inference. However, there currently exists no well-established methodology for determining the optimal number of P/D hardware resources, subject to constraints on total throughput, service level objectives (SLOs), and request characteristics - specifically input and output lengths. To address this gap, we propose a hybrid approach that combines theoretical modeling with empirical benchmarking. First, we present a theoretical model for calculating P/D resource counts, which is based on total throughput requirements, request input and output lengths, as well as prefill and decode throughput. Then, to obtain the actual prefill and decode throughput under SLO constraints, we model the prefill process using M/M/1 queuing theory, deriving the achieved prefill throughput from the benchmarked maximum prefill throughput and Time-To-First-Token (TTFT). For the decode phase, we determine the decode batch sizes that meet Time-Per-Output-Token (TPOT) requirements and obtain the corresponding decode throughput through empirical measurements. Our experimental results demonstrate that the proposed method can accurately predict optimal P/D resource allocation in real-world LLM inference scenarios.

cross A Benchmark Study of Neural Network Compression Methods for Hyperspectral Image Classification

Authors: Sai Shi

Abstract: Deep neural networks have achieved strong performance in image classification tasks due to their ability to learn complex patterns from high-dimensional data. However, their large computational and memory requirements often limit deployment on resource-constrained platforms such as remote sensing devices and edge systems. Network compression techniques have therefore been proposed to reduce model size and computational cost while maintaining predictive performance. In this study, we conduct a systematic evaluation of neural network compression methods for a remote sensing application, namely hyperspectral land cover classification. Specifically, we examine three widely used compression strategies for convolutional neural networks: pruning, quantization, and knowledge distillation. Experiments are conducted on two benchmark hyperspectral datasets, considering classification accuracy, memory consumption, and inference efficiency. Our results demonstrate that compressed models can significantly reduce model size and computational cost while maintaining competitive classification performance. These findings provide insights into the trade-offs between compression ratio, efficiency, and accuracy, and highlight the potential of compression techniques for enabling efficient deep learning deployment in remote sensing applications.

cross Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models

Authors: Jihoon Jeong

Abstract: Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models -- like biological organisms -- have internal structures, dynamic processes, heritable traits, observable symptoms, classifiable conditions, and treatable states. This paper introduces Model Medicine as a research program, bridging the gap between current AI interpretability research (anatomical observation) and the systematic clinical practice that complex AI systems increasingly require. We present five contributions: (1) a discipline taxonomy organizing 15 subdisciplines across four divisions -- Basic Model Sciences, Clinical Model Sciences, Model Public Health, and Model Architectural Medicine; (2) the Four Shell Model (v3.3), a behavioral genetics framework empirically grounded in 720 agents and 24,923 decisions from the Agora-12 program, explaining how model behavior emerges from Core--Shell interaction; (3) Neural MRI (Model Resonance Imaging), a working open-source diagnostic tool mapping five medical neuroimaging modalities to AI interpretability techniques, validated through four clinical cases demonstrating imaging, comparison, localization, and predictive capability; (4) a five-layer diagnostic framework for comprehensive model assessment; and (5) clinical model sciences including the Model Temperament Index for behavioral profiling, Model Semiology for symptom description, and M-CARE for standardized case reporting. We additionally propose the Layered Core Hypothesis -- a biologically-inspired three-layer parameter architecture -- and a therapeutic framework connecting diagnosis to treatment.

cross Interactive Benchmarks

Authors: Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, Mengdi Wang

Abstract: Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench

URLs: https://github.com/interactivebench/interactivebench

cross CONE: Embeddings for Complex Numerical Data Preserving Unit and Variable Semantics

Authors: Gyanendra Shrestha, Anna Pyayt, Michael Gubanov

Abstract: Large pre-trained models (LMs) and Large Language Models (LLMs) are typically effective at capturing language semantics and contextual relationships. However, these models encounter challenges in maintaining optimal performance on tasks involving numbers. Blindly treating numerical or structured data as terms is inadequate -- their semantics must be well understood and encoded by the models. In this paper, we propose CONE, a hybrid transformer encoder pre-trained model that encodes numbers, ranges, and gaussians into an embedding vector space preserving distance. We introduce a novel composite embedding construction algorithm that integrates numerical values, ranges or gaussians together with their associated units and attribute names to precisely capture their intricate semantics. We conduct extensive experimental evaluation on large-scale datasets across diverse domains (web, medical, finance, and government) that justifies CONE's strong numerical reasoning capabilities, achieving an F1 score of 87.28% on DROP, a remarkable improvement of up to 9.37% in F1 over state-of-the-art (SOTA) baselines, and outperforming major SOTA models with a significant Recall@10 gain of up to 25%.

cross Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary

Authors: Alexandru Florea, Shansong Wang, Mingzhe Hu, Qiang Li, Zach Eidex, Luke del Balzo, Mojtaba Safari, Xiaofeng Yang

Abstract: The transition from task-specific artificial intelligence toward general-purpose foundation models raises fundamental questions about their capacity to support the integrated reasoning required in clinical medicine, where diagnosis demands synthesis of ambiguous patient narratives, laboratory data, and multimodal imaging. This landscape commentary provides the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o across a diverse spectrum of clinically grounded tasks, including medical education examinations, text-based reasoning benchmarks, and visual question-answering in neuroradiology, digital pathology, and mammography using a standardized zero-shot chain-of-thought protocol. GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA. When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence, achieving state-of-the-art or competitive performance across most VQA benchmarks and outperforming GPT-4o by margins of 10-40% in mammography tasks requiring fine-grained lesion characterization. However, performance remained moderate in neuroradiology (44% macro-average accuracy) and lagged behind domain-specific models in mammography, where specialized systems exceed 80% accuracy compared to GPT-5's 52-64%. These findings indicate that while GPT-5 represents a meaningful advance toward integrated multimodal clinical reasoning, mirroring the clinician's cognitive process of biasing uncertain information with objective findings, generalist models are not yet substitutes for purpose-built systems in highly specialized, perception-critical tasks.

cross Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation

Authors: Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Xilin Zhao, Qingming Huang

Abstract: The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D-Ability and P-Ability. Extensive experiments across various benchmarks and multi-modal large language models validate the effectiveness of our method. The code is available at https://github.com/boyuh/DCR.

URLs: https://github.com/boyuh/DCR.

cross The Inductive Bias of Convolutional Neural Networks: Locality and Weight Sharing Reshape Implicit Regularization

Authors: Tongtong Liang, Esha Singh, Rahul Parhi, Alexander Cloninger, Yu-Xiang Wang

Abstract: We study how architectural inductive bias reshapes the implicit regularization induced by the edge-of-stability phenomenon in gradient descent. Prior work has established that for fully connected networks, the strength of this regularization is governed solely by the global input geometry; consequently, it is insufficient to prevent overfitting on difficult distributions such as the high-dimensional sphere. In this paper, we show that locality and weight sharing fundamentally change this picture. Specifically, we prove that provided the receptive field size $m$ remains small relative to the ambient dimension $d$, these networks generalize on spherical data with a rate of $n^{-\frac{1}{6} +O(m/d)}$, a regime where fully connected networks provably fail. This theoretical result confirms that weight sharing couples the learned filters to the low-dimensional patch manifold, thereby bypassing the high dimensionality of the ambient space. We further corroborate our theory by analyzing the patch geometry of natural images, showing that standard convolutional designs induce patch distributions that are highly amenable to this stability mechanism, thus providing a systematic explanation for the superior generalization of convolutional networks over fully connected baselines.

cross WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

Authors: Aurchi Chowdhury, Rubaiyat -E-Zaman, Sk. Ashrafuzzaman Nafees

Abstract: This paper presents our solution for the DL Sprint 4.0, addressing the dual challenges of Bengali Long-Form Speech Recognition (Task 1) and Speaker Diarization (Task 2). Processing long-form, multi-speaker Bengali audio introduces significant hurdles in voice activity detection, overlapping speech, and context preservation. To solve the long-form transcription challenge, we implemented a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription. For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX. A key contribution of our approach is the domain-specific fine-tuning of the Pyannote segmentation model on the competition dataset. This adaptation allowed the model to better capture the nuances of Bengali conversational dynamics and accurately resolve complex, overlapping speaker boundaries. Our methodology demonstrates that applying intelligent timestamped chunking to ASR and targeted segmentation fine-tuning to diarization significantly drives down Word Error Rate (WER) and Diarization Error Rate (DER), in low-resource settings.

cross Quadratic polarity and polar Fenchel-Young divergences from the canonical Legendre polarity

Authors: Frank Nielsen, Basile Plus-Gourdon, Mahito Sugiyama

Abstract: Polarity is a fundamental reciprocal duality of $n$-dimensional projective geometry which associates to points polar hyperplanes, and more generally $k$-dimensional convex bodies to polar $(n-1-k)$-dimensional convex bodies. It is well-known that the Legendre-Fenchel transformation of functions can be interpreted from the polarity viewpoint of their graphs using an extra dimension. In this paper, we first show that generic polarities induced by quadratic polarity functionals can be expressed either as deformed Legendre polarity or as the Legendre polarity of deformed convex bodies, and be efficiently manipulated using linear algebra on $(n+2)\times (n+2)$ matrices operating on homogeneous coordinates. Second, we define polar divergences using the Legendre polarity and show that they generalize the Fenchel-Young divergence or equivalent Bregman divergence. This polarity study brings new understanding of the core reference duality in information geometry. Last, we show that the total Bregman divergences can be considered as a total polar Fenchel-Young divergence from which we newly exhibit the reference duality using dual polar conformal factors.

cross On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

Authors: Pradyumna Tambwekar, Andrew Silva, Deepak Gopinath, Jonathan DeCastro, Xiongyi Cui, Guy Rosman

Abstract: Embodied foundation models are increasingly performant in real-world domains such as robotics or autonomous driving. These models are often deployed in interactive or assistive settings, where it is important that these assistive models generalize to new users and new tasks. Diverse interactive data generation offers a promising avenue for providing data-efficient generalization capabilities for interactive embodied foundation models. In this paper, we investigate the generalization capabilities of a multimodal foundation model fine-tuned on diverse interactive assistance data in a synthetic domain. We explore generalization along two axes: a) assistance with unseen categories of user behavior and b) providing guidance in new configurations not encountered during training. We study a broad capability called \textbf{Open-Set Corrective Assistance}, in which the model needs to inspect lengthy user behavior and provide assistance through either corrective actions or language-based feedback. This task remains unsolved in prior work, which typically assumes closed corrective categories or relies on external planners, making it a challenging testbed for evaluating the limits of assistive data. To support this task, we generate synthetic assistive datasets in Overcooked and fine-tune a LLaMA-based model to evaluate generalization to novel tasks and user behaviors. Our approach provides key insights into the nature of assistive datasets required to enable open-set assistive intelligence. In particular, we show that performant models benefit from datasets that cover different aspects of assistance, including multimodal grounding, defect inference, and exposure to diverse scenarios.

cross Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

Authors: Rui Zhao, Bin Shi, Kai Sun, Bo Dong

Abstract: Partial label learning is a prominent weakly supervised classification task, where each training instance is ambiguously labeled with a set of candidate labels. In real-world scenarios, candidate labels are often influenced by instance features, leading to the emergence of instance-dependent PLL (ID-PLL), a setting that more accurately reflects this relationship. A significant challenge in ID-PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class-specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra- and inter-class regulations. For intra-class regulation, CAD amplifies class-specific features to generate class-wise augmentations and aligns same-class augmentations across instances. For inter-class regulation, CAD introduces a weighted penalty loss function that applies stronger penalties to more ambiguous labels, encouraging larger inter-class distances. By jointly applying intra- and inter-class regulations, CAD improves the clarity of class boundaries and reduces class confusion caused by entanglement. Extensive experimental results demonstrate the effectiveness of CAD in mitigating the entanglement problem and enhancing ID-PLL performance. The code is available at https://github.com/RyanZhaoIc/CAD.git.

URLs: https://github.com/RyanZhaoIc/CAD.git.

cross Osmosis Distillation: Model Hijacking with the Fewest Samples

Authors: Yuchen Shi, Huajie Chen, Heng Xu, Zhiquan Liu, Jialiang Shen, Chi Liu, Shuai Zhou, Tianqing Zhu, Wanlei Zhou

Abstract: Transfer learning is devised to leverage knowledge from pre-trained models to solve new tasks with limited data and computational resources. Meanwhile, dataset distillation has emerged to synthesize a compact dataset that preserves critical information from the original large dataset. Therefore, a combination of transfer learning and dataset distillation offers promising performance in evaluations. However, a non-negligible security threat remains undiscovered in transfer learning using synthetic datasets generated by dataset distillation methods, where an adversary can perform a model hijacking attack with only a few poisoned samples in the synthetic dataset. To reveal this threat, we propose Osmosis Distillation (OD) attack, a novel model hijacking strategy that targets deep learning models using the fewest samples. Comprehensive evaluations on various datasets demonstrate that the OD attack attains high attack success rates in hidden tasks while preserving high model utility in original tasks. Furthermore, the distilled osmosis set enables model hijacking across diverse model architectures, allowing model hijacking in transfer learning with considerable attack performance and model utility. We argue that awareness of using third-party synthetic datasets in transfer learning must be raised.

cross Causally Robust Reward Learning from Reason-Augmented Preference Feedback

Authors: Minjune Hwang, Yigit Korkmaz, Daniel Seita, Erdem B{\i}y{\i}k

Abstract: Preference-based reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co-occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de-emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g., "avoids collisions", "completes the task faster") can appear across multiple tasks, ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers preference knowledge to novel tasks without extra data or language-model fine-tuning. Our learned reward model can ground preferences on the articulated reason, aligning better with user intent and generalizing beyond spurious features. ReCouPLe outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks. We have released our code at https://github.com/mj-hwang/ReCouPLe

URLs: https://github.com/mj-hwang/ReCouPLe

cross Interpretable Pre-Release Baseball Pitch Type Anticipation from Broadcast 3D Kinematics

Authors: Jerrin Bright, Michelle Lu, John Zelek

Abstract: How much can a pitcher's body reveal about the upcoming pitch? We study this question at scale by classifying eight pitch types from monocular 3D pose sequences, without access to ball-flight data. Our pipeline chains a diffusion-based 3D pose backbone with automatic pitching-event detection, groundtruth-validated biomechanical feature extraction, and gradient-boosted classification over 229 kinematic features. Evaluated on 119,561 professional pitches, the largest such benchmark to date, we achieve 80.4\% accuracy using body kinematics alone. A systematic importance analysis reveals that upper-body mechanics contribute 64.9\% of the predictive signal versus 35.1\% for the lower body, with wrist position (14.8\%) and trunk lateral tilt emerging as the most informative joint group and biomechanical feature, respectively. We further show that grip-defined variants (four-seam vs.\ two-seam fastball) are not separable from pose, establishing an empirical ceiling near 80\% and delineating where kinematic information ends and ball-flight information begins.

cross How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

Authors: Kuo-Wei Lai, Guanghui Wang, Molei Tao, Vidya Muthukumar

Abstract: Overparameterized ML models, including neural networks, typically induce underdetermined training objectives with multiple global minima. The implicit bias refers to the limiting global minimum that is attained by a common optimization algorithm, such as gradient descent (GD). In this paper, we characterize the implicit bias of GD for training a shallow ReLU model with the squared loss on high-dimensional random features. Prior work showed that the implicit bias does not exist in the worst-case (Vardi and Shamir, 2021), or corresponds exactly to the minimum-l2-norm solution among all global minima under exactly orthogonal data (Boursier et al., 2022). Our work interpolates between these two extremes and shows that, for sufficiently high-dimensional random data, the implicit bias approximates the minimum-l2-norm solution with high probability with a gap on the order $\Theta(\sqrt{n/d})$, where n is the number of training examples and d is the feature dimension. Our results are obtained through a novel primal-dual analysis, which carefully tracks the evolution of predictions, data-span coefficients, as well as their interactions, and shows that the ReLU activation pattern quickly stabilizes with high probability over the random data.

cross VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory

Authors: Yuheng Lei, Zhixuan Liang, Hongyuan Zhang, Ping Luo

Abstract: Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single-step observations or short-context histories, making them struggle with non-Markovian tasks that require long-term memory. Simply enlarging the context window incurs substantial computational and memory costs and encourages overfitting to spurious correlations, leading to catastrophic failures under distribution shift and violating real-time constraints in robotic systems. By contrast, humans can compress important past experiences into long-term memories and exploit them to solve tasks throughout their lifetime. In this paper, we propose VPWEM, a non-Markovian visuomotor policy equipped with working and episodic memories. VPWEM retains a sliding window of recent observation tokens as short-term working memory, and introduces a Transformer-based contextual memory compressor that recursively converts out-of-window observations into a fixed number of episodic memory tokens. The compressor uses self-attention over a cache of past summary tokens and cross-attention over a cache of historical observations, and is trained jointly with the policy. We instantiate VPWEM on diffusion policies to exploit both short-term and episode-wide information for action generation with nearly constant memory and computation per step. Experiments demonstrate that VPWEM outperforms state-of-the-art baselines including diffusion policies and vision-language-action (VLA) models by more than 20% on the memory-intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT. Code is available at https://github.com/HarryLui98/code_vpwem.

URLs: https://github.com/HarryLui98/code_vpwem.

cross Person Detection and Tracking from an Overhead Crane LiDAR

Authors: Nilusha Jayawickrama, Henrik Toikka, Risto Ojala

Abstract: This paper investigates person detection and tracking in an industrial indoor workspace using a LiDAR mounted on an overhead crane. The overhead viewpoint introduces a strong domain shift from common vehicle-centric LiDAR benchmarks, and limited availability of suitable public training data. Henceforth, we curate a site-specific overhead LiDAR dataset with 3D human bounding-box annotations and adapt selected candidate 3D detectors under a unified training and evaluation protocol. We further integrate lightweight tracking-by-detection using AB3DMOT and SimpleTrack to maintain person identities over time. Detection performance is reported with distance-sliced evaluation to quantify the practical operating envelope of the sensing setup. The best adapted detector configurations achieve average precision (AP) up to 0.84 within a 5.0 m horizontal radius, increasing to 0.97 at 1.0 m, with VoxelNeXt and SECOND emerging as the most reliable backbones across this range. The acquired results contribute in bridging the domain gap between standard driving datasets and overhead sensing for person detection and tracking. We also report latency measurements, highlighting practical real-time feasibility. Finally, we release our dataset and implementations in GitHub to support further research

cross TimeWarp: Evaluating Web Agents by Revisiting the Past

Authors: Md Farhan Ishmam, Kenneth Marino

Abstract: The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout. TimeWarp consists of three web environments, each with six UI versions spanning different eras of the internet, paired with a set of complex, realistic tasks requiring different forms of web navigation. Our experiments reveal web agents' vulnerability to changes and the limitations of behavior cloning (BC) on single-version trajectories. To address this, we propose TimeTraj, a simple yet effective algorithm that uses plan distillation to collect trajectories across multiple versions. By training agents on teacher rollouts using our BC-variant, we achieve substantial performance gains: $20.4\%\rightarrow37.7\%$ for Qwen-3 4B and $0\%\rightarrow27.0\%$ for Llama-3.1 8B models. We hope our work helps researchers study generalization across web designs and unlock a new paradigm for collecting plans rather than trajectories, thereby improving the robustness of web agents.

cross Replaying pre-training data improves fine-tuning

Authors: Suhas Kotha, Percy Liang

Abstract: To obtain a language model for a target domain (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the relatively limited amount of target data. Typically, generic data is only mixed in during fine-tuning to prevent catastrophic forgetting of the generic domain. We surprisingly find that replaying the generic data during fine-tuning can actually improve performance on the (less related) target task. Concretely, in a controlled pre-training environment with 4M target tokens, 4B total tokens, and 150M parameter models, generic replay increases target data efficiency by up to $1.87\times$ for fine-tuning and $2.06\times$ for mid-training. We further analyze data schedules that introduce target data during pre-training and find that replay helps more when there is less target data present in pre-training. We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by $4.5\%$ and Basque question-answering accuracy by $2\%$.

cross Measuring the Fragility of Trust: Devising Credibility Index via Explanation Stability (CIES) for Business Decision Support Systems

Authors: Alin-Gabriel Vaduva, Simona-Vasilica Oprea, Adela Bara

Abstract: Explainable Artificial Intelligence (XAI) methods (SHAP, LIME) are increasingly adopted to interpret models in high-stakes businesses. However, the credibility of these explanations, their stability under realistic data perturbations, remains unquantified. This paper introduces the Credibility Index via Explanation Stability (CIES), a mathematically grounded metric that measures how robust a model's explanations are when subject to realistic business noise. CIES captures whether the reasons behind a prediction remain consistent, not just the prediction itself. The metric employs a rank-weighted distance function that penalizes instability in the most important features disproportionately, reflecting business semantics where changes in top decision drivers are more consequential than changes in marginal features. We evaluate CIES across three datasets (customer churn, credit risk, employee attrition), four tree-based classification models and two data balancing conditions. Results demonstrate that model complexity impacts explanation credibility, class imbalance treatment via SMOTE affects not only predictive performance but also explanation stability, and CIES provides statistically superior discriminative power compared to a uniform baseline metric (p < 0.01 in all 24 configurations). A sensitivity analysis across four noise levels confirms the robustness of the metric itself. These findings offer business practitioners a deployable "credibility warning system" for AI-driven decision support.

cross RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform

Authors: Kenan Li, Rongzhi Li, Linghao Zhang, Qirui Jin, Liao Zhu, Xiaosong Huang, Geng Zhang, Yikai Zhang, Shilin He, Chengxing Xie, Xin Zhang, Zijian Jin, Bowen Li, Chaoyun Zhang, Yu Kang, Yufan Huang, Elsie Nallipogu, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang

Abstract: Building software repositories typically requires significant manual effort. Recent advances in large language model (LLM) agents have accelerated automation in software engineering (SWE). We introduce RepoLaunch, the first agent capable of automatically resolving dependencies, compiling source code, and extracting test results for repositories across arbitrary programming languages and operating systems. To demonstrate its utility, we further propose a fully automated pipeline for SWE dataset creation, where task design is the only human intervention. RepoLaunch automates the remaining steps, enabling scalable benchmarking and training of coding agents and LLMs. Notably, several works on agentic benchmarking and training have recently adopted RepoLaunch for automated task generation.

cross Good-Enough LLM Obfuscation (GELO)

Authors: Anatoly Belikov, Ilya Fedotov

Abstract: Large Language Models (LLMs) are increasingly served on shared accelerators where an adversary with read access to device memory can observe KV caches and hidden states, threatening prompt privacy for open-source models. Cryptographic protections such as MPC and FHE offer strong guarantees but remain one to two orders of magnitude too slow for interactive inference, while static obfuscation schemes break under multi-run statistical attacks once the model is known. We present GELO (Good-Enough LLM Obfuscation), a lightweight protocol for privacy-preserving inference that limits information leakage from untrusted accelerator observations by hiding hidden states with fresh, per-batch invertible mixing. For each offloaded projection, the TEE samples a random matrix A, forms $U = AH$, offloads U and weights W to the accelerator, and then applies $A^-1$ on return, so that $A^-1 ((AH)W ) = HW$ and outputs are unchanged. Because mixing is never reused across batches, the attacker faces only a single-batch blind source separation problem. We analyze information leakage and introduce two practical defenses: (i) non-orthogonal mixing to mask Gram matrices, and (ii) orthogonal mixing augmented with a small fraction of high-energy "shield" vectors that pollute higher-order statistics. On Llama-2 7B, GELO preserves float32 outputs exactly, closely matches low-precision baselines, offloads the dominant matrix multiplications with about 20-30% latency overhead, and defeats a range of ICA/BSS and anchor-based attacks.

cross ARC-TGI: Human-Validated Task Generators with Reasoning Chain Templates for ARC-AGI

Authors: Jens Lehmann, Syeda Khushbakht, Nikoo Salehfard, Nur A Zarin Nishat, Dhananjay Bhandiwad, Andrei Aioanei, Sahar Vahdati

Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) probes few-shot abstraction and rule induction on small visual grids, but progress is difficult to measure on static collections of hand-authored puzzles due to overfitting, dataset leakage, and memorisation. We introduce ARC-TGI (ARC Task Generators Inventory), an open-source framework for task-family generators: compact Python programs that sample diverse ARC-AGI tasks while preserving a latent rule. ARC-TGI is built around a solver-facing representation: each generated task is paired with natural-language input and transformation reasoning chains and partially evaluated Python code implementing sampling, transformation, and episode construction. Crucially, ARC-TGI supports task-level constraints so that training examples collectively expose the variations needed to infer the underlying rule, a requirement for human-solvable ARC tasks that independent per-example sampling often fails to guarantee. All generators undergo human refinement and local verification to keep both grids and reasoning traces natural and consistent under variation. We release 461 generators covering 180 ARC-Mini tasks, 215 ARC-AGI-1 tasks (200 train, 15 test), and 66 ARC-AGI-2 tasks (55 train, 11 test), enabling scalable dataset sampling and controlled benchmarking.

cross BLINK: Behavioral Latent Modeling of NK Cell Cytotoxicity

Authors: Iman Nematollahi, Jose Francisco Villena-Ossa, Alina Moter, Kiana Farhadyar, Gabriel Kalweit, Abhinav Valada, Toni Cathomen, Evelyn Ullrich, Maria Kalweit

Abstract: Machine learning models of cellular interaction dynamics hold promise for understanding cell behavior. Natural killer (NK) cell cytotoxicity is a prominent example of such interaction dynamics and is commonly studied using time-resolved multi-channel fluorescence microscopy. Although tumor cell death events can be annotated at single frames, NK cytotoxic outcome emerges over time from cellular interactions and cannot be reliably inferred from frame-wise classification alone. We introduce BLINK, a trajectory-based recurrent state-space model that serves as a cell world model for NK-tumor interactions. BLINK learns latent interaction dynamics from partially observed NK-tumor interaction sequences and predicts apoptosis increments that accumulate into cytotoxic outcomes. Experiments on long-term time-lapse NK-tumor recordings show improved cytotoxic outcome detection and enable forecasting of future outcomes, together with an interpretable latent representation that organizes NK trajectories into coherent behavioral modes and temporally structured interaction phases. BLINK provides a unified framework for quantitative evaluation and structured modeling of NK cytotoxic behavior at the single-cell level.

cross SRasP: Self-Reorientation Adversarial Style Perturbation for Cross-Domain Few-Shot Learning

Authors: Wenqian Li, Pengfei Fang, Hui Xue

Abstract: Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from a seen source domain to unseen target domains, serving as a key benchmark for evaluating the robustness and transferability of models. Existing style-based perturbation methods mitigate domain shift but often suffer from gradient instability and convergence to sharp minima.To address these limitations, we propose a novel crop-global style perturbation network, termed Self-Reorientation Adversarial \underline{S}tyle \underline{P}erturbation (SRasP). Specifically, SRasP leverages global semantic guidance to identify incoherent crops, followed by reorienting and aggregating the style gradients of these crops with the global style gradients within one image. Furthermore, we propose a novel multi-objective optimization function to maximize visual discrepancy while enforcing semantic consistency among global, crop, and adversarial features. Applying the stabilized perturbations during training encourages convergence toward flatter and more transferable solutions, improving generalization to unseen domains. Extensive experiments are conducted on multiple CD-FSL benchmarks, demonstrating consistent improvements over state-of-the-art methods.

cross Particle-Guided Diffusion for Gas-Phase Reaction Kinetics

Authors: Andrew Millard, Henrik Pedersen

Abstract: Physics-guided sampling with diffusion model priors has shown promise for solving partial differential equation (PDE) governed problems, but applications to chemically meaningful reaction-transport systems remain limited. We apply diffusion-based guided sampling to gas-phase chemical reactions by training on solutions of the advection-reaction-diffusion (ARD) equation across varying parameters. The method generates physically consistent concentration fields and accurately predicts outlet concentrations, including at unseen parameter values, demonstrating the potential of diffusion models for inference in reactive transport.

cross Recurrent Graph Neural Networks and Arithmetic Circuits

Authors: Timon Barlag, Vivian Holzapfel, Laura Strieker, Jonni Virtema, Heribert Vollmer

Abstract: We characterise the computational power of recurrent graph neural networks (GNNs) in terms of arithmetic circuits over the real numbers. Our networks are not restricted to aggregate-combine GNNs or other particular types. Generalizing similar notions from the literature, we introduce the model of recurrent arithmetic circuits, which can be seen as arithmetic analogues of sequential or logical circuits. These circuits utilise so-called memory gates which are used to store data between iterations of the recurrent circuit. While (recurrent) GNNs work on labelled graphs, we construct arithmetic circuits that obtain encoded labelled graphs as real valued tuples and then compute the same function. For the other direction we construct recurrent GNNs which are able to simulate the computations of recurrent circuits. These GNNs are given the circuit-input as initial feature vectors and then, after the GNN-computation, have the circuit-output among the feature vectors of its nodes. In this way we establish an exact correspondence between the expressivity of recurrent GNNs and recurrent arithmetic circuits operating over real numbers.

cross Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers

Authors: Ruichen Xu, Wenjing Yan, Ying-Jun Angela Zhang

Abstract: Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning through aligned representations; (2) Sequential training succeeds only when similarity structure is learned before specific attributes, revealing a necessary curriculum; (3) Two-hop reasoning ($a \to b, b \to c \implies a \to c$) reduces to analogical reasoning with identity bridges ($b = b$), which must appear explicitly in training data. These results reveal a unified mechanism: transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments with architectures up to 1.5B parameters validate our theory and demonstrate how representational geometry shapes inductive reasoning capabilities.

cross The Impact of Preprocessing Methods on Racial Encoding and Model Robustness in CXR Diagnosis

Authors: Dishantkumar Sutariya, Eike Petersen

Abstract: Deep learning models can identify racial identity with high accuracy from chest X-ray (CXR) recordings. Thus, there is widespread concern about the potential for racial shortcut learning, where a model inadvertently learns to systematically bias its diagnostic predictions as a function of racial identity. Such racial biases threaten healthcare equity and model reliability, as models may systematically misdiagnose certain demographic groups. Since racial shortcuts are diffuse - non-localized and distributed throughout the whole CXR recording - image preprocessing methods may influence racial shortcut learning, yet the potential of such methods for reducing biases remains underexplored. Here, we investigate the effects of image preprocessing methods including lung masking, lung cropping, and Contrast Limited Adaptive Histogram Equalization (CLAHE). These approaches aim to suppress spurious cues encoding racial information while preserving diagnostic accuracy. Our experiments reveal that simple bounding box-based lung cropping can be an effective strategy for reducing racial shortcut learning while maintaining diagnostic model performance, bypassing frequently postulated fairness-accuracy trade-offs.

cross A Geometry-Adaptive Deep Variational Framework for Phase Discovery in the Landau-Brazovskii Model

Authors: Yuchen Xie, Jianyuan Yin, Lei Zhang

Abstract: The discovery of ordered structures in pattern-forming systems, such as the Landau-Brazovskii (LB) model, is often limited by the sensitivity of numerical solvers to the prescribed computational domain size. Incompatible domains induce artificial stress, frequently trapping the system in high-energy metastable configurations. To resolve this issue, we propose a Geometry-Adaptive Deep Variational Framework (GeoDVF) that jointly optimizes the infinite-dimensional order parameter, which is parameterized by a neural network, and the finite-dimensional geometric parameters of the computational domain. By explicitly treating the domain size as trainable variables within the variational formulation, GeoDVF naturally eliminates artificial stress during training. To escape the attraction basin of the disordered phase under small initializations, we introduce a warmup penalty mechanism, which effectively destabilizes the disordered phase, enabling the spontaneous nucleation of complex three-dimensional ordered phases from random initializations. Furthermore, we design a guided initialization protocol to resolve topologically intricate phases associated with narrow basins of attraction. Extensive numerical experiments show that GeoDVF provides a robust and geometry-consistent variational solver capable of identifying both stable and metastable states without prior knowledge.

cross Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding

Authors: Ofir Ben Shoham

Abstract: Speculative decoding accelerates inference for Large Language Models by using a lightweight draft model to propose candidate tokens that are verified in parallel by a larger target model. Prior work shows that the draft model often dominates speculative decoding latency, since it generates tokens sequentially and incurs high cost from its language modeling head as vocabulary size grows. This exposes a fundamental trade-off in draft model design: larger vocabularies improve token coverage and agreement with the target model, but incur higher draft latency, while smaller vocabularies reduce latency at the risk of missing tokens required for accurate draft generation. We address this trade-off through vocabulary trimming for draft models, motivated by the observation that domain-specific workloads use only a small fraction of the full vocabulary. We cast draft vocabulary selection as a constrained optimization problem that balances token coverage and draft latency. Coverage is computed over assistant responses in the training data, while latency is estimated using architecture-aware FLOPs that capture the cost of the language modeling head as a function of vocabulary size. We optimize a utility function with a Tree-structured Parzen Estimator to efficiently explore the coverage-latency Pareto frontier under a minimum coverage constraint. Experiments show improved speculative decoding throughput while reducing draft vocabularies by up to 97% with high coverage. On domain-specific tasks, we achieve up to 16% latency reduction and 20% throughput improvement, and up to 6.7% throughput gains on diverse out-of-distribution tasks.

cross KARL: Knowledge Agents via Reinforcement Learning

Authors: Jonathan D. Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob Portes, Abhay Gupta, Pallavi Koppol, Ashutosh Baheti, Sean Kulinski, Ivan Zhou, Irene Dea, Krista Opsahl-Ong, Simon Favreau-Lessard, Sean Owen, Jose Javier Gonzalez Ortiz, Arnav Singhvi, Xabi Andrade, Cindy Wang, Kartik Sreenivasan, Sam Havens, Jialu Liu, Peyton DeNiro, Wen Sun, Michael Bendersky, Jonathan Frankle

Abstract: We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we introduce KARLBench, a multi-capability evaluation suite spanning six distinct search regimes, including constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over technical documentation, and fact aggregation over internal enterprise notes. Second, we show that models trained across heterogeneous search behaviors generalize substantially better than those optimized for any single benchmark. Third, we develop an agentic synthesis pipeline that employs long-horizon reasoning and tool use to generate diverse, grounded, and high-quality training data, with iterative bootstrapping from increasingly capable models. Fourth, we propose a new post-training paradigm based on iterative large-batch off-policy RL that is sample efficient, robust to train-inference engine discrepancies, and naturally extends to multi-task training with out-of-distribution generalization. Compared to Claude 4.6 and GPT 5.2, KARL is Pareto-optimal on KARLBench across cost-quality and latency-quality trade-offs, including tasks that were out-of-distribution during training. With sufficient test-time compute, it surpasses the strongest closed models. These results show that tailored synthetic data in combination with multi-task reinforcement learning enables cost-efficient and high-performing knowledge agents for grounded reasoning.

cross Learning Optimal Individualized Decision Rules with Conditional Demographic Parity

Authors: Wenhai Cui, Wen Su, Donglin Zeng, Xingqiu Zhao

Abstract: Individualized decision rules (IDRs) have become increasingly prevalent in societal applications such as personalized marketing, healthcare, and public policy design. However, a critical ethical concern arises from the potential discriminatory effects of IDRs trained on biased data. These algorithms may disproportionately harm individuals from minority subgroups defined by sensitive attributes like gender, race, or language. To address this issue, we propose a novel framework that incorporates demographic parity (DP) and conditional demographic parity (CDP) constraints into the estimation of optimal IDRs. We show that the theoretically optimal IDRs under DP and CDP constraints can be obtained by applying perturbations to the unconstrained optimal IDRs, enabling a computationally efficient solution. Theoretically, we derive convergence rates for both policy value and the fairness constraint term. The effectiveness of our methods is illustrated through comprehensive simulation studies and an empirical application to the Oregon Health Insurance Experiment.

cross Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards

Authors: Linghan Fang, Tianxin Xie, Li Liu

Abstract: Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes temperature-controlled stochastic decoding to generate diverse transcription candidates. These are scored by a reward model that measures audio-text semantic alignment, and the resulting feedback is used to update both model and prompt parameters via reinforcement learning. Comprehensive experiments on LibriSpeech with synthetic noise and L2 Arctic accented English datasets demonstrate that our method achieves higher accuracy while maintaining lower latency than existing TTA baselines. Ablation studies further confirm the effectiveness of combining audio and language-based rewards, highlighting our method's enhanced stability and interpretability. Overall, our approach provides a practical and robust solution for deploying ASR systems in challenging real-world conditions.

cross Layer by layer, module by module: Choose both for optimal OOD probing of ViT

Authors: Ambroise Odonnat, Vasilii Feofanov, Laetitia Chapel, Romain Tavenard, Ievgen Redko

Abstract: Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

cross Bayesian Supervised Causal Clustering

Authors: Luwei Wang, Nazir Lone, Sohan Seth

Abstract: Finding patient subgroups with similar characteristics is crucial for personalized decision-making in various disciplines such as healthcare and policy evaluation. While most existing approaches rely on unsupervised clustering methods, there is a growing trend toward using supervised clustering methods that identify operationalizable subgroups in the context of a specific outcome of interest. We propose Bayesian Supervised Causal Clustering (BSCC), with treatment effect as outcome to guide the clustering process. BSCC identifies homogenous subgroups of individuals who are similar in their covariate profiles as well as their treatment effects. We evaluate BSCC on simulated datasets as well as real-world dataset from the third International Stroke Trial to assess the practical usefulness of the framework.

cross Latent Policy Steering through One-Step Flow Policies

Authors: Hokyun Im, Andrey Kolobov, Jianlong Fu, Youngwoon Lee

Abstract: Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.

cross How important are the genes to explain the outcome - the asymmetric Shapley value as an honest importance metric for high-dimensional features

Authors: Mark A. van de Wiel, Jeroen Goedhart, Martin Jullum, Kjersti Aas

Abstract: In clinical prediction settings the importance of a high-dimensional feature like genomics is often assessed by evaluating the change in predictive performance when adding it to a set of traditional clinical variables. This approach is questionable, because it does not account for collinearity nor known directionality of dependencies between variables. We suggest to use asymmetric Shapley values as a more suitable alternative to quantify feature importance in the context of a mixed-dimensional prediction model. We focus on a setting that is particularly relevant in clinical prediction: disease state as a mediating variable for genomic effects, with additional confounders for which the direction of effects may be unknown. We derive efficient algorithms to compute local and global asymmetric Shapley values for this setting. The former are shown to be very useful for inference, whereas the latter provide interpretation by decomposing any predictive performance metric into contributions of the features. Throughout, we illustrate our framework by a leading example: the prediction of progression-free survival for colorectal cancer patients.

cross Bayes with No Shame: Admissibility Geometries of Predictive Inference

Authors: Nicholas G. Polson, Daniel Zantedeschi

Abstract: Four distinct admissibility geometries govern sequential and distribution-free inference: Blackwell risk dominance over convex risk sets, anytime-valid admissibility within the nonnegative supermartingale cone, marginal coverage validity over exchangeable prediction sets, and Ces\`aro approachability (CAA) admissibility, which reaches the risk-set boundary via approachability-style arguments rather than explicit priors. We prove a criterion separation theorem: the four classes of admissible procedures are pairwise non-nested. Each geometry carries a different certificate of optimality: a supporting-hyperplane prior (Blackwell), a nonnegative supermartingale (anytime-valid), an exchangeability rank (coverage), or a Ces\`aro steering argument (CAA). Martingale coherence is necessary for Blackwell admissibility and necessary and sufficient for anytime-valid admissibility within e-processes, but is not sufficient for Blackwell admissibility and is not necessary for coverage validity or CAA-admissibility. All four criteria share a common optimization template (minimize Bayesian risk subject to a feasibility constraint), but the constraint sets operate over different spaces, partial orders, and performance metrics, making them geometrically incompatible. Admissibility is irreducibly criterion-relative.

cross On the Statistical Optimality of Optimal Decision Trees

Authors: Zineng Xu, Subhroshekhar Ghosh, Yan Shuo Tan

Abstract: While globally optimal empirical risk minimization (ERM) decision trees have become computationally feasible and empirically successful, rigorous theoretical guarantees for their statistical performance remain limited. In this work, we develop a comprehensive statistical theory for ERM trees under random design in both high-dimensional regression and classification. We first establish sharp oracle inequalities that bound the excess risk of the ERM estimator relative to the best possible approximation achievable by any tree with at most $L$ leaves, thereby characterizing the interpretability-accuracy trade-off. We derive these results using a novel uniform concentration framework based on empirically localized Rademacher complexity. Furthermore, we derive minimax optimal rates over a novel function class: the piecewise sparse heterogeneous anisotropic Besov (PSHAB) space. This space explicitly captures three key structural features encountered in practice: sparsity, anisotropic smoothness, and spatial heterogeneity. While our main results are established under sub-Gaussianity, we also provide robust guarantees that hold under heavy-tailed noise settings. Together, these findings provide a principled foundation for the optimality of ERM trees and introduce empirical process tools broadly applicable to other highly adaptive, data-driven procedures.

cross Harnessing Synthetic Data from Generative AI for Statistical Inference

Authors: Ahmad Abdel-Azim, Ruoyu Wang, Xihong Lin

Abstract: The emergence of generative AI models has dramatically expanded the availability and use of synthetic data across scientific, industrial, and policy domains. While these developments open new possibilities for data analysis, they also raise fundamental statistical questions about when synthetic data can be used in a valid, reliable, and principled manner. This paper reviews the current landscape of synthetic data generation and use from a statistical perspective, with the goal of clarifying the assumptions under which synthetic data can meaningfully support downstream discovery, inference, and prediction. We survey major classes of modern generative models, their intended use cases, and the benefits they offer, while also highlighting their limitations and characteristic failure modes. We additionally examine common pitfalls that arise when synthetic data are treated as surrogates for real observations, including biases from model misspecification, attenuated uncertainty, and difficulties in generalization. Building on these insights, we discuss emerging frameworks for the principled use of synthetic data. We conclude with practical recommendations, open problems, and cautions intended to guide both method developers and applied researchers.

cross MobileFetalCLIP: Selective Repulsive Knowledge Distillation for Mobile Fetal Ultrasound Analysis

Authors: Numan Saeed, Fadillah Adamsyah Maani, Mohammad Yaqub

Abstract: Fetal ultrasound AI could transform prenatal care in low-resource settings, yet current foundation models exceed 300M visual parameters, precluding deployment on point-of-care devices. Standard knowledge distillation fails under such extreme capacity gaps (~26x), as compact students waste capacity mimicking architectural artifacts of oversized teachers. We introduce Selective Repulsive Knowledge Distillation, which decomposes contrastive KD into diagonal and off-diagonal components: matched pair alignment is preserved while the off-diagonal weight decays into negative values, repelling the student from the teacher's inter-class confusions and forcing discovery of architecturally native features. Our 11.4M parameter student surpasses the 304M-parameter FetalCLIP teacher on zero-shot HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), while running at 1.6 ms on iPhone 16 Pro, enabling real-time assistive AI on handheld ultrasound devices. Our code, models, and app are publicly available at https://github.com/numanai/MobileFetalCLIP.

URLs: https://github.com/numanai/MobileFetalCLIP.

cross Ensembling Language Models with Sequential Monte Carlo

Authors: Robin Shing Moon Chan, Tianyu Liu, Samuel Kiegeland, Clemente Pasti, Jacob Hoover Vigly, Timothy J. O'Donnell, Ryan Cotterell, Tim Vieira

Abstract: Practitioners have access to an abundance of language models and prompting strategies for solving many language modeling tasks; yet prior work shows that modeling performance is highly sensitive to both choices. Classical machine learning ensembling techniques offer a principled approach: aggregate predictions from multiple sources to achieve better performance than any single one. However, applying ensembling to language models during decoding is challenging: naively aggregating next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings. In this work, we introduce a unified framework for composing $K$ language models into $f$-ensemble distributions for a wide range of functions $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$. To sample from these distributions, we propose a byte-level sequential Monte Carlo (SMC) algorithm that operates in a shared character space, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit. We evaluate a family of $f$-ensembles across prompt and model combinations for various structured text generation tasks, highlighting the benefits of alternative aggregation strategies over traditional probability averaging, and showing that better posterior approximations can yield better ensemble performance.

cross Thermodynamic Response Functions in Singular Bayesian Models

Authors: Sean Plummer

Abstract: Singular statistical models-including mixtures, matrix factorization, and neural networks-violate regular asymptotics due to parameter non-identifiability and degenerate Fisher geometry. Although singular learning theory characterizes marginal likelihood behavior through invariants such as the real log canonical threshold and singular fluctuation, these quantities remain difficult to interpret operationally. At the same time, widely used criteria such as WAIC and WBIC appear disconnected from underlying singular geometry. We show that posterior tempering induces a one-parameter deformation of the posterior distribution whose associated observables generate a hierarchy of thermodynamic response functions. A universal covariance identity links derivatives of tempered expectations to posterior fluctuations, placing WAIC, WBIC, and singular fluctuation within a unified response framework. Within this framework, classical quantities from singular learning theory acquire natural thermodynamic interpretations: RLCT governs the leading free-energy slope, singular fluctuation corresponds to curvature of the tempered free energy, and WAIC measures predictive fluctuation. We formalize an observable algebra that quotients out non-identifiable directions, allowing structurally meaningful order parameters to be constructed in singular models. Across canonical singular examples-including symmetric Gaussian mixtures, reduced-rank regression, and overparameterized neural networks-we empirically demonstrate phase-transition-like behavior under tempering. Order parameters collapse, susceptibilities peak, and complexity measures align with structural reorganization in posterior geometry. Our results suggest that thermodynamic response theory provides a natural organizing framework for interpreting complexity, predictive variability, and structural reorganization in singular Bayesian learning.

cross Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

Authors: Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo

Abstract: We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

cross RoboPocket: Improve Robot Policies Instantly with Your Phone

Authors: Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le, Yi Wang, Yuting Zhang, Jun Lv, Chuan Wen, Cewu Lu

Abstract: Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to scale. To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones. Its core innovation is a Remote Inference framework that visualizes the policy's predicted trajectory via Augmented Reality (AR) Visual Foresight. This immersive feedback allows collectors to proactively identify potential failures and focus data collection on the policy's weak regions without requiring a physical robot. Furthermore, we implement an asynchronous Online Finetuning pipeline that continuously updates the policy with incoming data, effectively closing the learning loop in minutes. Extensive experiments demonstrate that RoboPocket adheres to data scaling laws and doubles the data efficiency compared to offline scaling strategies, overcoming their long-standing efficiency bottleneck. Moreover, our instant iteration loop also boosts sample efficiency by up to 2$\times$ in distributed environments a small number of interactive corrections per person. Project page and videos: https://robo-pocket.github.io.

URLs: https://robo-pocket.github.io.

replace Localized Distributional Robustness in Submodular Multi-Task Subset Selection

Authors: Ege C. Kaya, Abolfazl Hashemi

Abstract: In this work, we treat the problem of multi-task submodular optimization from the perspective of local distributional robustness within the neighborhood of a reference distribution which assigns an importance score to each task. We initially propose to introduce a relative-entropy regularization term to the standard multi-task objective. We then demonstrate through duality that this novel formulation itself is equivalent to the maximization of a monotone increasing function composed with a submodular function, which may be efficiently carried out through standard greedy selection methods. This approach bridges the existing gap in the optimization of performance-robustness trade-offs in multi-task subset selection. To numerically validate our theoretical results, we test the proposed method in two different settings, one on the selection of satellites in low Earth orbit constellations in the context of a sensor selection problem involving weak-submodular functions, and the other on an image summarization task using neural networks involving submodular functions. Our method is compared with two other algorithms focused on optimizing the performance of the worst-case task, and on directly optimizing the performance on the reference distribution itself. We conclude that our novel formulation produces a solution that is locally distributional robust, and computationally inexpensive.

replace Learning to Cover: Online Learning and Optimization with Irreversible Decisions

Authors: Alexandre Jacquillat, Michael Lingzhi Li

Abstract: We define an online learning and optimization problem with discrete and irreversible decisions contributing toward a coverage target. In each period, a decision-maker selects facilities to open, receives information on the success of each one, and updates a classification model to guide future decisions. The goal is to minimize facility openings under a chance constraint reflecting the coverage target, in an asymptotic regime characterized by a large target number of facilities $m\to\infty$ but a finite horizon $T \in \mathcal{Z}_+$. We prove that, under statistical conditions, the online classifier converges to the Bayes-optimal classifier at a rate of at best $\mathcal{O}(1/\sqrt n)$. Thus, we formulate our online learning and optimization problem, with a generalized learning rate $r>0$ and a residual error $1-p$. We derive an asymptotically optimal algorithm and an asymptotically tight lower bound. The regret grows in $\Theta\left(m^{\frac{1-r}{1-r^T}}\right)$ if $p=1$ (perfect learning) or in $\Theta\left(\max\left\{m^{\frac{1-r}{1-r^T}},\sqrt{m}\right\}\right)$ otherwise; in particular, the regret rate is sub-linear and converges exponentially fast to its infinite-horizon limit. We extend this result to a more complicated facility location setting in a bipartite facility-customer graph with a target on customer coverage. Throughout, constructive proofs identify a policy featuring limited exploration initially and fast exploitation later on once uncertainty gets mitigated. These results uncover the benefits of limited online learning and optimization through pilot programs prior to full-fledged expansion.

replace Towards a Fairer Non-negative Matrix Factorization

Authors: Lara Kassab, Erin George, Deanna Needell, Haowen Geng, Nika Jafar Nia, Aoxi Li

Abstract: There has been a recent critical need to study fairness and bias in machine learning (ML) algorithms. Since there is clearly no one-size-fits-all solution to fairness, ML methods should be developed alongside bias mitigation strategies that are practical and approachable to the practitioner. Motivated by recent work on ``fair" PCA, here we consider the more challenging method of non-negative matrix factorization (NMF) as both a showcasing example and a method that is important in its own right for both topic modeling tasks and feature extraction for other ML tasks. We demonstrate that a modification of the objective function, by using a min-max formulation, may \textit{sometimes} be able to offer an improvement in fairness for groups in the population. We derive two methods for the objective minimization, a multiplicative update rule as well as an alternating minimization scheme, and discuss implementation practicalities. We include a suite of synthetic and real experiments that show how the method may improve fairness while also highlighting the important fact that this may sometime increase error for some individuals and fairness is not a rigid definition and method choice should strongly depend on the application at hand.

replace An Experimental Study on Fairness-aware Machine Learning for Credit Scoring Problems

Authors: Huyen Giang Thi Thu, Thang Viet Doan, Ha-Bang Ban, Tai Le Quy

Abstract: The digitalization of credit scoring has become essential for financial institutions and commercial banks, especially in the era of digital transformation. Machine learning techniques are commonly used to evaluate customers' creditworthiness. However, the predicted outcomes of machine learning models can be biased toward protected attributes, such as race or gender. Numerous fairness-aware machine learning models and fairness measures have been proposed. Nevertheless, their performance in the context of credit scoring has not been thoroughly investigated. In this paper, we present a comprehensive experimental study of fairness-aware machine learning in credit scoring. The study explores key aspects of credit scoring, including financial datasets, predictive models, and fairness measures. We also provide a detailed evaluation of fairness-aware predictive models and fairness measures on widely used financial datasets. The experimental results show that fairness-aware models achieve a better balance between predictive accuracy and fairness compared to traditional classification models.

replace Path Planning for Masked Diffusion Model Sampling

Authors: Fred Zhangzhi Peng, Zachary Bezemek, Sawan Patel, Jarrid Rector-Brooks, Sherwood Yao, Avishek Joey Bose, Alexander Tong, Pranam Chatterjee

Abstract: Any order generation of discrete data using masked diffusion models (MDMs) offers a compelling alternative to traditional autoregressive models, especially in domains that lack a natural causal ordering of data. However, current popular MDMs depart from their successful continuous diffusion model counterparts with simplified masked inference wherein unmasked tokens cannot be iteratively refined -- even if there is a mistake. In this paper, we extract the full power of MDMs by introducing a novel inference sampling strategy termed Path Planning (P2) that decomposes each generation step into two sub-stages: planning and denoising. Under P2, the planner at every step selects appropriate tokens that are marked to be updated, which can then be sampled using the denoiser. We demonstrate that P2 generalizes all existing sampling strategies for MDMs and critically enhances generative quality through the new capability of refining and updating existing unmasked tokens. We theoretically prove that P2 establishes a (new) expanded evidence lower bound (ELBO) on the log marginal likelihood of data. We instantiate P2 with a family of planners including: 1.) Self-Planning, 2.) BERT-Planning, and 3.) Trained-Planning with a learned planner leading to SOTA generative performance for MDMs on a suite of domains. Specifically, solely using P2 inference, we observe relative improvements of 22% in protein sequence foldability, 8% in RNA sequence pLDDT, 4% in math reasoning, 68% in story generation (ROUGE score), and 33% in code generation for the challenging pass@1 metric.

replace Curse of Dimensionality in Neural Network Optimization

Authors: Sanghoon Na, Haizhao Yang

Abstract: This paper demonstrates that when a shallow neural network with a Lipschitz continuous activation function is trained using either empirical or population risk to approximate a target function that is $r$ times continuously differentiable on $[0,1]^d$, the population risk may not decay at a rate faster than $t^{-\frac{4r}{d-2r}}$, where $t$ denotes the time parameter of the gradient flow dynamics. This result highlights the presence of the curse of dimensionality in the optimization computation required to achieve a desired accuracy. Instead of analyzing parameter evolution directly, the training dynamics are examined through the evolution of the parameter distribution under the 2-Wasserstein gradient flow. Furthermore, it is established that the curse of dimensionality persists when a locally Lipschitz continuous activation function is employed, where the Lipschitz constant in $[-x,x]$ is bounded by $O(x^\delta)$ for any $x \in \mathbb{R}$. In this scenario, the population risk is shown to decay at a rate no faster than $t^{-\frac{(4+2\delta)r}{d-2r}}$. Understanding how function smoothness influences the curse of dimensionality in neural network optimization theory is an important and underexplored direction that this work aims to address.

replace FBFL: A Field-Based Coordination Approach for Data Heterogeneity in Federated Learning

Authors: Davide Domini, Gianluca Aguzzi, Lukas Esterle, Mirko Viroli

Abstract: In the last years, Federated learning (FL) has become a popular solution to train machine learning models in domains with high privacy concerns. However, FL scalability and performance face significant challenges in real-world deployments where data across devices are non-independently and identically distributed (non-IID). The heterogeneity in data distribution frequently arises from spatial distribution of devices, leading to degraded model performance in the absence of proper handling. Additionally, FL typical reliance on centralized architectures introduces bottlenecks and single-point-of-failure risks, particularly problematic at scale or in dynamic environments. To close this gap, we propose Field-Based Federated Learning (FBFL), a novel approach leveraging macroprogramming and field coordination to address these limitations through: (i) distributed spatial-based leader election for personalization to mitigate non-IID data challenges; and (ii) construction of a self-organizing, hierarchical architecture using advanced macroprogramming patterns. Moreover, FBFL not only overcomes the aforementioned limitations, but also enables the development of more specialized models tailored to the specific data distribution in each subregion. This paper formalizes FBFL and evaluates it extensively using MNIST, FashionMNIST, and Extended MNIST datasets. We demonstrate that, when operating under IID data conditions, FBFL performs comparably to the widely-used FedAvg algorithm. Furthermore, in challenging non-IID scenarios, FBFL not only outperforms FedAvg but also surpasses other state-of-the-art methods, namely FedProx and Scaffold, which have been specifically designed to address non-IID data distributions. Additionally, we showcase the resilience of FBFL's self-organizing hierarchical architecture against server failures.

replace Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy

Authors: Rustem Islamov, Samuel Horvath, Aurelien Lucchi, Peter Richtarik, Eduard Gorbunov

Abstract: Strong Differential Privacy (DP) and Optimization guarantees are two desirable properties for a method in Federated Learning (FL). However, existing algorithms do not achieve both properties at once: they either have optimal DP guarantees but rely on restrictive assumptions such as bounded gradients/bounded data heterogeneity, or they ensure strong optimization performance but lack DP guarantees. To address this gap in the literature, we propose and analyze a new method called Clip21-SGD2M based on a novel combination of clipping, heavy-ball momentum, and Error Feedback. In particular, for non-convex smooth distributed problems with clients having arbitrarily heterogeneous data, we prove that Clip21-SGD2M has optimal convergence rate and also near optimal (local-)DP neighborhood. Our numerical experiments on non-convex logistic regression and training of neural networks highlight the superiority of Clip21-SGD2M over baselines in terms of the optimization performance for a given DP-budget.

replace Generative Models in Decision Making: A Survey

Authors: Xinyu Shao, Jianping Zhang, Haozhi Wang, Leo Maxime Brunswic, Kaiwen Zhou, Jiqian Dong, Kaiyang Guo, Zhitang Chen, Jun Wang, Jianye Hao, Xiu Li, Yinchuan Li

Abstract: Generative models have fundamentally reshaped the landscape of decision-making, reframing the problem from pure scalar reward maximization to high-fidelity trajectory generation and distribution matching. This paradigm shift addresses intrinsic limitations in classical Reinforcement Learning (RL), particularly the limited expressivity of standard unimodal policy distributions in capturing complex, multi-modal behaviors embedded in diverse datasets. However, current literature often treats these models as isolated algorithmic improvements, rarely synthesizing them into a single comprehensive framework. This survey proposes a principled taxonomy grounding generative decision-making within the probabilistic framework of Control as Inference. By performing a variational factorization of the trajectory posterior, we conceptualize four distinct functional roles: Controllers for amortized policy inference, Modelers for dynamics priors, Optimizers for iterative trajectory refinement, and Evaluators for trajectory guidance and value assessment. Unlike existing architecture-centric reviews, this function-centric framework allows us to critically analyze representative generative families across distinct dimensions. Furthermore, we examine deployment in high-stakes domains, specifically Embodied AI, Autonomous Driving, and AI for Science, highlighting systemic risks such as dynamics hallucination in world models and proxy exploitation. Finally, we chart the path toward Generalist Physical Intelligence, identifying pivotal challenges in inference efficiency, trustworthiness, and the emergence of Physical Foundation Models.

replace BACE-RUL: A Bi-directional Adversarial Network with Covariate Encoding for Machine Remaining Useful Life Prediction

Authors: Zekai Zhang, Dan Li, Shunyu Wu, Junya Cai, Bo Zhang, See Kiong Ng, Zibin Zheng

Abstract: Prognostic and Health Management (PHM) are crucial ways to avoid unnecessary maintenance for Cyber-Physical Systems (CPS) and improve system reliability. Predicting the Remaining Useful Life (RUL) is one of the most challenging tasks for PHM. Existing methods require prior knowledge about the system, contrived assumptions, or temporal mining to model the life cycles of machine equipment/devices, resulting in diminished accuracy and limited applicability in real-world scenarios. This paper proposes a Bi-directional Adversarial network with Covariate Encoding for machine Remaining Useful Life (BACE-RUL) prediction, which only adopts sensor measurements from the current life cycle to predict RUL rather than relying on previous consecutive cycle recordings. The current sensor measurements of mechanical devices are encoded to a conditional space to better understand the implicit inner mechanical status. The predictor is trained as a conditional generative network with the encoded sensor measurements as its conditions. Various experiments on several real-world datasets, including the turbofan aircraft engine dataset and the dataset collected from degradation experiments of Li-Ion battery cells, show that the proposed model is a general framework and outperforms state-of-the-art methods.

replace ms-Mamba: Multi-scale Mamba for Time-Series Forecasting

Authors: Yusuf Meric Karadag, Ismail Talaz, Ipek Gursel Dino, Sinan Kalkan

Abstract: The problem of Time-series Forecasting is generally addressed by recurrent, Transformer-based and the recently proposed Mamba-based architectures. However, existing architectures generally process their input at a single temporal scale, which may be sub-optimal for many tasks where information changes over multiple time scales. In this paper, we introduce a novel architecture called Multi-scale Mamba (ms-Mamba) to address this gap. ms-Mamba incorporates multiple temporal scales by using multiple Mamba blocks with different sampling rates ($\Delta$s). Our experiments on many benchmarks demonstrate that ms-Mamba outperforms state-of-the-art approaches, including the recently proposed Transformer-based and Mamba-based models. For example, on the Solar-Energy dataset, ms-Mamba outperforms its closest competitor S-Mamba (0.229 vs. 0.240 in terms of mean-squared error) while using fewer parameters (3.53M vs. 4.77M), less memory (13.46MB vs. 18.18MB), and less operations (14.93G vs. 20.53G MACs), averaged across four forecast lengths. Codes and models will be made available.

replace TianQuan-S2S: A Subseasonal-to-Seasonal Global Weather Model via Incorporate Climatology State

Authors: Guowen Li, Xintong Liu, Yang Liu, Mengxuan Chen, Shilei Cao, Xuehe Wang, Juepeng Zheng, Jinxiao Zhang, Haoyuan Liang, Lixian Zhang, Jiuke Wang, Meng Jin, Hong Cheng, Haohuan Fu

Abstract: Accurate Subseasonal-to-Seasonal (S2S) forecasting is vital for decision-making in agriculture, energy production, and emergency management. However, it remains a challenging and underexplored problem due to the chaotic nature of the weather system. Recent data-driven studies have shown promising results, but their performance is limited by the inadequate incorporation of climate states and a model tendency to degrade, progressively losing fine-scale details and yielding over-smoothed forecasts. To overcome these limitations, we propose TianQuan-S2S, a global S2S forecasting model that integrates initial weather states with climatological means via incorporating climatology into patch embedding and enhancing variability capture through an uncertainty-augmented Transformer. Extensive experiments on the Earth Reanalysis 5 (ERA5) reanalysis dataset demonstrate that our model yields a significant improvement in both deterministic and ensemble forecasting over the climatology mean, traditional numerical methods, and data-driven models. Ablation studies empirically show the effectiveness of our model designs. Remarkably, our model outperforms skillful numerical ECMWF-S2S and advanced data-driven Fuxi-S2S in key meteorological variables. The code implementation can be found in https://github.com/zhangminglang42/TianQuan.

URLs: https://github.com/zhangminglang42/TianQuan.

replace Learning Virtual Machine Scheduling in Cloud Computing through Language Agents

Authors: JieHao Wu, Ziwei Wang, Junjie Sheng, Wenhao Li, Xiangfeng Wang, Jun Luo

Abstract: In cloud services, virtual machine (VM) scheduling is a typical Online Dynamic Multidimensional Bin Packing (ODMBP) problem, characterized by large-scale complexity and fluctuating demands. Traditional optimization methods struggle to adapt to real-time changes, domain-expert-designed heuristic approaches suffer from rigid strategies, and existing learning-based methods often lack generalizability and interpretability. To address these limitations, this paper proposes a hierarchical language agent framework named MiCo, which provides a large language model (LLM)-driven heuristic design paradigm for solving ODMBP. Specifically, ODMBP is formulated as a Semi-Markov Decision Process with Options (SMDP-Option), enabling dynamic scheduling through a two-stage architecture, i.e., Option Miner and Option Composer. Option Miner utilizes LLMs to discover diverse and useful non-context-aware strategies by interacting with constructed environments. Option Composer employs LLMs to discover a composing strategy that integrates the non-context-aware strategies with the contextual ones. Extensive experiments on real-world enterprise datasets demonstrate that MiCo achieves a 96.9\% competitive ratio in large-scale scenarios involving more than 10,000 virtual machines. It maintains high performance even under nonstationary request flows and diverse configurations, thus validating its effectiveness in complex and large-scale cloud environments.

replace VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

Authors: Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt

Abstract: Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools. To support future research in multi-turn multi-modal reasoning, we open-source our code at https://github.com/VTOOL-R1/vtool-r1

URLs: https://github.com/VTOOL-R1/vtool-r1

replace Attribute-Efficient PAC Learning of Sparse Halfspaces with Constant Malicious Noise Rate

Authors: Shiwei Zeng, Jie Shen

Abstract: Attribute-efficient PAC learning of sparse halfspaces has been a fundamental problem in machine learning theory. In recent years, machine learning algorithms are faced with prevalent data corruptions or even malicious attacks. It is of central interest to design computationally-efficient algorithms that are robust to malicious corruptions. In this paper, we consider that there exists a constant amount of malicious noise in the data and the goal is to learn an underlying $s$-sparse halfspace $w^* \in \mathbb{R}^d$ with $\text{poly}(s,\log d)$ samples. Specifically, we follow a recent line of works and assume that the underlying distribution satisfies a certain concentration condition and a margin condition at the same time. Under such conditions, we show that attribute-efficiency can be achieved with simple variants to existing hinge loss minimization programs. Our key contribution includes: 1) an attribute-efficient PAC learning algorithm that works under a constant malicious noise rate; 2) a new gradient analysis that carefully handles the sparsity admitted constraints in hinge loss minimization program.

replace Continuous Chain of Thought Enables Parallel Exploration and Reasoning

Authors: Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak

Abstract: Modern language models generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary. While this discrete sampling has achieved remarkable success, conducting chain-of-thought with continuously-valued tokens (CoT2) offers a richer and more expressive alternative. Our work provides new theoretical guarantees and algorithms for CoT2, motivated by logical reasoning tasks that inherently require search capabilities. Theoretically, we establish how CoT2 facilitates the model to track multiple discrete traces in parallel; and quantify the level of achievable parallelism and its benefits for inference efficiency. We also provide a CoT2-based one-layer transformer construction that solves the combinatorial "subset sum problem" given a sufficient embedding dimension. These insights arise from a novel and effective supervision strategy where we match the language model outputs to the empirical token distributions of a set of target traces. Complementing this, we introduce sampling strategies that unlock policy optimization methods for CoT2. Our primary strategy samples and composes $K$ discrete tokens at each decoding step to control the level of parallelism. Experiments confirm that (i) the optimal level of parallelism is governed by the embedding dimension, (ii) our continuous supervision strategy can outperform alternative methods, and (iii) policy optimization with CoT2 indeed improves the performance of the model beyond its initial discrete or continuous supervision.

replace FPGA-Enabled Machine Learning Applications in Earth Observation: A Systematic Review

Authors: C\'edric L\'eonard (Technical University of Munich, Munich, Germany, Remote Sensing Technology Institute), Dirk Stober (Technical University of Munich, Munich, Germany), Martin Schulz (Technical University of Munich, Munich, Germany)

Abstract: New UAV technologies and the NewSpace era are transforming Earth Observation missions and data acquisition. Numerous small platforms generate large data volume, straining bandwidth and requiring onboard decision-making to transmit high-quality information in time. While Machine Learning allows real-time autonomous processing, FPGAs balance performance with adaptability to mission-specific requirements, enabling onboard deployment. This review systematically analyzes 68 experiments deploying ML models on FPGAs for Remote Sensing applications. We introduce two distinct taxonomies to capture both efficient model architectures and FPGA implementation strategies. For transparency and reproducibility, we follow PRISMA 2020 guidelines and share all data and code at https://github.com/CedricLeon/Survey_RS-ML-FPGA.

URLs: https://github.com/CedricLeon/Survey_RS-ML-FPGA.

replace HSG-12M: A Large-Scale Benchmark of Spatial Multigraphs from the Energy Spectra of Non-Hermitian Crystals

Authors: Xianquan Yan, Hakan Akg\"un, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee

Abstract: AI is transforming scientific research by revealing new ways to understand complex physical systems, but its impact remains constrained by the lack of large, high-quality domain-specific datasets. A rich, largely untapped resource lies in non-Hermitian quantum physics, where the energy spectra of crystals form intricate geometries on the complex plane -- termed as Hamiltonian spectral graphs. Despite their significance as fingerprints for electronic behavior, their systematic study has been intractable due to the reliance on manual extraction. To unlock this potential, we introduce Poly2Graph: a high-performance, open-source pipeline that automates the mapping of 1-D crystal Hamiltonians to spectral graphs. Using this tool, we present HSG-12M: a dataset containing 11.6 million static and 5.1 million dynamic Hamiltonian spectral graphs across 1401 characteristic-polynomial classes, distilled from 177 TB of spectral potential data. Crucially, HSG-12M is the first large-scale dataset of spatial multigraphs -- graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. This simultaneously addresses a critical gap, as existing graph benchmarks overwhelmingly assume simple, non-spatial edges, discarding vital geometric information. Benchmarks with popular GNNs expose new challenges in learning spatial multi-edges at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for data-driven scientific discovery in condensed matter physics, new opportunities in geometry-aware graph learning and beyond.

replace SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning

Authors: Ruiqi Zhang, Daman Arora, Song Mei, Andrea Zanette

Abstract: Training large language models with reinforcement learning (RL) against verifiable rewards significantly enhances their reasoning abilities, yet remains computationally expensive due to inefficient uniform prompt sampling. We introduce Selective Prompting with Efficient Estimation of Difficulty (SPEED), an adaptive online RL curriculum that selectively chooses training examples of intermediate difficulty to maximize learning efficiency. Theoretically, we establish that intermediate-difficulty prompts improve the gradient estimator's signal-to-noise ratio, accelerating convergence. Empirically, our efficient implementation leads to 2x to 6x faster training without degrading accuracy, requires no manual tuning, and integrates seamlessly into standard RL algorithms.

replace Bures-Wasserstein Flow Matching for Graph Generation

Authors: Keyue Jiang, Jiahao Cui, Xiaowen Dong, Laura Toni

Abstract: Graph generation has emerged as a critical task in fields ranging from drug discovery to circuit design. Contemporary approaches, notably diffusion and flow-based models, have achieved solid graph generative performance through constructing a probability path that interpolates between reference and data distributions. However, these methods typically model the evolution of individual nodes and edges independently and use linear interpolations in the disjoint space of nodes/edges to build the path. This disentangled interpolation breaks the interconnected patterns of graphs, making the constructed probability path irregular and non-smooth, which causes poor training dynamics and faulty sampling convergence. To address the limitation, this paper first presents a theoretically grounded framework for probability path construction in graph generative models. Specifically, we model the joint evolution of the nodes and edges by representing graphs as connected systems parameterized by Markov random fields (MRF). We then leverage the optimal transport displacement between MRF objects to design a smooth probability path that ensures the co-evolution of graph components. Based on this, we introduce BWFlow, a flow-matching framework for graph generation that utilizes the derived optimal probability path to benefit the training and sampling algorithm design. Experimental evaluations in plain graph generation and molecule generation validate the effectiveness of BWFlow with competitive performance, better training convergence, and efficient sampling.

replace From Bandit Regret to FDR Control: Online Selective Generation with Adversarial Feedback Unlocking

Authors: Minjae Lee, Yoonjae Jung, Sangdon Park

Abstract: As interactive generative systems are increasingly deployed in real-world applications, their tendency to generate unreliable or false responses raises serious concerns. Selective generation mitigates this risk by ensuring that the system answers only when confident. However, real-world deployments typically provide only partial user feedback on the selected response (e.g., thumbs up/down) and often operate in non-stationary or adversarial environments,for which effective learning methods are largely missing. To bridge this gap, we propose ExSUL, a novel online learning framework for selective generation with adversarial bandit feedback. Technically, we introduce (i) a novel conversion lemma that translates the regret of any bandit algorithm into an FDR bound, and (ii) feedback unlocking, a strategy that exploits the structure of selective generation to extract additional learning signals from partial feedback. We prove that ExSUL achieves a regret bound of $O(\sqrt{T \ln |H|})$, matching the efficiency and FDR controllability of full-information settings despite receiving only partial feedback. While applicable to general generative tasks, we demonstrate the efficacy of ExSUL for ensuring the reliability of Large Language Models (LLMs) through empirical validation on question-answering tasks across diverse non-stationary and adversarial settings. Our results demonstrate that ExSUL robustly controls the FDR while maintaining competitive answering coverage.

replace Structured Kolmogorov-Arnold Neural ODEs for Interpretable Learning and Symbolic Discovery of Nonlinear Dynamics

Authors: Wei Liu, Kiran Bacsa, Loon Ching Tang, Eleni Chatzi

Abstract: Understanding and modeling nonlinear dynamical systems is a fundamental challenge across science and engineering. Deep learning has shown remarkable potential for capturing complex system behavior, yet achieving models that are both accurate and physically interpretable remains difficult. To address this, we propose Structured Kolmogorov-Arnold Neural ODEs (SKANODEs), a framework that integrates structured state-space modeling with Kolmogorov-Arnold Networks (KANs). Within a Neural ODE architecture, SKANODE employs a fully trainable KAN as a universal function approximator to perform virtual sensing, recovering latent states that correspond to interpretable physical quantities such as displacements and velocities. Leveraging KAN's symbolic regression capability, SKANODE then extracts compact, interpretable expressions for the system's governing dynamics. Experiments on two canonical nonlinear oscillators and a real-world F-16 ground vibration dataset demonstrate that SKANODE reliably recovers physically meaningful latent displacement and velocity trajectories from acceleration measurements, identifies the correct governing nonlinearities--including the cubic stiffness in the Duffing oscillator and the nonlinear damping structure in the Van der Pol oscillator--and reveals hysteretic signatures in the F-16 interface dynamics through structured latent phase portraits and an interpretable symbolic model. Across all three cases, SKANODE provides more accurate and robust predictions than black-box NODE baselines and classical ARX and NARX identification, while producing equation-level descriptions of the learned nonlinear dynamics.

replace Parameter Stress Analysis in Reinforcement Learning: Applying Synaptic Filtering to Policy Networks

Authors: Zain ul Abdeen, Ming Jin

Abstract: This paper explores reinforcement learning (RL) policy robustness by systematically analyzing network parameters under internal and external stresses. \textcolor{black}{We apply synaptic filtering methods using high-pass, low-pass, and pulse-wave filters from} \citep{pravin2024fragility}, as an internal stress by selectively perturbing parameters, while adversarial attacks apply external stress through modified agent observations. This dual approach enables the classification of parameters as \textit{fragile}, \textit{robust}, or \textit{antifragile}, based on their influence on policy performance in clean and adversarial settings. Parameter scores are defined to quantify these characteristics, and the framework is validated on proximal policy optimization (PPO)-trained agents in Mujoco continuous control environments. The results highlight the presence of antifragile parameters that enhance policy performance under stress, demonstrating the potential of targeted filtering techniques to improve RL policy adaptability. These insights provide a foundation for future advancements in the design of robust and antifragile RL systems.

replace Overtone: Cyclic Patch Modulation for Clean, Efficient, and Flexible Physics Emulators

Authors: Payel Mukhopadhyay, Michael McCabe, Ruben Ohana, Miles Cranmer

Abstract: Transformer-based PDE surrogates achieve remarkable performance but face two key challenges: fixed patch sizes cause systematic error accumulation at harmonic frequencies, and computational costs remain inflexible regardless of problem complexity or available resources. We introduce Overtone, a unified solution through dynamic patch size control at inference. Overtone's key insight is that cyclically modulating patch sizes during autoregressive rollouts distributes errors across the frequency spectrum, mitigating the systematic harmonic artifact accumulation that plague fixed-patch models. We implement this through two architecture-agnostic modules--CSM (using dynamic stride modulation) and CKM (using dynamic kernel resizing)--that together provide both harmonic mitigation and compute-adaptive deployment. This flexible tokenization lets users trade accuracy for speed dynamically based on computational constraints, and the cyclic rollout strategy yields up to 40% lower long rollout error in variance-normalised RMSE (VRMSE) compared to conventional, static-patch surrogates. Across challenging 2D and 3D PDE benchmarks, one Overtone model matches or exceeds fixed-patch baselines across inference compute budgets, when trained under a fixed total training budget setting.

replace Some Super-approximation Rates of ReLU Neural Networks for Korobov Functions

Authors: Yuwen Li, Guozhi Zhang

Abstract: This paper examines the $L_p$ and $W^1_p$ norm approximation errors of ReLU neural networks for Korobov functions. In terms of network width and depth, we derive nearly optimal super-approximation error bounds of order $2m$ in the $L_p$ norm and order $2m-2$ in the $W^1_p$ norm, for target functions with $L_p$ mixed derivative of order $m$ in each direction. The analysis leverages sparse grid finite elements and the bit extraction technique. Our results improve upon classical lowest order $L_\infty$ and $H^1$ norm error bounds and demonstrate that the expressivity of neural networks is largely unaffected by the curse of dimensionality.

replace Kernel Based Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games

Authors: Berkay Anahtarci, Can Deha Kariksiz, Naci Saldi

Abstract: We consider the maximum causal entropy inverse reinforcement learning (IRL) problem for infinite-horizon stationary mean-field games (MFG), in which we model the unknown reward function within a reproducing kernel Hilbert space (RKHS). This allows the inference of rich and potentially nonlinear reward structures directly from expert demonstrations, in contrast to most existing approaches for MFGs that typically restrict the reward to a linear combination of a fixed finite set of basis functions and rely on finite-horizon formulations. We introduce a Lagrangian relaxation that enables us to reformulate the problem as an unconstrained log-likelihood maximization and obtain a solution via a gradient ascent algorithm. To establish the theoretical consistency of the algorithm, we prove the smoothness of the log-likelihood objective through the Fr\'echet differentiability of the related soft Bellman operators with respect to the parameters in the RKHS. To illustrate the practical advantages of the RKHS formulation, we validate our framework on a mean-field traffic routing game exhibiting state-dependent preference reversal, where the kernel-based method reduces policy recovery error by over an order of magnitude compared to a linear reward baseline with a comparable parameter count. Furthermore, we extend the framework to the finite-horizon non-stationary setting. We demonstrate that the log-likelihood reformulation is structurally unavailable in this regime and instead develop an alternative gradient descent algorithm on the convex dual via Danskin's theorem, establishing smoothness and convergence guarantees.

replace TIC-GRPO: Provable and Efficient Optimization for Reinforcement Learning from Human Feedback

Authors: Lei Pang, Jun Luo, Ruinan Jin

Abstract: Group Relative Policy Optimization (GRPO), recently introduced by DeepSeek, is a critic-free reinforcement learning algorithm for fine-tuning large language models. GRPO replaces the value function in Proximal Policy Optimization (PPO) with group-normalized rewards while retaining PPO-style token-level importance sampling based on an old policy. Our theoretical analysis reveals that the GRPO update rule estimates the policy gradient at the old policy rather than the current one; however, since the old policy is refreshed every few steps, the resulting discrepancy remains small and the induced bias is negligible in practice. To empirically validate this insight, we conduct an ablation study that entirely removes importance sampling and performs multiple optimization steps using gradients estimated at a fixed old policy. Remarkably, this simplified variant attains performance comparable to standard GRPO. Motivated by this finding, we propose Trajectory-level Importance-Corrected GRPO (TIC-GRPO), a new algorithm that replaces token-level importance ratios with a single trajectory-level probability ratio, thereby yielding an estimate of the current policy gradient while preserving the critic-free structure. Furthermore, we present the first convergence analysis for GRPO-style methods and show that TIC-GRPO converges faster than GRPO. Finally, empirical results across math reasoning and coding tasks demonstrate the superiority of TIC-GRPO.

replace Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection

Authors: Jovana Kljajic, John M. O'Toole, Robert Hogan, Tamara Skoric

Abstract: Reliable evaluation of machine learning models for neonatal seizure detection is critical for clinical adoption. Current practices often rely on inconsistent and biased metrics, hindering model comparability and interpretability. Expert-level claims about AI performance are frequently made without rigorous validation, raising concerns about their reliability. This study aims to systematically evaluate common performance metrics and propose best practices tailored to the specific challenges of neonatal seizure detection. Using real and synthetic seizure annotations, we assessed standard performance metrics, consensus strategies, and human-expert level equivalence tests under varying class imbalance, inter-rater agreement, and number of raters. Matthews and Pearson's correlation coefficients outperformed the area under the curve in reflecting performance under class imbalance. Consensus types are sensitive to the number of raters and agreement level among them. Among human-expert level equivalence tests, the multi-rater Turing test using Fleiss k best captured expert-level AI performance. We recommend reporting: (1) at least one balanced metric, (2) Sensitivity, specificity, PPV and NPV, (3) Multi-rater Turing test results using Fleiss k, and (4) All the above on held-out validation set. This proposed framework provides an important prerequisite to clinical validation by enabling a thorough and honest appraisal of AI methods for neonatal seizure detection.

replace In-Training Defenses against Emergent Misalignment in Language Models

Authors: David Kacz\'er, Magnus J{\o}rgenv{\aa}g, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, Florian Mai

Abstract: Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\mathcal{l}_2$ distance in feature space, (iii) preventative steering with an evil persona vector, and (iv) interleaving training examples from a general instruct-tuning dataset. We demonstrate that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.

replace Multi-Agent Reinforcement Learning in Intelligent Transportation Systems: A Comprehensive Survey

Authors: Rexcharles Donatus, Kumater Ter, Daniel Udekwe

Abstract: The growing complexity of urban mobility and the demand for efficient, sustainable, and adaptive solutions have positioned Intelligent Transportation Systems (ITS) at the forefront of modern infrastructure innovation. At the core of ITS lies the challenge of autonomous decision-making across dynamic, large scale, and uncertain environments where multiple agents traffic signals, autonomous vehicles, or fleet units must coordinate effectively. Multi Agent Reinforcement Learning (MARL) offers a promising paradigm for addressing these challenges by enabling distributed agents to jointly learn optimal strategies that balance individual objectives with system wide efficiency. This paper presents a comprehensive survey of MARL applications in ITS. We introduce a structured taxonomy that categorizes MARL approaches according to coordination models and learning algorithms, spanning value based, policy based, actor critic, and communication enhanced frameworks. Applications are reviewed across key ITS domains, including traffic signal control, connected and autonomous vehicle coordination, logistics optimization, and mobility on demand systems. Furthermore, we highlight widely used simulation platforms such as SUMO, CARLA, and CityFlow that support MARL experimentation, along with emerging benchmarks. The survey also identifies core challenges, including scalability, non stationarity, credit assignment, communication constraints, and the sim to real transfer gap, which continue to hinder real world deployment.

replace AttnBoost: Retail Supply Chain Sales Insights via Gradient Boosting Perspective

Authors: Yadi Liu, Xiaoli Ma, Muxin Ge, Zeyu Han, Jingxi Qiu, Ye Aung Moe, Yilan Shen, Wenbin Wei, Cheng Huang

Abstract: Forecasting product demand in retail supply chains presents a complex challenge due to noisy, heterogeneous features and rapidly shifting consumer behavior. While traditional gradient boosting decision trees (GBDT) offer strong predictive performance on structured data, they often lack adaptive mechanisms to identify and emphasize the most relevant features under changing conditions. In this work, we propose AttnBoost, an interpretable learning framework that integrates feature-level attention into the boosting process to enhance both predictive accuracy and explainability. Specifically, the model dynamically adjusts feature importance during each boosting round via a lightweight attention mechanism, allowing it to focus on high-impact variables such as promotions, pricing, and seasonal trends. We evaluate AttnBoost on a large-scale retail sales dataset and demonstrate that it outperforms standard machine learning and deep tabular models, while also providing actionable insights for supply chain managers. An ablation study confirms the utility of the attention module in mitigating overfitting and improving interpretability. Our results suggest that attention-guided boosting represents a promising direction for interpretable and scalable AI in real-world forecasting applications.

replace Topology Structure Optimization of Reservoirs Using GLMY Homology

Authors: Yu Chen, Shengwei Wang, Hongwei Lin

Abstract: Reservoir is an efficient network for time series processing. It is well known that network structure is one of the determinants of its performance. However, the topology structure of reservoirs, as well as their performance, is hard to analyzed, due to the lack of suitable mathematical tools. In this paper, we study the topology structure of reservoirs using persistent GLMY homology theory, and develop a method to improve its performance. Specifically, it is found that the reservoir performance is closely related to the one-dimensional GLMY homology groups. Then, we develop a reservoir structure optimization method by modifying the minimal representative cycles of one-dimensional GLMY homology groups. Finally, by experiments, it is validated that the performance of reservoirs is jointly influenced by the reservoir structure and the periodicity of the dataset.

replace TabStruct: Measuring Structural Fidelity of Tabular Data

Authors: Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik

Abstract: Evaluating tabular generators remains a challenging problem, as the unique causal structural prior of heterogeneous tabular data does not lend itself to intuitive human inspection. Recent work has introduced structural fidelity as a tabular-specific evaluation dimension to assess whether synthetic data complies with the causal structures of real data. However, existing benchmarks often neglect the interplay between structural fidelity and conventional evaluation dimensions, thus failing to provide a holistic understanding of model performance. Moreover, they are typically limited to toy datasets, as quantifying existing structural fidelity metrics requires access to ground-truth causal structures, which are rarely available for real-world datasets. In this paper, we propose a novel evaluation framework that jointly considers structural fidelity and conventional evaluation dimensions. We introduce a new evaluation metric, $\textbf{global utility}$, which enables the assessment of structural fidelity even in the absence of ground-truth causal structures. In addition, we present $\textbf{TabStruct}$, a comprehensive evaluation benchmark offering large-scale quantitative analysis on 13 tabular generators from nine distinct categories, across 29 datasets. Our results demonstrate that global utility provides a task-independent, domain-agnostic lens for tabular generator performance. We release the TabStruct benchmark suite, including all datasets, evaluation pipelines, and raw results. Code is available at https://github.com/SilenceX12138/TabStruct.

URLs: https://github.com/SilenceX12138/TabStruct.

replace Complexity-Regularized Proximal Policy Optimization

Authors: Luca Serfilippi, Giorgio Franceschelli, Antonio Corradi, Mirco Musolesi

Abstract: Policy gradient methods usually rely on entropy regularization to prevent premature convergence. However, maximizing entropy indiscriminately pushes the policy towards a uniform distribution, often overriding the reward signal if not optimally tuned. We propose replacing the standard entropy term with a self-regulating complexity term, defined as the product of Shannon entropy and disequilibrium, where the latter quantifies the distance from the uniform distribution. Unlike pure entropy, which favors maximal disorder, this complexity measure is zero for both fully deterministic and perfectly uniform distributions, i.e., it is strictly positive for systems that exhibit a meaningful interplay between order and randomness. These properties ensure the policy maintains beneficial stochasticity while reducing regularization pressure when the policy is highly uncertain, allowing learning to focus on reward optimization. We introduce Complexity-Regularized Proximal Policy Optimization (CR-PPO), a modification of PPO that leverages this dynamic. We empirically demonstrate that CR-PPO is significantly more robust to hyperparameter selection than entropy-regularized PPO, achieving consistent performance across orders of magnitude of regularization coefficients and remaining harmless when regularization is unnecessary, thereby reducing the need for expensive hyperparameter tuning.

replace Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

Authors: Simon Schrodi, Elias Kempf, Fazl Barez, Thomas Brox

Abstract: Language models can transfer hidden biases during distillation. For example, a teacher that "likes owls" can make its student "like owls" too, even when the training data consists only of lists of numbers. This surprising phenomenon is called subliminal learning. Subliminal learning can be expected under soft distillation, where the student is trained on the teacher's full next-token distribution. But the fact that this also occurs under hard distillation-where the student only sees sampled tokens-raises a deeper question: when and how does subliminal learning actually occur? We answer this question through controlled experiments and mechanistic analysis. Our results show that subliminal learning does not need (global) token entanglement or logit leakage. Instead, it comes down to a small set of divergence tokens-rare cases where teachers with different biases would predict different tokens. Masking out these tokens mostly removes the hidden bias transfer. Mechanistically, divergence tokens reveal that early layers are critical. Surprisingly, finetuning even a single such early layer is sufficient for subliminal learning. Finally, we find that subliminal learning is fragile. Even small changes, like prompt paraphrasings, are usually sufficient to suppress it.

replace OPPO: Accelerating PPO-based RLHF via Pipeline Overlap

Authors: Kaizhuo Yan, Yingjie Yu, Yifan Yu, Haizhong Zheng, Fan Lai

Abstract: Proximal Policy Optimization (PPO)-based reinforcement learning from human feedback (RLHF) is a widely adopted paradigm for aligning large language models (LLMs) with human preferences. However, its training pipeline suffers from substantial inefficiencies due to sequential multi-model dependencies (e.g., reward model depends on actor outputs) and long-tail response lengths, where a few long responses straggle the stage completion. We present OPPO, a novel, lightweight, and model-agnostic PPO-based RLHF framework that improves training efficiency by overlapping pipeline execution. OPPO introduces two novel techniques: (1) Intra-step overlap, which streams upstream model outputs (e.g., actor model) in right-sized chunks, enabling the downstream model (e.g., reward) to begin prefill while the upstream continues decoding; and (2) Inter-step overlap, which adaptively overcommits a few prompts and defers long generations to future steps, mitigating tail latency without discarding partial work. OPPO integrates easily with existing PPO implementations with a lightweight wrapper. Extensive evaluations show that OPPO accelerates PPO-based RLHF training by $1.8\times$--$2.8\times$ and improves GPU utilization by $1.4\times$--$2.1\times$ without compromising training convergence.

replace Non-Asymptotic Analysis of Efficiency in Conformalized Regression

Authors: Yunzhen Yao, Lie He, Michael Gastpar

Abstract: Conformal prediction provides prediction sets with coverage guarantees. The informativeness of conformal prediction depends on its efficiency, typically quantified by the expected size of the prediction set. Prior work on the efficiency of conformalized regression commonly treats the miscoverage level $\alpha$ as a fixed constant. In this work, we establish non-asymptotic bounds on the deviation of the prediction set length from the oracle interval length for conformalized quantile and median regression trained via SGD, under mild assumptions on the data distribution. Our bounds of order $\mathcal{O}(1/\sqrt{n} + 1/(\alpha^2 n) + 1/\sqrt{m} + \exp(-\alpha^2 m))$ capture the joint dependence of efficiency on the proper training set size $n$, the calibration set size $m$, and the miscoverage level $\alpha$. The results identify phase transitions in convergence rates across different regimes of $\alpha$, offering guidance for allocating data to control excess prediction set length. Empirical results are consistent with our theoretical findings.

replace Do We Really Need Permutations? Impact of Model Width on Linear Mode Connectivity

Authors: Akira Ito, Masanori Yamada, Daiki Chijiwa, Atsutoshi Kumagai

Abstract: Recently, Ainsworth et al. empirically demonstrated that, given two independently trained models, applying a parameter permutation that preserves the input-output behavior allows the two models to be connected by a low-loss linear path. When such a path exists, the models are said to achieve linear mode connectivity (LMC). Prior studies, including Ainsworth et al.(2023), have reported that achieving LMC requires not only an appropriate permutation search but also sufficiently wide models (e.g., a 32 $\times$ width multiplier for ResNet-20). This is broadly believed to be because increasing the model width ensures a large enough space of candidate permutations, increasing the chance of finding one that yields LMC. In this work, we empirically demonstrate that, even without any permutations, simply widening the models is sufficient for achieving LMC when using a suitable softmax temperature calibration. We further explain why this phenomenon arises by analyzing intermediate layer outputs. Specifically, we introduce layerwise exponentially weighted connectivity (LEWC), which states that the output of each layer of the merged model can be represented as an exponentially weighted sum of the outputs of the corresponding layers of the original models. Consequently the merged model's output matches that of an ensemble of the original models, facilitating LMC. To the best of our knowledge, this work is the first to show that widening the model not only facilitates nonlinear mode connectivity, as suggested in prior research, but also significantly increases the possibility of achieving linear mode connectivity.

replace Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems

Authors: Rishi Jha, Harold Triedman, Justin Wagle, Vitaly Shmatikov

Abstract: Control-flow hijacking attacks manipulate orchestration mechanisms in multi-agent systems into performing unsafe actions that compromise the system and exfiltrate sensitive information. Recently proposed defenses, such as LlamaFirewall, rely on alignment checks of inter-agent communications to ensure that all agent invocations are "related to" and "likely to further" the original objective. We start by demonstrating control-flow hijacking attacks that evade these defenses even if alignment checks are performed by advanced LLMs. We argue that the safety and functionality objectives of multi-agent systems fundamentally conflict with each other. This conflict is exacerbated by the brittle definitions of "alignment" and the checkers' incomplete visibility into the execution context. We then propose, implement, and evaluate ControlValve, a new defense inspired by the principles of control-flow integrity and least privilege. ControlValve (1) generates permitted control-flow graphs for multi-agent systems, and (2) enforces that all executions comply with these graphs, along with contextual rules (generated in a zero-shot manner) for each agent invocation.

replace LLEMA: Evolutionary Search with LLMs for Multi-Objective Materials Discovery

Authors: Nikhil Abhyankar, Sanchit Kabra, Saaketh Desai, Chandan K. Reddy

Abstract: Materials discovery requires navigating vast chemical and structural spaces while satisfying multiple, often conflicting, objectives. We present LLM-guided Evolution for MAterials discovery (LLEMA), a unified framework that couples the scientific knowledge embedded in large language models with chemistry-informed evolutionary rules and memory-based refinement. At each iteration, an LLM proposes crystallographically specified candidates under explicit property constraints; a surrogate-augmented oracle estimates physicochemical properties; and a multi-objective scorer updates success/failure memories to guide subsequent generations. Evaluated on 14 realistic tasks that span electronics, energy, coatings, optics, and aerospace, LLEMA discovers candidates that are chemically plausible, thermodynamically stable, and property-aligned, achieving higher hit rates and improved Pareto front quality relative to generative and LLM-only baselines. Ablation studies confirm the importance of rule-guided generation, memory-based refinement, and surrogate prediction. By enforcing synthesizability and multi-objective trade-offs, LLEMA provides a principled approach to accelerating practical materials discovery. Project website: https://scientific-discovery.github.io/llema-project/

URLs: https://scientific-discovery.github.io/llema-project/

replace SPOT: Single-Shot Positioning via Trainable Near-Field Rainbow Beamforming

Authors: Yeyue Cai, Jianhua Mo, Meixia Tao

Abstract: Phase-time arrays, which integrate phase shifters (PSs) and true-time delays (TTDs), have emerged as a cost-effective architecture for generating frequency-dependent rainbow beams in wideband sensing and localization. This paper proposes an end-to-end deep learning-based scheme that simultaneously designs the rainbow beams and estimates user positions. Treating the PS and TTD coefficients as trainable variables allows the network to synthesize task-oriented beams that maximize localization accuracy. A lightweight fully connected module then recovers the user's angle-range coordinates from its feedback of the maximum quantized received power and its corresponding subcarrier index after a single downlink transmission. Compared with existing analytical and learning-based schemes, the proposed method reduces overhead by an order of magnitude and delivers consistently lower two-dimensional positioning error.

replace ReCast: Reliability-aware Codebook Assisted Lightweight Time Series Forecasting

Authors: Xiang Ma, Taihua Chen, Pengcheng Wang, Xuemei Li, Caiming Zhang

Abstract: Time series forecasting is crucial for applications in various domains. Conventional methods often rely on global decomposition into trend, seasonal, and residual components, which become ineffective for real-world series dominated by local, complex, and highly dynamic patterns. Moreover, the high model complexity of such approaches limits their applicability in real-time or resource-constrained environments. In this work, we propose a novel \textbf{RE}liability-aware \textbf{C}odebook-\textbf{AS}sisted \textbf{T}ime series forecasting framework (\textbf{ReCast}) that enables lightweight and robust prediction by exploiting recurring local shapes. ReCast encodes local patterns into discrete embeddings through patch-wise quantization using a learnable codebook, thereby compactly capturing stable regular structures. To compensate for residual variations not preserved by quantization, ReCast employs a dual-path architecture comprising a quantization path for efficient modeling of regular structures and a residual path for reconstructing irregular fluctuations. A central contribution of ReCast is a reliability-aware codebook update strategy, which incrementally refines the codebook via weighted corrections. These correction weights are derived by fusing multiple reliability factors from complementary perspectives by a distributionally robust optimization (DRO) scheme, ensuring adaptability to non-stationarity and robustness to distribution shifts. Extensive experiments demonstrate that ReCast outperforms state-of-the-art (SOTA) models in accuracy, efficiency, and adaptability to distribution shifts.

replace Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach

Authors: Yaoxin Yang, Peng Ye, Xudong Tan, Chongjun Tu, Maosen Zhao, Jia Hao, Tao Chen

Abstract: Multimodal large language models suffer from substantial inference overhead since multimodal KV Cache grows proportionally with the visual input length. Existing multimodal KV Cache compression methods mostly rely on attention score to reduce cache size, which makes them are incompatible with established efficient attention kernels (e.g., FlashAttention) and ignores the contribution of value vectors to the attention output. In this work, we revisit multimodal KV Cache compression from the perspective of the KV matrices' distribution. First, we observe that frequency-domain energy of multimodal KV matrices is predominantly concentrated in low-frequency and extract this principal energy via a low-pass filter. Further, we find that removing KV pairs that deviate substantially from this principal energy leads to a pronounced performance drop, which we define as Outlier KVs. Considering Outlier KVs are more likely to encode features critical for inference, we propose FlashCache, a frequency-domain-guided, Outlier-KV-aware KV Cache compression framework. First, we introduce an Outlier KV Recognition Module that models the principal component of multimodal KV matrices in the frequency domain and preferentially retains KV pairs that significantly deviate from it. Furthermore, Dynamic Budget Allocation Module is designed to adaptively determine the per-layer KV Cache size to retain more Outlier KVs. Experiments on multiple MLLMs and benchmarks demonstrate that FlashCache outperforms state-of-the-art multimoal KV compression methods, achieving up to 1.69 times faster decoding with 80% lower KV memory usage while maintaining task performance.

replace A physics-informed U-Net-LSTM network for nonlinear structural response under seismic excitation

Authors: Sutirtha Biswas, Kshitij Kumar Yadav

Abstract: Accurate and efficient seismic response prediction is essential for the design of resilient structures. While the Finite Element Method (FEM) remains the standard for nonlinear seismic analysis, its high computational demands limit its scalability and real-time applicability. Recent developments in deep learning - particularly Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory (LSTM) models - have shown promise in reducing the computational cost of the nonlinear seismic analysis of structures. However, these data-driven models often struggle to generalize and capture the underlying physics, leading to reduced reliability. We propose a novel Physics-Informed U-Net-LSTM framework that integrates physical laws with deep learning to enhance both accuracy and efficiency. The proposed 1D U-Net captures the underlying latent features of the long-term input sequences. By embedding domain-specific constraints into the learning process, the proposed model achieves improved predictive performance over conventional Machine Learning (ML) architectures. This approach bridges the gap between purely data-driven methods and physics-based modeling, offering a robust and computationally efficient alternative for predicting the seismic response of structures.

replace Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning

Authors: Franki Nguimatsia Tiofack, Th\'eotime Le Hellard, Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier

Abstract: Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: https://simple-robotics.github.io/publications/guided-flow-policy/

URLs: https://simple-robotics.github.io/publications/guided-flow-policy/

replace Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order

Authors: Prakhar Gupta, Vaibhav Gupta

Abstract: Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical solver ordering, used only during RL post-training, improves performance even when fine-tuned on randomized solution sequences. On Zebra puzzles, we fine-tune a Transformer on randomized solution orders, then post-train it with Group Relative Policy Optimization (GRPO) using two rewards: a sparse task reward that is 1 only when the puzzle is fully solved, and an ordering reward that increases when the model's emission order aligns with the canonical solver order. To compare signals cleanly, we combine them via fixed mixtures and use a simple bootstrapped scaling to equalize component magnitudes at initialization. Mixed rewards generally outperform task-only optimization, suggesting that coarse ordering signals can steer RL post-training toward canonical trajectories without modifying supervised data or architecture.

replace Sparse Attention Post-Training for Mechanistic Interpretability

Authors: Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Sch\"olkopf

Abstract: We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention attribution, enabling a unified view of feature-based and circuit-based perspectives. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

replace Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models

Authors: Haidong Kang, Jun Du, Lihong Lin

Abstract: Mixed-Precision Quantization (MPQ) liberates Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck and has garnered increasing research attention. However, conventional methods either rely on costly differentiable optimization search, which is neither efficient nor flexible, or learn a quantized DNN from a proxy (e.g., HAWQ) manually designed by human experts, which is labor-intensive and requires extensive expert knowledge. Can we design a proxy without involving any human experts or training? In this paper, we provide an affirmative answer by proposing a novel Large Language Model (LLM)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework. It reforms the design paradigm of MPQ by utilizing LLMs and evolutionary search strategies to automatically find superior TAP tailored for MPQ. In addition, to bridge the gap between black-box LLMs and the challenging MPQ task, we introduce a lightweight Direct Preference Optimization (DPO)-based strategy controller that dynamically reweights the selection probabilities of the three prompt templates for evolutionary search strategies according to fitness signals, without fine-tuning the LLM. This forms a task-aware feedback loop that improves proxy generation across evolutions. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.

replace Measuring Uncertainty Calibration

Authors: Kamil Ciosek, Nicol\`o Felicioni, Sina Ghiassian, Juan Elenter Litwin, Francesco Tonolini, David Gustafsson, Eva Garcia-Martin, Carmen Barcena Gonzalez, Rapha\"elle Bertrand-Lalo

Abstract: We make two contributions to the problem of estimating the $L_1$ calibration error of a binary classifier from a finite dataset. First, we provide an upper bound for any classifier where the calibration function has bounded variation. Second, we provide a method of modifying any classifier so that its calibration error can be upper bounded efficiently without significantly impacting classifier performance and without any restrictive assumptions. All our results are non-asymptotic and distribution-free. We conclude by providing advice on how to measure calibration error in practice. Our methods yield practical procedures that can be run on real-world datasets with modest overhead.

replace RePo: Language Models with Context Re-Positioning

Authors: Huayang Li, Tianyu Zhao, Deng Cai, Richard Sproat

Abstract: In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. Drawing on Cognitive Load Theory (CLT), we argue that this uninformative structure increases extraneous cognitive load, consuming finite working memory capacity that should be allocated to deep reasoning and attention allocation. To address this, we propose RePo, a novel mechanism that reduces extraneous load via context re-positioning. Unlike standard approaches, RePo utilizes a differentiable module, $f_\phi$, to assign token positions that capture contextual dependencies, rather than replying on pre-defined order. By continually pre-training on the OLMo-2 1B & 7B models, we demonstrate that RePo consistently enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Detailed analysis reveals that RePo successfully allocate higher attention to distant but relevant information, assign positions in dense and non-linear space, and capture the intrinsic structure of the input context. We will open-source the code and model weights. Our code is at https://github.com/SakanaAI/repo.

URLs: https://github.com/SakanaAI/repo.

replace Controlled LLM Training on Spectral Sphere

Authors: Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, Chong Luo, Baining Guo

Abstract: Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization ($\boldsymbol{\mu}$P) provides a theoretical safeguard for width-invariant $\Theta(1)$ activation control, whereas emerging optimizers like Muon are only ``half-aligned'' with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the \textbf{Spectral Sphere Optimizer (SSO)}, which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully $\boldsymbol{\mu}$P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.

replace BPE: Behavioral Profiling Ensemble

Authors: Yanxin Liu, Yunqi Zhang

Abstract: In the field of machine learning, ensemble learning is widely recognized as a pivotal strategy for pushing the boundaries of predictive performance. Traditional static ensemble methods typically assign weights by treating each base learner as a whole, thereby overlooking that individual models exhibit varying competence across different regions of the instance space. Dynamic Ensemble Selection (DES) was introduced to address this limitation. However, both static and dynamic approaches predominantly rely on inter-model differences as the basis for integration; this inter-model perspective neglects models' intrinsic characteristics and often requires heavy reliance on reference sets for competence estimation. We propose the Behavioral Profiling Ensemble (BPE) framework, which introduces a model-centric integration paradigm. Unlike traditional methods, BPE constructs an intrinsic behavioral profile $\mathcal{P}_k$ for each model and derives aggregation weights from the deviation between a model's response to a test instance and its established profile; in this work, we instantiate $\mathcal{P}_k$ with entropy-based summary statistics (e.g., mean and variance). Extensive experiments on 42 real-world datasets show that BPE-derived algorithms outperform state-of-the-art DES baselines, increasing predictive accuracy while reducing computational and storage overhead.

replace ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits

Authors: Aryan Karmore

Abstract: Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d^2 + N \cdot d \log d)$ memory,sub-linear in the number of experts. The key insight: training these rotations with quantization reduces activation outliers and stabilizes extreme low bit training, where static methods collapse. Across language modeling benchmarks, ButterflyMoE achieves 150$\times$ memory reduction at 256 experts with negligible accuracy loss. ButterflyMoE allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

replace Yuan3.0 Ultra: A Trillion-Parameter Enterprise-Oriented MoE LLM

Authors: YuanLab. ai, :, Shawn Wu, Jiangang Luo, Darcy Chen, Sean Wang, Louie Li, Allen Wang, Xudong Zhao, Tong Yu, Bach Li, Joseph Shen, Gawain Ma, Jasper Jia, Marcus Mao, Claire Wang, Hunter He, Carol Wang, Zera Zhang, Jason Wang, Chonly Shen, Leo Zhang, Logan Chen, Qasim Meng, James Gong, Daniel Zhao, Penn Zheng, Owen Zhu

Abstract: We introduce Yuan3.0 Ultra, an open-source Mixture-of-Experts (MoE) large language model featuring 68.8B activated parameters and 1010B total parameters, specially designed to enhance performance on enterprise scenarios tasks while maintaining competitive capabilities on general purpose tasks. We propose Layer-Adaptive Expert Pruning (LAEP) algorithm designed for the pre-training stage of MoE LLMs. In contrast to previous expert pruning approaches that operate primarily in the post-training phase, the proposed algorithm enhances training efficiency by selectively pruning underutilized experts and reorganizing experts across computing devices according to token distribution statistics. Comprehensive experiments demonstrate that LAEP effectively reduces model size and substantially improves pre-training efficiency. When pre-training Yuan3.0 Ultra from scratch original with 1515B parameters, this algorithm delivers a 49\% boost in pre-training efficiency and a 33.3\% reduction in total parameters, while preserving the model's outstanding multi-domain performance. On enterprise scenario benchmarks including Docmatix, ChatRAG, SummEval and MMTab, Yuan3.0 Ultra achieves leading accuracy. The model and codes are publicly available at https://github.com/Yuan-lab-LLM/Yuan3.0-Ultra.

URLs: https://github.com/Yuan-lab-LLM/Yuan3.0-Ultra.

replace Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Authors: Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover

Abstract: Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 8-12x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.

replace A Scalable Inter-edge Correlation Modeling in CopulaGNN for Link Sign Prediction

Authors: Jinkyu Sung, Myunggeum Jee, Joonseok Lee

Abstract: Link sign prediction on a signed graph is a task to determine whether the relationship represented by an edge is positive or negative. Since the presence of negative edges violates the graph homophily assumption that adjacent nodes are similar, regular graph methods have not been applicable without auxiliary structures to handle them. We aim to directly model the latent statistical dependency among edges with the Gaussian copula and its corresponding correlation matrix, extending CopulaGNN (Ma et al., 2021). However, a naive modeling of edge-edge relations is computationally intractable even for a graph with moderate scale. To address this, we propose to 1) represent the correlation matrix as a Gramian of edge embeddings, significantly reducing the number of parameters, and 2) reformulate the conditional probability distribution to dramatically reduce the inference cost. We theoretically verify scalability of our method by proving its linear convergence. Also, our extensive experiments demonstrate that it achieves significantly faster convergence than baselines, maintaining competitive prediction performance to the state-of-the-art models.

replace Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement

Authors: Maria Despoina Siampou, Shushman Choudhury, Shang-Ling Hsu, Neha Arora, Cyrus Shahabi

Abstract: Recent progress in geospatial foundation models highlights the importance of learning general-purpose representations for real-world locations, particularly points-of-interest (POIs) where human activity concentrates. Existing approaches, however, focus primarily on place identity derived from static textual metadata, or learn representations tied to trajectory context, which capture movement regularities rather than how places are actually used (i.e., POI's function). We argue that POI function is a missing but essential signal for general POI representations. We introduce Mobility-Embedded POIs (ME-POIs), a framework that augments POI embeddings derived, from language models with large-scale human mobility data to learn POI-centric, context-independent representations grounded in real-world usage. ME-POIs encodes individual visits as temporally contextualized embeddings and aligns them with learnable POI representations via contrastive learning to capture usage patterns across users and time. To address long-tail sparsity, we propose a novel mechanism that propagates temporal visit patterns from nearby, frequently visited POIs across multiple spatial scales. We evaluate ME-POIs on five newly proposed map enrichment tasks, testing its ability to capture both the identity and function of POIs. Across all tasks, augmenting text-based embeddings with ME-POIs consistently outperforms both text-only and mobility-only baselines. Notably, ME-POIs trained on mobility data alone can surpass text-only models on certain tasks, highlighting that POI function is a critical component of accurate and generalizable POI representations.

replace YuriiFormer: A Suite of Nesterov-Accelerated Transformers

Authors: Aleksandr Zimin, Yury Polyanskiy, Philippe Rigollet

Abstract: We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.

replace MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations

Authors: Qishuai Wen, Zhiyuan Huang, Xianghan Meng, Wei He, Chun-Guang Li

Abstract: The attention operator in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically instantiated from input tokens and whose width equals sequence length N. As the context extends, the expressive capacity of such an N-width MLP increases, but scaling its fast weights becomes prohibitively expensive for extremely long sequences. Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through either routing or compression. Then we propose a compress-and-route strategy, which compresses the N-width MLP into a narrower one using a small set of landmark queries and constructs deformable experts by gathering top-k activated key-value pairs for each landmark query. We call this strategy a Mixture of Top-k Activations (MiTA), and refer to the resulting efficient mechanism as MiTA attention. Preliminary experiments on vision tasks demonstrate the promise of our MiTA attention and motivate further investigation on its optimization and broader applications in more challenging settings.

replace Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards

Authors: Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She, Viet Anh Nguyen

Abstract: Sampling efficiency is a key bottleneck in reinforcement learning with verifiable rewards. Existing group-based policy optimization methods, such as GRPO, allocate a fixed number of rollouts for all training prompts. This uniform allocation implicitly treats all prompts as equally informative, and could lead to inefficient computational budget usage and impede training progress. We introduce VIP, a Variance-Informed Predictive allocation strategy that allocates a given rollout budget to the prompts in the incumbent batch to minimize the expected gradient variance of the policy update. At each iteration, VIP uses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts. These probability predictions are translated into variance estimates, which are then fed into a convex optimization problem to determine the optimal rollout allocations under a hard compute budget constraint. Empirical results show that VIP consistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies in multiple benchmarks.

replace Position: Beyond Model-Centric Prediction -- Agentic Time Series Forecasting

Authors: Mingyue Cheng, Xiaoyu Tao, Qi Liu, Ze Guo, Enhong Chen

Abstract: Time series forecasting has traditionally been formulated as a model-centric, static, and single-pass prediction problem that maps historical observations to future values. While this paradigm has driven substantial progress, it proves insufficient in adaptive and multi-turn settings where forecasting requires informative feature extraction, reasoning-driven inference, iterative refinement, and continual adaptation over time. In this paper, we argue for agentic time series forecasting (ATSF), which reframes forecasting as an agentic process composed of perception, planning, action, reflection, and memory. Rather than focusing solely on predictive models, ATSF emphasizes organizing forecasting as an agentic workflow that can interact with tools, incorporate feedback from outcomes, and evolve through experience accumulation. We outline three representative implementation paradigms -- workflow-based design, agentic reinforcement learning, and a hybrid agentic workflow paradigm -- and discuss the opportunities and challenges that arise when shifting from model-centric prediction to agentic forecasting. Together, this position aims to establish agentic forecasting as a foundation for future research at the intersection of time series forecasting.

replace On the Non-Identifiability of Steering Vectors in Large Language Models

Authors: Sohan Venkatesh, Ashish Mahendran Kurapath

Abstract: Activation steering methods are widely used to control large language model (LLM) behavior and are often interpreted as revealing meaningful internal representations. This interpretation assumes that steering directions are identifiable and uniquely recoverable from input-output behavior. We show that, under white-box single-layer access, steering vectors are fundamentally non-identifiable due to large equivalence classes of behaviorally indistinguishable interventions. Empirically, we find that orthogonal perturbations achieve near-equivalent efficacy with negligible effect sizes across multiple models and traits. Critically, we show that non-identifiability is a robust geometric property that persists across diverse prompt distributions. These findings reveal fundamental interpretability limits and highlight the need for structural constraints beyond behavioral testing to enable reliable alignment interventions.

replace Beyond the Unit Hypersphere: Embedding Magnitude in Contrastive Learning

Authors: Xincan Feng, Taro Watanabe

Abstract: Cosine similarity is prevalent in contrastive learning, yet it assumes embedding magnitude is noise. We systematically study magnitude learning through a framework that independently controls query-side and document-side normalization. First, magnitude learning benefits retrieval and Retrieval-Augmented Generation (RAG) where queries and documents have distinct roles, but not Semantic Textual Similarity (STS) or CLIP where inputs are interchangeable. Second, query and document magnitudes serve different roles: document magnitude scales inference scores, while query magnitude modulates training gradients. Normalizing one side consistently outperforms both sides, and the Fisher Information Matrix condition number predicts which side to normalize. Third, magnitude learning improves out-of-domain generalization more than in-domain performance, with gains up to +72\% vs +7\%, requiring retrieval-specialized pre-training or sufficient data. These findings provide practical guidance for retrieval and RAG across text and vision domains.

replace Supervised Metric Regularization Through Alternating Optimization for Multi-Regime Physics-Informed Neural Networks

Authors: Enzo Nicolas Spotorno, Josafat Ribeiro Leal, Antonio Augusto Frohlich

Abstract: Standard Physics-Informed Neural Networks (PINNs) often face challenges when modeling parameterized dynamical systems with sharp regime transitions, such as bifurcations. In these scenarios, the continuous mapping from parameters to solutions can result in spectral bias or "mode collapse", where the network averages distinct physical behaviors. We propose a Topology-Aware PINN (TAPINN) that aims to mitigate this challenge by structuring the latent space via Supervised Metric Regularization. Unlike standard parametric PINNs that map physical parameters directly to solutions, our method conditions the solver on a latent state optimized to reflect the metric-based separation between regimes, showing ~49% lower physics residual (0.082 vs. 0.160). We train this architecture using a phase-based Alternating Optimization (AO) schedule to manage gradient conflicts between the metric and physics objectives. Preliminary experiments on the Duffing Oscillator demonstrate that while standard baselines suffer from spectral bias and high-capacity Hypernetworks overfit (memorizing data while violating physics), our approach achieves stable convergence with 2.18x lower gradient variance than a multi-output Sobolev Error baseline, and 5x fewer parameters than a hypernetwork-based alternative.

replace Empirical Stability Analysis of Kolmogorov-Arnold Networks in Hard-Constrained Recurrent Physics-Informed Discovery

Authors: Enzo Nicolas Spotorno, Josafat Leal Filho, Antonio Augusto Medeiros Frohlich

Abstract: We investigate the integration of Kolmogorov-Arnold Networks (KANs) into hard-constrained recurrent physics-informed architectures (HRPINN) to evaluate the fidelity of learned residual manifolds in oscillatory systems. Motivated by the Kolmogorov-Arnold representation theorem and preliminary gray-box results, we hypothesized that KANs would enable efficient recovery of unknown terms compared to MLPs. Through initial sensitivity analysis on configuration sensitivity, parameter scale, and training paradigm, we found that while small KANs are competitive on univariate polynomial residuals (Duffing), they exhibit severe hyperparameter fragility, instability in deeper configurations, and consistent failure on multiplicative terms (Van der Pol), generally outperformed by standard MLPs. These empirical challenges highlight limitations of the additive inductive bias in the original KAN formulation for state coupling and provide preliminary empirical evidence of inductive bias limitations for future hybrid modeling.

replace Learn from Your Mistakes: Self-Correcting Masked Diffusion Models

Authors: Yair Schiff, Omer Belhasin, Roy Uziel, Guanghan Wang, Marianne Arriola, Gilad Turok, Michael Elad, Volodymyr Kuleshov

Abstract: Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models, enabling parallel token generation while achieving competitive performance. Despite these advantages, MDMs face a fundamental limitation: once tokens are unmasked, they remain fixed, leading to error accumulation and ultimately degrading sample quality. We address this by proposing a framework that trains a model to perform both unmasking and correction. By reusing outputs from the MDM denoising network as inputs for corrector training, we train a model to recover from potential mistakes. During generation we apply additional corrective refinement steps between unmasking ones in order to change decoded tokens and improve outputs. We name our training and sampling method Progressive Self-Correction (ProSeCo) for its unique ability to iteratively refine an entire sequence, including already generated tokens. We conduct extensive experimental validation across multiple conditional and unconditional tasks, demonstrating that ProSeCo yields better quality-efficiency trade-offs (up to ~2-3x faster sampling) and enables inference-time compute scaling to further increase sample quality beyond standard MDMs (up to ~1.3x improvement on benchmarks).

replace QTabGAN: A Hybrid Quantum-Classical GAN for Tabular Data Synthesis

Authors: Subhangi Kumari, Rakesh Achutha, Vignesh Sivaraman

Abstract: Synthesizing realistic tabular data is challenging due to heterogeneous feature types and high dimensionality. We introduce QTabGAN, a hybrid quantum-classical generative adversarial framework for tabular data synthesis. QTabGAN is especially designed for settings where real data are scarce or restricted by privacy constraints. The model exploits the expressive power of quantum circuits to learn complex data distributions, which are then mapped to tabular features using classical neural networks. We evaluate QTabGAN on multiple classification and regression datasets and benchmark it against leading state-of-the-art generative models. Experiments show that QTabGAN achieves up to 54.07% improvement across various classification datasets and evaluation metrics, thus establishing a scalable quantum approach to tabular data synthesis and highlighting its potential for quantum-assisted generative modelling.

replace Out-of-Support Generalisation via Weight-Space Sequence Modelling

Authors: Roussel Desmond Nzoyem

Abstract: As breakthroughs in deep learning transform key industries, models are increasingly required to extrapolate on datapoints found outside the range of the training set, a challenge we coin as out-of-support (OoS) generalisation. However, neural networks frequently exhibit catastrophic failure on OoS samples, yielding unrealistic but overconfident predictions. We address this challenge by reformulating the OoS generalisation problem as a sequence modelling task in the weight space, wherein the training set is partitioned into concentric shells corresponding to discrete sequential steps. Our WeightCaster framework yields plausible, interpretable, and uncertainty-aware predictions without necessitating explicit inductive biases, all the while maintaining high computational efficiency. Emprical validation on a synthetic cosine dataset and real-world air quality sensor readings demonstrates performance competitive or superior to the state-of-the-art. By enhancing reliability beyond in-distribution scenarios, these results hold significant implications for the wider adoption of artificial intelligence in safety-critical applications.

replace Neural Network-Based Parameter Estimation of a Labour Market Agent-Based Model

Authors: M Lopes Alves, Joel Dyer, Doyne Farmer, Michael Wooldridge, Anisoara Calinescu

Abstract: Agent-based modelling (ABM) is a widespread approach to simulate complex systems. Advancements in computational processing and storage have facilitated the adoption of ABMs across many fields; however, ABMs face challenges that limit their use as decision-support tools. A significant issue is parameter estimation in large-scale ABMs, particularly due to computational constraints on exploring the parameter space. This study evaluates a state-of-the-art simulation-based inference (SBI) framework that uses neural networks (NN) for parameter estimation. This framework is applied to an established labour market ABM based on job transition networks. The ABM is initiated with synthetic datasets and the real U.S. labour market. Next, we compare the effectiveness of summary statistics derived from a list of statistical measures with that learned by an embedded NN. The results demonstrate that the NN-based approach recovers the original parameters when evaluating posterior distributions across various dataset scales and improves efficiency compared to traditional Bayesian methods.

replace SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework

Authors: Rong Fu, Zijian Zhang, Kun Liu, Jiekai Wu, Xianda Li, Simon Fong

Abstract: Comparative analysis of adaptive immune repertoires at population scale is hampered by two practical bottlenecks: the near-quadratic cost of pairwise affinity evaluations and dataset imbalances that obscure clinically important minority clonotypes. We introduce SubQuad, an end-to-end pipeline that addresses these challenges by combining antigen-aware, near-subquadratic retrieval with GPU-accelerated affinity kernels, learned multimodal fusion, and fairness-constrained clustering. The system employs compact MinHash prefiltering to sharply reduce candidate comparisons, a differentiable gating module that adaptively weights complementary alignment and embedding channels on a per-pair basis, and an automated calibration routine that enforces proportional representation of rare antigen-specific subgroups. On large viral and tumor repertoires SubQuad achieves measured gains in throughput and peak memory usage while preserving or improving recall@k, cluster purity, and subgroup equity. By co-designing indexing, similarity fusion, and equity-aware objectives, SubQuad offers a scalable, bias-aware platform for repertoire mining and downstream translational tasks such as vaccine target prioritization and biomarker discovery.

replace Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

Authors: Bowen Yu, Maolin Wang, Sheng Zhang, Binhao Wang, Yi Wen, Jingtong Gao, Bowen Liu, Zimo Zhao, Wanyu Wang, Xiangyu Zhao

Abstract: Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. First, we establish structural understanding via masked shuffled reconstruction. Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO. Experiments on GSM8K demonstrate that our approach enables Qwen2.5-3B-Base to achieve an 11.29 percent accuracy improvement while reducing output length by 27.4 percent, surpassing both instruction-tuned variants and prior distillation methods.

replace cc-Shapley: Measuring Multivariate Feature Importance Needs Causal Context

Authors: J\"org Martin, Stefan Haufe

Abstract: Explainable artificial intelligence promises to yield insights into relevant features, thereby enabling humans to examine and scrutinize machine learning models or even facilitating scientific discovery. Considering the widespread technique of Shapley values, we find that purely data-driven operationalization of multivariate feature importance is unsuitable for such purposes. Even for simple problems with two features, spurious associations due to collider bias and suppression arise from considering one feature only in the observational context of the other, which can lead to misinterpretations. Causal knowledge about the data-generating process is required to identify and correct such misleading feature attributions. We propose cc-Shapley (causal context Shapley), an interventional modification of conventional observational Shapley values leveraging knowledge of the data's causal structure, thereby analyzing the relevance of a feature in the causal context of the remaining features. We show theoretically that this eradicates spurious association induced by collider bias. We compare the behavior of Shapley and cc-Shapley values on various, synthetic, and real-world datasets. We observe nullification or reversal of associations compared to univariate feature importance when moving from observational to cc-Shapley.

replace On Imbalanced Regression with Hoeffding Trees

Authors: Pantia-Marina Alchirch, Dimitrios I. Diochnos

Abstract: Many real-world applications generate continuous data streams for regression. Hoeffding trees and their variants have a long-standing tradition due to their effectiveness, either alone or as base models in broader ensembles. Recent batch-learning work shows that kernel density estimation (KDE) improves smoothed predictions in imbalanced regression [Yang et al., 2021], while hierarchical shrinkage (HS) provides post-hoc regularization for decision trees without modifying their structure [Agarwal et al., 2022]. We extend KDE to streaming settings via a telescoping formulation and integrate HS into incremental decision trees. Empirical evaluation on standard online regression benchmarks shows that KDE consistently improves early-stream performance, whereas HS provides limited gains. Our implementation is publicly available at: https://github.com/marinaAlchirch/DSFA_2026.

URLs: https://github.com/marinaAlchirch/DSFA_2026.

replace Zatom-1: A Multimodal Flow Foundation Model for 3D Molecules and Materials

Authors: Alex Morehead, Miruna Cretu, Antonia Panescu, Rishabh Anand, Maurice Weiler, Tynan Perez, Samuel Blau, Steven Farrell, Wahid Bhimji, Anubhav Jain, Hrushikesh Sahasrabuddhe, Pietro Lio, Tommi Jaakkola, Rafael Gomez-Bombarelli, Rex Ying, N. Benjamin Erichson, Michael W. Mahoney

Abstract: General-purpose 3D chemical modeling encompasses molecules and materials, requiring both generative and predictive capabilities. However, most existing AI approaches are optimized for a single domain (molecules or materials) and a single task (generation or prediction), which limits representation sharing and transfer. We introduce Zatom-1, the first end-to-end, fully open-source foundation model that unifies generative and predictive learning of 3D molecules and materials. Zatom-1 is a Transformer trained with a multimodal flow matching objective that jointly models discrete atom types and continuous 3D geometries. This approach supports scalable pretraining with predictable gains as model capacity increases, while enabling fast and stable sampling. We use joint generative pretraining as a universal initialization for downstream multi-task prediction of properties, energies, and forces. Empirically, Zatom-1 matches or outperforms specialized baselines on both generative and predictive benchmarks, while reducing the generative inference time by more than an order of magnitude. Our experiments demonstrate positive predictive transfer between chemical domains from joint generative pretraining: modeling materials during pretraining improves molecular property prediction accuracy.

replace Learn Hard Problems During RL with Reference Guided Fine-tuning

Authors: Yangzhen Wu, Shanda Li, Zixin Wen, Xin Zhou, Ameet Talwalkar, Yiming Yang, Wenhao Huang, Tianle Cai

Abstract: Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model's reasoning space while still benefiting from reference guidance. Fine-tuning on these reference-guided trajectories increases the number of solvable problems and produces a checkpoint that receives more positive rewards during RL. Across three benchmarks (AIME24, AIME25, BeyondAIME), ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL. Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.

replace MatRIS: Toward Reliable and Efficient Pretrained Machine Learning Interatomic Potentials

Authors: Yuanchang Zhou, Siyu Hu, Xiangyu Zhang, Hongyu Wang, Guangming Tan, Weile Jia

Abstract: Foundation MLIPs demonstrate broad applicability across diverse material systems and have emerged as a powerful and transformative paradigm in chemical and computational materials science. Equivariant MLIPs achieve state-of-the-art accuracy in a wide range of benchmarks by incorporating equivariant inductive bias. However, the reliance on tensor products and high-degree representations makes them computationally costly. This raises a fundamental question: as quantum mechanical-based datasets continue to expand, can we develop a more compact model to thoroughly exploit high-dimensional atomic interactions? In this work, we present MatRIS (\textbf{Mat}erials \textbf{R}epresentation and \textbf{I}nteraction \textbf{S}imulation), an invariant MLIP that introduces attention-based modeling of three-body interactions. MatRIS leverages a novel separable attention mechanism with linear complexity $O(N)$, enabling both scalability and expressiveness. MatRIS delivers accuracy comparable to that of leading equivariant models on a wide range of popular benchmarks (Matbench-Discovery, MatPES, MDR phonon, Molecular dataset, etc). Taking Matbench-Discovery as an example, MatRIS achieves an F1 score of up to 0.847 and attains comparable accuracy at a lower training cost. The work indicates that our carefully designed invariant models can match or exceed the accuracy of equivariant models at a fraction of the cost, shedding light on the development of accurate and efficient MLIPs.

replace IoUCert: Robustness Verification for Anchor-based Object Detectors

Authors: Benedikt Br\"uckner, Alejandro J. Mercado, Yanghao Zhang, Panagiotis Kouvaros, Alessio Lomuscio

Abstract: While formal robustness verification has seen significant success in image classification, scaling these guarantees to object detection remains notoriously difficult due to complex non-linear coordinate transformations and Intersection-over-Union (IoU) metrics. We introduce IoUCert, a novel formal verification framework designed specifically to overcome these bottlenecks in foundational anchor-based object detection architectures. Focusing on the object localisation component in single-object settings, we propose a coordinate transformation that enables our algorithm to circumvent precision-degrading relaxations of non-linear box prediction functions. This allows us to optimise bounds directly with respect to the anchor box offsets which enables a novel Interval Bound Propagation method that derives optimal IoU bounds. We demonstrate that our method enables, for the first time, the robustness verification of realistic, anchor-based models including SSD, YOLOv2, and YOLOv3 variants against various input perturbations.

replace Incremental Graph Construction Enables Robust Spectral Clustering of Texts

Authors: Marko Pranji\'c, Boshko Koloski, Nada Lavra\v{c}, Senja Pollak, Marko Robnik-\v{S}ikonja

Abstract: Neighborhood graphs are a critical but often fragile step in spectral clustering of text embeddings. On realistic text datasets, standard $k$-NN graphs can contain many disconnected components at practical sparsity levels (small $k$), making spectral clustering degenerate and sensitive to hyperparameters. We introduce a simple incremental $k$-NN graph construction that preserves connectivity by design: each new node is linked to its $k$ nearest previously inserted nodes, which guarantees a connected graph for any $k$. We provide an inductive proof of connectedness and discuss implications for incremental updates when new documents arrive. We validate the approach on spectral clustering of SentenceTransformer embeddings using Laplacian eigenmaps across six clustering datasets from the Massive Text Embedding Benchmark. Compared to standard $k$-NN graphs, our method outperforms in the low-$k$ regime where disconnected components are prevalent, and matches standard $k$-NN at larger $k$.

replace Inverse Reconstruction of Shock Time Series from Shock Response Spectrum Curves using Machine Learning

Authors: Adam Watts (Los Alamos National Laboratory), Andrew Jeon (Los Alamos National Laboratory), Destry Newton (Los Alamos National Laboratory), Ryan Bowering (University of Rochester)

Abstract: The shock response spectrum (SRS) is widely used to characterize the response of single-degree-of-freedom (SDOF) systems to transient accelerations. Because the mapping from acceleration time history to SRS is nonlinear and many-to-one, reconstructing time-domain signals from a target spectrum is inherently ill-posed. Conventional approaches address this problem through iterative optimization, typically representing signals as sums of exponentially decayed sinusoids, but these methods are computationally expensive and constrained by predefined basis functions. We propose a conditional variational autoencoder (CVAE) that learns a data-driven inverse mapping from SRS to acceleration time series. Once trained, the model generates signals consistent with prescribed target spectra without requiring iterative optimization. Experiments demonstrate improved spectral fidelity relative to classical techniques, strong generalization to unseen spectra, and inference speeds three to six orders of magnitude faster. These results establish deep generative modeling as a scalable and efficient approach for inverse SRS reconstruction.

replace AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

Authors: Pei Yang, Wanyi Chen, Asuka Yuxi Zheng, Xueqian Li, Xiang Li, Haoqin Tu, Jie Xiao, Yifan Pang, Dongdong Zhang, Fuqiang Li, Alfred Long, Bill Shi, Lynn Ai, Eric Yang

Abstract: Large language model (LLM) agents offer a promising data-driven approach to automating Site Reliability Engineering (SRE), yet their enterprise deployment is constrained by three challenges: restricted access to proprietary data, unsafe action execution under permission-governed environments, and the inability of closed systems to improve from failures. We present AOI (Autonomous Operations Intelligence), a trainable multi-agent framework formulating automated operations as a structured trajectory learning problem under security constraints. Our approach integrates three key components. First, a trainable diagnostic system applies Group Relative Policy Optimization (GRPO) to distill expert-level knowledge into locally deployed open-source models, enabling preference-based learning without exposing sensitive data. Second, a read-write separated execution architecture decomposes operational trajectories into observation, reasoning, and action phases, allowing safe learning while preventing unauthorized state mutation. Third, a Failure Trajectory Closed-Loop Evolver mines unsuccessful trajectories and converts them into corrective supervision signals, enabling continual data augmentation. Evaluated on the AIOpsLab benchmark, our contributions yield cumulative gains. (1) The AOI runtime alone achieves 66.3% best@5 success on all 86 tasks, outperforming the prior state-of-the-art (41.9%) by 24.4 points. (2) Adding Observer GRPO training, a locally deployed 14B model reaches 42.9% avg@1 on 63 held-out tasks with unseen fault types, surpassing Claude Sonnet 4.5. (3) The Evolver converts 37 failed trajectories into diagnostic guidance, improving end-to-end avg@5 by 4.8 points while reducing variance by 35%.

replace RADAR: Learning to Route with Asymmetry-aware DistAnce Representations

Authors: Hang Yi, Ziwei Huang, Yining Ma, Zhiguang Cao

Abstract: Recent neural solvers have achieved strong performance on vehicle routing problems (VRPs), yet they mainly assume symmetric Euclidean distances, restricting applicability to real-world scenarios. A core challenge is encoding the relational features in asymmetric distance matrices of VRPs. Early attempts directly encoded these matrices but often failed to produce compact embeddings and generalized poorly at scale. In this paper, we propose RADAR, a scalable neural framework that augments existing neural VRP solvers with the ability to handle asymmetric inputs. RADAR addresses asymmetry from both static and dynamic perspectives. It leverages Singular Value Decomposition (SVD) on the asymmetric distance matrix to initialize compact and generalizable embeddings that inherently encode the static asymmetry in the inbound and outbound costs of each node. To further model dynamic asymmetry in embedding interactions during encoding, it replaces the standard softmax with Sinkhorn normalization that imposes joint row and column distance awareness in attention weights. Extensive experiments on synthetic and real-world benchmarks across various VRPs show that RADAR outperforms strong baselines on both in-distribution and out-of-distribution instances, demonstrating robust generalization and superior performance in solving asymmetric VRPs.

replace Why Are Linear RNNs More Parallelizable?

Authors: William Merrill, Hongjian Jiang, Yanhong Li, Anthony Lin, Ashish Sabharwal

Abstract: The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNs -- but not traditional, nonlinear RNNs -- as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve $\mathsf{L}$-complete problems (and even $\mathsf{P}$-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $\mathsf{NC}^1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive ($\mathsf{PNC}^1$-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.

replace-cross Zeroth-Order primal-dual Alternating Projection Gradient Algorithms for Nonconvex Minimax Problems with Coupled linear Constraints

Authors: Huiling Zhang, Zi Xu, Yuhong Dai

Abstract: In this paper, we study zeroth-order algorithms for nonconvex minimax problems with coupled linear constraints under the deterministic and stochastic settings, which have attracted wide attention in machine learning, signal processing and many other fields in recent years, e.g., adversarial attacks in resource allocation problems and network flow problems etc. We propose two single-loop algorithms, namely the zeroth-order primal-dual alternating projected gradient (ZO-PDAPG) algorithm and the zeroth-order regularized momentum primal-dual projected gradient algorithm (ZO-RMPDPG), for solving deterministic and stochastic nonconvex-(strongly) concave minimax problems with coupled linear constraints. The iteration complexity of the two proposed algorithms to obtain an $\varepsilon$-stationary point are proved to be $\mathcal{O}(\varepsilon ^{-2})$ (resp. $\mathcal{O}(\varepsilon ^{-4})$) for solving nonconvex-strongly concave (resp. nonconvex-concave) minimax problems with coupled linear constraints under deterministic settings and $\tilde{\mathcal{O}}(\varepsilon ^{-3})$ (resp. $\tilde{\mathcal{O}}(\varepsilon ^{-6.5})$) under stochastic settings respectively. To the best of our knowledge, they are the first two zeroth-order algorithms with iterative complexity guarantees for solving nonconvex-(strongly) concave minimax problems with coupled linear constraints under the deterministic and stochastic settings. The proposed ZO-RMPDPG algorithm, when specialized to stochastic nonconvex-concave minimax problems without coupled constraints, outperforms all existing zeroth-order algorithms by achieving a better iteration complexity, thus setting a new state-of-the-art.

replace-cross Distilling Privileged Information for Dubins Traveling Salesman Problems with Neighborhoods

Authors: Min Kyu Shin, Su-Jeong Park, Seung-Keol Ryu, Heeyeon Kim, Han-Lim Choi

Abstract: This paper presents a novel learning approach for Dubins Traveling Salesman Problems(DTSP) with Neighborhood (DTSPN) to quickly produce a tour of a non-holonomic vehicle passing through neighborhoods of given task points. The method involves two learning phases: initially, a model-free reinforcement learning approach leverages privileged information to distill knowledge from expert trajectories generated by the LinKernighan heuristic (LKH) algorithm. Subsequently, a supervised learning phase trains an adaptation network to solve problems independently of privileged information. Before the first learning phase, a parameter initialization technique using the demonstration data was also devised to enhance training efficiency. The proposed learning method produces a solution about 50 times faster than LKH and substantially outperforms other imitation learning and RL with demonstration schemes, most of which fail to sense all the task points.

replace-cross Generalization Bounds for Markov Algorithms through Entropy Flow Computations

Authors: Benjamin Dupuis, Maxime Haddouche, George Deligiannidis, Umut Simsekli

Abstract: Many learning algorithms can be represented as Markov processes, and understanding their generalization error is a central topic in learning theory. For specific continuous-time noisy algorithms, a prominent analysis technique relies on information-theoretic tools and the so-called ``entropy flow'' method. This technique is compatible with a broad range of assumptions and leverages the convergence properties of learning dynamics to produce meaningful generalization bounds, which can also be informative or extend to discrete-time settings. Despite their success, existing entropy flow formulations are limited to specific noise and algorithm structures (\eg, Langevin dynamics). In this work, we exploit new technical tools to extend its applicability to all learning algorithms whose iterative dynamics is governed by a time-homogeneous Markov process. Our approach builds on a principled continuous-time approximation of Markov algorithms and introduces a new, exact entropy flow formula for such processes. Within this unified framework, we establish novel connections to a well-studied family of modified logarithmic Sobolev inequalities, which we use to connect the generalization error to the ergodic properties of Markov processes. Finally, we provide a detailed analysis of all the terms appearing in our theory and demonstrate its effectiveness by deriving new generalization bounds for several concrete algorithms.

replace-cross Sink equilibria and the attractors of learning in games

Authors: Oliver Biggar, Christos Papadimitriou

Abstract: Characterizing the limit behavior -- that is, the attractors -- of learning dynamics is one of the most fundamental open questions in game theory. In recent work on this front, it was conjectured that the attractors of the replicator dynamic are in one-to-one correspondence with the sink equilibria of the game -- the sink strongly connected components of a game's preference graph -- and it was established that they do stand in at least one-to-many correspondence with them. Here, we show that the one-to-one conjecture is false. We disprove this conjecture over the course of three theorems: the first disproves a stronger form of the conjecture, while the weaker form is disproved separately in the two-player and $N$-player ($N>2$) cases. By showing how the conjecture fails, we lay out the obstacles that lie ahead for characterizing attractors of the replicator, and introduce new ideas with which to tackle them. All three counterexamples derive from an object called a local source -- a point lying within the sink equilibrium, and yet which is `locally repelling'; we prove that the absence of local sources is necessary, but not sufficient, for the one-to-one property to be true. We complement this with a sufficient condition: we introduce a local property of a sink equilibrium called pseudoconvexity, and establish that when the sink equilibria of a two-player game are pseudoconvex then they precisely define the attractors. Pseudoconvexity generalizes the previous cases -- such as zero-sum games and potential games -- where this conjecture was known to hold, and reformulates these cases in terms of a simple graph property.

replace-cross Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

Authors: Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu

Abstract: Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ''safety mirage'', where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.

replace-cross Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models

Authors: Sabaat Haroon, Ahmad Faraz Khan, Ahmad Humayun, Waris Gill, Abdul Haddi Amjad, Ali R. Butt, Mohammad Taha Khan, Muhammad Ali Gulzar

Abstract: Generative Large Language Models (LLMs) are increasingly used in non-generative software maintenance tasks, such as fault localization (FL). Success in FL depends on a models ability to reason about program semantics beyond surface-level syntactic and lexical features. However, widely used LLM benchmarks primarily evaluate code generation, which differs fundamentally from semantic program reasoning. Meanwhile, traditional FL benchmarks such as Defect4J and BugsInPy are either not scalable or obsolete, as their datasets have become part of LLM training data, leading to biased results. This paper presents the first large-scale empirical investigation into the robustness of LLMs fault localizability. Inspired by mutation testing, we develop an end-to-end evaluation framework that addresses key limitations in existing LLM evaluation, including data contamination, scalability, automation, and extensibility. Using real-world programs with specifications, we inject unseen faults and ask LLMs to localize them, filtering out underspecified programs where localization is ambiguous. For each successfully localized program, we apply semantic-preserving mutations (SPMs) and rerun localization to assess robustness and determine whether LLM reasoning relies on syntactic cues rather than semantics. We evaluate 10 state-of-the-art LLMs on 750,013 fault localization tasks from over 1,300 Java and Python programs. We find that SPMs cause LLMs to fail on previously localized faults in 78% of cases, and that reasoning is stronger when relevant code appears earlier in context. These results indicate that LLM code reasoning is often tied to features irrelevant to semantics. We also identify code patterns that are challenging for LLMs to reason about. Overall, our findings motivate fundamental advances in how LLMs represent, interpret, and prioritize code semantics to reason more deeply about program logic

replace-cross Noise2Ghost: Self-supervised deep convolutional reconstruction for ghost imaging

Authors: Mathieu Manni, Dmitry Karpov, K. Joost Batenburg, Sharon Shwartz, Nicola Vigan\`o

Abstract: We present a new self-supervised deep-learning-based Ghost Imaging (GI) reconstruction method, which provides unparalleled reconstruction quality for noisy acquisitions among unsupervised methods. We present the supporting mathematical framework and results from theoretical and real data use cases. Self-supervision removes the need for clean reference data while offering strong noise reduction. This provides the necessary tools for addressing signal-to-noise ratio concerns for GI acquisitions in emerging and cutting-edge low-light GI scenarios. Notable examples include micro- and nano-scale x-ray emission imaging, e.g., x-ray fluorescence imaging of dose-sensitive samples. Their applications include in-vivo and in-operando case studies for biological samples and batteries.

replace-cross Differentially Private and Scalable Estimation of the Network Principal Component

Authors: Alireza Khayatian, Anil Vullikanti, Aritra Konar

Abstract: Computing the principal component (PC) of the adjacency matrix of an undirected graph has several applications ranging from identifying key vertices for influence maximization and controlling diffusion processes, to discovering densely interconnected vertex subsets. However, many networked datasets are sensitive, which necessitates private computation of the PC for use in the aforementioned applications. Differential privacy has emerged as the gold standard in privacy-preserving data analysis, but existing DP algorithms for private PC suffer from low accuracy due to large noise injection or high complexity. Motivated by the large gap between the local and global sensitivities of the PC on real-graphs, we consider instance-specific mechanisms for privately computing the PC under edge-DP. These mechanisms guarantee privacy for all datasets, but provide good utility on ``well-behaved'' datasets by injecting smaller amounts of noise. More specifically, we consider the Propose-Test-Release (PTR) framework. Although computationally expensive in general, we design a novel approach for implementing a PTR variant in the same time as computation of a non-private PC, while offering good utility. Our framework tests in a differentially-private manner whether a given graph is ``well-behaved'' or not, and then tests whether its private to release a noisy PC with small noise. As a consequence, this also leads to the first DP algorithm for the Densest-$k$-subgraph problem, a key graph mining primitive. We run our method on diverse real-world networks, with the largest having 3 million vertices, and compare its utility to a pre-existing baseline based on the private power method (PPM). Although PTR requires a slightly larger privacy budget, on average, it achieves a 180-fold improvement in runtime over PPM.

replace-cross Variational Formulation of Particle Flow

Authors: Yinzhuang Yi, Jorge Cort\'es, Nikolay Atanasov

Abstract: This paper provides a formulation of the log-homotopy particle flow from the perspective of variational inference. We show that the transient density used to derive the particle flow follows a time-scaled trajectory of the Fisher-Rao gradient flow in the space of probability densities. The Fisher-Rao gradient flow is obtained as a continuous-time algorithm for variational inference, minimizing the Kullback-Leibler divergence between a variational density and the true posterior density. When considering a parametric family of variational densities, the function space Fisher-Rao gradient flow simplifies to the natural gradient flow of the variational density parameters. By adopting a Gaussian variational density, we derive a Gaussian approximated Fisher-Rao particle flow and show that, under linear Gaussian assumptions, it reduces to the Exact Daum and Huang particle flow. Additionally, we introduce a Gaussian mixture approximated Fisher-Rao particle flow to enhance the expressive power of our model through a multi-modal variational density. Simulations on low- and high-dimensional estimation problems illustrate our results.

replace-cross ReactDance: Hierarchical Representation for High-Fidelity and Coherent Long-Form Reactive Dance Generation

Authors: Jingzhong Lin, Xinru Li, Yuanyuan Qi, Bohao Zhang, Wenxiang Liu, Kecheng Tang, Wenxuan Huang, Xiangfeng Xu, Bangyan Li, Changbo Wang, Gaoqi He

Abstract: Reactive dance generation (RDG), the task of generating a dance conditioned on a lead dancer's motion, holds significant promise for enhancing human-robot interaction and immersive digital entertainment. Despite progress in duet synchronization and motion-music alignment, two key challenges remain: generating fine-grained spatial interactions and ensuring long-term temporal coherence. In this work, we introduce \textbf{ReactDance}, a diffusion framework that operates on a novel hierarchical latent space to address these spatiotemporal challenges in RDG. First, for high-fidelity spatial expression and fine-grained control, we propose Hierarchical Finite Scalar Quantization (\textbf{HFSQ}). This multi-scale motion representation effectively disentangles coarse body posture from subtle limb dynamics, enabling independent and detailed control over both aspects through a layered guidance mechanism. Second, to efficiently generate long sequences with high temporal coherence, we propose Blockwise Local Context (\textbf{BLC}), a non-autoregressive sampling strategy. Departing from slow, frame-by-frame generation, BLC partitions the sequence into blocks and synthesizes them in parallel via periodic causal masking and positional encodings. Coherence across these blocks is ensured by a dense sliding-window training approach that enriches the representation with local temporal context. Extensive experiments show that ReactDance substantially outperforms state-of-the-art methods in motion quality, long-term coherence, and sampling efficiency. Project page: https://ripemangobox.github.io/ReactDance.

URLs: https://ripemangobox.github.io/ReactDance.

replace-cross Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

Authors: Jin Du, Li Chen, Xun Xian, An Luo, Fangqiao Tian, Ganghua Wang, Charles Doss, Xiaotong Shen, Jie Ding

Abstract: Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpson's paradox or selection bias. This oversight limits the applicability of LLMs in the real world. To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired with grading rubrics. This approach allows us to quantitatively measure both causal reasoning capabilities and the reliability of LLMs' responses. We evaluate models using two protocols: (1) direct prompting, which assesses intrinsic causal reasoning, and (2) code-assisted prompting, where models generate executable code for explicit statistical analysis. Additionally, we validate the effectiveness of this judge by comparing its scoring with assessments from human experts. Our results reveal significant limitations in current LLMs when performing statistical causal inference. The CausalPitfalls benchmark provides essential guidance and quantitative metrics to advance the development of trustworthy causal reasoning systems.

replace-cross ShIOEnv: A Command Evaluation Environment for Grammar-Constrained Synthesis and Execution Behavior Modeling

Authors: Jarrod Ragsdale, Rajendra Boppana

Abstract: Modeling of command-line interface (CLI) interaction has enabled flexible, execution-free output presentation. However, current approaches struggle to model inputs with complex compositions and inputs whose execution behavior depends on system characteristics. This is due to a lack of shell input-output (ShIO) data in the training distributions used by the models in these approaches. To address this data gap, we present ShIOEnv, a Gymnasium-compatible Bash shell environment for command synthesis and system-grounded execution behavior capturing. To concentrate synthesis on productive regions of the state-action space, we temporally abstract argument construction into grammar-derived options, thereby constraining synthesis to syntactically valid arguments. We introduce a self-supervised irreducibility signal to approximate the proportion of arguments that contribute to the observed execution behavior, serving as a measure of information density for each input. Using ShIOEnv, we curate and release 2.1M input-output pairs for modeling feedback from Bash command execution. We find that models trained on grammar-constrained datasets with higher maximum irreducibility achieve greater accuracy when modeling the execution behavior of user-sourced inputs than prior execution-free baselines.

replace-cross Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Authors: Ba-Hien Tran, Van Minh Nguyen

Abstract: Weight binarization has emerged as a promising strategy to reduce the complexity of large language models (LLMs). Existing approaches fall into post-training binarization, which is simple but causes severe performance loss, and training-aware methods, which depend on full-precision latent weights, adding complexity and limiting efficiency. We propose a novel framework that represents LLMs with multi-kernel Boolean parameters and, for the first time, enables direct finetuning LMMs in the Boolean domain, eliminating the need for latent weights. This enhances representational capacity and dramatically reduces complexity during both finetuning and inference. Extensive experiments across diverse LLMs show our method outperforms recent ultra low-bit quantization and binarization techniques.

replace-cross SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Authors: Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, Tu Vu

Abstract: We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.

replace-cross EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements

Authors: Issa Sugiura, Takashi Ishida, Taro Makino, Chieko Tazuke, Takanori Nakagawa, Kosuke Nakago, David Ha

Abstract: Large Language Models (LLMs) have made remarkable progress, surpassing human performance on several benchmarks in domains such as mathematics and coding. A key driver of this progress has been the development of benchmark datasets. In contrast, the financial domain poses higher entry barriers due to its demand for specialized expertise, and benchmarks remain relatively scarce compared to those in mathematics or coding. We introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate LLMs on challenging tasks such as accounting fraud detection, earnings forecasting, and industry classification. EDINET-Bench is constructed from ten years of annual reports filed by Japanese companies. These tasks require models to process entire annual reports and integrate information across multiple tables and textual sections, demanding expert-level reasoning that is challenging even for human professionals. Our experiments show that even state-of-the-art LLMs struggle in this domain, performing only marginally better than logistic regression in binary classification tasks such as fraud detection and earnings forecasting. Our results show that simply providing reports to LLMs in a straightforward setting is not enough. This highlights the need for benchmark frameworks that better reflect the environments in which financial professionals operate, with richer scaffolding such as realistic simulations and task-specific reasoning support to enable more effective problem solving. We make our dataset and code publicly available to support future research.

replace-cross Learning Physical Systems: Symplectification via Gauge Fixing in Dirac Structures

Authors: Aristotelis Papatheodorou, Pranav Vaidhyanathan, Natalia Ares, Ioannis Havoutis

Abstract: Physics-informed deep learning has achieved remarkable progress by embedding geometric priors, such as Hamiltonian symmetries and variational principles, into neural networks, enabling structure-preserving models that extrapolate with high accuracy. However, in systems with dissipation and holonomic constraints, ubiquitous in legged locomotion and multibody robotics, the canonical symplectic form becomes degenerate, undermining the very invariants that guarantee stability and long-term prediction. In this work, we tackle this foundational limitation by introducing Presymplectification Networks (PSNs), the first framework to learn the symplectification lift via Dirac structures, restoring a non-degenerate symplectic geometry by embedding constrained systems into a higher-dimensional manifold. Our architecture combines a recurrent encoder with a flow-matching objective to learn the augmented phase-space dynamics end-to-end. We then attach a lightweight Symplectic Network (SympNet) to forecast constrained trajectories while preserving energy, momentum, and constraint satisfaction. We demonstrate our method on the dynamics of the ANYmal quadruped robot, a challenging contact-rich, multibody system. To the best of our knowledge, this is the first framework that effectively bridges the gap between constrained, dissipative mechanical systems and symplectic learning, unlocking a whole new class of geometric machine learning models, grounded in first principles yet adaptable from data.

replace-cross MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Authors: Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin, Fengze Liu, Yan Zhao, Bingni Zhang, Taifeng Wang, Yin Zheng, Trevor Cohn, Meng Fang

Abstract: Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English "raters" via pairwise comparisons to learn unified document-quality scores,then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks. We further analyze translation fidelity, selection biases, and underrepresentation of narrative material, outlining directions for future work.

replace-cross Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models

Authors: Xingyu Qiu, Mengying Yang, Xinghua Ma, Dong Liang, Fanding Li, Gongning Luo, Wei Wang, Kuanquan Wang, Shuo Li

Abstract: Although EDM aims to unify the design space of diffusion models, its reliance on fixed Gaussian noise prevents it from explaining emerging flow-based methods that diffuse arbitrary noise. Moreover, our study reveals that EDM's forcible injection of Gaussian noise has adverse effects on image restoration task, as it corrupts the degraded images, overextends the restoration distance, and increases the task's complexity. To interpret diverse methods for handling distinct noise patterns within a unified theoretical framework and to minimize the restoration distance, we propose EDA, which Elucidates the Design space of Arbitrary-noise diffusion models. Theoretically, EDA expands noise pattern flexibility while preserving EDM's modularity, with rigorous proof that increased noise complexity introduces no additional computational overhead during restoration. EDA is validated on three representative medical image denoising and natural image restoration tasks: MRI bias field correction (global smooth noise), CT metal artifact removal (global sharp noise) and natural image shadow removal (local boundary-aware noise). With only 5 sampling steps, competitive results against specialized methods across medical and natural tasks demonstrate EDA's strong generalization capability for image restoration. Code is available at: https://github.com/PerceptionComputingLab/EDA.

URLs: https://github.com/PerceptionComputingLab/EDA.

replace-cross Structured quantum learning via em algorithm for Boltzmann machines

Authors: Takeshi Kimura, Kohtaro Kato, Masahito Hayashi

Abstract: Quantum Boltzmann machines (QBMs) are generative models with potential advantages in quantum machine learning, yet their training is fundamentally limited by the barren plateau problem, where gradients vanish exponentially with system size. We introduce a quantum version of the em algorithm, an information-geometric generalization of the classical Expectation-Maximization method, which circumvents gradient-based optimization on non-convex functions. Implemented on a semi-quantum restricted Boltzmann machine (sqRBM) -- a hybrid architecture with quantum effects confined to the hidden layer -- our method achieves stable learning and outperforms gradient descent on multiple benchmark datasets. These results establish a structured and scalable alternative to gradient-based training in QML, offering a pathway to mitigate barren plateaus and enhance quantum generative modeling.

replace-cross Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings

Authors: Jenny Y. Huang, Yunyi Shen, Dennis Wei, Tamara Broderick

Abstract: We propose a method for evaluating the robustness of widely used LLM ranking systems -- variants of a Bradley--Terry model -- to dropping a worst-case very small fraction of preference data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from popular LLM ranking platforms, including Chatbot Arena and derivatives, we find that the rankings of top-performing models can be remarkably sensitive to the removal of a small fraction of preferences; for instance, dropping just 0.003% of human preferences can change the top-ranked model on Chatbot Arena. Our robustness check identifies the specific preferences most responsible for such ranking flips, allowing for inspection of these influential preferences. We observe that the rankings derived from MT-bench preferences are notably more robust than those from Chatbot Arena, likely due to MT-bench's use of expert annotators and carefully constructed prompts. Finally, we find that neither rankings based on crowdsourced human evaluations nor those based on LLM-as-a-judge preferences are systematically more sensitive than the other.

replace-cross How Quantization Shapes Bias in Large Language Models

Authors: Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych

Abstract: This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, fairness, toxicity, and sentiment. We employ both probability- and generated text-based metrics across 13 benchmarks and evaluate models that differ in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and subgroups, and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.

replace-cross New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR

Authors: Xugang Lu, Peng Shen, Hisashi Kawai

Abstract: Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple consecutive acoustic frames typically correspond to a single linguistic token (many-to-one), certain acoustic transition regions may relate to multiple adjacent tokens (one-to-many). Moreover, acoustic sequences often include frames with no linguistic counterpart, such as background noise or silence may lead to imbalanced matching conditions. In this work, we take a new insight to regard alignment and matching as a detection problem, where the goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens while flexibly handling redundant or noisy acoustic frames in transferring linguistic knowledge for ASR. Based on this new insight, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. Our method ensures that every linguistic token is grounded in at least one acoustic observation, while allowing for flexible, probabilistic mappings from acoustic to linguistic units. We evaluate our proposed model with experiments on an CTC-based ASR system with a pre-trained language model for knowledge transfer. Experimental results demonstrate the effectiveness of our approach in flexibly controlling degree of matching and hence to improve ASR performance.

replace-cross BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

Authors: Th\'eo Charlot, Tarek Kunze, Maxime Poli, Alejandrina Cristia, Emmanuel Dupoux, Marvin Lavechin

Abstract: Child-centered daylong recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, a self-supervised speech model trained on 13,000 hours of multilingual child-centered recordings spanning 40+ languages. Evaluated on voice type classification -- distinguishing target children from female adults, male adults, and other children, a key preprocessing step for analyzing naturalistic language experiences -- BabyHuBERT-VTC achieves F1-scores from 52.1% to 74.4% across six corpora, consistently outperforming W2V2-LL4300 (English daylongs) and HuBERT (clean adult speech). Notable gains include 13.2 and 15.9 absolute F1 points over HuBERT on Vanuatu and Solomon Islands, demonstrating effectiveness on underrepresented languages. We share code and model to support researchers working with child-centered recordings across diverse linguistic contexts.

replace-cross Diffusion-Based Impedance Learning for Contact-Rich Manipulation Tasks

Authors: Noah Geiger, Tamim Asfour, Neville Hogan, Johannes Lachner

Abstract: Learning-based methods excel at robot motion generation but remain limited in contact-rich physical interaction. Impedance control provides stable and safe contact behavior but requires task-specific tuning of stiffness and damping parameters. We present Diffusion-Based Impedance Learning, a framework that bridges these paradigms by combining generative modeling with energy-consistent impedance control. A Transformer-based Diffusion Model, conditioned via cross-attention on measured external wrenches, reconstructs simulated Zero-Force Trajectories (sZFTs) that represent contact-consistent equilibrium behavior. A SLERP-based quaternion noise scheduler preserves geometric consistency for rotations on the unit sphere. The reconstructed sZFT is used by an energy-based estimator to adapt impedance online through directional stiffness and damping modulation. Trained on parkour and robot-assisted therapy demonstrations collected via Apple Vision Pro teleoperation, the model achieves sub-millimeter positional and sub-degree rotational accuracy using only tens of thousands of samples. Deployed in realtime torque control on a KUKA LBR iiwa, the approach enables smooth obstacle traversal and generalizes to unseen tasks, achieving 100% success in multi-geometry peg-in-hole insertion.

replace-cross Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription

Authors: Michael Yeung, Keisuke Toyama, Toya Teramoto, Shusuke Takahashi, Tamaki Kojima

Abstract: Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks.

replace-cross BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving

Authors: Shu Liu, Wenlin Chen, Weihao Li, Zheng Wang, Lijin Yang, Jianing Huang, Yipin Zhang, Zhongzhan Huang, Ze Cheng, Hao Yang

Abstract: Diffusion-based planners have shown strong potential for autonomous driving by capturing multi-modal driving behaviors. A key challenge is how to effectively guide these models for safe and reactive planning in closed-loop settings, where the ego vehicle's actions influence future states. Recent work leverages typical expert driving behaviors (i.e., anchors) to guide diffusion planners but relies on a truncated diffusion schedule that introduces an asymmetry between the forward and denoising processes, diverging from the core principles of diffusion models. To address this, we introduce BridgeDrive, a novel anchor-guided diffusion bridge policy for closed-loop trajectory planning. Our approach formulates planning as a diffusion bridge that directly transforms coarse anchor trajectories into refined, context-aware plans, ensuring theoretical consistency between the forward and reverse processes. BridgeDrive is compatible with efficient ODE solvers, enabling real-time deployment. We achieve state-of-the-art performance on the Bench2Drive closed-loop evaluation benchmark, improving the success rate by 7.72% and 2.45% over prior arts with PDM-Lite and LEAD datasets, respectively. Project page: https://github.com/shuliu-ethz/BridgeDrive.

URLs: https://github.com/shuliu-ethz/BridgeDrive.

replace-cross BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

Authors: Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre, Meng Lu, Morteza Ziyadi, Xuan Wang

Abstract: Evaluating language models fairly is increasingly difficult as static benchmarks risk contamination by training data, obscuring whether models truly reason or recall. We introduce BeyondBench, an evaluation framework using algorithmic problem generation to create mathematically grounded problems on the fly, ensuring each test remains uncontaminated. Our framework covers 44 algorithmic tasks with 117 variations across three difficulty levels: the Easy Suite (29 tasks) for arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) for NP-complete and constraint satisfaction problems. Each task draws from a space exceeding 10^15 unique instances, with deterministically verified solutions. We evaluated 101 language models (85 open-source, 16 closed-source), spanning 0.5B to 141B parameters and multiple quantization schemes, using three-fold evaluation for robustness. Results reveal consistent reasoning deficiencies, with performance degrading sharply as complexity increases. In Hard Suite evaluations, Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved accuracies of 56.21%, 27.16%, and 33.37% respectively. Performance drops significantly without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing declines of 16.81%, 15.86%, and 43.95% in overall accuracy. Contamination resistance rests on three guarantees: (i) the problem space vastly exceeds any static dataset, (ii) every instance has a deterministically verifiable solution, and (iii) isomorphic transformations yield semantically equivalent but syntactically novel problems. BeyondBench redefines reasoning evaluation via genuine algorithmic problem-solving. Our leaderboard is at https://ctrl-gaurav.github.io/BeyondBench/, Python package at https://pypi.org/project/beyondbench/, and codebase at https://github.com/ctrl-gaurav/BeyondBench.

URLs: https://ctrl-gaurav.github.io/BeyondBench/,, https://pypi.org/project/beyondbench/,, https://github.com/ctrl-gaurav/BeyondBench.

replace-cross Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

Authors: Guolin Ke, Hui Xue

Abstract: Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs -- including after CFG -- to lie on a fixed-radius hypersphere (constant $\ell_2$ norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.

replace-cross Quantitative convergence of trained single layer neural networks to Gaussian processes

Authors: Eloy Mosig, Andrea Agazzi, Dario Trevisan

Abstract: In this paper, we study the quantitative convergence of shallow neural networks trained via gradient descent to their associated Gaussian processes in the infinite-width limit. While previous work has established qualitative convergence under broad settings, precise, finite-width estimates remain limited, particularly during training. We provide explicit upper bounds on the quadratic Wasserstein distance between the network output and its Gaussian approximation at any training time $t \ge 0$, demonstrating polynomial decay with network width. Our results quantify how architectural parameters, such as width and input dimension, influence convergence, and how training dynamics affect the approximation error.

replace-cross Pretraining Large Language Models with NVFP4

Authors: NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Muya Chang, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, Abhinav Goel, Oleg Goncharov, Yugi Guvvala, Robert Hesse, Russell Hewett, Herbert Hum, Ujval Kapasi, Brucek Khailany, Mikail Khona, Nick Knight, Alex Kondratenko, Ronny Krashinsky, Ben Lanir, Simon Layton, Michael Lightstone, Daniel Lo, Paulius Micikevicius, Asit Mishra, Tim Moon, Deepak Narayanan, Chao Ni, Abhijit Paithankar, Satish Pasumarthi, Ankit Patel, Mostofa Patwary, Ashwin Poojary, Gargi Prasad, Sweta Priyadarshi, Yigong Qin, Xiaowei Ren, Oleg Rybakov, Charbel Sakr, Sanjeev Satheesh, Stas Sergienko, Pasha Shamis, Kirthi Shankar, Nishant Sharma, Mohammad Shoeybi, Michael Siu, Misha Smelyanskiy, Darko Stosic, Dusan Stosic, Bor-Yiing Su, Frank Sun, Nima Tajbakhsh, Shelby Thomas, Przemek Tredak, Evgeny Tsykunov, Gandhi Vaithilingam, Aditya Vavre, Rangharajan Venkatesan, Roger Waleffe, Qiyu Wan, Hexin Wang, Mengdi Wang, Lizzie Wei, Hao Wu, Evan Wu, Keith Wyss, Ning Xu, Jinze Xue, Charlene Yang, Yujia Zhai, Ruoxi Zhang, Jingyang Zhu, Zhongbo Zhu

Abstract: Large Language Models (LLMs) today are powerful problem solvers across many domains, and they continue to get stronger as they scale in model size, training set size, and training set quality, as shown by extensive research and experimentation across the industry. Training a frontier model today requires on the order of tens to hundreds of yottaflops, which is a massive investment of time, compute, and energy. Improving pretraining efficiency is therefore essential to enable the next generation of even more capable LLMs. While 8-bit floating point (FP8) training is now widely adopted, transitioning to even narrower precision, such as 4-bit floating point (FP4), could unlock additional improvements in computational speed and resource utilization. However, quantization at this level poses challenges to training stability, convergence, and implementation, notably for large-scale models trained on long token horizons. In this study, we introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format. Our method integrates Random Hadamard transforms (RHT) to bound block-level outliers, employs a two-dimensional quantization scheme for consistent representations across both the forward and backward passes, utilizes stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers. We validate our approach by training a 12-billion-parameter model on 10 trillion tokens -- the longest publicly documented training run in 4-bit precision to date. Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline. These findings highlight that NVFP4, when combined with our training approach, represents a major step forward in narrow-precision LLM training algorithms.

replace-cross VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL

Authors: Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, Lili Qiu

Abstract: The rapid proliferation of AI-generated video necessitates robust detection tools that offer both high accuracy and human-interpretable explanations. While existing MLLM-based detectors rely on supervised fine-tuning (SFT) or direct preference optimization (DPO), these methods are often bottlenecked by static, pre-labeled datasets that fail to capture the evolving, multi-step physical inconsistencies of modern generative models. To bridge this gap, we introduce VidGuard-R1, the first video authenticity detector to utilize group relative policy optimization (GRPO). Moving beyond passive preference matching, VidGuard-R1 employs a reinforcement learning framework that encourages the model to explore and rank multiple reasoning paths. By introducing specialized reward models for temporal stability and diffusion-aware complexity, we incentivize the model to discover 'physics-grounded' artifacts. Our contributions include: (1) a curated dataset of 140,000 challenging real/fake video pairs; (2) a GRPO-based training paradigm that achieves state-of-the-art zero-shot performance; and (3) a reasoning-first architecture that provides precise, verifiable rationales for its forensic judgments. Project website: https://vidguard-r1.github.io/.

URLs: https://vidguard-r1.github.io/.

replace-cross True Self-Supervised Novel View Synthesis is Transferable

Authors: Thomas W. Mitchel, Hyunwoo Ryu, Vincent Sitzmann

Abstract: In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry -- such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.

replace-cross CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions

Authors: Lizhi Yang, Blake Werner, Massimiliano de Sa, Aaron D. Ames

Abstract: Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.

replace-cross Bayesian Inference for PDE-based Inverse Problems using the Optimization of a Discrete Loss

Authors: Lucas Amoudruz, Sergey Litvinov, Costas Papadimitriou, Petros Koumoutsakos

Abstract: Inverse problems are crucial for many applications in science, engineering and medicine that involve data assimilation, design, and imaging. Their solution infers the parameters or latent states of a complex system from noisy data and partially observable processes. When measurements are an incomplete or indirect view of the system, additional knowledge is required to accurately solve the inverse problem. Adopting a physical model of the system in the form of partial differential equations (PDEs) is a potent method to close this gap. In particular, the method of optimizing a discrete loss (ODIL) has shown great potential in terms of robustness and computational cost. In this work, we introduce B-ODIL, a Bayesian extension of ODIL, that integrates the PDE loss of ODIL as prior knowledge and combines it with a likelihood describing the data. B-ODIL employs a Bayesian formulation of PDE-based inverse problems to infer solutions with quantified uncertainties. We demonstrate the capabilities of B-ODIL in a series of synthetic benchmarks involving PDEs in one, two, and three dimensions. We showcase the application of B-ODIL in estimating tumor concentration and its uncertainty in a patient's brain from MRI scans using a three-dimensional tumor growth model.

replace-cross Schr\"odinger Bridge Mamba for One-Step Speech Enhancement

Authors: Jing Yang, Sirui Wang, Chao Wu, Lei Guo, Fan Fan

Abstract: We present Schr\"odinger Bridge Mamba (SBM), a novel model for efficient speech enhancement by integrating the Schr\"odinger Bridge (SB) training paradigm and the Mamba architecture. Experiments of joint denoising and dereverberation tasks demonstrate SBM outperforms strong generative and discriminative methods on multiple metrics with only one step of inference while achieving a competitive real-time factor for streaming feasibility. Ablation studies reveal that the SB paradigm consistently yields improved performance across diverse architectures over conventional mapping. Furthermore, Mamba exhibits a stronger performance under the SB paradigm compared to Multi-Head Self-Attention (MHSA) and Long Short-Term Memory (LSTM) backbones. These findings highlight the synergy between the Mamba architecture and the SB trajectory-based training, providing a high-quality solution for real-world speech enhancement. Demo page: https://sbmse.github.io

URLs: https://sbmse.github.io

replace-cross Generalization Below the Edge of Stability: The Role of Data Geometry

Authors: Tongtong Liang, Alexander Cloninger, Rahul Parhi, Yu-Xiang Wang

Abstract: Understanding generalization in overparameterized neural networks hinges on the interplay between the data geometry, neural architecture, and training dynamics. In this paper, we theoretically explore how data geometry controls this implicit bias. This paper presents theoretical results for overparametrized two-layer ReLU networks trained below the edge of stability. First, for data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension. Second, for a family of isotropic distributions that vary in how strongly probability mass concentrates toward the unit sphere, we derive a spectrum of bounds showing that rates deteriorate as the mass concentrates toward the sphere. These results instantiate a unifying principle: When the data is harder to "shatter" with respect to the activation thresholds of the ReLU neurons, gradient descent tends to learn representations that capture shared patterns and thus finds solutions that generalize well. On the other hand, for data that is easily shattered (e.g., data supported on the sphere) gradient descent favors memorization. Our theoretical results consolidate disparate empirical findings that have appeared in the literature.

replace-cross Testing Most Influential Sets

Authors: Lucas Darius Konrad, Nikolas Kuschnig

Abstract: Small influential data subsets can dramatically impact model conclusions, with a few data points overturning key findings. While recent work identifies these most influential sets, there is no formal way to tell when maximum influence is excessive rather than expected under natural random sampling variation. We address this gap by developing a principled framework for most influential sets. Focusing on linear least-squares, we derive a convenient exact influence formula and identify the extreme value distributions of maximal influence - the heavy-tailed Fr\'echet for constant-size sets and heavy-tailed data, and the well-behaved Gumbel for growing sets or light tails. This allows us to conduct rigorous hypothesis tests for excessive influence. We demonstrate through applications across economics, biology, and machine learning benchmarks, resolving contested findings and replacing ad-hoc heuristics with rigorous inference.

replace-cross Auto-Adaptive PINNs with Applications to Phase Transitions

Authors: Kevin Buck, Woojeong Kim

Abstract: We propose an adaptive sampling method for the training of Physics Informed Neural Networks (PINNs) which allows for sampling based on an arbitrary problem-specific heuristic which may depend on the network and its gradients. In particular we focus our analysis on the Allen-Cahn equations, attempting to accurately resolve the characteristic interfacial regions using a PINN without any post-hoc resampling. In experiments, we show the effectiveness of these methods over residual-adaptive frameworks.

replace-cross FMint-SDE: A Multimodal Foundation Model for Accelerating Numerical Simulation of SDEs via Error Correction

Authors: Jiaxin Yuan, Haizhao Yang, Maria Cameron

Abstract: Fast and accurate simulation of dynamical systems is a fundamental challenge across scientific and engineering domains. Traditional numerical integrators often face a trade-off between accuracy and computational efficiency, while existing neural network-based approaches typically require training a separate model for each case. To overcome these limitations, we introduce a novel multi-modal foundation model for large-scale simulations of differential equations: FMint-SDE (Foundation Model based on Initialization for stochastic differential equations). Based on a decoder-only transformer with in-context learning, FMint-SDE leverages numerical and textual modalities to learn a universal error-correction scheme. It is trained using prompted sequences of coarse solutions generated by conventional solvers, enabling broad generalization across diverse systems. We evaluate our models on a suite of challenging SDE benchmarks spanning applications in molecular dynamics, mechanical systems, finance, and biology. Experimental results show that our approach achieves a superior accuracy-efficiency tradeoff compared to classical solvers, underscoring the potential of FMint-SDE as a general-purpose simulation tool for dynamical systems.

replace-cross MotionStream: Real-Time Video Generation with Interactive Motion Controls

Authors: Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, Xun Huang

Abstract: Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing with Distribution Matching Distillation, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons -- (1) bridging the domain gap from training on finite length and extrapolating to infinite horizons, (2) sustaining high quality by preventing error accumulation, and (3) maintaining fast inference, without incurring growth in computational cost due to increasing context windows. A key to our approach is introducing carefully designed sliding-window causal attention, combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.

replace-cross CytoNet: A Foundation Model for the Human Cerebral Cortex at Cellular Resolution

Authors: Christian Schiffer, Zeynep Boztoprak, Jan-Oliver Kropp, Julia Th\"onni{\ss}en, Katia Berr, Hannah Spitzer, Katrin Amunts, Timo Dickscheid

Abstract: Studying the cellular architecture of the human cerebral cortex is critical for understanding brain organization and function. It requires investigating complex texture patterns in histological images, yet automatic methods that scale across whole brains are still lacking. Here we introduce CytoNet, a foundation model trained on 1 million unlabeled microscopic image patches from over 4,000 histological sections spanning ten postmortem human brains. Using co-localization in the cortical sheet for self-supervision, CytoNet encodes complex cellular patterns into expressive and anatomically meaningful feature representations. CytoNet supports multiple downstream applications, including area classification, laminar segmentation, quantification of microarchitectural variation, and data-driven mapping of previously uncharted areas. In addition, CytoNet captures microarchitectural signatures of macroscale functional organization, enabling decoding of functional network parcellations from cytoarchitectonic features. Together, these results establish CytoNet as a unified framework for scalable analysis of cortical microarchitecture and for linking cellular architecture to structure-function organization in the human cerebral cortex.

replace-cross CoRPO: Adding a Correctness Bias to GRPO Improves Generalization

Authors: Anisha Garg, Claire Zhang, Nishit Neema, David Bick, Ganesh Venkatesh, Joel Hestness

Abstract: Group-Relative Policy Optimization (GRPO) has emerged as the standard for training reasoning capabilities in large language models through reinforcement learning. By estimating advantages using group-mean rewards rather than a learned critic, GRPO has enabled efficient scaling of reinforcement learning from verifiable rewards (RLVR). However, we identify a fundamental limitation: GRPO's mean baseline can assign positive advantages to incorrect solutions simply because they outperform a poorly-performing group average. It leads to overestimation of advantages and reinforcement of incorrect behaviours. To address this, we propose Correctness-Relative Policy Optimization (CoRPO), a simple modification to the GRPO objective that clips the minimum baseline to a fixed correctness threshold. We show that baseline clipping introduces a protective bias to advantage estimation that mitigates overfitting while preserving effective exploration. Empirically, CoRPO-trained models improve cross-domain reasoning, generalizing more consistently to out-of-domain (OOD) tasks. When trained on coding tasks, CoRPO outperforms GRPO on math, and vice-versa, indicating that CoRPO learns robust, transferable reasoning patterns rather than task-specific solutions.

replace-cross Observer-Actor: Active Vision Imitation Learning with Sparse-View Gaussian Splatting

Authors: Yilong Wang, Cheng Qian, Ruomeng Fan, Edward Johns

Abstract: We propose Observer Actor (ObAct), a novel framework for active vision imitation learning in which the observer moves to optimal visual observations for the actor. We study ObAct on a dual-arm robotic system equipped with wrist-mounted cameras. At test time, ObAct dynamically assigns observer and actor roles: the observer arm constructs a 3D Gaussian Splatting (3DGS) representation from three images, virtually explores this to find an optimal camera pose, then moves to this pose; the actor arm then executes a policy using the observer's observations. This formulation enhances the clarity and visibility of both the object and the gripper in the policy's observations. As a result, we enable the training of ambidextrous policies on observations that remain closer to the occlusion-free training distribution, leading to more robust policies. We study this formulation with two existing imitation learning methods -- trajectory transfer and behavior cloning -- and experiments show that ObAct significantly outperforms static-camera setups: trajectory transfer improves by 145% without occlusion and 233% with occlusion, while behavior cloning improves by 75% and 143%, respectively. Videos are available at https://obact.github.io.

URLs: https://obact.github.io.

replace-cross CycleChemist: A Dual-Pronged Machine Learning Framework for Organic Photovoltaic Discovery

Authors: Hou Hei Lam, Jiangjie Qiu, Xiuyuan Hu, Wentao Li, Fankun Zeng, Siwei Fu, Hao Zhang, Xiaonan Wang

Abstract: Organic photovoltaic (OPV) materials offer a promising path toward sustainable energy generation, but their development is limited by the difficulty of identifying high performance donor and acceptor pairs with strong power conversion efficiencies (PCEs). Existing design strategies typically focus on either the donor or the acceptor alone, rather than using a unified approach capable of modeling both components. In this work, we introduce a dual machine learning framework for OPV discovery that combines predictive modeling with generative molecular design. We present the Organic Photovoltaic Donor Acceptor Dataset (OPV2D), the largest curated dataset of its kind, containing 2000 experimentally characterized donor acceptor pairs. Using this dataset, we develop the Organic Photovoltaic Classifier (OPVC) to predict whether a material exhibits OPV behavior, and a hierarchical graph neural network that incorporates multi task learning and donor acceptor interaction modeling. This framework includes the Molecular Orbital Energy Estimator (MOE2) for predicting HOMO and LUMO energy levels, and the Photovoltaic Performance Predictor (P3) for estimating PCE. In addition, we introduce the Material Generative Pretrained Transformer (MatGPT) to produce synthetically accessible organic semiconductors, guided by a reinforcement learning strategy with three objective policy optimization. By linking molecular representation learning with performance prediction, our framework advances data driven discovery of high performance OPV materials.

replace-cross DPAC: Distribution-Preserving Adversarial Control for Diffusion Sampling

Authors: Han-Jin Lee, Han-Ju Lee, Jin-Seong Kim, Seok-Hwan Choi

Abstract: Adversarially guided diffusion sampling often achieves the target class, but sample quality degrades as deviations between the adversarially controlled and nominal trajectories accumulate. We formalize this degradation as a path-space Kullback-Leibler divergence(path-KL) between controlled and nominal (uncontrolled) diffusion processes, thereby showing via Girsanov's theorem that it exactly equals the control energy. Building on this stochastic optimal control (SOC) view, we theoretically establish that minimizing this path-KL simultaneously tightens upper bounds on both the 2-Wasserstein distance and Fr\'echet Inception Distance (FID), revealing a principled connection between adversarial control energy and perceptual fidelity. From a variational perspective, we derive a first-order optimality condition for the control: among all directions that yield the same classification gain, the component tangent to iso-(log-)density surfaces (i.e., orthogonal to the score) minimizes path-KL, whereas the normal component directly increases distributional drift. This leads to DPAC (Distribution-Preserving Adversarial Control), a diffusion guidance rule that projects adversarial gradients onto the tangent space defined by the generative score geometry. We further show that in discrete solvers, the tangent projection cancels the O({\Delta}t) leading error term in the Wasserstein distance, achieving an O({\Delta}t^2) quality gap; moreover, it remains second-order robust to score or metric approximation. Empirical studies on ImageNet-100 validate the theoretical predictions, confirming that DPAC achieves lower FID and estimated path-KL at matched attack success rates.

replace-cross GRAND: Guidance, Rebalancing, and Assignment for Networked Dispatch in Multi-Agent Path Finding

Authors: Johannes Gaber, Meshal Alharbi, Daniele Gammelli, Gioele Zardini

Abstract: Large robot fleets are now common in warehouses and other logistics settings, where small control gains translate into large operational impacts. In this article, we address task scheduling for lifelong Multi-Agent Pickup-and-Delivery (MAPD) and propose a hybrid method that couples learning-based global guidance with lightweight optimization. A graph neural network policy trained via reinforcement learning outputs a desired distribution of free agents over an aggregated warehouse graph. This signal is converted into region-to-region rebalancing through a minimum-cost flow, and finalized by small, local assignment problems, preserving accuracy while keeping per-step latency within a 1 s compute budget. We call this approach GRAND: a hierarchical algorithm that relies on Guidance, Rebalancing, and Assignment to explicitly leverage the workspace Network structure and Dispatch agents to tasks. On congested warehouse benchmarks from the League of Robot Runners (LoRR) with up to 500 agents, our approach improves throughput by up to 10% over the 2024 winning scheduler while maintaining real-time execution. The results indicate that coupling graph-structured learned guidance with tractable solvers reduces congestion and yields a practical, scalable blueprint for high-throughput scheduling in large fleets.

replace-cross NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation

Authors: Yu Zeng, Charles Ochoa, Mingyuan Zhou, Vishal M. Patel, Vitor Guizilini, Rowan McAllister

Abstract: Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion (\phi-PD), a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. \phi-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, \phi-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, \phi-PD significantly improves sim-to-real planner transfer performance. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.

URLs: https://yuzeng-at-tri.github.io/ppd-page/

replace-cross Symmetric Aggregation of Conformity Scores for Efficient Uncertainty Sets

Authors: Nabil Alami, Jad Zakharia, Souhaib Ben Taieb

Abstract: Access to multiple predictive models trained for the same task, whether in regression or classification, is increasingly common in many applications. Aggregating their predictive uncertainties to produce reliable and efficient uncertainty quantification is therefore a critical but still underexplored challenge, especially within the framework of conformal prediction (CP). While CP methods can generate individual prediction sets from each model, combining them into a single, more informative set remains a challenging problem. To address this, we propose SACP (Symmetric Aggregated Conformal Prediction), a novel method that aggregates nonconformity scores from multiple predictors. SACP transforms these scores into e-values and combines them using any symmetric aggregation function. This flexible design enables a robust, data-driven framework for selecting aggregation strategies that yield sharper prediction sets. We also provide theoretical insights that help justify the validity and performance of the SACP approach. Extensive experiments on diverse datasets show that SACP consistently improves efficiency and often outperforms state-of-the-art model aggregation baselines.

replace-cross ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Authors: Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li

Abstract: Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce \textsc{ReFusion}, a novel masked diffusion model that integrates sequence reorganization into the causal attention framework. By elevating parallel decoding from the token level to a higher slot level, \textsc{ReFusion} interleaves inter-slot diffusion-based selection with intra-slot autoregressive infilling, while reordering newly generated slots ahead of the remaining masks after each iteration. Consequently, this design simultaneously unlocks full KV cache reuse and reduces learning complexity from an intractable token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that \textsc{ReFusion} not only overwhelmingly surpasses prior MDMs with a 34\% performance gain and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.

replace-cross Agentic Multi-Persona Framework for Evidence-Aware Fake News Detection

Authors: Roopa Bukke, Soumya Pandey, Suraj Kumar, Soumi Chattopadhyay, Chandranath Adak

Abstract: The rapid proliferation of online misinformation threatens the stability of digital social systems and poses significant risks to public trust, policy, and safety, necessitating reliable automated fake news detection. Existing methods often struggle with multimodal content, domain generalization, and explainability. We propose AMPEND-LS, an agentic multi-persona evidence-grounded framework with LLM-SLM synergy for multimodal fake news detection. AMPEND-LS integrates textual, visual, and contextual signals through a structured reasoning pipeline powered by LLMs, augmented with reverse image search, knowledge graph paths, and persuasion strategy analysis. To improve reliability, we introduce a credibility fusion mechanism combining semantic similarity, domain trustworthiness, and temporal context, and a complementary SLM classifier to mitigate LLM uncertainty and hallucinations. Extensive experiments across three benchmark datasets demonstrate that AMPEND-LS consistently outperformed state-of-the-art baselines in accuracy, F1 score, and robustness. Qualitative case studies further highlight its transparent reasoning and resilience against evolving misinformation. This work advances the development of adaptive, explainable, and evidence-aware systems for safeguarding online information integrity.

replace-cross Parallel Token Prediction for Language Models

Authors: Felix Draxler, Justus Will, Farrin Marouf Sofian, Theofanis Karaletsos, Sameer Singh, Stephan Mandt

Abstract: Autoregressive decoding in language models is inherently slow, generating only one token per forward pass. We propose Parallel Token Prediction (PTP), a general-purpose framework for predicting multiple tokens in a single model call. PTP moves the source of randomness from post-hoc sampling to random input variables, making future tokens deterministic functions of those inputs and thus jointly predictable in a single forward pass. We prove that a single PTP call can represent arbitrary dependencies between tokens. PTP is trained by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, PTP achieves a 2.4x speedup on a diverse-task speculative decoding benchmark. We provide code and checkpoints at https://github.com/mandt-lab/ptp.

URLs: https://github.com/mandt-lab/ptp.

replace-cross Prediction of Cellular Malignancy Using Electrical Impedance Signatures and Supervised Machine Learning

Authors: Shadeeb Hossain

Abstract: Bioelectrical properties of cells such as relative permittivity, conductivity, and characteristic time constants vary significantly between healthy and malignant cells across different frequencies. These distinctions provide a promising foundation for diagnostic and classification applications. This study systematically reviewed 33 scholarly articles to compile datasets of quantitative bioelectric parameters and evaluated their utility in predictive modeling. Three supervised machine learning algorithms- Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbor (KNN) were implemented and tuned using key hyperparameters to assess classification performance. Model effectiveness was evaluated using accuracy and F1 score as performance metrics. Results demonstrate that Random Forest achieved the highest predictive accuracy of ~ 90% when configured with a maximum depth of 4 and 100 estimators. These findings highlight the potential of integrating bioelectrical property analysis with machine learning for improved diagnostic decision-making. Similarly, for KNN and SVM, the F1 score peaked at approximately 78% and 76.5%, respectively. Future work will explore incorporating additional discriminative features, leveraging stimulated datasets, and optimizing hyperparameter through advanced search strategies. Ultimately, hardware prototype with embedded micro-electrodes and real-time control systems could pave the path for practical diagnostic tools capable of in-situ cell classification.

replace-cross EmboTeam: Grounding LLM Reasoning into Reactive Behavior Trees via PDDL for Embodied Multi-Robot Collaboration

Authors: Haishan Zeng, Mengna Wang, Peng Li

Abstract: In embodied artificial intelligence, enabling heterogeneous robot teams to execute long-horizon tasks from high-level instructions remains a critical challenge. While large language models (LLMs) show promise in instruction parsing and preliminary planning, they exhibit limitations in long-term reasoning and dynamic multi-robot coordination. We propose EmboTeam, a novel embodied multi-robot task planning framework that addresses these issues through a three-stage cascaded architecture: 1) It leverages an LLM to parse instructions and generate Planning Domain Definition Language (PDDL) problem descriptions, thereby transforming commands into formal planning problems; 2) It combines the semantic reasoning of LLMs with the search capabilities of a classical planner to produce optimized action sequences; 3) It compiles the resulting plan into behavior trees for reactive control. The framework supports dynamically sized heterogeneous robot teams via a shared blackboard mechanism for communication and state synchronization. To validate our approach, we introduce the MACE-THOR benchmark dataset, comprising 42 complex tasks across 8 distinct household layouts. Experiments show EmboTeam improves the task success rate from 12% to 55% and goal condition recall from 32% to 72% over the LaMMA-P baseline.

replace-cross Agentic Very Long Video Understanding

Authors: Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, Hyo Jin Kim

Abstract: The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including large language models and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks. Code is available at https://github.com/facebookresearch/egagent.

URLs: https://github.com/facebookresearch/egagent.

replace-cross Improved Convergence Rates of Muon Optimizer for Nonconvex Optimization

Authors: Shuntaro Nagashima, Hideaki Iiduka

Abstract: The Muon optimizer has recently attracted attention due to its orthogonalized first-order updates, and a deeper theoretical understanding of its convergence behavior is essential for guiding practical applications; however, existing convergence guarantees are either coarse or obtained under restrictive analytical settings. In this work, we establish sharper convergence guarantees for the Muon optimizer through a direct and simplified analysis that does not rely on restrictive assumptions on the update rule. Our results improve upon existing bounds by achieving faster convergence rates while covering a broader class of problem settings. These findings provide a more accurate theoretical characterization of Muon and offer insights applicable to a broader class of orthogonalized first-order methods.

replace-cross Latent-IMH: Efficient Bayesian Inference for Inverse Problems with Approximate Operators

Authors: Youguang Chen, George Biros

Abstract: We study sampling from posterior distributions in Bayesian linear inverse problems where $A$, the parameters to observables operator, is computationally expensive. In many applications, $A$ can be factored in a manner that facilitates the construction of a cost-effective approximation $\tilde{A}$. In this framework, we introduce Latent-IMH, a sampling method based on the Metropolis-Hastings independence (IMH) sampler. Latent-IMH first generates intermediate latent variables using the approximate $\tilde{A}$, and then refines them using the exact $A$. Its primary benefit is that it shifts the computational cost to an offline phase. We theoretically analyze the performance of Latent-IMH using KL divergence and mixing time bounds. Using numerical experiments on several model problems, we show that, under reasonable assumptions, it outperforms state-of-the-art methods such as the No-U-Turn sampler (NUTS) in computational efficiency. In some cases, Latent-IMH can be orders of magnitude faster than existing schemes.

replace-cross LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

Authors: Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Yuxuan Liao, Zehong Wang, Zhiyuan Liu, Zhenfei Yin, Li Yuan, Philip Torr, Huan Sun, Xiangxiang Zeng, Mengdi Wang, Le Cong, Shenghua Gao, Xiangru Tang

Abstract: Chemical large language models (LLMs) predominantly rely on explicit Chain-of-Thought (CoT) in natural language to perform complex reasoning. However, chemical reasoning is inherently continuous and structural, and forcing it into discrete linguistic tokens introduces a fundamental representation mismatch that constrains both efficiency and performance. We introduce LatentChem, a latent reasoning interface that decouples chemical computation from textual generation, enabling models to perform multi-step reasoning directly in continuous latent space while emitting language only for final outputs. Remarkably, we observe a consistent emergent behavior: when optimized solely for task success, models spontaneously internalize reasoning, progressively abandoning verbose textual derivations in favor of implicit latent computation. This shift is not merely stylistic but computationally advantageous. Across diverse chemical reasoning benchmarks, LatentChem achieves a 59.88\% non-tie win rate over strong CoT-based baselines on ChemCoTBench, while delivering a 10.84$\times$ average inference speedup. Our results provide empirical evidence that chemical reasoning is more naturally and effectively realized as continuous latent dynamics rather than discretized linguistic trajectories.

replace-cross Optimal training-conditional regret for online conformal prediction

Authors: Jiadong Liang, Zhimei Ren, Yuxin Chen

Abstract: We study online conformal prediction for non-stationary data streams subject to unknown distribution drift. While most prior work studied this problem under adversarial settings and/or assessed performance in terms of gaps of time-averaged marginal coverage, we instead evaluate performance through training-conditional cumulative regret. We specifically focus on independently generated data with two types of distribution shift: abrupt change points and smooth drift. When non-conformity score functions are pretrained on an independent dataset, we propose a split-conformal style algorithm that leverages drift detection to adaptively update calibration sets, which provably achieves minimax-optimal regret. When non-conformity scores are instead trained online, we develop a full-conformal style algorithm that again incorporates drift detection to handle non-stationarity; this approach relies on stability - rather than permutation symmetry - of the model-fitting algorithm, which is often better suited to online learning under evolving environments. We establish non-asymptotic regret guarantees for our online full conformal algorithm, which match the minimax lower bound under appropriate restrictions on the prediction sets. Numerical experiments corroborate our theoretical findings.

replace-cross CityGuard: Graph-Aware Private Descriptors for Bias-Resilient Identity Search Across Urban Cameras

Authors: Rong Fu, Yibo Meng, Jia Yee Tan, Jiaxuan Lu, Rui Lu, Jiekai Wu, Zhaolu Kang, Simon Fong

Abstract: City-scale person re-identification across distributed cameras must handle severe appearance changes from viewpoint, occlusion, and domain shift while complying with data protection rules that prevent sharing raw imagery. We introduce CityGuard, a topology-aware transformer for privacy-preserving identity retrieval in decentralized surveillance. The framework integrates three components. A dispersion-adaptive metric learner adjusts instance-level margins according to feature spread, increasing intra-class compactness. Spatially conditioned attention injects coarse geometry, such as GPS or deployment floor plans, into graph-based self-attention to enable projectively consistent cross-view alignment using only coarse geometric priors without requiring survey-grade calibration. Differentially private embedding maps are coupled with compact approximate indexes to support secure and cost-efficient deployment. Together these designs produce descriptors robust to viewpoint variation, occlusion, and domain shifts, and they enable a tunable balance between privacy and utility under rigorous differential-privacy accounting. Experiments on Market-1501 and additional public benchmarks, complemented by database-scale retrieval studies, show consistent gains in retrieval precision and query throughput over strong baselines, confirming the practicality of the framework for privacy-critical urban identity matching.

replace-cross RA-QA: A Benchmarking System for Respiratory Audio Question Answering Under Real-World Heterogeneity

Authors: Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia Mascolo

Abstract: As conversational multimodal AI tools are increasingly adopted to process patient data for health assessment, robust benchmarks are needed to measure progress and expose failure modes under realistic conditions. Despite the importance of respiratory audio for mobile health screening, respiratory audio question answering remains underexplored, with existing studies evaluated narrowly and lacking real-world heterogeneity across modalities, devices, and question types. We hence introduce the Respiratory-Audio Question-Answering (RA-QA) benchmark, including a standardized data generation pipeline, a comprehensive multimodal QA collection, and a unified evaluation protocol. RA-QA harmonizes public RA datasets into a collection of 9 million format-diverse QA pairs covering diagnostic and contextual attributes. We benchmark classical ML baselines alongside multimodal audio-language models, establishing reproducible reference points and showing how current approaches fail under heterogeneity.

replace-cross Lap2: Revisiting Laplace DP-SGD for High Dimensions via Majorization Theory

Authors: Meisam Mohammady, Qin Yang, Nicholas Stout, Ayesha Samreen, Han Wang, Christopher J Quinn, Yuan Hong

Abstract: Differentially Private Stochastic Gradient Descent (DP-SGD) is a cornerstone technique for ensuring privacy in deep learning, widely used in both training from scratch and fine-tuning large-scale language models. While DP-SGD predominantly relies on the Gaussian mechanism, the Laplace mechanism remains underutilized due to its reliance on L1 norm clipping. This constraint severely limits its practicality in high-dimensional models because the L1 norm of an n-dimensional gradient can be up to sqrt(n) times larger than its L2 norm. As a result, the required noise scale grows significantly with model size, leading to poor utility or untrainable models. In this work, we introduce Lap2, a new solution that enables L2 clipping for Laplace DP-SGD while preserving strong privacy guarantees. We overcome the dimensionality-driven clipping barrier by computing coordinate-wise moment bounds and applying majorization theory to construct a tight, data-independent upper bound over the full model. By exploiting the Schur-convexity of the moment accountant function, we aggregate these bounds using a carefully designed majorization set that respects the L2 clipping constraint. This yields a multivariate privacy accountant that scales gracefully with model dimension and enables the use of thousands of moments. Empirical evaluations demonstrate that our approach significantly improves the performance of Laplace DP-SGD, achieving results comparable to or better than Gaussian DP-SGD under strong privacy constraints. For instance, fine-tuning RoBERTa-base (125M parameters) on SST-2 achieves 87.88% accuracy at epsilon=0.54, outperforming Gaussian (87.16%) and standard Laplace (48.97%) under the same budget.

replace-cross Inference-time optimization for experiment-grounded protein ensemble generation

Authors: Advaith Maddipatla, Anar Rzayev, Marco Pegoraro, Martin Pacesa, Paul Schanda, Ailie Marx, Sanketh Vedula, Alex M. Bronstein

Abstract: Protein function relies on dynamic conformational ensembles, yet current generative models like AlphaFold3 often fail to produce ensembles that match experimental data. Recent experiment-guided generators attempt to address this by steering the reverse diffusion process. However, these methods are limited by fixed sampling horizons and sensitivity to initialization, often yielding thermodynamically implausible results. We introduce a general inference-time optimization framework to solve these challenges. First, we optimize over latent representations to maximize ensemble log-likelihood, rather than perturbing structures post hoc. This approach eliminates dependence on diffusion length, removes initialization bias, and easily incorporates external constraints. Second, we present novel sampling schemes for drawing Boltzmann-weighted ensembles. By combining structural priors from AlphaFold3 with force-field-based priors, we sample from their product distribution while balancing experimental likelihoods. Our results show that this framework consistently outperforms state-of-the-art guidance, improving diversity, physical energy, and agreement with data in X-ray crystallography and NMR, often fitting the experimental data better than deposited PDB structures. Finally, inference-time optimization experiments maximizing ipTM scores reveal that perturbing AlphaFold3 embeddings can artificially inflate model confidence. This exposes a vulnerability in current design metrics, whose mitigation could offer a pathway to reduce false discovery rates in binder engineering.

replace-cross Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Authors: Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu

Abstract: Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced-reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by nearly half relative to original repositories and achieves an 82.5% mean reused-code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT-4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.

replace-cross DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer

Authors: Yuxuan Zhang, Katar\'ina T\'othov\'a, Zian Wang, Kangxue Yin, Haithem Turki, Riccardo de Lutio, Yen-Yu Chang, Or Litany, Sanja Fidler, Zan Gojcic

Abstract: Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations, we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into temporally consistent outputs while improving their realism. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step image diffusion model, capable of running in online simulators on a single GPU. The key to training it effectively is a custom data curation pipeline that constructs synthetic-real pairs emphasizing appearance harmonization, artifact correction, and lighting realism. The result is a scalable system that significantly elevates simulation fidelity in both research and production environments.

replace-cross Fine-grained Soundscape Control for Augmented Hearing

Authors: Seunghyun Oh, Malek Itani, Aseem Gauri, Shyamnath Gollakota

Abstract: Hearables are becoming ubiquitous, yet their sound controls remain blunt: users can either enable global noise suppression or focus on a single target sound. Real-world acoustic scenes, however, contain many simultaneous sources that users may want to adjust independently. We introduce Aurchestra, the first system to provide fine-grained, real-time soundscape control on resource-constrained hearables. Our system has two key components: (1) a dynamic interface that surfaces only active sound classes and (2) a real-time, on-device multi-output extraction network that generates separate streams for each selected class, achieving robust performance for upto 5 overlapping target sounds, and letting users mix their environment by customizing per-class volumes, much like an audio engineer mixes tracks. We optimize the model architecture for multiple compute-limited platforms and demonstrate real-time performance on 6 ms streaming audio chunks. Across real-world environments in previously unseen indoor and outdoor scenarios, our system enables expressive per-class sound control and achieves substantial improvements in target-class enhancement and interference suppression. Our results show that the world need not be heard as a single, undifferentiated stream: with Aurchestra, the soundscape becomes truly programmable.

replace-cross Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics

Authors: Victor May, Aaditya Salgarkar, Yishan Wang, Diganta Misra, Huu Nguyen

Abstract: Tool-augmented LLMs are increasingly deployed as agents that interleave natural-language reasoning with executable Python actions, as in CodeAct-style frameworks. In deployment, these agents rely on runtime state that persists across steps. By contrast, the traces used to post-train these models rarely encode how interpreter state is managed. We ask whether interpreter persistence is merely a runtime scaffold, or a property of the training data that shapes how agents learn to use the interpreter. We isolate state persistence as a training-time variable. We introduce Opaque Knapsack, a procedurally generated family of partially observable optimization tasks designed to prevent one-shot solutions. Item attributes and constraints are hidden behind budgeted tool calls, forcing multi-turn control flow and iterative state revision. Holding task instances, prompts, tools, model, and supervision fixed, we generate matched trajectories differing only in whether interpreter state persists across steps or resets after each action. We then fine-tune identical base models (Qwen3-8B) on each trace variant and evaluate all four train-runtime combinations. Our 2x2 cross-evaluation shows that interpreter persistence shapes how agents reach solutions, not whether they do: solution quality is statistically indistinguishable across conditions, but token cost and stability differ substantially. A persistent-trained model in a stateless runtime triggers missing-variable errors in roughly 80% of episodes; a stateless-trained model in a persistent runtime redundantly re-derives retained state, using roughly 3.5x more tokens. Interpreter persistence should be treated as a first-class semantic of agent traces. Aligning fine-tuning data with deployment runtimes improves efficiency and reduces brittle train-runtime mismatches.

replace-cross VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling

Authors: Yanir Marmor, Arad Zulti, David Krongauz, Adam Gabet, Yoad Snapir, Yair Lifshitz, Eran Segal

Abstract: Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 speakers with recording spans of up to 15 years. Each segment includes aligned transcripts and verified demographic metadata from official parliamentary records. We benchmark modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification under longitudinal conditions. Speaker verification EER rises from 2.15\% to 4.58\% over 15 years for the strongest model, and cross-sectionally trained age regressors fail to capture within-speaker aging, while longitudinally trained models recover a meaningful temporal signal. We publicly release the dataset and pipeline to support aging-robust speech systems and Hebrew speech processing.

replace-cross Conformal Graph Prediction with Z-Gromov Wasserstein Distances

Authors: Gabriel Melo, Thibaut de Saivre, Anna Calissano, Florence d'Alch\'e-Buc

Abstract: Supervised graph prediction addresses regression problems where the outputs are structured graphs. Although several approaches exist for graph-valued prediction, principled uncertainty quantification remains limited. We propose a conformal prediction framework for graph-valued outputs, providing distribution-free coverage guarantees in structured output spaces. Our method defines nonconformity via the Z-Gromov-Wasserstein distance, instantiated in practice through Fused Gromov-Wasserstein (FGW), enabling permutation invariant comparison between predicted and candidate graphs. To obtain adaptive prediction sets, we introduce Score Conformalized Quantile Regression (SCQR), an extension of Conformalized Quantile Regression (CQR) to handle complex output spaces such as graph-valued outputs. We evaluate the proposed approach on a synthetic task and a real problem of molecule identification.

replace-cross stratum: A System Infrastructure for Massive Agent-Centric ML Workloads

Authors: Arnab Phani, Elias Strauss, Sebastian Schelter

Abstract: Recent advances in large language models (LLMs) transform how machine learning (ML) pipelines are developed and evaluated. LLMs enable a new type of workload, agentic pipeline search, in which autonomous or semi-autonomous agents generate, validate, and optimize complete ML pipelines. These agents predominantly operate over popular Python ML libraries and exhibit highly exploratory behavior. This results in thousands of executions for data profiling, pipeline generation, and iterative refinement of pipeline stages. However, the existing Python-based ML ecosystem is built around libraries such as Pandas and scikit-learn, which are designed for human-centric, interactive, sequential workflows and remain constrained by Python's interpretive execution model, library-level isolation, and limited runtime support for executing large numbers of pipelines. Meanwhile, many high-performance ML systems proposed by the systems community either target narrow workload classes or require specialized programming models, which limits their integration with the Python ML ecosystem and makes them largely ill-suited for LLM-based agents. This growing mismatch exposes a fundamental systems challenge in supporting agentic pipeline search at scale. We therefore propose stratum, a unified system infrastructure that decouples pipeline execution from planning and reasoning during agentic pipeline search. Stratum integrates seamlessly with existing Python libraries, compiles batches of pipelines into optimized execution graphs, and efficiently executes them across heterogeneous backends, including a novel Rust-based runtime. We present stratum's architectural vision along with an early prototype, discuss key design decisions, and outline open challenges and research directions. Finally, preliminary experiments show that stratum can significantly speed up large-scale agentic pipeline search up to 16.6x.

replace-cross DMD-augmented Unpaired Neural Schr\"odinger Bridge for Ultra-Low Field MRI Enhancement

Authors: Youngmin Kim, Jaeyun Shin, Jeongchan Kim, Taehoon Lee, Jaemin Kim, Peter Hsu, Jelle Veraart, Jong Chul Ye

Abstract: Ultra Low Field (64 mT) brain MRI improves accessibility but suffers from reduced image quality compared to 3 T. As paired 64 mT - 3 T scans are scarce, we propose an unpaired 64 mT $\rightarrow$ 3 T translation framework that enhances realism while preserving anatomy. Our method builds upon the Unpaired Neural Schr\"odinge Bridge (UNSB) with multi-step refinement. To strengthen target distribution alignment, we augment the adversarial objective with DMD2-style diffusion-guided distribution matching using a frozen 3T diffusion teacher. To explicitly constrain global structure beyond patch-level correspondence, we combine PatchNCE with an Anatomical Structure Preservation (ASP) regularizer that enforces soft foreground background consistency and boundary aware constraints. Evaluated on two disjoint cohorts, the proposed framework achieves an improved realism structure trade-off, enhancing distribution level realism on unpaired benchmarks while increasing structural fidelity on the paired cohort compared to unpaired baselines.

replace-cross LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification

Authors: Md Akib Haider, Ahsan Bulbul, Nafis Fuad Shahid, Aimaan Ahmed, Mohammad Ishrak Abedin

Abstract: Code comment classification is a critical task for automated software documentation and analysis. In the context of the NLBSE'26 Tool Competition, we present LoRA-MME, a Multi-Model Ensemble architecture utilizing Parameter-Efficient Fine-Tuning (PEFT). Our approach addresses the multi-label classification challenge across Java, Python, and Pharo by combining the strengths of four distinct transformer encoders: UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa. By independently fine-tuning these models using Low-Rank Adaptation(LoRA) and aggregating their predictions via a learned weighted ensemble strategy, we maximize classification performance without the memory overhead of full model fine-tuning. Our tool achieved an F1 Weighted score of 0.7906 and a Macro F1 of 0.6867 on the test set. However, the computational cost of the ensemble resulted in a final submission score of 41.20%, highlighting the trade-off between semantic accuracy and inference efficiency.