Authors: Anna Kalenkova, Lu Xia, Dirk Neumann
Abstract: This paper proposes a novel method for analyzing food retail processes with a focus on reducing food waste. The approach integrates object-centric process mining (OCPM) with stochastic process discovery and analysis. First, a stochastic process in the form of a continuous-time Markov chain is discovered from grocery store sales data. This model is then extended with supply activities. Finally, a what-if analysis is conducted to evaluate how the quantity of products in the store evolves over time. This enables the identification of an optimal balance between customer purchasing behavior and supply strategies, helping to prevent both food waste due to oversupply and product shortages.
Authors: Yi En Chou, Te Hsin Liu, Chao An Lin
Abstract: Physics Informed Neural Networks offer a mesh free framework for solving PDEs but are highly sensitive to loss weight selection. We propose two dimensional analysis based weighting schemes, one based on quantifiable terms, and another also incorporating unquantifiable terms for more balanced training. Benchmarks on heat conduction, convection diffusion, and lid driven cavity flows show that the second scheme consistently improves stability and accuracy over equal weighting. Notably, in high Peclet number convection diffusion, where traditional solvers fail, PINNs with our scheme achieve stable, accurate predictions, highlighting their robustness and generalizability in CFD problems.
Authors: Rushil Gupta, Jason Hartford, Bang Liu
Abstract: Large language models (LLMs) have recently been proposed as general-purpose agents for experimental design, with claims that they can perform in-context experimental design. We evaluate this hypothesis using both open- and closed-source instruction-tuned LLMs applied to genetic perturbation and molecular property discovery tasks. We find that LLM-based agents show no sensitivity to experimental feedback: replacing true outcomes with randomly permuted labels has no impact on performance. Across benchmarks, classical methods such as linear bandits and Gaussian process optimization consistently outperform LLM agents. We further propose a simple hybrid method, LLM-guided Nearest Neighbour (LLMNN) sampling, that combines LLM prior knowledge with nearest-neighbor sampling to guide the design of experiments. LLMNN achieves competitive or superior performance across domains without requiring significant in-context adaptation. These results suggest that current open- and closed-source LLMs do not perform in-context experimental design in practice and highlight the need for hybrid frameworks that decouple prior-based reasoning from batch acquisition with updated posteriors.
Authors: Nyi Nyi Aung, Neil Muralles, Adrian Stein
Abstract: This work addresses object identification under known dynamics in unmanned aerial vehicle applications, where learning and classification are combined through a physics-informed residual neural network. The proposed framework leverages physics-informed learning for state mapping and state-derivative prediction, while a softmax layer enables multi-class confidence estimation. Quadcopter, fixed-wing, and helicopter aerial vehicles are considered as case studies. The results demonstrate high classification accuracy with reduced training time, offering a promising solution for system identification problems in domains where the underlying dynamics are well understood.
Authors: Zihuan Qiu, Lei Wang, Yang Cao, Runtong Zhang, Bing Su, Yi Xu, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li
Abstract: Data-free continual model merging (DFCMM) aims to fuse independently fine-tuned models into a single backbone that evolves with incoming tasks without accessing task data. This paper formulate two fundamental desiderata for DFCMM: transparency, avoiding interference with earlier tasks, and fidelity, adapting faithfully to each new task. This poses a challenge that existing approaches fail to address: how to bridge data-level desiderata with parameter-space optimization to ensure transparency and fidelity in the absence of task data. To this end, we propose NUFILT (NUll-space FILTering), a data-free framework that directly links these desiderata to optimization. Our key observation is that task vectors approximately align with representation subspaces, providing structural surrogates for enforcing transparency and fidelity. Accordingly, we design a null-space projector that preserves prior responses by filtering out overlapping components of new task vectors, thereby ensuring transparency, and a lightweight LoRA adapter that injects complementary task-specific signals, enabling fidelity in adapting to new tasks. The adapter is trained with a projection-based surrogate loss to retain consistency with previous knowledge while introducing novel directions. This joint filtering-adaptation process allows the backbone to absorb new knowledge while retaining existing behaviors, and the updates are finally fused back in a layer-wise linear fashion without extra parameters or inference cost. Theoretically, we establish approximate subspace alignment guarantees that justify null-space filtering. Empirically, NUFILT achieves state-of-the-art performance with minimal forgetting on both vision and NLP benchmarks, improving average accuracy by 4-7% over OPCM and WUDI-Merging, while narrowing the gap to fine-tuning and reducing computation overhead.
Authors: Waleed Esmail, Alexander Kappes, Stuart Russell, Christine Thomas
Abstract: We introduce \textit{SeismoGPT}, a transformer-based model for forecasting three-component seismic waveforms in the context of future gravitational wave detectors like the Einstein Telescope. The model is trained in an autoregressive setting and can operate on both single-station and array-based inputs. By learning temporal and spatial dependencies directly from waveform data, SeismoGPT captures realistic ground motion patterns and provides accurate short-term forecasts. Our results show that the model performs well within the immediate prediction window and gradually degrades further ahead, as expected in autoregressive systems. This approach lays the groundwork for data-driven seismic forecasting that could support Newtonian noise mitigation and real-time observatory control.
Authors: George Yakushev, Alina Shutova, Ivan Rubachev, Renat Sergazinov, Artem Babenko
Abstract: Tabular foundation models are becoming increasingly popular for low-resource tabular problems. These models make up for small training datasets by pretraining on large volumes of synthetic data. The prior knowledge obtained via pretraining provides the exceptional performance, but the resulting model becomes a black box that is difficult to interpret and costly to inference. In this work, we explore an alternative strategy: using reasoning-capable LLMs to induce decision trees for small tabular datasets in agentic setup. We design a minimal set of tools for constructing, analyzing and manipulating decision trees. By using these tools, LLMs combine their prior knowledge with learning from data to create a lightweight decision tree that outperforms traditional CART on low-resource tabular problems. While a single decision tree does not outperform state-of-the-art black box models, it comes with a human-readable reasoning trace that can be checked for biases and data leaks. Furthermore, the reasoning-based LLM's creation process allows for additional human input: correcting biases or incorporating domain-specific intuition that is not captured in the data.
Authors: Shehtab Zaman, Chengyan Liu, Kenneth Chiu
Abstract: Idempotent generative networks (IGNs) are a new line of generative models based on idempotent mapping to a target manifold. IGNs support both single-and multi-step generation, allowing for a flexible trade-off between computational cost and sample quality. But similar to Generative Adversarial Networks (GANs), conventional IGNs require adversarial training and are prone to training instabilities and mode collapse. Diffusion and score-based models are popular approaches to generative modeling that iteratively transport samples from one distribution, usually a Gaussian, to a target data distribution. These models have gained popularity due to their stable training dynamics and high-fidelity generation quality. However, this stability and quality come at the cost of high computational cost, as the data must be transported incrementally along the entire trajectory. New sampling methods, model distillation, and consistency models have been developed to reduce the sampling cost and even perform one-shot sampling from diffusion models. In this work, we unite diffusion and IGNs by distilling idempotent models from diffusion model scores, called SIGN. Our proposed method is highly stable and does not require adversarial losses. We provide a theoretical analysis of our proposed score-based training methods and empirically show that IGNs can be effectively distilled from a pre-trained diffusion model, enabling faster inference than iterative score-based models. SIGNs can perform multi-step sampling, allowing users to trade off quality for efficiency. These models operate directly on the source domain; they can project corrupted or alternate distributions back onto the target manifold, enabling zero-shot editing of inputs. We validate our models on multiple image datasets, achieving state-of-the-art results for idempotent models on the CIFAR and CelebA datasets.
Authors: Hude Liu, Jerry Yao-Chieh Hu, Jennifer Yuntong Zhang, Zhao Song, Han Liu
Abstract: We formalize hallucinations in generative models as failures to link an estimate to any plausible cause. Under this interpretation, we show that even loss-minimizing optimal estimators still hallucinate. We confirm this with a general high probability lower bound on hallucinate rate for generic data distributions. This reframes hallucination as structural misalignment between loss minimization and human-acceptable outputs, and hence estimation errors induced by miscalibration. Experiments on coin aggregation, open-ended QA, and text-to-image support our theory.
Authors: Guanghan Wang, Yair Schiff, Gilad Turok, Volodymyr Kuleshov
Abstract: While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on properties of masking to accurately estimate the likelihoods of sampling trajectories. Our estimators trade off computation for approximation accuracy in an analytically tractable manner, and are particularly effective for DLMs that support any-order likelihood estimation. We characterize and study this property in popular DLMs and show that it is key for efficient diffusion-based reasoning. Empirically, d2 significantly improves over previous diffusion reasoning frameworks using only RL (without relying on supervised fine-tuning), and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500).
Authors: Yuan Gao, Hao Wu, Qingsong Wen, Kun Wang, Xian Wu, Xiaomeng Huang
Abstract: Reconstructing subsurface ocean dynamics, such as vertical velocity fields, from incomplete surface observations poses a critical challenge in Earth science, a field long hampered by the lack of standardized, analysis-ready benchmarks. To systematically address this issue and catalyze research, we first build and release KD48, a high-resolution ocean dynamics benchmark derived from petascale simulations and curated with expert-driven denoising. Building on this benchmark, we introduce VISION, a novel reconstruction paradigm based on Dynamic Prompting designed to tackle the core problem of missing data in real-world observations. The essence of VISION lies in its ability to generate a visual prompt on-the-fly from any available subset of observations, which encodes both data availability and the ocean's physical state. More importantly, we design a State-conditioned Prompting module that efficiently injects this prompt into a universal backbone, endowed with geometry- and scale-aware operators, to guide its adaptive adjustment of computational strategies. This mechanism enables VISION to precisely handle the challenges posed by varying input combinations. Extensive experiments on the KD48 benchmark demonstrate that VISION not only substantially outperforms state-of-the-art models but also exhibits strong generalization under extreme data missing scenarios. By providing a high-quality benchmark and a robust model, our work establishes a solid infrastructure for ocean science research under data uncertainty. Our codes are available at: https://github.com/YuanGao-YG/VISION.
Authors: Zixuan Wu, So Won Jeong, Yating Liu, Yeo Jin Jung, Claire Donnat
Abstract: With promising empirical performance across a wide range of applications, synthetic data augmentation appears a viable solution to data scarcity and the demands of increasingly data-intensive models. Its effectiveness lies in expanding the training set in a way that reduces estimator variance while introducing only minimal bias. Controlling this bias is therefore critical: effective data augmentation should generate diverse samples from the same underlying distribution as the training set, with minimal shifts. In this paper, we propose conformal data augmentation, a principled data filtering framework that leverages the power of conformal prediction to produce diverse synthetic data while filtering out poor-quality generations with provable risk control. Our method is simple to implement, requires no access to internal model logits, nor large-scale model retraining. We demonstrate the effectiveness of our approach across multiple tasks, including topic prediction, sentiment analysis, image classification, and fraud detection, showing consistent performance improvements of up to 40% in F1 score over unaugmented baselines, and 4% over other filtered augmentation baselines.
Authors: Arya Akhavan, David Janz, El-Mahdi El-Mhamdi
Abstract: We study distributed learning in the setting of gradient-free zero-order optimization and introduce FedZero, a federated zero-order algorithm that delivers sharp theoretical guarantees. Specifically, FedZero: (1) achieves near-optimal optimization error bounds with high probability in the federated convex setting; and (2) in the single-worker regime-where the problem reduces to the standard zero-order framework, establishes the first high-probability convergence guarantees for convex zero-order optimization, thereby strengthening the classical expectation-based results. At its core, FedZero employs a gradient estimator based on randomization over the $\ell_1$-sphere. To analyze it, we develop new concentration inequalities for Lipschitz functions under the uniform measure on the $\ell_1$-sphere, with explicit constants. These concentration tools are not only central to our high-probability guarantees but may also be of independent interest.
Authors: Daniil D. Sirota, Sergey A. Khan, Sergey L. Kostikov, Kirill A. Butov
Abstract: This paper presents a method for modeling transient fluid flow in subsurface reservoir systems based on the developed neural operator architecture (TFNO-opt). Reservoir systems are complex dynamic objects with distributed parameters described by systems of partial differential equations (PDEs). Traditional numerical methods for modeling such systems, despite their high accuracy, are characterized by significant time costs for performing calculations, which limits their applicability in control and decision support problems. The proposed architecture (TFNO-opt) is based on Fourier neural operators, which allow approximating PDE solutions in infinite-dimensional functional spaces, providing invariance to discretization and the possibility of generalization to various implementations of equations. The developed modifications are aimed at increasing the accuracy and stability of the trained neural operator, which is especially important for control problems. These include adjustable internal time resolution of the integral Fourier operator, tensor decomposition of parameters in the spectral domain, use of the Sobolev norm in the error function, and separation of approximation errors and reconstruction of initial conditions for more accurate reproduction of physical processes. The effectiveness of the proposed improvements is confirmed by computational experiments. The practical significance is confirmed by computational experiments using the example of the problem of hydrodynamic modeling of an underground gas storage (UGS), where the acceleration of calculations by six orders of magnitude was achieved, compared to traditional methods. This opens up new opportunities for the effective control of complex reservoir systems.
Authors: Dmitry Eremeev, Oleg Platonov, Gleb Bazhenov, Artem Babenko, Liudmila Prokhorenkova
Abstract: Foundation models pretrained on large-scale datasets have transformed such fields as natural language processing and computer vision, but their application to graph data remains limited. Recently emerged graph foundation models, such as G2T-FM, utilize tabular foundation models for graph tasks and were shown to significantly outperform prior attempts to create GFMs. However, these models primarily rely on hand-crafted graph features, limiting their ability to learn complex graph-specific patterns. In this work, we propose GraphPFN: a prior-data fitted network for node-level prediction. First, we design a prior distribution of synthetic attributed graphs. For graph structure generation, we use a novel combination of multiple stochastic block models and a preferential attachment process. We then apply graph-aware structured causal models to generate node attributes and targets. This procedure allows us to efficiently generate a wide range of realistic graph datasets. Then, we augment the tabular foundation model LimiX with attention-based graph neighborhood aggregation layers and train it on synthetic graphs sampled from our prior, allowing the model to capture graph structural dependencies not present in tabular data. On diverse real-world graph datasets with up to 50,000 nodes, GraphPFN shows strong in-context learning performance and achieves state-of-the-art results after finetuning, outperforming both G2T-FM and task-specific GNNs trained from scratch on most datasets. More broadly, our work demonstrates that pretraining on synthetic graphs from a well-designed prior distribution is an effective strategy for building graph foundation models.
Authors: Arani Roy, Shristi Das Biswas, Kaushik Roy
Abstract: Diffusion models (DMs), lauded for their generative performance, are computationally prohibitive due to their billion-scale parameters and iterative denoising dynamics. Existing efficiency techniques, such as quantization, timestep reduction, or pruning, offer savings in compute, memory, or runtime but are strictly bottlenecked by reliance on fine-tuning or retraining to recover performance. In this work, we introduce SlimDiff, an automated activation-informed structural compression framework that reduces both attention and feedforward dimensionalities in DMs, while being entirely gradient-free. SlimDiff reframes DM compression as a spectral approximation task, where activation covariances across denoising timesteps define low-rank subspaces that guide dynamic pruning under a fixed compression budget. This activation-aware formulation mitigates error accumulation across timesteps by applying module-wise decompositions over functional weight groups: query--key interactions, value--output couplings, and feedforward projections, rather than isolated matrix factorizations, while adaptively allocating sparsity across modules to respect the non-uniform geometry of diffusion trajectories. SlimDiff achieves up to 35\% acceleration and $\sim$100M parameter reduction over baselines, with generation quality on par with uncompressed models without any backpropagation. Crucially, our approach requires only about 500 calibration samples, over 70$\times$ fewer than prior methods. To our knowledge, this is the first closed-form, activation-guided structural compression of DMs that is entirely training-free, providing both theoretical clarity and practical efficiency.
Authors: Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin
Abstract: Reinforcement fine-tuning (RFT) often suffers from \emph{reward over-optimization}, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements. Our code can be accessed at https://github.com/Jun-Kai-Zhang/rubrics.git .
Authors: Micha Livne
Abstract: Learning representations that transfer well to diverse downstream tasks remains a central challenge in representation learning. Existing paradigms -- contrastive learning, self-supervised masking, and denoising auto-encoders -- balance this challenge with different trade-offs. We introduce the {contrastive Mutual Information Machine} (cMIM), a probabilistic framework that extends the Mutual Information Machine (MIM) with a contrastive objective. While MIM maximizes mutual information between inputs and latents and promotes clustering of codes, it falls short on discriminative tasks. cMIM addresses this gap by imposing global discriminative structure while retaining MIM's generative fidelity. Our contributions are threefold. First, we propose cMIM, a contrastive extension of MIM that removes the need for positive data augmentation and is substantially less sensitive to batch size than InfoNCE. Second, we introduce {informative embeddings}, a general technique for extracting enriched features from encoder-decoder models that boosts discriminative performance without additional training and applies broadly beyond MIM. Third, we provide empirical evidence across vision and molecular benchmarks showing that cMIM outperforms MIM and InfoNCE on classification and regression tasks while preserving competitive reconstruction quality. These results position cMIM as a unified framework for representation learning, advancing the goal of models that serve both discriminative and generative applications effectively.
Authors: Weiqiao Han, Chenlin Meng, Christopher D. Manning, Stefano Ermon
Abstract: We present DistillKac, a fast image generator that uses the damped wave equation and its stochastic Kac representation to move probability mass at finite speed. In contrast to diffusion models whose reverse time velocities can become stiff and implicitly allow unbounded propagation speed, Kac dynamics enforce finite speed transport and yield globally bounded kinetic energy. Building on this structure, we introduce classifier-free guidance in velocity space that preserves square integrability under mild conditions. We then propose endpoint only distillation that trains a student to match a frozen teacher over long intervals. We prove a stability result that promotes supervision at the endpoints to closeness along the entire path. Experiments demonstrate DistillKac delivers high quality samples with very few function evaluations while retaining the numerical stability benefits of finite speed probability flows.
Authors: Joshua Mitton, Prarthana Bhattacharyya, Ralph Abboud, Simon Woodhead
Abstract: The main focus of research on Knowledge Tracing (KT) models is on model developments with the aim of improving predictive accuracy. Most of these models make the most incorrect predictions when students choose a distractor, leading to student errors going undetected. We present an approach to add new capabilities to KT models by capturing predictive uncertainty and demonstrate that a larger predictive uncertainty aligns with model incorrect predictions. We show that uncertainty in KT models is informative and that this signal would be pedagogically useful for application in an educational learning platform that can be used in a limited resource setting where understanding student ability is necessary.
Authors: Yuandong Tian
Abstract: While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open question whether there is a mathematical framework to characterize what kind of features emerge, how and in which conditions it happens from training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li_2}$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning, characterized by the structure of backpropagated gradient $G_F$ across layers. In (I), $G_F$ is random, and top layer overfits to random hidden representation. In (II), the gradient of each node (column of $G_F$) only depends on its own activation, and thus each hidden node learns their representation independently from $G_F$, which now carries information about target labels, thanks to weight decay. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. Finally, in (III), we provably show how hidden nodes interact, and how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.
Authors: Hongyang He, Xinyuan Song, Yangfan He, Zeyu Zhang, Yanshu Li, Haochen You, Lifan Sun, Wenqiao Zhang
Abstract: We introduce TRiCo, a novel triadic game-theoretic co-training framework that rethinks the structure of semi-supervised learning by incorporating a teacher, two students, and an adversarial generator into a unified training paradigm. Unlike existing co-training or teacher-student approaches, TRiCo formulates SSL as a structured interaction among three roles: (i) two student classifiers trained on frozen, complementary representations, (ii) a meta-learned teacher that adaptively regulates pseudo-label selection and loss balancing via validation-based feedback, and (iii) a non-parametric generator that perturbs embeddings to uncover decision boundary weaknesses. Pseudo-labels are selected based on mutual information rather than confidence, providing a more robust measure of epistemic uncertainty. This triadic interaction is formalized as a Stackelberg game, where the teacher leads strategy optimization and students follow under adversarial perturbations. By addressing key limitations in existing SSL frameworks, such as static view interactions, unreliable pseudo-labels, and lack of hard sample modeling, TRiCo provides a principled and generalizable solution. Extensive experiments on CIFAR-10, SVHN, STL-10, and ImageNet demonstrate that TRiCo consistently achieves state-of-the-art performance in low-label regimes, while remaining architecture-agnostic and compatible with frozen vision backbones.
Authors: Sathwik Karnik, Somil Bansal
Abstract: Large language models (LLMs) are now ubiquitous in everyday tools, raising urgent safety concerns about their tendency to generate harmful content. The dominant safety approach -- reinforcement learning from human feedback (RLHF) -- effectively shapes model behavior during training but offers no safeguards at inference time, where unsafe continuations may still arise. We propose BRT-Align, a reachability-based framework that brings control-theoretic safety tools to LLM inference. BRT-Align models autoregressive generation as a dynamical system in latent space and learn a safety value function via backward reachability, estimating the worst-case evolution of a trajectory. This enables two complementary mechanisms: (1) a runtime monitor that forecasts unsafe completions several tokens in advance, and (2) a least-restrictive steering filter that minimally perturbs latent states to redirect generation away from unsafe regions. Experiments across multiple LLMs and toxicity benchmarks demonstrate that BRT-Align provides more accurate and earlier detection of unsafe continuations than baselines. Moreover, for LLM safety alignment, BRT-Align substantially reduces unsafe generations while preserving sentence diversity and coherence. Qualitative results further highlight emergent alignment properties: BRT-Align consistently produces responses that are less violent, less profane, less offensive, and less politically biased. Together, these findings demonstrate that reachability analysis provides a principled and practical foundation for inference-time LLM safety.
Authors: Dongkyu Cho, Miao Zhang, Rumi Chunara
Abstract: Data augmentation is a widely used strategy to improve model robustness and generalization by enriching training datasets with synthetic examples. While large language models (LLMs) have demonstrated strong generative capabilities for this purpose, their applications in high-stakes domains like healthcare present unique challenges due to the risk of generating clinically incorrect or misleading information. In this work, we propose a novel query-based model collaboration framework that integrates expert-level domain knowledge to guide the augmentation process to preserve critical medical information. Experiments on clinical prediction tasks demonstrate that our lightweight collaboration-based approach consistently outperforms existing LLM augmentation methods while improving safety through reduced factual errors. This framework addresses the gap between LLM augmentation potential and the safety requirements of specialized domains.
Authors: Tankred Saanum, Can Demircan, Samuel J. Gershman, Eric Schulz
Abstract: Large Language Models (LLMs) excel at in-context learning, the ability to use information provided as context to improve prediction of future tokens. Induction heads have been argued to play a crucial role for in-context learning in Transformer Language Models. These attention heads make a token attend to successors of past occurrences of the same token in the input. This basic mechanism supports LLMs' ability to copy and predict repeating patterns. However, it is unclear if this same mechanism can support in-context learning of more complex repetitive patterns with hierarchical structure. Natural language is teeming with such cases: The article "the" in English usually prefaces multiple nouns in a text. When predicting which token succeeds a particular instance of "the", we need to integrate further contextual cues from the text to predict the correct noun. If induction heads naively attend to all past instances of successor tokens of "the" in a context-independent manner, they cannot support this level of contextual information integration. In this study, we design a synthetic in-context learning task, where tokens are repeated with hierarchical dependencies. Here, attending uniformly to all successor tokens is not sufficient to accurately predict future tokens. Evaluating a range of LLMs on these token sequences and natural language analogues, we find adaptive induction heads that support prediction by learning what to attend to in-context. Next, we investigate how induction heads themselves learn in-context. We find evidence that learning is supported by attention heads that uncover a set of latent contexts, determining the different token transition relationships. Overall, we not only show that LLMs have induction heads that learn, but offer a complete mechanistic account of how LLMs learn to predict higher-order repetitive patterns in-context.
Authors: Christopher Ackerman
Abstract: The possibility of LLM self-awareness and even sentience is gaining increasing public attention and has major safety and policy implications, but the science of measuring them is still in a nascent state. Here we introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs. Taking inspiration from research on metacognition in nonhuman animals, our approach eschews model self-reports and instead tests to what degree models can strategically deploy knowledge of internal states. Using two experimental paradigms, we demonstrate that frontier LLMs introduced since early 2024 show increasingly strong evidence of certain metacognitive abilities, specifically the ability to assess and utilize their own confidence in their ability to answer factual and reasoning questions correctly and the ability to anticipate what answers they would give and utilize that information appropriately. We buttress these behavioral findings with an analysis of the token probabilities returned by the models, which suggests the presence of an upstream internal signal that could provide the basis for metacognition. We further find that these abilities 1) are limited in resolution, 2) emerge in context-dependent manners, and 3) seem to be qualitatively different from those of humans. We also report intriguing differences across models of similar capabilities, suggesting that LLM post-training may have a role in developing metacognitive abilities.
Authors: Yevgeny Seldin
Abstract: Learning, whether natural or artificial, is a process of selection. It starts with a set of candidate options and selects the more successful ones. In the case of machine learning the selection is done based on empirical estimates of prediction accuracy of candidate prediction rules on some data. Due to randomness of data sampling the empirical estimates are inherently noisy, leading to selection under uncertainty. The book provides statistical tools to obtain theoretical guarantees on the outcome of selection under uncertainty. We start with concentration of measure inequalities, which are the main statistical instrument for controlling how much an empirical estimate of expectation of a function deviates from the true expectation. The book covers a broad range of inequalities, including Markov's, Chebyshev's, Hoeffding's, Bernstein's, Empirical Bernstein's, Unexpected Bernstein's, kl, and split-kl. We then study the classical (offline) supervised learning and provide a range of tools for deriving generalization bounds, including Occam's razor, Vapnik-Chervonenkis analysis, and PAC-Bayesian analysis. The latter is further applied to derive generalization guarantees for weighted majority votes. After covering the offline setting, we turn our attention to online learning. We present the space of online learning problems characterized by environmental feedback, environmental resistance, and structural complexity. A common performance measure in online learning is regret, which compares performance of an algorithm to performance of the best prediction rule in hindsight, out of a restricted set of prediction rules. We present tools for deriving regret bounds in stochastic and adversarial environments, and under full information and bandit feedback.
Authors: Yiliu Wang, Timothy Doyeon Kim, Eric Shea-Brown, Uygar S\"umb\"ul
Abstract: Switching dynamical systems can model complicated time series data while maintaining interpretability by inferring a finite set of dynamics primitives and explaining different portions of the observed time series with one of these primitives. However, due to the discrete nature of this set, such models struggle to capture smooth, variable-speed transitions, as well as stochastic mixtures of overlapping states, and the inferred dynamics often display spurious rapid switching on real-world datasets. Here, we propose the Gumbel Dynamical Model (GDM). First, by introducing a continuous relaxation of discrete states and a different noise model defined on the relaxed-discrete state space via the Gumbel distribution, GDM expands the set of available state dynamics, allowing the model to approximate smoother and non-stationary ground-truth dynamics more faithfully. Second, the relaxation makes the model fully differentiable, enabling fast and scalable training with standard gradient descent methods. We validate our approach on standard simulation datasets and highlight its ability to model soft, sticky states and transitions in a stochastic setting. Furthermore, we apply our model to two real-world datasets, demonstrating its ability to infer interpretable states in stochastic time series with multiple dynamics, a setting where traditional methods often fail.
Authors: Mst Eshita Khatun, Halima Akter, Tasnimul Rehan, Toufiq Ahmed
Abstract: In this digital era, online shopping is common practice in our daily lives. Product reviews significantly influence consumer buying behavior and help establish buyer trust. However, the prevalence of fraudulent reviews undermines this trust by potentially misleading consumers and damaging the reputations of the sellers. This research addresses this pressing issue by employing advanced big data analytics and machine learning approaches on a substantial dataset of Amazon product reviews. The primary objective is to detect and classify spam reviews accurately so that it enhances the authenticity of the review. Using a scalable big data framework, we efficiently process and analyze a large scale of review data, extracting key features indicative of fraudulent behavior. Our study illustrates the utility of various machine learning classifiers in detecting spam reviews, with Logistic Regression achieving an accuracy of 90.35%, thus contributing to a more trustworthy and transparent online shopping environment.
Authors: Tian Yu Yen, Reese E. Jones, Ravi G. Patel
Abstract: Operator learning is a recently developed generalization of regression to mappings between functions. It promises to drastically reduce expensive numerical integration of PDEs to fast evaluations of mappings between functional states of a system, i.e., surrogate and reduced-order modeling. Operator learning has already found applications in several areas such as modeling sea ice, combustion, and atmospheric physics. Recent approaches towards integrating uncertainty quantification into the operator models have relied on likelihood based methods to infer parameter distributions from noisy data. However, stochastic operators may yield actions from which a likelihood is difficult or impossible to construct. In this paper, we introduce, GenUQ, a measure-theoretic approach to UQ that avoids constructing a likelihood by introducing a generative hyper-network model that produces parameter distributions consistent with observed data. We demonstrate that GenUQ outperforms other UQ methods in three example problems, recovering a manufactured operator, learning the solution operator to a stochastic elliptic PDE, and modeling the failure location of porous steel under tension.
Authors: Seohyeon Cha, Huancheng Chen, Haris Vikalo
Abstract: Federated continual learning (FCL) enables distributed client devices to learn from streaming data across diverse and evolving tasks. A major challenge to continual learning, catastrophic forgetting, is exacerbated in decentralized settings by the data heterogeneity, constrained communication and privacy concerns. We propose Federated gradient Projection-based Continual Learning with Task Identity Prediction (FedProTIP), a novel FCL framework that mitigates forgetting by projecting client updates onto the orthogonal complement of the subspace spanned by previously learned representations of the global model. This projection reduces interference with earlier tasks and preserves performance across the task sequence. To further address the challenge of task-agnostic inference, we incorporate a lightweight mechanism that leverages core bases from prior tasks to predict task identity and dynamically adjust the global model's outputs. Extensive experiments across standard FCL benchmarks demonstrate that FedProTIP significantly outperforms state-of-the-art methods in average accuracy, particularly in settings where task identities are a priori unknown.
Authors: Kevin Xia, Elias Bareinboim
Abstract: The study of causal abstractions bridges two integral components of human intelligence: the ability to determine cause and effect, and the ability to interpret complex patterns into abstract concepts. Formally, causal abstraction frameworks define connections between complicated low-level causal models and simple high-level ones. One major limitation of most existing definitions is that they are not well-defined when considering lossy abstraction functions in which multiple low-level interventions can have different effects while mapping to the same high-level intervention (an assumption called the abstract invariance condition). In this paper, we introduce a new type of abstractions called projected abstractions that generalize existing definitions to accommodate lossy representations. We show how to construct a projected abstraction from the low-level model and how it translates equivalent observational, interventional, and counterfactual causal queries from low to high-level. Given that the true model is rarely available in practice we prove a new graphical criteria for identifying and estimating high-level causal queries from limited low-level data. Finally, we experimentally show the effectiveness of projected abstraction models in high-dimensional image settings.
Authors: Marco Paul E. Apolinario, Kaushik Roy
Abstract: On-device learning is essential for personalization, privacy, and long-term adaptation in resource-constrained environments. Achieving this requires efficient learning, both fine-tuning existing models and continually acquiring new tasks without catastrophic forgetting. Yet both settings are constrained by high memory cost of storing activations during backpropagation. Existing activation compression methods reduce this cost but relying on repeated low-rank decompositions, introducing computational overhead. Also, such methods have not been explored for continual learning. We propose LANCE (Low-rank Activation Compression), a framework that performs one-shot higher-order Singular Value Decompsoition (SVD) to obtain a reusable low-rank subspace for activation projection. This eliminates repeated decompositions, reducing both memory and computation. Moreover, fixed low-rank subspaces further enable on-device continual learning by allocating tasks to orthogonal subspaces without storing large task-specific matrices. Experiments show that LANCE reduces activation storage up to 250$\times$ while maintaining accuracy comparable to full backpropagation on CIFAR-10/100, Oxford-IIIT Pets, Flowers102, and CUB-200 datasets. On continual learning benchmarks (Split CIFAR-100, Split MiniImageNet, 5-Datasets), it achieves performance competitive with orthogonal gradient projection methods at a fraction of the memory cost. These results position LANCE as a practical and scalable solution for efficient fine-tuning and continual learning on edge devices.
Authors: Krishu K Thapa, Reet Barik, Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath
Abstract: Training large models ranging from millions to billions of parameters is highly resource-intensive, requiring significant time, compute, and memory. It is observed that most of the learning (higher change in weights) takes place in the earlier stage of the training loop. These changes stabilize as training continues, enabling them to be captured by matrices of a low intrinsic rank. Therefore, we propose an approach to identify such states of partial convergence and dynamically switch from full parameter training to Low-Rank Adaptation (LoRA) on the ViT-Large model. We introduce a flexible approach that leverages user-defined hyperparameters to determine the switching point and assign a rank specific to each module layer based on its level of convergence. Experimental results show that this approach preserves model accuracy while reducing the number of trainable parameters to 10% of its original size, resulting in a 3x improvement in throughput, and a 1.5x reduction in average training time per epoch while also reducing GPU memory consumption by 20%
Authors: Andreas Burger, Luca Thiede, Nikolaj R{\o}nne, Varinia Bernales, Nandita Vijaykumar, Tejs Vegge, Arghya Bhowmik, Alan Aspuru-Guzik
Abstract: Fundamental tasks in computational chemistry, from transition state search to vibrational analysis, rely on molecular Hessians, which are the second derivatives of the potential energy. Yet, Hessians are computationally expensive to calculate and scale poorly with system size, with both quantum mechanical methods and neural networks. In this work, we demonstrate that Hessians can be predicted directly from a deep learning model, without relying on automatic differentiation or finite differences. We observe that one can construct SE(3)-equivariant, symmetric Hessians from irreducible representations (irrep) features up to degree $l$=2 computed during message passing in graph neural networks. This makes HIP Hessians one to two orders of magnitude faster, more accurate, more memory efficient, easier to train, and enables more favorable scaling with system size. We validate our predictions across a wide range of downstream tasks, demonstrating consistently superior performance for transition state search, accelerated geometry optimization, zero-point energy corrections, and vibrational analysis benchmarks. We open-source the HIP codebase and model weights to enable further development of the direct prediction of Hessians at https://github.com/BurgerAndreas/hip
Authors: Feng Yu, Jia Hu, Geyong Min
Abstract: Parameter-efficient fine-tuning (PEFT) methods must be resource-efficient yet handle heterogeneous reasoning transformations, and classical low-rank adaptation (LoRA) is constrained by the nominal rank $r$. Hadamard-style extensions like HiRA raise the nominal rank but couple every update to the global energy pattern of the frozen weight matrix, while ABBA trades this inductive bias for fully learned dense intermediates. To address the limitation of global modulation, we propose Block Hadamard high-Rank Adaptation (BHRA), which partitions each weight matrix and applies HiRA-style multiplicative modulation independently within every block, preserving the PEFT parameter footprint while unlocking localized rank amplification. Our empirical analyses reveal that this blockwise design maintains rich spectra across rank budgets, mitigating the collapse induced by global modulation. Across eight commonsense reasoning tasks and two arithmetic benchmarks with Llama-3.2 1B/3B, Mistral-7B, and Gemma-2 9B, BHRA consistently surpasses strong PEFT baselines under matched parameter budgets.
Authors: Mingze Dong, Leda Wang, Yuval Kluger
Abstract: Mask-based pretraining has become a cornerstone of modern large-scale models across language, vision, and recently biology. Despite its empirical success, its role and limits in learning data representations have been unclear. In this work, we show that the behavior of mask-based pretraining can be directly characterized by test risk in high-dimensional minimum-norm ("ridge-less") linear regression, without relying on further model specifications. Further analysis of linear models uncovers several novel aspects of mask-based pretraining. The theoretical framework and its implications have been validated across diverse neural architectures (including MLPs, CNNs, and Transformers) applied to both vision and language tasks. Guided by our theory, we propose an embarrassingly simple yet overlooked pretraining scheme named Randomly Random Mask AutoEncoding (R$^2$MAE), which enforces capturing multi-scale features from data and is able to outperform optimal fixed mask ratio settings in our linear model framework. We implement R$^2$MAE in vision, language, DNA sequence, and single-cell models, where it consistently outperforms standard and more complicated masking schemes, leading to improvements for state-of-the-art models. Our code is available at: https://github.com/MingzeDong/r2mae
Authors: Rina Panigrahy, Vatsal Sharan
Abstract: Safety, trust and Artificial General Intelligence (AGI) are aspirational goals in artificial intelligence (AI) systems, and there are several informal interpretations of these notions. In this paper, we propose strict, mathematical definitions of safety, trust, and AGI, and demonstrate a fundamental incompatibility between them. We define safety of a system as the property that it never makes any false claims, trust as the assumption that the system is safe, and AGI as the property of an AI system always matching or exceeding human capability. Our core finding is that -- for our formal definitions of these notions -- a safe and trusted AI system cannot be an AGI system: for such a safe, trusted system there are task instances which are easily and provably solvable by a human but not by the system. We note that we consider strict mathematical definitions of safety and trust, and it is possible for real-world deployments to instead rely on alternate, practical interpretations of these notions. We show our results for program verification, planning, and graph reachability. Our proofs draw parallels to G\"odel's incompleteness theorems and Turing's proof of the undecidability of the halting problem, and can be regarded as interpretations of G\"odel's and Turing's results.
Authors: Yinuo Ren, Wenhao Gao, Lexing Ying, Grant M. Rotskoff, Jiequn Han
Abstract: We study inference-time scaling for diffusion models, where the goal is to adapt a pre-trained model to new target distributions without retraining. Existing guidance-based methods are simple but introduce bias, while particle-based corrections suffer from weight degeneracy and high computational cost. We introduce DriftLite, a lightweight, training-free particle-based approach that steers the inference dynamics on the fly with provably optimal stability control. DriftLite exploits a previously unexplored degree of freedom in the Fokker-Planck equation between the drift and particle potential, and yields two practical instantiations: Variance- and Energy-Controlling Guidance (VCG/ECG) for approximating the optimal drift with minimal overhead. Across Gaussian mixture models, particle systems, and large-scale protein-ligand co-folding problems, DriftLite consistently reduces variance and improves sample quality over pure guidance and sequential Monte Carlo baselines. These results highlight a principled, efficient route toward scalable inference-time adaptation of diffusion models.
Authors: Chang Deng, Bryon Aragam
Abstract: Existing methods for differentiable structure learning in discrete data typically assume that the data are generated from specific structural equation models. However, these assumptions may not align with the true data-generating process, which limits the general applicability of such methods. Furthermore, current approaches often ignore the complex dependence structure inherent in discrete data and consider only linear effects. We propose a differentiable structure learning framework that is capable of capturing arbitrary dependencies among discrete variables. We show that although general discrete models are unidentifiable from purely observational data, it is possible to characterize the complete set of compatible parameters and structures. Additionally, we establish identifiability up to Markov equivalence under mild assumptions. We formulate the learning problem as a single differentiable optimization task in the most general form, thereby avoiding the unrealistic simplifications adopted by previous methods. Empirical results demonstrate that our approach effectively captures complex relationships in discrete data.
Authors: Siming Shan, Min Zhu, Youzuo Lin, Lu Lu
Abstract: Partial differential equation (PDE)-governed inverse problems are fundamental across various scientific and engineering applications; yet they face significant challenges due to nonlinearity, ill-posedness, and sensitivity to noise. Here, we introduce a new computational framework, RED-DiffEq, by integrating physics-driven inversion and data-driven learning. RED-DiffEq leverages pretrained diffusion models as a regularization mechanism for PDE-governed inverse problems. We apply RED-DiffEq to solve the full waveform inversion problem in geophysics, a challenging seismic imaging technique that seeks to reconstruct high-resolution subsurface velocity models from seismic measurement data. Our method shows enhanced accuracy and robustness compared to conventional methods. Additionally, it exhibits strong generalization ability to more complex velocity models that the diffusion model is not trained on. Our framework can also be directly applied to diverse PDE-governed inverse problems.
Authors: Pascal Memmesheimer, Vincent Heuveline, J\"urgen Hesser
Abstract: Treatment effect estimation is essential for informed decision-making in many fields such as healthcare, economics, and public policy. While flexible machine learning models have been widely applied for estimating heterogeneous treatment effects, quantifying the inherent uncertainty of their point predictions remains an issue. Recent advancements in conformal prediction address this limitation by allowing for inexpensive computation, as well as distribution shifts, while still providing frequentist, finite-sample coverage guarantees under minimal assumptions for any point-predictor model. This advancement holds significant potential for improving decision-making in especially high-stakes environments. In this work, we perform a systematic review regarding conformal prediction methods for treatment effect estimation and provide for both the necessary theoretical background. Through a systematic filtering process, we select and analyze eleven key papers, identifying and describing current state-of-the-art methods in this area. Based on our findings, we propose directions for future research.
Authors: Afrina Tabassum, Bin Guo, Xiyao Ma, Hoda Eldardiry, Ismini Lourentzou
Abstract: Multimodal Procedural Planning (MPP) aims to generate step-by-step instructions that combine text and images, with the central challenge of preserving object-state consistency across modalities while producing informative plans. Existing approaches often leverage large language models (LLMs) to refine textual steps; however, visual object-state alignment and systematic evaluation are largely underexplored. We present MMPlanner, a zero-shot MPP framework that introduces Object State Reasoning Chain-of-Thought (OSR-CoT) prompting to explicitly model object-state transitions and generate accurate multimodal plans. To assess plan quality, we design LLM-as-a-judge protocols for planning accuracy and cross-modal alignment, and further propose a visual step-reordering task to measure temporal coherence. Experiments on RECIPEPLAN and WIKIPLAN show that MMPlanner achieves state-of-the-art performance, improving textual planning by +6.8%, cross-modal alignment by +11.9%, and visual step ordering by +26.7%
Authors: Davide Bizzaro, Alessandro Daniele
Abstract: Neurosymbolic integration (NeSy) blends neural-network learning with symbolic reasoning. The field can be split between methods injecting hand-crafted rules into neural models, and methods inducing symbolic rules from data. We introduce Logic of Hypotheses (LoH), a novel language that unifies these strands, enabling the flexible integration of data-driven rule learning with symbolic priors and expert knowledge. LoH extends propositional logic syntax with a choice operator, which has learnable parameters and selects a subformula from a pool of options. Using fuzzy logic, formulas in LoH can be directly compiled into a differentiable computational graph, so the optimal choices can be learned via backpropagation. This framework subsumes some existing NeSy models, while adding the possibility of arbitrary degrees of knowledge specification. Moreover, the use of Goedel fuzzy logic and the recently developed Goedel trick yields models that can be discretized to hard Boolean-valued functions without any loss in performance. We provide experimental analysis on such models, showing strong results on tabular data and on the Visual Tic-Tac-Toe NeSy task, while producing interpretable decision rules.
Authors: Joshua Salim, Jordan Yu, Xilei Zhao
Abstract: While deep learning models excel at predictive tasks, they often overfit due to their complex structure and large number of parameters, causing them to memorize training data, including noise, rather than learn patterns that generalize to new data. To tackle this challenge, this paper proposes a new regularization method, i.e., Enforcing Domain-Informed Monotonicity in Deep Neural Networks (DIM), which maintains domain-informed monotonic relationships in complex deep learning models to further improve predictions. Specifically, our method enforces monotonicity by penalizing violations relative to a linear baseline, effectively encouraging the model to follow expected trends while preserving its predictive power. We formalize this approach through a comprehensive mathematical framework that establishes a linear reference, measures deviations from monotonic behavior, and integrates these measurements into the training objective. We test and validate the proposed methodology using a real-world ridesourcing dataset from Chicago and a synthetically created dataset. Experiments across various neural network architectures show that even modest monotonicity constraints consistently enhance model performance. DIM enhances the predictive performance of deep neural networks by applying domain-informed monotonicity constraints to regularize model behavior and mitigate overfitting
Authors: Andrii Zahorodnii, Christopher Wang, Bennett Stankovits, Charikleia Moraitaki, Geeling Chau, Andrei Barbu, Boris Katz, Ila R Fiete
Abstract: High-resolution neural datasets enable foundation models for the next generation of brain-computer interfaces and neurological treatments. The community requires rigorous benchmarks to discriminate between competing modeling approaches, yet no standardized evaluation frameworks exist for intracranial EEG (iEEG) recordings. To address this gap, we present Neuroprobe: a suite of decoding tasks for studying multi-modal language processing in the brain. Unlike scalp EEG, intracranial EEG requires invasive surgery to implant electrodes that record neural activity directly from the brain with minimal signal distortion. Neuroprobe is built on the BrainTreebank dataset, which consists of 40 hours of iEEG recordings from 10 human subjects performing a naturalistic movie viewing task. Neuroprobe serves two critical functions. First, it is a mine from which neuroscience insights can be drawn. Its high temporal and spatial resolution allows researchers to systematically determine when and where computations for each aspect of language processing occur in the brain by measuring the decodability of each feature across time and all electrode locations. Using Neuroprobe, we visualize how information flows from the superior temporal gyrus to the prefrontal cortex, and the progression from simple auditory features to more complex language features in a purely data-driven manner. Second, as the field moves toward neural foundation models, Neuroprobe provides a rigorous framework for comparing competing architectures and training protocols. We found that the linear baseline is surprisingly strong, beating frontier foundation models on many tasks. Neuroprobe is designed with computational efficiency and ease of use in mind. We make the code for Neuroprobe openly available and maintain a public leaderboard, aiming to enable rapid progress in the field of iEEG foundation models, at https://neuroprobe.dev/
URLs: https://neuroprobe.dev/
Authors: Junyong Park, Oron Levy, Rebecca Adaimi, Asaf Liberman, Gierad Laput, Abdelkareem Bedri
Abstract: Wearable accelerometers are used for a wide range of applications, such as gesture recognition, gait analysis, and sports monitoring. Yet most existing foundation models focus primarily on classifying common daily activities such as locomotion and exercise, limiting their applicability to the broader range of tasks that rely on other signal characteristics. We present SlotFM, an accelerometer foundation model that generalizes across diverse downstream tasks. SlotFM uses Time-Frequency Slot Attention, an extension of Slot Attention that processes both time and frequency representations of the raw signals. It generates multiple small embeddings (slots), each capturing different signal components, enabling task-specific heads to focus on the most relevant parts of the data. We also introduce two loss regularizers that capture local structure and frequency patterns, which improve reconstruction of fine-grained details and helps the embeddings preserve task-relevant information. We evaluate SlotFM on 16 classification and regression downstream tasks that extend beyond standard human activity recognition. It outperforms existing self-supervised approaches on 13 of these tasks and achieves comparable results to the best performing approaches on the remaining tasks. On average, our method yields a 4.5% performance gain, demonstrating strong generalization for sensing foundation models.
Authors: Peng Xu, Chun-Ying Hou, Xiaohui Chen, Richard Y. Zhang
Abstract: Clustering is a hard discrete optimization problem. Nonconvex approaches such as low-rank semidefinite programming (SDP) have recently demonstrated promising statistical and local algorithmic guarantees for cluster recovery. Due to the combinatorial structure of the $K$-means clustering problem, current relaxation algorithms struggle to balance their constraint feasibility and objective optimality, presenting tremendous challenges in computing the second-order critical points with rigorous guarantees. In this paper, we provide a new formulation of the $K$-means problem as a smooth unconstrained optimization over a submanifold and characterize its Riemannian structures to allow it to be solved using a second-order cubic-regularized Riemannian Newton algorithm. By factorizing the $K$-means manifold into a product manifold, we show how each Newton subproblem can be solved in linear time. Our numerical experiments show that the proposed method converges significantly faster than the state-of-the-art first-order nonnegative low-rank factorization method, while achieving similarly optimal statistical accuracy.
Authors: Divya Gopinath, Corina S. Pasareanu, Muhammad Usman
Abstract: We present Prophecy, a tool for automatically inferring formal properties of feed-forward neural networks. Prophecy is based on the observation that a significant part of the logic of feed-forward networks is captured in the activation status of the neurons at inner layers. Prophecy works by extracting rules based on neuron activations (values or on/off statuses) as preconditions that imply certain desirable output property, e.g., the prediction being a certain class. These rules represent network properties captured in the hidden layers that imply the desired output behavior. We present the architecture of the tool, highlight its features and demonstrate its usage on different types of models and output properties. We present an overview of its applications, such as inferring and proving formal explanations of neural networks, compositional verification, run-time monitoring, repair, and others. We also show novel results highlighting its potential in the era of large vision-language models.
Authors: Thomas Walton, Darin Tsui, Aryan Musharaf, Amirali Aghazadeh
Abstract: Autoregressive models have transformed protein engineering by enabling the generation of novel protein sequences beyond those found in nature. However, their sequential inference introduces significant latency, limiting their utility in high-throughput protein screening. Speculative decoding accelerates generation by employing a lightweight draft model to sample tokens, which a larger target model then verifies and refines. Yet, in protein sequence generation, draft models are typically agnostic to the structural and functional constraints of the target protein, leading to biologically implausible outputs and a shift in the likelihood distribution of generated sequences. We introduce SpecMER (Speculative Decoding via k-mer Guidance), a novel framework that incorporates biological, structural, and functional priors using k-mer motifs extracted from multiple sequence alignments. By scoring candidate sequences in parallel and selecting those most consistent with known biological patterns, SpecMER significantly improves sequence plausibility while retaining the efficiency of speculative decoding. SpecMER achieves 24-32% speedup over standard autoregressive decoding, along with higher acceptance rates and improved sequence likelihoods.
Authors: Saurabh Kataria, Davood Fattahi, Minxiao Wang, Ran Xiao, Matthew Clark, Timothy Ruchti, Mark Mai, Xiao Hu
Abstract: High-frequency physiological waveform modality offers deep, real-time insights into patient status. Recently, physiological foundation models based on Photoplethysmography (PPG), such as PPG-GPT, have been shown to predict critical events, including Cardiac Arrest (CA). However, their powerful representation still needs to be leveraged suitably, especially when the downstream data/label is scarce. We offer three orthogonal improvements to improve PPG-only CA systems by using minimal auxiliary information. First, we propose to use time-to-event modeling, either through simple regression to the event onset time or by pursuing fine-grained discrete survival modeling. Second, we encourage the model to learn CA-focused features by making them patient-identity invariant. This is achieved by first training the largest-scale de-identified biometric identification model, referred to as the p-vector, and subsequently using it adversarially to deconfound cues, such as person identity, that may cause overfitting through memorization. Third, we propose regression on the pseudo-lab values generated by pre-trained auxiliary estimator networks. This is crucial since true blood lab measurements, such as lactate, sodium, troponin, and potassium, are collected sparingly. Via zero-shot prediction, the auxiliary networks can enrich cardiac arrest waveform labels and generate pseudo-continuous estimates as targets. Our proposals can independently improve the 24-hour time-averaged AUC from the 0.74 to the 0.78-0.80 range. We primarily improve over longer time horizons with minimal degradation near the event, thus pushing the Early Warning System research. Finally, we pursue multi-task formulation and diagnose it with a high gradient conflict rate among competing losses, which we alleviate via the PCGrad optimization technique.
Authors: Taiga Kojima, Masayuki Karasuyama
Abstract: In the graph-level prediction task (predict a label for a given graph), the information contained in subgraphs of the input graph plays a key role. In this paper, we propose Exact subgraph Isomorphism Network (EIN), which combines the exact subgraph enumeration, neural network, and a sparse regularization. In general, building a graph-level prediction model achieving high discriminative ability along with interpretability is still a challenging problem. Our combination of the subgraph enumeration and neural network contributes to high discriminative ability about the subgraph structure of the input graph. Further, the sparse regularization in EIN enables us 1) to derive an effective pruning strategy that mitigates computational difficulty of the enumeration while maintaining the prediction performance, and 2) to identify important subgraphs that contributes to high interpretability. We empirically show that EIN has sufficiently high prediction performance compared with standard graph neural network models, and also, we show examples of post-hoc analysis based on the selected subgraphs.
Authors: Yuqin Jiang, Andrey A. Popov, Tianle Duan, Qingchun Li
Abstract: Understanding urban human mobility patterns at various spatial levels is essential for social science. This study presents a machine learning framework to downscale origin-destination (OD) taxi trips flows in New York City from a larger spatial unit to a smaller spatial unit. First, correlations between OD trips and demographic, socioeconomic, and commuting characteristics are developed using four models: Linear Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and Neural Networks (NN). Second, a perturbation-based sensitivity analysis is applied to interpret variable importance for nonlinear models. The results show that the linear regression model failed to capture the complex variable interactions. While NN performs best with the training and testing datasets, SVM shows the best generalization ability in downscaling performance. The methodology presented in this study provides both analytical advancement and practical applications to improve transportation services and urban development.
Authors: Weiqi Yue, Wenbiao Li, Yuzhou Jiang, Anisa Halimi, Roger French, Erman Ayday
Abstract: Federated learning enables collaborative model training without sharing raw data, but data heterogeneity consistently challenges the performance of the global model. Traditional optimization methods often rely on collaborative global model training involving all clients, followed by local adaptation to improve individual performance. In this work, we focus on early-stage quality control and propose PQFed, a novel privacy-preserving personalized federated learning framework that designs customized training strategies for each client prior to the federated training process. PQFed extracts representative features from each client's raw data and applies clustering techniques to estimate inter-client dataset similarity. Based on these similarity estimates, the framework implements a client selection strategy that enables each client to collaborate with others who have compatible data distributions. We evaluate PQFed on two benchmark datasets, CIFAR-10 and MNIST, integrated with three existing federated learning algorithms. Experimental results show that PQFed consistently improves the target client's model performance, even with a limited number of participants. We further benchmark PQFed against a baseline cluster-based algorithm, IFCA, and observe that PQFed also achieves better performance in low-participation scenarios. These findings highlight PQFed's scalability and effectiveness in personalized federated learning settings.
Authors: Xavier Gonzalez, E. Kelly Buchanan, Hyun Dong Lee, Jerry Weihong Liu, Ke Alexander Wang, David M. Zoltowski, Christopher R\'e, Scott W. Linderman
Abstract: Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. This unifying view highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, our framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.
Authors: Takuya Kanayama, Yuki Ito, Tomoyuki Tamura, Masayuki Karasuyama
Abstract: A bilevel optimization problem consists of two optimization problems nested as an upper- and a lower-level problem, in which the optimality of the lower-level problem defines a constraint for the upper-level problem. This paper considers Bayesian optimization (BO) for the case that both the upper- and lower-levels involve expensive black-box functions. Because of its nested structure, bilevel optimization has a complex problem definition and, compared with other standard extensions of BO such as multi-objective or constraint settings, it has not been widely studied. We propose an information-theoretic approach that considers the information gain of both the upper- and lower-optimal solutions and values. This enables us to define a unified criterion that measures the benefit for both level problems, simultaneously. Further, we also show a practical lower bound based approach to evaluating the information gain. We empirically demonstrate the effectiveness of our proposed method through several benchmark datasets.
Authors: Houliang Zhou, Rong Zhou, Yangying Liu, Kanhao Zhao, Li Shen, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer's Disease Neuroimaging Initiative
Abstract: Identifying objective neuroimaging biomarkers to forecast Alzheimer's disease (AD) progression is crucial for timely intervention. However, this task remains challenging due to the complex dysfunctions in the spatio-temporal characteristics of underlying brain networks, which are often overlooked by existing methods. To address these limitations, we develop an interpretable spatio-temporal graph neural network framework to predict future AD progression, leveraging dual Stochastic Differential Equations (SDEs) to model the irregularly-sampled longitudinal functional magnetic resonance imaging (fMRI) data. We validate our approach on two independent cohorts, including the Open Access Series of Imaging Studies (OASIS-3) and the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our framework effectively learns sparse regional and connective importance probabilities, enabling the identification of key brain circuit abnormalities associated with disease progression. Notably, we detect the parahippocampal cortex, prefrontal cortex, and parietal lobule as salient regions, with significant disruptions in the ventral attention, dorsal attention, and default mode networks. These abnormalities correlate strongly with longitudinal AD-related clinical symptoms. Moreover, our interpretability strategy reveals both established and novel neural systems-level and sex-specific biomarkers, offering new insights into the neurobiological mechanisms underlying AD progression. Our findings highlight the potential of spatio-temporal graph-based learning for early, individualized prediction of AD progression, even in the context of irregularly-sampled longitudinal imaging data.
Authors: Ziqing Wang, Yibo Wen, William Pattie, Xiao Luo, Weimin Wu, Jerry Yao-Chieh Hu, Abhishek Pandey, Han Liu, Kaize Ding
Abstract: Lead optimization in drug discovery requires efficiently navigating vast chemical space through iterative cycles to enhance molecular properties while preserving structural similarity to the original lead compound. Despite recent advances, traditional optimization methods struggle with sample efficiency-achieving good optimization performance with limited oracle evaluations. Large Language Models (LLMs) provide a promising approach through their in-context learning and instruction following capabilities, which align naturally with these iterative processes. However, existing LLM-based methods fail to leverage this strength, treating each optimization step independently. To address this, we present POLO (Preference-guided multi-turn Optimization for Lead Optimization), which enables LLMs to learn from complete optimization trajectories rather than isolated steps. At its core, POLO introduces Preference-Guided Policy Optimization (PGPO), a novel reinforcement learning algorithm that extracts learning signals at two complementary levels: trajectory-level optimization reinforces successful strategies, while turn-level preference learning provides dense comparative feedback by ranking intermediate molecules within each trajectory. Through this dual-level learning from intermediate evaluation, POLO achieves superior sample efficiency by fully exploiting each costly oracle call. Extensive experiments demonstrate that POLO achieves 84% average success rate on single-property tasks (2.3x better than baselines) and 50% on multi-property tasks using only 500 oracle evaluations, significantly advancing the state-of-the-art in sample-efficient molecular optimization.
Authors: Ciyuan Peng, Nguyen Linh Dan Le, Shan Jin, Dexuan Ding, Shuo Yu, Feng Xia
Abstract: Brain graph learning has demonstrated significant achievements in the fields of neuroscience and artificial intelligence. However, existing methods struggle to selectively learn disease-related knowledge, leading to heavy parameters and computational costs. This challenge diminishes their efficiency, as well as limits their practicality for real-world clinical applications. To this end, we propose a lightweight Brain PathoGraph Learning (BrainPoG) model that enables efficient brain graph learning by pathological pattern filtering and pathological feature distillation. Specifically, BrainPoG first contains a filter to extract the pathological pattern formulated by highly disease-relevant subgraphs, achieving graph pruning and lesion localization. A PathoGraph is therefore constructed by dropping less disease-relevant subgraphs from the whole brain graph. Afterwards, a pathological feature distillation module is designed to reduce disease-irrelevant noise features and enhance pathological features of each node in the PathoGraph. BrainPoG can exclusively learn informative disease-related knowledge while avoiding less relevant information, achieving efficient brain graph learning. Extensive experiments on four benchmark datasets demonstrate that BrainPoG exhibits superiority in both model performance and computational efficiency across various brain disease detection tasks.
Authors: Brian B. Moser, Arundhati S. Shanbhag, Tobias C. Nauen, Stanislav Frolov, Federico Raue, Joachim Folz, Andreas Dengel
Abstract: The goal of coreset selection methods is to identify representative subsets of datasets for efficient model training. Yet, existing methods often ignore the possibility of annotation errors and require fixed pruning ratios, making them impractical in real-world settings. We present HyperCore, a robust and adaptive coreset selection framework designed explicitly for noisy environments. HyperCore leverages lightweight hypersphere models learned per class, embedding in-class samples close to a hypersphere center while naturally segregating out-of-class samples based on their distance. By using Youden's J statistic, HyperCore can adaptively select pruning thresholds, enabling automatic, noise-aware data pruning without hyperparameter tuning. Our experiments reveal that HyperCore consistently surpasses state-of-the-art coreset selection methods, especially under noisy and low-data regimes. HyperCore effectively discards mislabeled and ambiguous points, yielding compact yet highly informative subsets suitable for scalable and noise-free learning.
Authors: Brian B. Moser, Tobias C. Nauen, Arundhati S. Shanbhag, Federico Raue, Stanislav Frolov, Joachim Folz, Andreas Dengel
Abstract: The goal of coreset selection is to identify representative subsets of datasets for efficient model training. Yet, existing approaches paradoxically require expensive training-based signals, e.g., gradients, decision boundary estimates or forgetting counts, computed over the entire dataset prior to pruning, which undermines their very purpose by requiring training on samples they aim to avoid. We introduce SubZeroCore, a novel, training-free coreset selection method that integrates submodular coverage and density into a single, unified objective. To achieve this, we introduce a sampling strategy based on a closed-form solution to optimally balance these objectives, guided by a single hyperparameter that explicitly controls the desired coverage for local density measures. Despite no training, extensive evaluations show that SubZeroCore matches training-based baselines and significantly outperforms them at high pruning rates, while dramatically reducing computational overhead. SubZeroCore also demonstrates superior robustness to label noise, highlighting its practical effectiveness and scalability for real-world scenarios.
Authors: Jaemin Oh
Abstract: Four-dimensional variational data assimilation (4DVAR) is a cornerstone of numerical weather prediction, but its cost function is difficult to optimize and computationally intensive. We propose a neural field-based reformulation in which the full spatiotemporal state is represented as a continuous function parameterized by a neural network. This reparameterization removes the time-sequential dependency of classical 4DVAR, enabling parallel-in-time optimization in parameter space. Physical constraints are incorporated directly through a physics-informed loss, simplifying implementation and reducing computational cost. We evaluate the method on the two-dimensional incompressible Navier--Stokes equations with Kolmogorov forcing. Compared to a baseline 4DVAR implementation, the neural reparameterized variants produce more stable initial condition estimates without spurious oscillations. Notably, unlike most machine learning-based approaches, our framework does not require access to ground-truth states or reanalysis data, broadening its applicability to settings with limited reference information.
Authors: Sadman Saumik Islam, Bruna Dalcin Baldasso, Davide Cattaneo, Xianta Jiang, Michelle Ploughman
Abstract: People with Multiple Sclerosis (MS) complain of problems with hand dexterity and cognitive fatigue. However, in many cases, impairments are subtle and difficult to detect. Functional near-infrared spectroscopy (fNIRS) is a non-invasive neuroimaging technique that measures brain hemodynamic responses during cognitive or motor tasks. We aimed to detect brain activity biomarkers that could explain subjective reports of cognitive fatigue while completing dexterous tasks and provide targets for future brain stimulation treatments. We recruited 15 people with MS who did not have a hand (Nine Hole Peg Test [NHPT]), mobility, or cognitive impairment, and 12 age- and sex-matched controls. Participants completed two types of hand dexterity tasks with their dominant hand, single task and dual task (NHPT while holding a ball between the fifth finger and hypothenar eminence of the same hand). We analyzed fNIRS data (oxygenated and deoxygenated hemoglobin levels) using a machine learning framework to classify MS patients from controls based on their brain activation patterns in bilateral prefrontal and sensorimotor cortices. The K-Nearest Neighbor classifier achieved an accuracy of 75.0% for single manual dexterity tasks and 66.7% for the more complex dual manual dexterity tasks. Using XAI, we found that the most important brain regions contributing to the machine learning model were the supramarginal/angular gyri and the precentral gyrus (sensory integration and motor regions) of the ipsilateral hemisphere, with suppressed activity and slower neurovascular response in the MS group. During both tasks, deoxygenated hemoglobin levels were better predictors than the conventional measure of oxygenated hemoglobin. This nonconventional method of fNIRS data analysis revealed novel brain activity biomarkers that can help develop personalized brain stimulation targets.
Authors: Zihan Yu, Guanren Wang, Jingtao Ding, Huandong Wang, Yong Li
Abstract: Symbolic regression discovers accurate and interpretable formulas to describe given data, thereby providing scientific insights for domain experts and promoting scientific discovery. However, existing symbolic regression methods often use complexity metrics as a proxy for interoperability, which only considers the size of the formula but ignores its internal mathematical structure. Therefore, while they can discover formulas with compact forms, the discovered formulas often have structures that are difficult to analyze or interpret mathematically. In this work, inspired by the observation that physical formulas are typically numerically stable under limited calculation precision, we propose the Effective Information Criterion (EIC). It treats formulas as information processing systems with specific internal structures and identifies the unreasonable structure in them by the loss of significant digits or the amplification of rounding noise as data flows through the system. We find that this criterion reveals the gap between the structural rationality of models discovered by existing symbolic regression algorithms and real-world physical formulas. Combining EIC with various search-based symbolic regression algorithms improves their performance on the Pareto frontier and reduces the irrational structure in the results. Combining EIC with generative-based algorithms reduces the number of samples required for pre-training, improving sample efficiency by 2~4 times. Finally, for different formulas with similar accuracy and complexity, EIC shows a 70.2% agreement with 108 human experts' preferences for formula interpretability, demonstrating that EIC, by measuring the unreasonable structures in formulas, actually reflects the formula's interpretability.
Authors: Yizhou Zhang, Ning Lv, Teng Wang, Jisheng Dang
Abstract: Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models (LLMs) via reinforcement learning. However, its practical deployment is impeded by an excessively slow training process, primarily attributed to the computationally intensive autoregressive generation of multiple responses per query, which makes the generation phase the primary performance bottleneck. Although speculative decoding presents a promising direction for acceleration, its direct application in GRPO achieves limited speedup under high-concurrency training conditions. To overcome this limitation, we propose a concurrency-aware speculative decoding framework that dynamically adjusts the drafting and verification strategy according to real-time concurrency levels, thereby maximizing the acceleration of the generation process. Furthermore, to address performance degradation arising from distributional drift between the evolving target model and the fixed draft model during training, we introduce an online draft learning mechanism that enables the draft model to continuously adapt using feedback signals from the target model. Experimental results across multiple mathematical reasoning datasets and models demonstrate that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency. The code is available at https://github.com/yedaotian9/GRPO_speculative.
Authors: Kourosh Kakhi, Abbas Khosravi, Roohallah Alizadehsani, U. Rajendra Acharyab
Abstract: Fatigue detection using physiological signals is critical in domains such as transportation, healthcare, and performance monitoring. While most studies focus on single modalities, this work examines statistical relationships between signal pairs to improve classification robustness. Using the DROZY dataset, we extracted features from ECG, EMG, EOG, and EEG across 15 signal combinations and evaluated them with Decision Tree, Random Forest, Logistic Regression, and XGBoost. Results show that XGBoost with the EMG EEG combination achieved the best performance. SHAP analysis highlighted ECG EOG correlation as a key feature, and multi signal models consistently outperformed single signal ones. These findings demonstrate that feature level fusion of physiological signals enhances accuracy, interpretability, and practical applicability of fatigue monitoring systems.
Authors: Chang Liu, Bohao Zhao, Jingtao Ding, Yong Li
Abstract: Accurately forecasting chaotic systems, prevalent in domains such as weather prediction and fluid dynamics, remains a significant scientific challenge. The inherent sensitivity of these systems to initial conditions, coupled with a scarcity of observational data, severely constrains traditional modeling approaches. Since these models are typically trained for a specific system, they lack the generalization capacity necessary for real-world applications, which demand robust zero-shot or few-shot forecasting on novel or data-limited scenarios. To overcome this generalization barrier, we propose ChaosNexus, a foundation model pre-trained on a diverse corpus of chaotic dynamics. ChaosNexus employs a novel multi-scale architecture named ScaleFormer augmented with Mixture-of-Experts layers, to capture both universal patterns and system-specific behaviors. The model demonstrates state-of-the-art zero-shot generalization across both synthetic and real-world benchmarks. On a large-scale testbed comprising over 9,000 synthetic chaotic systems, it improves the fidelity of long-term attractor statistics by more than 40% compared to the leading baseline. This robust performance extends to real-world applications with exceptional data efficiency. For instance, in 5-day global weather forecasting, ChaosNexus achieves a competitive zero-shot mean error below 1 degree, a result that further improves with few-shot fine-tuning. Moreover, experiments on the scaling behavior of ChaosNexus provide a guiding principle for scientific foundation models: cross-system generalization stems from the diversity of training systems, rather than sheer data volume.
Authors: Akshay Trikha, Kyle Chu, Advait Gosai, Parker Szachta, Eric Weiner
Abstract: Predicting material properties is crucial for designing better batteries, semiconductors, and medical devices. Deep learning helps scientists quickly find promising materials by predicting their energy, forces, and stresses. Companies scale capacities of deep learning models in multiple domains, such as language modeling, and invest many millions of dollars into such models. Our team analyzes how scaling training data (giving models more information to learn from), model sizes (giving models more capacity to learn patterns), and compute (giving models more computational resources) for neural networks affects their performance for material property prediction. In particular, we trained both transformer and EquiformerV2 neural networks to predict material properties. We find empirical scaling laws for these models: we can predict how increasing each of the three hyperparameters (training data, model size, and compute) affects predictive performance. In particular, the loss $L$ can be measured with a power law relationship $L = \alpha \cdot N^{-\beta}$, where $\alpha$ and $\beta$ are constants while $N$ is the relevant hyperparameter. We also incorporate command-line arguments for changing training settings such as the amount of epochs, maximum learning rate, and whether mixed precision is enabled. Future work could entail further investigating scaling laws for other neural network models in this domain, such as GemNet and fully connected networks, to assess how they compare to the models we trained.
Authors: Chanwoong Park, Uijeong Jang, Ernest K. Ryu, Insoon Yang
Abstract: Sharpness-Aware Minimization (SAM) is a widely used method that steers training toward flatter minimizers, which typically generalize better. In this work, however, we show that SAM can converge to hallucinated minimizers -- points that are not minimizers of the original objective. We theoretically prove the existence of such hallucinated minimizers and establish conditions for local convergence to them. We further provide empirical evidence demonstrating that SAM can indeed converge to these points in practice. Finally, we propose a simple yet effective remedy for avoiding hallucinated minimizers.
Authors: The Viet Bui, Tien Mai, Hong Thanh Nguyen
Abstract: We study the problem of online multi-agent reinforcement learning (MARL) in environments with sparse rewards, where reward feedback is not provided at each interaction but only revealed at the end of a trajectory. This setting, though realistic, presents a fundamental challenge: the lack of intermediate rewards hinders standard MARL algorithms from effectively guiding policy learning. To address this issue, we propose a novel framework that integrates online inverse preference learning with multi-agent on-policy optimization into a unified architecture. At its core, our approach introduces an implicit multi-agent reward learning model, built upon a preference-based value-decomposition network, which produces both global and local reward signals. These signals are further used to construct dual advantage streams, enabling differentiated learning targets for the centralized critic and decentralized actors. In addition, we demonstrate how large language models (LLMs) can be leveraged to provide preference labels that enhance the quality of the learned reward model. Empirical evaluations on state-of-the-art benchmarks, including MAMuJoCo and SMACv2, show that our method achieves superior performance compared to existing baselines, highlighting its effectiveness in addressing sparse-reward challenges in online MARL.
Authors: Xunpeng Huang, Yingyu Lin, Nishant Jain, Kaibo Wang, Difan Zou, Yian Ma, Tong Zhang
Abstract: We study masked discrete diffusion -- a flexible paradigm for text generation in which tokens are progressively corrupted by special mask symbols before being denoised. Although this approach has demonstrated strong empirical performance, its theoretical complexity in high-dimensional settings remains insufficiently understood. Existing analyses largely focus on uniform discrete diffusion, and more recent attempts addressing masked diffusion either (1) overlook widely used Euler samplers, (2) impose restrictive bounded-score assumptions, or (3) fail to showcase the advantages of masked discrete diffusion over its uniform counterpart. To address this gap, we show that Euler samplers can achieve $\epsilon$-accuracy in total variation (TV) with $\tilde{O}(d^{2}\epsilon^{-3/2})$ discrete score evaluations, thereby providing the first rigorous analysis of typical Euler sampler in masked discrete diffusion. We then propose a Mask-Aware Truncated Uniformization (MATU) approach that both removes bounded-score assumptions and preserves unbiased discrete score approximation. By exploiting the property that each token can be unmasked at most once, MATU attains a nearly $\epsilon$-free complexity of $O(d\,\ln d\cdot (1-\epsilon^2))$. This result surpasses existing uniformization methods under uniform discrete diffusion, eliminating the $\ln(1/\epsilon)$ factor and substantially speeding up convergence. Our findings not only provide a rigorous theoretical foundation for masked discrete diffusion, showcasing its practical advantages over uniform diffusion for text generation, but also pave the way for future efforts to analyze diffusion-based language models developed under masking paradigm.
Authors: Rohan Deb, Qiaobo Li, Mayank Shrivastava, Arindam Banerjee
Abstract: Uniform bounds on sketched inner products of vectors or matrices underpin several important computational and statistical results in machine learning and randomized algorithms, including the Johnson-Lindenstrauss (J-L) lemma, the Restricted Isometry Property (RIP), randomized sketching, and approximate linear algebra. However, many modern analyses involve *sketched bilinear forms*, for which existing uniform bounds either do not apply or are not sharp on general sets. In this work, we develop a general framework to analyze such sketched bilinear forms and derive uniform bounds in terms of geometric complexities of the associated sets. Our approach relies on generic chaining and introduces new techniques for handling suprema over pairs of sets. We further extend these results to the setting where the bilinear form involves a sum of $T$ independent sketching matrices and show that the deviation scales as $\sqrt{T}$. This unified analysis recovers known results such as the J-L lemma as special cases, while extending RIP-type guarantees. Additionally, we obtain improved convergence bounds for sketched Federated Learning algorithms where such cross terms arise naturally due to sketched gradient compression, and design sketched variants of bandit algorithms with sharper regret bounds that depend on the geometric complexity of the action and parameter sets, rather than the ambient dimension.
Authors: Taejong Joo, Shu Ishida, Ivan Sosnovik, Bryan Lim, Sahand Rezaei-Shoshtari, Adam Gaier, Robert Giaquinto
Abstract: As a model-agnostic approach to long context modeling, multi-agent systems can process inputs longer than a large language model's context window without retraining or architectural modifications. However, their performance often heavily relies on hand-crafted multi-agent collaboration strategies and prompt engineering, which limit generalizability. In this work, we introduce a principled framework that formalizes the model-agnostic long context modeling problem as a compression problem, yielding an information-theoretic compression objective. Building on this framework, we propose Graph of Agents (GoA), which dynamically constructs an input-dependent collaboration structure that maximizes this objective. For Llama 3.1 8B and Qwen3 8B across six document question answering benchmarks, GoA improves the average $F_1$ score of retrieval-augmented generation by 5.7\% and a strong multi-agent baseline using a fixed collaboration structure by 16.35\%, respectively. Even with only a 2K context window, GoA surpasses the 128K context window Llama 3.1 8B on LongBench, showing a dramatic increase in effective context length. Our source code is available at https://github.com/tjoo512/graph-of-agents.
Authors: Shuaike Shen, Jiaqing Xie, Zhuo Yang, Antong Zhang, Shuzhou Sun, Ben Gao, Tianfan Fu, Biqing Qi, Yuqiang Li
Abstract: Recent advances in molecular foundation models have shown impressive performance in molecular property prediction and de novo molecular design, with promising applications in areas such as drug discovery and reaction prediction. Nevertheless, most existing approaches rely exclusively on SMILES representations and overlook both experimental spectra and 3D structural information-two indispensable sources for capturing molecular behavior in real-world scenarios. This limitation reduces their effectiveness in tasks where stereochemistry, spatial conformation, and experimental validation are critical. To overcome these challenges, we propose MolSpectLLM, a molecular foundation model pretrained on Qwen2.5-7B that unifies experimental spectroscopy with molecular 3D structure. By explicitly modeling molecular spectra, MolSpectLLM achieves state-of-the-art performance on spectrum-related tasks, with an average accuracy of 0.53 across NMR, IR, and MS benchmarks. MolSpectLLM also shows strong performance on the spectra analysis task, obtaining 15.5% sequence accuracy and 41.7% token accuracy on Spectra-to-SMILES, substantially outperforming large general-purpose LLMs. More importantly, MolSpectLLM not only achieves strong performance on molecular elucidation tasks, but also generates accurate 3D molecular structures directly from SMILES or spectral inputs, bridging spectral analysis, molecular elucidation, and molecular design.
Authors: Seong-Woong Shim, Myunsoo Kim, Jae Hyeon Cho, Byung-Jun Lee
Abstract: Retrieval-Augmented Generation (RAG) is a framework for grounding Large Language Models (LLMs) in external, up-to-date information. However, recent advancements in context window size allow LLMs to process inputs of up to 128K tokens or more, offering an alternative strategy: supplying the full document context directly to the model, rather than relying on RAG to retrieve a subset of contexts. Nevertheless, this emerging alternative strategy has notable limitations: (i) it is token-inefficient to handle large and potentially redundant contexts; (ii) it exacerbates the `lost in the middle' phenomenon; and (iii) under limited model capacity, it amplifies distraction, ultimately degrading LLM output quality. In this paper, we propose LDAR (Learning Distraction-Aware Retrieval), an adaptive retriever that learns to retrieve contexts in a way that mitigates interference from distracting passages, thereby achieving significantly higher performance with reduced token usage compared to long-context approaches. Extensive experiments across diverse LLM architectures and six knowledge-intensive benchmarks demonstrate the effectiveness and robustness of our approach, highlighting the importance of balancing the trade-off between information coverage and distraction.
Authors: Yifei Peng, Yaoli Liu, Enbo Xia, Yu Jin, Wang-Zhou Dai, Zhong Ren, Yao-Xiang Ding, Kun Zhou
Abstract: We propose ILP-CoT, a method that bridges Inductive Logic Programming (ILP) and Multimodal Large Language Models (MLLMs) for abductive logical rule induction. The task involves both discovering logical facts and inducing logical rules from a small number of unstructured textual or visual inputs, which still remain challenging when solely relying on ILP, due to the requirement of specified background knowledge and high computational cost, or MLLMs, due to the appearance of perceptual hallucinations. Based on the key observation that MLLMs could propose structure-correct rules even under hallucinations, our approach automatically builds ILP tasks with pruned search spaces based on the rule structure proposals from MLLMs, and utilizes ILP system to output rules built upon rectified logical facts and formal inductive reasoning. Its effectiveness is verified through challenging logical induction benchmarks, as well as a potential application of our approach, namely text-to-image customized generation with rule induction. Our code and data are released at https://github.com/future-item/ILP-CoT.
Authors: Chaoyang Luo, Yan Zou, Nanjing Huang
Abstract: Despite neural ordinary differential equations (Neural ODEs) exhibiting intrinsic robustness under input perturbations due to their dynamical systems nature, recent approaches often involve imposing Lyapunov-based stability conditions to provide formal robustness guarantees. However, a fundamental challenge remains: the tension between robustness and accuracy, primarily stemming from the difficulty in imposing appropriate stability conditions. To address this, we propose an adaptive stable learning framework named Zubov-Net, which innovatively reformulates Zubov's equation into a consistency characterization between regions of attraction (RoAs) and prescribed RoAs (PRoAs). Building on this consistency, we introduce a new paradigm for actively controlling the geometry of RoAs by directly optimizing PRoAs to reconcile accuracy and robustness. Our approach is realized through tripartite losses (consistency, classification, and separation losses) and a parallel boundary sampling algorithm that co-optimizes the Neural ODE and the Lyapunov function. To enhance the discriminativity of Lyapunov functions, we design an input-attention-based convex neural network via a softmax attention mechanism that focuses on equilibrium-relevant features and also serves as weight normalization to maintain training stability in deep architectures. Theoretically, we prove that minimizing the tripartite loss guarantees consistent alignment of PRoAs-RoAs, trajectory stability, and non-overlapping PRoAs. Moreover, we establish stochastic convex separability with tighter probability bounds and fewer dimensionality requirements to justify the convex design in Lyapunov functions. Experimentally, Zubov-Net maintains high classification accuracy while significantly improving robustness against various stochastic noises and adversarial attacks.
Authors: Aaron Tu, Weihao Xuan, Heli Qi, Xu Huang, Qingcheng Zeng, Shayan Talaei, Yijia Xiao, Peng Xia, Xiangru Tang, Yuchen Zhuang, Bing Hu, Hanqun Cao, Wenqi Shi, Tianang Leng, Rui Yang, Yingjian Chen, Ziqi Wang, Irene Li, Nan Liu, Huaxiu Yao, Li Erran Li, Ge Liu, Amin Saberi, Naoto Yokoya, Jure Leskovec, Yejin Choi, Fang Wu
Abstract: Reinforcement learning with verifiable rewards (RLVR) is a practical and scalable approach to enhancing large language models in areas such as math, code, and other structured tasks. Two questions motivate this paper: how much of the reported gains survive under strictly parity-controlled evaluation, and whether RLVR is cost-free or exacts a measurable tax. We argue that progress is real, but gains are often overstated due to three forces - an RLVR tax, evaluation pitfalls, and data contamination. Using a partial-prompt contamination audit and matched-budget reproductions across base and RL models, we show that several headline gaps shrink or vanish under clean, parity-controlled evaluation. We then propose a tax-aware training and evaluation protocol that co-optimizes accuracy, grounding, and calibrated abstention and standardizes budgeting and provenance checks. Applied to recent RLVR setups, this protocol yields more reliable estimates of reasoning gains and, in several cases, revises prior conclusions. Our position is constructive: RLVR is valuable and industry-ready; we advocate keeping its practical benefits while prioritizing reliability, safety, and measurement.
Authors: Yuka Hashimoto, Sho Sonoda, Isao Ishikawa, Masahiro Ikeda
Abstract: We derive a new Rademacher complexity bound for deep neural networks using Koopman operators, group representations, and reproducing kernel Hilbert spaces (RKHSs). The proposed bound describes why the models with high-rank weight matrices generalize well. Although there are existing bounds that attempt to describe this phenomenon, these existing bounds can be applied to limited types of models. We introduce an algebraic representation of neural networks and a kernel function to construct an RKHS to derive a bound for a wider range of realistic models. This work paves the way for the Koopman-based theory for Rademacher complexity bounds to be valid for more practical situations.
Authors: Zihuan Qiu, Yi Xu, Fanman Meng, Runtong Zhang, Linfeng Xu, Qingbo Wu, Hongliang Li
Abstract: Class Incremental Learning (CIL) aims to sequentially acquire knowledge of new classes without forgetting previously learned ones. Despite recent progress, current CIL methods still exhibit significant performance gaps compared to their oracle counterparts-models trained with full access to historical data. Inspired by recent insights on Linear Mode Connectivity (LMC), we revisit the geometric properties of oracle solutions in CIL and uncover a fundamental observation: these oracle solutions typically maintain low-loss linear connections to the optimum of previous tasks. Motivated by this finding, we propose Increment Vector Transformation (IVT), a novel plug-and-play framework designed to mitigate catastrophic forgetting during training. Rather than directly following CIL updates, IVT periodically teleports the model parameters to transformed solutions that preserve linear connectivity to previous task optimum. By maintaining low-loss along these connecting paths, IVT effectively ensures stable performance on previously learned tasks. The transformation is efficiently approximated using diagonal Fisher Information Matrices, making IVT suitable for both exemplar-free and exemplar-based scenarios, and compatible with various initialization strategies. Extensive experiments on CIFAR-100, FGVCAircraft, ImageNet-Subset, and ImageNet-Full demonstrate that IVT consistently enhances the performance of strong CIL baselines. Specifically, on CIFAR-100, IVT improves the last accuracy of the PASS baseline by +5.12% and reduces forgetting by 2.54%. For the CLIP-pre-trained SLCA baseline on FGVCAircraft, IVT yields gains of +14.93% in average accuracy and +21.95% in last accuracy. The code will be released.
Authors: Zhengyan Wan, Yidong Ouyang, Liyan Xie, Fang Fang, Hongyuan Zha, Guang Cheng
Abstract: Guidance provides a simple and effective framework for posterior sampling by steering the generation process towards the desired distribution. When modeling discrete data, existing approaches mostly focus on guidance with the first-order Taylor approximation to improve the sampling efficiency. However, such an approximation is inappropriate in discrete state spaces since the approximation error could be large. A novel guidance framework for discrete data is proposed to address this problem: We derive the exact transition rate for the desired distribution given a learned discrete flow matching model, leading to guidance that only requires a single forward pass in each sampling step, significantly improving efficiency. This unified novel framework is general enough, encompassing existing guidance methods as special cases, and it can also be seamlessly applied to the masked diffusion model. We demonstrate the effectiveness of our proposed guidance on energy-guided simulations and preference alignment on text-to-image generation and multimodal understanding tasks. The code is available through https://github.com/WanZhengyan/Discrete-Guidance-Matching/tree/main.
URLs: https://github.com/WanZhengyan/Discrete-Guidance-Matching/tree/main.
Authors: Fumin Wang
Abstract: Interpretability is one of the considerations when applying machine learning to high-stakes fields such as healthcare that involve matters of life safety. Generalized Additive Models (GAMs) enhance interpretability by visualizing shape functions. Nevertheless, to preserve interpretability, GAMs omit higher-order interaction effects (beyond pairwise interactions), which imposes significant constraints on their predictive performance. We observe that Curve Ergodic Set Regression (CESR), a multiplicative model, naturally enables the visualization of its shape functions and simultaneously incorporates both interactions among all features and individual feature effects. Nevertheless, CESR fails to demonstrate superior performance compared to GAMs. We introduce Multiplicative-Additive Constrained Models (MACMs), which augment CESR with an additive part to disentangle the intertwined coefficients of its interactive and independent terms, thus effectively broadening the hypothesis space. The model is composed of a multiplicative part and an additive part, whose shape functions can both be naturally visualized, thereby assisting users in interpreting how features participate in the decision-making process. Consequently, MACMs constitute an improvement over both CESR and GAMs. The experimental results indicate that neural network-based MACMs significantly outperform both CESR and the current state-of-the-art GAMs in terms of predictive performance.
Authors: Yunchen Li, Shaohui Lin, Zhou Yu
Abstract: This paper investigates the theoretical behavior of generative models under finite training populations. Within the stochastic interpolation generative framework, we derive closed-form expressions for the optimal velocity field and score function when only a finite number of training samples are available. We demonstrate that, under some regularity conditions, the deterministic generative process exactly recovers the training samples, while the stochastic generative process manifests as training samples with added Gaussian noise. Beyond the idealized setting, we consider model estimation errors and introduce formal definitions of underfitting and overfitting specific to generative models. Our theoretical analysis reveals that, in the presence of estimation errors, the stochastic generation process effectively produces convex combinations of training samples corrupted by a mixture of uniform and Gaussian noise. Experiments on generation tasks and downstream tasks such as classification support our theory.
Authors: Amine Bechar, Adel Oulefki, Abbes Amira, Fatih Kurogollu, Yassine Himeur
Abstract: The analysis of complex building time-series for actionable insights and recommendations remains challenging due to the nonlinear and multi-scale characteristics of energy data. To address this, we propose a framework that fine-tunes visual language large models (VLLMs) on 3D graphical representations of the data. The approach converts 1D time-series into 3D representations using continuous wavelet transforms (CWTs) and recurrence plots (RPs), which capture temporal dynamics and localize frequency anomalies. These 3D encodings enable VLLMs to visually interpret energy-consumption patterns, detect anomalies, and provide recommendations for energy efficiency. We demonstrate the framework on real-world building-energy datasets, where fine-tuned VLLMs successfully monitor building states, identify recurring anomalies, and generate optimization recommendations. Quantitatively, the Idefics-7B VLLM achieves validation losses of 0.0952 with CWTs and 0.1064 with RPs on the University of Sharjah energy dataset, outperforming direct fine-tuning on raw time-series data (0.1176) for anomaly detection. This work bridges time-series analysis and visualization, providing a scalable and interpretable framework for energy analytics.
Authors: O. Duranthon, P. Marion, C. Boyer, B. Loureiro, L. Zdeborov\'a
Abstract: Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the easier-to-analyze linearized attention. In this work, we address this gap through a principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location. Building on ideas from statistical physics, we develop an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters. At the population level, we show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short. We then examine other activation functions to identify which properties are necessary for optimal performance. Finally, we analyze the finite-sample regime: we provide an asymptotic characterization of the test error and show that, while softmax is no longer Bayes-optimal, it consistently outperforms linear attention. We discuss the connection with optimization by gradient-based algorithms.
Authors: Xianghua Zeng, Hao Peng, Angsheng Li, Yicheng Pan
Abstract: Diffusion-based generative methods have shown promising potential for modeling trajectories from offline reinforcement learning (RL) datasets, and hierarchical diffusion has been introduced to mitigate variance accumulation and computational challenges in long-horizon planning tasks. However, existing approaches typically assume a fixed two-layer diffusion hierarchy with a single predefined temporal scale, which limits adaptability to diverse downstream tasks and reduces flexibility in decision making. In this work, we propose SIHD, a novel Structural Information-based Hierarchical Diffusion framework for effective and stable offline policy learning in long-horizon environments with sparse rewards. Specifically, we analyze structural information embedded in offline trajectories to construct the diffusion hierarchy adaptively, enabling flexible trajectory modeling across multiple temporal scales. Rather than relying on reward predictions from localized sub-trajectories, we quantify the structural information gain of each state community and use it as a conditioning signal within the corresponding diffusion layer. To reduce overreliance on offline datasets, we introduce a structural entropy regularizer that encourages exploration of underrepresented states while avoiding extrapolation errors from distributional shifts. Extensive evaluations on challenging offline RL tasks show that SIHD significantly outperforms state-of-the-art baselines in decision-making performance and demonstrates superior generalization across diverse scenarios.
Authors: Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim
Abstract: We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods -- including GFlowNets, PPO, and REINFORCE -- by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than $400\ \times$) with only a 6% increase in computation. Our code is publicly available \href{https://github.com/dbsxodud-11/active_attacks}{here}.
Authors: Zhichao Sheng, Shilin Zhou, Chen Gong, Zhenghua Li
Abstract: Large Audio Language Models (LALMs), powered by the chain-of-thought (CoT) paradigm, have shown remarkable reasoning capabilities. Intuitively, different problems often require varying depths of reasoning. While some methods can determine whether to reason for a given problem, they typically lack a fine-grained mechanism to modulate how much to reason. This often results in a ``one-size-fits-all'' reasoning depth, which generates redundant overthinking for simple questions while failing to allocate sufficient thought to complex ones. In this paper, we conduct an in-depth analysis of LALMs and find that an effective and efficient LALM should reason smartly by adapting its reasoning depth to the problem's complexity. To achieve this, we propose a difficulty-adaptive reasoning method for LALMs. Specifically, we propose a reward function that dynamically links reasoning length to the model's perceived problem difficulty. This reward encourages shorter, concise reasoning for easy tasks and more elaborate, in-depth reasoning for complex ones. Extensive experiments demonstrate that our method is both effective and efficient, simultaneously improving task performance and significantly reducing the average reasoning length. Further analysis on reasoning structure paradigm offers valuable insights for future work.
Authors: Feng Jiang, Amina Mollaysa, Hehuan Ma, Tommaso Mansi, Junzhou Huang, Mangal Prakash, Rui Liao
Abstract: Drug target interaction (DTI) prediction is a cornerstone of computational drug discovery, enabling rational design, repurposing, and mechanistic insights. While deep learning has advanced DTI modeling, existing approaches primarily rely on SMILES protein pairs and fail to exploit the rich multimodal information available for small molecules and proteins. We introduce GRAMDTI, a pretraining framework that integrates multimodal molecular and protein inputs into unified representations. GRAMDTI extends volume based contrastive learning to four modalities, capturing higher-order semantic alignment beyond conventional pairwise approaches. To handle modality informativeness, we propose adaptive modality dropout, dynamically regulating each modality's contribution during pre-training. Additionally, IC50 activity measurements, when available, are incorporated as weak supervision to ground representations in biologically meaningful interaction strengths. Experiments on four publicly available datasets demonstrate that GRAMDTI consistently outperforms state of the art baselines. Our results highlight the benefits of higher order multimodal alignment, adaptive modality utilization, and auxiliary supervision for robust and generalizable DTI prediction.
Authors: Cheng Jin, Qitan Shi, Yuantao Gu
Abstract: Classifier-Free Guidance (CFG) is widely used to improve conditional fidelity in diffusion models, but its impact on sampling dynamics remains poorly understood. Prior studies, often restricted to unimodal conditional distributions or simplified cases, provide only a partial picture. We analyze CFG under multimodal conditionals and show that the sampling process unfolds in three successive stages. In the Direction Shift stage, guidance accelerates movement toward the weighted mean, introducing initialization bias and norm growth. In the Mode Separation stage, local dynamics remain largely neutral, but the inherited bias suppresses weaker modes, reducing global diversity. In the Concentration stage, guidance amplifies within-mode contraction, diminishing fine-grained variability. This unified view explains a widely observed phenomenon: stronger guidance improves semantic alignment but inevitably reduces diversity. Experiments support these predictions, showing that early strong guidance erodes global diversity, while late strong guidance suppresses fine-grained variation. Moreover, our theory naturally suggests a time-varying guidance schedule, and empirical results confirm that it consistently improves both quality and diversity.
Authors: Yajie Qi, Wei Wei, Lin Li, Lijun Zhang, Zhidong Gao, Da Wang, Huizhong Song
Abstract: Real-world decision-making tasks typically occur in complex and open environments, posing significant challenges to reinforcement learning (RL) agents' exploration efficiency and long-horizon planning capabilities. A promising approach is LLM-enhanced RL, which leverages the rich prior knowledge and strong planning capabilities of LLMs to guide RL agents in efficient exploration. However, existing methods mostly rely on frequent and costly LLM invocations and suffer from limited performance due to the semantic mismatch. In this paper, we introduce a Structured Goal-guided Reinforcement Learning (SGRL) method that integrates a structured goal planner and a goal-conditioned action pruner to guide RL agents toward efficient exploration. Specifically, the structured goal planner utilizes LLMs to generate a reusable, structured function for goal generation, in which goals are prioritized. Furthermore, by utilizing LLMs to determine goals' priority weights, it dynamically generates forward-looking goals to guide the agent's policy toward more promising decision-making trajectories. The goal-conditioned action pruner employs an action masking mechanism that filters out actions misaligned with the current goal, thereby constraining the RL agent to select goal-consistent policies. We evaluate the proposed method on Crafter and Craftax-Classic, and experimental results demonstrate that SGRL achieves superior performance compared to existing state-of-the-art methods.
Authors: Jianrong Ding, Muxi Chen, Chenchen Zhao, Qiang Xu
Abstract: Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model's learned features, offering a powerful observational lens. However, the ambiguous and ungrounded nature of these features makes them unreliable instruments for the active, causal probing of model behavior. To solve this, we introduce Concept-SAE, a framework that forges semantically grounded concept tokens through a novel hybrid disentanglement strategy. We first quantitatively demonstrate that our dual-supervision approach produces tokens that are remarkably faithful and spatially localized, outperforming alternative methods in disentanglement. This validated fidelity enables two critical applications: (1) we probe the causal link between internal concepts and predictions via direct intervention, and (2) we probe the model's failure modes by systematically localizing adversarial vulnerabilities to specific layers. Concept-SAE provides a validated blueprint for moving beyond correlational interpretation to the mechanistic, causal probing of model behavior.
Authors: Hugh Xuechen Liu, K{\i}van\c{c} Tatar
Abstract: Bipartite knowledge graphs in niche domains are typically data-poor and edge-sparse, which hinders link prediction. We introduce AEGIS (Authentic Edge Growth In Sparsity), an edge-only augmentation framework that resamples existing training edges -either uniformly simple or with inverse-degree bias degree-aware -thereby preserving the original node set and sidestepping fabricated endpoints. To probe authenticity across regimes, we consider naturally sparse graphs (game design pattern's game-pattern network) and induce sparsity in denser benchmarks (Amazon, MovieLens) via high-rate bond percolation. We evaluate augmentations on two complementary metrics: AUC-ROC (higher is better) and the Brier score (lower is better), using two-tailed paired t-tests against sparse baselines. On Amazon and MovieLens, copy-based AEGIS variants match the baseline while the semantic KNN augmentation is the only method that restores AUC and calibration; random and synthetic edges remain detrimental. On the text-rich GDP graph, semantic KNN achieves the largest AUC improvement and Brier score reduction, and simple also lowers the Brier score relative to the sparse control. These findings position authenticity-constrained resampling as a data-efficient strategy for sparse bipartite link prediction, with semantic augmentation providing an additional boost when informative node descriptions are available.
Authors: Shilei Cao, Hehai Lin, Jiashun Cheng, Yang Liu, Guowen Li, Xuehe Wang, Juepeng Zheng, Haoyuan Liang, Meng Jin, Chengwei Qin, Hong Cheng, Haohuan Fu
Abstract: While recent advances in machine learning have equipped Weather Foundation Models (WFMs) with substantial generalization capabilities across diverse downstream tasks, the escalating computational requirements associated with their expanding scale increasingly hinder practical deployment. Current Parameter-Efficient Fine-Tuning (PEFT) methods, designed for vision or language tasks, fail to address the unique challenges of weather downstream tasks, such as variable heterogeneity, resolution diversity, and spatiotemporal coverage variations, leading to suboptimal performance when applied to WFMs. To bridge this gap, we introduce WeatherPEFT, a novel PEFT framework for WFMs incorporating two synergistic innovations. First, during the forward pass, Task-Adaptive Dynamic Prompting (TADP) dynamically injects the embedding weights within the encoder to the input tokens of the pre-trained backbone via internal and external pattern extraction, enabling context-aware feature recalibration for specific downstream tasks. Furthermore, during backpropagation, Stochastic Fisher-Guided Adaptive Selection (SFAS) not only leverages Fisher information to identify and update the most task-critical parameters, thereby preserving invariant pre-trained knowledge, but also introduces randomness to stabilize the selection. We demonstrate the effectiveness and efficiency of WeatherPEFT on three downstream tasks, where existing PEFT methods show significant gaps versus Full-Tuning, and WeatherPEFT achieves performance parity with Full-Tuning using fewer trainable parameters. The code of this work will be released.
Authors: Panagiotis Giannoulis, Yorgos Pantis, Christos Tzamos
Abstract: Despite their proficiency in various language tasks, Large Language Models (LLMs) struggle with combinatorial problems like Satisfiability, Traveling Salesman Problem, or even basic arithmetic. We address this gap through a novel approach for solving problems in the class NP. We focus on the paradigmatic task of Sudoku and achieve state-of-the-art accuracy (99\%) compared to prior neuro-symbolic approaches. Unlike prior work that used custom architectures, our method employs a vanilla decoder-only Transformer (GPT-2) without external tools or function calling. Our method integrates imitation learning of simple Sudoku rules with an explicit Depth-First Search (DFS) exploration strategy involving informed guessing and backtracking. Moving beyond imitation learning, we seek to minimize the number of guesses until reaching a solution. We provide a rigorous analysis of this setup formalizing its connection to a contextual variant of Min-Sum Set Cover, a well-studied problem in algorithms and stochastic optimization.
Authors: Haodong Pan, Yusong Wang, Nanning Zheng, Caijui Jiang
Abstract: Geometric graph neural networks (GNNs) excel at capturing molecular geometry, yet their locality-biased message passing hampers the modeling of long-range interactions. Current solutions have fundamental limitations: extending cutoff radii causes computational costs to scale cubically with distance; physics-inspired kernels (e.g., Coulomb, dispersion) are often system-specific and lack generality; Fourier-space methods require careful tuning of multiple parameters (e.g., mesh size, k-space cutoff) with added computational overhead. We introduce Multi-stage Clustered Global Modeling (MCGM), a lightweight, plug-and-play module that endows geometric GNNs with hierarchical global context through efficient clustering operations. MCGM builds a multi-resolution hierarchy of atomic clusters, distills global information via dynamic hierarchical clustering, and propagates this context back through learned transformations, ultimately reinforcing atomic features via residual connections. Seamlessly integrated into four diverse backbone architectures, MCGM reduces OE62 energy prediction error by an average of 26.2%. On AQM, MCGM achieves state-of-the-art accuracy (17.0 meV for energy, 4.9 meV/{\AA} for forces) while using 20% fewer parameters than Neural P3M. Code will be made available upon acceptance.
Authors: Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Elena Tutubalina, Ivan Oseledets
Abstract: Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.
Authors: Zhihua Zhong, Xuanyang Huang
Abstract: Latent space is one of the key concepts in generative AI, offering powerful means for creative exploration through vector manipulation. However, diffusion models like Stable Diffusion lack the intuitive latent vector control found in GANs, limiting their flexibility for artistic expression. This paper introduces \workname, a framework for integrating customizable latent space operations into the diffusion process. By enabling direct manipulation of conceptual and spatial representations, this approach expands creative possibilities in generative art. We demonstrate the potential of this framework through two artworks, \textit{Infinitepedia} and \textit{Latent Motion}, highlighting its use in conceptual blending and dynamic motion generation. Our findings reveal latent space structures with semantic and meaningless regions, offering insights into the geometry of diffusion models and paving the way for further explorations of latent space.
Authors: Suman Sanyal
Abstract: We propose Convexity-Driven Projection (CDP), a boundary-free linear method for dimensionality reduction of point clouds that targets preserving detour-induced local non-convexity. CDP builds a $k$-NN graph, identifies admissible pairs whose Euclidean-to-shortest-path ratios are below a threshold, and aggregates their normalized directions to form a positive semidefinite non-convexity structure matrix. The projection uses the top-$k$ eigenvectors of the structure matrix. We give two verifiable guarantees. A pairwise a-posteriori certificate that bounds the post-projection distortion for each admissible pair, and an average-case spectral bound that links expected captured direction energy to the spectrum of the structure matrix, yielding quantile statements for typical distortion. Our evaluation protocol reports fixed- and reselected-pairs detour errors and certificate quantiles, enabling practitioners to check guarantees on their data.
Authors: Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Mitsuki Sakamoto, Ryota Mitsuhashi, Eiji Uchibe
Abstract: Group Relative Policy Optimization (GRPO) has been shown to be an effective algorithm when an accurate reward model is available. However, such a highly reliable reward model is not available in many real-world tasks. In this paper, we particularly focus on multi-objective settings, in which we identify that GRPO is vulnerable to reward hacking, optimizing only one of the objectives at the cost of the others. To address this issue, we propose MO-GRPO, an extension of GRPO with a simple normalization method to reweight the reward functions automatically according to the variances of their values. We first show analytically that MO-GRPO ensures that all reward functions contribute evenly to the loss function while preserving the order of preferences, eliminating the need for manual tuning of the reward functions' scales. Then, we evaluate MO-GRPO experimentally in four domains: (i) the multi-armed bandits problem, (ii) simulated control task (Mo-Gymnasium), (iii) machine translation tasks on the WMT benchmark (En-Ja, En-Zh), and (iv) instruction following task. MO-GRPO achieves stable learning by evenly distributing correlations among the components of rewards, outperforming GRPO, showing MO-GRPO to be a promising algorithm for multi-objective reinforcement learning problems.
Authors: Yi Ding, Muyun Jiang, Weibang Jiang, Shuailei Zhang, Xinliang Zhou, Chenyu Liu, Shanglin Li, Yong Li, Cuntai Guan
Abstract: Electroencephalography (EEG) is a non-invasive technique for recording brain electrical activity, widely used in brain-computer interface (BCI) and healthcare. Recent EEG foundation models trained on large-scale datasets have shown improved performance and generalizability over traditional decoding methods, yet significant challenges remain. Existing models often fail to explicitly capture channel-to-channel and region-to-region interactions, which are critical sources of information inherently encoded in EEG signals. Due to varying channel configurations across datasets, they either approximate spatial structure with self-attention or restrict training to a limited set of common channels, sacrificing flexibility and effectiveness. Moreover, although EEG datasets reflect diverse brain states such as emotion, motor, and others, current models rarely learn state-aware representations during self-supervised pre-training. To address these gaps, we propose BrainPro, a large EEG model that introduces a retrieval-based spatial learning block to flexibly capture channel- and region-level interactions across varying electrode layouts, and a brain state-decoupling block that enables state-aware representation learning through parallel encoders with decoupling and region-aware reconstruction losses. This design allows BrainPro to adapt seamlessly to diverse tasks and hardware settings. Pre-trained on an extensive EEG corpus, BrainPro achieves state-of-the-art performance and robust generalization across nine public BCI datasets. Our codes and the pre-trained weights will be released.
Authors: Hua Yuan, Ning Xu, Xin Geng, Yong Rui
Abstract: Since the advent of knowledge distillation, much research has focused on how the soft labels generated by the teacher model can be utilized effectively. Existing studies points out that the implicit knowledge within soft labels originates from the multi-view structure present in the data. Feature variations within samples of the same class allow the student model to generalize better by learning diverse representations. However, in existing distillation methods, teacher models predominantly adhere to ground-truth labels as targets, without considering the diverse representations within the same class. Therefore, we propose incorporating an intra-class contrastive loss during teacher training to enrich the intra-class information contained in soft labels. In practice, we find that intra-class loss causes instability in training and slows convergence. To mitigate these issues, margin loss is integrated into intra-class contrastive learning to improve the training stability and convergence speed. Simultaneously, we theoretically analyze the impact of this loss on the intra-class distances and inter-class distances. It has been proved that the intra-class contrastive loss can enrich the intra-class diversity. Experimental results demonstrate the effectiveness of the proposed method.
Authors: Hua Yuan, Xuran Meng, Qiufeng Wang, Shiyu Xia, Ning Xu, Xu Yang, Jing Wang, Xin Geng, Yong Rui
Abstract: Parameter transfer is a central paradigm in transfer learning, enabling knowledge reuse across tasks and domains by sharing model parameters between upstream and downstream models. However, when only a subset of parameters from the upstream model is transferred to the downstream model, there remains a lack of theoretical understanding of the conditions under which such partial parameter reuse is beneficial and of the factors that govern its effectiveness. To address this gap, we analyze a setting in which both the upstream and downstream models are ReLU convolutional neural networks (CNNs). Within this theoretical framework, we characterize how the inherited parameters act as carriers of universal knowledge and identify key factors that amplify their beneficial impact on the target task. Furthermore, our analysis provides insight into why, in certain cases, transferring parameters can lead to lower test accuracy on the target task than training a new model from scratch. Numerical experiments and real-world data experiments are conducted to empirically validate our theoretical findings.
Authors: Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina
Abstract: Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed as a precise, interpretable, and potentially safer alternative to fine-tuning. We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests. Through extensive experiments on different model families, we show that even steering in a random direction can increase the probability of harmful compliance from 0% to 2-27%. Alarmingly, steering benign features from a sparse autoencoder (SAE), a common source of interpretable directions, increases these rates by a further 2-4%. Finally, we show that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack, significantly increasing harmful compliance on unseen requests. These results challenge the paradigm of safety through interpretability, showing that precise control over model internals does not guarantee precise control over model behavior.
Authors: Li Xia, Zheng Liu, Sili Huang, Wei Tang, Xuan Liu
Abstract: Federated Learning (FL) preserves privacy by keeping raw data local, yet Gradient Inversion Attacks (GIAs) pose significant threats. In FedAVG multi-step scenarios, attackers observe only aggregated gradients, making data reconstruction challenging. Existing surrogate model methods like SME assume linear parameter trajectories, but we demonstrate this severely underestimates SGD's nonlinear complexity, fundamentally limiting attack effectiveness. We propose Non-Linear Surrogate Model Extension (NL-SME), the first method to introduce nonlinear parametric trajectory modeling for GIAs. Our approach replaces linear interpolation with learnable quadratic B\'ezier curves that capture SGD's curved characteristics through control points, combined with regularization and dvec scaling mechanisms for enhanced expressiveness. Extensive experiments on CIFAR-100 and FEMNIST datasets show NL-SME significantly outperforms baselines across all metrics, achieving order-of-magnitude improvements in cosine similarity loss while maintaining computational efficiency.This work exposes heightened privacy vulnerabilities in FL's multi-step update paradigm and offers novel perspectives for developing robust defense strategies.
Authors: Zhipu Cui, Johannes Lutzeyer
Abstract: Graph Neural Networks (GNNs) have achieved remarkable success across a range of learning tasks. However, scaling GNNs to large graphs remains a significant challenge, especially for graph-level tasks. In this work, we introduce SHAKE-GNN, a novel scalable graph-level GNN framework based on a hierarchy of Kirchhoff Forests, a class of random spanning forests used to construct stochastic multi-resolution decompositions of graphs. SHAKE-GNN produces multi-scale representations, enabling flexible trade-offs between efficiency and performance. We introduce an improved, data-driven strategy for selecting the trade-off parameter and analyse the time-complexity of SHAKE-GNN. Experimental results on multiple large-scale graph classification benchmarks demonstrate that SHAKE-GNN achieves competitive performance while offering improved scalability.
Authors: Marina Ceccon, Alessandro Fabris, Goran Radanovi\'c, Asia J. Biega, Gian Antonio Susto
Abstract: Algorithmic recourse seeks to provide individuals with actionable recommendations that increase their chances of receiving favorable outcomes from automated decision systems (e.g., loan approvals). While prior research has emphasized robustness to model updates, considerably less attention has been given to the temporal dynamics of recourse--particularly in competitive, resource-constrained settings where recommendations shape future applicant pools. In this work, we present a novel time-aware framework for algorithmic recourse, explicitly modeling how candidate populations adapt in response to recommendations. Additionally, we introduce a novel reinforcement learning (RL)-based recourse algorithm that captures the evolving dynamics of the environment to generate recommendations that are both feasible and valid. We design our recommendations to be durable, supporting validity over a predefined time horizon T. This durability allows individuals to confidently reapply after taking time to implement the suggested changes. Through extensive experiments in complex simulation environments, we show that our approach substantially outperforms existing baselines, offering a superior balance between feasibility and long-term validity. Together, these results underscore the importance of incorporating temporal and behavioral dynamics into the design of practical recourse systems.
Authors: Maria Iannario, Dae-Jin Lee, Manuele Leonelli
Abstract: Psychological attributes rarely operate in isolation: coaches reason about networks of related traits. We analyze a new dataset of 164 female volleyball players from Italy's C and D leagues that combines standardized psychological profiling with background information. To learn directed relationships among mixed-type variables (ordinal questionnaire scores, categorical demographics, continuous indicators), we introduce latent MMHC, a hybrid structure learner that couples a latent Gaussian copula and a constraint-based skeleton with a constrained score-based refinement to return a single DAG. We also study a bootstrap-aggregated variant for stability. In simulations spanning sample size, sparsity, and dimension, latent Max-Min Hill-Climbing (MMHC) attains lower structural Hamming distance and higher edge recall than recent copula-based learners while maintaining high specificity. Applied to volleyball, the learned network organizes mental skills around goal setting and self-confidence, with emotional arousal linking motivation and anxiety, and locates Big-Five traits (notably neuroticism and extraversion) upstream of skill clusters. Scenario analyses quantify how improvements in specific skills propagate through the network to shift preparation, confidence, and self-esteem. The approach provides an interpretable, data-driven framework for profiling psychological traits in sport and for decision support in athlete development.
Authors: David Benfield, Phan Tu Vuong, Alain Zemkoho
Abstract: Adversarial machine learning challenges the assumption that the underlying distribution remains consistent throughout the training and implementation of a prediction model. In particular, adversarial evasion considers scenarios where adversaries adapt their data to influence particular outcomes from established prediction models, such scenarios arise in applications such as spam email filtering, malware detection and fake-image generation, where security methods must be actively updated to keep up with the ever-improving generation of malicious data. Game theoretic models have been shown to be effective at modelling these scenarios and hence training resilient predictors against such adversaries. Recent advancements in the use of pessimistic bilevel optimsiation which remove assumptions about the convexity and uniqueness of the adversary's optimal strategy have proved to be particularly effective at mitigating threats to classifiers due to its ability to capture the antagonistic nature of the adversary. However, this formulation has not yet been adapted to regression scenarios. This article serves to propose a pessimistic bilevel optimisation program for regression scenarios which makes no assumptions on the convexity or uniqueness of the adversary's solutions.
Authors: Chao Wang, Tao Yang, Hongtao Tian, Yunsheng Shi, Qiyao Ma, Xiaotao Liu, Ting Yao, Wenbo Ding
Abstract: Critic-free methods like GRPO reduce memory demands by estimating advantages from multiple rollouts but tend to converge slowly, as critical learning signals are diluted by an abundance of uninformative samples and tokens. To tackle this challenge, we propose the \textbf{Dynamic Dual-Level Down-Sampling (D$^3$S)} framework that prioritizes the most informative samples and tokens across groups to improve the efficient of policy optimization. D$^3$S operates along two levels: (1) the sample-level, which selects a subset of rollouts to maximize advantage variance ($\text{Var}(A)$). We theoretically proven that this selection is positively correlated with the upper bound of the policy gradient norms, yielding higher policy gradients. (2) the token-level, which prioritizes tokens with a high product of advantage magnitude and policy entropy ($|A_{i,t}|\times H_{i,t}$), focusing updates on tokens where the policy is both uncertain and impactful. Moreover, to prevent overfitting to high-signal data, D$^3$S employs a dynamic down-sampling schedule inspired by curriculum learning. This schedule starts with aggressive down-sampling to accelerate early learning and gradually relaxes to promote robust generalization. Extensive experiments on Qwen2.5 and Llama3.1 demonstrate that integrating D$^3$S into advanced RL algorithms achieves state-of-the-art performance and generalization while requiring \textit{fewer} samples and tokens across diverse reasoning benchmarks. Our code is added in the supplementary materials and will be made publicly available.
Authors: Jeong Eul Kwon, Joo Heung Yoon, Hyo Kyung Lee
Abstract: Irregular sampling and high missingness are intrinsic challenges in modeling time series derived from electronic health records (EHRs),where clinical variables are measured at uneven intervals depending on workflow and intervention timing. To address this, we propose VITAL, a variable-aware, large language model (LLM) based framework tailored for learning from irregularly sampled physiological time series. VITAL differentiates between two distinct types of clinical variables: vital signs, which are frequently recorded and exhibit temporal patterns, and laboratory tests, which are measured sporadically and lack temporal structure. It reprograms vital signs into the language space, enabling the LLM to capture temporal context and reason over missing values through explicit encoding. In contrast, laboratory variables are embedded either using representative summary values or a learnable [Not measured] token, depending on their availability. Extensive evaluations on the benchmark datasets from the PhysioNet demonstrate that VITAL outperforms state of the art methods designed for irregular time series. Furthermore, it maintains robust performance under high levels of missingness, which is prevalent in real world clinical scenarios where key variables are often unavailable.
Authors: Moritz Piening, Robert Beinert
Abstract: Wasserstein distances define a metric between probability measures on arbitrary metric spaces, including meta-measures (measures over measures). The resulting Wasserstein over Wasserstein (WoW) distance is a powerful, but computationally costly tool for comparing datasets or distributions over images and shapes. Existing sliced WoW accelerations rely on parametric meta-measures or the existence of high-order moments, leading to numerical instability. As an alternative, we propose to leverage the isometry between the 1d Wasserstein space and the quantile functions in the function space $L_2([0,1])$. For this purpose, we introduce a general sliced Wasserstein framework for arbitrary Banach spaces. Due to the 1d Wasserstein isometry, this framework defines a sliced distance between 1d meta-measures via infinite-dimensional $L_2$-projections, parametrized by Gaussian processes. Combining this 1d construction with classical integration over the Euclidean unit sphere yields the double-sliced Wasserstein (DSW) metric for general meta-measures. We show that DSW minimization is equivalent to WoW minimization for discretized meta-measures, while avoiding unstable higher-order moments and computational savings. Numerical experiments on datasets, shapes, and images validate DSW as a scalable substitute for the WoW distance.
Authors: Takashi Morita
Abstract: Vector quantization, which discretizes a continuous vector space into a finite set of representative vectors (a codebook), has been widely adopted in modern machine learning. Despite its effectiveness, vector quantization poses a fundamental challenge: the non-differentiable quantization step blocks gradient backpropagation. Smoothed vector quantization addresses this issue by relaxing the hard assignment of a codebook vector into a weighted combination of codebook entries, represented as the matrix product of a simplex vector and the codebook. Effective smoothing requires two properties: (1) smoothed quantizers should remain close to a onehot vector, ensuring tight approximation, and (2) all codebook entries should be utilized, preventing code collapse. Existing methods typically address these desiderata separately. By contrast, the present study introduces a simple and intuitive regularization that promotes both simultaneously by minimizing the distance between each simplex vertex and its $K$-nearest smoothed quantizers. Experiments on representative benchmarks, including discrete image autoencoding and contrastive speech representation learning, demonstrate that the proposed method achieves more reliable codebook utilization and improves performance compared to prior approaches.
Authors: Shirin Alanova, Kristina Kazistova, Ekaterina Galaeva, Alina Kostromina, Vladimir Smirnov, Redko Dmitry, Alexey Dontsov, Maxim Zhelnin, Evgeny Burnaev, Egor Shvetsov
Abstract: The demand for efficient large language model (LLM) inference has intensified the focus on sparsification techniques. While semi-structured (N:M) pruning is well-established for weights, its application to activation pruning remains underexplored despite its potential for dynamic, input-adaptive compression and reductions in I/O overhead. This work presents a comprehensive analysis of methods for post-training N:M activation pruning in LLMs. Across multiple LLMs, we demonstrate that pruning activations enables superior preservation of generative capabilities compared to weight pruning at equivalent sparsity levels. We evaluate lightweight, plug-and-play error mitigation techniques and pruning criteria, establishing strong hardware-friendly baselines that require minimal calibration. Furthermore, we explore sparsity patterns beyond NVIDIA's standard 2:4, showing that the 16:32 pattern achieves performance nearly on par with unstructured sparsity. However, considering the trade-off between flexibility and hardware implementation complexity, we focus on the 8:16 pattern as a superior candidate. Our findings provide both effective practical methods for activation pruning and a motivation for future hardware to support more flexible sparsity patterns. Our code is available https://anonymous.4open.science/r/Structured-Sparse-Activations-Inference-EC3C/README.md .
URLs: https://anonymous.4open.science/r/Structured-Sparse-Activations-Inference-EC3C/README.md
Authors: Durgesh Kalwar, Mayank Baranwal, Harshad Khadilkar
Abstract: In today's data-sensitive landscape, distributed learning emerges as a vital tool, not only fortifying privacy measures but also streamlining computational operations. This becomes especially crucial within fully decentralized infrastructures where local processing is imperative due to the absence of centralized aggregation. Here, we introduce DYNAWEIGHT, a novel framework to information aggregation in multi-agent networks. DYNAWEIGHT offers substantial acceleration in decentralized learning with minimal additional communication and memory overhead. Unlike traditional static weight assignments, such as Metropolis weights, DYNAWEIGHT dynamically allocates weights to neighboring servers based on their relative losses on local datasets. Consequently, it favors servers possessing diverse information, particularly in scenarios of substantial data heterogeneity. Our experiments on various datasets MNIST, CIFAR10, and CIFAR100 incorporating various server counts and graph topologies, demonstrate notable enhancements in training speeds. Notably, DYNAWEIGHT functions as an aggregation scheme compatible with any underlying server-level optimization algorithm, underscoring its versatility and potential for widespread integration.
Authors: Pavan Karjol, Vivek V Kashyap, Rohan Kashyap, Prathosh A P
Abstract: In this study, we introduce a method for learning group (known or unknown) equivariant functions by learning the associated quadratic form $x^T A x$ corresponding to the group from the data. Certain groups, known as orthogonal groups, preserve a specific quadratic form, and we leverage this property to uncover the underlying symmetry group under the assumption that it is orthogonal. By utilizing the corresponding unique symmetric matrix and its inherent diagonal form, we incorporate suitable inductive biases into the neural network architecture, leading to models that are both simplified and efficient. Our approach results in an invariant model that preserves norms, while the equivariant model is represented as a product of a norm-invariant model and a scale-invariant model, where the ``product'' refers to the group action. Moreover, we extend our framework to a more general setting where the function acts on tuples of input vectors via a diagonal (or product) group action. In this extension, the equivariant function is decomposed into an angular component extracted solely from the normalized first vector and a scale-invariant component that depends on the full Gram matrix of the tuple. This decomposition captures the inter-dependencies between multiple inputs while preserving the underlying group symmetry. We assess the effectiveness of our framework across multiple tasks, including polynomial regression, top quark tagging, and moment of inertia matrix prediction. Comparative analysis with baseline methods demonstrates that our model consistently excels in both discovering the underlying symmetry and efficiently learning the corresponding equivariant function.
Authors: Stefan Matthes, Zhiwei Han, Hao Shen
Abstract: Disentangled representations seek to recover latent factors of variation underlying observed data, yet their identifiability is still not fully understood. We introduce a unified framework in which disentanglement is achieved through mechanistic independence, which characterizes latent factors by how they act on observed variables rather than by their latent distribution. This perspective is invariant to changes of the latent density, even when such changes induce statistical dependencies among factors. Within this framework, we propose several related independence criteria -- ranging from support-based and sparsity-based to higher-order conditions -- and show that each yields identifiability of latent subspaces, even under nonlinear, non-invertible mixing. We further establish a hierarchy among these criteria and provide a graph-theoretic characterization of latent subspaces as connected components. Together, these results clarify the conditions under which disentangled representations can be identified without relying on statistical assumptions.
Authors: Duc Thien Nguyen, Konstantinos Slavakis, Eleftherios Kofidis, Dimitris Pados
Abstract: A regression-based framework for interpretable multi-way data imputation, termed Kernel Regression via Tensor Trains with Hadamard overparametrization (KReTTaH), is introduced. KReTTaH adopts a nonparametric formulation by casting imputation as regression via reproducing kernel Hilbert spaces. Parameter efficiency is achieved through tensors of fixed tensor-train (TT) rank, which reside on low-dimensional Riemannian manifolds, and is further enhanced via Hadamard overparametrization, which promotes sparsity within the TT parameter space. Learning is accomplished by solving a smooth inverse problem posed on the Riemannian manifold of fixed TT-rank tensors. As a representative application, the estimation of dynamic graph flows is considered. In this setting, KReTTaH exhibits flexibility by seamlessly incorporating graph-based (topological) priors via its inverse problem formulation. Numerical tests on real-world graph datasets demonstrate that KReTTaH consistently outperforms state-of-the-art alternatives-including a nonparametric tensor- and a neural-network-based methods-for imputing missing, time-varying edge flows.
Authors: Mu Huang, Linning Xu, Mingyue Dai, Yidi Shao, Bo Dai
Abstract: Simulating physically plausible trajectories toward user-defined goals is a fundamental yet challenging task in fluid dynamics. While particle-based simulators can efficiently reproduce forward dynamics, inverse inference remains difficult, especially in dissipative systems where dynamics are irreversible and optimization-based solvers are slow, unstable, and often fail to converge. In this work, we introduce the Reversible Graph Network Simulator (R-GNS), a unified framework that enforces bidirectional consistency within a single graph architecture. Unlike prior neural simulators that approximate inverse dynamics by fitting backward data, R-GNS does not attempt to reverse the underlying physics. Instead, we propose a mathematically invertible design based on residual reversible message passing with shared parameters, coupling forward dynamics with inverse inference to deliver accurate predictions and efficient recovery of plausible initial states. Experiments on three dissipative benchmarks (Water-3D, WaterRamps, and WaterDrop) show that R-GNS achieves higher accuracy and consistency with only one quarter of the parameters, and performs inverse inference more than 100 times faster than optimization-based baselines. For forward simulation, R-GNS matches the speed of strong GNS baselines, while in goal-conditioned tasks it eliminates iterative optimization and achieves orders-of-magnitude speedups. On goal-conditioned tasks, R-GNS further demonstrates its ability to complex target shapes (e.g., characters "L" and "N") through vivid, physically consistent trajectories. To our knowledge, this is the first reversible framework that unifies forward and inverse simulation for dissipative fluid systems.
Authors: Leonardo Iurada, Simone Bombari, Tatiana Tommasi, Marco Mondelli
Abstract: Large-scale deep learning models are known to memorize parts of the training set. In machine learning theory, memorization is often framed as interpolation or label fitting, and classical results show that this can be achieved when the number of parameters $p$ in the model is larger than the number of training samples $n$. In this work, we consider memorization from the perspective of data reconstruction, demonstrating that this can be achieved when $p$ is larger than $dn$, where $d$ is the dimensionality of the data. More specifically, we show that, in the random features model, when $p \gg dn$, the subspace spanned by the training samples in feature space gives sufficient information to identify the individual samples in input space. Our analysis suggests an optimization method to reconstruct the dataset from the model parameters, and we demonstrate that this method performs well on various architectures (random features, two-layer fully-connected and deep residual networks). Our results reveal a law of data reconstruction, according to which the entire training dataset can be recovered as $p$ exceeds the threshold $dn$.
Authors: Pavan Karjol, Vivek V Kashyap, Rohan Kashyap, Prathosh A P
Abstract: We introduce a novel framework for the automatic discovery of one-parameter subgroups ($H_{\gamma}$) of $SO(3)$ and, more generally, $SO(n)$. One-parameter subgroups of $SO(n)$ are crucial in a wide range of applications, including robotics, quantum mechanics, and molecular structure analysis. Our method utilizes the standard Jordan form of skew-symmetric matrices, which define the Lie algebra of $SO(n)$, to establish a canonical form for orbits under the action of $H_{\gamma}$. This canonical form is then employed to derive a standardized representation for $H_{\gamma}$-invariant functions. By learning the appropriate parameters, the framework uncovers the underlying one-parameter subgroup $H_{\gamma}$. The effectiveness of the proposed approach is demonstrated through tasks such as double pendulum modeling, moment of inertia prediction, top quark tagging and invariant polynomial regression, where it successfully recovers meaningful subgroup structure and produces interpretable, symmetry-aware representations.
Authors: Alexandra Cimpean, Nicole Orzan, Catholijn Jonker, Pieter Libin, Ann Now\'e
Abstract: Equity in real-world sequential decision problems can be enforced using fairness-aware methods. Therefore, we require algorithms that can make suitable and transparent trade-offs between performance and the desired fairness notions. As the desired performance-fairness trade-off is hard to specify a priori, we propose a framework where multiple trade-offs can be explored. Insights provided by the reinforcement learning algorithm regarding the obtainable performance-fairness trade-offs can then guide stakeholders in selecting the most appropriate policy. To capture fairness, we propose an extended Markov decision process, $f$MDP, that explicitly encodes individuals and groups. Given this $f$MDP, we formalise fairness notions in the context of sequential decision problems and formulate a fairness framework that computes fairness measures over time. We evaluate our framework in two scenarios with distinct fairness requirements: job hiring, where strong teams must be composed while treating applicants equally, and fraud detection, where fraudulent transactions must be detected while ensuring the burden on customers is fairly distributed. We show that our framework learns policies that are more fair across multiple scenarios, with only minor loss in performance reward. Moreover, we observe that group and individual fairness notions do not necessarily imply one another, highlighting the benefit of our framework in settings where both fairness types are desired. Finally, we provide guidelines on how to apply this framework across different problem settings.
Authors: Xiaoyang Liu, Tao Zhu, Zineng Dong, Yuntian Liu, Qingfeng Guo, Zhaoxuan Liu, Yu Chen, Tao Luo
Abstract: Statement autoformalization, the automated translation of statements from natural language into formal languages, has seen significant advancements, yet the development of automated evaluation metrics remains limited. Existing metrics for formal statement similarity often fail to balance semantic and structural information. String-based approaches capture syntactic structure but ignore semantic meaning, whereas proof-based methods validate semantic equivalence but disregard structural nuances and, critically, provide no graded similarity score in the event of proof failure. To address these issues, we introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which comprehensively integrates semantic and structural information to provide a continuous similarity score. Our framework first transforms formal statements into Operator Trees to capture their syntactic structure and then computes a similarity score using our novel TransTED (Transformation Tree Edit Distance) Similarity metric, which enhances traditional Tree Edit Distance by incorporating semantic awareness through transformations. For rigorous validation, we present EPLA (Evaluating Provability and Likeness for Autoformalization), a new benchmark of 524 expert-annotated formal statement pairs derived from miniF2F and ProofNet, with labels for both semantic provability and structural likeness. Experiments on EPLA demonstrate that TransTED Similarity outperforms existing methods, achieving state-of-the-art accuracy and the highest Kappa coefficient. The benchmark, and implementation code will be made public soon.
Authors: Isaac Reid, Arijit Sehanobish, Cedrik H\"ofs, Bruno Mlodozeniec, Leonhard Vulpius, Federico Barbero, Adrian Weller, Krzysztof Choromanski, Richard E. Turner, Petar Veli\v{c}kovi\'c
Abstract: We introduce WIRE: Wavelet-Induced Rotary Encodings. WIRE extends Rotary Position Encodings (RoPE), a popular algorithm in LLMs and ViTs, to graph-structured data. We demonstrate that WIRE is more general than RoPE, recovering the latter in the special case of grid graphs. WIRE also enjoys a host of desirable theoretical properties, including equivariance under node ordering permutation, compatibility with linear attention, and (under select assumptions) asymptotic dependence on graph resistive distance. We test WIRE on a range of synthetic and real-world tasks, including identifying monochromatic subgraphs, semantic segmentation of point clouds, and more standard graph benchmarks. We find it to be effective in settings where the underlying graph structure is important.
Authors: Nakyeong Yang, Dong-Kyum Kim, Jea Kwon, Minsung Kim, Kyomin Jung, Meeyoung Cha
Abstract: Large language models trained on web-scale data can memorize private or sensitive knowledge, raising significant privacy risks. Although some unlearning methods mitigate these risks, they remain vulnerable to "relearning" during subsequent training, allowing a substantial portion of forgotten knowledge to resurface. In this paper, we show that widely used unlearning methods cause shallow alignment: instead of faithfully erasing target knowledge, they generate spurious unlearning neurons that amplify negative influence to hide it. To overcome this limitation, we introduce Ssiuu, a new class of unlearning methods that employs attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge. Experimental results confirm that our method reliably erases target knowledge and outperforms strong baselines across two practical retraining scenarios: (1) adversarial injection of private data, and (2) benign attack using an instruction-following benchmark. Our findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models.
Authors: Jo\~ao Paulo Vieira, Victor Afonso Bauler, Rodrigo Kobashikawa Rosa, Danilo Silva
Abstract: Reliable detection of bearing faults is essential for maintaining the safety and operational efficiency of rotating machinery. While recent advances in machine learning (ML), particularly deep learning, have shown strong performance in controlled settings, many studies fail to generalize to real-world applications due to methodological flaws, most notably data leakage. This paper investigates the issue of data leakage in vibration-based bearing fault diagnosis and its impact on model evaluation. We demonstrate that common dataset partitioning strategies, such as segment-wise and condition-wise splits, introduce spurious correlations that inflate performance metrics. To address this, we propose a rigorous, leakage-free evaluation methodology centered on bearing-wise data partitioning, ensuring no overlap between the physical components used for training and testing. Additionally, we reformulate the classification task as a multi-label problem, enabling the detection of co-occurring fault types and the use of prevalence-independent metrics such as Macro AUROC. Beyond preventing leakage, we also examine the effect of dataset diversity on generalization, showing that the number of unique training bearings is a decisive factor for achieving robust performance. We evaluate our methodology on three widely adopted datasets: CWRU, Paderborn University (PU), and University of Ottawa (UORED-VAFCLS). This study highlights the importance of leakage-aware evaluation protocols and provides practical guidelines for dataset partitioning, model selection, and validation, fostering the development of more trustworthy ML systems for industrial fault diagnosis applications.
Authors: Nassim Walha, Sebastian G. Gruber, Thomas Decker, Yinchong Yang, Alireza Javanmardi, Eyke H\"ullermeier, Florian Buettner
Abstract: As Large Language Models (LLMs) are increasingly integrated in diverse applications, obtaining reliable measures of their predictive uncertainty has become critically important. A precise distinction between aleatoric uncertainty, arising from inherent ambiguities within input data, and epistemic uncertainty, originating exclusively from model limitations, is essential to effectively address each uncertainty source. In this paper, we introduce Spectral Uncertainty, a novel approach to quantifying and decomposing uncertainties in LLMs. Leveraging the Von Neumann entropy from quantum information theory, Spectral Uncertainty provides a rigorous theoretical foundation for separating total uncertainty into distinct aleatoric and epistemic components. Unlike existing baseline methods, our approach incorporates a fine-grained representation of semantic similarity, enabling nuanced differentiation among various semantic interpretations in model responses. Empirical evaluations demonstrate that Spectral Uncertainty outperforms state-of-the-art methods in estimating both aleatoric and total uncertainty across diverse models and benchmark datasets.
Authors: Xingjian Wu, Zhengyu Li, Hanyin Cheng, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang
Abstract: Time Series Analysis is widely used in various real-world applications such as weather forecasting, financial fraud detection, imputation for missing data in IoT systems, and classification for action recognization. Mixture-of-Experts (MoE), as a powerful architecture, though demonstrating effectiveness in NLP, still falls short in adapting to versatile tasks in time series analytics due to its task-agnostic router and the lack of capability in modeling channel correlations. In this study, we propose a novel, general MoE-based time series framework called PatchMoE to support the intricate ``knowledge'' utilization for distinct tasks, thus task-aware. Based on the observation that hierarchical representations often vary across tasks, e.g., forecasting vs. classification, we propose a Recurrent Noisy Gating to utilize the hierarchical information in routing, thus obtaining task-sepcific capability. And the routing strategy is operated on time series tokens in both temporal and channel dimensions, and encouraged by a meticulously designed Temporal \& Channel Load Balancing Loss to model the intricate temporal and channel correlations. Comprehensive experiments on five downstream tasks demonstrate the state-of-the-art performance of PatchMoE.
Authors: Mehdi Letafati, Samad Ali, Matti Latva-aho
Abstract: Semantic communication (SemCom) systems aim to learn the mapping from low-dimensional semantics to high-dimensional ground-truth. While this is more akin to a "domain translation" problem, existing frameworks typically emphasize on channel-adaptive neural encoding-decoding schemes, lacking full exploration of signal distribution. Moreover, such methods so far have employed autoencoder-based architectures, where the encoding is tightly coupled to a matched decoder, causing scalability issues in practice. To address these gaps, diffusion autoencoder models are proposed for wireless SemCom. The goal is to learn a "semantic-to-clean" mapping, from the semantic space to the ground-truth probability distribution. A neural encoder at semantic transmitter extracts the high-level semantics, and a conditional diffusion model (CDiff) at the semantic receiver exploits the source distribution for signal-space denoising, while the received semantic latents are incorporated as the conditioning input to "steer" the decoding process towards the semantics intended by the transmitter. It is analytically proved that the proposed decoder model is a consistent estimator of the ground-truth data. Furthermore, extensive simulations over CIFAR-10 and MNIST datasets are provided along with design insights, highlighting the performance compared to legacy autoencoders and variational autoencoders (VAE). Simulations are further extended to the multi-user SemCom, identifying the dominating factors in a more realistic setup.
Authors: Yingying Li, Mingxuan Xie, Hailong You, Yongqiang Yao, Hongwei Liu
Abstract: This paper proposes an efficient hypergraph partitioning framework based on a novel multi-objective non-convex constrained relaxation model. A modified accelerated proximal gradient algorithm is employed to generate diverse $k$-dimensional vertex features to avoid local optima and enhance partition quality. Two MST-based strategies are designed for different data scales: for small-scale data, the Prim algorithm constructs a minimum spanning tree followed by pruning and clustering; for large-scale data, a subset of representative nodes is selected to build a smaller MST, while the remaining nodes are assigned accordingly to reduce complexity. To further improve partitioning results, refinement strategies including greedy migration, swapping, and recursive MST-based clustering are introduced for partitions. Experimental results on public benchmark sets demonstrate that the proposed algorithm achieves reductions in cut size of approximately 2\%--5\% on average compared to KaHyPar in 2, 3, and 4-way partitioning, with improvements of up to 35\% on specific instances. Particularly on weighted vertex sets, our algorithm outperforms state-of-the-art partitioners including KaHyPar, hMetis, Mt-KaHyPar, and K-SpecPart, highlighting its superior partitioning quality and competitiveness. Furthermore, the proposed refinement strategy improves hMetis partitions by up to 16\%. A comprehensive evaluation based on virtual instance methodology and parameter sensitivity analysis validates the algorithm's competitiveness and characterizes its performance trade-offs.
Authors: Xingjian Wu, Jianxin Jin, Wanghui Qiu, Peng Chen, Yang Shu, Bin Yang, Chenjuan Guo
Abstract: Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on well-recognized benchmarks, including TimeMMD, TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.
Authors: Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang
Abstract: Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to Optimal Brain Surgeon (OBS) theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where d is the model's dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of compression ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at compression ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at \href{https://github.com/LLIKKE/HEAPr}{https://github.com/LLIKKE/HEAPr}.
URLs: https://github.com/LLIKKE/HEAPr, https://github.com/LLIKKE/HEAPr
Authors: Gabriel Kitso Gibberd, Jose Pablo Folch, Antonio Del Rio Chanona
Abstract: Computational representations have become crucial in unlocking the recent growth of machine learning algorithms for chemistry. Initially hand-designed, machine learning has shown that meaningful representations can be learnt from data. Chemical datasets are limited and so the representations learnt from data are generic, being trained on broad datasets which contain shallow information on many different molecule types. For example, generic fingerprints lack physical context specific to solvents. However, the use of harmful solvents is a leading climate-related issue in the chemical industry, and there is a surge of interest in green solvent replacement. To empower this research, we propose a new solvent representation scheme by developing Solvent Data Driven Embeddings (SoDaDE). SoDaDE uses a small transformer model and solvent property dataset to create a fingerprint for solvents. To showcase their effectiveness, we use SoDaDE to predict yields on a recently published dataset, outperforming previous representations. We demonstrate through this paper that data-driven fingerprints can be made with small datasets and set-up a workflow that can be explored for other applications.
Authors: Bumgeun Park, Donghwan Lee
Abstract: Reinforcement learning (RL) has achieved impressive results across domains, yet learning an optimal policy typically requires extensive interaction data, limiting practical deployment. A common remedy is to leverage priors, such as pre-collected datasets or reference policies, but their utility degrades under task mismatch between training and deployment. While prior work has sought to address this mismatch, it has largely been restricted to in-distribution settings. To address this challenge, we propose Adaptive Policy Backbone (APB), a meta-transfer RL method that inserts lightweight linear layers before and after a shared backbone, thereby enabling parameter-efficient fine-tuning (PEFT) while preserving prior knowledge during adaptation. Our results show that APB improves sample efficiency over standard RL and adapts to out-of-distribution (OOD) tasks where existing meta-RL baselines typically fail.
Authors: Hyunwoo Kim, Junha Lee, Mincheol Choi, Jeonghwan Lee, Jaeshin Cho
Abstract: Deep learning models have become increasingly large and complex, resulting in higher memory consumption and computational demands. Consequently, model loading times and initial inference latency have increased, posing significant challenges in mobile and latency-sensitive environments where frequent model loading and unloading are required, which directly impacts user experience. While Knowledge Distillation (KD) offers a solution by compressing large teacher models into smaller student ones, it often comes at the cost of reduced performance. To address this trade-off, we propose Progressive Weight Loading (PWL), a novel technique that enables fast initial inference by first deploying a lightweight student model, then incrementally replacing its layers with those of a pre-trained teacher model. To support seamless layer substitution, we introduce a training method that not only aligns intermediate feature representations between student and teacher layers, but also improves the overall output performance of the student model. Our experiments on VGG, ResNet, and ViT architectures demonstrate that models trained with PWL maintain competitive distillation performance and gradually improve accuracy as teacher layers are loaded-matching the final accuracy of the full teacher model without compromising initial inference speed. This makes PWL particularly suited for dynamic, resource-constrained deployments where both responsiveness and performance are critical.
Authors: Bowen Wang, Matteo Zecchin, Osvaldo Simeone
Abstract: An associative memory (AM) enables cue-response recall, and associative memorization has recently been noted to underlie the operation of modern neural architectures such as Transformers. This work addresses a distributed setting where agents maintain a local AM to recall their own associations as well as selective information from others. Specifically, we introduce a distributed online gradient descent method that optimizes local AMs at different agents through communication over routing trees. Our theoretical analysis establishes sublinear regret guarantees, and experiments demonstrate that the proposed protocol consistently outperforms existing online optimization baselines.
Authors: Naicheng He, Kaicheng Guo, Arjun Prakash, Saket Tiwari, Ruo Yu Tao, Tyrone Serapio, Amy Greenwald, George Konidaris
Abstract: We investigate why deep neural networks suffer from \emph{loss of plasticity} in deep continual learning, failing to learn new tasks without reinitializing parameters. We show that this failure is preceded by Hessian spectral collapse at new-task initialization, where meaningful curvature directions vanish and gradient descent becomes ineffective. To characterize the necessary condition for successful training, we introduce the notion of $\tau$-trainability and show that current plasticity preserving algorithms can be unified under this framework. Targeting spectral collapse directly, we then discuss the Kronecker factored approximation of the Hessian, which motivates two regularization enhancements: maintaining high effective feature rank and applying $L2$ penalties. Experiments on continual supervised and reinforcement learning tasks confirm that combining these two regularizers effectively preserves plasticity.
Authors: Marie Brockschmidt, Maresa Schr\"oder, Stefan Feuerriegel
Abstract: Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff, an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii) preserves the censoring mechanism. Across multiple datasets, we show that \survdiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and downstream evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first diffusion model explicitly designed for generating synthetic survival data.
Authors: Fan Wang, Zhiyuan Chen, Yuxuan Zhong, Sunjian Zheng, Pengtao Shao, Bo Yu, Shaoshan Liu, Jianan Wang, Ning Ding, Yang Cao, Yu Kang
Abstract: The capability of predicting environmental dynamics underpins both biological neural systems and general embodied AI in adapting to their surroundings. Yet prevailing approaches rest on static world models that falter when confronted with novel or rare configurations. We investigate in-context environment learning (ICEL), shifting attention from zero-shot performance to the growth and asymptotic limits of the world model. Our contributions are three-fold: (1) we formalize in-context learning of a world model and identify two core mechanisms: environment recognition and environment learning; (2) we derive error upper-bounds for both mechanisms that expose how the mechanisms emerge; and (3) we empirically confirm that distinct ICL mechanisms exist in the world model, and we further investigate how data distribution and model architecture affect ICL in a manner consistent with theory. These findings demonstrate the potential of self-adapting world models and highlight the key factors behind the emergence of ICEL, most notably the necessity of long context and diverse environments.
Authors: Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazar\'e, Herv\'e J\'egou
Abstract: We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text.
Authors: Moritz Hehl, Max von Renesse, Melanie Weber
Abstract: Deep neural networks learn feature representations via complex geometric transformations of the input data manifold. Despite the models' empirical success across domains, our understanding of neural feature representations is still incomplete. In this work we investigate neural feature geometry through the lens of discrete geometry. Since the input data manifold is typically unobserved, we approximate it using geometric graphs that encode local similarity structure. We provide theoretical results on the evolution of these graphs during training, showing that nonlinear activations play a crucial role in shaping feature geometry in feedforward neural networks. Moreover, we discover that the geometric transformations resemble a discrete Ricci flow on these graphs, suggesting that neural feature geometry evolves analogous to Ricci flow. This connection is supported by experiments on over 20,000 feedforward neural networks trained on binary classification tasks across both synthetic and real-world datasets. We observe that the emergence of class separability corresponds to the emergence of community structure in the associated graph representations, which is known to relate to discrete Ricci flow dynamics. Building on these insights, we introduce a novel framework for locally evaluating geometric transformations through comparison with discrete Ricci flow dynamics. Our results suggest practical design principles, including a geometry-informed early-stopping heuristic and a criterion for selecting network depth.
Authors: Lovenya Jain, Pooneh Mousavi, Mirco Ravanelli, Cem Subakan
Abstract: Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model's decision process and can be used as reliable explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.
Authors: Bo Wang, Imran Khan, Martin White, Natalia Beloff
Abstract: We present a federated, multi-modal phishing website detector that supports URL, HTML, and IMAGE inputs without binding clients to a fixed modality at inference: any client can invoke any modality head trained elsewhere. Methodologically, we propose role-aware bucket aggregation on top of FedProx, inspired by Mixture-of-Experts and FedMM. We drop learnable routing and use hard gating (selecting the IMAGE/HTML/URL expert by sample modality), enabling separate aggregation of modality-specific parameters to isolate cross-embedding conflicts and stabilize convergence. On TR-OP, the Fusion head reaches Acc 97.5% with FPR 2.4% across two data types; on the image subset (ablation) it attains Acc 95.5% with FPR 5.9%. For text, we use GraphCodeBERT for URLs and an early three-way embedding for raw, noisy HTML. On WebPhish (HTML) we obtain Acc 96.5% / FPR 1.8%; on TR-OP (raw HTML) we obtain Acc 95.1% / FPR 4.6%. Results indicate that bucket aggregation with hard-gated experts enables stable federated training under strict privacy, while improving the usability and flexibility of multi-modal phishing detection.
Authors: Haibo Wang, Lutfu S. Sua, Jun Huang, Figen Balo, Burak Dolar
Abstract: Effective credit risk management is fundamental to financial decision-making, necessitating robust models for default probability prediction and financial entity classification. Traditional machine learning approaches face significant challenges when confronted with high-dimensional data, limited interpretability, rare event detection, and multi-class imbalance problems in risk assessment. This research proposes a comprehensive meta-learning framework that synthesizes multiple complementary models: supervised learning algorithms, including XGBoost, Random Forest, Support Vector Machine, and Decision Tree; unsupervised methods such as K-Nearest Neighbors; deep learning architectures like Multilayer Perceptron; alongside LASSO regularization for feature selection and dimensionality reduction; and Error-Correcting Output Codes as a meta-classifier for handling imbalanced multi-class problems. We implement Permutation Feature Importance analysis for each prediction class across all constituent models to enhance model transparency. Our framework aims to optimize predictive performance while providing a more holistic approach to credit risk assessment. This research contributes to the development of more accurate and reliable computational models for strategic financial decision support by addressing three fundamental challenges in credit risk modeling. The empirical validation of our approach involves an analysis of the Corporate Credit Ratings dataset with credit ratings for 2,029 publicly listed US companies. Results demonstrate that our meta-learning framework significantly enhances the accuracy of financial entity classification regarding credit rating migrations (upgrades and downgrades) and default probability estimation.
Authors: Luca Bergamin, Roberto Confalonieri, Fabio Aiolli
Abstract: Deep neural networks are widely used in practical applications of AI, however, their inner structure and complexity made them generally not easily interpretable. Model transparency and interpretability are key requirements for multiple scenarios where high performance is not enough to adopt the proposed solution. In this work, a differentiable approximation of $L_0$ regularization is adapted into a logic-based neural network, the Multi-layer Logical Perceptron (MLLP), to study its efficacy in reducing the complexity of its discrete interpretable version, the Concept Rule Set (CRS), while retaining its performance. The results are compared to alternative heuristics like Random Binarization of the network weights, to determine if better results can be achieved when using a less-noisy technique that sparsifies the network based on the loss function instead of a random distribution. The trade-off between the CRS complexity and its performance is discussed.
Authors: Narada Maugin, Tristan Cazenave
Abstract: The Counterfactual Regret Minimization (CFR) algorithm and its variants have enabled the development of pokerbots capable of beating the best human players in heads-up (1v1) cash games and competing with them in six-player formats. However, CFR's computational complexity rises exponentially with the number of players. Furthermore, in games with three or more players, following Nash equilibrium no longer guarantees a non-losing outcome. These limitations, along with others, significantly restrict the applicability of CFR to the most popular formats: tournaments. Motivated by the recent success of Large Language Models (LLM) in chess and Diplomacy, we present SpinGPT, the first LLM tailored to Spin & Go, a popular three-player online poker format. SpinGPT is trained in two stages: (1) Supervised Fine-Tuning on 320k high-stakes expert decisions; (2) Reinforcement Learning on 270k solver-generated hands. Our results show that SpinGPT matches the solver's actions in 78% of decisions (tolerant accuracy). With a simple deep-stack heuristic, it achieves 13.4 +/- 12.9 BB/100 versus Slumbot in heads-up over 30,000 hands (95% CI). These results suggest that LLMs could be a new way to deal with multi-player imperfect-information games like poker.
Authors: Filipe C. L. Duarte, Paulo S. G. de Mattos Neto, Paulo R. A. Firmino
Abstract: The decline in interest rates and economic stabilization has heightened the importance of accurate mortality rate forecasting, particularly in insurance and pension markets. Multi-step-ahead predictions are crucial for public health, demographic planning, and insurance risk assessments; however, they face challenges when data are limited. Hybrid systems that combine statistical and Machine Learning (ML) models offer a promising solution for handling both linear and nonlinear patterns. This study evaluated the impact of different multi-step forecasting approaches (Recursive, Direct, and Multi-Input Multi-Output) and ML models on the accuracy of hybrid systems. Results from 12 datasets and 21 models show that the selection of both the multi-step approach and the ML model is essential for improving performance, with the ARIMA-LSTM hybrid using a recursive approach outperforming other models in most cases.
Authors: Nan Tang, Jing-Cheng Pang, Guanlin Li, Chao Qian, Yang Yu
Abstract: Reward design remains a critical bottleneck in visual reinforcement learning (RL) for robotic manipulation. In simulated environments, rewards are conventionally designed based on the distance to a target position. However, such precise positional information is often unavailable in real-world visual settings due to sensory and perceptual limitations. In this study, we propose a method that implicitly infers spatial distances through keypoints extracted from images. Building on this, we introduce Reward Learning with Anticipation Model (ReLAM), a novel framework that automatically generates dense, structured rewards from action-free video demonstrations. ReLAM first learns an anticipation model that serves as a planner and proposes intermediate keypoint-based subgoals on the optimal path to the final goal, creating a structured learning curriculum directly aligned with the task's geometric objectives. Based on the anticipated subgoals, a continuous reward signal is provided to train a low-level, goal-conditioned policy under the hierarchical reinforcement learning (HRL) framework with provable sub-optimality bound. Extensive experiments on complex, long-horizon manipulation tasks show that ReLAM significantly accelerates learning and achieves superior performance compared to state-of-the-art methods.
Authors: Fanjin Meng, Yuan Yuan, Jingtao Ding, Jie Feng, Chonghua Han, Yong Li
Abstract: Mobility Foundation Models (MFMs) have advanced the modeling of human movement patterns, yet they face a ceiling due to limitations in data scale and semantic understanding. While Large Language Models (LLMs) offer powerful semantic reasoning, they lack the innate understanding of spatio-temporal statistics required for generating physically plausible mobility trajectories. To address these gaps, we propose MoveFM-R, a novel framework that unlocks the full potential of mobility foundation models by leveraging language-driven semantic reasoning capabilities. It tackles two key challenges: the vocabulary mismatch between continuous geographic coordinates and discrete language tokens, and the representation gap between the latent vectors of MFMs and the semantic world of LLMs. MoveFM-R is built on three core innovations: a semantically enhanced location encoding to bridge the geography-language gap, a progressive curriculum to align the LLM's reasoning with mobility patterns, and an interactive self-reflection mechanism for conditional trajectory generation. Extensive experiments demonstrate that MoveFM-R significantly outperforms existing MFM-based and LLM-based baselines. It also shows robust generalization in zero-shot settings and excels at generating realistic trajectories from natural language instructions. By synthesizing the statistical power of MFMs with the deep semantic understanding of LLMs, MoveFM-R pioneers a new paradigm that enables a more comprehensive, interpretable, and powerful modeling of human mobility. The implementation of MoveFM-R is available online at https://anonymous.4open.science/r/MoveFM-R-CDE7/.
Authors: Xiao Xue, Marco F. P. ten Eikelder, Mingyang Gao, Xiaoyuan Cheng, Yiming Yang, Yi He, Shuo Wang, Sibo Cheng, Yukun Hu, Peter V. Coveney
Abstract: The lattice Boltzmann equation (LBE), rooted in kinetic theory, provides a powerful framework for capturing complex flow behaviour by describing the evolution of single-particle distribution functions (PDFs). Despite its success, solving the LBE numerically remains computationally intensive due to strict time-step restrictions imposed by collision kernels. Here, we introduce a physics-informed neural operator framework for the LBE that enables prediction over large time horizons without step-by-step integration, effectively bypassing the need to explicitly solve the collision kernel. We incorporate intrinsic moment-matching constraints of the LBE, along with global equivariance of the full distribution field, enabling the model to capture the complex dynamics of the underlying kinetic system. Our framework is discretization-invariant, enabling models trained on coarse lattices to generalise to finer ones (kinetic super-resolution). In addition, it is agnostic to the specific form of the underlying collision model, which makes it naturally applicable across different kinetic datasets regardless of the governing dynamics. Our results demonstrate robustness across complex flow scenarios, including von Karman vortex shedding, ligament breakup, and bubble adhesion. This establishes a new data-driven pathway for modelling kinetic systems.
Authors: Yongqi Huang, Jitao Zhao, Dongxiao He, Xiaobao Wang, Yawen Li, Yuxiao Huang, Di Jin, Zhiyong Feng
Abstract: Graph Prompt Learning (GPL) has emerged as a promising paradigm that bridges graph pretraining models and downstream scenarios, mitigating label dependency and the misalignment between upstream pretraining and downstream tasks. Although existing GPL studies explore various prompt strategies, their effectiveness and underlying principles remain unclear. We identify two critical limitations: (1) Lack of consensus on underlying mechanisms: Despite current GPLs have advanced the field, there is no consensus on how prompts interact with pretrained models, as different strategies intervene at varying spaces within the model, i.e., input-level, layer-wise, and representation-level prompts. (2) Limited scenario adaptability: Most methods fail to generalize across diverse downstream scenarios, especially under data distribution shifts (e.g., homophilic-to-heterophilic graphs). To address these issues, we theoretically analyze existing GPL approaches and reveal that representation-level prompts essentially function as fine-tuning a simple downstream classifier, proposing that graph prompt learning should focus on unleashing the capability of pretrained models, and the classifier adapts to downstream scenarios. Based on our findings, we propose UniPrompt, a novel GPL method that adapts any pretrained models, unleashing the capability of pretrained models while preserving the structure of the input graph. Extensive experiments demonstrate that our method can effectively integrate with various pretrained models and achieve strong performance across in-domain and cross-domain scenarios.
Authors: Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert
Abstract: We introduce a memory- and compute-efficient method for low-communication distributed training. Existing methods reduce communication by performing multiple local updates between infrequent global synchronizations. We demonstrate that their efficiency can be significantly improved by restricting backpropagation: instead of updating all the parameters, each node updates only a fixed subset while keeping the remainder frozen during local steps. This constraint substantially reduces peak memory usage and training FLOPs, while a full forward pass over all parameters eliminates the need for cross-node activation exchange. Experiments on a $1.3$B-parameter language model trained across $32$ nodes show that our method matches the perplexity of prior low-communication approaches under identical token and bandwidth budgets while reducing training FLOPs and peak memory.
Authors: Yuma Fujimoto, Kenshi Abe, Kaito Ariu
Abstract: This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that the regret is constant ($O(1)$-regret) in general-sum normal-form games, and the strategies converge to the Nash equilibrium as a subsequence (best-iterate convergence) in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.
Authors: Florian Graf, Paolo Pellizzoni, Martin Uray, Stefan Huber, Roland Kwitt
Abstract: We consider the problem of computing persistent homology (PH) for large-scale Euclidean point cloud data, aimed at downstream machine learning tasks, where the exponential growth of the most widely-used Vietoris-Rips complex imposes serious computational limitations. Although more scalable alternatives such as the Alpha complex or sparse Rips approximations exist, they often still result in a prohibitively large number of simplices. This poses challenges in the complex construction and in the subsequent PH computation, prohibiting their use on large-scale point clouds. To mitigate these issues, we introduce the Flood complex, inspired by the advantages of the Alpha and Witness complex constructions. Informally, at a given filtration value $r\geq 0$, the Flood complex contains all simplices from a Delaunay triangulation of a small subset of the point cloud $X$ that are fully covered by balls of radius $r$ emanating from $X$, a process we call flooding. Our construction allows for efficient PH computation, possesses several desirable theoretical properties, and is amenable to GPU parallelization. Scaling experiments on 3D point cloud data show that we can compute PH of up to dimension 2 on several millions of points. Importantly, when evaluating object classification performance on real-world and synthetic data, we provide evidence that this scaling capability is needed, especially if objects are geometrically or topologically complex, yielding performance superior to other PH-based methods and neural networks for point cloud data.
Authors: Tianxiang Gao, Siyuan Sun, Hailiang Liu, Hongyang Gao
Abstract: Neural Ordinary Differential Equations (ODEs) have been successful in various applications due to their continuous nature and parameter-sharing efficiency. However, these unique characteristics also introduce challenges in training, particularly with respect to gradient computation accuracy and convergence analysis. In this paper, we address these challenges by investigating the impact of activation functions. We demonstrate that the properties of activation functions, specifically smoothness and nonlinearity, are critical to the training dynamics. Smooth activation functions guarantee globally unique solutions for both forward and backward ODEs, while sufficient nonlinearity is essential for maintaining the spectral properties of the Neural Tangent Kernel (NTK) during training. Together, these properties enable us to establish the global convergence of Neural ODEs under gradient descent in overparameterized regimes. Our theoretical findings are validated by numerical experiments, which not only support our analysis but also provide practical guidelines for scaling Neural ODEs, potentially leading to faster training and improved performance in real-world applications.
Authors: Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova
Abstract: The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.
Authors: Daniil Shlenskii, Alexander Korotin
Abstract: Electrostatic generative models such as PFGM++ have recently emerged as a powerful framework, achieving state-of-the-art performance in image synthesis. PFGM++ operates in an extended data space with auxiliary dimensionality $D$, recovering the diffusion model framework as $D\to\infty$, while yielding superior empirical results for finite $D$. Like diffusion models, PFGM++ relies on expensive ODE simulations to generate samples, making it computationally costly. To address this, we propose Inverse Poisson Flow Matching (IPFM), a novel distillation framework that accelerates electrostatic generative models across all values of $D$. Our IPFM reformulates distillation as an inverse problem: learning a generator whose induced electrostatic field matches that of the teacher. We derive a tractable training objective for this problem and show that, as $D \to \infty$, our IPFM closely recovers Score Identity Distillation (SiD), a recent method for distilling diffusion models. Empirically, our IPFM produces distilled generators that achieve near-teacher or even superior sample quality using only a few function evaluations. Moreover, we observe that distillation converges faster for finite $D$ than in the $D \to \infty$ (diffusion) limit, which is consistent with prior findings that finite-$D$ PFGM++ models exhibit more favorable optimization and sampling properties.
Authors: Changhun Kim, Timon Conrad, Redwanul Karim, Julian Oelhaf, David Riebesel, Tom\'as Arias-Vergara, Andreas Maier, Johann J\"ager, Siming Bayer
Abstract: Physics-informed graph neural networks (PIGNNs) have emerged as fast AC power-flow solvers that can replace classic Newton--Raphson (NR) solvers, especially when thousands of scenarios must be evaluated. However, current PIGNNs still need accuracy improvements at parity speed; in particular, the physics loss is inoperative at inference, which can deter operational adoption. We address this with PIGNN-Attn-LS, combining an edge-aware attention mechanism that explicitly encodes line physics via per-edge biases, capturing the grid's anisotropy, with a backtracking line-search-based globalized correction operator that restores an operative decrease criterion at inference. Training and testing use a realistic High-/Medium-Voltage scenario generator, with NR used only to construct reference states. On held-out HV cases consisting of 4--32-bus grids, PIGNN-Attn-LS achieves a test RMSE of 0.00033 p.u. in voltage and 0.08$^\circ$ in angle, outperforming the PIGNN-MLP baseline by 99.5\% and 87.1\%, respectively. With streaming micro-batches, it delivers 2--5$\times$ faster batched inference than NR on 4--1024-bus grids.
Authors: Robert Parker, Oscar Dowson, Nicole LoGiudice, Manuel Garcia, Russell Bent
Abstract: We propose a reduced-space formulation for optimizing over trained neural networks where the network's outputs and derivatives are evaluated on a GPU. To do this, we treat the neural network as a "gray box" where intermediate variables and constraints are not exposed to the optimization solver. Compared to the full-space formulation, in which intermediate variables and constraints are exposed to the optimization solver, the reduced-space formulation leads to faster solves and fewer iterations in an interior point method. We demonstrate the benefits of this method on two optimization problems: Adversarial generation for a classifier trained on MNIST images and security-constrained optimal power flow with transient feasibility enforced using a neural network surrogate.
Authors: Xinyu Liu, Bei Li, Jiahao Liu, Junhao Ruan, Kechen Jiao, Hongyin Tang, Jingang Wang, Xiao Tong, Jingbo Zhu
Abstract: High-order numerical methods enhance Transformer performance in tasks like NLP and CV, but introduce a performance-efficiency trade-off due to increased computational overhead. Our analysis reveals that conventional efficiency techniques, such as distillation, can be detrimental to the performance of these models, exemplified by PCformer. To explore more optimizable ODE-based Transformer architectures, we propose the \textbf{I}terative \textbf{I}mplicit \textbf{E}uler \textbf{T}ransformer \textbf{(IIET)}, which simplifies high-order methods using an iterative implicit Euler approach. This simplification not only leads to superior performance but also facilitates model compression compared to PCformer. To enhance inference efficiency, we introduce \textbf{I}teration \textbf{I}nfluence-\textbf{A}ware \textbf{D}istillation \textbf{(IIAD)}. Through a flexible threshold, IIAD allows users to effectively balance the performance-efficiency trade-off. On lm-evaluation-harness, IIET boosts average accuracy by 2.65\% over vanilla Transformers and 0.8\% over PCformer. Its efficient variant, E-IIET, significantly cuts inference overhead by 55\% while retaining 99.4\% of the original task accuracy. Moreover, the most efficient IIET variant achieves an average performance gain exceeding 1.6\% over vanilla Transformer with comparable speed.
Authors: Boshra Ariguib, Mathias Niepert, Andrei Manolache
Abstract: High-quality molecular representations are essential for property prediction and molecular design, yet large labeled datasets remain scarce. While self-supervised pretraining on molecular graphs has shown promise, many existing approaches either depend on hand-crafted augmentations or complex generative objectives, and often rely solely on 2D topology, leaving valuable 3D structural information underutilized. To address this gap, we introduce C-FREE (Contrast-Free Representation learning on Ego-nets), a simple framework that integrates 2D graphs with ensembles of 3D conformers. C-FREE learns molecular representations by predicting subgraph embeddings from their complementary neighborhoods in the latent space, using fixed-radius ego-nets as modeling units across different conformers. This design allows us to integrate both geometric and topological information within a hybrid Graph Neural Network (GNN)-Transformer backbone, without negatives, positional encodings, or expensive pre-processing. Pretraining on the GEOM dataset, which provides rich 3D conformational diversity, C-FREE achieves state-of-the-art results on MoleculeNet, surpassing contrastive, generative, and other multimodal self-supervised methods. Fine-tuning across datasets with diverse sizes and molecule types further demonstrates that pretraining transfers effectively to new chemical domains, highlighting the importance of 3D-informed molecular representations.
Authors: Septimus Boshoff, Sebastian Peitz, Stefan Klus
Abstract: The Koopman operator, as a linear representation of a nonlinear dynamical system, has been attracting attention in many fields of science. Recently, Koopman operator theory has been combined with another concept that is popular in data science: reproducing kernel Hilbert spaces. We follow this thread into Gaussian process methods, and illustrate how these methods can alleviate two pervasive problems with kernel-based Koopman algorithms. The first being sparsity: most kernel methods do not scale well and require an approximation to become practical. We show that not only can the computational demands be reduced, but also demonstrate improved resilience against sensor noise. The second problem involves hyperparameter optimization and dictionary learning to adapt the model to the dynamical system. In summary, the main contribution of this work is the unification of Gaussian process regression and dynamic mode decomposition.
Authors: Sadia Asif, Mohammad Mohammadi Amiri
Abstract: Large language models deployed in sensitive applications increasingly require the ability to unlearn specific knowledge, such as user requests, copyrighted materials, or outdated information, without retraining from scratch to ensure regulatory compliance, user privacy, and safety. This task, known as machine unlearning, aims to remove the influence of targeted data (forgetting) while maintaining performance on the remaining data (retention). A common approach is to formulate this as a multi-objective problem and reduce it to a single-objective problem via scalarization, where forgetting and retention losses are combined using a weighted sum. However, this often results in unstable training dynamics and degraded model utility due to conflicting gradient directions. To address these challenges, we propose OFMU, a penalty-based bi-level optimization framework that explicitly prioritizes forgetting while preserving retention through a hierarchical structure. Our method enforces forgetting via an inner maximization step that incorporates a similarity-aware penalty to decorrelate the gradients of the forget and retention objectives, and restores utility through an outer minimization step. To ensure scalability, we develop a two-loop algorithm with provable convergence guarantees under both convex and non-convex regimes. We further provide a rigorous theoretical analysis of convergence rates and show that our approach achieves better trade-offs between forgetting efficacy and model utility compared to prior methods. Extensive experiments across vision and language benchmarks demonstrate that OFMU consistently outperforms existing unlearning methods in both forgetting efficacy and retained utility.
Authors: Samuele Punzo, Silvia Giulia Galfr\`e, Francesco Massafra, Alessandro Maglione, Corrado Priami, Alina S\^irbu
Abstract: We present a machine learning pipeline for biomarker discovery in Multiple Sclerosis (MS), integrating eight publicly available microarray datasets from Peripheral Blood Mononuclear Cells (PBMC). After robust preprocessing we trained an XGBoost classifier optimized via Bayesian search. SHapley Additive exPlanations (SHAP) were used to identify key features for model prediction, indicating thus possible biomarkers. These were compared with genes identified through classical Differential Expression Analysis (DEA). Our comparison revealed both overlapping and unique biomarkers between SHAP and DEA, suggesting complementary strengths. Enrichment analysis confirmed the biological relevance of SHAP-selected genes, linking them to pathways such as sphingolipid signaling, Th1/Th2/Th17 cell differentiation, and Epstein-Barr virus infection all known to be associated with MS. This study highlights the value of combining explainable AI (xAI) with traditional statistical methods to gain deeper insights into disease mechanism.
Authors: Juan Ramirez, Simon Lacoste-Julien
Abstract: Constrained optimization is a powerful framework for enforcing requirements on neural networks. These constrained deep learning problems are typically solved using first-order methods on their min-max Lagrangian formulation, but such approaches often suffer from oscillations and can fail to find all local solutions. While the Augmented Lagrangian method (ALM) addresses these issues, practitioners often favor dual optimistic ascent schemes (PI control) on the standard Lagrangian, which perform well empirically but lack formal guarantees. In this paper, we establish a previously unknown equivalence between these approaches: dual optimistic ascent on the Lagrangian is equivalent to gradient descent-ascent on the Augmented Lagrangian. This finding allows us to transfer the robust theoretical guarantees of the ALM to the dual optimistic setting, proving it converges linearly to all local solutions. Furthermore, the equivalence provides principled guidance for tuning the optimism hyper-parameter. Our work closes a critical gap between the empirical success of dual optimistic methods and their theoretical foundation.
Authors: Zahid Iqbal
Abstract: Federated Learning (FL) has emerged as a promising decentralized learning (DL) approach that enables the use of distributed data without compromising user privacy. However, FL poses several key challenges. First, it is frequently assumed that every client can train the same machine learning models, however, not all clients are able to meet this assumption because of differences in their business needs and computational resources. Second, statistical heterogeneity (a.k.a. non-IID data) poses a major challenge in FL, which can lead to lower global model performance. Third, while addressing these challenges, there is a need for a cost-effective incentive mechanism to encourage clients to participate in FL training. In response to these challenges, we propose several methodologies: DL-SH, which facilitates efficient, privacy-preserving, and communication-efficient learning in the context of statistical heterogeneity; DL-MH, designed to manage fully heterogeneous models while tackling statistical disparities; and I-DL-MH, an incentive-based extension of DL-MH that promotes client engagement in federated learning training by providing incentives within this complex federated learning framework. Comprehensive experiments were carried out to assess the performance and scalability of the proposed approaches across a range of complex experimental settings. This involved utilizing various model architectures, in diverse data distributions, including IID and several non-IID scenarios, as well as multiple datasets. Experimental results demonstrate that the proposed approaches significantly enhance accuracy and decrease communication costs while effectively addressing statistical heterogeneity and model heterogeneity in comparison to existing state-of-the-art approaches and baselines, with DL-SH improving global model accuracy by 153%, and I-DL-MH achieving a 225% improvement under non-IID conditions.
Authors: Guillem Capellera, Luis Ferraz, Antonio Rubio, Alexandre Alahi, Antonio Agudo
Abstract: Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they interact synchronously. To bridge this gap, we introduce JointDiff, a novel diffusion framework designed to unify these two processes by simultaneously generating continuous spatio-temporal data and synchronous discrete events. We demonstrate its efficacy in the sports domain by simultaneously modeling multi-agent trajectories and key possession events. This joint modeling is validated with non-controllable generation and two novel controllable generation scenarios: weak-possessor-guidance, which offers flexible semantic control over game dynamics through a simple list of intended ball possessors, and text-guidance, which enables fine-grained, language-driven generation. To enable the conditioning with these guidance signals, we introduce CrossGuid, an effective conditioning operation for multi-agent domains. We also share a new unified sports benchmark enhanced with textual descriptions for soccer and football datasets. JointDiff achieves state-of-the-art performance, demonstrating that joint modeling is crucial for building realistic and controllable generative models for interactive systems.
Authors: Chenyu Liu, Yuqiu Deng, Tianyu Liu, Jinan Zhou, Xinliang Zhou, Ziyu Jia, Yi Ding
Abstract: Electroencephalography (EEG), with its broad range of applications, necessitates models that can generalize effectively across various tasks and datasets. Large EEG Models (LEMs) address this by pretraining encoder-centric architectures on large-scale unlabeled data to extract universal representations. While effective, these models lack decoders of comparable capacity, limiting the full utilization of the learned features. To address this issue, we introduce ECHO, a novel decoder-centric LEM paradigm that reformulates EEG modeling as sequence-to-sequence learning. ECHO captures layered relationships among signals, labels, and tasks within sequence space, while incorporating discrete support samples to construct contextual cues. This design equips ECHO with in-context learning, enabling dynamic adaptation to heterogeneous tasks without parameter updates. Extensive experiments across multiple datasets demonstrate that, even with basic model components, ECHO consistently outperforms state-of-the-art single-task LEMs in multi-task settings, showing superior generalization and adaptability.
Authors: Liangyu Ding, Chenghan Wu, Guokai Li, Zizhuo Wang
Abstract: Bundle pricing refers to designing several product combinations (i.e., bundles) and determining their prices in order to maximize the expected profit. It is a classic problem in revenue management and arises in many industries, such as e-commerce, tourism, and video games. However, the problem is typically intractable due to the exponential number of candidate bundles. In this paper, we explore the usage of graph convolutional networks (GCNs) in solving the bundle pricing problem. Specifically, we first develop a graph representation of the mixed bundling model (where every possible bundle is assigned with a specific price) and then train a GCN to learn the latent patterns of optimal bundles. Based on the trained GCN, we propose two inference strategies to derive high-quality feasible solutions. A local-search technique is further proposed to improve the solution quality. Numerical experiments validate the effectiveness and efficiency of our proposed GCN-based framework. Using a GCN trained on instances with 5 products, our methods consistently achieve near-optimal solutions (better than 97%) with only a fraction of computational time for problems of small to medium size. It also achieves superior solutions for larger size of problems compared with other heuristic methods such as bundle size pricing (BSP). The method can also provide high quality solutions for instances with more than 30 products even for the challenging cases where product utilities are non-additive.
Authors: Lute Lillo, Nick Cheney
Abstract: In independent, identically distributed (i.i.d.) training regimes, activation functions have been benchmarked extensively, and their differences often shrink once model size and optimization are tuned. In continual learning, however, the picture is different: beyond catastrophic forgetting, models can progressively lose the ability to adapt (referred to as loss of plasticity) and the role of the non-linearity in this failure mode remains underexplored. We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) and evaluate them in two complementary settings: (i) supervised class-incremental benchmarks and (ii) reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change. The takeaway is straightforward: thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning without extra capacity or task-specific tuning.
Authors: Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli
Abstract: Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient. We argue that this inefficiency stems from the standard practice of optimizing policies directly in the high-dimensional and highly redundant parameter space $\Theta$. This challenge is greatly compounded in multi-task settings. In this work, we develop a novel, unsupervised approach that compresses the policy parameter space $\Theta$ into a low-dimensional latent space $\mathcal{Z}$. We train a generative model $g:\mathcal{Z}\to\Theta$ by optimizing a behavioral reconstruction loss, which ensures that the latent space is organized by functional similarity rather than proximity in parameterization. We conjecture that the inherent dimensionality of this manifold is a function of the environment's complexity, rather than the size of the policy network. We validate our approach in continuous control domains, showing that the parameterization of standard policy networks can be compressed up to five orders of magnitude while retaining most of its expressivity. As a byproduct, we show that the learned manifold enables task-specific adaptation via Policy Gradient operating in the latent space $\mathcal{Z}$.
Authors: Marek Pecha, Michael Skotnica, Jana Ru\v{s}ajov\'a, Bohdan Rieznikov, V\'it Wandrol, Mark\'eta R\"osnerov\'a, Jarom\'ir Knejzl\'ik
Abstract: The northeastern region of the Czech Republic is among the most seismically active areas in the country. The most frequent seismic events are mining-induced since there used to be strong mining activity in the past. However, natural tectonic events may also occur. In addition, seismic stations often record explosions in quarries in the region. Despite the cessation of mining activities, mine-induced seismic events still occur. Therefore, a rapid differentiation between tectonic and anthropogenic events is still important. The region is currently monitored by the OKC seismic station in Ostrava-Kr\'{a}sn\'{e} Pole built in 1983 which is a part of the Czech Regional Seismic Network. The station has been providing digital continuous waveform data at 100 Hz since 2007. In the years 1992--2002, the region was co-monitored by the Seismic Polygon Fren\v{s}t\'{a}t (SPF) which consisted of five seismic stations using a triggered STA/LTA system. In this study, we apply and compare machine learning methods to the SPF dataset, which contains labeled records of tectonic and mining-induced events. For binary classification, a Long Short-Term Memory recurrent neural network and XGBoost achieved an F1-score of 0.94 -- 0.95, demonstrating the potential of modern machine learning techniques for rapid event characterization.
Authors: Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can, Jin Mingyu, Mei Kai, Wan Kun, Metaxas Dimitris
Abstract: Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.
Authors: Guannan Lai, Da-Wei Zhou, Xin Yang, Han-Jia Ye
Abstract: Class Incremental Learning (CIL) requires models to continuously learn new classes without forgetting previously learned ones, while maintaining stable performance across all possible class sequences. In real-world settings, the order in which classes arrive is diverse and unpredictable, and model performance can vary substantially across different sequences. Yet mainstream evaluation protocols calculate mean and variance from only a small set of randomly sampled sequences. Our theoretical analysis and empirical results demonstrate that this sampling strategy fails to capture the full performance range, resulting in biased mean estimates and a severe underestimation of the true variance in the performance distribution. We therefore contend that a robust CIL evaluation protocol should accurately characterize and estimate the entire performance distribution. To this end, we introduce the concept of extreme sequences and provide theoretical justification for their crucial role in the reliable evaluation of CIL. Moreover, we observe a consistent positive correlation between inter-task similarity and model performance, a relation that can be leveraged to guide the search for extreme sequences. Building on these insights, we propose EDGE (Extreme case-based Distribution and Generalization Evaluation), an evaluation protocol that adaptively identifies and samples extreme class sequences using inter-task similarity, offering a closer approximation of the ground-truth performance distribution. Extensive experiments demonstrate that EDGE effectively captures performance extremes and yields more accurate estimates of distributional boundaries, providing actionable insights for model selection and robustness checking. Our code is available at https://github.com/AIGNLAI/EDGE.
Authors: Elaheh Akbari, Ping He, Ahmadreza Moradipari, Yikun Bai, Soheil Kolouri
Abstract: Flow-matching generative models have emerged as a powerful paradigm for continuous data generation, achieving state-of-the-art results across domains such as images, 3D shapes, and point clouds. Despite their success, these models suffer from slow inference due to the requirement of numerous sequential sampling steps. Recent work has sought to accelerate inference by reducing the number of sampling steps. In particular, Mean Flows offer a one-step generation approach that delivers substantial speedups while retaining strong generative performance. Yet, in many continuous domains, Mean Flows fail to faithfully approximate the behavior of the original multi-step flow-matching process. In this work, we address this limitation by incorporating optimal transport-based sampling strategies into the Mean Flow framework, enabling one-step generators that better preserve the fidelity and diversity of the original multi-step flow process. Experiments on controlled low-dimensional settings and on high-dimensional tasks such as image generation, image-to-image translation, and point cloud generation demonstrate that our approach achieves superior inference accuracy in one-step generative modeling.
Authors: Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun
Abstract: Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.
Authors: Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.
Authors: Aayush Mishra, Daniel Khashabi, Anqi Liu
Abstract: Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt. ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: Can ICL's internal computations be used to improve the qualities of SFT? We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL's rich functionality, we introduce ICL Activation Alignment (IA2), a self-distillation technique which aims to replicate ICL's activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and 2 model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.
Authors: Maojiang Su, Mingcheng Lu, Jerry Yao-Chieh Hu, Shang Wu, Zhao Song, Alex Reneau, Han Liu
Abstract: We provide a theoretical analysis for end-to-end training Discrete Flow Matching (DFM) generative models. DFM is a promising discrete generative modeling framework that learns the underlying generative dynamics by training a neural network to approximate the transformative velocity field. Our analysis establishes a clear chain of guarantees by decomposing the final distribution estimation error. We first prove that the total variation distance between the generated and target distributions is controlled by the risk of the learned velocity field. We then bound this risk by analyzing its two primary sources: (i) Approximation Error, where we quantify the capacity of the Transformer architecture to represent the true velocity, and (ii) Estimation Error, where we derive statistical convergence rates that bound the error from training on a finite dataset. By composing these results, we provide the first formal proof that the distribution generated by a trained DFM model provably converges to the true data distribution as the training set size increases.
Authors: Ehsan Futuhi, Nathan R. Sturtevant
Abstract: Heuristic functions are central to the performance of search algorithms such as A-star, where admissibility - the property of never overestimating the true shortest-path cost - guarantees solution optimality. Recent deep learning approaches often disregard admissibility and provide limited guarantees on generalization beyond the training data. This paper addresses both of these limitations. First, we pose heuristic learning as a constrained optimization problem and introduce Cross-Entropy Admissibility (CEA), a loss function that enforces admissibility during training. On the Rubik's Cube domain, this method yields near-admissible heuristics with significantly stronger guidance than compressed pattern database (PDB) heuristics. Theoretically, we study the sample complexity of learning heuristics. By leveraging PDB abstractions and the structural properties of graphs such as the Rubik's Cube, we tighten the bound on the number of training samples needed for A-star to generalize. Replacing a general hypothesis class with a ReLU neural network gives bounds that depend primarily on the network's width and depth, rather than on graph size. Using the same network, we also provide the first generalization guarantees for goal-dependent heuristics.
Authors: Jiyeon Kim, Yingjie Hu, Negar Elhami-Khorasani, Kai Sun, Ryan Zhenqi Zhou
Abstract: Predicting the spread of wildfires is essential for effective fire management and risk assessment. With the fast advancements of artificial intelligence (AI), various deep learning models have been developed and utilized for wildfire spread prediction. However, there is limited understanding of the advantages and limitations of these models, and it is also unclear how deep learning-based fire spread models can be compared with existing non-AI fire models. In this work, we assess the ability of five typical deep learning models integrated with weather and environmental variables for wildfire spread prediction based on over ten years of wildfire data in the state of Hawaii. We further use the 2023 Maui fires as a case study to compare the best deep learning models with a widely-used fire spread model, FARSITE. The results show that two deep learning models, i.e., ConvLSTM and ConvLSTM with attention, perform the best among the five tested AI models. FARSITE shows higher precision, lower recall, and higher F1-score than the best AI models, while the AI models offer higher flexibility for the input data. By integrating AI models with an explainable AI method, we further identify important weather and environmental factors associated with the 2023 Maui wildfires.
Authors: Ankush Kumar Mishra, Jacob P. Mauthe, Nicholas Luke, Aram Amassian, Baskar Ganapathysubramanian
Abstract: Self-driving labs (SDLs) promise faster materials discovery by coupling automation with machine learning, but a central challenge is predicting costly, slow-to-measure properties from inexpensive, automatable readouts. We address this for doped conjugated polymers by learning interpretable spectral fingerprints from optical spectroscopy to predict electrical conductivity. Optical spectra are fast, non-destructive, and sensitive to aggregation and charge generation; we automate their featurization by combining a genetic algorithm (GA) with area-under-the-curve (AUC) computations over adaptively selected spectral windows. These data-driven spectral features, together with processing parameters, are used to train a quantitative structure-property relationship (QSPR) linking optical response and processing to conductivity. To improve accuracy and interpretability in the small-data regime, we add domain-knowledge-based feature expansions and apply SHAP-guided selection to retain a compact, physically meaningful feature set. The pipeline is evaluated under a leak-free train/test protocol, and GA is repeated to assess feature stability. The data-driven model matches the performance of a baseline built from expert-curated descriptors while reducing experimental effort (about 33%) by limiting direct conductivity measurements. Combining data-driven and expert features yields a hybrid QSPR with superior predictive performance, highlighting productive human-ML collaboration. The learned features recover known descriptors in pBTTT (0-0/0-1 vibronic intensity ratio) and reveal a tail-state region correlated with polymer bleaching during successful doping. This approach delivers interpretable, noise-robust, small-data-friendly features that convert rapid measurements into reliable predictions of costly properties and readily extends to other spectral modalities (e.g., XANES, Raman, FTIR).
Authors: Mahedi Hasan
Abstract: Seismic velocity inversion is a key task in geophysical exploration, enabling the reconstruction of subsurface structures from seismic wave data. It is critical for high-resolution seismic imaging and interpretation. Traditional physics-driven methods, such as Full Waveform Inversion (FWI), are computationally demanding, sensitive to initialization, and limited by the bandwidth of seismic data. Recent advances in deep learning have led to data-driven approaches that treat velocity inversion as a dense prediction task. This research benchmarks three advanced encoder-decoder architectures -- U-Net, U-Net++, and DeepLabV3+ -- together with SeismoLabV3+, an optimized variant of DeepLabV3+ with a ResNeXt50 32x4d backbone and task-specific modifications -- for seismic velocity inversion using the ThinkOnward 2025 Speed \& Structure dataset, which consists of five-channel seismic shot gathers paired with high-resolution velocity maps. Experimental results show that SeismoLabV3+ achieves the best performance, with MAPE values of 0.03025 on the internal validation split and 0.031246 on the hidden test set as scored via the official ThinkOnward leaderboard. These findings demonstrate the suitability of deep segmentation networks for seismic velocity inversion and underscore the value of tailored architectural refinements in advancing geophysical AI models.
Authors: Xin Li
Abstract: We propose an information-topological framework in which cycle closure is the fundamental mechanism of memory and consciousness. Memory is not a static store but the ability to re-enter latent cycles in neural state space, with invariant cycles serving as carriers of meaning by filtering order-specific noise and preserving what persists across contexts. The dot-cycle dichotomy captures this: transient dots scaffold exploration, while nontrivial cycles encode low-entropy content invariants that stabilize memory. Biologically, polychronous neural groups realize 1-cycles through delay-locked spiking reinforced by STDP, nested within theta-gamma rhythms that enforce boundary cancellation. These micro-cycles compose hierarchically, extending navigation loops into general memory and cognition. The perception-action cycle introduces high-order invariance: closure holds even across sense-act alternations, generalizing ancestral homing behavior. Sheaf-cosheaf duality formalizes this process: sheaves glue perceptual fragments into global sections, cosheaves decompose global plans into actions and closure aligns top-down predictions with bottom-up cycles. Consciousness then arises as the persistence of high-order invariants that integrate (unity) yet differentiate (richness) across contexts. We conclude that cycle is all you need: persistent invariants enable generalization in non-ergodic environments with long-term coherence at minimal energetic cost.
Authors: Mohammad Sadegh Khorshidi, Navid Yazdanjue, Hassan Gharoun, Mohammad Reza Nikoo, Fang Chen, Amir H. Gandomi
Abstract: We study symbolic surrogate modeling of frozen Transformer embeddings to obtain compact, auditable classifiers with calibrated probabilities. For five benchmarks (SST2G, 20NG, MNIST, CIFAR10, MSC17), embeddings from ModernBERT, DINOv2, and SigLIP are partitioned on the training set into disjoint, information-preserving views via semantic-preserving feature partitioning (SPFP). A cooperative multi-population genetic program (MEGP) then learns additive, closed-form logit programs over these views. Across 30 runs per dataset we report F1, AUC, log-loss, Brier, expected calibration error (ECE), and symbolic complexity; a canonical model is chosen by a one-standard-error rule on validation F1 with a parsimony tie-break. Temperature scaling fitted on validation yields substantial ECE reductions on test. The resulting surrogates achieve strong discrimination (up to F1 around 0.99 on MNIST, CIFAR10, MSC17; around 0.95 on SST2G), while 20NG remains most challenging. We provide reliability diagrams, dimension usage and overlap statistics, contribution-based importances, and global effect profiles (PDP and ALE), demonstrating faithful, cross-modal explanations grounded in explicit programs.
Authors: Huizhe Zhang, Jintang Li, Yuchang Zhu, Liang Chen, Li Kuang
Abstract: Graph Neural Networks (GNNs) are exemplary deep models designed for graph data. Message passing mechanism enables GNNs to effectively capture graph topology and push the performance boundaries across various graph tasks. However, the trend of developing such complex machinery for graph representation learning has become unsustainable on large-scale graphs. The computational and time overhead make it imperative to develop more energy-efficient GNNs to cope with the explosive growth of real-world graphs. Spiking Graph Neural Networks (SGNNs), which integrate biologically plausible learning via unique spike-based neurons, have emerged as a promising energy-efficient alternative. Different layers communicate with sparse and binary spikes, which facilitates computation and storage of intermediate graph representations. Despite the proliferation of SGNNs proposed in recent years, there is no systematic benchmark to explore the basic design principles of these brain-inspired networks on the graph data. To bridge this gap, we present SGNNBench to quantify progress in the field of SGNNs. Specifically, SGNNBench conducts an in-depth investigation of SGNNs from multiple perspectives, including effectiveness, energy efficiency, and architectural design. We comprehensively evaluate 9 state-of-the-art SGNNs across 18 datasets. Regarding efficiency, we empirically compare these baselines w.r.t model size, memory usage, and theoretical energy consumption to reveal the often-overlooked energy bottlenecks of SGNNs. Besides, we elaborately investigate the design space of SGNNs to promote the development of a general SGNN paradigm.
Authors: Gerard Boxo, Aman Neelappa, Shivam Raval
Abstract: White box monitors that analyze model internals offer promising advantages for detecting potentially harmful behaviors in large language models, including lower computational costs and integration into layered defense systems.However, training and evaluating these monitors requires response exemplars that exhibit the target behaviors, typically elicited through prompting or fine-tuning. This presents a challenge when the information used to elicit behaviors inevitably leaks into the data that monitors ingest, inflating their effectiveness. We present a systematic framework for evaluating a monitor's performance in terms of its ability to detect genuine model behavior rather than superficial elicitation artifacts. Furthermore, we propose three novel strategies to evaluate the monitor: content filtering (removing deception-related text from inputs), score filtering (aggregating only over task-relevant tokens), and prompt distilled fine-tuned model organisms (models trained to exhibit deceptive behavior without explicit prompting). Using deception detection as a representative case study, we identify two forms of leakage that inflate monitor performance: elicitation leakage from prompts that explicitly request harmful behavior, and reasoning leakage from models that verbalize their deceptive actions. Through experiments on multiple deception benchmarks, we apply our proposed mitigation strategies and measure performance retention. Our evaluation of the monitors reveal three crucial findings: (1) Content filtering is a good mitigation strategy that allows for a smooth removal of elicitation signal and can decrease probe AUROC by 30\% (2) Score filtering was found to reduce AUROC by 15\% but is not as straightforward to attribute to (3) A finetuned model organism improves monitor evaluations but reduces their performance by upto 40\%, even when re-trained.
Authors: Jiahui An, Sara Irina Fabrikant, Giacomo Indiveri, Elisa Donati
Abstract: Accurately assessing mental workload is crucial in cognitive neuroscience, human-computer interaction, and real-time monitoring, as cognitive load fluctuations affect performance and decision-making. While Electroencephalography (EEG) based machine learning (ML) models can be used to this end, their high computational cost hinders embedded real-time applications. Hardware implementations of spiking neural networks (SNNs) offer a promising alternative for low-power, fast, event-driven processing. This study compares hardware compatible SNN models with various traditional ML ones, using an open-source multimodal dataset. Our results show that multimodal integration improves accuracy, with SNN performance comparable to the ML one, demonstrating their potential for real-time implementations of cognitive load detection. These findings position event-based processing as a promising solution for low-latency, energy efficient workload monitoring in adaptive closed-loop embedded devices that dynamically regulate cognitive load.
Authors: Hongyu Qu, Hongxiong Xu, Lin Dong, Chunyi Xiang, Gaozhen Nie
Abstract: Accurate forecasting of tropical cyclone (TC) intensity - particularly during periods of rapid intensification and rapid weakening - remains a challenge for operational meteorology, with high-stakes implications for disaster preparedness and infrastructure resilience. Recent advances in machine learning have yielded notable progress in TC prediction; however, most existing systems provide forecasts that degrade rapidly in extreme regimes and lack long-range consistency. Here we introduce TIFNet, a transformer-based forecasting model that generates non-iterative, 5-day intensity trajectories by integrating high-resolution global forecasts with a historical-evolution fusion mechanism. Trained on reanalysis data and fine-tuned with operational data, TIFNet consistently outperforms operational numerical models across all forecast horizons, delivering robust improvements across weak, strong, and super typhoon categories. In rapid intensity change regimes - long regarded as the most difficult to forecast - TIFNet reduces forecast error by 29-43% relative to current operational baselines. These results represent a substantial advance in artificial-intelligence-based TC intensity forecasting, especially under extreme conditions where traditional models consistently underperform.
Authors: William Saakyan, Matthias Norden, Lola Eversmann, Simon Kirsch, Muyu Lin, Simon Guendelman, Isabel Dziobek, Hanna Drimalla
Abstract: Due to the complex and resource-intensive nature of diagnosing Autism Spectrum Condition (ASC), several computer-aided diagnostic support methods have been proposed to detect autism by analyzing behavioral cues in patient video data. While these models show promising results on some datasets, they struggle with poor gaze feature performance and lack of real-world generalizability. To tackle these challenges, we analyze a standardized video dataset comprising 168 participants with ASC (46% female) and 157 non-autistic participants (46% female), making it, to our knowledge, the largest and most balanced dataset available. We conduct a multimodal analysis of facial expressions, voice prosody, head motion, heart rate variability (HRV), and gaze behavior. To address the limitations of prior gaze models, we introduce novel statistical descriptors that quantify variability in eye gaze angles, improving gaze-based classification accuracy from 64% to 69% and aligning computational findings with clinical research on gaze aversion in ASC. Using late fusion, we achieve a classification accuracy of 74%, demonstrating the effectiveness of integrating behavioral markers across multiple modalities. Our findings highlight the potential for scalable, video-based screening tools to support autism assessment.
Authors: Kirill V. Karpov, Ivan S. Pikulin, Grigory V. Bokov, Artem A. Mitrofanov
Abstract: The properties of complexes with transuranium elements have long been the object of research in various fields of chemistry. However, their experimental study is complicated by their rarity, high cost and special conditions necessary for working with such elements, and the complexity of quantum chemical calculations does not allow their use for large systems. To overcome these problems, we used modern machine learning methods to create a novel neural network architecture that allows to use available experimental data on a number of elements and thus significantly improve the quality of the resulting models. We also described the applicability domain of the presented model and identified the molecular fragments that most influence the stability of the complexes.
Authors: Dayu Yang, Hui Fang
Abstract: Connecting conversation with external domain knowledge is vital for conversational recommender systems (CRS) to correctly understand user preferences. However, existing solutions either require domain-specific engineering, which limits flexibility, or rely solely on large language models, which increases the risk of hallucination. While Retrieval-Augmented Generation (RAG) holds promise, its naive use in CRS is hindered by noisy dialogues that weaken retrieval and by overlooked nuances among similar items. We propose ReGeS, a reciprocal Retrieval-Generation Synergy framework that unifies generation-augmented retrieval to distill informative user intent from conversations and retrieval-augmented generation to differentiate subtle item features. This synergy obviates the need for extra annotations, reduces hallucinations, and simplifies continuous updates. Experiments on multiple CRS benchmarks show that ReGeS achieves state-of-the-art performance in recommendation accuracy, demonstrating the effectiveness of reciprocal synergy for knowledge-intensive CRS tasks.
Authors: Imran Ashraf, Mukhtar Ullah, Muhammad Faisal Nadeem, Muhammad Nouman Noor
Abstract: Deep Learning models have transformed various domains, including the healthcare sector, particularly biomedical image classification by learning intricate features and enabling accurate diagnostics pertaining to complex diseases. Recent studies have adopted two different approaches to train DL models: training from scratch and transfer learning. Both approaches demand substantial computational time and resources due to the involvement of massive datasets in model training. These computational demands are further increased due to the design-space exploration required for selecting optimal hyperparameters, which typically necessitates several training rounds. With the growing sizes of datasets, exploring solutions to this problem has recently gained the research community's attention. A plausible solution is to select a subset of the dataset for training and hyperparameter search. This subset, referred to as the corset, must be a representative set of the original dataset. A straightforward approach to selecting the coreset could be employing random sampling, albeit at the cost of compromising the representativeness of the original dataset. A critical limitation of random sampling is the bias towards the dominant classes in an imbalanced dataset. Even if the dataset has inter-class balance, this random sampling will not capture intra-class diversity. This study addresses this issue by introducing an intelligent, lightweight mechanism for coreset selection. Specifically, it proposes a method to extract intra-class diversity, forming per-class clusters that are utilized for the final sampling. We demonstrate the efficacy of the proposed methodology by conducting extensive classification experiments on a well-known biomedical imaging dataset. Results demonstrate that the proposed scheme outperforms the random sampling approach on several performance metrics for uniform conditions.
Authors: Manel Rakez, Thomas Louis, Julien Guillaumin, Foucauld Chamming's, Pierre Fillard, Brice Amadeo, Virginie Rondeau
Abstract: Risk-adapted breast cancer screening requires robust models that leverage longitudinal imaging data. Most current deep learning models use single or limited prior mammograms and lack adaptation for real-world settings marked by imbalanced outcome distribution and heterogeneous follow-up. We developed LongiMam, an end-to-end deep learning model that integrates both current and up to four prior mammograms. LongiMam combines a convolutional and a recurrent neural network to capture spatial and temporal patterns predictive of breast cancer. The model was trained and evaluated using a large, population-based screening dataset with disproportionate case-to-control ratio typical of clinical screening. Across several scenarios that varied in the number and composition of prior exams, LongiMam consistently improved prediction when prior mammograms were included. The addition of prior and current visits outperformed single-visit models, while priors alone performed less well, highlighting the importance of combining historical and recent information. Subgroup analyses confirmed the model's efficacy across key risk groups, including women with dense breasts and those aged 55 years or older. Moreover, the model performed best in women with observed changes in mammographic density over time. These findings demonstrate that longitudinal modeling enhances breast cancer prediction and support the use of repeated mammograms to refine risk stratification in screening programs. LongiMam is publicly available as open-source software.
Authors: Eric Enouen, Sainyam Galhotra
Abstract: Concept Bottleneck Models (CBMs) use a set of human-interpretable concepts to predict the final task label, enabling domain experts to not only validate the CBM's predictions, but also intervene on incorrect concepts at test time. However, these interventions fail to address systemic misalignment between the CBM and the expert's reasoning, such as when the model learns shortcuts from biased data. To address this, we present a general interpretable debugging framework for CBMs that follows a two-step process of Removal and Retraining. In the Removal step, experts use concept explanations to identify and remove any undesired concepts. In the Retraining step, we introduce CBDebug, a novel method that leverages the interpretability of CBMs as a bridge for converting concept-level user feedback into sample-level auxiliary labels. These labels are then used to apply supervised bias mitigation and targeted augmentation, reducing the model's reliance on undesired concepts. We evaluate our framework with both real and automated expert feedback, and find that CBDebug significantly outperforms prior retraining methods across multiple CBM architectures (PIP-Net, Post-hoc CBM) and benchmarks with known spurious correlations.
Authors: Sanish Suwal, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi
Abstract: Prior works have shown that neural networks can be heavily pruned while preserving performance, but the impact of pruning on model interpretability remains unclear. In this work, we investigate how magnitude-based pruning followed by fine-tuning affects both low-level saliency maps and high-level concept representations. Using a ResNet-18 trained on ImageNette, we compare post-hoc explanations from Vanilla Gradients (VG) and Integrated Gradients (IG) across pruning levels, evaluating sparsity and faithfulness. We further apply CRAFT-based concept extraction to track changes in semantic coherence of learned concepts. Our results show that light-to-moderate pruning improves saliency-map focus and faithfulness while retaining distinct, semantically meaningful concepts. In contrast, aggressive pruning merges heterogeneous features, reducing saliency map sparsity and concept coherence despite maintaining accuracy. These findings suggest that while pruning can shape internal representations toward more human-aligned attention patterns, excessive pruning undermines interpretability.
Authors: Nabeel Nisar Bhat, Maksim Karnaukh, Stein Vandenbroeke, Wouter Lemoine, Jakob Struye, Jesus Omar Lacruz, Siddhartha Kumar, Mohammad Hossein Moghaddam, Joerg Widmer, Rafael Berkvens, Jeroen Famaey
Abstract: This article presents mmHSense, a set of open labeled mmWave datasets to support human sensing research within Integrated Sensing and Communication (ISAC) systems. The datasets can be used to explore mmWave ISAC for various end applications such as gesture recognition, person identification, pose estimation, and localization. Moreover, the datasets can be used to develop and advance signal processing and deep learning research on mmWave ISAC. This article describes the testbed, experimental settings, and signal features for each dataset. Furthermore, the utility of the datasets is demonstrated through validation on a specific downstream task. In addition, we demonstrate the use of parameter-efficient fine-tuning to adapt ISAC models to different tasks, significantly reducing computational complexity while maintaining performance on prior tasks.
Authors: Petr Ko\v{s}\v{t}\'al, Pavel Kord\'ik, Ond\v{r}ej Podsztavek
Abstract: High-resolution climate projections are essential for local decision-making. However, available climate projections have low spatial resolution (e.g. 12.5 km), which limits their usability. We address this limitation by leveraging single-image super-resolution models to statistically downscale climate projections to 1-km resolution. Since high-resolution climate projections are unavailable for training, we train models on a high-resolution observational gridded data set and apply them to low-resolution climate projections. We propose a climate indicator-based assessment using observed climate indices computed at weather station locations to evaluate the downscaled climate projections without ground-truth high-resolution climate projections. Experiments on daily mean temperature demonstrate that single-image super-resolution models can downscale climate projections without increasing the error of climate indicators compared to low-resolution climate projections.
Authors: Ehsan Sharifian, Saber Salehkaleybar, Negar Kiyavash
Abstract: We study the problem of causal structure learning from a combination of observational and interventional data generated by a linear non-Gaussian structural equation model that might contain cycles. Recent results show that using mere observational data identifies the causal graph only up to a permutation-equivalence class. We obtain a combinatorial characterization of this class by showing that each graph in an equivalence class corresponds to a perfect matching in a bipartite graph. This bipartite representation allows us to analyze how interventions modify or constrain the matchings. Specifically, we show that each atomic intervention reveals one edge of the true matching and eliminates all incompatible causal graphs. Consequently, we formalize the optimal experiment design task as an adaptive stochastic optimization problem over the set of equivalence classes with a natural reward function that quantifies how many graphs are eliminated from the equivalence class by an intervention. We show that this reward function is adaptive submodular and provide a greedy policy with a provable near-optimal performance guarantee. A key technical challenge is to efficiently estimate the reward function without having to explicitly enumerate all the graphs in the equivalence class. We propose a sampling-based estimator using random matchings and analyze its bias and concentration behavior. Our simulation results show that performing a small number of interventions guided by our stochastic optimization framework recovers the true underlying causal structure.
Authors: Jiaqi Liu, Lan Zhang, Xiaoyong Yuan
Abstract: Text-to-image diffusion models (DMs) inadvertently reproduce copyrighted styles and protected visual concepts, raising legal and ethical concerns. Concept erasure has emerged as a safeguard, aiming to selectively suppress such concepts through fine-tuning. However, existing methods do not scale to practical settings where providers must erase multiple and possibly conflicting concepts. The core bottleneck is their reliance on static erasure: a single checkpoint is fine-tuned to remove all target concepts, regardless of the actual erasure needs at inference. This rigid design mismatches real-world usage, where requests vary per generation, leading to degraded erasure success and reduced fidelity for non-target content. We propose DyME, an on-demand erasure framework that trains lightweight, concept-specific LoRA adapters and dynamically composes only those needed at inference. This modular design enables flexible multi-concept erasure, but naive composition causes interference among adapters, especially when many or semantically related concepts are suppressed. To overcome this, we introduce bi-level orthogonality constraints at both the feature and parameter levels, disentangling representation shifts and enforcing orthogonal adapter subspaces. We further develop ErasureBench-H, a new hierarchical benchmark with brand-series-character structure, enabling principled evaluation across semantic granularities and erasure set sizes. Experiments on ErasureBench-H and standard datasets (e.g., CIFAR-100, Imagenette) demonstrate that DyME consistently outperforms state-of-the-art baselines, achieving higher multi-concept erasure fidelity with minimal collateral degradation.
Authors: Anna Hallin
Abstract: The rise of foundation models -- large, pretrained machine learning models that can be finetuned to a variety of tasks -- has revolutionized the fields of natural language processing and computer vision. In high-energy physics, the question of whether these models can be implemented directly in physics research, or even built from scratch, tailored for particle physics data, has generated an increasing amount of attention. This review, which is the first on the topic of foundation models in high-energy physics, summarizes and discusses the research that has been published in the field so far.
Authors: Alnur Ali, Ashutosh Baheti, Jonathan Chang, Ta-Chung Chi, Brandon Cui, Andrew Drozdov, Jonathan Frankle, Abhay Gupta, Pallavi Koppol, Sean Kulinski, Jonathan Li, Dipendra Misra, Krista Opsahl-Ong, Jose Javier Gonzalez Ortiz, Matei Zaharia, Yue Zhang
Abstract: Developing custom reasoning models via Reinforcement Learning (RL) that can incorporate organization-specific knowledge has great potential to address problems faced by enterprise customers. In many of these problems, the reward function is verifiable, a setting termed RL with Verifiable Rewards (RLVR). We apply RLVR to a popular data science benchmark called BIRD that measures the ability of an AI agent to convert a natural language query for a database to SQL executions. We apply a simple and general-purpose training recipe involving careful prompt and model selection, a warm-up stage using our offline RL approach called TAO, followed by rigorous online RLVR training. With no additional training data beyond the BIRD training set and no use of proprietary models, our very first submission to the BIRD leaderboard reached state-of-the-art accuracy on the private test set: 73.56% without self-consistency and 75.68% with self-consistency. In the latter case, our model also required fewer generations than the second-best approach. While BIRD is only a proxy task, the simplicity of our framework makes it broadly applicable to enterprise domains such as business intelligence, data science, and coding.
Authors: Vishnu Raj, Gouthaman KV, Shiv Gehlot, Lars Villemoes, Arijit Biswas
Abstract: We present GMLv2, a reference-based model designed for the prediction of subjective audio quality as measured by MUSHRA scores. GMLv2 introduces a Beta distribution-based loss to model the listener ratings and incorporates additional neural audio coding (NAC) subjective datasets to extend its generalization and applicability. Extensive evaluations on diverse testset demonstrate that proposed GMLv2 consistently outperforms widely used metrics, such as PEAQ and ViSQOL, both in terms of correlation with subjective scores and in reliably predicting these scores across diverse content types and codec configurations. Consequently, GMLv2 offers a scalable and automated framework for perceptual audio quality evaluation, poised to accelerate research and development in modern audio coding technologies.
Authors: Adit Jain, Brendan Rappazzo
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a leading approach for improving large language model (LLM) reasoning capabilities. Most current methods follow variants of Group Relative Policy Optimization, which samples multiple reasoning completions, scores them relative to each other, and adjusts the policy accordingly. However, these approaches invariably sample discrete tokens at each reasoning step, discarding the rich distributional information in the model's probability distribution over candidate tokens. While preserving and utilizing this distributional information has proven beneficial in non-RL settings, current RLVR methods seem to be unnecessarily constraining the reasoning search space by not using this information. To address this limitation, we investigate mixture-of-token generation (MoT-G) in RLVR. We present a unified framework that generalizes existing MoT-G approaches, including existing training-free methods that construct mixture embeddings as weighted sums over token embeddings, and extend RLVR to operate directly in this continuous mixture space for generating chain-of-thought. Evaluating two MoT-G variants on Reasoning-Gym, a suite of reasoning-intensive language tasks, we find that MoT--G methods achieve substantial improvements (5--35 \% gains on 7 out of 10 tasks) compared to standard decoding with the Qwen2.5-1.5B model, while reaching comparable accuracy with half the number of trajectories, suggesting improved training efficiency. Through comprehensive hidden-state and token-level analyses, we provide evidence that MoT--G's benefits may stem from its ability to maintain higher hidden-state entropy throughout the reasoning process and promote exploration in token space.
Authors: Md Sajid Islam, Tanvir Hasan
Abstract: Bluetooth-based mesh networks offer a promising infrastructure for offline communication in emergency and resource constrained scenarios. However, traditional routing strategies such as Ad hoc On-Demand Distance Vector (AODV) often degrade under congestion and dynamic topological changes. This study proposes a hybrid intelligent routing framework that augments AODV with supervised machine learning to improve next-hop selection under varied network constraints. The framework integrates four predictive models: a delivery success classifier, a TTL regressor, a delay regressor, and a forwarder suitability classifier, into a unified scoring mechanism that dynamically ranks neighbors during multi-hop message transmission. A simulation environment with stationary node deployments was developed, incorporating buffer constraints and device heterogeneity to evaluate three strategies: baseline AODV, a partial hybrid ML model (ABC), and the full hybrid ML model (ABCD). Across ten scenarios, the Hybrid ABCD model achieves approximately 99.97 percent packet delivery under these controlled conditions, significantly outperforming both the baseline and intermediate approaches. The results demonstrate that lightweight, explainable machine learning models can enhance routing reliability and adaptability in Bluetooth mesh networks, particularly in infrastructure-less environments where delivery success is prioritized over latency constraints.
Authors: Alexandru Ioni\c{t}\u{a}, Andreea Ioni\c{t}\u{a}
Abstract: With the increased interest in artificial intelligence, Machine Learning as a Service provides the infrastructure in the Cloud for easy training, testing, and deploying models. However, these systems have a major privacy issue: uploading sensitive data to the Cloud, especially during training. Therefore, achieving secure Neural Network training has been on many researchers' minds lately. More and more solutions for this problem are built around a main pillar: Functional Encryption (FE). Although these approaches are very interesting and offer a new perspective on ML training over encrypted data, some vulnerabilities do not seem to be taken into consideration. In our paper, we present an attack on neural networks that uses FE for secure training over encrypted data. Our approach uses linear programming to reconstruct the original input, unveiling the previous security promises. To address the attack, we propose two solutions for secure training and inference that involve the client during the computation phase. One approach ensures security without relying on encryption, while the other uses function-hiding inner-product techniques.
Authors: Salman Beigi, Omid Etesami, Mohammad Mahmoody, Amir Najafi
Abstract: We study optimal transport between two high-dimensional distributions $\mu,\nu$ in $R^n$ from an algorithmic perspective: given $x \sim \mu$, find a close $y \sim \nu$ in $poly(n)$ time, where $n$ is the dimension of $x,y$. Thus, running time depends on the dimension rather than the full representation size of $\mu,\nu$. Our main result is a general algorithm for transporting any product distribution $\mu$ to any $\nu$ with cost $\Delta + \delta$ under $\ell_p^p$, where $\Delta$ is the Knothe-Rosenblatt transport cost and $\delta$ is a computational error decreasing with runtime. This requires $\nu$ to be "sequentially samplable" with bounded average sampling cost, a new but natural notion. We further prove: An algorithmic version of Talagrand's inequality for transporting the standard Gaussian $\Phi^n$ to arbitrary $\nu$ under squared Euclidean cost. For $\nu = \Phi^n$ conditioned on a set $\mathcal{S}$ of measure $\varepsilon$, we construct the sequential sampler in expected time $poly(n/\varepsilon)$ using membership oracle access to $\mathcal{S}$. This yields an algorithmic transport from $\Phi^n$ to $\Phi^n|\mathcal{S}$ in $poly(n/\varepsilon)$ time and expected squared distance $O(\log 1/\varepsilon)$, optimal for general $\mathcal{S}$ of measure $\varepsilon$. As corollary, we obtain the first computational concentration result (Etesami et al. SODA 2020) for Gaussian measure under Euclidean distance with dimension-independent transportation cost, resolving an open question of Etesami et al. Specifically, for any $\mathcal{S}$ of Gaussian measure $\varepsilon$, most $\Phi^n$ samples can be mapped to $\mathcal{S}$ within distance $O(\sqrt{\log 1/\varepsilon})$ in $poly(n/\varepsilon)$ time.
Authors: Naman Jain, Pranjali Jain, Pratik Kayal, Jayakrishna Sahit, Soham Pachpande, Jayesh Choudhari
Abstract: India is an agro-based economy and proper information about agricultural practices is the key to optimal agricultural growth and output. In order to answer the queries of the farmer, we have build an agricultural chatbot based on the dataset from Kisan Call Center. This system is robust enough to answer queries related to weather, market rates, plant protection and government schemes. This system is available 24* 7, can be accessed through any electronic device and the information is delivered with the ease of understanding. The system is based on a sentence embedding model which gives an accuracy of 56%. After eliminating synonyms and incorporating entity extraction, the accuracy jumps to 86%. With such a system, farmers can progress towards easier information about farming related practices and hence a better agricultural output. The job of the Call Center workforce would be made easier and the hard work of various such workers can be redirected to a better goal.
Authors: Ahmed Jaber, Wangshu Zhu, Karthick Jayavelu, Justin Downes, Sameer Mohamed, Candace Agonafir, Linnia Hawkins, Tian Zheng
Abstract: Climate data science faces persistent barriers stemming from the fragmented nature of data sources, heterogeneous formats, and the steep technical expertise required to identify, acquire, and process datasets. These challenges limit participation, slow discovery, and reduce the reproducibility of scientific workflows. In this paper, we present a proof of concept for addressing these barriers through the integration of a curated knowledge graph (KG) with AI agents designed for cloud-native scientific workflows. The KG provides a unifying layer that organizes datasets, tools, and workflows, while AI agents -- powered by generative AI services -- enable natural language interaction, automated data access, and streamlined analysis. Together, these components drastically lower the technical threshold for engaging in climate data science, enabling non-specialist users to identify and analyze relevant datasets. By leveraging existing cloud-ready API data portals, we demonstrate that "a knowledge graph is all you need" to unlock scalable and agentic workflows for scientific inquiry. The open-source design of our system further supports community contributions, ensuring that the KG and associated tools can evolve as a shared commons. Our results illustrate a pathway toward democratizing access to climate data and establishing a reproducible, extensible framework for human--AI collaboration in scientific research.
Authors: Chibuzor Okocha, Kelechi Ezema, Christan Grant
Abstract: This study examines domain effects in speaker diarization for African-accented English. We evaluate multiple production and open systems on general and clinical dialogues under a strict DER protocol that scores overlap. A consistent domain penalty appears for clinical speech and remains significant across models. Error analysis attributes much of this penalty to false alarms and missed detections, aligning with short turns and frequent overlap. We test lightweight domain adaptation by fine-tuning a segmentation module on accent-matched data; it reduces error but does not eliminate the gap. Our contributions include a controlled benchmark across domains, a concise approach to error decomposition and conversation-level profiling, and an adaptation recipe that is easy to reproduce. Results point to overlap-aware segmentation and balanced clinical resources as practical next steps.
Authors: Junno Yun, Ya\c{s}ar Utku Al\c{c}alar, Mehmet Ak\c{c}akaya
Abstract: Efficient training strategies for large-scale diffusion models have recently emphasized the importance of improving discriminative feature representations in these models. A central line of work in this direction is representation alignment with features obtained from powerful external encoders, which improves the representation quality as assessed through linear probing. Alignment-based approaches show promise but depend on large pretrained encoders, which are computationally expensive to obtain. In this work, we propose an alternative regularization for training, based on promoting the Linear SEParability (LSEP) of intermediate layer representations. LSEP eliminates the need for an auxiliary encoder and representation alignment, while incorporating linear probing directly into the network's learning dynamics rather than treating it as a simple post-hoc evaluation tool. Our results demonstrate substantial improvements in both training efficiency and generation quality on flow-based transformer architectures such as SiTs, achieving an FID of 1.46 on $256 \times 256$ ImageNet dataset.
Authors: Mohammad Parsa Afshar, Aryan Azimi
Abstract: Prediction of consumer behavior is one of the important purposes in marketing, cognitive neuroscience, and human-computer interaction. The electroencephalography (EEG) data can help analyze the decision process by providing detailed information about the brain's neural activity. In this research, a comparative approach is utilized for predicting consumer behavior by EEG data. In the first step, the features of the EEG data from the NeuMa dataset were extracted and cleaned. For the Graph Neural Network (GNN) models, the brain connectivity features were created. Different machine learning models, such as classical models and Graph Neural Networks, are used and compared. The GNN models with different architectures are implemented to have a comprehensive comparison; furthermore, a wide range of classical models, such as ensemble models, are applied, which can be very helpful to show the difference and performance of each model on the dataset. Although the results did not show a significant difference overall, the GNN models generally performed better in some basic criteria where classical models were not satisfactory. This study not only shows that combining EEG signal analysis and machine learning models can provide an approach to deeper understanding of consumer behavior, but also provides a comprehensive comparison between the machine learning models that have been widely used in previous studies in the EEG-based neuromarketing such as Support Vector Machine (SVM), and the models which are not used or rarely used in the field, like Graph Neural Networks.
Authors: Jakob M\"oderl, Erik Leitinger, Bernard Henri Fleury
Abstract: Sparse Bayesian learning (SBL) associates to each weight in the underlying linear model a hyperparameter by assuming that each weight is Gaussian distributed with zero mean and precision (inverse variance) equal to its associated hyperparameter. The method estimates the hyperparameters by marginalizing out the weights and performing (marginalized) maximum likelihood (ML) estimation. SBL returns many hyperparameter estimates to diverge to infinity, effectively setting the estimates of the corresponding weights to zero (i.e., pruning the corresponding weights from the model) and thereby yielding a sparse estimate of the weight vector. In this letter, we analyze the marginal likelihood as function of a single hyperparameter while keeping the others fixed, when the Gaussian assumptions on the noise samples and the weight distribution that underlies the derivation of SBL are weakened. We derive sufficient conditions that lead, on the one hand, to finite hyperparameter estimates and, on the other, to infinite ones. Finally, we show that in the Gaussian case, the two conditions are complementary and coincide with the pruning condition of fast SBL (F-SBL), thereby providing additional insights into this algorithm.
Authors: Yu Gui, Cong Ma, Zongming Ma
Abstract: Learning disentangled representations is a fundamental task in multi-modal learning. In modern applications such as single-cell multi-omics, both shared and modality-specific features are critical for characterizing cell states and supporting downstream analyses. Ideally, modality-specific features should be independent of shared ones while also capturing all complementary information within each modality. This tradeoff is naturally expressed through information-theoretic criteria, but mutual-information-based objectives are difficult to estimate reliably, and their variational surrogates often underperform in practice. In this paper, we introduce IndiSeek, a novel disentangled representation learning approach that addresses this challenge by combining an independence-enforcing objective with a computationally efficient reconstruction loss that bounds conditional mutual information. This formulation explicitly balances independence and completeness, enabling principled extraction of modality-specific features. We demonstrate the effectiveness of IndiSeek on synthetic simulations, a CITE-seq dataset and multiple real-world multi-modal benchmarks.
Authors: Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi
Abstract: We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.
Authors: Mafalda Malafaia, Peter A. N. Bosman, Coen Rasch, Tanja Alderliesten
Abstract: Accurate and interpretable survival analysis remains a core challenge in oncology. With growing multimodal data and the clinical need for transparent models to support validation and trust, this challenge increases in complexity. We propose an interpretable multimodal AI framework to automate survival analysis by integrating clinical variables and computed tomography imaging. Our MultiFIX-based framework uses deep learning to infer survival-relevant features that are further explained: imaging features are interpreted via Grad-CAM, while clinical variables are modeled as symbolic expressions through genetic programming. Risk estimation employs a transparent Cox regression, enabling stratification into groups with distinct survival outcomes. Using the open-source RADCURE dataset for head and neck cancer, MultiFIX achieves a C-index of 0.838 (prediction) and 0.826 (stratification), outperforming the clinical and academic baseline approaches and aligning with known prognostic markers. These results highlight the promise of interpretable multimodal AI for precision oncology with MultiFIX.
Authors: Md. Mahfuzur Rahman, Kishor Datta Gupta, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Roy George
Abstract: Immediate damage assessment is essential after natural catastrophes; yet, conventional hand evaluation techniques are sluggish and perilous. Although satellite and unmanned aerial vehicle (UAV) photos offer extensive perspectives of impacted regions, current computer vision methodologies generally yield just classification labels or segmentation masks, so constraining their capacity to deliver a thorough situational comprehension. We introduce the Vision Language Caption Enhancer (VLCE), a multimodal system designed to produce comprehensive, contextually-informed explanations of disaster imagery. VLCE employs a dual-architecture approach: a CNN-LSTM model with a ResNet50 backbone pretrained on EuroSat satellite imagery for the xBD dataset, and a Vision Transformer (ViT) model pretrained on UAV pictures for the RescueNet dataset. Both systems utilize external semantic knowledge from ConceptNet and WordNet to expand vocabulary coverage and improve description accuracy. We assess VLCE in comparison to leading vision-language models (LLaVA and QwenVL) utilizing CLIPScore for semantic alignment and InfoMetIC for caption informativeness. Experimental findings indicate that VLCE markedly surpasses baseline models, attaining a maximum of 95.33% on InfoMetIC while preserving competitive semantic alignment. Our dual-architecture system demonstrates significant potential for improving disaster damage assessment by automating the production of actionable, information-dense descriptions from satellite and drone photos.
Authors: Lingxiao Kong, Cong Yang, Oya Deniz Beyan, Zeyd Boukhers
Abstract: Multi-Objective Reinforcement Learning (MORL) presents significant challenges and opportunities for optimizing multiple objectives in Large Language Models (LLMs). We introduce a MORL taxonomy and examine the advantages and limitations of various MORL methods when applied to LLM optimization, identifying the need for efficient and flexible approaches that accommodate personalization functionality and inherent complexities in LLMs and RL. We propose a vision for a MORL benchmarking framework that addresses the effects of different methods on diverse objective relationships. As future research directions, we focus on meta-policy MORL development that can improve efficiency and flexibility through its bi-level learning paradigm, highlighting key research questions and potential solutions for improving LLM performance.
Authors: Luca Callisti, Marco Romito, Francesco Triggiano
Abstract: We present a theoretical analysis of some popular adaptive Stochastic Gradient Descent (SGD) methods in the small learning rate regime. Using the stochastic modified equations framework introduced by Li et al., we derive effective continuous stochastic dynamics for these methods. Our key contribution is that sampling-induced noise in SGD manifests in the limit as independent Brownian motions driving the parameter and gradient second momentum evolutions. Furthermore, extending the approach of Malladi et al., we investigate scaling rules between the learning rate and key hyperparameters in adaptive methods, characterising all non-trivial limiting dynamics.
Authors: Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen
Abstract: The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model's weights. While KV-cache compression via low-rank projection is a promising direction, existing methods rely on a static, offline-learned subspace that performs poorly under data distribution shifts. To overcome these limitations, we introduce OjaKV, a novel framework that integrates a strategic hybrid storage policy with online subspace adaptation. First, OjaKV recognizes that not all tokens are equally important for compression; it preserves the crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention. Second, for the vast majority of intermediate tokens, it applies low-rank compression by incrementally adapting the projection basis using Oja's algorithm for online principal component analysis. This adaptation involves a comprehensive update during prompt prefilling and lightweight periodic updates during decoding, ensuring the subspace remains aligned with the evolving context. Crucially, our framework is fully compatible with modern attention modules like FlashAttention. Experiments demonstrate that OjaKV maintains or even improves zero-shot accuracy at high compression ratios. In particular, OjaKV achieves its strongest gains on very long-context benchmarks that require complex reasoning, highlighting the importance of online subspace adaptation in dynamically tracking context shifts. These results establish our hybrid framework as a practical, plug-and-play solution for memory-efficient long-context inference without requiring model fine-tuning.
Authors: Zitong Lan, Yiduo Hao, Mingmin Zhao
Abstract: Audio editing plays a central role in VR/AR immersion, virtual conferencing, sound design, and other interactive media. However, recent generative audio editing models depend on template-like instruction formats and are restricted to mono-channel audio. These models fail to deal with declarative audio editing, where the user declares what the desired outcome should be, while leaving the details of editing operations to the system. We introduce SmartDJ, a novel framework for stereo audio editing that combines the reasoning capability of audio language models with the generative power of latent diffusion. Given a high-level instruction, SmartDJ decomposes it into a sequence of atomic edit operations, such as adding, removing, or spatially relocating events. These operations are then executed by a diffusion model trained to manipulate stereo audio. To support this, we design a data synthesis pipeline that produces paired examples of high-level instructions, atomic edit operations, and audios before and after each edit operation. Experiments demonstrate that SmartDJ achieves superior perceptual quality, spatial realism, and semantic alignment compared to prior audio editing methods. Demos are available at https://zitonglan.github.io/project/smartdj/smartdj.html.
URLs: https://zitonglan.github.io/project/smartdj/smartdj.html.
Authors: Anjiang Wei, Tarun Suresh, Tianran Sun, Haoze Wu, Ke Wang, Alex Aiken
Abstract: Program verification relies on loop invariants, yet automatically discovering strong invariants remains a long-standing challenge. We introduce a principled framework for evaluating LLMs on invariant synthesis. Our approach uses a verifier-based decision procedure with a formal soundness guarantee and assesses not only correctness but also the speedup that invariants provide in verification. We evaluate 7 state-of-the-art LLMs, and existing LLM-based verifiers against the traditional solver UAutomizer. While LLM-based verifiers represent a promising direction, they do not yet offer a significant advantage over UAutomizer. Model capability also proves critical, as shown by sharp differences in speedups across models, and our benchmark remains an open challenge for current LLMs. Finally, we show that supervised fine-tuning and Best-of-N sampling can improve performance: fine-tuning on 3589 instances raises the percentage of speedup cases for Qwen3-Coder-480B from 8% to 29.2%, and Best-of-N sampling with N=16 improves Claude-sonnet-4 from 8.8% to 22.1%.
Authors: Prakhar Sharma, Haohuang Wen, Vinod Yegneswaran, Ashish Gehani, Phillip Porras, Zhiqiang Lin
Abstract: The evolution toward 6G networks is being accelerated by the Open Radio Access Network (O-RAN) paradigm -- an open, interoperable architecture that enables intelligent, modular applications across public telecom and private enterprise domains. While this openness creates unprecedented opportunities for innovation, it also expands the attack surface, demanding resilient, low-cost, and autonomous security solutions. Legacy defenses remain largely reactive, labor-intensive, and inadequate for the scale and complexity of next-generation systems. Current O-RAN applications focus mainly on network optimization or passive threat detection, with limited capability for closed-loop, automated response. To address this critical gap, we present an agentic AI framework for fully automated, end-to-end threat mitigation in 6G O-RAN environments. MobiLLM orchestrates security workflows through a modular multi-agent system powered by Large Language Models (LLMs). The framework features a Threat Analysis Agent for real-time data triage, a Threat Classification Agent that uses Retrieval-Augmented Generation (RAG) to map anomalies to specific countermeasures, and a Threat Response Agent that safely operationalizes mitigation actions via O-RAN control interfaces. Grounded in trusted knowledge bases such as the MITRE FiGHT framework and 3GPP specifications, and equipped with robust safety guardrails, MobiLLM provides a blueprint for trustworthy AI-driven network security. Initial evaluations demonstrate that MobiLLM can effectively identify and orchestrate complex mitigation strategies, significantly reducing response latency and showcasing the feasibility of autonomous security operations in 6G.
Authors: Adam Lahouari, Jutta Rogal, Mark E. Tuckerman
Abstract: Machine learning interatomic potentials (MLIPs) have become powerful tools to extend molecular simulations beyond the limits of quantum methods, offering near-quantum accuracy at much lower computational cost. Yet, developing reliable MLIPs remains difficult because it requires generating high-quality datasets, preprocessing atomic structures, and carefully training and validating models. In this work, we introduce an Automated Machine Learning Pipeline (AMLP) that unifies the entire workflow from dataset creation to model validation. AMLP employs large-language-model agents to assist with electronic-structure code selection, input preparation, and output conversion, while its analysis suite (AMLP-Analysis), based on ASE supports a range of molecular simulations. The pipeline is built on the MACE architecture and validated on acridine polymorphs, where, with a straightforward fine-tuning of a foundation model, mean absolute errors of ~1.7 meV/atom in energies and ~7.0 meV/{\AA} in forces are achieved. The fitted MLIP reproduces DFT geometries with sub-{\AA} accuracy and demonstrates stability during molecular dynamics simulations in the microcanonical and canonical ensembles.
Authors: Joon Kwon
Abstract: We propose a conversion scheme that turns regret minimizing algorithms into fixed point iterations, with convergence guarantees following from regret bounds. The resulting iterations can be seen as a grand extension of the classical Krasnoselskii--Mann iterations, as the latter are recovered by converting the Online Gradient Descent algorithm. This approach yields new simple iterations for finding fixed points of non-self operators. We also focus on converting algorithms from the AdaGrad family of regret minimizers, and thus obtain fixed point iterations with adaptive guarantees of a new kind. Numerical experiments on various problems demonstrate faster convergence of AdaGrad-based fixed point iterations over Krasnoselskii--Mann iterations.
Authors: J. Cuevas-Zepeda, C. Chavez, J. Estrada, J. Noonan, B. D. Nord, N. Saffold, M. Sofo-Haro, R. Spinola e Castro, S. Trivedi
Abstract: The development of novel instrumentation requires an iterative cycle with three stages: design, prototyping, and testing. Recent advancements in simulation and nanofabrication techniques have significantly accelerated the design and prototyping phases. Nonetheless, detector characterization continues to be a major bottleneck in device development. During the testing phase, a significant time investment is required to characterize the device in different operating conditions and find optimal operating parameters. The total effort spent on characterization and parameter optimization can occupy a year or more of an expert's time. In this work, we present a novel technique for automated sensor calibration that aims to accelerate the testing stage of the development cycle. This technique leverages closed-loop Bayesian optimization (BO), using real-time measurements to guide parameter selection and identify optimal operating states. We demonstrate the method with a novel low-noise CCD, showing that the machine learning-driven tool can efficiently characterize and optimize operation of the sensor in a couple of days without supervision of a device expert.
Authors: Philippe Nadeau, Miguel Rogel, Ivan Bili\'c, Ivan Petrovi\'c, Jonathan Kelly
Abstract: Stably placing an object in a multi-object scene is a fundamental challenge in robotic manipulation, as placements must be penetration-free, establish precise surface contact, and result in a force equilibrium. To assess stability, existing methods rely on running a simulation engine or resort to heuristic, appearance-based assessments. In contrast, our approach integrates stability directly into the sampling process of a diffusion model. To this end, we query an offline sampling-based planner to gather multi-modal placement labels and train a diffusion model to generate stable placements. The diffusion model is conditioned on scene and object point clouds, and serves as a geometry-aware prior. We leverage the compositional nature of score-based generative models to combine this learned prior with a stability-aware loss, thereby increasing the likelihood of sampling from regions of high stability. Importantly, this strategy requires no additional re-training or fine-tuning, and can be directly applied to off-the-shelf models. We evaluate our method on four benchmark scenes where stability can be accurately computed. Our physics-guided models achieve placements that are 56% more robust to forceful perturbations while reducing runtime by 47% compared to a state-of-the-art geometric method.
Authors: Mahindra Singh Rautela, Alexander Most, Siddharth Mansingh, Bradley C. Love, Ayan Biswas, Diane Oyen, Earl Lawrence
Abstract: We introduce MORPH, a shape-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data dimensionality (1D--3D) at different resolutions, multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorizes full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters (LoRA), MORPH outperforms models trained from scratch in both zero-shot and full-shot generalization. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning.
Authors: Aurosweta Mahapatra, Ismail Rasim Ulgen, Berrak Sisman
Abstract: Current anti-spoofing systems remain vulnerable to expressive and emotional synthetic speech, since they rarely leverage prosody as a discriminative cue. Prosody is central to human expressiveness and emotion, and humans instinctively use prosodic cues such as F0 patterns and voiced/unvoiced structure to distinguish natural from synthetic speech. In this paper, we propose HuLA, a two-stage prosody-aware multi-task learning framework for spoof detection. In Stage 1, a self-supervised learning (SSL) backbone is trained on real speech with auxiliary tasks of F0 prediction and voiced/unvoiced classification, enhancing its ability to capture natural prosodic variation similar to human perceptual learning. In Stage 2, the model is jointly optimized for spoof detection and prosody tasks on both real and synthetic data, leveraging prosodic awareness to detect mismatches between natural and expressive synthetic speech. Experiments show that HuLA consistently outperforms strong baselines on challenging out-of-domain dataset, including expressive, emotional, and cross-lingual attacks. These results demonstrate that explicit prosodic supervision, combined with SSL embeddings, substantially improves robustness against advanced synthetic speech attacks.
Authors: Jiawei Shan, Yiming Dong, Jiwei Zhao
Abstract: Real-world applications often face scarce labeled data due to the high cost and time requirements of gold-standard experiments, whereas unlabeled data are typically abundant. With the growing adoption of machine learning techniques, it has become increasingly feasible to generate multiple predicted labels using a variety of models and algorithms, including deep learning, large language models, and generative AI. In this paper, we propose a novel approach that safely and adaptively aggregates multiple black-box predictions with unknown quality while preserving valid statistical inference. Our method provides two key guarantees: (i) it never performs worse than using the labeled data alone, regardless of the quality of the predictions; and (ii) if any one of the predictions (without knowing which one) perfectly fits the ground truth, the algorithm adaptively exploits this to achieve either a faster convergence rate or the semiparametric efficiency bound. We demonstrate the effectiveness of the proposed algorithm through experiments on both synthetic and benchmark datasets.
Authors: Ian Taylor, Juliane Mueller, Julie Bessac
Abstract: As data collection and simulation capabilities advance, multi-modal learning, the task of learning from multiple modalities and sources of data, is becoming an increasingly important area of research. Surrogate models that learn from data of multiple auxiliary modalities to support the modeling of a highly expensive quantity of interest have the potential to aid outer loop applications such as optimization, inverse problems, or sensitivity analyses when multi-modal data are available. We develop two multi-modal Bayesian neural network surrogate models and leverage conditionally conjugate distributions in the last layer to estimate model parameters using stochastic variational inference (SVI). We provide a method to perform this conjugate SVI estimation in the presence of partially missing observations. We demonstrate improved prediction accuracy and uncertainty quantification compared to uni-modal surrogate models for both scalar and time series data.
Authors: Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Roy Fejgin, Ryan Langman, Mikyas Desta, Leili Tavabi, Jason Li
Abstract: Developing high-quality text-to-speech (TTS) systems for low-resource languages is challenging due to the scarcity of paired text and speech data. In contrast, automatic speech recognition (ASR) models for such languages are often more accessible, owing to large-scale multilingual pre-training efforts. We propose a framework based on Group Relative Policy Optimization (GRPO) to adapt an autoregressive, multilingual TTS model to new languages. Our method first establishes a language-agnostic foundation for TTS synthesis by training a multilingual baseline with International Phonetic Alphabet (IPA) tokens. Next, we fine-tune this model on limited paired data of the new languages to capture the target language's prosodic features. Finally, we apply GRPO to optimize the model using only unpaired text and speaker prompts, guided by a multi-objective reward from pretrained ASR, speaker verification, and audio quality estimation models. Experiments demonstrate that this pipeline produces intelligible and speaker-consistent speech in low-resource languages, substantially outperforming fine-tuning alone. Furthermore, our GRPO-based framework also improves TTS performance in high-resource languages, surpassing offline alignment methods such as Direct Preference Optimization (DPO) yielding superior intelligibility, speaker similarity, and audio quality.
Authors: Jiannan Xiang, Yun Zhu, Lei Shu, Maria Wang, Lijun Yu, Gabriel Barcik, James Lyon, Srinivas Sunkara, Jindong Chen
Abstract: Developing and testing user interfaces (UIs) and training AI agents to interact with them are challenging due to the dynamic and diverse nature of real-world mobile environments. Existing methods often rely on cumbersome physical devices or limited static analysis of screenshots, which hinders scalable testing and the development of intelligent UI agents. We introduce UISim, a novel image-based UI simulator that offers a dynamic and interactive platform for exploring mobile phone environments purely from screen images. Our system employs a two-stage method: given an initial phone screen image and a user action, it first predicts the abstract layout of the next UI state, then synthesizes a new, visually consistent image based on this predicted layout. This approach enables the realistic simulation of UI transitions. UISim provides immediate practical benefits for UI testing, rapid prototyping, and synthetic data generation. Furthermore, its interactive capabilities pave the way for advanced applications, such as UI navigation task planning for AI agents. Our experimental results show that UISim outperforms end-to-end UI generation baselines in generating realistic and coherent subsequent UI states, highlighting its fidelity and potential to streamline UI development and enhance AI agent training.
Authors: Michael Yeung, Keisuke Toyama, Toya Teramoto, Shusuke Takahashi, Tamaki Kojima
Abstract: Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks.
Authors: Linxiao Zeng, Haoyun Deng, Kangyuan Shu, Shizhen Wang
Abstract: Large Language Models (LLMs) have recently demonstrated impressive capabilities in various text generation tasks. However, it remains challenging to use them off-the-shelf in streaming applications (such as live translation), where the output must continually update as the input context expands, while still maintaining a reasonable computational cost to meet the latency requirement. In this work, we reexamine the re-translation approach to simultaneous translation and propose Self-Speculative Biased Decoding, a novel inference paradigm designed to avoid repeatedly generating output from scratch for a consistently growing input stream. We propose using the most recent output as a draft for the current growing input context. During the verification stage, the output will be biased towards the draft token for a higher draft acceptance rate. This strategy not only minimizes flickering that might distract users but also leads to higher speedups. Conventional decoding may take charge from the point of divergence after draft verification and continue until the end condition is met. Unlike existing speculative decoding strategies, our approach eliminates the need for draft computations, making it a model-agnostic and plug-and-play solution for accelerating latency-sensitive streaming applications. Experimental results on simultaneous text-to-text re-translation demonstrate that our approach achieves up to 1.7x speedup compared to conventional auto-regressive re-translation without compromising quality. Additionally, it significantly reduces flickering by 80% by incorporating the display-only mask-k technique.
Authors: Ammar Ahmed, Azal Ahmad Khan, Ayaan Ahmad, Sheng Di, Zirui Liu, Ali Anwar
Abstract: Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.
Authors: Anirud Nandakumar, Chayan Banerjee, Lelitha Devi Vanajakshi
Abstract: Efficient traffic signal control (TSC) is crucial for reducing congestion, travel delays, pollution, and for ensuring road safety. Traditional approaches, such as fixed signal control and actuated control, often struggle to handle dynamic traffic patterns. In this study, we propose a novel adaptive TSC framework that leverages Reinforcement Learning (RL), using the Proximal Policy Optimization (PPO) algorithm, to minimize total queue lengths across all signal phases. The challenge of efficiently representing highly stochastic traffic conditions for an RL controller is addressed through multiple state representations, including an expanded state space, an autoencoder representation, and a K-Planes-inspired representation. The proposed algorithm has been implemented using the Simulation of Urban Mobility (SUMO) traffic simulator and demonstrates superior performance over both traditional methods and other conventional RL-based approaches in reducing queue lengths. The best performing configuration achieves an approximately 29% reduction in average queue lengths compared to the traditional Webster method. Furthermore, comparative evaluation of alternative reward formulations demonstrates the effectiveness of the proposed queue-based approach, showcasing the potential for scalable and adaptive urban traffic management.
Authors: Wenyi Gong, Mieszko Lis
Abstract: Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25x speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.
Authors: Jiyuan Pei, Yi Mei, Jialin Liu, Mengjie Zhang, Xin Yao
Abstract: Recent neural solvers have demonstrated promising performance in learning to solve routing problems. However, existing studies are primarily based on one-off training on one or a set of predefined problem distributions and scales, i.e., tasks. When a new task arises, they typically rely on either zero-shot generalization, which may be poor due to the discrepancies between the new task and the training task(s), or fine-tuning the pretrained solver on the new task, which possibly leads to catastrophic forgetting of knowledge acquired from previous tasks. This paper explores a novel lifelong learning paradigm for neural VRP solvers, where multiple tasks with diverse distributions and scales arise sequentially over time. Solvers are required to effectively and efficiently learn to solve new tasks while maintaining their performance on previously learned tasks. Consequently, a novel framework called Lifelong Learning Router with Behavior Consolidation (LLR-BC) is proposed. LLR-BC consolidates prior knowledge effectively by aligning behaviors of the solver trained on a new task with the buffered ones in a decision-seeking way. To encourage more focus on crucial experiences, LLR-BC assigns greater consolidated weights to decisions with lower confidence. Extensive experiments on capacitated vehicle routing problems and traveling salesman problems demonstrate LLR-BC's effectiveness in training high-performance neural solvers in a lifelong learning setting, addressing the catastrophic forgetting issue, maintaining their plasticity, and improving zero-shot generalization ability.
Authors: Han Yuan, Yue Zhao, Li Zhang, Wuqiong Luo, Zheng Ma
Abstract: Structured output from large language models (LLMs) has enhanced efficiency in processing generated information and is increasingly adopted in industrial applications. Prior studies have investigated the impact of structured output on LLMs' generation quality, often presenting one-way findings. Some suggest that structured format enhances completeness and factual accuracy, while others argue that it restricts the reasoning capacity of LLMs and leads to reductions in standard evaluation metrics. Potential limitations of these assessments include restricted testing scenarios, weakly controlled comparative settings, and reliance on coarse metrics. In this work, we present a refined analysis using causal inference. Based on one assumed and two guaranteed constraints, we derive five potential causal structures characterizing the influence of structured output on LLMs' generation: (1) collider without m-bias, (2) collider with m-bias, (3) single cause from instruction, (4) single cause from output format, and (5) independence. Across seven public and one developed reasoning tasks, we find that coarse metrics report positive, negative, or neutral effects of structured output on GPT-4o's generation. However, causal inference reveals no causal impact in 43 out of 48 scenarios. In the remaining 5, 3 involve multifaceted causal structures influenced by concrete instructions.
Authors: Jingkai Guo, Chaitali Chakrabarti, Deliang Fan
Abstract: Model integrity of Large language models (LLMs) has become a pressing security concern with their massive online deployment. Prior Bit-Flip Attacks (BFAs) -- a class of popular AI weight memory fault-injection techniques -- can severely compromise Deep Neural Networks (DNNs): as few as tens of bit flips can degrade accuracy toward random guessing. Recent studies extend BFAs to LLMs and reveal that, despite the intuition of better robustness from modularity and redundancy, only a handful of adversarial bit flips can also cause LLMs' catastrophic accuracy degradation. However, existing BFA methods typically focus on either integer or floating-point models separately, limiting attack flexibility. Moreover, in floating-point models, random bit flips often cause perturbed parameters to extreme values (e.g., flipping in exponent bit), making it not stealthy and leading to numerical runtime error (e.g., invalid tensor values (NaN/Inf)). In this work, for the first time, we propose SBFA (Sneaky Bit-Flip Attack), which collapses LLM performance with only one single bit flip while keeping perturbed values within benign layer-wise weight distribution. It is achieved through iterative searching and ranking through our defined parameter sensitivity metric, ImpactScore, which combines gradient sensitivity and perturbation range constrained by the benign layer-wise weight distribution. A novel lightweight SKIP searching algorithm is also proposed to greatly reduce searching complexity, which leads to successful SBFA searching taking only tens of minutes for SOTA LLMs. Across Qwen, LLaMA, and Gemma models, with only one single bit flip, SBFA successfully degrades accuracy to below random levels on MMLU and SST-2 in both BF16 and INT8 data formats. Remarkably, flipping a single bit out of billions of parameters reveals a severe security concern of SOTA LLM models.
Authors: Erdun Gao, Jake Fawkes, Dino Sejdinovic
Abstract: Estimating the Conditional Average Treatment Effect (CATE) is often constrained by the high cost of obtaining outcome measurements, making active learning essential. However, conventional active learning strategies suffer from a fundamental objective mismatch. They are designed to reduce uncertainty in model parameters or in observable factual outcomes, failing to directly target the unobservable causal quantities that are the true objects of interest. To address this misalignment, we introduce the principle of causal objective alignment, which posits that acquisition functions should target unobservable causal quantities, such as the potential outcomes and the CATE, rather than indirect proxies. We operationalize this principle through the Causal-EPIG framework, which adapts the information-theoretic criterion of Expected Predictive Information Gain (EPIG) to explicitly quantify the value of a query in terms of reducing uncertainty about unobservable causal quantities. From this unified framework, we derive two distinct strategies that embody a fundamental trade-off: a comprehensive approach that robustly models the full causal mechanisms via the joint potential outcomes, and a focused approach that directly targets the CATE estimand for maximum sample efficiency. Extensive experiments demonstrate that our strategies consistently outperform standard baselines, and crucially, reveal that the optimal strategy is context-dependent, contingent on the base estimator and data complexity. Our framework thus provides a principled guide for sample-efficient CATE estimation in practice.
Authors: Thanh-Long V. Le, Myeongho Jeon, Kim Vu, Viet Lai, Eunho Yang
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.
Authors: Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
Abstract: Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. Intuitively, activating more experts at inference $k'$ (where $k'> k$) means engaging a larger set of model parameters for the computation and thus is expected to improve performance. However, contrary to this intuition, we find the scaling range to be so narrow that performance begins to degrade rapidly after only a slight increase in the number of experts. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce Elastic Mixture-of-Experts (EMoE), a novel training framework that enables MoE models to scale the number of activated experts at inference without incurring additional training overhead. By simultaneously training experts to collaborate in diverse combinations and encouraging the router for high-quality selections, EMoE ensures robust performance across computational budgets at inference. We conduct extensive experiments on various MoE settings. Our results show that EMoE significantly expands the effective performance-scaling range, extending it to as much as 2-3$\times$ the training-time $k$, while also pushing the model's peak performance to a higher level.
Authors: Zhengyan Wan, Yidong Ouyang, Qiang Yao, Liyan Xie, Fang Fang, Hongyuan Zha, Guang Cheng
Abstract: Discrete flow models offer a powerful framework for learning distributions over discrete state spaces and have demonstrated superior performance compared to the discrete diffusion model. However, their convergence properties and error analysis remain largely unexplored. In this work, we develop a unified framework grounded in stochastic calculus theory to systematically investigate the theoretical properties of discrete flow. Specifically, we derive the KL divergence of two path measures regarding two continuous-time Markov chains (CTMCs) with different transition rates by developing a novel Girsanov-type theorem, and provide a comprehensive analysis that encompasses the error arising from transition rate estimation and early stopping, where the first type of error has rarely been analyzed by existing works. Unlike discrete diffusion models, discrete flow incurs no truncation error caused by truncating the time horizon in the noising process. Building on generator matching and uniformization, we establish non-asymptotic error bounds for distribution estimation. Our results provide the first error analysis for discrete flow models.
Authors: Ivan Lau, Jonathan Scarlett
Abstract: In this paper, we study the problem of distributed mean estimation with 1-bit communication constraints. We propose a mean estimator that is based on (randomized and sequentially-chosen) interval queries, whose 1-bit outcome indicates whether the given sample lies in the specified interval. Our estimator is $(\epsilon, \delta)$-PAC for all distributions with bounded mean ($-\lambda \le \mathbb{E}(X) \le \lambda $) and variance ($\mathrm{Var}(X) \le \sigma^2$) for some known parameters $\lambda$ and $\sigma$. We derive a sample complexity bound $\widetilde{O}\big( \frac{\sigma^2}{\epsilon^2}\log\frac{1}{\delta} + \log\frac{\lambda}{\sigma}\big)$, which matches the minimax lower bound for the unquantized setting up to logarithmic factors and the additional $\log\frac{\lambda}{\sigma}$ term that we show to be unavoidable. We also establish an adaptivity gap for interval-query based estimators: the best non-adaptive mean estimator is considerably worse than our adaptive mean estimator for large $\frac{\lambda}{\sigma}$. Finally, we give tightened sample complexity bounds for distributions with stronger tail decay, and present additional variants that (i) handle an unknown sampling budget (ii) adapt to the unknown true variance given (possibly loose) upper and lower bounds on the variance, and (iii) use only two stages of adaptivity at the expense of more complicated (non-interval) queries.
Authors: Carlo Dindorf, Jonas Dully, Steven Simon, Dennis Perchthaler, Stephan Becker, Hannah Ehmann, Kjell Heitmann, Bernd Stetter, Christian Diers, Michael Fr\"ohlich
Abstract: Plantar pressure mapping is essential in clinical diagnostics and sports science, yet large heterogeneous datasets often contain outliers from technical errors or procedural inconsistencies. Statistical Parametric Mapping (SPM) provides interpretable analyses but is sensitive to alignment and its capacity for robust outlier detection remains unclear. This study compares an SPM approach with an explainable machine learning (ML) approach to establish transparent quality-control pipelines for plantar pressure datasets. Data from multiple centers were annotated by expert consensus and enriched with synthetic anomalies resulting in 798 valid samples and 2000 outliers. We evaluated (i) a non-parametric, registration-dependent SPM approach and (ii) a convolutional neural network (CNN), explained using SHapley Additive exPlanations (SHAP). Performance was assessed via nested cross-validation; explanation quality via a semantic differential survey with domain experts. The ML model reached high accuracy and outperformed SPM, which misclassified clinically meaningful variations and missed true outliers. Experts perceived both SPM and SHAP explanations as clear, useful, and trustworthy, though SPM was assessed less complex. These findings highlight the complementary potential of SPM and explainable ML as approaches for automated outlier detection in plantar pressure data, and underscore the importance of explainability in translating complex model outputs into interpretable insights that can effectively inform decision-making.
Authors: Divake Kumar, Sina Tayebati, Francesco Migliarba, Ranganath Krishnan, Amit Ranjan Trivedi
Abstract: Deep learning models in robotics often output point estimates with poorly calibrated confidences, offering no native mechanism to quantify predictive reliability under novel, noisy, or out-of-distribution inputs. Conformal prediction (CP) addresses this gap by providing distribution-free coverage guarantees, yet its reliance on fixed nonconformity scores ignores context and can yield intervals that are overly conservative or unsafe. We address this with Learnable Conformal Prediction (LCP), which replaces fixed scores with a lightweight neural function that leverages geometric, semantic, and task-specific features to produce context-aware uncertainty sets. LCP maintains CP's theoretical guarantees while reducing prediction set sizes by 18% in classification, tightening detection intervals by 52%, and improving path planning safety from 72% to 91% success with minimal overhead. Across three robotic tasks on seven benchmarks, LCP consistently outperforms Standard CP and ensemble baselines. In classification on CIFAR-100 and ImageNet, it achieves smaller set sizes (4.7-9.9% reduction) at target coverage. For object detection on COCO, BDD100K, and Cityscapes, it produces 46-54% tighter bounding boxes. In path planning through cluttered environments, it improves success to 91.5% with only 4.5% path inflation, compared to 12.2% for Standard CP. The method is lightweight (approximately 4.8% runtime overhead, 42 KB memory) and supports online adaptation, making it well suited to resource-constrained autonomous systems. Hardware evaluation shows LCP adds less than 1% memory and 15.9% inference overhead, yet sustains 39 FPS on detection tasks while being 7.4 times more energy-efficient than ensembles.
Authors: Lingguang Wang, \"Omer \c{S}ahin Ta\c{s}, Marlon Steiner, Christoph Stiller
Abstract: Learning-based planners are sensitive to the long-tailed distribution of driving data. Common maneuvers dominate datasets, while dangerous or rare scenarios are sparse. This imbalance can bias models toward the frequent cases and degrade performance on critical scenarios. To tackle this problem, we compare balancing strategies for sampling training data and find reweighting by trajectory pattern an effective approach. We then present FlowDrive, a flow-matching trajectory planner that learns a conditional rectified flow to map noise directly to trajectory distributions with few flow-matching steps. We further introduce moderated, in-the-loop guidance that injects small perturbation between flow steps to systematically increase trajectory diversity while remaining scene-consistent. On nuPlan and the interaction-focused interPlan benchmarks, FlowDrive achieves state-of-the-art results among learning-based planners and approaches methods with rule-based refinements. After adding moderated guidance and light post-processing (FlowDrive*), it achieves overall state-of-the-art performance across nearly all benchmark splits.
Authors: Jewon Lee, Wooksu Shin, Seungmin Yang, Ki-Ung Song, DongUk Lim, Jaeyeon Kim, Tae-Ho Kim, Bo-Kyeong Kim
Abstract: Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of "thinking with images" models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage "coarse-to-fine" reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.
Authors: Dong-Kyum Kim, Minsung Kim, Jea Kwon, Nakyeong Yang, Meeyoung Cha
Abstract: The reversal curse -- a language model's (LM) inability to infer an unseen fact ``B is A'' from a learned fact ``A is B'' -- is widely considered a fundamental limitation. We show that this is not an inherent failure but an artifact of how models encode knowledge. By training LMs from scratch on a synthetic dataset of relational knowledge graphs, we demonstrate that bilinear relational structure emerges in their hidden representations. This structure substantially alleviates the reversal curse, enabling LMs to infer unseen reverse facts. Crucially, we also find that this bilinear structure plays a key role in consistent model editing. When a fact is updated in a LM with this structure, the edit correctly propagates to its reverse and other logically dependent facts. In contrast, models lacking this representation not only suffer from the reversal curse but also fail to generalize edits, further introducing logical inconsistencies. Our results establish that training on a relational knowledge dataset induces the emergence of bilinear internal representations, which in turn enable LMs to behave in a logically consistent manner after editing. This implies that the success of model editing depends critically not just on editing algorithms but on the underlying representational geometry of the knowledge being modified.
Authors: Trinnhallen Brisley, Gordon Ross, Daniel Paulin
Abstract: Hawkes process models are used in settings where past events increase the likelihood of future events occurring. Many applications record events as counts on a regular grid, yet discrete-time Hawkes models remain comparatively underused and are often constrained by fixed-form baselines and excitation kernels. In particular, there is a lack of flexible, nonparametric treatments of both the baseline and the excitation in discrete time. To this end, we propose the Gaussian Process Discrete Hawkes Process (GP-DHP), a nonparametric framework that places Gaussian process priors on both the baseline and the excitation and performs inference through a collapsed latent representation. This yields smooth, data-adaptive structure without prespecifying trends, periodicities, or decay shapes, and enables maximum a posteriori (MAP) estimation with near-linear-time \(O(T\log T)\) complexity. A closed-form projection recovers interpretable baseline and excitation functions from the optimized latent trajectory. In simulations, GP-DHP recovers diverse excitation shapes and evolving baselines. In case studies on U.S. terrorism incidents and weekly Cryptosporidiosis counts, it improves test predictive log-likelihood over standard parametric discrete Hawkes baselines while capturing bursts, delays, and seasonal background variation. The results indicate that flexible discrete-time self-excitation can be achieved without sacrificing scalability or interpretability.
Authors: Hanlin Zhu, Tianyu Guo, Song Mei, Stuart Russell, Nikhil Ghosh, Alberto Bietti, Jiantao Jiao
Abstract: As LLMs are increasingly deployed as agents, agentic reasoning - the ability to combine tool use, especially search, and reasoning - becomes a critical skill. However, it is hard to disentangle agentic reasoning when evaluated in complex environments and tasks. Current agent benchmarks often mix agentic reasoning with challenging math reasoning, expert-level knowledge, and other advanced capabilities. To fill this gap, we build a novel benchmark, GSM-Agent, where an LLM agent is required to solve grade-school-level reasoning problems, but is only presented with the question in the prompt without the premises that contain the necessary information to solve the task, and needs to proactively collect that information using tools. Although the original tasks are grade-school math problems, we observe that even frontier models like GPT-5 only achieve 67% accuracy. To understand and analyze the agentic reasoning patterns, we propose the concept of agentic reasoning graph: cluster the environment's document embeddings into nodes, and map each tool call to its nearest node to build a reasoning path. Surprisingly, we identify that the ability to revisit a previously visited node, widely taken as a crucial pattern in static reasoning, is often missing for agentic reasoning for many models. Based on the insight, we propose a tool-augmented test-time scaling method to improve LLM's agentic reasoning performance by adding tools to encourage models to revisit. We expect our benchmark and the agentic reasoning framework to aid future studies of understanding and pushing the boundaries of agentic reasoning.
Authors: Yessin Moakher, Malik Tiomoko, Cosme Louart, Zhenyu Liao
Abstract: We present a rigorous asymptotic analysis of Echo State Networks (ESNs) in a teacher student setting with a linear teacher with oracle weights. Leveraging random matrix theory, we derive closed form expressions for the asymptotic bias, variance, and mean-squared error (MSE) as functions of the input statistics, the oracle vector, and the ridge regularization parameter. The analysis reveals two key departures from classical ridge regression: (i) ESNs do not exhibit double descent, and (ii) ESNs attain lower MSE when both the number of training samples and the teacher memory length are limited. We further provide an explicit formula for the optimal regularization in the identity input covariance case, and propose an efficient numerical scheme to compute the optimum in the general case. Together, these results offer interpretable theory and practical guidelines for tuning ESNs, helping reconcile recent empirical observations with provable performance guarantees
Authors: Emmanuel de Salis, Massimo De Santis, Davide Piras, Sambit K. Giri, Michele Bianco, Nicolas Cerardi, Philipp Denzel, Merve Selcuk-Simsek, Kelley M. Hess, M. Carmen Toribio, Franz Kirsten, Hatem Ghorbel
Abstract: Hydrogen is the most abundant element in our Universe. The first generation of stars and galaxies produced photons that ionized hydrogen gas, driving a cosmological event known as the Epoch of Reionization (EoR). The upcoming Square Kilometre Array Observatory (SKAO) will map the distribution of neutral hydrogen during this era, aiding in the study of the properties of these first-generation objects. Extracting astrophysical information will be challenging, as SKAO will produce a tremendous amount of data where the hydrogen signal will be contaminated with undesired foreground contamination and instrumental systematics. To address this, we develop the latest deep learning techniques to extract information from the 2D power spectra of the hydrogen signal expected from SKAO. We apply a series of neural network models to these measurements and quantify their ability to predict the history of cosmic hydrogen reionization, which is connected to the increasing number and efficiency of early photon sources. We show that the study of the early Universe benefits from modern deep learning technology. In particular, we demonstrate that dedicated machine learning algorithms can achieve more than a $0.95$ $R^2$ score on average in recovering the reionization history. This enables accurate and precise cosmological and astrophysical inference of structure formation in the early Universe.
Authors: Emily Honey, Anders Helbo, Jens Petersen
Abstract: Computed tomography (CT) is essential for treatment and diagnostics; In case CT are missing or otherwise difficult to obtain, methods for generating synthetic CT (sCT) images from magnetic resonance imaging (MRI) images are sought after. Therefore, it is valuable to establish a reference for what strategies are most effective for MRI-to-CT translation. In this paper, we compare the performance of two frequently used architectures for MRI-to-CT translation: a conditional generative adversarial network (cGAN) and a conditional denoising diffusion probabilistic model (cDDPM). We chose well-established implementations to represent each architecture: Pix2Pix for cGAN, and Palette for cDDPM. We separate the classical 3D translation problem into a sequence of 2D translations on the transverse plane, to investigate the viability of a strategy that reduces the computational cost. We also investigate the impact of conditioning the generative process on a single MRI image/slice and on multiple MRI slices. The performance is assessed using a thorough evaluation protocol, including a novel slice-wise metric Similarity Of Slices (SIMOS), which measures the continuity between transverse slices when compiling the sCTs into 3D format. Our comparative analysis revealed that MRI-to-CT generative models benefit from multi-channel conditional input and using cDDPM as an architecture.
Authors: Masahiro Kato
Abstract: This study considers the estimation of the average treatment effect (ATE). For ATE estimation, we estimate the propensity score through direct bias-correction term estimation. Let $\{(X_i, D_i, Y_i)\}_{i=1}^{n}$ be the observations, where $X_i \in \mathbb{R}^p$ denotes $p$-dimensional covariates, $D_i \in \{0, 1\}$ denotes a binary treatment assignment indicator, and $Y_i \in \mathbb{R}$ is an outcome. In ATE estimation, the bias-correction term $h_0(X_i, D_i) = \frac{1[D_i = 1]}{e_0(X_i)} - \frac{1[D_i = 0]}{1 - e_0(X_i)}$ plays an important role, where $e_0(X_i)$ is the propensity score, the probability of being assigned treatment $1$. In this study, we propose estimating $h_0$ (or equivalently the propensity score $e_0$) by directly minimizing the prediction error of $h_0$. Since the bias-correction term $h_0$ is essential for ATE estimation, this direct approach is expected to improve estimation accuracy for the ATE. For example, existing studies often employ maximum likelihood or covariate balancing to estimate $e_0$, but these approaches may not be optimal for accurately estimating $h_0$ or the ATE. We present a general framework for this direct bias-correction term estimation approach from the perspective of Bregman divergence minimization and conduct simulation studies to evaluate the effectiveness of the proposed method.
Authors: Malik Tiomoko, Ekkehard Schnoor
Abstract: Regularized linear regression is central to machine learning, yet its high-dimensional behavior with informative priors remains poorly understood. We provide the first exact asymptotic characterization of training and test risks for maximum a posteriori (MAP) regression with Gaussian priors centered at a domain-informed initialization. Our framework unifies ridge regression, least squares, and prior-informed estimators, and -- using random matrix theory -- yields closed-form risk formulas that expose the bias-variance-prior tradeoff, explain double descent, and quantify prior mismatch. We also identify a closed-form minimizer of test risk, enabling a simple estimator of the optimal regularization parameter. Simulations confirm the theory with high accuracy. By connecting Bayesian priors, classical regularization, and modern asymptotics, our results provide both conceptual clarity and practical guidance for learning with structured prior knowledge.
Authors: Merve Atasever, Matthew Hong, Mihir Nitin Kulkarni, Qingpei Li, Jyotirmoy V. Deshmukh
Abstract: Multi-Agent Path Finding (MAPF) poses a significant and challenging problem critical for applications in robotics and logistics, particularly due to its combinatorial complexity and the partial observability inherent in realistic environments. Decentralized reinforcement learning methods commonly encounter two substantial difficulties: first, they often yield self-centered behaviors among agents, resulting in frequent collisions, and second, their reliance on complex communication modules leads to prolonged training times, sometimes spanning weeks. To address these challenges, we propose an efficient decentralized planning framework based on the Decision Transformer (DT), uniquely leveraging offline reinforcement learning to substantially reduce training durations from weeks to mere hours. Crucially, our approach effectively handles long-horizon credit assignment and significantly improves performance in scenarios with sparse and delayed rewards. Furthermore, to overcome adaptability limitations inherent in standard RL methods under dynamic environmental changes, we integrate a large language model (GPT-4o) to dynamically guide agent policies. Extensive experiments in both static and dynamically changing environments demonstrate that our DT-based approach, augmented briefly by GPT-4o, significantly enhances adaptability and performance.
Authors: Kirsten Odendaal, Neela Kaushik, Spencer Halverson
Abstract: This work integrates StyleGAN, DragGAN and Principal Component Analysis (PCA) to enhance the latent space efficiency and controllability of GAN-generated images. Style-GAN provides a structured latent space, DragGAN enables intuitive image manipulation, and PCA reduces dimensionality and facilitates cross-model alignment for more streamlined and interpretable exploration of latent spaces. We apply our techniques to the Animal Faces High Quality (AFHQ) dataset, and find that our approach of integrating PCA-based dimensionality reduction with the Drag-GAN framework for image manipulation retains performance while improving optimization efficiency. Notably, introducing PCA into the latent W+ layers of DragGAN can consistently reduce the total optimization time while maintaining good visual quality and even boosting the Structural Similarity Index Measure (SSIM) of the optimized image, particularly in shallower latent spaces (W+ layers = 3). We also demonstrate capability for aligning images generated by two StyleGAN models trained on similar but distinct data domains (AFHQ-Dog and AFHQ-Cat), and show that we can control the latent space of these aligned images to manipulate the images in an intuitive and interpretable manner. Our findings highlight the possibility for efficient and interpretable latent space control for a wide range of image synthesis and editing applications.
Authors: Matt Y. Cheung, Ashok Veeraraghavan, Guha Balakrishnan
Abstract: In clinical applications, the utility of segmentation models is often based on the accuracy of derived downstream metrics such as organ size, rather than by the pixel-level accuracy of the segmentation masks themselves. Thus, uncertainty quantification for such metrics is crucial for decision-making. Conformal prediction (CP) is a popular framework to derive such principled uncertainty guarantees, but applying CP naively to the final scalar metric is inefficient because it treats the complex, non-linear segmentation-to-metric pipeline as a black box. We introduce COMPASS, a practical framework that generates efficient, metric-based CP intervals for image segmentation models by leveraging the inductive biases of their underlying deep neural networks. COMPASS performs calibration directly in the model's representation space by perturbing intermediate features along low-dimensional subspaces maximally sensitive to the target metric. We prove that COMPASS achieves valid marginal coverage under exchangeability and nestedness assumptions. Empirically, we demonstrate that COMPASS produces significantly tighter intervals than traditional CP baselines on four medical image segmentation tasks for area estimation of skin lesions and anatomical structures. Furthermore, we show that leveraging learned internal features to estimate importance weights allows COMPASS to also recover target coverage under covariate shifts. COMPASS paves the way for practical, metric-based uncertainty quantification for medical image segmentation.
Authors: Simone Lionetti, Fabian Gr\"oger, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Alexander A. Navarini, Marc Pouly
Abstract: Clinical dataset labels are rarely certain as annotators disagree and confidence is not uniform across cases. Typical aggregation procedures, such as majority voting, obscure this variability. In simple experiments on medical imaging benchmarks, accounting for the confidence in binary labels significantly impacts model rankings. We therefore argue that machine-learning evaluations should explicitly account for annotation uncertainty using probabilistic metrics that directly operate on distributions. These metrics can be applied independently of the annotations' generating process, whether modeled by simple counting, subjective confidence ratings, or probabilistic response models. They are also computationally lightweight, as closed-form expressions have linear-time implementations once examples are sorted by model score. We thus urge the community to release raw annotations for datasets and to adopt uncertainty-aware evaluation so that performance estimates may better reflect clinical data.
Authors: Aleksandar Terzi\'c, Nicolas Menet, Michael Hersche, Thomas Hofmann, Abbas Rahimi
Abstract: Modern state-space models (SSMs) often utilize transition matrices which enable efficient computation but pose restrictions on the model's expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$). Consequently, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any $N$-state FSA with one layer of dimension $N$ and a linear readout of size $N \times N$, significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multiclass time-series classification, the performance is comparable to that of neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded as a set of variable-length English sentences. The code is available at https://github.com/IBM/expressive-sparse-state-space-model
URLs: https://github.com/IBM/expressive-sparse-state-space-model
Authors: Seyedmorteza Sadat, Farnood Salehi, Romann M. Weber
Abstract: While diffusion models have made remarkable progress in image generation, their outputs can still appear unrealistic and lack fine details, especially when using fewer number of neural function evaluations (NFEs) or lower guidance scales. To address this issue, we propose a novel momentum-based sampling technique, termed history-guided sampling (HiGS), which enhances quality and efficiency of diffusion sampling by integrating recent model predictions into each inference step. Specifically, HiGS leverages the difference between the current prediction and a weighted average of past predictions to steer the sampling process toward more realistic outputs with better details and structure. Our approach introduces practically no additional computation and integrates seamlessly into existing diffusion frameworks, requiring neither extra training nor fine-tuning. Extensive experiments show that HiGS consistently improves image quality across diverse models and architectures and under varying sampling budgets and guidance scales. Moreover, using a pretrained SiT model, HiGS achieves a new state-of-the-art FID of 1.61 for unguided ImageNet generation at 256$\times$256 with only 30 sampling steps (instead of the standard 250). We thus present HiGS as a plug-and-play enhancement to standard diffusion sampling that enables faster generation with higher fidelity.
Authors: Pierrick Chatillon, Julien Rabin, David Tschumperl\'e
Abstract: This paper addresses the problem of exemplar-based texture synthesis. We introduce NIFTY, a hybrid framework that combines recent insights on diffusion models trained with convolutional neural networks, and classical patch-based texture optimization techniques. NIFTY is a non-parametric flow-matching model built on non-local patch matching, which avoids the need for neural network training while alleviating common shortcomings of patch-based methods, such as poor initialization or visual artifacts. Experimental results demonstrate the effectiveness of the proposed approach compared to representative methods from the literature. Code is available at https://github.com/PierrickCh/Nifty.git
Authors: Anvit Garg, Sohom Bhattacharya, Pragya Sur
Abstract: Model collapse occurs when generative models degrade after repeatedly training on their own synthetic outputs. We study this effect in overparameterized linear regression in a setting where each iteration mixes fresh real labels with synthetic labels drawn from the model fitted in the previous iteration. We derive precise generalization error formulae for minimum-$\ell_2$-norm interpolation and ridge regression under this iterative scheme. Our analysis reveals intriguing properties of the optimal mixing weight that minimizes long-term prediction error and provably prevents model collapse. For instance, in the case of min-$\ell_2$-norm interpolation, we establish that the optimal real-data proportion converges to the reciprocal of the golden ratio for fairly general classes of covariate distributions. Previously, this property was known only for ordinary least squares, and additionally in low dimensions. For ridge regression, we further analyze two popular model classes -- the random-effects model and the spiked covariance model -- demonstrating how spectral geometry governs optimal weighting. In both cases, as well as for isotropic features, we uncover that the optimal mixing ratio should be at least one-half, reflecting the necessity of favoring real-data over synthetic. We validate our theoretical results with extensive simulations.
Authors: Amit Roy, Abulhair Saparov
Abstract: Reasoning capability is essential to ensure the factual correctness of the responses of transformer-based Large Language Models (LLMs), and robust reasoning about transitive relations is instrumental in many settings, such as causal inference. Hence, it is essential to investigate the capability of transformers in the task of inferring transitive relations (e.g., knowing A causes B and B causes C, then A causes C). The task of inferring transitive relations is equivalent to the task of connectivity in directed graphs (e.g., knowing there is a path from A to B, and there is a path from B to C, then there is a path from A to C). Past research focused on whether transformers can learn to infer transitivity from in-context examples provided in the input prompt. However, transformers' capability to infer transitive relations from training examples and how scaling affects the ability is unexplored. In this study, we seek to answer this question by generating directed graphs to train transformer models of varying sizes and evaluate their ability to infer transitive relations for various graph sizes. Our findings suggest that transformers are capable of learning connectivity on "grid-like'' directed graphs where each node can be embedded in a low-dimensional subspace, and connectivity is easily inferable from the embeddings of the nodes. We find that the dimensionality of the underlying grid graph is a strong predictor of transformers' ability to learn the connectivity task, where higher-dimensional grid graphs pose a greater challenge than low-dimensional grid graphs. In addition, we observe that increasing the model scale leads to increasingly better generalization to infer connectivity over grid graphs. However, if the graph is not a grid graph and contains many disconnected components, transformers struggle to learn the connectivity task, especially when the number of components is large.
Authors: Yujin Kim, Changjae Im, Taehyun Kim, Tak Hur, Daniel K. Park
Abstract: Classification using variational quantum circuits is a promising frontier in quantum machine learning. Quantum supervised learning (QSL) applied to classical data using variational quantum circuits involves embedding the data into a quantum Hilbert space and optimizing the circuit parameters to train the measurement process. In this context, the efficacy of QSL is inherently influenced by the selection of quantum embedding. In this study, we introduce a classical-quantum hybrid approach for optimizing quantum embedding beyond the limitations of the standard circuit model of quantum computation (i.e., completely positive and trace-preserving maps) for general multi-channel data. We benchmark the performance of various models in our framework using the CIFAR-10 and Tiny ImageNet datasets and provide theoretical analyses that guide model design and optimization.
Authors: Nikita Kotelevskii, Maiya Goloburda, Vladimir Kondratyev, Alexander Fishkov, Mohsen Guizani, Eric Moulines, Maxim Panov
Abstract: Most uncertainty quantification (UQ) approaches provide a single scalar value as a measure of model reliability. However, different uncertainty measures could provide complementary information on the prediction confidence. Even measures targeting the same type of uncertainty (e.g., ensemble-based and density-based measures of epistemic uncertainty) may capture different failure modes. We take a multidimensional view on UQ by stacking complementary UQ measures into a vector. Such vectors are assigned with Monge-Kantorovich ranks produced by an optimal-transport-based ordering method. The prediction is then deemed more uncertain than the other if it has a higher rank. The resulting VecUQ-OT algorithm uses entropy-regularized optimal transport. The transport map is learned on vectors of scores from in-distribution data and, by design, applies to unseen inputs, including out-of-distribution cases, without retraining. Our framework supports flexible non-additive uncertainty fusion (including aleatoric and epistemic components). It yields a robust ordering for downstream tasks such as selective prediction, misclassification detection, out-of-distribution detection, and selective generation. Across synthetic, image, and text data, VecUQ-OT shows high efficiency even when individual measures fail. The code for the method is available at: https://github.com/stat-ml/multidimensional_uncertainty.
URLs: https://github.com/stat-ml/multidimensional_uncertainty.
Authors: Luca Bergamin, Giovanna Maria Dimitri, Fabio Aiolli
Abstract: Semantic segmentation is a fundamental task in medical image analysis, aiding medical decision-making by helping radiologists distinguish objects in an image. Research in this field has been driven by deep learning applications, which have the potential to scale these systems even in the presence of noise and artifacts. However, these systems are not yet perfected. We argue that performance can be improved by incorporating common medical knowledge into the segmentation model's loss function. To this end, we introduce Logic Tensor Networks (LTNs) to encode medical background knowledge using first-order logic (FOL) rules. The encoded rules span from constraints on the shape of the produced segmentation, to relationships between different segmented areas. We apply LTNs in an end-to-end framework with a SwinUNETR for semantic segmentation. We evaluate our method on the task of segmenting the hippocampus in brain MRI scans. Our experiments show that LTNs improve the baseline segmentation performance, especially when training data is scarce. Despite being in its preliminary stages, we argue that neurosymbolic methods are general enough to be adapted and applied to other medical semantic segmentation tasks.
Authors: Shayne Wadle, Yanxin Zhang, Vikas Singh, Karthikeyan Sankaralingam
Abstract: The evaluation of new microprocessor designs is constrained by slow, cycle-accurate simulators that rely on unrepresentative benchmark traces. This paper introduces a novel deep learning framework for high-fidelity, ``in-the-wild'' simulation on production hardware. Our core contribution is a DL model trained on microarchitecture-independent features to predict cycle-level performance for hypothetical processor designs. This unique approach allows the model to be deployed on existing silicon to evaluate future hardware. We propose a complete system featuring a lightweight hardware trace collector and a principled sampling strategy to minimize user impact. This system achieves a simulation speed of 5 MIPS on a commodity GPU, imposing a mere 0.1% performance overhead. Furthermore, our co-designed Neutrino on-chip accelerator improves performance by 85x over the GPU. We demonstrate that this framework enables accurate performance analysis and large-scale hardware A/B testing on a massive scale using real-world applications.
Authors: Pei Xu, Zhen Wu, Ruocheng Wang, Vishnu Sarukkai, Kayvon Fatahalian, Ioannis Karamouzas, Victor Zordan, C. Karen Liu
Abstract: Learning a control policy for a multi-phase, long-horizon task, such as basketball maneuvers, remains challenging for reinforcement learning approaches due to the need for seamless policy composition and transitions between skills. A long-horizon task typically consists of distinct subtasks with well-defined goals, separated by transitional subtasks with unclear goals but critical to the success of the entire task. Existing methods like the mixture of experts and skill chaining struggle with tasks where individual policies do not share significant commonly explored states or lack well-defined initial and terminal states between different phases. In this paper, we introduce a novel policy integration framework to enable the composition of drastically different motor skills in multi-phase long-horizon tasks with ill-defined intermediate states. Based on that, we further introduce a high-level soft router to enable seamless and robust transitions between the subtasks. We evaluate our framework on a set of fundamental basketball skills and challenging transitions. Policies trained by our approach can effectively control the simulated character to interact with the ball and accomplish the long-horizon task specified by real-time user commands, without relying on ball trajectory references.
Authors: Nikita Kornilov, David Li, Tikhon Mavrin, Aleksei Leonov, Nikita Gushchin, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin
Abstract: While achieving exceptional generative quality, modern diffusion, flow, and other matching models suffer from slow inference, as they require many steps of iterative generation. Recent distillation methods address this by training efficient one-step generators under the guidance of a pre-trained teacher model. However, these methods are often constrained to only one specific framework, e.g., only to diffusion or only to flow models. Furthermore, these methods are naturally data-free, and to benefit from the usage of real data, it is required to use an additional complex adversarial training with an extra discriminator model. In this paper, we present RealUID, a universal distillation framework for all matching models that seamlessly incorporates real data into the distillation procedure without GANs. Our RealUID approach offers a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models, and is also extended to their modifications, such as Bridge Matching and Stochastic Interpolants.
Authors: Alejandro Almod\'ovar, Patricia A. Apell\'aniz, Santiago Zazo, Juan Parras
Abstract: Deep neural networks achieve state-of-the-art performance in estimating heterogeneous treatment effects, but their opacity limits trust and adoption in sensitive domains such as medicine, economics, and public policy. Building on well-established and high-performing causal neural architectures, we propose causalKANs, a framework that transforms neural estimators of conditional average treatment effects (CATEs) into Kolmogorov--Arnold Networks (KANs). By incorporating pruning and symbolic simplification, causalKANs yields interpretable closed-form formulas while preserving predictive accuracy. Experiments on benchmark datasets demonstrate that causalKANs perform on par with neural baselines in CATE error metrics, and that even simple KAN variants achieve competitive performance, offering a favorable accuracy--interpretability trade-off. By combining reliability with analytic accessibility, causalKANs provide auditable estimators supported by closed-form expressions and interpretable plots, enabling trustworthy individualized decision-making in high-stakes settings. We release the code for reproducibility at https://github.com/aalmodovares/causalkans .
Authors: Antreas Ioannou, Andreas Shiamishis, Nora Hollenstein, Nezihe Merve G\"urel
Abstract: In an era dominated by Large Language Models (LLMs), understanding their capabilities and limitations, especially in high-stakes fields like law, is crucial. While LLMs such as Meta's LLaMA, OpenAI's ChatGPT, Google's Gemini, DeepSeek, and other emerging models are increasingly integrated into legal workflows, their performance in multilingual, jurisdictionally diverse, and adversarial contexts remains insufficiently explored. This work evaluates LLaMA and Gemini on multilingual legal and non-legal benchmarks, and assesses their adversarial robustness in legal tasks through character and word-level perturbations. We use an LLM-as-a-Judge approach for human-aligned evaluation. We moreover present an open-source, modular evaluation pipeline designed to support multilingual, task-diverse benchmarking of any combination of LLMs and datasets, with a particular focus on legal tasks, including classification, summarization, open questions, and general reasoning. Our findings confirm that legal tasks pose significant challenges for LLMs with accuracies often below 50% on legal reasoning benchmarks such as LEXam, compared to over 70% on general-purpose tasks like XNLI. In addition, while English generally yields more stable results, it does not always lead to higher accuracy. Prompt sensitivity and adversarial vulnerability is also shown to persist across languages. Finally, a correlation is found between the performance of a language and its syntactic similarity to English. We also observe that LLaMA is weaker than Gemini, with the latter showing an average advantage of about 24 percentage points across the same task. Despite improvements in newer LLMs, challenges remain in deploying them reliably for critical, multilingual legal applications.
Authors: Jinyeop Song, Jeff Gore, Max Kleiman-Weiner
Abstract: As language model (LM) agents become more capable and gain broader access to real-world tools, there is a growing need for scalable evaluation frameworks of agentic capability. However, conventional benchmark-centric evaluations are costly to design and require human designers to come up with valid tasks that translate into insights about general model capabilities. In this work, we propose information-theoretic evaluation based on empowerment, the mutual information between an agent's actions and future states, as an open-ended method for evaluating LM agents. We introduce EELMA (Estimating Empowerment of Language Model Agents), an algorithm for approximating effective empowerment from multi-turn text interactions. We validate EELMA on both language games and scaled-up realistic web-browsing scenarios. We find that empowerment strongly correlates with average task performance, characterize the impact of environmental complexity and agentic factors such as chain-of-thought, model scale, and memory length on estimated empowerment, and that high empowerment states and actions are often pivotal moments for general capabilities. Together, these results demonstrate empowerment as an appealing general-purpose metric for evaluating and monitoring LM agents in complex, open-ended settings.
Authors: Idan Kashani, Avi Mendelson, Yaniv Nemcovsky
Abstract: Large language models (LLMs) achieve impressive results over various tasks, and ever-expanding public repositories contain an abundance of pre-trained models. Therefore, identifying the best-performing LLM for a given task is a significant challenge. Previous works have suggested learning LLM representations to address this. However, these approaches present limited scalability and require costly retraining to encompass additional models and datasets. Moreover, the produced representation utilizes distinct spaces that cannot be easily interpreted. This work presents an efficient, training-free approach to representing LLMs as linear operators within the prompts' semantic task space, thus providing a highly interpretable representation of the models' application. Our method utilizes closed-form computation of geometrical properties and ensures exceptional scalability and real-time adaptability to dynamically expanding repositories. We demonstrate our approach on success prediction and model selection tasks, achieving competitive or state-of-the-art results with notable performance in out-of-sample scenarios.
Authors: Rakesh Thakur, Shivaansh Kaushik, Gauri Chopra, Harsh Rohilla
Abstract: This paper introduces TrueGradeAI, an AI-driven digital examination framework designed to overcome the shortcomings of traditional paper-based assessments, including excessive paper usage, logistical complexity, grading delays, and evaluator bias. The system preserves natural handwriting by capturing stylus input on secure tablets and applying transformer-based optical character recognition for transcription. Evaluation is conducted through a retrieval-augmented pipeline that integrates faculty solutions, cache layers, and external references, enabling a large language model to assign scores with explicit, evidence-linked reasoning. Unlike prior tablet-based exam systems that primarily digitize responses, TrueGradeAI advances the field by incorporating explainable automation, bias mitigation, and auditable grading trails. By uniting handwriting preservation with scalable and transparent evaluation, the framework reduces environmental costs, accelerates feedback cycles, and progressively builds a reusable knowledge base, while actively working to mitigate grading bias and ensure fairness in assessment.
Authors: Bo Li, Guanzhi Deng, Ronghao Chen, Junrong Yue, Shuo Zhang, Qinghua Zhao, Linqi Song, Lijie Wen
Abstract: Understanding how Large Language Models (LLMs) perform complex reasoning and their failure mechanisms is a challenge in interpretability research. To provide a measurable geometric analysis perspective, we define the concept of the Reasoning Manifold, a latent low-dimensional geometric structure formed by the internal representations corresponding to all correctly reasoned generations. This structure can be conceptualized as the embodiment of the effective thinking paths that the model has learned to successfully solve a given task. Based on this concept, we build REMA, a framework that explains the origins of failures by quantitatively comparing the spatial relationships of internal model representations corresponding to both erroneous and correct reasoning samples. Specifically, REMA first quantifies the geometric deviation of each erroneous representation by calculating its k-nearest neighbors distance to the approximated manifold formed by correct representations, thereby providing a unified failure signal. It then localizes the divergence points where these deviations first become significant by tracking this deviation metric across the model's layers and comparing it against a baseline of internal fluctuations from correct representations, thus identifying where the reasoning chain begins to go off-track. Our extensive experiments on diverse language and multimodal models and tasks demonstrate the low-dimensional nature of the reasoning manifold and the high separability between erroneous and correct reasoning representations. The results also validate the effectiveness of the REMA framework in analyzing the origins of reasoning failures. This research connects abstract reasoning failures to measurable geometric deviations in representations, providing new avenues for in-depth understanding and diagnosis of the internal computational processes of black-box models.
Authors: Mingyi Zheng, Hongyu Jiang, Yizhou Lu, Jiaye Teng
Abstract: Conformal Prediction (CP) is a distribution-free framework for constructing statistically rigorous prediction sets. While popular variants such as CD-split improve CP's efficiency, they often yield prediction sets composed of multiple disconnected subintervals, which are difficult to interpret. In this paper, we propose SCD-split, which incorporates smoothing operations into the CP framework. Such smoothing operations potentially help merge the subintervals, thus leading to interpretable prediction sets. Experimental results on both synthetic and real-world datasets demonstrate that SCD-split balances the interval length and the number of disconnected subintervals. Theoretically, under specific conditions, SCD-split provably reduces the number of disconnected subintervals while maintaining comparable coverage guarantees and interval length compared with CD-split.
Authors: Yonghan Jung
Abstract: In observational settings where treatment and outcome share unmeasured confounders but an observed mediator remains unconfounded, the front-door (FD) adjustment identifies causal effects through the mediator. We study the heterogeneous treatment effect (HTE) under FD identification and introduce two debiased learners: FD-DR-Learner and FD-R-Learner. Both attain fast, quasi-oracle rates (i.e., performance comparable to an oracle that knows the nuisances) even when nuisance functions converge as slowly as n^-1/4. We provide error analyses establishing debiasedness and demonstrate robust empirical performance in synthetic studies and a real-world case study of primary seat-belt laws using Fatality Analysis Reporting System (FARS) dataset. Together, these results indicate that the proposed learners deliver reliable and sample-efficient HTE estimates in FD scenarios. The implementation is available at https://github.com/yonghanjung/FD-CATE. Keywords: Front-door adjustment; Heterogeneous treatment effects; Debiased learning; Quasi-oracle rates; Causal inference.
Authors: Mario G\'omez, Guanqun Ma, Tom Needham, Bei Wang
Abstract: We introduce a general framework for analyzing data modeled as parameterized families of networks. Building on a Gromov-Wasserstein variant of optimal transport, we define a family of parameterized Gromov-Wasserstein distances for comparing such parametric data, including time-varying metric spaces induced by collective motion, temporally evolving weighted social networks, and random graph models. We establish foundational properties of these distances, showing that they subsume several existing metrics in the literature, and derive theoretical approximation guarantees. In particular, we develop computationally tractable lower bounds and relate them to graph statistics commonly used in random graph theory. Furthermore, we prove that our distances can be consistently approximated in random graph and random metric space settings via empirical estimates from generative models. Finally, we demonstrate the practical utility of our framework through a series of numerical experiments.
Authors: Xiaocheng Zou, Shijin Duan, Charles Fleming, Gaowen Liu, Ramana Rao Kompella, Shaolei Ren, Xiaolin Xu
Abstract: Quantum generative models based on instantaneous quantum polynomial (IQP) circuits show great promise in learning complex distributions while maintaining classical trainability. However, current implementations suffer from two key limitations: lack of controllability over generated outputs and severe generation bias towards certain expected patterns. We present a Controllable Quantum Generative Framework, ConQuER, which addresses both challenges through a modular circuit architecture. ConQuER embeds a lightweight controller circuit that can be directly combined with pre-trained IQP circuits to precisely control the output distribution without full retraining. Leveraging the advantages of IQP, our scheme enables precise control over properties such as the Hamming Weight distribution with minimal parameter and gate overhead. In addition, inspired by the controller design, we extend this modular approach through data-driven optimization to embed implicit control paths in the underlying IQP architecture, significantly reducing generation bias on structured datasets. ConQuER retains efficient classical training properties and high scalability. We experimentally validate ConQuER on multiple quantum state datasets, demonstrating its superior control accuracy and balanced generation performance, only with very low overhead cost over original IQP circuits. Our framework bridges the gap between the advantages of quantum computing and the practical needs of controllable generation modeling.
Authors: Hao Chen, Lin Liu, Yu Guang Wang
Abstract: Causal representation learning (CRL) has garnered increasing interests from the causal inference and artificial intelligence community, due to its capability of disentangling potentially complex data-generating mechanism into causally interpretable latent features, by leveraging the heterogeneity of modern datasets. In this paper, we further contribute to the CRL literature, by focusing on the stylized linear structural causal model over the latent features and assuming a linear mixing function that maps latent features to the observed data or measurements. Existing linear CRL methods often rely on stringent assumptions, such as accessibility to single-node interventional data or restrictive distributional constraints on latent features and exogenous measurement noise. However, these prerequisites can be challenging to satisfy in certain scenarios. In this work, we propose a novel linear CRL algorithm that, unlike most existing linear CRL methods, operates under weaker assumptions about environment heterogeneity and data-generating distributions while still recovering latent causal features up to an equivalence class. We further validate our new algorithm via synthetic experiments and an interpretability analysis of large language models (LLMs), demonstrating both its superiority over competing methods in finite samples and its potential in integrating causality into AI.
Authors: Simone Di Gregorio, Paul D\"utting, Federico Fusco, Chris Schwiegelshohn
Abstract: Bilateral trade models the task of intermediating between two strategic agents, a seller and a buyer, willing to trade a good for which they hold private valuations. We study this problem from the perspective of a broker, in a regret minimization framework. At each time step, a new seller and buyer arrive, and the broker has to propose a mechanism that is incentive-compatible and individually rational, with the goal of maximizing profit. We propose a learning algorithm that guarantees a nearly tight $\tilde{O}(\sqrt{T})$ regret in the stochastic setting when seller and buyer valuations are drawn i.i.d. from a fixed and possibly correlated unknown distribution. We further show that it is impossible to achieve sublinear regret in the non-stationary scenario where valuations are generated upfront by an adversary. Our ambitious benchmark for these results is the best incentive-compatible and individually rational mechanism. This separates us from previous works on efficiency maximization in bilateral trade, where the benchmark is a single number: the best fixed price in hindsight. A particular challenge we face is that uniform convergence for all mechanisms' profits is impossible. We overcome this difficulty via a careful chaining analysis that proves convergence for a provably near-optimal mechanism at (essentially) optimal rate. We further showcase the broader applicability of our techniques by providing nearly optimal results for the joint ads problem.
Authors: Yixuan Han, Fan Ma, Ruijie Quan, Yi Yang
Abstract: Test-Time Scaling (TTS) enhances the reasoning ability of large language models (LLMs) by allocating additional computation during inference. However, existing approaches primarily rely on output-level sampling while overlooking the role of model architecture. In mainstream Mixture-of-Experts (MoE) LLMs, we observe that varying the number of activated experts yields complementary solution sets with stable accuracy, revealing a new and underexplored source of diversity. Motivated by this observation, we propose Dynamic Experts Search (DES), a TTS strategy that elevates expert activation into a controllable dimension of the search space. DES integrates two key components: (1) Dynamic MoE, which enables direct control of expert counts during inference to generate diverse reasoning trajectories without additional cost; and (2) Expert Configuration Inheritance, which preserves consistent expert counts within a reasoning path while varying them across runs, thereby balancing stability and diversity throughout the search. Extensive experiments across MoE architectures, verifiers and reasoning benchmarks (i.e., math, code and knowledge) demonstrate that DES reliably outperforms TTS baselines, enhancing accuracy and stability without additional cost. These results highlight DES as a practical and scalable form of architecture-aware TTS, illustrating how structural flexibility in modern LLMs can advance reasoning.
Authors: Qixin Zhang, Yan Sun, Can Jin, Xikun Zhang, Yao Shu, Puning Zhao, Li Shen, Dacheng Tao
Abstract: In this paper, we present two effective policy learning algorithms for multi-agent online coordination(MA-OC) problem. The first one, \texttt{MA-SPL}, not only can achieve the optimal $(1-\frac{c}{e})$-approximation guarantee for the MA-OC problem with submodular objectives but also can handle the unexplored $\alpha$-weakly DR-submodular and $(\gamma,\beta)$-weakly submodular scenarios, where $c$ is the curvature of the investigated submodular functions, $\alpha$ denotes the diminishing-return(DR) ratio and the tuple $(\gamma,\beta)$ represents the submodularity ratios. Subsequently, in order to reduce the reliance on the unknown parameters $\alpha,\gamma,\beta$ inherent in the \texttt{MA-SPL} algorithm, we further introduce the second online algorithm named \texttt{MA-MPL}. This \texttt{MA-MPL} algorithm is entirely \emph{parameter-free} and simultaneously can maintain the same approximation ratio as the first \texttt{MA-SPL} algorithm. The core of our \texttt{MA-SPL} and \texttt{MA-MPL} algorithms is a novel continuous-relaxation technique termed as \emph{policy-based continuous extension}. Compared with the well-established \emph{multi-linear extension}, a notable advantage of this new \emph{policy-based continuous extension} is its ability to provide a lossless rounding scheme for any set function, thereby enabling us to tackle the challenging weakly submodular objectives. Finally, extensive simulations are conducted to validate the effectiveness of our proposed algorithms.
Authors: Katsuhiko Hayashi, Hidetaka Kamigaito
Abstract: We prove that all standard subregular language classes are linearly separable when represented by their deciding predicates. This establishes finite observability and guarantees learnability with simple linear models. Synthetic experiments confirm perfect separability under noise-free conditions, while real-data experiments on English morphology show that learned features align with well-known linguistic constraints. These results demonstrate that the subregular hierarchy provides a rigorous and interpretable foundation for modeling natural language structure. Our code used in real-data experiments is available at https://github.com/UTokyo-HayashiLab/subregular.
Authors: Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, Wei Chen
Abstract: Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration's role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent reward hacking in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.
Authors: Ziyu Liu, Yuhang Zang, Shengyuan Ding, Yuhang Cao, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
Abstract: Recent Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) increasingly use Reinforcement Learning (RL) for post-pretraining, such as RL with Verifiable Rewards (RLVR) for objective tasks and RL from Human Feedback (RLHF) for subjective tasks. However, RLHF incurs high costs and potential reward-policy mismatch due to reliance on human preferences, while RLVR still wastes supervision by discarding rollouts and correctness signals after each update. To address these challenges, we introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR. Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model. This auxiliary training uses a mix of objectives, such as pointwise reward score, pairwise comparison, and evaluation conditioned on further-reflection responses, to teach the model to evaluate and improve its own responses. Our process eliminates the need for a separate reward model and costly human preference data. SPARK creates a positive co-evolving feedback loop: improved reward accuracy yields better policy gradients, which in turn produce higher-quality rollouts that further refine the reward model. Our unified framework supports test-time scaling via self-reflection without external reward models and their associated costs. We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks. For example, SPARK-VL-7B achieves an average 9.7% gain on 7 reasoning benchmarks, 12.1% on 2 reward benchmarks, and 1.5% on 8 general benchmarks over the baselines, demonstrating robustness and broad generalization.
Authors: Xingyu Shen, Yingfa Chen, Zhen Leng Thai, Xu Han, Zhiyuan Liu, Maosong Sun
Abstract: While Transformer-based models have demonstrated remarkable language modeling performance, their high complexities result in high costs when processing long contexts. In contrast, recurrent neural networks (RNNs) such as linear attention and state space models have gained popularity due to their constant per-token complexities. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a constant-size recurrent state. Previous works have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with larger recurrent states results in high training costs. In this paper, we introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training. For two popular classes of RNNs, linear attention and state space models, we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters. Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities.
Authors: Gen Li, Yuling Yan
Abstract: Reinforcement learning with human feedback (RLHF), which learns a reward model from human preference data and then optimizes a policy to favor preferred responses, has emerged as a central paradigm for aligning large language models (LLMs) with human preferences. In this paper, we investigate exploration principles for online RLHF, where one seeks to adaptively collect new preference data to refine both the reward model and the policy in a data-efficient manner. By examining existing optimism-based exploration algorithms, we identify a drawback in their sampling protocol: they tend to gather comparisons that fail to reduce the most informative uncertainties in reward differences, and we prove lower bounds showing that such methods can incur linear regret over exponentially long horizons. Motivated by this insight, we propose a new exploration scheme that directs preference queries toward reducing uncertainty in reward differences most relevant to policy improvement. Under a multi-armed bandit model of RLHF, we establish regret bounds of order $T^{(\beta+1)/(\beta+2)}$, where $\beta>0$ is a hyperparameter that balances reward maximization against mitigating distribution shift. To our knowledge, this is the first online RLHF algorithm with regret scaling polynomially in all model parameters.
Authors: Luc Boudier, Loris Manganelli, Eleftherios Tsonis, Nicolas Dufour, Vicky Kalogeiton
Abstract: Few-shot image classification remains challenging due to the limited availability of labeled examples. Recent approaches have explored generating synthetic training data using text-to-image diffusion models, but often require extensive model fine-tuning or external information sources. We present a novel training-free approach, called DIPSY, that leverages IP-Adapter for image-to-image translation to generate highly discriminative synthetic images using only the available few-shot examples. DIPSY introduces three key innovations: (1) an extended classifier-free guidance scheme that enables independent control over positive and negative image conditioning; (2) a class similarity-based sampling strategy that identifies effective contrastive examples; and (3) a simple yet effective pipeline that requires no model fine-tuning or external captioning and filtering. Experiments across ten benchmark datasets demonstrate that our approach achieves state-of-the-art or comparable performance, while eliminating the need for generative model adaptation or reliance on external tools for caption generation and image filtering. Our results highlight the effectiveness of leveraging dual image prompting with positive-negative guidance for generating class-discriminative features, particularly for fine-grained classification tasks.
Authors: Amandeep Kumar, Nithin Gopalakrishnan Nair, Vishal M. Patel
Abstract: Autoregressive (AR) transformers have emerged as a powerful paradigm for visual generation, largely due to their scalability, computational efficiency and unified architecture with language and vision. Among them, next scale prediction Visual Autoregressive Generation (VAR) has recently demonstrated remarkable performance, even surpassing diffusion-based models. In this work, we revisit VAR and uncover a theoretical insight: when equipped with a Markovian attention mask, VAR is mathematically equivalent to a discrete diffusion. We term this reinterpretation as Scalable Visual Refinement with Discrete Diffusion (SRDD), establishing a principled bridge between AR transformers and diffusion models. Leveraging this new perspective, we show how one can directly import the advantages of diffusion such as iterative refinement and reduce architectural inefficiencies into VAR, yielding faster convergence, lower inference cost, and improved zero-shot reconstruction. Across multiple datasets, we show that the diffusion based perspective of VAR leads to consistent gains in efficiency and generation.
Authors: Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang
Abstract: We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.
Authors: Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, Tianyu Pang
Abstract: LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.
URLs: https://github.com/sail-sg/feedback-conditional-policy.
Authors: Chih Yao Hu, Yang-Sen Lin, Yuna Lee, Chih-Hai Su, Jie-Ying Lee, Shr-Ruei Tsai, Chin-Yang Lin, Kuan-Wen Chen, Tsung-Wei Ke, Yu-Lun Liu
Abstract: We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harnesses VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, outperforming the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs. Project page: https://spf-web.pages.dev
Authors: Wenjian Hao, Zehui Lu, Zihao Liang, Tianyu Zhou, Shaoshuai Mou
Abstract: This paper develops a policy learning method for tuning a pre-trained policy to adapt to additional tasks without altering the original task. A method named Adaptive Policy Gradient (APG) is proposed in this paper, which combines Bellman's principle of optimality with the policy gradient approach to improve the convergence rate. This paper provides theoretical analysis which guarantees the convergence rate and sample complexity of $\mathcal{O}(1/T)$ and $\mathcal{O}(1/\epsilon)$, respectively, where $T$ denotes the number of iterations and $\epsilon$ denotes the accuracy of the resulting stationary policy. Furthermore, several challenging numerical simulations, including cartpole, lunar lander, and robot arm, are provided to show that APG obtains similar performance compared to existing deterministic policy gradient methods while utilizing much less data and converging at a faster rate.
Authors: Lucas Berry, David Meger
Abstract: This work introduces an efficient novel approach for epistemic uncertainty estimation for ensemble models for regression tasks using pairwise-distance estimators (PaiDEs). Utilizing the pairwise-distance between model components, these estimators establish bounds on entropy. We leverage this capability to enhance the performance of Bayesian Active Learning by Disagreement (BALD). Notably, unlike sample-based Monte Carlo estimators, PaiDEs exhibit a remarkable capability to estimate epistemic uncertainty at speeds up to 100 times faster while covering a significantly larger number of inputs at once and demonstrating superior performance in higher dimensions. To validate our approach, we conducted a varied series of regression experiments on commonly used benchmarks: 1D sinusoidal data, $\textit{Pendulum}$, $\textit{Hopper}$, $\textit{Ant}$ and $\textit{Humanoid}$. For each experimental setting, an active learning framework was applied to demonstrate the advantages of PaiDEs for epistemic uncertainty estimation. We compare our approach to existing active learning methods and find that our approach outperforms on high-dimensional regression tasks.
Authors: Zhizun Wang, David Meger
Abstract: In this paper, we propose a novel model-based multi-agent reinforcement learning approach named Value Decomposition Framework with Disentangled World Model to address the challenge of achieving a common goal of multiple agents interacting in the same environment with reduced sample complexity. Due to scalability and non-stationarity problems posed by multi-agent systems, model-free methods rely on a considerable number of samples for training. In contrast, we use a modularized world model, composed of action-conditioned, action-free, and static branches, to unravel the complicated environment dynamics. Our model produces imagined outcomes based on past experience, without sampling directly from the real environment. We employ variational auto-encoders and variational graph auto-encoders to learn the latent representations for the world model, which is merged with a value-based framework to predict the joint action-value function and optimize the overall training objective. Experimental results on StarCraft II micro-management, Multi-Agent MuJoCo, and Level-Based Foraging challenges demonstrate that our method achieves high sample efficiency and exhibits superior performance compared to other baselines across a wide range of multi-agent learning tasks.
Authors: Ole-Christian Galbo Engstr{\o}m, Martin Holm Jensen
Abstract: We present algorithms that substantially accelerate partition-based cross-validation for machine learning models that require matrix products $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$. Our algorithms have applications in model selection for, for example, principal component analysis (PCA), principal component regression (PCR), ridge regression (RR), ordinary least squares (OLS), and partial least squares (PLS). Our algorithms support all combinations of column-wise centering and scaling of $\mathbf{X}$ and $\mathbf{Y}$, and we demonstrate in our accompanying implementation that this adds only a manageable, practical constant over efficient variants without preprocessing. We prove the correctness of our algorithms under a fold-based partitioning scheme and show that the running time is independent of the number of folds; that is, they have the same time complexity as that of computing $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$ and space complexity equivalent to storing $\mathbf{X}$, $\mathbf{Y}$, $\mathbf{X}^\mathbf{T}\mathbf{X}$, and $\mathbf{X}^\mathbf{T}\mathbf{Y}$. Importantly, unlike alternatives found in the literature, we avoid data leakage due to preprocessing. We achieve these results by eliminating redundant computations in the overlap between training partitions. Concretely, we show how to manipulate $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$ using only samples from the validation partition to obtain the preprocessed training partition-wise $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$. To our knowledge, we are the first to derive correct and efficient cross-validation algorithms for any of the $16$ combinations of column-wise centering and scaling, for which we also prove only $12$ give distinct matrix products.
Authors: Matt Y Cheung, Tucker J Netherton, Laurence E Court, Ashok Veeraraghavan, Guha Balakrishnan
Abstract: Modern deep learning reconstruction algorithms generate impressively realistic scans from sparse inputs, but can often produce significant inaccuracies. This makes it difficult to provide statistically guaranteed claims about the true state of a subject from scans reconstructed by these algorithms. In this study, we propose a framework for computing provably valid prediction bounds on claims derived from probabilistic black-box image reconstruction algorithms. The key insights behind our framework are to represent reconstructed scans with a derived clinical metric of interest, and to calibrate bounds on the ground truth metric with conformal prediction (CP) using a prior calibration dataset. These bounds convey interpretable feedback about the subject's state, and can also be used to retrieve nearest-neighbor reconstructed scans for visual inspection. We demonstrate the utility of this framework on sparse-view computed tomography (CT) for fat mass quantification and radiotherapy planning tasks. Results show that our framework produces bounds with better semantical interpretation than conventional pixel-based bounding approaches. Furthermore, we can flag dangerous outlier reconstructions that look plausible but have statistically unlikely metric values.
Authors: Natalie S. Frank
Abstract: We propose a new notion of uniqueness for the adversarial Bayes classifier in the setting of binary classification. Analyzing this concept produces a simple procedure for computing all adversarial Bayes classifiers for a well-motivated family of one dimensional data distributions. This characterization is then leveraged to show that as the perturbation radius increases, certain notions of regularity for the adversarial Bayes classifiers improve. Furthermore, these results provide tools for understanding relationships between the Bayes and adversarial Bayes classifiers in one dimension.
Authors: Shengyu Tao
Abstract: The sustainable utilization of lithium-ion batteries (LIBs) is crucial to the global energy transition and carbon neutrality, yet data scarcity and heterogeneity remain major barriers across remanufacturing, reusing, and recycling. This dissertation develops a machine learning assisted framework to address these challenges throughout the battery lifecycle. A physics informed quality control model predicts long-term degradation from limited early-cycle data, while a generative learning based residual value assessment method enables rapid and accurate evaluation of retired batteries under random conditions. A federated learning strategy achieves privacy preserving and high precision cathode material sorting, supporting efficient recycling. Furthermore, a unified diagnostics and prognostics framework based on correlation alignment enhances adaptability across tasks such as state of health estimation, state of charge estimation, and remaining useful life prediction under varied testing protocols. Collectively, these contributions advance sustainable battery management by integrating physics, data generation, privacy preserving collaboration, and adaptive learning, offering methodological innovations to promote circular economy and global carbon neutrality.
Authors: Noga Bar, Raja Giryes
Abstract: Large annotated datasets are crucial for the success of deep neural networks, but labeling data can be prohibitively expensive in domains such as medical imaging. This work tackles the subset selection problem: selecting a small set of the most informative examples from a large unlabeled pool for annotation. We propose a simple and effective method that combines feature norms, randomization, and orthogonality (via the Gram-Schmidt process) to select diverse and informative samples. Feature norms serve as a proxy for informativeness, while randomization and orthogonalization reduce redundancy and encourage coverage of the feature space. Extensive experiments on image and text benchmarks, including CIFAR-10/100, Tiny ImageNet, ImageNet, OrganAMNIST, and Yelp, show that our method consistently improves subset selection performance, both as a standalone approach and when integrated with existing techniques.
Authors: Ahmad Rashid, Ruotian Wu, Julia Grosse, Agustinus Kristiadi, Pascal Poupart
Abstract: Large language models (LLMs) can be improved by aligning with human preferences through fine-tuning -- the so-called reinforcement learning from human feedback (RLHF). However, the cost of fine-tuning an LLM is prohibitive for many users. Due to their ability to bypass LLM fine-tuning, prediction-time tokenwise reward-guided text generation (RGTG) methods have recently been proposed. They use a reward model trained on full sequences to score partial sequences during decoding in a bid to steer the generation towards sequences with high rewards. However, these methods have so far been only heuristically motivated and poorly analyzed. In this work, we show that reward models trained on full sequences are not compatible with scoring partial sequences. To alleviate this, we propose to train a Bradley-Terry reward model on partial sequences explicitly, and autoregressively sample from the implied tokenwise policy during decoding. We study the properties of this reward model and the resulting policy: we show that this policy is proportional to the ratio of two distinct RLHF policies. Our simple approach outperforms previous RGTG methods and performs similarly to strong offline baselines without large-scale LLM fine-tuning. Code for our work is available at https://github.com/ahmadrash/PARGS
Authors: Faried Abu Zaid, Daniel Neider, Mustafa Yal\c{c}{\i}ner
Abstract: Formal verification has emerged as a promising method to ensure the safety and reliability of neural networks. However, many relevant properties, such as fairness or global robustness, pertain to the entire input space. If one applies verification techniques naively, the neural network is checked even on inputs that do not occur in the real world and have no meaning. To tackle this shortcoming, we propose the VeriFlow architecture as a flow-based density model tailored to allow any verification approach to restrict its search to some data distribution of interest. We argue that our architecture is particularly well suited for this purpose because of two major properties. First, we show that the transformation that is defined by our model is piecewise affine. Therefore, the model allows the usage of verifiers based on constraint solving with linear arithmetic. Second, upper density level sets (UDL) of the data distribution are definable via linear constraints in the latent space. As a consequence, representations of UDLs specified by a given probability are effectively computable in the latent space. This property allows for effective verification with a fine-grained, probabilistically interpretable control of how a-typical the inputs subject to verification are.
Authors: Mohammadreza Ghaffarzadeh-Esfahani, Mahdi Ghaffarzadeh-Esfahani, Arian Salahi-Niri, Hossein Toreyhi, Zahra Atf, Amirali Mohsenzadeh-Kermani, Mahshad Sarikhani, Zohreh Tajabadi, Fatemeh Shojaeian, Mohammad Hassan Bagheri, Aydin Feyzi, Mohammadamin Tarighatpayma, Narges Gazmeh, Fateme Heydari, Hossein Afshar, Amirreza Allahgholipour, Farid Alimardani, Ameneh Salehi, Naghmeh Asadimanesh, Mohammad Amin Khalafi, Hadis Shabanipour, Ali Moradi, Sajjad Hossein Zadeh, Omid Yazdani, Romina Esbati, Moozhan Maleki, Danial Samiei Nasr, Amirali Soheili, Hossein Majlesi, Saba Shahsavan, Alireza Soheilipour, Nooshin Goudarzi, Erfan Taherifard, Hamidreza Hatamabadi, Jamil S Samaan, Thomas Savage, Ankit Sakhuja, Ali Soroush, Girish Nadkarni, Ilad Alavi Darazam, Mohamad Amin Pourhoseingholi, Seyed Amir Ahmad Safavi-Naini
Abstract: This study compared the performance of classical feature-based machine learning models (CMLs) and large language models (LLMs) in predicting COVID-19 mortality using high-dimensional tabular data from 9,134 patients across four hospitals. Seven CML models, including XGBoost and random forest (RF), were evaluated alongside eight LLMs, such as GPT-4 and Mistral-7b, which performed zero-shot classification on text-converted structured data. Additionally, Mistral- 7b was fine-tuned using the QLoRA approach. XGBoost and RF demonstrated superior performance among CMLs, achieving F1 scores of 0.87 and 0.83 for internal and external validation, respectively. GPT-4 led the LLM category with an F1 score of 0.43, while fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, yielding a stable F1 score of 0.74 during external validation. Although LLMs showed moderate performance in zero-shot classification, fine-tuning substantially enhanced their effectiveness, potentially bridging the gap with CML models. However, CMLs still outperformed LLMs in handling high-dimensional tabular data tasks. This study highlights the potential of both CMLs and fine-tuned LLMs in medical predictive modeling, while emphasizing the current superiority of CMLs for structured data analysis.
Authors: Sebastian J. Wetzel, Zakaria Patel
Abstract: It has been demonstrated that artificial neural networks like autoencoders or Siamese networks encode meaningful concepts in their latent spaces. However, there does not exist a comprehensive framework for retrieving this information in a human-readable form without prior knowledge. In quantitative disciplines concepts are typically formulated as equations. Hence, in order to extract these concepts, we introduce a framework for finding closed-form interpretations of neurons in latent spaces of artificial neural networks. The interpretation framework is based on embedding trained neural networks into an equivalence class of functions that encode the same concept. We interpret these neural networks by finding an intersection between the equivalence class and human-readable equations defined by a symbolic search space. Computationally, this framework is based on finding a symbolic expression whose normalized gradients match the normalized gradients of a specific neuron with respect to the input variables. The effectiveness of our approach is demonstrated by retrieving invariants of matrices and conserved quantities of dynamical systems from latent spaces of Siamese neural networks.
Authors: Zongbo Han, Jialong Yang, Guangyu Wang, Junfan Li, Qianli Xu, Mike Zheng Shou, Changqing Zhang
Abstract: Vision-language foundation models (VLMs), such as CLIP, exhibit remarkable performance across a wide range of tasks. However, deploying these models can be unreliable when significant distribution gaps exist between training and test data, while fine-tuning for diverse scenarios is often costly. Cache-based test-time adapters offer an efficient alternative by storing representative test samples to guide subsequent classifications. Yet, these methods typically employ naive cache management with limited capacity, leading to severe catastrophic forgetting when samples are inevitably dropped during updates. In this paper, we propose DOTA (DistributiOnal Test-time Adaptation), a simple yet effective method addressing this limitation. Crucially, instead of merely memorizing individual test samples, DOTA continuously estimates the underlying distribution of the test data stream. Test-time posterior probabilities are then computed using these dynamically estimated distributions via Bayes' theorem for adaptation. This distribution-centric approach enables the model to continually learn and adapt to the deployment environment. Extensive experiments validate that DOTA significantly mitigates forgetting and achieves state-of-the-art performance compared to existing methods.
Authors: Yingxu Wang, Mengzhu Wang, Houcheng Su, Nan Yin, Quanming Yao, James Kwok
Abstract: Spiking Graph Networks (SGNs) have demonstrated significant potential in graph classification by emulating brain-inspired neural dynamics to achieve energy-efficient computation. However, existing SGNs are generally constrained to in-distribution scenarios and struggle with distribution shifts. In this paper, we first propose the domain adaptation problem in SGNs, and introduce a novel framework named Degree-Consicious Spiking Graph for Cross-Domain Adaptation (DeSGraDA). DeSGraDA enhances generalization across domains with three key components. First, we introduce the degree-conscious spiking representation module by adapting spike thresholds based on node degrees, enabling more expressive and structure-aware signal encoding. Then, we perform temporal distribution alignment by adversarially matching membrane potentials between domains, ensuring effective performance under domain shift while preserving energy efficiency. Additionally, we extract consistent predictions across two spaces to create reliable pseudo-labels, effectively leveraging unlabeled data to enhance graph classification performance. Furthermore, we establish the first generalization bound for SGDA, providing theoretical insights into its adaptation performance. Extensive experiments on benchmark datasets validate that DeSGraDA consistently outperforms state-of-the-art methods in both classification accuracy and energy efficiency.
Authors: Vinith M. Suriyakumar, Rohan Alur, Ayush Sekhari, Manish Raghavan, Ashia C. Wilson
Abstract: Text-to-image diffusion models rely on massive, web-scale datasets. Training them from scratch is computationally expensive, and as a result, developers often prefer to make incremental updates to existing models. These updates often compose fine-tuning steps (to learn new concepts or improve model performance) with "unlearning" steps (to "forget" existing concepts, such as copyrighted works or explicit content). In this work, we demonstrate a critical and previously unknown vulnerability that arises in this paradigm: even under benign, non-adversarial conditions, fine-tuning a text-to-image diffusion model on seemingly unrelated images can cause it to "relearn" concepts that were previously "unlearned." We comprehensively investigate the causes and scope of this phenomenon, which we term concept resurgence, by performing a series of experiments which compose "concept unlearning" with subsequent fine-tuning of Stable Diffusion v1.4 and Stable Diffusion v2.1. Our findings underscore the fragility of composing incremental model updates, and raise serious new concerns about current approaches to ensuring the safety and alignment of text-to-image diffusion models.
Authors: Lothar Sebastian Krapp, Laura Wirth
Abstract: The Fundamental Theorem of Statistical Learning states that a hypothesis space is PAC learnable if and only if its VC dimension is finite. For the agnostic model of PAC learning, the literature so far presents proofs of this theorem that often tacitly impose several measurability assumptions on the involved sets and functions. We scrutinize these proofs from a measure-theoretic perspective in order to explicitly extract the assumptions needed for a rigorous argument. This leads to a sound statement as well as a detailed and self-contained proof of the Fundamental Theorem of Statistical Learning in the agnostic setting, showcasing the minimal measurability requirements needed. As the Fundamental Theorem of Statistical Learning underpins a wide range of further theoretical developments, our results are of foundational importance: A careful analysis of measurability aspects is essential, especially when the theorem is used in settings where measure-theoretic subtleties play a role. We particularly discuss applications in Model Theory, considering NIP and o-minimal structures. Our main theorem presents sufficient conditions for the PAC learnability of hypothesis spaces defined over o-minimal expansions of the reals. This class of hypothesis spaces covers all artificial neural networks for binary classification that use commonly employed activation functions like ReLU and the sigmoid function.
Authors: Manav Vora, Ilan Shomorony, Melkior Ornik
Abstract: We study capacity- and budget-constrained multi-agent MDPs (CB-MA-MDPs), a class that captures many maintenance and scheduling tasks in which each agent can irreversibly fail and a planner must decide (i) when to apply a restorative action and (ii) which subset of agents to treat in parallel. The global budget limits the total number of restorations, while the capacity constraint bounds the number of simultaneous actions, turning na\"ive dynamic programming into a combinatorial search that scales exponentially with the number of agents. We propose a two-stage solution that remains tractable for large systems. First, a Linear Sum Assignment Problem (LSAP)-based grouping partitions the agents into r disjoint sets (r = capacity) that maximise diversity in expected time-to-failure, allocating budget to each set proportionally. Second, a meta-trained PPO policy solves each sub-MDP, leveraging transfer across groups to converge rapidly. To validate our approach, we apply it to the problem of scheduling repairs for a large team of industrial robots, constrained by a limited number of repair technicians and a total repair budget. Our results demonstrate that the proposed method outperforms baseline approaches in terms of maximizing the average uptime of the robot team, particularly for large team sizes. Lastly, we confirm the scalability of our approach through a computational complexity analysis across varying numbers of robots and repair technicians.
Authors: Dongmin Park, Sebin Kim, Taehong Moon, Minkyu Kim, Kangwook Lee, Jaewoong Cho
Abstract: State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare compositions of concepts, e.g., objects with unusual attributes. In this paper, we show that the compositional generation power of diffusion models on such rare concepts can be significantly enhanced by the Large Language Model (LLM) guidance. We start with empirical and theoretical analysis, demonstrating that exposing frequent concepts relevant to the target rare concepts during the diffusion sampling process yields more accurate concept composition. Based on this, we propose a training-free approach, R2F, that plans and executes the overall rare-to-frequent concept guidance throughout the diffusion inference by leveraging the abundant semantic knowledge in LLMs. Our framework is flexible across any pre-trained diffusion models and LLMs, and can be seamlessly integrated with the region-guided diffusion approaches. Extensive experiments on three datasets, including our newly proposed benchmark, RareBench, containing various prompts with rare compositions of concepts, R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1%p in T2I alignment. Code is available at https://github.com/krafton-ai/Rare-to-Frequent.
Authors: Emma Reyner-Fuentes, Esther Rituerto-Gonzalez, Carmen Pelaez-Moreno
Abstract: Gender-based violence is a pervasive public health issue that severely impacts women's mental health, often leading to conditions such as in anxiety, depression, post-traumatic stress disorder, and substance abuse. Identifying the combination of these various mental health conditions could then point to someone who is a victim of gender-based violence. And while speech-based artificial intelligence tools show as a promising solution for mental health screening, their performance often deteriorates when encountering speech from previously unseen speakers, a sign that speaker traits may be confounding factors. This study introduces a speaker-agnostic approach to detecting the gender-based violence victim condition from speech, aiming to develop robust artificial intelligence models capable of generalizing across speakers. By employing domain-adversarial training, we reduce the influence of speaker identity on model predictions, we achieve a 26.95% relative reduction in speaker identification accuracy while improving gender-based violence victim condition classification accuracy by 6.37% (relative). These results suggest that our models effectively capture paralinguistic biomarkers linked to the gender-based violence victim condition, rather than speaker-specific traits. Additionally, the model's predictions show moderate correlation with pre-clinical post-traumatic stress disorder symptoms, supporting the relevance of speech as a non-invasive tool for mental health monitoring. This work lays the foundation for ethical, privacy-preserving artificial intelligence systems to support clinical screening of gender-based violence survivors.
Authors: Tian Xie, Pavan Rauch, Xueru Zhang
Abstract: When ML algorithms are deployed to automate human-related decisions, human agents may learn the underlying decision policies and adapt their behavior. Strategic Classification (SC) has emerged as a framework for studying this interaction between agents and decision-makers to design more trustworthy ML systems. Prior theoretical models in SC assume that agents are perfectly or approximately rational and respond to decision policies by optimizing their utility. However, the growing prevalence of LLMs raises the possibility that real-world agents may instead rely on these tools for strategic advice. This shift prompts two questions: (i) Can LLMs generate effective and socially responsible strategies in SC settings? (ii) Can existing SC theoretical models accurately capture agent behavior when agents follow LLM-generated advice? To investigate these questions, we examine five critical SC scenarios: hiring, loan applications, school admissions, personal income, and public assistance programs. We simulate agents with diverse profiles who interact with three commercial LLMs (GPT-4o, GPT-4.1, and GPT-5), following their suggestions on effort allocations on features. We compare the resulting agent behaviors with the best responses in existing SC models. Our findings show that: (i) Even without access to the decision policy, LLMs can generate effective strategies that improve both agents' scores and qualification; (ii) At the population level, LLM-guided effort allocation strategies yield similar or even higher score improvements, qualification rates, and fairness metrics as those predicted by the SC theoretical model, suggesting that the theoretical model may still serve as a reasonable proxy for LLM-influenced behavior; and (iii) At the individual level, LLMs tend to produce more diverse and balanced effort allocations than theoretical models.
Authors: Mingyu Chen, Yiding Chen, Wen Sun, Xuezhou Zhang
Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focus on improving sample efficiency. All existing algorithms in online RLHF, whether doing passive exploration or active exploration, suffer from a sample complexity that scales exponentially with the scale of the reward function. This fundamental limitation hinders their effectiveness in scenarios with heavily skewed preferences, e.g. questions with a unique correct solution. To address this, we introduce Self-Exploring Preference-Incentive Online Preference Optimization (SE-POPO), an online RLHF algorithm that for the first time achieves a sample complexity that scales polynomially with the reward scale, answering an open problem raised by Xie et al. (2024).. Theoretically, we demonstrate that the sample complexity of SE-POPO dominates that of existing exploration algorithms. Empirically, our systematic evaluation confirms that SE-POPO is more sample-efficient than both exploratory and non-exploratory baselines, in two primary application scenarios of RLHF as well as on public benchmarks, marking a significant step forward in RLHF algorithm design. The code is available at https://github.com/MYC000801/SE-POPO.
Authors: Jack Sandberg, Morteza Haghir Chehreghani
Abstract: Gaussian process (GP) bandits provide a powerful framework for performing blackbox optimization of unknown functions. The characteristics of the unknown function depends heavily on the assumed GP prior. Most work in the literature assume that this prior is known but in practice this seldom holds. Instead, practitioners often rely on maximum likelihood estimation to select the hyperparameters of the prior - which lacks theoretical guarantees. In this work, we propose two algorithms for joint prior selection and regret minimization in GP bandits based on GP Thompson sampling (GP-TS): Prior-Elimination GP-TS (PE-GP-TS) and HyperPrior GP-TS (HP-GP-TS). We theoretically analyze the algorithms and establish upper bounds for their respective regret. In addition, we demonstrate the effectiveness of our algorithms compared to the alternatives through experiments with synthetic and real-world data.
Authors: Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding
Abstract: Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
Authors: Stavros Orfanoudakis, Nanda Kishor Panda, Peter Palensky, Pedro P. Vergara
Abstract: Reinforcement Learning (RL) methods used for solving real-world optimization problems often involve dynamic state-action spaces, larger scale, and sparse rewards, leading to significant challenges in convergence, scalability, and efficient exploration of the solution space. This study introduces GNN-DT, a novel Decision Transformer (DT) architecture that integrates Graph Neural Network (GNN) embedders with a novel residual connection between input and output tokens crucial for handling dynamic environments. By learning from previously collected trajectories, GNN-DT tackles the sparse rewards limitations of online RL algorithms and delivers high-quality solutions in real-time. We evaluate GNN-DT on the complex electric vehicle (EV) charging optimization problem and prove that its performance is superior and requires significantly fewer training trajectories, thus improving sample efficiency compared to existing DT and offline RL baselines. Furthermore, GNN-DT exhibits robust generalization to unseen environments and larger action spaces, addressing a critical gap in prior offline and online RL approaches.
Authors: Jianan Nie, Peiyao Xiao, Kaiyi Ji, Peng Gao
Abstract: Predicting properties of crystals from their structures is a fundamental yet challenging task in materials science. Unlike molecules, crystal structures exhibit infinite periodic arrangements of atoms, requiring methods capable of capturing both local and global information effectively. However, current works fall short of capturing long-range interactions within periodic structures. To address this limitation, we leverage \emph{reciprocal space}, the natural domain for periodic crystals, and construct a Fourier series representation from fractional coordinates and reciprocal lattice vectors with learnable filters. Building on this principle, we introduce the reciprocal space-based geometry network (\textbf{ReciNet}), a novel architecture that integrates geometric GNNs and reciprocal blocks to model short-range and long-range interactions, respectively. Experimental results on standard benchmarks JARVIS, Materials Project, and MatBench demonstrate that ReciNet achieves state-of-the-art predictive accuracy across a range of crystal property prediction tasks. Additionally, we explore a model extension to multi-property prediction with the mixture-of-experts, which demonstrates high computational efficiency and reveals positive transfer between correlated properties. These findings highlight the potential of our model as a scalable and accurate solution for crystal property prediction.
Authors: Arwen Bradley, Preetum Nakkiran, David Berthelot, James Thornton, Joshua M. Susskind
Abstract: We study the theoretical foundations of composition in diffusion models, with a particular focus on out-of-distribution extrapolation and length-generalization. Prior work has shown that composing distributions via linear score combination can achieve promising results, including length-generalization in some cases (Du et al., 2023; Liu et al., 2022). However, our theoretical understanding of how and why such compositions work remains incomplete. In fact, it is not even entirely clear what it means for composition to "work". This paper starts to address these fundamental gaps. We begin by precisely defining one possible desired result of composition, which we call projective composition. Then, we investigate: (1) when linear score combinations provably achieve projective composition, (2) whether reverse-diffusion sampling can generate the desired composition, and (3) the conditions under which composition fails. We connect our theoretical analysis to prior empirical observations where composition has either worked or failed, for reasons that were unclear at the time. Finally, we propose a simple heuristic to help predict the success or failure of new compositions.
Authors: Peiyao Xiao, Chaosheng Dong, Shaofeng Zou, Kaiyi Ji
Abstract: Multi-task learning (MTL) has been widely adopted for its ability to simultaneously learn multiple tasks. While existing gradient manipulation methods often yield more balanced solutions than simple scalarization-based approaches, they typically incur a significant computational overhead of $\mathcal{O}(K)$ in both time and memory, where $K$ is the number of tasks. In this paper, we propose LDC-MTL, a simple and scalable loss discrepancy control approach for MTL, formulated from a bilevel optimization perspective. Our method incorporates two key components: (i) a bilevel formulation for fine-grained loss discrepancy control, and (ii) a scalable first-order bilevel algorithm that requires only $\mathcal{O}(1)$ time and memory. Theoretically, we prove that LDC-MTL guarantees convergence not only to a stationary point of the bilevel problem with loss discrepancy control but also to an $\epsilon$-accurate Pareto stationary point for all $K$ loss functions under mild conditions. Extensive experiments on diverse multi-task datasets demonstrate the superior performance of LDC-MTL in both accuracy and efficiency.
Authors: Xun Wang, Zhuoran Li, Hai Zhong, Longbo Huang
Abstract: As a data-driven approach, offline MARL learns superior policies solely from offline datasets, ideal for domains rich in historical data but with high interaction costs and risks. However, most existing methods are task-specific, requiring retraining for new tasks, leading to redundancy and inefficiency. To address this issue, we propose a task-efficient value-based multi-task offline MARL algorithm, Skill-Discovery Conservative Q-Learning (SD-CQL). Unlike existing methods decoding actions from skills via behavior cloning, SD-CQL discovers skills in a latent space by reconstructing the next observation, evaluates fixed and variable actions separately, and uses conservative Q-learning with local value calibration to select the optimal action for each skill. It eliminates the need for local-global alignment and enables strong multi-task generalization from limited, small-scale source tasks. Substantial experiments on StarCraft II demonstrate the superior generalization performance and task-efficiency of SD-CQL. It achieves the best performance on $\textbf{13}$ out of $14$ task sets, with up to $\textbf{68.9%}$ improvement on individual task sets.
Authors: Yikun Bai, Shuang Wang, Huy Tran, Hengrong Du, Juexin Wang, Soheil Kolouri
Abstract: Structured data, such as graphs, is vital in machine learning due to its capacity to capture complex relationships and interactions. In recent years, the Fused Gromov-Wasserstein (FGW) distance has attracted growing interest because it enables the comparison of structured data by jointly accounting for feature similarity and geometric structure. However, as a variant of optimal transport (OT), classical FGW assumes an equal mass constraint on the compared data. In this work, we relax this mass constraint and propose the Fused Partial Gromov-Wasserstein (FPGW) framework, which extends FGW to accommodate unbalanced data. Theoretically, we establish the relationship between FPGW and FGW and prove the metric properties of FPGW. Numerically, we introduce Frank-Wolfe solvers and Sinkhorn solvers for the proposed FPGW framework. Finally, we evaluate the FPGW distance through graph matching, graph classification and graph clustering experiments, demonstrating its robust performance.
Authors: Jathurshan Pradeepkumar, Xihao Piao, Zheng Chen, Jimeng Sun
Abstract: Foundation models are reshaping EEG analysis, yet an important problem of EEG tokenization remains a challenge. This paper presents TFM-Tokenizer, a novel tokenization framework that learns a vocabulary of time-frequency motifs from single-channel EEG signals and encodes them into discrete tokens. We propose a dual-path architecture with time-frequency masking to capture robust motif representations, and it is model-agnostic, supporting both lightweight transformers and existing foundation models for downstream tasks. Our study demonstrates three key benefits: Accuracy: Experiments on four diverse EEG benchmarks demonstrate consistent performance gains across both single- and multi-dataset pretraining settings, achieving up to 17% improvement in Cohen's Kappa over strong baselines. Generalization: Moreover, as a plug-and-play component, it consistently boosts the performance of diverse foundation models, including BIOT and LaBraM. Scalability: By operating at the single-channel level rather than relying on the strict 10-20 EEG system, our method has the potential to be device-agnostic. Experiments on ear-EEG sleep staging, which differs from the pretraining data in signal format, channel configuration, recording device, and task, show that our tokenizer outperforms baselines by 14%. A comprehensive token analysis reveals strong class-discriminative, frequency-aware, and consistent structure, enabling improved representation quality and interpretability. Code is available at https://github.com/Jathurshan0330/TFM-Tokenizer.
Authors: Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz
Abstract: Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.
Authors: Ambroise Heurtebise, Omar Chehab, Pierre Ablin, Alexandre Gramfort, Aapo Hyv\"arinen
Abstract: Causal discovery is a difficult problem that typically relies on strong assumptions on the data-generating model, such as non-Gaussianity. In practice, many modern applications provide multiple related views of the same system, which has rarely been considered for causal discovery. Here, we leverage this multi-view structure to achieve causal discovery with weak assumptions. We propose a multi-view linear Structural Equation Model (SEM) that extends the well-known framework of non-Gaussian disturbances by alternatively leveraging correlation over views. We prove the identifiability of the model for acyclic SEMs. Subsequently, we propose several multi-view causal discovery algorithms, inspired by single-view algorithms (DirectLiNGAM, PairwiseLiNGAM, and ICA-LiNGAM). The new methods are validated through simulations and applications on neuroimaging data, where they enable the estimation of causal graphs between brain regions.
Authors: Subed Lamichhane, Haotian Lu, Sheldon X. -D. Tan
Abstract: In contrast to the assumptions of most existing Electromigration (EM) analysis tools, the evolution of EM-induced stress is inherently non-deterministic, influenced by factors such as input current fluctuations and manufacturing non-idealities. Traditional approaches for estimating stress variations typically involve computationally expensive and inefficient Monte Carlo simulations with industrial solvers, which quantify variations using mean and variance metrics. In this work, we introduce a novel machine learning-based framework, termed BPINN-EM- Post, for efficient stochastic analysis of EM-induced post-voiding aging processes. For the first time, our new approach integrates closed-form analytical solutions with a Bayesian Physics- Informed Neural Network (BPINN) framework to accelerate the analysis. The closed-form solutions enforce physical laws at the individual wire segment level, while the BPINN ensures that physics constraints at inter-segment junctions are satisfied and stochastic behaviors are accurately modeled. By reducing the number of variables in the loss functions through utilizing analytical solutions, our method significantly improves training efficiency without accuracy loss and naturally incorporates variational effects. Additionally, the analytical solutions effectively address the challenge of incorporating initial stress distributions in interconnect structures during post-void stress calculations. Numerical results demonstrate that BPINN-EM-Post achieves over 240x and more than 67x speedup compared to Monte Carlo simulations using the FEM-based COMSOL solver and FDM-based EMSpice, respectively, with marginal accuracy loss.
Authors: Yihang Lu, Mahwish Yousaf, Xianwei Meng, Enhong Chen
Abstract: Imputation of random or non-random missing data is a long-standing research topic and a crucial application for Intelligent Transportation Systems (ITS). However, with the advent of modern communication technologies such as Global Satellite Navigation Systems (GNSS), traffic data collection has introduced new challenges in random missing value imputation and increasing demands for spatiotemporal dependency modelings. To address these issues, we propose a novel spatiotemporal traffic imputation method based on a Multimode Nonlinear Transformed Tensor Nuclear Norm (MNT-TNN), which can effectively capture the intrinsic multimode spatiotemporal correlations and low-rankness of the traffic tensor, represented as location $\times$ location $\times$ time. To solve the nonconvex optimization problem, we design a proximal alternating minimization (PAM) algorithm with theoretical convergence guarantees. We also suggest an Augmented Transform-based Tensor Nuclear Norm Families (ATTNNs) framework to enhance the imputation results of TTNN techniques, especially at very high miss rates. Extensive experiments on real datasets demonstrate that our proposed MNT-TNN and ATTNNs can outperform the compared state-of-the-art imputation methods, completing the benchmark of random missing traffic value imputation.
Authors: Liming Wang, Muhammad Jehanzeb Mirza, Yishu Gong, Yuan Gong, Jiaqi Zhang, Brian H. Tracey, Katerina Placek, Marco Vilela, James R. Glass
Abstract: This paper presents a novel theoretical framework for understanding how diffusion models can learn disentangled representations. Within this framework, we establish identifiability conditions for general disentangled latent variable models, analyze training dynamics, and derive sample complexity bounds for disentangled latent subspace models. To validate our theory, we conduct disentanglement experiments across diverse tasks and modalities, including subspace recovery in latent subspace Gaussian mixture models, image colorization, image denoising, and voice conversion for speech classification. Additionally, our experiments show that training strategies inspired by our theory, such as style guidance regularization, consistently enhance disentanglement performance.
Authors: Xinyan Meng, Huabin Cheng, Rujie Chen, Ning Xu, Yu Chen, Wei Zhang
Abstract: The state-of-the-art researches indicate that analytic algorithms are promising in handling complex floorplanning scenarios. However, it is challenging to generate compact floorplans with excellent wirelength optimization effect due to the local convergence of gradient-based optimization algorithms designed for constructed smooth optimization models. Accordingly, we propose to construct a nonsmooth analytic floorplanning model addressed by the conjugate subgradient algorithm (CSA), which is accelerated by a population-based scheme adaptively regulating the stepsize with the assistance of Q-learning. In this way, the proposed CSA assisted by Q-learning (CSAQ) can strike a good balance on exploration and exploitation. Experimental results on the MCNC and GSRC benchmarks demonstrate that the proposed fixed-outline floorplanning algorithm based on CSAQ (CSF) not only address global floorplanning effectively, but also get legal floorplans more efficiently than the constraint graph-based legalization algorithm as well as its improved variants. It is also demonstrated that the CSF is competitive to the state-of-the-art algorithms on floorplanning scenarios only containing hard modules.
Authors: Grgur Kova\v{c}, J\'er\'emy Perez, R\'emy Portelas, Peter Ford Dominey, Pierre-Yves Oudeyer
Abstract: Large language models (LLMs) are increasingly used in the creation of online content, creating feedback loops as subsequent generations of models will be trained on this synthetic data. Such loops were shown to lead to distribution shifts - models misrepresenting the true underlying distributions of human data (also called model collapse). However, how human data properties affect such shifts remains poorly understood. In this paper, we provide the first empirical examination of the effect of such properties on the outcome of recursive training. We first confirm that using different human datasets leads to distribution shifts of different magnitudes. Through exhaustive manipulation of dataset properties combined with regression analyses, we then identify a set of properties predicting distribution shift magnitudes. Lexical diversity is found to amplify these shifts, while semantic diversity and data quality mitigate them. Furthermore, we find that these influences are highly modular: data scrapped from a given internet domain has little influence on the content generated for another domain. Finally, experiments on political bias reveal that human data properties affect whether the initial bias will be amplified or reduced. Overall, our results portray a novel view, where different parts of internet may undergo different types of distribution shift.
Authors: Feng Chen, Yefei He, Lequan Lin, Chenhui Gou, Jing Liu, Bohan Zhuang, Qi Wu
Abstract: Sparse attention mechanisms aim to reduce computational overhead with minimal accuracy loss by selectively processing salient tokens. Despite their effectiveness, most methods merely exploit a model's inherent sparsity and thus plateau at moderate budgets (about 50\% token reduction), with little headroom to push budget lower without hurting accuracy. Other approaches attempt to enforce sparsity through trainable sparse attention or sharpness-inducing regularizers, but these either fix rigid patterns that ignore input and layer dynamics, or optimize proxy objectives without direct control over token budgets. In this paper, we explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named \textit{Sparsity Forcing}. Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards. By contrasting rollouts within each group, the more efficient and correct answer is rewarded while less efficient or incorrect ones are penalized, thereby turning token saving into an end-to-end, inference-consistent optimization objective. Across thirteen image and video benchmarks, Sparsity Forcing raises token reduction ratio on Qwen2-VL/Qwen2.5-VL from 20\% to 75\% with minimal accuracy decline, significantly reducing long-context inference memory by up to 3$\times$ while speeding up decoding by up to 3.3$\times$.
Authors: Giovanni Catalani, Michael Bauerheim, Fr\'ed\'eric Tost, Xavier Bertrand, Joseph Morlier
Abstract: Advances in neural operators have introduced discretization invariant surrogate models for PDEs on general geometries, yet many approaches struggle to encode local geometric structure and variable domains efficiently. We introduce enf2enf, a neural field approach for predicting steady-state PDEs with geometric variability. Our method encodes geometries into latent features anchored at specific spatial locations, preserving locality throughout the network. These local representations are combined with global parameters and decoded to continuous physical fields, enabling effective modeling of complex shape variations. Experiments on aerodynamic and structural benchmarks demonstrate competitive or superior performance compared to graph-based, neural operator, and recent neural field methods, with real-time inference and efficient scaling to high-resolution meshes.
Authors: Gleb Molodtsov, Daniil Medyakov, Sergey Skorik, Nikolas Khachaturov, Shahane Tigranyan, Vladimir Aletov, Aram Avetisyan, Martin Tak\'a\v{c}, Aleksandr Beznosikov
Abstract: Recent advancements in machine learning have improved performance while also increasing computational demands. While federated and distributed setups address these issues, their structure is vulnerable to malicious influences. In this paper, we address a specific threat, Byzantine attacks, where compromised clients inject adversarial updates to derail global convergence. We combine the trust scores concept with trial function methodology to dynamically filter outliers. Our methods address the critical limitations of previous approaches, allowing functionality even when Byzantine nodes are in the majority. Moreover, our algorithms adapt to widely used scaled methods like Adam and RMSProp, as well as practical scenarios, including local training and partial participation. We validate the robustness of our methods by conducting extensive experiments on both synthetic and real ECG data collected from medical institutions. Furthermore, we provide a broad theoretical analysis of our algorithms and their extensions to aforementioned practical setups. The convergence guarantees of our methods are comparable to those of classical algorithms developed without Byzantine interference.
Authors: Mohammadmahdi Ghasemloo, Alireza Moradi
Abstract: With the widespread adoption of Large Language Models (LLMs), there is a growing need to establish best practices for leveraging their capabilities beyond traditional natural language tasks. In this paper, a novel cross-domain knowledge transfer framework is proposed to enhance the performance of LLMs in time series forecasting -- a task of increasing relevance in fields such as energy systems, finance, and healthcare. The approach systematically infuses LLMs with structured temporal information to improve their forecasting accuracy. This study evaluates the proposed method on a real-world time series dataset and compares it to a naive baseline where the LLM receives no auxiliary information. Results show that knowledge-informed forecasting significantly outperforms the uninformed baseline in terms of predictive accuracy and generalization. These findings highlight the potential of knowledge transfer strategies to bridge the gap between LLMs and domain-specific forecasting tasks.
Authors: Li Ju, Max Andersson, Stina Fredriksson, Edward Gl\"ockner, Andreas Hellander, Ekta Vats, Prashant Singh
Abstract: Vision-language models (VLMs) as foundation models have significantly enhanced performance across a wide range of visual and textual tasks, without requiring large-scale training from scratch for downstream tasks. However, these deterministic VLMs fail to capture the inherent ambiguity and uncertainty in natural language and visual data. Recent probabilistic post-hoc adaptation methods address this by mapping deterministic embeddings onto probability distributions; however, existing approaches do not account for the asymmetric uncertainty structure of the modalities, and the constraint that meaningful deterministic embeddings reside on a unit hypersphere, potentially leading to suboptimal performance. In this paper, we address the asymmetric uncertainty structure inherent in textual and visual data, and propose AsymVLM to build probabilistic embeddings from pre-trained VLMs on the unit hypersphere, enabling uncertainty quantification. We validate the effectiveness of the probabilistic embeddings on established benchmarks, and present comprehensive ablation studies demonstrating the inherent nature of asymmetry in the uncertainty structure of textual and visual data.
Authors: Fei Wu, Jia Hu, Geyong Min, Shiqiang Wang
Abstract: Driven by the rapid growth of model parameters, parameter-efficient fine-tuning (PEFT) has become essential for adapting large models to diverse downstream tasks under constrained computational resources. Within this paradigm, orthogonal fine-tuning and its variants preserve semantic representations of pre-trained models, but struggle to achieve both expressiveness and efficiency in terms of parameter counts, memory, and computation. To overcome this limitation, we propose efficient Orthogonal Fine-Tuning with Principal Subspace adaptation (PSOFT), which confines orthogonal transformations to the principal subspace of pre-trained weights. Specifically, PSOFT constructs this subspace via matrix decomposition to enable compatible transformations with higher effective rank, establishes a theoretical condition that strictly maintains the geometry of this subspace for essential semantic preservation, and introduces efficient tunable vectors that gradually relax orthogonality during training to enhance adaptability. Extensive experiments on 35 NLP and CV tasks across four representative models demonstrate that PSOFT offers a practical and scalable solution to simultaneously achieve semantic preservation, expressiveness, and multi-dimensional efficiency in PEFT. The code is publicly available at https://github.com/fei407/PSOFT.
Authors: Zhen Li, Yupeng Su, Songmiao Wang, Runming Yang, Congkai Xie, Aofan Liu, Ming Li, Jiannong Cao, Yuan Xie, Ngai Wong, Hongxia Yang
Abstract: Low-bit post-training quantization (PTQ) is a practical route to deploy reasoning-capable LLMs under tight memory and latency budgets, yet it can markedly impair mathematical reasoning (drops up to 69.81% in our harder settings). We address two deployment-critical questions with process-level precision: Where along a step-structured solution does degradation first arise? How to mitigate it while staying in the low-bit regime? Across widely used PTQ methods (AWQ, GPTQ, SmoothQuant), open-source model families (Qwen, LLaMA; 0.5--7B), and math reasoning benchmarks (GSM8K, MATH, AIME), we perform format-aligned chain-of-thought with step-aligned attribution and uncover two robust regularities: (i) PTQ disproportionately elevates method and execution errors relative to high-level conceptual mistakes; and (ii) failures emerge early, with the first vulnerable step flipping and cascading to the final answer. These regularities suggest a general intervention principle: restore local token-level margins exactly at the earliest failure frontier. We instantiate this principle as a lightweight measure$\rightarrow$locate$\rightarrow$restore loop that operates directly on the quantized model: detect the first faulty step, construct our "Silver Bullet" datasets, and apply small-scale supervised/preference tuning. In our settings, as few as 332 curated examples and 3--5 minutes of compute on a single GPU recover 4-bit weight math reasoning toward the full-precision baseline while preserving PTQ efficiency. Our framework is quantizer- and architecture-agnostic within the evaluated regimes, and turns low-bit degradation from a global accuracy problem into a local, reproducible process intervention.
Authors: Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, Dimitris Metaxas, Hao Wang
Abstract: While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a Token-level Uncertainty estimation framework for Reasoning (TokUR) that enables LLMs to self-assess and self-improve their responses in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness, and the uncertainty signals produced by TokUR can be leveraged to enhance the model's reasoning performance at test time. These results highlight the effectiveness of TokUR as a principled and scalable approach for improving the reliability and interpretability of LLMs in challenging reasoning tasks.
Authors: David Chanin, Tom\'a\v{s} Dulka, Adri\`a Garriga-Alonso
Abstract: It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions, as long as the activations are composed of sparse linear combinations of underlying features. However, we find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together, thus destroying monosemanticity. In LLM SAEs, these two conditions are almost certainly true. This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE. In this work, we introduce the problem of feature hedging and study it both theoretically in toy models and empirically in SAEs trained on LLMs. We suspect that feature hedging may be one of the core reasons that SAEs consistently underperform supervised baselines. Finally, we use our understanding of feature hedging to propose an improved variant of matryoshka SAEs. Importantly, our work shows that SAE width is not a neutral hyperparameter: narrower SAEs suffer more from hedging than wider SAEs.
Authors: Minsu Kim, Jean-Pierre Falet, Oliver E. Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, Yoshua Bengio
Abstract: Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.
Authors: Arun Kumar, Paul Schrater
Abstract: Invariant representations are core to representation learning, yet a central challenge remains: uncovering invariants that are stable and transferable without suppressing task-relevant signals. This raises fundamental questions, requiring further inquiry, about the appropriate level of abstraction at which such invariants should be defined and which aspects of a system they should characterize. Interpretation of the environment relies on abstract knowledge structures to make sense of the current state, which leads to interactions, essential drivers of learning and knowledge acquisition. Interpretation operates at the level of higher-order relational knowledge; hence, we propose that invariant structures must be where knowledge resides, specifically as partitions defined by the closure of relational paths within an abstract knowledge space. These partitions serve as the core invariant representations, forming the structural substrate where knowledge is stored and learning occurs. On the other hand, inter-partition connectors enable the deployment of these knowledge partitions encoding task-relevant transitions. Thus, invariant partitions provide the foundational primitives of structured representation. We formalize the computational foundations for structured relational representations of the invariant partitions based on closed semiring, a relational algebraic structure.
Authors: Francesco D'Amico, Dario Bocchi, Matteo Negri
Abstract: Scaling laws in deep learning -- empirical power-law relationships linking model performance to resource growth -- have emerged as simple yet striking regularities across architectures, datasets, and tasks. These laws are particularly impactful in guiding the design of state-of-the-art models, since they quantify the benefits of increasing data or model size, and hint at the foundations of interpretability in machine learning. However, most studies focus on asymptotic behavior at the end of training. In this work, we describe a richer picture by analyzing the entire training dynamics: we identify two novel \textit{dynamical} scaling laws that govern how performance evolves as function of different norm-based complexity measures. Combined, our new laws recover the well-known scaling for test error at convergence. Our findings are consistent across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10 and CIFAR-100. Furthermore, we provide analytical support using a single-layer perceptron trained with logistic loss, where we derive the new dynamical scaling laws, and we explain them through the implicit bias induced by gradient-based training.
Authors: Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, Ramtin Pedarsani
Abstract: Vision-Language Models (VLMs) such as CLIP have shown remarkable performance in cross-modal tasks through large-scale contrastive pre-training. To adapt these large transformer-based models efficiently for downstream tasks, Parameter-Efficient Fine-Tuning (PEFT) techniques like (Low-Rank Adaptation) LoRA have emerged as scalable alternatives to full fine-tuning, especially in few-shot scenarios. However, like traditional deep neural networks, VLMs are highly vulnerable to adversarial attacks, where imperceptible perturbations can significantly degrade model performance. Adversarial training remains the most effective strategy for improving model robustness in PEFT. In this work, we propose AdvCLIP-LoRA, to our knowledge the first method designed to enhance the adversarial robustness of CLIP models fine-tuned with LoRA in few-shot settings. Our method formulates training as a minimax optimization over low-rank adapters and adversarial perturbations, enabling robust adaptation with a small trainable footprint. Across eight datasets and two backbones (ViT-B/16 and ViT-B/32), AdvCLIP-LoRA achieves state-of-the-art performance in few-shot classification, adversarial base-to-new generalization, and cross-dataset transfer, delivering higher adversarial robustness than prompt tuning baselines without sacrificing much clean accuracy. These findings highlight AdvCLIP-LoRA as a practical approach for robust adaptation of VLMs in resource-constrained settings.
Authors: Ziwei Luo, Fredrik K. Gustafsson, Jens Sj\"olund, Thomas B. Sch\"on
Abstract: This work presents a forward-only diffusion (FoD) approach for generative modelling. In contrast to traditional diffusion models that rely on a coupled forward-backward diffusion scheme, FoD directly learns data generation through a single forward diffusion process, yielding a simple yet efficient generative framework. The core of FoD is a state-dependent stochastic differential equation that involves a mean-reverting term in both the drift and diffusion functions. This mean-reversion property guarantees the convergence to clean data, naturally simulating a stochastic interpolation between source and target distributions. More importantly, FoD is analytically tractable and is trained using a simple stochastic flow matching objective, enabling a few-step non-Markov chain sampling during inference. The proposed FoD model, despite its simplicity, achieves state-of-the-art performance on various image restoration tasks. Its general applicability on image-conditioned generation is also demonstrated via qualitative results on image-to-image translation. Our code is available at https://github.com/Algolzw/FoD.
Authors: Hyunjin Seo, Taewon Kim, Sihyun Yu, SungSoo Ahn
Abstract: Masked diffusion models (MDMs) have achieved notable progress in modeling discrete data, while their potential in molecular generation remains underexplored. In this work, we explore their potential and introduce the surprising result that naively applying standards MDMs severely degrades the performance. We identify the critical cause of this issue as a state-clashing problem-where the forward diffusion of distinct molecules collapse into a common state, resulting in a mixture of reconstruction targets that cannot be learned using typical reverse diffusion process with unimodal predictions. To mitigate this, we propose Masked Element-wise Learnable Diffusion (MELD) that orchestrates per-element corruption trajectories to avoid collision between distinct molecular graphs. This is achieved through a parameterized noise scheduling network that assigns distinct corruption rates to individual graph elements, i.e., atoms and bonds. Extensive experiments on diverse molecular benchmarks reveal that MELD markedly enhances overall generation quality compared to element-agnostic noise scheduling, increasing the chemical validity of vanilla MDMs on ZINC250K from 15% to 93%, Furthermore, it achieves state-of-the-art property alignment in conditional generation tasks.
Authors: Noah Amsel, David Persson, Christopher Musco, Robert M. Gower
Abstract: Computing the polar decomposition and the related matrix sign function has been a well-studied problem in numerical analysis for decades. Recently, it has emerged as an important subroutine within the Muon algorithm for training deep neural networks. However, the requirements of this application differ sharply from classical settings: deep learning demands GPU-friendly algorithms that prioritize high throughput over high precision. We introduce Polar Express, a new method for computing the polar decomposition. Like Newton-Schulz and other classical polynomial methods, our approach uses only matrix-matrix multiplications, making it very efficient on GPUs. Inspired by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the update rule at each iteration by solving a minimax optimization problem. We prove that this strategy minimizes error in a worst-case sense, allowing Polar Express to converge as rapidly as possible both in the early iterations and asymptotically. We also address finite-precision issues, making it practical to use in bfloat16. When integrated into the Muon training framework, our method leads to consistent improvements in validation loss when training a GPT-2 model on one billion tokens from the FineWeb dataset, outperforming recent alternatives across a range of learning rates.
Authors: Yizhuo Chen, Tianchen Wang, You Lyu, Yanlan Hu, Jinyang Li, Tomoyoshi Kimura, Hongjue Zhao, Yigong Hu, Denizhan Kara, Tarek Abdelzaher
Abstract: We present SPAR, a framework for self-supervised placement-aware representation learning in distributed sensing. Distributed sensing spans applications where multiple spatially distributed and multimodal sensors jointly observe an environment, from vehicle monitoring to human activity recognition and earthquake localization. A central challenge shared by this wide spectrum of applications, is that observed signals are inseparably shaped by sensor placements, including their spatial locations and structural roles. However, existing pretraining methods remain largely placement-agnostic. SPAR addresses this gap through a unifying principle: the duality between signals and positions. Guided by this principle, SPAR introduces spatial and structural positional embeddings together with dual reconstruction objectives, explicitly modeling how observing positions and observed signals shape each other. Placement is thus treated not as auxiliary metadata but as intrinsic to representation learning. SPAR is theoretically supported by analyses from information theory and occlusion-invariant learning. Extensive experiments on three real-world datasets show that SPAR achieves superior robustness and generalization across various modalities, placements, and downstream tasks.
Authors: Adnan Oomerjee, Zafeirios Fountas, Haitham Bou-Ammar, Jun Wang
Abstract: Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute, most prominently through token-space "thinking" chains of thought. A growing line of work pushes extra computation into the model's latent space, which we term Auxiliary Latent-Space Computation (ALSC). Existing ALSC methods largely fall into three buckets: (i) token-mediated latent rollouts, (ii) residual/activation steering, and (iii) memory (KV) compression. An underexplored alternative is memory consolidation/reconsolidation, two processes in the brain that are responsible for stabilising newly formed memory traces, and, upon recall, transiently rendering established traces plastic such they can integrate new contextual information before restabilising. In Transformer LLMs, this can be seen as analogous to performing in-place rewrites of new KV segments, and rewrites of recalled past segments. In this work, we give a theoretical justification as to why memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning. We do this through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input information compression and retention of predictive information in latent representations. We then introduce the Bottlenecked Transformer, which augments a backbone LLM with a Cache Processor, an auxiliary Transformer that performs periodic, non-causal, in-place KV rewrites at newline-delimited reasoning step boundaries. The Processor consolidates recently written KV entries and reconsolidates a small, top-k attention-selected set of prior entries. We evaluate our Bottlenecked Transformer architecture on math reasoning benchmarks. Our model sees consistent performance gains over vanilla Transformers and pause-token augmented baselines, with gains of up to +6.6pp for selected tasks/backbones.
Authors: Jingtong Gao, Ling Pan, Yejing Wang, Rui Zhong, Chi Lu, Qingpeng Cai, Peng Jiang, Xiangyu Zhao
Abstract: Reinforcement Learning (RL) has emerged as a pivotal method for improving the reasoning capabilities of Large Language Models (LLMs). However, prevalent RL approaches such as Proximal Policy Optimization (PPO) and Group-Regularized Policy Optimization (GRPO) face critical limitations due to their reliance on sparse outcome-based rewards and inadequate mechanisms for incentivizing exploration. These limitations result in inefficient guidance for reasoning. Specifically, sparse rewards fail to deliver sufficient feedback, particularly for challenging problems. Furthermore, such rewards induce systematic biases that prioritize exploitation of familiar trajectories over novel solution discovery. These shortcomings critically hinder performance in complex reasoning tasks, which inherently demand iterative refinement across intermediate steps. To address these challenges, we propose an Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR), a method designed to deliver dense rewards and amplify exploration in the RL-based paradigm. i-MENTOR introduces three innovations: trajectory-aware exploration rewards that mitigate bias in token-level strategies while maintaining computational efficiency; error-conditioned reward allocation to ensure efficient exploration on challenging samples while intrinsically stabilizing training; and advantage-preserving integration that maintains advantage distribution integrity while incorporating exploratory guidance. Experiments across 4 public datasets demonstrate i-MENTOR's effectiveness, achieving a 22.23\% improvement on AIME 2024.
Authors: Ionut-Vlad Modoranu, Mher Safaryan, Erik Schultheis, Max Ryabinin, Artem Chumachenko, Dan Alistarh
Abstract: Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul's $N$-point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25\%$ across different model sizes. Our code is available at \href{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers/tree/main/ista_daslab_optimizers/fft_low_rank}{ISTA-DASLab-Optimizers}.
URLs: https://github.com/IST-DASLab/ISTA-DASLab-Optimizers/tree/main/ista_daslab_optimizers/fft_low_rank
Authors: Ziyang Cheng, Zhixun Li, Yuhan Li, Yixin Song, Kangyi Zhao, Dawei Cheng, Jia Li, Hong Cheng, Jeffrey Xu Yu
Abstract: Nowadays, real-world data, including graph-structure data, often arrives in a streaming manner, which means that learning systems need to continuously acquire new knowledge without forgetting previously learned information. Although substantial existing works attempt to address catastrophic forgetting in graph machine learning, they are all based on training from scratch with streaming data. With the rise of pretrained models, an increasing number of studies have leveraged their strong generalization ability for continual learning. Therefore, in this work, we attempt to answer whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL). We first point out that current experimental setups for GCL have significant flaws, as the evaluation stage may lead to task ID leakage. Then, we evaluate the performance of LLMs in more realistic scenarios and find that even minor modifications can lead to outstanding results. Finally, based on extensive experiments, we propose a simple-yet-effective method, Simple Graph Continual Learning (SimGCL), that surpasses the previous state-of-the-art GNN-based baseline by around 20% under the rehearsal-free constraint. To facilitate reproducibility, we have developed an easy-to-use benchmark LLM4GCL for training and evaluating existing GCL methods. The code is available at: https://github.com/ZhixunLEE/LLM4GCL.
Authors: Yiding Wang, Fauxu Meng, Xuefeng Zhang, Fan Jiang, Pingzhi Tang, Muhan Zhang
Abstract: Existing parameter-efficient fine-tuning (PEFT) methods for large language models (LLMs), such as LoRA and PiSSA, constrain model updates to low-rank subspaces, limiting their expressiveness and leading to suboptimal performance on complex tasks. To address this, we introduce High-rank Distributed PiSSA (HD-PiSSA), a distributed PEFT approach that initializes orthogonal adapters across different devices and aggregates their delta updates collectively on W for fine-tuning. Unlike Data Parallel LoRA or PiSSA, which maintain identical adapters across all devices, HD-PiSSA assigns different principal components of the pre-trained weights to each GPU, significantly expanding the range of update directions. This results in over 16x higher effective updated ranks than data-parallel LoRA or PiSSA when fine-tuning on 8 GPUs with the same per-device adapter rank. Empirically, we evaluate HD-PiSSA across various challenging downstream tasks, including mathematics, code generation, and multi-task learning. In the multi-task setting, HD-PiSSA achieves average gains of 10.0 absolute points (14.63%) over LoRA and 4.98 points (6.60%) over PiSSA across 12 benchmarks, demonstrating its benefits from the extra optimization flexibility.
Authors: Florian Eichin, Yupei Du, Philipp Mondorf, Maria Matveev, Barbara Plank, Michael A. Hedderich
Abstract: Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, such approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all these perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to realistic settings like AdamW. We empirically validate that a CNN and a Transformer are accurately replicated by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. Their effectiveness for parameter pruning is comparable to existing methods, demonstrating their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.
Authors: Ruishuo Chen, Xun Wang, Rui Hu, Zhuoran Li, Longbo Huang
Abstract: Generative Flow Networks (GFlowNets) are effective at sampling diverse, high-reward objects, but in many real-world settings where new reward queries are infeasible, they must be trained from offline datasets. The prevailing proxy-based training methods are susceptible to error propagation, while existing proxy-free approaches often use coarse constraints that limit exploration. To address these issues, we propose Trajectory-Distilled GFlowNet (TD-GFN), a novel proxy-free training framework. TD-GFN learns dense, transition-level edge rewards from offline trajectories via inverse reinforcement learning to provide rich structural guidance for efficient exploration. Crucially, to ensure robustness, these rewards are used indirectly to guide the policy through DAG pruning and prioritized backward sampling of training trajectories. This ensures that final gradient updates depend only on ground-truth terminal rewards from the dataset, thereby preventing the error propagation. Experiments show that TD-GFN significantly outperforms a broad range of existing baselines in both convergence speed and final sample quality, establishing a more robust and efficient paradigm for offline GFlowNet training.
Authors: Ryota Ushio, Takashi Ishida, Masashi Sugiyama
Abstract: While the performance of machine learning systems has experienced significant improvement in recent years, relatively little attention has been paid to the fundamental question: to what extent can we improve our models? This paper provides a means of answering this question in the setting of binary classification, which is practical and theoretically supported. We extend a previous work that utilizes soft labels for estimating the Bayes error, the optimal error rate, in two important ways. First, we theoretically investigate the properties of the bias of the hard-label-based estimator discussed in the original work. We reveal that the decay rate of the bias is adaptive to how well the two class-conditional distributions are separated, and it can decay significantly faster than the previous result suggested as the number of hard labels per instance grows. Second, we tackle a more challenging problem setting: estimation with corrupted soft labels. One might be tempted to use calibrated soft labels instead of clean ones. However, we reveal that calibration guarantee is not enough, that is, even perfectly calibrated soft labels can result in a substantially inaccurate estimate. Then, we show that isotonic calibration can provide a statistically consistent estimator under an assumption weaker than that of the previous work. Our method is instance-free, i.e., we do not assume access to any input instances. This feature allows it to be adopted in practical scenarios where the instances are not available due to privacy issues. Experiments with synthetic and real-world datasets show the validity of our methods and theory.
Authors: Xudong Wang, Ziheng Sun, Chris Ding, Jicong Fan
Abstract: This work proposes a framework LGKDE that learns kernel density estimation for graphs. The key challenge in graph density estimation lies in effectively capturing both structural patterns and semantic variations while maintaining theoretical guarantees. Combining graph kernels and kernel density estimation (KDE) is a standard approach to graph density estimation, but has unsatisfactory performance due to the handcrafted and fixed features of kernels. Our method LGKDE leverages graph neural networks to represent each graph as a discrete distribution and utilizes maximum mean discrepancy to learn the graph metric for multi-scale KDE, where all parameters are learned by maximizing the density of graphs relative to the density of their well-designed perturbed counterparts. The perturbations are conducted on both node features and graph spectra, which helps better characterize the boundary of normal density regions. Theoretically, we establish consistency and convergence guarantees for LGKDE, including bounds on the mean integrated squared error, robustness, and generalization. We validate LGKDE by demonstrating its effectiveness in recovering the underlying density of synthetic graph distributions and applying it to graph anomaly detection across diverse benchmark datasets. Extensive empirical evaluation shows that LGKDE demonstrates superior performance compared to state-of-the-art baselines on most benchmark datasets.
Authors: Han Wan, Rui Zhang, Hao Sun
Abstract: Learning PDE dynamics from limited data with unknown physics is challenging. Existing neural PDE solvers either require large datasets or rely on known physics (e.g., PDE residuals or handcrafted stencils), leading to limited applicability. To address these challenges, we propose Spectral-Inspired Neural Operator (SINO), which can model complex systems from just 2-5 trajectories, without requiring explicit PDE terms. Specifically, SINO automatically captures both local and global spatial derivatives from frequency indices, enabling a compact representation of the underlying differential operators in physics-agnostic regimes. To model nonlinear effects, it employs a Pi-block that performs multiplicative operations on spectral features, complemented by a low-pass filter to suppress aliasing. Extensive experiments on both 2D and 3D PDE benchmarks demonstrate that SINO achieves state-of-the-art performance, with improvements of 1-2 orders of magnitude in accuracy. Particularly, with only 5 training trajectories, SINO outperforms data-driven methods trained on 1000 trajectories and remains predictive on challenging out-of-distribution cases where other methods fail.
Authors: Xiaomeng Yang, Zhiyu Tan, Junyan Wang, Zhijian Zhou, Hao Li
Abstract: Preference learning has become a central technique for aligning generative models with human expectations. Recently, it has been extended to diffusion models through methods like Direct Preference Optimization (DPO). However, existing approaches such as Diffusion-DPO suffer from two key challenges: timestep-dependent instability, caused by a mismatch between the reverse and forward diffusion processes and by high gradient variance in early noisy timesteps, and off-policy bias arising from the mismatch between optimization and data collection policies. We begin by analyzing the reverse diffusion trajectory and observe that instability primarily occurs at early timesteps with low importance weights. To address these issues, we first propose DPO-C\&M, a practical strategy that improves stability by clipping and masking uninformative timesteps while partially mitigating off-policy bias. Building on this, we introduce SDPO (Importance-Sampled Direct Preference Optimization), a principled framework that incorporates importance sampling into the objective to fully correct for off-policy bias and emphasize informative updates during the diffusion process. Experiments on CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B demonstrate that both methods outperform standard Diffusion-DPO, with SDPO achieving superior VBench scores, human preference alignment, and training robustness. These results highlight the importance of timestep-aware, distribution-corrected optimization in diffusion-based preference learning.
Authors: Albert Tseng, Zhaofeng Sun, Christopher De Sa
Abstract: The goal of quantization is to produce a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, most quantization algorithms minimize the immediate activation error of each layer as a proxy for the end-to-end error. However, this ignores the effect of future layers, making it a poor proxy. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that directly considers the error at the network's output. YAQA introduces a series of theoretical results that culminate in the first end-to-end error bounds for quantization algorithms. First, we characterize the convergence time of adaptive rounding algorithms via the structure of their Hessian approximations. We then show that the end-to-end error can be bounded by the approximation's cosine similarity to the true Hessian. This admits a natural Kronecker-factored approximation with corresponding near-optimal Hessian sketches. YAQA is provably better than GPTQ/LDLQ and empirically reduces the error by $\approx 30\%$ over these methods. YAQA even achieves a lower error than quantization aware training. This translates to state of the art performance on downstream tasks, all while adding no inference overhead.
Authors: Giorgos Iacovides, Wuyang Zhou, Chao Li, Qibin Zhao, Danilo Mandic
Abstract: Tensor networks (TNs) provide efficient representations of high-dimensional data, yet identification of the optimal TN structures, the so called tensor network structure search (TN-SS) problem, remains a challenge. Current state-of-the-art (SOTA) algorithms solve TN-SS as a purely numerical optimization problem and require extensive function evaluations, which is prohibitive for real-world applications. In addition, existing methods ignore the valuable domain information inherent in real-world tensor data and lack transparency in their identified TN structures. To this end, we propose a novel TN-SS framework, termed the tnLLM, which incorporates domain information about the data and harnesses the reasoning capabilities of large language models (LLMs) to directly predict suitable TN structures. The proposed framework involves a domain-aware prompting pipeline which instructs the LLM to infer suitable TN structures based on the real-world relationships between tensor modes. In this way, our approach is capable of not only iteratively optimizing the objective function, but also generating domain-aware explanations for the identified structures. Experimental results demonstrate that tnLLM achieves comparable TN-SS objective function values with much fewer function evaluations compared to SOTA algorithms. Furthermore, we demonstrate that the LLM-enabled domain information can be used to find good initializations in the search space for sampling-based SOTA methods to accelerate their convergence while preserving theoretical performance guarantees.
Authors: Chang Liu, Bohao Zhao, Jingtao Ding, Huandong Wang, Yong Li
Abstract: Long-term forecasting of chaotic systems remains a fundamental challenge due to the intrinsic sensitivity to initial conditions and the complex geometry of strange attractors. Conventional approaches, such as reservoir computing, typically require training data that incorporates long-term continuous dynamical behavior to comprehensively capture system dynamics. While advanced deep sequence models can capture transient dynamics within the training data, they often struggle to maintain predictive stability and dynamical coherence over extended horizons. Here, we propose PhyxMamba, a framework that integrates a Mamba-based state-space model with physics-informed principles to forecast long-term behavior of chaotic systems given short-term historical observations on their state evolution. We first reconstruct the attractor manifold with time-delay embeddings to extract global dynamical features. After that, we introduce a generative training scheme that enables Mamba to replicate the physical process. It is further augmented by multi-patch prediction and attractor geometry regularization for physical constraints, enhancing predictive accuracy and preserving key statistical properties of systems. Extensive experiments on simulated and real-world chaotic systems demonstrate that PhyxMamba delivers superior forecasting accuracy and faithfully captures essential statistics from short-term historical observations.
Authors: Liwen Sun, Hao-Ren Yao, Gary Gao, Ophir Frieder, Chenyan Xiong
Abstract: Cancer screening, leading to early detection, saves lives. Unfortunately, existing screening techniques require expensive and intrusive medical procedures, not globally available, resulting in too many lost would-be-saved lives. We present CATCH-FM, CATch Cancer early with Healthcare Foundation Models, a cancer pre-screening methodology that identifies high-risk patients for further screening solely based on their historical medical records. With millions of electronic healthcare records (EHR), we establish the scaling law of EHR foundation models pretrained on medical code sequences, pretrain compute-optimal foundation models of up to 2.4 billion parameters, and finetune them on clinician-curated cancer risk prediction cohorts. In our retrospective evaluation comprising of thirty thousand patients, CATCH-FM achieves strong efficacy, with 50% sensitivity in predicting first cancer risks at 99% specificity cutoff, and outperforming feature-based tree models and both general and medical LLMs by up to 20% AUPRC. Despite significant demographic, healthcare system, and EHR coding differences, CATCH-FM achieves state-of-the-art pancreatic cancer risk prediction on the EHRSHOT few-shot leaderboard, outperforming EHR foundation models pretrained using on-site patient data. Our analysis demonstrates the robustness of CATCH-FM in various patient distributions, the benefits of operating in the ICD code space, and its ability to capture non-trivial cancer risk factors. Our code will be open-sourced.
Authors: Junquan Huang, Zong-Gan Chen, Yuncheng Jiang, Zhi-Hui Zhan
Abstract: GCN-based traveling salesman problem (TSP) solvers face two critical challenges: poor cross-scale generalization for TSPs and high training costs. To address these challenges, we propose a Subgraph-Based Rescaling Graph Convolutional Network (RsGCN). Focusing on the scale-dependent features (i.e., features varied with problem scales) related to nodes and edges, we design the subgraph-based rescaling to normalize edge lengths of subgraphs. Under a unified subgraph perspective, RsGCN can efficiently learn scale-generalizable representations from small-scale TSPs at low cost. To exploit and assess the heatmaps generated by RsGCN, we design a Reconstruction-Based Search (RBS), in which a reconstruction process based on adaptive weight is incorporated to help avoid local optima. Based on a combined architecture of RsGCN and RBS, our solver achieves remarkable generalization and low training cost: with only 3 epochs of training on a mixed-scale dataset containing instances with up to 100 nodes, it can be generalized successfully to 10K-node instances without any fine-tuning. Extensive experiments demonstrate our advanced performance across uniform-distribution instances of 9 different scales from 20 to 10K nodes and 78 real-world instances from TSPLIB, while requiring the fewest learnable parameters and training epochs among neural competitors.
Authors: Andrey Veprikov, Vladimir Solodkin, Alexander Zyl, Andrey Savchenko, Aleksandr Beznosikov
Abstract: The widespread utilization of language models in modern applications is inconceivable without Parameter-Efficient Fine-Tuning techniques, such as low-rank adaptation ($\texttt{LoRA}$), which adds trainable adapters to selected layers. Although $\texttt{LoRA}$ may obtain accurate solutions, it requires significant memory to train large models and intuition on which layers to add adapters. In this paper, we propose a novel method, $\texttt{WeightLoRA}$, which overcomes this issue by adaptive selection of the most critical $\texttt{LoRA}$ heads throughout the optimization process. As a result, we can significantly reduce the number of trainable parameters while maintaining the capability to obtain consistent or even superior metric values. We conduct experiments for a series of competitive benchmarks and DeBERTa, BART, and Llama models, comparing our method with different adaptive approaches. The experimental results demonstrate the efficacy of $\texttt{WeightLoRA}$ and the superior performance of $\texttt{WeightLoRA+}$ in almost all cases.
Authors: Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov
Abstract: Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.
Authors: C. Evans Hedges
Abstract: We study $\perp$Grad, a geometry-aware modification to gradient-based optimization that constrains descent directions to address overconfidence, a key limitation of standard optimizers in uncertainty-critical applications. By enforcing orthogonality between gradient updates and weight vectors, $\perp$Grad alters optimization trajectories without architectural changes. On CIFAR-10 with 10% labeled data, $\perp$Grad matches SGD in accuracy while achieving statistically significant improvements in test loss ($p=0.05$), predictive entropy ($p=0.001$), and confidence measures. These effects show consistent trends across corruption levels and architectures. $\perp$Grad is optimizer-agnostic, incurs minimal overhead, and remains compatible with post-hoc calibration techniques. Theoretically, we characterize convergence and stationary points for a simplified $\perp$Grad variant, revealing that orthogonalization constrains loss reduction pathways to avoid confidence inflation and encourage decision-boundary improvements. Our findings suggest that geometric interventions in optimization can improve predictive uncertainty estimates at low computational cost.
Authors: Snir Hordan, Maya Bechler-Speicher, Gur Lifshitz, Nadav Dym
Abstract: Spectral features are widely incorporated within Graph Neural Networks (GNNs) to improve their expressive power, or their ability to distinguish among non-isomorphic graphs. One popular example is the usage of graph Laplacian eigenvectors for positional encoding in MPNNs and Graph Transformers. The expressive power of such Spectrally-enhanced GNNs (SGNNs) is usually evaluated via the k-WL graph isomorphism test hierarchy and homomorphism counting. Yet, these frameworks align poorly with the graph spectra, yielding limited insight into SGNNs' expressive power. We leverage a well-studied paradigm of classifying graphs by their largest eigenvalue multiplicity to introduce an expressivity hierarchy for SGNNs. We then prove that many SGNNs are incomplete even on graphs with distinct eigenvalues. To mitigate this deficiency, we adapt rotation equivariant neural networks to the graph spectra setting to propose a method to provably improve SGNNs' expressivity on simple spectrum graphs. We empirically verify our theoretical claims via an image classification experiment on the MNIST Superpixel dataset and eigenvector canonicalization on graphs from ZINC.
Authors: Geonwoo Cho, Jaemoon Lee, Jaegyun Im, Subi Lee, Jihwan Lee, Sundong Kim
Abstract: Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy. Effective skill learning requires jointly maximizing both exploration and skill diversity. However, existing methods often face challenges in simultaneously optimizing for these two conflicting objectives. In this work, we propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED), which explicitly addresses both: during pre-training, a gradient-surgery projection balances the exploration and diversity gradients, and during fine-tuning, a skill selector exploits the learned diversity by choosing skills suited to downstream tasks. Our approach achieves performance that surpasses SBRL baselines across various benchmarks. Through an extensive ablation study, we identify the role of each component and demonstrate that each element in AMPED is contributing to performance. We further provide theoretical and empirical evidence that, with a greedy skill selector, greater skill diversity reduces fine-tuning sample complexity. These results highlight the importance of explicitly harmonizing exploration and diversity and demonstrate the effectiveness of AMPED in enabling robust and generalizable skill learning. Project Page: https://geonwoo.me/amped/
Authors: Marek \v{C}ern\'y
Abstract: Message-passing graph neural networks (MPGNNs) dominate modern graph learning. Typical efforts enhance MPGNN's expressive power by enriching the adjacency-based aggregation. In contrast, we introduce an efficient aggregation over walk incidence-based matrices that are constructed to deliberately trade off some expressivity for stronger and more structured inductive bias. Our approach allows for seamless scaling between classical message-passing and simpler methods based on walks. We rigorously characterize the expressive power at each intermediate step using homomorphism counts over a hierarchy of generalized caterpillar graphs. Based on this foundation, we propose Caterpillar GNNs, whose robust graph-level aggregation successfully tackles a benchmark specifically designed to challenge MPGNNs. Moreover, we demonstrate that, on real-world datasets, Caterpillar GNNs achieve comparable predictive performance while significantly reducing the number of nodes in the hidden layers of the computational graph.
Authors: Seokbin Yoon, Keumjin Lee
Abstract: Aircraft trajectory modeling plays a crucial role in air traffic management (ATM) and is important for various downstream tasks, including conflict detection and landing time prediction. Dataset augmentation by adding synthetically generated trajectory data is necessary to develop a more robust aircraft trajectory model and ensure that the trajectory dataset is sufficient and balanced. We propose a novel framework called ATRADA for aircraft trajectory dataset augmentation. In the proposed framework, a Transformer encoder learns the underlying patterns in the original trajectory dataset and converts each data point into a context vector in the learned latent space. The converted dataset is projected to reduced dimensions using principal component analysis (PCA), and a Gaussian mixture model (GMM) is applied to fit the probability distribution of the data points in the reduced-dimensional space. Finally, new samples are drawn from the fitted GMM, the dimension of the samples is reverted to the original dimension, and the samples are decoded with a multi-layer perceptron (MLP). Several experiments demonstrate that the framework effectively generates new, high-quality synthetic aircraft trajectory data, which were compared to the results of several baselines.
Authors: Zeyu Liu, Yan Li, Yunquan Zhang, Boyang Zhang, Guoyong Jiang, Xin Zhang, Limin Xiao, Weifeng Zhang, Daning Cheng
Abstract: Training large language models typically demands extensive GPU memory and substantial financial investment, which poses a barrier for many small- to medium-sized teams. In this paper, we propose a full-parameter pre-training and fine-tuning framework based on block coordinate descent (BCD), enhanced with engineering optimizations, to enable efficient training of large-scale models on cost-effective RTX 4090, A100 and A800 GPU clusters. Under identical hardware configurations, we reduce the training cost of a 7B model to 33% on A100/A800 and only 2.6% on RTX 4090, compared to standard full-parameter training. It also enables large models previously restricted to A100 clusters to be trained on RTX 4090 without degrading performance. BCD achieves comparable or better accuracy than full-parameter and fine-tuning methods at most cases, with lower GPU consumption and improved hardware utilization.
Authors: Rohan Gupta, Erik Jenner
Abstract: Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations rather than relying solely on black-box outputs. These methods have shown promise in identifying behaviours such as deception and unsafe completions. However, these monitors may themselves become training signals, for example, by using problematic samples found in deployment to retrain models. This raises an important question: can models learn to evade such monitors? To evaluate this capability, we introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to evade latent-space monitors while maintaining their blackbox behaviour. We apply RL-Obfuscation to Language Models ranging from 7B to 14B parameters and evaluate their Evasion Success Rate against a suite of monitors. We find that token-level monitors are highly vulnerable to this attack while more holistic monitors, such as max-pooling or attention-based probes, remain robust. Moreover, for these vulnerable monitors, models trained to evade a single static monitor can generalise to evade other unseen monitors. We also find that the models can be trained to conditionally bypass latent-space monitors on only certain inputs. Finally, we study how the models bypass these monitors and find that the model can learn to repurpose tokens to have different internal representations.
Authors: Jinglong Luo, Zhuo Zhang, Yehong Zhang, Shiyu Liu, Ye Dong, Hui Wang, Yue Yu, Xun Zhou, Zenglin Xu
Abstract: Large Language Models (LLMs) have revolutionized numerous fields, yet their adaptation to specialized tasks in privacy-sensitive domains such as healthcare and finance remains constrained due to the scarcity of accessible training data caused by stringent privacy requirements. Secure Multi-party Computation (MPC)-based privacy-preserving machine learning provides theoretical guarantees for the privacy of model parameters and data. However, its application to LLMs has been predominantly limited to inference, as fine-tuning introduces significant efficiency challenges, particularly in backward propagation, optimizer, and self-attention operations. To address these challenges, we propose SecP-Tuning, the first MPC-based framework designed for efficient, privacy-preserving prompt tuning of LLMs. SecP-Tuning innovatively integrates Forward-only Tuning (FoT) through the ``data owner-server interaction" paradigm, effectively removing the need for privacy-preserving computations in backward propagation and optimization processes. Furthermore, it devises an efficient privacy-preserving Random Feature Attention (RFA), effectively mitigating the computational complexity of softmax-based self-attention and circumventing MPC-incompatible nonlinear operations. Experimental results demonstrate that, compared to full-Parameter Supervised Fine-Tuning (SFT) and gradient-based prompt tuning, SecP-Tuning achieves approximately 12 times and 16 times end-to-end acceleration, as well as 18 times and 20 times reductions in communication overhead, respectively. Moreover, in five few-shot tasks, it achieves an average performance score of 82.45, outperforming SFT's 79.90 and prompt tuning's 73.73. Additionally, the ``black-box/API-style" privacy-preserving tuning paradigm of SecP-Tuning effectively avoids memory leakage risks caused by gradient/parameter transmission.
Authors: Guan Zhe Hong, Bhavya Vasudeva, Vatsal Sharan, Cyrus Rashtchian, Prabhakar Raghavan, Rina Panigrahy
Abstract: When large language models (LLMs) use in-context learning (ICL) to solve a new task, they must infer latent concepts from demonstration examples. This raises the question of whether and how transformers represent latent structures as part of their computation. Our work experiments with several controlled tasks, studying this question using mechanistic interpretability. First, we show that in transitive reasoning tasks with a latent, discrete concept, the model successfully identifies the latent concept and does step-by-step concept composition. This builds upon prior work that analyzes single-step reasoning. Then, we consider tasks parameterized by a latent numerical concept. We discover low-dimensional subspaces in the model's representation space, where the geometry cleanly reflects the underlying parameterization. Overall, we show that small and large models can indeed disentangle and utilize latent concepts that they learn in-context from a handful of abbreviated demonstrations.
Authors: Anas Barakat, John Lazarsfeld, Georgios Piliouras, Antonios Varvitsiotis
Abstract: Online multi-agent control problems, where many agents pursue competing and time-varying objectives, are widespread in domains such as autonomous robotics, economics, and energy systems. In these settings, robustness to adversarial disturbances is critical. In this paper, we study online control in multi-agent linear dynamical systems subject to such disturbances. In contrast to most prior work in multi-agent control, which typically assumes noiseless or stochastically perturbed dynamics, we consider an online setting where disturbances can be adversarial, and where each agent seeks to minimize its own sequence of convex losses. Under two feedback models, we analyze online gradient-based controllers with local policy updates. We prove per-agent regret bounds that are sublinear and near-optimal in the time horizon and that highlight different scalings with the number of agents. When agents' objectives are aligned, we further show that the multi-agent control problem induces a time-varying potential game for which we derive equilibrium tracking guarantees. Together, our results take a first step in bridging online control with online learning in games, establishing robust individual and collective performance guarantees in dynamic continuous-state environments.
Authors: Ali Ebrahimpour-Boroojeny, Yian Wang, Hari Sundaram
Abstract: In this paper, we reveal a significant shortcoming in class unlearning evaluations: overlooking the underlying class geometry can cause privacy leakage. We further propose a simple yet effective solution to mitigate this issue. We introduce a membership-inference attack via nearest neighbors (MIA-NN) that uses the probabilities the model assigns to neighboring classes to detect unlearned samples. Our experiments show that existing unlearning methods are vulnerable to MIA-NN across multiple datasets. We then propose a new fine-tuning objective that mitigates this privacy leakage by approximating, for forget-class inputs, the distribution over the remaining classes that a retrained-from-scratch model would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model's distribution accordingly. The resulting Tilted ReWeighting (TRW) distribution serves as the desired distribution during fine-tuning. We also show that across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior unlearning metrics. More specifically, on CIFAR-10, it reduces the gap with retrained models by 19% and 46% for U-LiRA and MIA-NN scores, accordingly, compared to the SOTA method for each category.
Authors: Filip Rydin, Attila Lischka, Jiaming Wu, Morteza Haghir Chehreghani, Bal\'azs Kulcs\'ar
Abstract: Learning-based methods for routing have gained significant attention in recent years, both in single-objective and multi-objective contexts. Yet, existing methods are unsuitable for routing on multigraphs, which feature multiple edges with distinct attributes between node pairs, despite their strong relevance in real-world scenarios. In this paper, we propose two graph neural network-based methods to address multi-objective routing on multigraphs. Our first approach operates directly on the multigraph by autoregressively selecting edges until a tour is completed. The second model, which is more scalable, first simplifies the multigraph via a learned pruning strategy and then performs autoregressive routing on the resulting simple graph. We evaluate both models empirically, across a wide range of problems and graph distributions, and demonstrate their competitive performance compared to strong heuristics and neural baselines.
Authors: Amr Abourayya, Jens Kleesiek, Bharat Rao, Michael Kamp
Abstract: Data heterogeneity poses a fundamental challenge in federated learning (FL), especially when clients differ not only in distribution but also in the reliability of their predictions across individual examples. While personalized FL (PFL) aims to address this, we observe that many PFL methods fail to outperform two necessary baselines, local training and centralized training. This suggests that meaningful personalization only emerges in a narrow regime, where global models are insufficient, but collaboration across clients still holds value. Our empirical findings point to two key ingredients for success in this regime: adaptivity in collaboration and fine-grained trust, at the level of individual examples. We show that these properties can be achieved within federated semi-supervised learning, where clients exchange predictions over a shared unlabeled dataset. This enables each client to align with public consensus when it is helpful, and disregard it when it is not, without sharing model parameters or raw data. As a concrete realization of this idea, we develop FEDMOSAIC, a personalized co-training method where clients reweight their loss and their contribution to pseudo-labels based on per-example agreement and confidence. FEDMOSAIC outperforms strong FL and PFL baselines across a range of non-IID settings, and we prove convergence under standard smoothness, bounded-variance, and drift assumptions. In contrast to many of these baselines, it also outperforms local and centralized training. These results clarify when federated personalization can be effective, and how fine-grained, trust-aware collaboration enables it.
Authors: Timo Thun, Andrea Merlo, Rory Conlin, Dario Panici, Daniel B\"ockenhoff
Abstract: We present a novel approach to compute three-dimensional Magnetohydrodynamic equilibria by parametrizing Fourier modes with artificial neural networks and compare it to equilibria computed by conventional solvers. The full nonlinear global force residual across the volume in real space is then minimized with first order optimizers. Already,we observe competitive computational cost to arrive at the same minimum residuals computed by existing codes. With increased computational cost,lower minima of the residual are achieved by the neural networks,establishing a new lower bound for the force residual. We use minimally complex neural networks,and we expect significant improvements for solving not only single equilibria with neural networks,but also for computing neural network models valid over continuous distributions of equilibria.
Authors: Hanqun Cao, Xinyi Zhou, Zijun Gao, Chenyu Wang, Xin Gao, Zhi Zhang, Cesar de la Fuente-Nunez, Chunbin Gu, Ge Liu, Pheng-Ann Heng
Abstract: Protein structure prediction often hinges on multiple sequence alignments (MSAs), which underperform on low-homology and orphan proteins. We introduce PLAME, a lightweight MSA design framework that leverages evolutionary embeddings from pretrained protein language models to generate MSAs that better support downstream folding. PLAME couples these embeddings with a conservation--diversity loss that balances agreement on conserved positions with coverage of plausible sequence variation. Beyond generation, we develop (i) an MSA selection strategy to filter high-quality candidates and (ii) a sequence-quality metric that is complementary to depth-based measures and predictive of folding gains. On AlphaFold2 low-homology/orphan benchmarks, PLAME delivers state-of-the-art improvements in structure accuracy (e.g., lDDT/TM-score), with consistent gains when paired with AlphaFold3. Ablations isolate the benefits of the selection strategy, and case studies elucidate how MSA characteristics shape AlphaFold confidence and error modes. Finally, we show PLAME functions as a lightweight adapter, enabling ESMFold to approach AlphaFold2-level accuracy while retaining ESMFold-like inference speed. PLAME thus provides a practical path to high-quality folding for proteins lacking strong evolutionary neighbors.
Authors: Claas Voelcker, Axel Brunnbauer, Marcel Hussing, Michal Nauman, Pieter Abbeel, Eric Eaton, Radu Grosu, Amir-massoud Farahmand, Igor Gilitschenski
Abstract: Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines training stability. Using pathwise policy gradients, i.e. computing a derivative by differentiating the objective function, alleviates the variance issues. However, they require an accurate action-conditioned value function, which is notoriously hard to learn without relying on replay buffers for reusing past off-policy data. We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to combine stochastic policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning. The result, Relative Entropy Pathwise Policy Optimization (REPPO), is an efficient on-policy algorithm that combines the stability of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. Compared to state-of-the-art on two standard GPU-parallelized benchmarks, REPPO provides strong empirical performance at superior sample efficiency, wall-clock time, memory footprint, and hyperparameter robustness.
Authors: Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi
Abstract: Recent advances in LLMs highlight RLVR as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether the current practice of RLVR truly expands a model's reasoning boundary or mainly amplifies high-reward outputs that the base model already knows for improved precision. This study presents an empirical investigation that provides fresh insights into the potential limits of the common practice of RLVR. We examine how, under current training conditions, RLVR can operate as a support-constrained optimization mechanism that may restrict the discovery of entirely original solutions, remaining constrained by the base model's initial distribution. We also identify an entropy-reward trade-off: while the current RLVR recipe reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while the current RLVR recipe consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy - resulting in greater uncertainty at each generation step - answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of the current RLVR recipe in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.
Authors: Zhuokun Chen, Zeren Chen, Jiahao He, Lu Sheng, Mingkui Tan, Jianfei Cai, Bohan Zhuang
Abstract: Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch$^{+}$, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00$\times$ on DeepSeek-R1-Distill-Qwen-7B, 3.85$\times$ on 14B, and 4.10$\times$ on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency--accuracy trade-offs that can be tailored to diverse computational budgets without retraining.
Authors: Zhuo Yang, Jiaqing Xie, Shuaike Shen, Daolang Wang, Yeyun Chen, Ben Gao, Shuzhou Sun, Biqing Qi, Dongzhan Zhou, Lei Bai, Linjiang Chen, Shufei Zhang, Qinying Gu, Jun Jiang, Tianfan Fu, Yuqiang Li
Abstract: Deep learning holds immense promise for spectroscopy, yet research and evaluation in this emerging field often lack standardized formulations. To address this issue, we introduce SpectrumLab, a pioneering unified platform designed to systematize and accelerate deep learning research in spectroscopy. SpectrumLab integrates three core components: a comprehensive Python library featuring essential data processing and evaluation tools, along with leaderboards; an innovative SpectrumAnnotator module that generates high-quality benchmarks from limited seed data; and SpectrumBench, a multi-layered benchmark suite covering 14 spectroscopic tasks and over 10 spectrum types, featuring spectra curated from over 1.2 million distinct chemical substances. Thorough empirical studies on SpectrumBench with 18 cutting-edge multimodal LLMs reveal critical limitations of current approaches. We hope SpectrumLab will serve as a crucial foundation for future advancements in deep learning-driven spectroscopy.
Authors: Biyi Fang, Jean Utke, Truong Vo, Diego Klabjan
Abstract: Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.
Authors: Hector Vargas Alvarez, Dimitrios G. Patsatzis, Lucia Russo, Ioannis Kevrekidis, Constantinos Siettos
Abstract: Bridging the microscopic and the macroscopic modelling scales in crowd dynamics constitutes an important open challenge for systematic numerical analysis, optimization and control. We propose a combined manifold and machine learning approach to learn the discrete evolution operator for the emergent crowd dynamics in latent spaces from high-fidelity agent-based simulations. The proposed framework builds upon our previous works on next-generation Equation-free algorithms for learning surrogate models of high-dim. multiscale systems. Our approach is a four-stage one, explicitly conserving the mass of the reconstructed dynamics in the high-dim. space. In the first step, we derive continuous macroscopic fields (densities) from discrete microscopic data (pedestrians' positions) using KDE. In the second step, based on manifold learning, we construct a map from the macroscopic ambient space into the latent space parametrized by a few coordinates based on POD of the corresponding density distribution. The third step involves learning reduced-order surrogate ROMs in the latent space using machine learning techniques, particularly LSTMs networks and MVARs. Finally, we reconstruct the crowd dynamics in the high-dim. space in terms of macroscopic density profiles. With this "embed->learn in latent space->lift back to ambient space" pipeline, we create an effective solution operator of the unavailable macroscopic PDE for the density evolution. For our illustrations, we use SFM to generate data in a corridor with an obstacle, imposing periodic boundary conditions. The numerical results demonstrate high accuracy, robustness, and generalizability, thus allowing for fast and accurate modelling of crowd dynamics from agent-based simulations. Notably, linear MVAR models surpass nonlinear LSTMs in predictive accuracy, while also offering significantly lower complexity and greater interpretability.
Authors: Zian Zhai, Fan Li, Xingyu Tan, Xiaoyang Wang, Wenjie Zhang
Abstract: Vector Quantization (VQ) has recently emerged as a promising approach for learning discrete representations of graph-structured data. However, a fundamental challenge, i.e., codebook collapse, remains underexplored in the graph domain, significantly limiting the expressiveness and generalization of graph tokens.In this paper, we present the first empirical study showing that codebook collapse consistently occurs when applying VQ to graph data, even with mitigation strategies proposed in vision or language domains. To understand why graph VQ is particularly vulnerable to collapse, we provide a theoretical analysis and identify two key factors: early assignment imbalances caused by redundancy in graph features and structural patterns, and self-reinforcing optimization loops in deterministic VQ. To address these issues, we propose RGVQ, a novel framework that integrates graph topology and feature similarity as explicit regularization signals to enhance codebook utilization and promote token diversity. RGVQ introduces soft assignments via Gumbel-Softmax reparameterization, ensuring that all codewords receive gradient updates. In addition, RGVQ incorporates a structure-aware contrastive regularization to penalize the token co-assignments among dissimilar node pairs. Extensive experiments demonstrate that RGVQ substantially improves codebook utilization and consistently boosts the performance of state-of-the-art graph VQ backbones across multiple downstream tasks, enabling more expressive and transferable graph token representations.
Authors: Xin Wu, Fei Teng, Ji Zhang, Xingwang Li, Yuxuan Liang
Abstract: An ideal time series classification (TSC) should be able to capture invariant representations, but achieving reliable performance on out-of-distribution (OOD) data remains a core obstacle. This obstacle arises from the way models inherently entangle domain-specific and label-relevant features, resulting in spurious correlations. While feature disentanglement aims to solve this, current methods are largely unguided, lacking the semantic direction required to isolate truly universal features. To address this, we propose an end-to-end Energy-Regularized Information for Shift-Robustness (ERIS) framework to enable guided and reliable feature disentanglement. The core idea is that effective disentanglement requires not only mathematical constraints but also semantic guidance to anchor the separation process. ERIS incorporates three key mechanisms to achieve this goal. Specifically, we first introduce an energy-guided calibration mechanism, which provides crucial semantic guidance for the separation, enabling the model to self-calibrate. Additionally, a weight-level orthogonality strategy enforces structural independence between domain-specific and label-relevant features, thereby mitigating their interference. Moreover, an auxiliary adversarial generalization mechanism enhances robustness by injecting structured perturbations. Experiments across four benchmarks demonstrate that ERIS achieves a statistically significant improvement over state-of-the-art baselines, consistently securing the top performance rank.
Authors: Xudong Wang, Guoming Tang, Junyu Xue, Srinivasan Keshav, Tongxin Li, Chris Ding
Abstract: Non-Intrusive Load Monitoring (NILM) offers a cost-effective method to obtain fine-grained appliance-level energy consumption in smart homes and building applications. However, the increasing adoption of behind-the-meter (BTM) energy sources such as solar panels and battery storage poses new challenges for conventional NILM methods that rely solely on at-the-meter data. The energy injected from the BTM sources can obscure the power signatures of individual appliances, leading to a significant decrease in NILM performance. To address this challenge, we present DualNILM, a deep multi-task learning framework designed for the dual tasks of appliance state recognition and injected energy identification. Using a Transformer-based architecture that integrates sequence-to-point and sequence-to-sequence strategies, DualNILM effectively captures multiscale temporal dependencies in the aggregate power consumption patterns, allowing for accurate appliance state recognition and energy injection identification. Extensive evaluation on self-collected and synthesized datasets demonstrates that DualNILM maintains an excellent performance for dual tasks in NILM, much outperforming conventional methods. Our work underscores the framework's potential for robust energy disaggregation in modern energy systems with renewable penetration. Synthetic photovoltaic augmented datasets with realistic injection simulation methodology will be open-sourced after review.
Authors: Benjamin Wei Hao Chin, Yuin Torng Yew, Haocheng Wu, Lanxin Liang, Chow Khuen Chan, Norita Mohd Zain, Siti Balqis Samdin, Sim Kuan Goh
Abstract: Classification of sleep stages is essential for assessing sleep quality and diagnosing sleep disorders. However, manual inspection of EEG characteristics for each stage is time-consuming and prone to human error. Although machine learning and deep learning methods have been actively developed, they continue to face challenges arising from the non-stationarity and variability of electroencephalography (EEG) and electrooculography (EOG) signals across diverse clinical configurations, often resulting in poor generalization. In this work, we propose SleepDIFFormer, a multi-channel differential transformer framework for heterogeneous EEG-EOG representation learning. SleepDIFFormer is trained across multiple sleep staging datasets, each treated as a source domain, with the goal of generalizing to unseen target domains. Specifically, it employs a Multi-channel Differential Transformer Architecture (MDTA) designed to process raw EEG and EOG signals while incorporating cross-domain alignment. Our approach mitigates spatial and temporal attention noise and learns a domain-invariant EEG-EOG representation through feature distribution alignment across datasets, thereby enhancing generalization to new domains. Empirically, we evaluated SleepDIFFormer on five diverse sleep staging datasets under domain generalization settings and benchmarked it against existing approaches, achieving state-of-the-art performance. We further conducted a comprehensive ablation study and interpreted the differential attention weights, demonstrating their relevance to characteristic sleep EEG patterns. These findings advance the development of automated sleep stage classification and highlight its potential in quantifying sleep architecture and detecting abnormalities that disrupt restorative rest. Our source code and checkpoint are made publicly available at https://github.com/Ben1001409/SleepDIFFormer
Authors: David Chanin, Adri\`a Garriga-Alonso
Abstract: Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to interpretable concepts. A core SAE training hyperparameter is L0: how many SAE features should fire per token on average. Existing work compares SAE algorithms using sparsity-reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value aside from its effect on reconstruction. In this work we study the effect of L0 on SAEs, and show that if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we present a proxy metric that can help guide the search for the correct L0 for an SAE on a given training distribution. We show that our method finds the correct L0 in toy models and coincides with peak sparse probing performance in LLM SAEs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that L0 must be set correctly to train SAEs with correct features.
Authors: Jerry Yao-Chieh Hu, Hude Liu, Jennifer Yuntong Zhang, Han Liu
Abstract: We prove that a minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting. We formalize two modes of in-context algorithm emulation. In the task-specific mode, for any continuous function $f: \mathbb{R} \to \mathbb{R}$, we show the existence of a single-head softmax attention layer whose forward pass reproduces functions of the form $f(w^\top x - y)$ to arbitrary precision. This general template subsumes many popular machine learning algorithms (e.g., gradient descent, linear regression, ridge regression). In the prompt-programmable mode, we prove universality: a single fixed-weight two-layer softmax attention module emulates all algorithms from the task-specific class (i.e., each implementable by a single softmax attention) via only prompting. Our key idea is to construct prompts that encode an algorithm's parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation. This construction requires no feed-forward layers and no parameter updates. All adaptation happens through the prompt alone. Numerical results corroborate our theory. These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable libraries of algorithms. They illuminate how GPT-style foundation models may swap algorithms via prompts alone, and establish a form of algorithmic universality in modern Transformer models.
Authors: Mikael Henaff, Scott Fujimoto, Michael Matthews, Michael Rabbat
Abstract: Hierarchical reinforcement learning (RL) has the potential to enable effective decision-making over long timescales. Existing approaches, while promising, have yet to realize the benefits of large-scale training. In this work, we identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability. Our code is open sourced at: github.com/facebookresearch/sol.
Authors: Emmanuel Jehanno, Romain Menegaux, Julien Mairal, Sergei Grudinin
Abstract: Crystalline structure prediction is an essential prerequisite for designing materials with targeted properties. Yet, it is still an open challenge in materials design and drug discovery. Despite recent advances in computational materials science, accurately predicting three-dimensional non-polymeric crystal structures remains elusive. In this work, we focus on the molecular assembly problem, where a set $\mathcal{S}$ of identical rigid molecules is packed to form a crystalline structure. Such a simplified formulation provides a useful approximation to the actual problem. However, while recent state-of-the-art methods have increasingly adopted sophisticated techniques, the underlying learning objective remains ill-posed. We propose a better formulation that introduces a loss function capturing key geometric molecular properties while ensuring permutation invariance over $\mathcal{S}$. Remarkably, we demonstrate that within this framework, a simple regression model already outperforms prior approaches, including flow matching techniques, on the COD-Cluster17 benchmark, a curated non-polymeric subset of the Crystallography Open Database (COD).
Authors: Matthew Zhang, Jihao Andreas Lin, Krzysztof Choromanski, Adrian Weller, Richard E. Turner, Isaac Reid
Abstract: We study the application of graph random features (GRFs) - a recently introduced stochastic estimator of graph node kernels - to scalable Gaussian processes on discrete input spaces. We prove that (under mild assumptions) Bayesian inference with GRFs enjoys $O(N^{3/2})$ time complexity with respect to the number of nodes $N$, compared to $O(N^3)$ for exact kernels. Substantial wall-clock speedups and memory savings unlock Bayesian optimisation on graphs with over $10^6$ nodes on a single computer chip, whilst preserving competitive performance.
Authors: Georgia Channing, Avijit Ghosh
Abstract: Artificial intelligence promises to accelerate scientific discovery, yet its benefits remain unevenly distributed. While technical obstacles such as scarce data, fragmented standards, and unequal access to computation are significant, we argue that the primary barriers are social and institutional. Narratives that defer progress to speculative "AI scientists," the undervaluing of data and infrastructure contributions, misaligned incentives, and gaps between domain experts and machine learning researchers all constrain impact. We highlight four interconnected challenges: community dysfunction, research priorities misaligned with upstream needs, data fragmentation, and infrastructure inequities. We argue that their roots lie in cultural and organizational practices. Addressing them requires not only technical innovation but also intentional community-building, cross-disciplinary education, shared benchmarks, and accessible infrastructure. We call for reframing AI for science as a collective social project, where sustainable collaboration and equitable participation are treated as prerequisites for technical progress.
Authors: Florian Wiesner, Matthias Wessling, Stephen Baek
Abstract: Foundation models have revolutionized natural language processing through a ``train once, deploy anywhere'' paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative -- democratizing access to high-fidelity simulations, accelerating scientific discovery, and eliminating the need for specialized solver development. Yet current physics-aware machine learning approaches remain fundamentally limited to single, narrow domains and require retraining for each new system. We present the General Physics Transformer (GPhyT), trained on 1.8 TB of diverse simulation data, that demonstrates foundation model capabilities are achievable for physics. Our key insight is that transformers can learn to infer governing dynamics from context, enabling a single model to simulate fluid-solid interactions, shock waves, thermal convection, and multi-phase dynamics without being told the underlying equations. GPhyT achieves three critical breakthroughs: (1) superior performance across multiple physics domains, outperforming specialized architectures by up to 29x, (2) zero-shot generalization to entirely unseen physical systems through in-context learning, and (3) stable long-term predictions through 50-timestep rollouts. By establishing that a single model can learn generalizable physical principles from data alone, this work opens the path toward a universal PFM that could transform computational science and engineering.
Authors: Juan Diego Toscano, Daniel T. Chen, Vivek Oommen, J\'er\^ome Darbon, George Em Karniadakis
Abstract: Residual-based adaptive strategies are widely used in scientific machine learning but remain largely heuristic. We introduce a unifying variational framework that formalizes these methods by integrating convex transformations of the residual. Different transformations correspond to distinct objective functionals: exponential weights target the minimization of uniform error, while linear weights recover the minimization of quadratic error. Within this perspective, adaptive weighting is equivalent to selecting sampling distributions that optimize the primal objective, thereby linking discretization choices directly to error metrics. This principled approach yields three benefits: (1) it enables systematic design of adaptive schemes across norms, (2) reduces discretization error through variance reduction of the loss estimator, and (3) enhances learning dynamics by improving the gradient signal-to-noise ratio. Extending the framework to operator learning, we demonstrate substantial performance gains across optimizers and architectures. Our results provide a theoretical justification of residual-based adaptivity and establish a foundation for principled discretization and training strategies.
Authors: Davide Ettori, Nastaran Darabi, Sureshkumar Senthilkumar, Amit Ranjan Trivedi
Abstract: Large deep learning models such as BERT and ResNet achieve state-of-the-art performance but are costly to deploy at the edge due to their size and compute demands. We present RMT-KD, a compression method that leverages Random Matrix Theory (RMT) for knowledge distillation to iteratively reduce network size. Instead of pruning or heuristic rank selection, RMT-KD preserves only informative directions identified via the spectral properties of hidden representations. RMT-based causal reduction is applied layer by layer with self-distillation to maintain stability and accuracy. On GLUE, AG News, and CIFAR-10, RMT-KD achieves up to 80% parameter reduction with only 2% accuracy loss, delivering 2.8x faster inference and nearly halved power consumption. These results establish RMT-KD as a mathematically grounded approach to network distillation.
Authors: Davide Ettori, Nastaran Darabi, Sina Tayebati, Ranganath Krishnan, Mahesh Subedar, Omesh Tickoo, Amit Ranjan Trivedi
Abstract: Large language models (LLMs) offer broad utility but remain prone to hallucination and out-of-distribution (OOD) errors. We propose EigenTrack, an interpretable real-time detector that uses the spectral geometry of hidden activations, a compact global signature of model dynamics. By streaming covariance-spectrum statistics such as entropy, eigenvalue gaps, and KL divergence from random baselines into a lightweight recurrent classifier, EigenTrack tracks temporal shifts in representation structure that signal hallucination and OOD drift before surface errors appear. Unlike black- and grey-box methods, it needs only a single forward pass without resampling. Unlike existing white-box detectors, it preserves temporal context, aggregates global signals, and offers interpretable accuracy-latency trade-offs.
Authors: Yinglong Zou, Juan Zhai, Chunrong Fang, Zhenyu Chen
Abstract: Deep learning models play a vital role in autonomous driving systems, supporting critical functions such as environmental perception. To accelerate model inference, these deep learning models' deployment relies on automotive deep learning frameworks, for example, PaddleInference in Apollo and TensorRT in AutoWare. However, unlike deploying deep learning models on the cloud, vehicular environments experience extreme ambient temperatures varying from -40{\deg}C to 50{\deg}C, significantly impacting GPU temperature. Additionally, heats generated when computing further lead to the GPU temperature increase. These temperature fluctuations lead to dynamic GPU frequency adjustments through mechanisms such as DVFS. However, automotive deep learning frameworks are designed without considering the impact of temperature-induced frequency variations. When deployed on temperature-varying GPUs, these frameworks suffer critical quality issues: compute-intensive operators face delays or errors, high/mixed-precision operators suffer from precision errors, and time-series operators suffer from synchronization issues. The above quality issues cannot be detected by existing deep learning framework testing methods because they ignore temperature's effect on the deep learning framework quality. To bridge this gap, we propose ThermalGuardian, the first automotive deep learning framework testing method under temperature-varying environments. Specifically, ThermalGuardian generates test input models using model mutation rules targeting temperature-sensitive operators, simulates GPU temperature fluctuations based on Newton's law of cooling, and controls GPU frequency based on real-time GPU temperature.
Authors: Lorenzo Guerra, Thomas Chapuis, Guillaume Duc, Pavlo Mozharovskyi, Van-Tam Nguyen
Abstract: Detecting intrusions in network traffic is a challenging task, particularly under limited supervision and constantly evolving attack patterns. While recent works have leveraged graph neural networks for network intrusion detection, they often decouple representation learning from anomaly detection, limiting the utility of the embeddings for identifying attacks. We propose GraphIDS, a self-supervised intrusion detection model that unifies these two stages by learning local graph representations of normal communication patterns through a masked autoencoder. An inductive graph neural network embeds each flow with its local topological context to capture typical network behavior, while a Transformer-based encoder-decoder reconstructs these embeddings, implicitly learning global co-occurrence patterns via self-attention without requiring explicit positional information. During inference, flows with unusually high reconstruction errors are flagged as potential intrusions. This end-to-end framework ensures that embeddings are directly optimized for the downstream task, facilitating the recognition of malicious traffic. On diverse NetFlow benchmarks, GraphIDS achieves up to 99.98% PR-AUC and 99.61% macro F1-score, outperforming baselines by 5-25 percentage points.
Authors: Dehao Zhang, Malu Zhang, Shuai Wang, Jingya Wang, Wenjie Wei, Zeyu Ma, Guoqing Wang, Yang Yang, Haizhou Li
Abstract: The explosive growth in sequence length has intensified the demand for effective and efficient long sequence modeling. Benefiting from intrinsic oscillatory membrane dynamics, Resonate-and-Fire (RF) neurons can efficiently extract frequency components from input signals and encode them into spatiotemporal spike trains, making them well-suited for long sequence modeling. However, RF neurons exhibit limited effective memory capacity and a trade-off between energy efficiency and training speed on complex temporal tasks. Inspired by the dendritic structure of biological neurons, we propose a Dendritic Resonate-and-Fire (D-RF) model, which explicitly incorporates a multi-dendritic and soma architecture. Each dendritic branch encodes specific frequency bands by utilizing the intrinsic oscillatory dynamics of RF neurons, thereby collectively achieving comprehensive frequency representation. Furthermore, we introduce an adaptive threshold mechanism into the soma structure that adjusts the threshold based on historical spiking activity, reducing redundant spikes while maintaining training efficiency in long sequence tasks. Extensive experiments demonstrate that our method maintains competitive accuracy while substantially ensuring sparse spikes without compromising computational efficiency during training. These results underscore its potential as an effective and efficient solution for long sequence modeling on edge platforms.
Authors: Kuiye Ding, Fanda Fan, Chunyi Hou, Zheya Wang, Lei Wang, Zhengxin Yang, Jianfeng Zhan
Abstract: Multivariate time series forecasting is essential in domains such as finance, transportation, climate, and energy. However, existing patch-based methods typically adopt fixed-length segmentation, overlooking the heterogeneity of local temporal dynamics and the decoding heterogeneity of forecasting. Such designs lose details in information-dense regions, introduce redundancy in stable segments, and fail to capture the distinct complexities of short-term and long-term horizons. We propose TimeMosaic, a forecasting framework that aims to address temporal heterogeneity. TimeMosaic employs adaptive patch embedding to dynamically adjust granularity according to local information density, balancing motif reuse with structural clarity while preserving temporal continuity. In addition, it introduces segment-wise decoding that treats each prediction horizon as a related subtask and adapts to horizon-specific difficulty and information requirements, rather than applying a single uniform decoder. Extensive evaluations on benchmark datasets demonstrate that TimeMosaic delivers consistent improvements over existing methods, and our model trained on the large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art TSFMs.
Authors: Rami Zewail
Abstract: Learning robust representations for biosignals is often hampered by the challenge of designing effective data augmentations.Traditional methods can fail to capture the complex variations inherent in physiological data. Within this context, we propose a novel hybrid framework, Diffusion-Augmented Contrastive Learning (DACL), that fuses concepts from diffusion models and supervised contrastive learning. The DACL framework operates on a latent space created by a lightweight Variational Autoencoder (VAE) trained on our novel Scattering Transformer (ST) features [12]. It utilizes the diffusion forward process as a principled data augmentation technique to generate multiple noisy views of these latent embeddings. A U-Net style encoder is then trained with a supervised contrastive objective to learn a representation that balances class discrimination with robustness to noise across various diffusion time steps. We evaluated this proof-of-concept method on the PhysioNet 2017 ECG dataset, achieving a competitive AUROC of 0.7815. This work establishes a new paradigm for representation learning by using the diffusion process itself to drive the contrastive objective, creating noise-invariant embeddings that demonstrate a strong foundation for class separability.
Authors: Qiyu Chen, Guozhang Chen
Abstract: The remarkable success of large-scale models is fundamentally tied to scaling laws, yet the finite nature of high-quality data presents a looming challenge. One of the next frontiers in modeling is data efficiency: the ability to learn more from less. A model's inductive bias is a critical lever for this, but foundational sequence models like State Space Models (SSMs) rely on a fixed bias. This fixed prior is sample-inefficient when a task's underlying structure does not match. In this work, we introduce a principled framework to solve this problem. We first formalize the inductive bias of linear time-invariant SSMs through an SSM-induced kernel, mathematically and empirically proving its spectrum is directly governed by the model's frequency response. Further, we propose a method of Task-Dependent Initialization (TDI): power spectrum matching, a fast and efficient method that aligns the model's inductive bias with the task's spectral characteristics before large-scale training. Our experiments on a diverse set of real-world benchmarks show that TDI significantly improves generalization and sample efficiency, particularly in low-data regimes. This work provides a theoretical and practical tool to create more data-efficient models, a crucial step towards sustainable scaling.
Authors: Zhengxiao Li, Liming Lu, Xu Zheng, Siyuan Liang, Zhenghan Chen, Yongbin Zhou, Shuchao Pang
Abstract: Data-Free Robustness Distillation (DFRD) aims to transfer the robustness from the teacher to the student without accessing the training data. While existing methods focus on overall robustness, they overlook the robust fairness issues, leading to severe disparity of robustness across different categories. In this paper, we find two key problems: (1) student model distilled with equal class proportion data behaves significantly different across distinct categories; and (2) the robustness of student model is not stable across different attacks target. To bridge these gaps, we present the first Fairness-Enhanced data-free Robustness Distillation (FERD) framework to adjust the proportion and distribution of adversarial examples. For the proportion, FERD adopts a robustness-guided class reweighting strategy to synthesize more samples for the less robust categories, thereby improving robustness of them. For the distribution, FERD generates complementary data samples for advanced robustness distillation. It generates Fairness-Aware Examples (FAEs) by enforcing a uniformity constraint on feature-level predictions, which suppress the dominance of class-specific non-robust features, providing a more balanced representation across all categories. Then, FERD constructs Uniform-Target Adversarial Examples (UTAEs) from FAEs by applying a uniform target class constraint to avoid biased attack directions, which distribute the attack targets across all categories and prevents overfitting to specific vulnerable categories. Extensive experiments on three public datasets show that FERD achieves state-of-the-art worst-class robustness under all adversarial attack (e.g., the worst-class robustness under FGSM and AutoAttack are improved by 15.1\% and 6.4\% using MobileNet-V2 on CIFAR-10), demonstrating superior performance in both robustness and fairness aspects.
Authors: Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu
Abstract: The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.
Authors: Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Qingsong Wen, Kun Wang, Xiaomeng Huang, Xian Wu
Abstract: Accurately forecasting the long-term evolution of turbulence represents a grand challenge in scientific computing and is crucial for applications ranging from climate modeling to aerospace engineering. Existing deep learning methods, particularly neural operators, often fail in long-term autoregressive predictions, suffering from catastrophic error accumulation and a loss of physical fidelity. This failure stems from their inability to simultaneously capture the distinct mathematical structures that govern turbulent dynamics: local, dissipative effects and global, non-local interactions. In this paper, we propose the {\textbf{\underline{D}}}ifferential-{\textbf{\underline{I}}}ntegral {\textbf{\underline{N}}}eural {\textbf{\underline{O}}}perator (\method{}), a novel framework designed from a first-principles approach of operator decomposition. \method{} explicitly models the turbulent evolution through parallel branches that learn distinct physical operators: a local differential operator, realized by a constrained convolutional network that provably converges to a derivative, and a global integral operator, captured by a Transformer architecture that learns a data-driven global kernel. This physics-based decomposition endows \method{} with exceptional stability and robustness. Through extensive experiments on the challenging 2D Kolmogorov flow benchmark, we demonstrate that \method{} significantly outperforms state-of-the-art models in long-term forecasting. It successfully suppresses error accumulation over hundreds of timesteps, maintains high fidelity in both the vorticity fields and energy spectra, and establishes a new benchmark for physically consistent, long-range turbulence forecast.
Authors: Constantinos Daskalakis, Ioannis Panageas
Abstract: Motivated by applications in Optimization, Game Theory, and the training of Generative Adversarial Networks, the convergence properties of first order methods in min-max problems have received extensive study. It has been recognized that they may cycle, and there is no good understanding of their limit points when they do not. When they converge, do they converge to local min-max solutions? We characterize the limit points of two basic first order methods, namely Gradient Descent/Ascent (GDA) and Optimistic Gradient Descent Ascent (OGDA). We show that both dynamics avoid unstable critical points for almost all initializations. Moreover, for small step sizes and under mild assumptions, the set of \{OGDA\}-stable critical points is a superset of \{GDA\}-stable critical points, which is a superset of local min-max solutions (strict in some cases). The connecting thread is that the behavior of these dynamics can be studied from a dynamical systems perspective.
Authors: Eleonora Lopez, Eleonora Grassucci, Danilo Comminiello
Abstract: Radiologists interpret mammography exams by jointly analyzing all four views, as correlations among them are crucial for accurate diagnosis. Recent methods employ dedicated fusion blocks to capture such dependencies, but these are often hindered by view dominance, training instability, and computational overhead. To address these challenges, we introduce multi-view hypercomplex learning, a novel learning paradigm for multi-view breast cancer classification based on parameterized hypercomplex neural networks (PHNNs). Thanks to hypercomplex algebra, our models intrinsically capture both intra- and inter-view relations. We propose PHResNets for two-view exams and two complementary four-view architectures: PHYBOnet, optimized for efficiency, and PHYSEnet, optimized for accuracy. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art multi-view models, while also generalizing across radiographic modalities and tasks such as disease classification from chest X-rays and multimodal brain tumor segmentation. Full code and pretrained models are available at https://github.com/ispamm/PHBreast.
Authors: Yifan Zhang, Chen Huang, Yueke Zhang, Huajie Shao, Kevin Leach, Yu Huang
Abstract: Binary code analysis and comprehension is critical to applications in reverse engineering and computer security tasks where source code is not available. Unfortunately, unlike source code, binary code lacks semantics and is more difficult for human engineers to understand and analyze. In this paper, we present ContraBin, a contrastive learning technique that integrates source code and comment information along with binaries to create an embedding capable of aiding binary analysis and comprehension tasks. Specifically, we present three components in ContraBin: (1) a primary contrastive learning method for initial pre-training, (2) a simplex interpolation method to integrate source code, comments, and binary code, and (3) an intermediate representation learning algorithm to train a binary code embedding. We further analyze the impact of human-written and synthetic comments on binary code comprehension tasks, revealing a significant performance disparity. While synthetic comments provide substantial benefits, human-written comments are found to introduce noise, even resulting in performance drops compared to using no comments. These findings reshape the narrative around the role of comment types in binary code analysis. We evaluate the effectiveness of ContraBin through four indicative downstream tasks related to binary code: algorithmic functionality classification, function name recovery, code summarization, and reverse engineering. The results show that ContraBin considerably improves performance on all four tasks, measured by accuracy, mean of average precision, and BLEU scores as appropriate. ContraBin is the first language representation model to incorporate source code, binary code, and comments into contrastive code representation learning and is intended to contribute to the field of binary code analysis. The dataset used in this study is available for further research.
Authors: Yiyang Zhang, Junyi Liu, Xiaobo Zhao
Abstract: Focusing on stochastic programming (SP) with covariate information, this paper proposes an empirical risk minimization (ERM) method embedded within a nonconvex piecewise affine decision rule (PADR), which aims to learn the direct mapping from features to optimal decisions. We establish the nonasymptotic consistency result of our PADR-based ERM model for unconstrained problems and asymptotic consistency result for constrained ones. To solve the nonconvex and nondifferentiable ERM problem, we develop an enhanced stochastic majorization-minimization algorithm and establish the asymptotic convergence to (composite strong) directional stationarity along with complexity analysis. We show that the proposed PADR-based ERM method applies to a broad class of nonconvex SP problems with theoretical consistency guarantees and computational tractability. Our numerical study demonstrates the superior performance of PADR-based ERM methods compared to state-of-the-art approaches under various settings, with significantly lower costs, less computation time, and robustness to feature dimensions and nonlinearity of the underlying dependency.
Authors: Iman Rahmaty, Hamed Shah-Mansouri, Ali Movaghar
Abstract: In the realm of mobile edge computing (MEC), efficient computation task offloading plays a pivotal role in ensuring a seamless quality of experience (QoE) for users. Maintaining a high QoE is paramount in today's interconnected world, where users demand reliable services. This challenge stands as one of the most primary key factors contributing to handling dynamic and uncertain mobile environments. In this study, we delve into computation offloading in MEC systems, where strict task processing deadlines and energy constraints can adversely affect the system performance. We formulate the computation task offloading problem as a Markov decision process (MDP) to maximize the long-term QoE of each user individually. We propose a distributed QoE-oriented computation offloading (QECO) algorithm based on deep reinforcement learning (DRL) that empowers mobile devices to make their offloading decisions without requiring knowledge of decisions made by other devices. Through numerical studies, we evaluate the performance of QECO. Simulation results reveal that compared to the state-of-the-art existing works, QECO increases the number of completed tasks by up to 14.4%, while simultaneously reducing task delay and energy consumption by 9.2% and 6.3%, respectively. Together, these improvements result in a significant average QoE enhancement of 37.1%. This substantial improvement is achieved by accurately accounting for user dynamics and edge server workloads when making intelligent offloading decisions. This highlights QECO's effectiveness in enhancing users' experience in MEC systems.
Authors: Yuefeng Peng, Ali Naseh, Amir Houmansadr
Abstract: Deep learning models, while achieving remarkable performances, are vulnerable to membership inference attacks (MIAs). Although various defenses have been proposed, there is still substantial room for improvement in the privacy-utility trade-off. In this work, we introduce a novel defense framework against MIAs by leveraging generative models. The key intuition of our defense is to remove the differences between member and non-member inputs, which is exploited by MIAs, by re-generating input samples before feeding them to the target model. Therefore, our defense, called DIFFENCE, works pre inference, which is unlike prior defenses that are either training-time or post-inference time. A unique feature of DIFFENCE is that it works on input samples only, without modifying the training or inference phase of the target model. Therefore, it can be cascaded with other defense mechanisms as we demonstrate through experiments. DIFFENCE is designed to preserve the model's prediction labels for each sample, thereby not affecting accuracy. Furthermore, we have empirically demonstrated it does not reduce the usefulness of confidence vectors. Through extensive experimentation, we show that DIFFENCE can serve as a robust plug-n-play defense mechanism, enhancing membership privacy without compromising model utility. For instance, DIFFENCE reduces MIA accuracy against an undefended model by 15.8\% and attack AUC by 14.0\% on average across three datasets, all without impacting model utility. By integrating DIFFENCE with prior defenses, we can achieve new state-of-the-art performances in the privacy-utility trade-off. For example, when combined with the state-of-the-art SELENA defense it reduces attack accuracy by 9.3\%, and attack AUC by 10.0\%. DIFFENCE achieves this by imposing a negligible computation overhead, adding only 57ms to the inference time per sample processed on average.
Authors: Alan J. X. Guo, Sihan Sun, Xiang Wei, Mengyi Wei, Xin Chen
Abstract: With the emergence of new storage and communication methods, the insertion, deletion, and substitution (IDS) channel has attracted considerable attention. However, many topics on the IDS channel and the associated Levenshtein distance remain open, making the invention of a novel IDS-correcting code a hard task. Furthermore, current studies on single-IDS-correcting code misalign with the requirements of applications which necessitates the correcting of multiple errors. Compromise solutions have involved shortening codewords to reduce the chance of multiple errors. However, the code rates of existing codes are poor at short lengths, diminishing the overall storage density. In this study, a novel method is introduced for designing high-code-rate single-IDS-correcting codewords through deep Levenshtein distance embedding. A deep learning model is utilized to project the sequences into embedding vectors that preserve the Levenshtein distances between the original sequences. This embedding space serves as a proxy for the complex Levenshtein domain, within which algorithms for codeword search and segment correcting is developed. While the concept underpinning this approach is straightforward, it bypasses the mathematical challenges typically encountered in code design. The proposed method results in a code rate that outperforms existing combinatorial solutions, particularly for designing short-length codewords.
Authors: Ruicheng Ao, Hongyu Chen, David Simchi-Levi, Feng Zhu
Abstract: We consider the problem of online resource allocation with average budget constraints. At each time point the decision maker makes an irrevocable decision of whether to accept or reject a request before the next request arrives with the goal to maximize the cumulative rewards. In contrast to existing literature requiring the total resource consumption is below a certain level, we require the average resource consumption per accepted request does not exceed a given threshold. This problem can be casted as an online knapsack problem with exogenous random budget replenishment, and can find applications in various fields such as online anomaly detection, sequential advertising, and per-capita public service providers. We start with general arrival distributions and show that a simple policy achieves a $O(\sqrt{T})$ regret. We complement the result by showing that such a regret growing rate is in general not improvable. We then shift our focus to discrete arrival distributions. We find that many existing re-solving heuristics in the online resource allocation literature, albeit achieve bounded loss in canonical settings, may incur a $\Omega(\sqrt{T})$ or even a $\Omega(T)$ regret. With the observation that canonical policies tend to be too optimistic and over accept arrivals, we propose a novel policy that incorporates budget safety buffers. It turns out that a little more safety can greatly enhance efficiency -- small additional logarithmic buffers suffice to reduce the regret from $\Omega(\sqrt{T})$ or even $\Omega(T)$ to $O(\ln^2 T)$. From a practical perspective, we extend the policy to the scenario with continuous arrival distributions, time-dependent information structures, as well as unknown $T$. We conduct both synthetic experiments and empirical applications on a time series data of New York City taxi passengers to validate the performance of our proposed policies.
Authors: Samuel Lanthaler, Andrew M. Stuart, Margaret Trautner
Abstract: Operator learning is a variant of machine learning that is designed to approximate maps between function spaces from data. The Fourier Neural Operator (FNO) is one of the main model architectures used for operator learning. The FNO combines linear and nonlinear operations in physical space with linear operations in Fourier space, leading to a parameterized map acting between function spaces. Although in definition, FNOs are objects in continuous space and perform convolutions on a continuum, their implementation is a discretized object performing computations on a grid, allowing efficient implementation via the FFT. Thus, there is a discretization error between the continuum FNO definition and the discretized object used in practice that is separate from other previously analyzed sources of model error. We examine this discretization error here and obtain algebraic rates of convergence in terms of the grid resolution as a function of the input regularity. Numerical experiments that validate the theory and describe model stability are performed. In addition, an algorithm is presented that leverages the discretization error and model error decomposition to optimize computational training time.
Authors: Jiawen Kang, Dongrui Han, Lingwei Meng, Jingyan Zhou, Jinchao Li, Xixin Wu, Helen Meng
Abstract: Alzheimer's Disease (AD) detection employs machine learning classification models to distinguish between individuals with AD and those without. Different from conventional classification tasks, we identify within-class variation as a critical challenge in AD detection: individuals with AD exhibit a spectrum of cognitive impairments. Therefore, simplistic binary AD classification may overlook two crucial aspects: within-class heterogeneity and instance-level imbalance. In this work, we found using a sample score estimator can generate sample-specific soft scores aligning with cognitive scores. We subsequently propose two simple yet effective methods: Soft Target Distillation (SoTD) and Instance-level Re-balancing (InRe), targeting two problems respectively. Based on the ADReSS and CU-MARVEL corpora, we demonstrated and analyzed the advantages of the proposed approaches in detection performance. These findings provide insights for developing robust and reliable AD detection models.
Authors: Xiaoyu Wu, Jiaru Zhang, Zhiwei Steven Wu
Abstract: Diffusion Models (DMs) have become powerful image generation tools, especially for few-shot fine-tuning where a pretrained DM is fine-tuned on a small image set to capture specific styles or objects. Many people upload these personalized checkpoints online, fostering communities such as Civitai and HuggingFace. However, model owners may overlook the data leakage risks when releasing fine-tuned checkpoints. Moreover, concerns regarding copyright violations arise when unauthorized data is used during fine-tuning. In this paper, we ask: "Can training data be extracted from these fine-tuned DMs shared online?" A successful extraction would present not only data leakage threats but also offer tangible evidence of copyright infringement. To answer this, we propose FineXtract, a framework for extracting fine-tuning data. Our method approximates fine-tuning as a gradual shift in the model's learned distribution -- from the original pretrained DM toward the fine-tuning data. By extrapolating the models before and after fine-tuning, we guide the generation toward high-probability regions within the fine-tuned data distribution. We then apply a clustering algorithm to extract the most probable images from those generated using this extrapolated guidance. Experiments on DMs fine-tuned with datasets including WikiArt, DreamBooth, and real-world checkpoints posted online validate the effectiveness of our method, extracting about 20% of fine-tuning data in most cases. The code is available https://github.com/Nicholas0228/FineXtract.
Authors: Sathvik Joel, Jie JW Wu, Fatemeh H. Fard
Abstract: Large Language Models (LLMs) have shown impressive capabilities in code generation for popular programming languages. However, their performance on Low-Resource Programming Languages (LRPLs) and Domain-Specific Languages (DSLs) remains a significant challenge, affecting millions of developers-3.5 million users in Rust alone-who cannot fully utilize LLM capabilities. LRPLs and DSLs encounter unique obstacles, including data scarcity and, for DSLs, specialized syntax that is poorly represented in general-purpose datasets. Addressing these challenges is crucial, as LRPLs and DSLs enhance development efficiency in specialized domains, such as finance and science. While several surveys discuss LLMs in software engineering, none focus specifically on the challenges and opportunities associated with LRPLs and DSLs. Our survey fills this gap by systematically reviewing the current state, methodologies, and challenges in leveraging LLMs for code generation in these languages. We filtered 111 papers from over 27,000 published studies between 2020 and 2024 to evaluate the capabilities and limitations of LLMs in LRPLs and DSLs. We report the LLMs used, benchmarks, and metrics for evaluation, strategies for enhancing performance, and methods for dataset collection and curation. We identified four main evaluation techniques and several metrics for assessing code generation in LRPLs and DSLs. Our analysis categorizes improvement methods into six groups and summarizes novel architectures proposed by researchers. Despite various techniques and metrics, a standard approach and benchmark dataset for evaluating code generation in LRPLs and DSLs are lacking. This survey serves as a resource for researchers and practitioners at the intersection of LLMs, software engineering, and specialized programming languages, laying the groundwork for future advancements in code generation for LRPLs and DSLs.
Authors: Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun
Abstract: Recent advancements in recurrent architectures, such as Mamba and RWKV, have showcased strong language capabilities. Unlike transformer-based models, these architectures encode all contextual information into a fixed-size state, leading to great inference efficiency. However, this approach can cause information interference, where different token data conflicts, resulting in performance degradation and incoherent outputs beyond a certain context length. To prevent this, most RNNs incorporate mechanisms designed to "forget" earlier tokens. In this paper, we reveal that Mamba-based models struggle to effectively forget earlier tokens even with built-in forgetting mechanisms. We demonstrate that this issue stems from training on contexts that are too short for the state size, enabling the model to perform well without needing to learn how to forget. Then, we show that the minimum training length required for the model to learn forgetting scales linearly with the state size, and the maximum context length for accurate retrieval of a 5-digit passkey scales exponentially with the state size, indicating that the model retains some information beyond the point where forgetting begins. These findings highlight a critical limitation in current RNN architectures and provide valuable insights for improving long-context modeling. Our work suggests that future RNN designs must account for the interplay between state size, training length, and forgetting mechanisms to achieve robust performance in long-context tasks.
Authors: Efimia Panagiotaki, Georgi Pramatarov, Lars Kunze, Daniele De Martini
Abstract: Testing and validating Autonomous Vehicle (AV) performance in safety-critical and diverse scenarios is crucial before real-world deployment. However, manually creating such scenarios in simulation remains a significant and time-consuming challenge. This work introduces a novel method that generates dynamic temporal scene graphs corresponding to diverse traffic scenarios, on-demand, tailored to user-defined preferences, such as AV actions, sets of dynamic agents, and criticality levels. A temporal Graph Neural Network (GNN) model learns to predict relationships between ego-vehicle, agents, and static structures, guided by real-world spatiotemporal interaction patterns and constrained by an ontology that restricts predictions to semantically valid links. Our model consistently outperforms the baselines in accurately generating links corresponding to the requested scenarios. We render the predicted scenarios in simulation to further demonstrate their effectiveness as testing environments for AV agents.
Authors: Apoorv Khandelwal, Tian Yun, Nihal V. Nayak, Jack Merullo, Stephen H. Bach, Chen Sun, Ellie Pavlick
Abstract: Pre-training is notoriously compute-intensive and academic researchers are notoriously under-resourced. It is, therefore, commonly assumed that academics can't pre-train models. In this paper, we seek to clarify this assumption. We first survey academic researchers to learn about their available compute and then empirically measure the time to replicate models on such resources. We introduce a benchmark to measure the time to pre-train models on given GPUs and also identify ideal settings for maximizing training speed. We run our benchmark on a range of models and academic GPUs, spending 2,000 GPU-hours on our experiments. Our results reveal a brighter picture for academic pre-training: for example, although Pythia-1B was originally trained on 64 GPUs for 3 days, we find it is also possible to replicate this model (with the same hyper-parameters) in 3x fewer GPU-days: i.e. on 4 GPUs in 18 days. We conclude with a cost-benefit analysis to help clarify the trade-offs between price and pre-training time. We believe our benchmark will help academic researchers conduct experiments that require training larger models on more data. We fully release our codebase at: https://github.com/apoorvkh/academic-pretraining.
Authors: Onur Boyar, Hiroyuki Hanada, Ichiro Takeuchi
Abstract: The rapid discovery of new chemical compounds is essential for advancing global health and developing treatments. While generative models show promise in creating novel molecules, challenges remain in ensuring the real-world applicability of these molecules and finding such molecules efficiently. To address this challenge, we introduce Conditional Latent Space Molecular Scaffold Optimization (CLaSMO), which integrates a Conditional Variational Autoencoder (CVAE) with Latent Space Bayesian Optimization (LSBO) to strategically modify molecules while preserving similarity to the original input, effectively framing the task as constrained optimization. Our LSBO setting improves the sample-efficiency of the molecular optimization, and our modification approach helps us to obtain molecules with higher chances of real-world applicability. CLaSMO explores substructures of molecules in a sample-efficient manner by performing BO in the latent space of a CVAE conditioned on the atomic environment of the molecule to be optimized. Our extensive evaluations across diverse optimization tasks, including rediscovery, docking score, and multi-property optimization, show that CLaSMO efficiently enhances target properties, delivers remarkable sample-efficiency crucial for resource-limited applications while considering molecular similarity constraints, achieves state of the art performance, and maintains practical synthetic accessibility. We also provide an open-source web application that enables chemical experts to apply CLaSMO in a Human-in-the-Loop setting.
Authors: Matteo Lapucci, Davide Pucci
Abstract: In this work, we address unconstrained finite-sum optimization problems, with particular focus on instances originating in large scale deep learning scenarios. Our main interest lies in the exploration of the relationship between recent line search approaches for stochastic optimization in the overparametrized regime and momentum directions. First, we point out that combining these two elements with computational benefits is not straightforward. To this aim, we propose a solution based on mini-batch persistency. We then introduce an algorithmic framework that exploits a mix of data persistency, conjugate-gradient type rules for the definition of the momentum parameter and stochastic line searches. The resulting algorithm provably possesses convergence properties under suitable assumptions and is empirically shown to outperform other popular methods from the literature, obtaining state-of-the-art results in both convex and nonconvex large scale training problems.
Authors: Haizhou Shi, Yibin Wang, Ligong Han, Huan Zhang, Hao Wang
Abstract: Estimating the uncertainty of responses from Large Language Models (LLMs) remains a critical challenge. While recent Bayesian methods have demonstrated effectiveness in quantifying uncertainty through low-rank weight updates, they typically require complex fine-tuning or post-training procedures. In this paper, we propose Training-Free Bayesianization (TFB), a simple yet theoretically grounded framework that efficiently transforms trained low-rank adapters into Bayesian ones without additional training. TFB systematically searches for the maximally acceptable level of variance in the weight posterior, constrained within a family of low-rank isotropic Gaussian distributions. Our theoretical analysis shows that under mild conditions, this search process is equivalent to KL-regularized variational optimization, a generalized form of variational inference. Through comprehensive experiments, we show that TFB achieves superior uncertainty estimation and generalization compared to existing methods while eliminating the need for complex Bayesianization training procedures. Code will be available at https://github.com/Wang-ML-Lab/bayesian-peft.
Authors: Zixuan Ke, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
Abstract: Domain-adaptive post-training of large language models (LLMs) has emerged as a promising approach for specialized domains such as medicine and finance. However, significant challenges remain in identifying optimal adaptation criteria and training strategies across varying data and model configurations. To address these challenges, we introduce FINDAP, a systematic and fine-grained investigation into domain-adaptive post-training of LLMs for the finance domain. Our approach consists of four key components: FinCap, which defines the core capabilities required for the target domain; FinRec, an effective training recipe that jointly optimizes continual pre-training and instruction-following, along with a novel preference data distillation method leveraging process signals from a generative reward model; FinTrain, a curated set of training datasets supporting FinRec; and FinEval, a comprehensive evaluation suite aligned with FinCap. The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks. Our analysis also highlights how each post-training stage contributes to distinct capabilities, uncovering specific challenges and effective solutions, providing valuable insights for domain adaptation of LLMs
Authors: Yingchao Yu, Yaochu Jin, Kuangrong Hao, Yuchen Xiao, Yuping Yan, Hengjie Yu, Zeqi Zheng, Wenxuan Pan
Abstract: Learning-to-learn (L2L), defined as progressively faster learning across similar tasks, is fundamental to both neuroscience and artificial intelligence. However, its neural basis remains elusive, as most studies emphasize neural population dynamics induced by synaptic plasticity while overlooking adaptations driven by intrinsic neuronal plasticity, which point-neuron models cannot capture. To address the above issue, we develop a recurrent spiking neural network with bi-level intrinsic plasticity (IP$^{2}$-RSNN). First, based on task demands, a slow meta-intrinsic plasticity determines which intrinsic neuronal properties are learnable, which is preserved throughout subsequent task learning once configured. Second, a fast intrinsic plasticity fine-tunes those learnable properties within each task. Our results indicate that the proposed bi-level intrinsic plasticity plays a critical role in enabling L2L in RSNNs and show that IP$^{2}$-RSNNs outperform point-neuron recurrent neural networks and self-attention models. Furthermore, our analysis of multi-scale neural dynamics reveals that the bi-level intrinsic plasticity is essential to task-type-specific adaptations at both the neuronal and network levels during L2L, while such adaptations cannot be captured by point-neuron models. Our results suggest that intrinsic plasticity provides significant computational advantages in L2L, shedding light on the design of brain-inspired deep learning models and algorithms.
Authors: Koen W. van Arem, Floris Goes-Smit, Jakob S\"ohl
Abstract: Transfers in professional football (soccer) are risky investments because of the large transfer fees and high risks involved. Although data-driven models can be used to improve transfer decisions, existing models focus on describing players' historical progress, leaving their future performance unknown. Moreover, recent developments have called for the use of explainable models combined with uncertainty quantification of predictions. This paper assesses explainable machine learning models based on predictive accuracy and uncertainty quantification methods for the prediction of the future development in quality and transfer value of professional football players. The predictive accuracy is studied by training the models to predict the quality and value of players one year ahead. This is carried out by training them on two data sets containing data-driven indicators describing the player quality and player value in historical settings. In general, the random forest model is found to be the most suitable model because it provides accurate predictions as well as an uncertainty quantification method that naturally arises from the bagging procedure of the random forest model. Additionally, this research shows that the development of player performance contains nonlinear patterns and interactions between variables, and that time series information can provide useful information for the modeling of player performance metrics. The resulting models can help football clubs make more informed, data-driven transfer decisions by forecasting player quality and transfer value.
Authors: Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu
Abstract: Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. In content creation workflows, precise and simultaneous control over camera motion, object motion, and lighting direction enhances both accuracy and flexibility. However, existing approaches typically treat these control signals separately, largely due to the scarcity of datasets with high-quality joint annotations and mismatched control spaces across modalities. We present VidCRAFT3, a unified and flexible I2V framework that supports both independent and joint control over camera motion, object motion, and lighting direction by integrating three core components. Image2Cloud reconstructs a 3D point cloud from the reference image to enable precise camera motion control. ObjMotionNet encodes sparse object trajectories into multi-scale optical flow features to guide object motion. The Spatial Triple-Attention Transformer integrates lighting direction embeddings via parallel cross-attention. To address the scarcity of jointly annotated data, we curate the VideoLightingDirection (VLD) dataset of synthetic static-scene video clips with per-frame lighting-direction labels, and adopt a three-stage training strategy that enables robust learning without fully joint annotations. Extensive experiments show that VidCRAFT3 outperforms existing methods in control precision and visual coherence. Code and data will be released. Project page: https://sixiaozheng.github.io/VidCRAFT3/.
Authors: Davis Brown, Prithvi Balehannina, Helen Jin, Shreya Havaldar, Hamed Hassani, Eric Wong
Abstract: Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks -- an order of magnitude more than prior work -- where frontier models exhibit systematic failures, in domains ranging from forecasting to online harassment. For example, we find that Sonnet 3.5 over-associates quantum computing and AGI and that o3-mini is prone to hallucination when fabrications are repeated in-context.
Authors: Saurabh Koju, Saurav Bastola, Prashant Shrestha, Sanskar Amgain, Yash Raj Shrestha, Rudra P. K. Poudel, Binod Bhattarai
Abstract: Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action-labeled data, limiting their applicability to real-world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action-controlled data generation, we propose the first surgical vision world model. The proposed model can generate action-controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc-2022 dataset. Codes and implementation details are available at https://github.com/bhattarailab/Surgical-Vision-World-Model
URLs: https://github.com/bhattarailab/Surgical-Vision-World-Model
Authors: Yingfa Chen, Yutong Wu, Chenyang Song, Zhen Leng Thai, Xingyu Shen, Xu Han, Zhiyuan Liu, Maosong Sun
Abstract: Grouped-Query Attention (GQA) is a widely adopted strategy for reducing the computational cost of attention layers in large language models (LLMs). However, current GQA configurations are often suboptimal because they overlook how context length influences inference cost. Since inference cost grows with context length, the most cost-efficient GQA configuration should also vary accordingly. In this work, we analyze the relationship among context length, model size, GQA configuration, and model loss, and introduce two innovations: (1) we decouple the total head size from the hidden size, enabling more flexible control over attention FLOPs; and (2) we jointly optimize the model size and the GQA configuration to arrive at a better allocation of inference resources between attention layers and other components. Our analysis reveals that commonly used GQA configurations are highly suboptimal for long-context scenarios. More importantly, we propose a recipe for deriving cost-optimal GQA configurations. Our results show that for long-context scenarios, one should use fewer attention heads while scaling up model size. Configurations selected by our recipe can reduce both memory usage and FLOPs by more than 50% compared to Llama-3's GQA, with *no degradation in model capabilities*. Our findings offer valuable insights for designing efficient long-context LLMs. The code is available at https://www.github.com/THUNLP/cost-optimal-gqa .
Authors: Derun Li, Changye Li, Yue Wang, Jianwei Ren, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Peng Jia, Xianpeng Lang, Ningyi Xu, Hang Zhao
Abstract: Generating human-like and adaptive trajectories is essential for autonomous driving in dynamic environments. While generative models have shown promise in synthesizing feasible trajectories, they often fail to capture the nuanced variability of personalized driving styles due to dataset biases and distributional shifts. To address this, we introduce TrajHF, a human feedback-driven finetuning framework for generative trajectory models, designed to align motion planning with diverse driving styles. TrajHF incorporates multi-conditional denoiser and reinforcement learning with human feedback to refine multi-modal trajectory generation beyond conventional imitation learning. This enables better alignment with human driving preferences while maintaining safety and feasibility constraints. TrajHF achieves performance comparable to the state-of-the-art on NavSim benchmark. TrajHF sets a new paradigm for personalized and adaptable trajectory generation in autonomous driving.
Authors: Austin Shouli, Ankur Barthwal, Molly Campbell, Ajay Kumar Shrestha
Abstract: The rapid expansion of Artificial Intelligence (AI) in digital platforms used by youth has created significant challenges related to privacy, autonomy, and data protection. While AI-driven personalization offers enhanced user experiences, it often operates without clear ethical boundaries, leaving young users vulnerable to data exploitation and algorithmic biases. This paper presents a call to action for ethical AI governance, advocating for a structured framework that ensures youth-centred privacy protections, transparent data practices, and regulatory oversight. We outline key areas requiring urgent intervention, including algorithmic transparency, privacy education, parental data-sharing ethics, and accountability measures. Through this approach, we seek to empower youth with greater control over their digital identities and propose actionable strategies for policymakers, AI developers, and educators to build a fairer and more accountable AI ecosystem.
Authors: Lin-Han Jia, Lan-Zhe Guo, Zhi Zhou, Si-Ye Han, Zi-Wen Li, Yu-Feng Li
Abstract: In real-world applications, it is highly challenging to detect anomalous samples with extremely sparse anomalies, as they are highly similar to and thus easily confused with normal samples. Moreover, the number of anomalous samples is inherently scarce. This results in a dual imbalance Multi-Instance Learning (MIL) problem, manifesting at both the macro and micro levels. To address this "needle-in-a-haystack problem", we find that MIL problem can be reformulated as a fine-grained PU learning problem. This allows us to address the imbalance issue in an unbiased manner using micro-level balancing mechanisms. To this end, we propose a novel framework, Balanced Fine-Grained Positive-Unlabeled (BFGPU)-based on rigorous theoretical foundations. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of BFGPU.
Authors: Dongyang Fan, Tyler J. Rotello, Sai Praneeth Karimireddy
Abstract: As large language models increasingly rely on external data sources, compensating data contributors has become a central concern. But how should these payments be devised? We revisit data valuations from a $\textit{market-design perspective}$ where payments serve to compensate data owners for the $\textit{private}$ heterogeneous costs they incur for collecting and sharing data. We show that popular valuation methods-such as Leave-One-Out and Data Shapley-make for poor payments. They fail to ensure truthful reporting of the costs, leading to $\textit{inefficient market}$ outcomes. To address this, we adapt well-established payment rules from mechanism design, namely Myerson and Vickrey-Clarke-Groves (VCG), to the data market setting. We show that Myerson payment is the minimal truthful mechanism, optimal from the buyer's perspective. Additionally, we identify a condition under which both data buyers and sellers are utility-satisfied, and the market achieves efficiency. Our findings highlight the importance of incorporating incentive compatibility into data valuation design, paving the way for more robust and efficient data markets. Our data market framework is readily applicable to real-world scenarios. We illustrate this with simulations of contributor compensation in an LLM based retrieval-augmented generation (RAG) marketplace tasked with challenging medical question answering.
Authors: Liam Welsh, Udit Grover, Sebastian Jaimungal
Abstract: Climate change is a major threat to the future of humanity, and its impacts are being intensified by excess man-made greenhouse gas emissions. One method governments can employ to control these emissions is to provide firms with emission limits and penalize any excess emissions above the limit. Excess emissions may also be offset by firms who choose to invest in carbon reducing and capturing projects. These projects generate offset credits which can be submitted to a regulating agency to offset a firm's excess emissions, or they can be traded with other firms. In this work, we characterize the finite-agent Nash equilibrium for offset credit markets. As computing Nash equilibria is an NP-hard problem, we utilize the modern reinforcement learning technique Nash-DQN to efficiently estimate the market's Nash equilibria. We demonstrate not only the validity of employing reinforcement learning methods applied to climate themed financial markets, but also the significant financial savings emitting firms may achieve when abiding by the Nash equilibria through numerical experiments.
Authors: Jie JW Wu, Manav Chaudhary, Davit Abrahamyan, Arhaan Khaku, Anjiang Wei, Fatemeh H. Fard
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, a gap remains between their output and the problem-solving strategies of human developers. Unlike humans, who spend substantial time disambiguating requirements through iterative dialogue, LLMs often generate code despite ambiguities in natural language requirements, leading to unreliable solutions. Different from prior work, we study whether a Code LLM can be fine-tuned to learn clarification-seeking behavior. While recent work has focused on LLM-based agents for iterative code generation, we argue that the ability to recognize and query ambiguous requirements should be intrinsic to the models themselves, especially in agentic AI where models and humans collaborate. We present ClarifyCoder, a framework with synthetic data generation and instruction-tuning that fine-tunes an LLM to identify ambiguities and request clarification before code generation. Our approach has two components: (1) a data synthesis technique that augments programming datasets with scenarios requiring clarification to generate clarification-aware training data, and (2) a fine-tuning strategy that teaches models to prioritize seeking clarification over immediate code generation when faced with incomplete or ambiguous requirements. We also provide an empirical analysis of integrating ClarifyCoder with standard fine-tuning for joint optimization of clarification-awareness and coding ability. Experimental results show that ClarifyCoder achieves a 63% communication rate (40% absolute increase) and a 52% good question rate (30% absolute increase) on ambiguous tasks, significantly improving LLMs' communication capabilities while maintaining code generation performance.
Authors: Alexander Baumann, Leonardo Ayala, Silvia Seidlitz, Jan Sellner, Alexander Studier-Fischer, Berkin \"Ozdemir, Lena Maier-Hein, Slobodan Ilic
Abstract: Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding, and is already established as a critical modality in remote sensing. However, variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies, leading to camera-specific models with limited generalizability and inadequate cross-camera applicability. To address this bottleneck, we introduce CARL, a model for Camera-Agnostic Representation Learning across RGB, multispectral, and hyperspectral imaging modalities. To enable the conversion of a spectral image with any channel dimensionality to a camera-agnostic representation, we introduce a novel spectral encoder, featuring a self-attention-cross-attention mechanism, to distill salient spectral information into learned spectral representations. Spatio-spectral pre-training is achieved with a novel feature-based self-supervision strategy tailored to CARL. Large-scale experiments across the domains of medical imaging, autonomous driving, and satellite imaging demonstrate our model's unique robustness to spectral heterogeneity, outperforming on datasets with simulated and real-world cross-camera spectral variations. The scalability and versatility of the proposed approach position our model as a backbone for future spectral foundation models.
Authors: Xinyue Peng, Yanming Liu, Yihan Cang, Chaoqun Cao, Ming Chen
Abstract: Solving non-convex resource allocation problems poses significant challenges in wireless communication systems, often beyond the capability of traditional optimization techniques. To address this issue, we propose LLM-OptiRA, the first framework that leverages large language models (LLMs) to automatically detect and transform non-convex components into solvable forms, enabling fully automated resolution of non-convex resource allocation problems in wireless communication systems. LLM-OptiRA not only simplifies problem-solving by reducing reliance on expert knowledge, but also integrates error correction and feasibility validation mechanisms to ensure robustness. Experimental results show that LLM-OptiRA achieves an execution rate of 96% and a success rate of 80% on GPT-4, significantly outperforming baseline approaches in complex optimization tasks across diverse scenarios.
Authors: Maytus Piriyajitakonkij, Rujikorn Charakorn, Weicheng Tao, Wei Pan, Mingfei Sun, Cheston Tan, Mengmi Zhang
Abstract: Language is a powerful communicative and cognitive tool. It enables humans to express thoughts, share intentions, and reason about complex phenomena. Despite our fluency in using and understanding language, the question of how it arises and evolves over time remains unsolved. A leading hypothesis in linguistics and anthropology posits that language evolved to meet the ecological and social demands of early human cooperation. Language did not arise in isolation, but through shared survival goals. Inspired by this view, we investigate the emergence of language in multi-agent Foraging Games. These environments are designed to reflect the cognitive and ecological constraints believed to have influenced the evolution of communication. Agents operate in a shared grid world with only partial knowledge about other agents and the environment, and must coordinate to complete games like picking up high-value targets or executing temporally ordered actions. Using end-to-end deep reinforcement learning, agents learn both actions and communication strategies from scratch. We find that agents develop communication protocols with hallmark features of natural language: arbitrariness, interchangeability, displacement, cultural transmission, and compositionality. We quantify each property and analyze how different factors, such as population size, social dynamics, and temporal dependencies, shape specific aspects of the emergent language. Our framework serves as a platform for studying how language can evolve from partial observability, temporal reasoning, and cooperative goals in embodied multi-agent settings. We will release all data, code, and models publicly.
Authors: Xiaojie Gu, Ziying Huang, Jia-Chen Gu, Kai Zhang
Abstract: Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model's internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose UltraEdit, a training-, subject-, and memory-free approach that is well-suited for ultra-scalable, real-world lifelong model editing. UltraEdit fundamentally differs from traditional paradigms by computing parameter shifts in one step using only a hidden state and its gradient, making the approach simple yet efficient. To improve scalability in lifelong settings, UltraEdit employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. UltraEdit achieves editing speeds over 7x faster than the previous state-of-the-art method, which was also the fastest known approach, while using less than 1/4 the VRAM. This makes it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct UltraEditBench, the largest dataset in the field to date with over 2M editing pairs, and demonstrate that our method supports up to 2M edits while maintaining high accuracy. Comprehensive experiments on five datasets and six models show that UltraEdit consistently achieves superior performance across diverse model editing scenarios, taking a further step towards safe and scalable lifelong learning. Our code is available at: https://github.com/XiaojieGu/UltraEdit
Authors: David Nordstr\"om, Johan Edstedt, Fredrik Kahl, Georg B\"okman
Abstract: Why are state-of-the-art Vision Transformers (ViTs) not designed to exploit natural geometric symmetries such as 90-degree rotations and reflections? In this paper, we argue that there is no fundamental reason, and what has been missing is an efficient implementation. To this end, we introduce Octic Vision Transformers (octic ViTs) which rely on octic group equivariance to capture these symmetries. In contrast to prior equivariant models that increase computational cost, our octic linear layers achieve 5.33x reductions in FLOPs and up to 8x reductions in memory compared to ordinary linear layers. In full octic ViT blocks the computational reductions approach the reductions in the linear layers with increased embedding dimension. We study two new families of ViTs, built from octic blocks, that are either fully octic equivariant or break equivariance in the last part of the network. Training octic ViTs supervised (DeiT-III) and unsupervised (DINOv2) on ImageNet-1K, we find that they match baseline accuracy while at the same time providing substantial efficiency gains.
Authors: Mikhail Menschikov, Alexander Kharitonov, Maiia Kotyga, Vadim Porvatov, Anna Zhukovskaya, David Kagramanyan, Egor Shvetsov, Evgeny Burnaev
Abstract: Large Language Models (LLMs) exhibit position bias - a systematic tendency to neglect information at specific context positions. However, the patterns of position bias behavior, depending on the language or model, remain unexplored. We present a multilingual study across five typologically distinct languages (English, Russian, German, Hindi, and Vietnamese) and five model architectures, examining how position bias interacts with prompt strategies and affects output entropy. Our key findings are: (1) Position bias is primarily model-driven, yet exhibits language-specific variations. For instance, Qwen2.5-7B-Instruct and DeepSeek 7B Chat consistently favors late positions, challenging established assumptions of a universal early-token bias in LLMs. (2) Explicitly instructing the model that "the context is relevant to the query" unexpectedly reduces accuracy across languages, undermining common prompt-engineering practices. (3) While the largest accuracy drop occurs when relevant information is placed in the middle of the context, this is not explicitly reflected by a corresponding peak in output entropy.
Authors: Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz
Abstract: Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models and how they affect RAG behaviours. Our code is available at https://github.com/ruizheliUOA/ARC_JSD.
Authors: Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, Haibo Hu
Abstract: Unlearning in large language models (LLMs) aims to remove specified data, but its efficacy is typically assessed with task-level metrics like accuracy and perplexity. We demonstrate that these metrics are often misleading, as models can appear to forget while their original behavior is easily restored through minimal fine-tuning. This phenomenon of \emph{reversibility} suggests that information is merely suppressed, not genuinely erased. To address this critical evaluation gap, we introduce a \emph{representation-level analysis framework}. Our toolkit comprises PCA-based similarity and shift, centered kernel alignment (CKA), and Fisher information, complemented by a summary metric, the mean PCA distance, to measure representational drift. Applying this framework across six unlearning methods, three data domains, and two LLMs, we identify four distinct forgetting regimes based on their \emph{reversibility} and \emph{catastrophicity}. Our analysis reveals that achieving the ideal state--irreversible, non-catastrophic forgetting--is exceptionally challenging. By probing the limits of unlearning, we identify a case of seemingly irreversible, targeted forgetting, offering new insights for designing more robust erasure algorithms. Our findings expose a fundamental gap in current evaluation practices and establish a representation-level foundation for trustworthy unlearning.
Authors: Fengyi Li, Kayhan Behdin, Natesh Pillai, Xiaofeng Wang, Zhipeng Wang, Ercan Yildiz
Abstract: Text segmentation based on the semantic meaning of sentences is a fundamental task with broad utility in many downstream applications. In this paper, we propose a graphical model-based unsupervised learning approach, named BP-Seg for efficient text segmentation. Our method not only considers local coherence, capturing the intuition that adjacent sentences are often more related, but also effectively groups sentences that are distant in the text yet semantically similar. This is achieved through belief propagation on the carefully constructed graphical models. Experimental results on both an illustrative example and a dataset with long-form documents demonstrate that our method performs favorably compared to competing approaches.
Authors: Jingzhi Hu, Geoffrey Ye Li
Abstract: Future networks are envisioned to connect massive artificial intelligence (AI) agents, enabling their extensive collaboration on diverse tasks. Compared to traditional entities, these agents naturally suit the semantic communication (SC), which can significantly enhance the bandwidth efficiency. Nevertheless, SC requires the knowledge among agents to be aligned, while agents have distinct expert knowledge for their individual tasks in practice. In this paper, we propose a distillation-enabled knowledge alignment protocol (DeKAP), which distills the expert knowledge of each agent into parameter-efficient low-rank matrices, allocates them across the network, and allows agents to simultaneously maintain aligned knowledge for multiple tasks. We formulate the joint minimization of alignment loss, communication overhead, and storage cost as a large-scale integer linear programming problem and develop a highly efficient greedy algorithm. From computer simulation, the DeKAP establishes knowledge alignment with the lowest communication and computation resources compared to conventional approaches.
Authors: Mathew J. Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri
Abstract: Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.
URLs: https://huggingface.co/datasets/NIH-CARD/BiomedSQL,, https://github.com/NIH-CARD/biomedsql.
Authors: Alex A. Saoulis, Davide Piras, Niall Jeffrey, Alessio Spurio Mancini, Ana M. G. Ferreira, Benjamin Joachimi
Abstract: Simulation-based inference (SBI) enables cosmological parameter estimation when closed-form likelihoods or models are unavailable. However, SBI relies on machine learning for neural compression and density estimation. This requires large training datasets which are prohibitively expensive for high-quality simulations. We overcome this limitation with multifidelity transfer learning, combining less expensive, lower-fidelity simulations with a limited number of high-fidelity simulations. We demonstrate our methodology on dark matter density maps from two separate simulation suites in the hydrodynamical CAMELS Multifield Dataset. Pre-training on dark-matter-only $N$-body simulations reduces the required number of high-fidelity hydrodynamical simulations by a factor between $8$ and $15$, depending on the model complexity, posterior dimensionality, and performance metrics used. By leveraging cheaper simulations, our approach enables performant and accurate inference on high-fidelity models while substantially reducing computational costs.
Authors: Jingyun Yang, Isabella Huang, Brandon Vu, Max Bajracharya, Rika Antonova, Jeannette Bohg
Abstract: Learned visuomotor policies are capable of performing increasingly complex manipulation tasks. However, most of these policies are trained on data collected from limited robot positions and camera viewpoints. This leads to poor generalization to novel robot positions, which limits the use of these policies on mobile platforms, especially for precise tasks like pressing buttons or turning faucets. In this work, we formulate the policy mobilization problem: find a mobile robot base pose in a novel environment that is in distribution with respect to a manipulation policy trained on a limited set of camera viewpoints. Compared to retraining the policy itself to be more robust to unseen robot base pose initializations, policy mobilization decouples navigation from manipulation and thus does not require additional demonstrations. Crucially, this problem formulation complements existing efforts to improve manipulation policy robustness to novel viewpoints and remains compatible with them. We propose a novel approach for policy mobilization that bridges navigation and manipulation by optimizing the robot's base pose to align with an in-distribution base pose for a learned policy. Our approach utilizes 3D Gaussian Splatting for novel view synthesis, a score function to evaluate pose suitability, and sampling-based optimization to identify optimal robot poses. To understand policy mobilization in more depth, we also introduce the Mobi-$\pi$ framework, which includes: (1) metrics that quantify the difficulty of mobilizing a given policy, (2) a suite of simulated mobile manipulation tasks based on RoboCasa to evaluate policy mobilization, and (3) visualization tools for analysis. In both our developed simulation task suite and the real world, we show that our approach outperforms baselines, demonstrating its effectiveness for policy mobilization.
Authors: Alessio Giorlandino, Sebastian Goldt
Abstract: Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.
Authors: Jinmei Liu, Fuhong Liu, Jianye Hao, Bo Wang, Huaxiong Li, Chunlin Chen, Zhi Wang
Abstract: Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose \textbf{S}calable \textbf{I}n-\textbf{C}ontext \textbf{Q}-\textbf{L}earning (\textbf{SICQL}), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using separate heads. We pretrain a generalized world model to capture task-relevant information, enabling the construction of a compact prompt that facilitates fast and precise in-context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper-expectile of the Q-function, and distill the in-context value functions into policy extraction using advantage-weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data. Our code is available at https://github.com/NJU-RL/SICQL
Authors: Damith Chamalke Senadeera, Xiaoyun Yang, Shibo Li, Muhammad Awais, Dimitrios Kollias, Gregory Slabaugh
Abstract: The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics. The model performs continuous fusion via a gating mechanism between the branches to enhance the model's ability to detect violent activities even in challenging surveillance scenarios. We also present a new benchmark by merging RWF-2000, RLVS, SURV and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Experimental results demonstrate that our model achieves state-of-the-art performance on this benchmark and also on DVD dataset which is another novel dataset on video violence detection, offering an optimal balance between accuracy and computational efficiency, demonstrating the promise of SSMs for scalable, near real-time surveillance violence detection.
Authors: Ali Waqas, Sinem Coleri
Abstract: Sensor-based local inference at IoT devices faces severe computational limitations, often requiring data transmission over noisy wireless channels for server-side processing. To address this, split-network Deep Neural Network (DNN) based Joint Source-Channel Coding (JSCC) schemes are used to extract and transmit relevant features instead of raw data. However, most existing methods rely on fixed network splits and static configurations, lacking adaptability to varying computational budgets and channel conditions. In this paper, we propose a novel SNR- and computation-adaptive distributed CNN framework for wireless image classification across IoT devices and edge servers. We introduce a learning-assisted intelligent Genetic Algorithm (LAIGA) that efficiently explores the CNN hyperparameter space to optimize network configuration under given FLOPs constraints and given SNR. LAIGA intelligently discards the infeasible network configurations that exceed computational budget at IoT device. It also benefits from the Random Forests based learning assistance to avoid a thorough exploration of hyperparameter space and to induce application specific bias in candidate optimal configurations. Experimental results demonstrate that the proposed framework outperforms fixed-split architectures and existing SNR-adaptive methods, especially under low SNR and limited computational resources. We achieve a 10\% increase in classification accuracy as compared to existing JSCC based SNR-adaptive multilayer framework at an SNR as low as -10dB across a range of available computational budget (1M to 70M FLOPs) at IoT device.
Authors: Shivani Shukla, Himanshu Joshi, Romilla Syed
Abstract: The rapid adoption of Large Language Models(LLMs) for code generation has transformed software development, yet little attention has been given to how security vulnerabilities evolve through iterative LLM feedback. This paper analyzes security degradation in AI-generated code through a controlled experiment with 400 code samples across 40 rounds of "improvements" using four distinct prompting strategies. Our findings show a 37.6% increase in critical vulnerabilities after just five iterations, with distinct vulnerability patterns emerging across different prompting approaches. This evidence challenges the assumption that iterative LLM refinement improves code security and highlights the essential role of human expertise in the loop. We propose practical guidelines for developers to mitigate these risks, emphasizing the need for robust human validation between LLM iterations to prevent the paradoxical introduction of new security issues during supposedly beneficial code "improvements".
Authors: Hiren Madhu, Jo\~ao Felipe Rocha, Tinglin Huang, Siddharth Viswanath, Smita Krishnaswamy, Rex Ying
Abstract: Single-cell transcriptomics and proteomics have become a great source for data-driven insights into biology, enabling the use of advanced deep learning methods to understand cellular heterogeneity and gene expression at the single-cell level. With the advent of spatial-omics data, we have the promise of characterizing cells within their tissue context as it provides both spatial coordinates and intra-cellular transcriptional or protein counts. Proteomics offers a complementary view by directly measuring proteins, which are the primary effectors of cellular function and key therapeutic targets. However, existing models either ignore the spatial information or the complex genetic and proteomic programs within cells. Thus they cannot infer how cell internal regulation adapts to microenvironmental cues. Furthermore, these models often utilize fixed gene vocabularies, hindering their generalizability unseen genes. In this paper, we introduce HEIST, a hierarchical graph transformer foundation model for spatial transcriptomics and proteomics. HEIST models tissues as hierarchical graphs. The higher level graph is a spatial cell graph, and each cell in turn, is represented by its lower level gene co-expression network graph. HEIST achieves this by performing both intra-level and cross-level message passing to utilize the hierarchy in its embeddings and can thus generalize to novel datatypes including spatial proteomics without retraining. HEIST is pretrained on 22.3M cells from 124 tissues across 15 organs using spatially-aware contrastive and masked autoencoding objectives. Unsupervised analysis of HEIST embeddings reveals spatially informed subpopulations missed by prior models. Downstream evaluations demonstrate generalizability to proteomics data and state-of-the-art performance in clinical outcome prediction, cell type annotation, and gene imputation across multiple technologies.
Authors: Xiaolong Wang, Zhaolu Kang, Wangyuxuan Zhai, Xinyue Lou, Yunghwei Lai, Ziyue Wang, Yawen Wang, Kaiyu Huang, Yile Wang, Peng Li, Yang Liu
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. MLLMs have shown promising capability in aligning visual and textual modalities, allowing them to process image-text pairs with clear and explicit meanings. However, resolving the inherent ambiguities present in real-world language and visual contexts remains a challenge. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes first a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and second a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models--encompassing both open-source and proprietary architectures--reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.
Authors: Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani
Abstract: Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during communication rounds. Through extensive experiments across eleven datasets, including domain- and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods.
Authors: Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, Paul Pu Liang
Abstract: Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual question answering, and code generation, yet their ability to reason on these tasks in different languages remains underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. We propose M2A, a novel method that combines multi-scale multilingual alignment with language-consistency rewards on machine-translated questions, training models to reason directly and accurately in the target language. Furthermore, existing multilingual benchmarks only evaluate on final answers, overlooking whether reasoning occurs in the intended language. To close this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark together with reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. Our results show that M2A significantly enhances multilingual reasoning fidelity in both mathematical and factual reasoning tasks, highlighting that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization. https://jd730.github.io/projects/M2A_GeoFact-X
Authors: Shuying Huang, Junpeng Li, Changchun Hua, Yana Yang
Abstract: To alleviate the annotation burden in supervised learning, N-tuples learning has recently emerged as a powerful weakly-supervised method. While existing N-tuples learning approaches extend pairwise learning to higher-order comparisons and accommodate various real-world scenarios, they often rely on task-specific designs and lack a unified theoretical foundation. In this paper, we propose a general N-tuples learning framework based on empirical risk minimization, which systematically integrates pointwise unlabeled data to enhance learning performance. This paper first unifies the data generation processes of N-tuples and pointwise unlabeled data under a shared probabilistic formulation. Based on this unified view, we derive an unbiased empirical risk estimator that generalizes a broad class of existing N-tuples models. We further establish a generalization error bound for theoretical support. To demonstrate the flexibility of the framework, we instantiate it in four representative weakly supervised scenarios, each recoverable as a special case of our general model. Additionally, to address overfitting issues arising from negative risk terms, we adopt correction functions to adjust the empirical risk. Extensive experiments on benchmark datasets validate the effectiveness of the proposed framework and demonstrate that leveraging pointwise unlabeled data consistently improves generalization across various N-tuples learning tasks.
Authors: Ravin Kumar
Abstract: We propose the APTx Neuron, a novel, unified neural computation unit that integrates non-linear activation and linear transformation into a single trainable expression. The APTx Neuron is derived from the APTx activation function, thereby eliminating the need for separate activation layers and making the architecture both computationally efficient and elegant. The proposed neuron follows the functional form $y = \sum_{i=1}^{n} ((\alpha_i + \tanh(\beta_i x_i)) \cdot \gamma_i x_i) + \delta$, where all parameters $\alpha_i$, $\beta_i$, $\gamma_i$, and $\delta$ are trainable. We validate our APTx Neuron-based architecture on the MNIST dataset, achieving up to $96.69\%$ test accuracy within 11 epochs using approximately 332K trainable parameters. The results highlight the superior expressiveness and computational efficiency of the APTx Neuron compared to traditional neurons, pointing toward a new paradigm in unified neuron design and the architectures built upon it. Source code is available at https://github.com/mr-ravin/aptx_neuron.
Authors: Semih Eren, Deniz Kucukahmetler, Nico Scherf
Abstract: Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.
Authors: Biyi Fang, Jean Utke, Truong Vo, Diego Klabjan
Abstract: Convolutional Neural Networks (CNNs) have achieved remarkable success across a wide range of machine learning tasks by leveraging hierarchical feature learning through deep architectures. However, the large number of layers and millions of parameters often make CNNs computationally expensive to train, requiring extensive time and manual tuning to discover optimal architectures. In this paper, we introduce a novel framework for boosting CNN performance that integrates dynamic feature selection with the principles of BoostCNN. Our approach incorporates two key strategies: subgrid selection and importance sampling, to guide training toward informative regions of the feature space. We further develop a family of algorithms that embed boosting weights directly into the network training process using a least squares loss formulation. This integration not only alleviates the burden of manual architecture design but also enhances accuracy and efficiency. Experimental results across several fine-grained classification benchmarks demonstrate that our boosted CNN variants consistently outperform conventional CNNs in both predictive performance and training speed.
Authors: Subhey Sadi Rahman, Md. Adnanul Islam, Md. Mahbub Alam, Musarrat Zeba, Md. Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, Sami Azam
Abstract: Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content. Consequently, LLMs can generate misinformation, making robust fact-checking essential. This review systematically analyzes how LLM-generated content is evaluated for factual accuracy by exploring key challenges such as hallucinations, dataset limitations, and the reliability of evaluation metrics. The review emphasizes the need for strong fact-checking frameworks that integrate advanced prompting strategies, domain-specific fine-tuning, and retrieval-augmented generation (RAG) methods. It proposes five research questions that guide the analysis of the recent literature from 2020 to 2025, focusing on evaluation methods and mitigation techniques. Instruction tuning, multi-agent reasoning, and RAG frameworks for external knowledge access are also reviewed. The key findings demonstrate the limitations of current metrics, the importance of validated external evidence, and the improvement of factual consistency through domain-specific customization. The review underscores the importance of building more accurate, understandable, and context-aware fact-checking. These insights contribute to the advancement of research toward more trustworthy models.
Authors: Llu\'is Arola-Fern\'andez
Abstract: Whether large predictive models merely parrot their training data or produce genuine insight lacks a physical explanation. This work reports a primitive form of intuition that emerges as a metastable phase of learning that critically balances next-token prediction against future path-entropy. The intuition mechanism is discovered via mind-tuning, the minimal principle that imposes Maximum Caliber in predictive models with a control temperature-like parameter $\lambda$. Training on random walks in deterministic mazes reveals a rich phase diagram: imitation (low $\lambda$), rule-breaking hallucination (high $\lambda$), and a fragile in-between window exhibiting strong protocol-dependence (hysteresis) and multistability, where models spontaneously discover novel goal-directed strategies. These results are captured by an effective low-dimensional theory and frame intuition as an emergent property at the critical balance between memorizing what is and wondering what could be.
Authors: Michael Poppel, David Bucher, Maximilian Zorn, Nico Kraus, Philipp Altmann, Jonas Stein, Claudia Linnhoff-Popien
Abstract: Quantum machine learning research has expanded rapidly due to potential computational advantages over classical methods. Angle encoding has emerged as a popular choice as feature map (FM) for embedding classical data into quantum models due to its simplicity and natural generation of truncated Fourier series, providing universal function approximation capabilities. Efficient FMs within quantum circuits can exploit exponential scaling of Fourier frequencies, with multi-dimensional inputs introducing additional exponential growth through mixed-frequency terms. Despite this promising expressive capability, practical implementation faces significant challenges. Through controlled experiments with white-box target functions, we demonstrate that training failures can occur even when all relevant frequencies are theoretically accessible. We illustrate how two primary known causes lead to unsuccessful optimization: insufficient trainable parameters relative to the model's frequency content, and limitations imposed by the ansatz's dynamic lie algebra dimension, but also uncover an additional parameter burden: the necessity of controlling non-unique frequencies within the model. To address this, we propose near-zero weight initialization to suppress unnecessary duplicate frequencies. For target functions with a priori frequency knowledge, we introduce frequency selection as a practical solution that reduces parameter requirements and mitigates the exponential growth that would otherwise render problems intractable due to parameter insufficiency. Our frequency selection approach achieved near-optimal performance (median $R^2 \approx 0.95$) with 78\% of the parameters needed by the best standard approach in 10 randomly chosen target functions.
Authors: Ziyi Zeng, Yun-Hsuan Chen, Xurong Gao, Wenyao Zheng, Hemmings Wu, Zhoule Zhu, Jie Yang, Chengkai Wang, Lihua Zhong, Weiwei Cheng, Mohamad Sawan
Abstract: The impact of repetitive transcranial magnetic stimulation (rTMS) on methamphetamine (METH) users' craving levels is often assessed using questionnaires. This study explores the feasibility of using neural signals to obtain more objective results. EEG signals recorded from 20 METH-addicted participants Before and After rTMS (MBT and MAT) and from 20 healthy participants (HC) are analyzed. In each EEG paradigm, participants are shown 15 METH-related and 15 neutral pictures randomly, and the relative band power (RBP) of each EEG sub-band frequency is derived. The average RBP across all 31 channels, as well as individual brain regions, is analyzed. Statistically, MAT's alpha, beta, and gamma RBPs are more like those of HC compared to MBT, as indicated by the power topographies. Utilizing a random forest (RF), the gamma RBP is identified as the optimal frequency band for distinguishing between MBT and HC with a 90% accuracy. The performance of classifying MAT versus HC is lower than that of MBT versus HC, suggesting that the efficacy of rTMS can be validated using RF with gamma RBP. Furthermore, the gamma RBP recorded by the TP10 and CP2 channels dominates the classification task of MBT versus HC when receiving METH-related image cues. The gamma RBP during exposure to METH-related cues can serve as a biomarker for distinguishing between MBT and HC and for evaluating the effectiveness of rTMS. Therefore, real-time monitoring of gamma RBP variations holds promise as a parameter for implementing a customized closed-loop neuromodulation system for treating METH addiction.
Authors: Loris Schoenegger, Lukas Thoma, Terra Blevins, Benjamin Roth
Abstract: Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textit{training data influence}, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.
Authors: Xinwen Liu, Lei Qian, Song Xi Chen, Niansheng Tang
Abstract: Causal inference in settings involving complex spatio-temporal dependencies, such as environmental epidemiology, is challenging due to the presence of unmeasured confounding. However, a significant gap persists in existing methods: current diffusion-based causal models rely on restrictive assumptions of causal sufficiency or static confounding. To address this limitation, we introduce the Partially Functional Dynamic Backdoor Diffusion-based Causal Model (PFD-BDCM), a generative framework designed to bridge this gap. Our approach uniquely incorporates valid backdoor adjustments into the diffusion sampling mechanism to mitigate bias from unmeasured confounders. Specifically, it captures their intricate dynamics through region-specific structural equations and conditional autoregressive processes, and accommodates multi-resolution variables via functional data techniques. Furthermore, we provide theoretical guarantees by establishing error bounds for counterfactual estimates. Extensive experiments on synthetic data and a real-world air pollution case study confirm that PFD-BDCM outperforms current state-of-the-art methods.
Authors: Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine
Abstract: Aligning AI with human values is a pressing unsolved problem. To address the lack of quantitative metrics for value alignment, we propose EigenBench: a black-box method for comparatively benchmarking language models' values. Given an ensemble of models, a constitution describing a value system, and a dataset of scenarios, our method returns a vector of scores quantifying each model's alignment to the given constitution. To produce these scores, each model judges the outputs of other models across many scenarios, and these judgments are aggregated with EigenTrust (Kamvar et al., 2003), yielding scores that reflect a weighted consensus judgment of the whole ensemble. EigenBench uses no ground truth labels, as it is designed to quantify subjective traits for which reasonable judges may disagree on the correct label. Hence, to validate our method, we collect human judgments on the same ensemble of models and show that EigenBench's judgments align closely with those of human evaluators. We further demonstrate that EigenBench can recover model rankings on the GPQA benchmark without access to objective labels, supporting its viability as a framework for evaluating subjective values for which no ground truths exist.
Authors: Luo Luo, Xue Cui, Tingkai Jia, Cheng Chen
Abstract: This paper studies decentralized optimization problem $f(\mathbf{x})=\frac{1}{m}\sum_{i=1}^m f_i(\mathbf{x})$, where each local function has the form of $f_i(\mathbf{x}) = {\mathbb E}\left[F(\mathbf{x};{\boldsymbol \xi}_i)\right]$ which is $(L_0,L_1)$-smooth but possibly nonconvex and the random variable ${\boldsymbol \xi}_i$ follows distribution ${\mathcal D}_i$. We propose a novel algorithm called decentralized normalized stochastic gradient descent (DNSGD), which can achieve an $\epsilon$-stationary point at each local agent. We present a new framework for analyzing decentralized first-order methods in the relaxed smooth setting, based on the Lyapunov function related to the product of the gradient norm and the consensus error. We show the upper bounds on the sample complexity of ${\mathcal O}(m^{-1}(L_f\sigma^2\Delta_f\epsilon^{-4} + \sigma^2\epsilon^{-2} + L_f^{-2}L_1^3\sigma^2\Delta_f\epsilon^{-1} + L_f^{-2}L_1^2\sigma^2))$ per agent and the communication complexity of $\tilde{\mathcal O}((L_f\epsilon^{-2} + L_1\epsilon^{-1})\gamma^{-1/2}\Delta_f)$, where $L_f=L_0 +L_1\zeta$, $\sigma^2$ is the variance of the stochastic gradient, $\Delta_f$ is the initial optimal function value gap, $\gamma$ is the spectral gap of the network, and $\zeta$ is the degree of the gradient dissimilarity. In the special case of $L_1=0$, the above results (nearly) match the lower bounds of decentralized stochastic nonconvex optimization under the standard smoothness. We also conduct numerical experiments to show the empirical superiority of our method.
Authors: Milan Marocchi, Matthew Fynn, Kayapanda Mandana, Yue Rong
Abstract: Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for approximately 17.9 million deaths each year. Early detection is critical, creating a demand for accurate and inexpensive pre-screening methods. Deep learning has recently been applied to classify abnormal heart sounds indicative of CVDs using synchronised phonocardiogram (PCG) and electrocardiogram (ECG) signals, as well as multichannel PCG (mPCG). However, state-of-the-art architectures remain underutilised due to the limited availability of synchronised and multichannel datasets. Augmented datasets and pre-trained models provide a pathway to overcome these limitations, enabling transformer-based architectures to be trained effectively. This work combines traditional signal processing with denoising diffusion models, WaveGrad and DiffWave, to create an augmented dataset to fine-tune a Wav2Vec 2.0-based classifier on multimodal and multichannel heart sound datasets. The approach achieves state-of-the-art performance. On the Computing in Cardiology (CinC) 2016 dataset of single channel PCG, accuracy, unweighted average recall (UAR), sensitivity, specificity and Matthew's correlation coefficient (MCC) reach 92.48%, 93.05%, 93.63%, 92.48%, 94.93% and 0.8283, respectively. Using the synchronised PCG and ECG signals of the training-a dataset from CinC, 93.14%, 92.21%, 94.35%, 90.10%, 95.12% and 0.8380 are achieved for accuracy, UAR, sensitivity, specificity and MCC, respectively. Using a wearable vest dataset consisting of mPCG data, the model achieves 77.13% accuracy, 74.25% UAR, 86.47% sensitivity, 62.04% specificity, and 0.5082 MCC. These results demonstrate the effectiveness of transformer-based models for CVD detection when supported by augmented datasets, highlighting their potential to advance multimodal and multichannel heart sound classification.
Authors: Debasish Dutta, Neeharika Sonowal, Risheraj Barauh, Deepjyoti Chetia, Sanjib Kr Kalita
Abstract: Microscopy image enhancement plays a pivotal role in understanding the details of biological cells and materials at microscopic scales. In recent years, there has been a significant rise in the advancement of microscopy image enhancement, specifically with the help of deep learning methods. This survey paper aims to provide a snapshot of this rapidly growing state-of-the-art method, focusing on its evolution, applications, challenges, and future directions. The core discussions take place around the key domains of microscopy image enhancement of super-resolution, reconstruction, and denoising, with each domain explored in terms of its current trends and their practical utility of deep learning.
Authors: Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung
Abstract: Logic reasoning in natural language has been recognized as an important measure of human intelligence for Large Language Models (LLMs). Popular benchmarks may entangle multiple reasoning skills and thus provide unfaithful evaluations on the logic reasoning skill. Meanwhile, existing logic reasoning benchmarks are limited in language diversity and their distributions are deviated from the distribution of an ideal logic reasoning benchmark, which may lead to biased evaluation results. This paper thereby proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way. To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in LLMs. Through experiments, we demonstrate the extent to which logical reasoning is required to answer the questions in DivLogicEval and compare the performance of different popular LLMs in conducting logical reasoning.
Authors: Chang Li, Zehua Chen, Liyuan Wang, Jun Zhu
Abstract: Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.
Authors: Lars Ankile, Zhenyu Jiang, Rocky Duan, Guanya Shi, Pieter Abbeel, Anusha Nagabandi
Abstract: Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency, safety concerns, and the difficulty of learning from sparse rewards for long-horizon tasks, especially for high-degree-of-freedom (DoF) systems. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world.
Authors: Sai Teja Reddy Adapala
Abstract: The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational limits that govern their reasoning under cognitive load remain poorly understood. In this work, we introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching (Attentional Residue) are key mechanisms that degrade performance. We designed the Interleaved Cognitive Evaluation (ICE), a deconfounded benchmark to systematically manipulate these load factors on challenging multi-hop reasoning tasks. A comprehensive study (N = 10 replications per item across 200 questions) revealed significant performance variations across five instruction-tuned models. Smaller open-source architectures (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2) exhibited baseline brittleness, achieving 0% accuracy (SEM = 0.0) across all conditions, including clean controls, on this high-intrinsic-load task. In contrast, Gemini-2.0-Flash-001 showed partial resilience, achieving 85% accuracy in control conditions, with a statistically significant degradation under context saturation ($\beta = -0.003$ per % load, $p < 0.001$). These findings provide preliminary evidence that cognitive load is a key contributor to reasoning failures, supporting theories of hallucination-as-guessing under uncertainty. We conclude that dynamic, cognitive-aware stress testing, as exemplified by the ICE benchmark, is essential for evaluating the true resilience and safety of advanced AI systems.
Authors: Manuel Perez-Carrasco, Maya Nasr, Sebastien Roche, Chris Chan Miller, Zhan Zhang, Core Francisco Park, Eleanor Walker, Cecilia Garraffo, Douglas Finkbeiner, Ritesh Gautam, Steven Wofsy
Abstract: Effective cloud and cloud shadow detection is a critical prerequisite for accurate retrieval of concentrations of atmospheric methane or other trace gases in hyperspectral remote sensing. This challenge is especially pertinent for MethaneSAT and for its airborne companion mission, MethaneAIR. In this study, we use machine learning methods to address the cloud and cloud shadow detection problem for sensors with these high spatial resolutions instruments. Cloud and cloud shadows in remote sensing data need to be effectively screened out as they bias methane retrievals in remote sensing imagery and impact the quantification of emissions. We deploy and evaluate conventional techniques including Iterative Logistic Regression (ILR) and Multilayer Perceptron (MLP), with advanced deep learning architectures, namely UNet and a Spectral Channel Attention Network (SCAN) method. Our results show that conventional methods struggle with spatial coherence and boundary definition, affecting the detection of clouds and cloud shadows. Deep learning models substantially improve detection quality: UNet performs best in preserving spatial structure, while SCAN excels at capturing fine boundary details. Notably, SCAN surpasses UNet on MethaneSAT data, underscoring the benefits of incorporating spectral attention for satellite specific features. This in depth assessment of various disparate machine learning techniques demonstrates the strengths and effectiveness of advanced deep learning architectures in providing robust, scalable solutions for clouds and cloud shadow screening towards enhancing methane emission quantification capacity of existing and next generation hyperspectral missions. Our data and code is publicly available at https://doi.org/10.7910/DVN/IKLZOJ
Authors: Daniel Maninger, Leon Chemnitz, Amir Molzam Sharifloo, Jannis Brugger, Mira Mezini
Abstract: API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models were able to solve more than 40% of the tasks.
Authors: Liang Wang, Nan Yang, Shaohan Huang, Li Dong, Furu Wei
Abstract: This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.
Authors: Huimin Yan, Longfei Xu, Junjie Sun, Ni Ou, Wei Luo, Xing Tan, Ran Cheng, Kaikui Liu, Xiangxiang Chu
Abstract: Generative recommendation has emerged as a promising paradigm, demonstrating remarkable results in both academic benchmarks and industrial applications. However, existing systems predominantly focus on unifying retrieval and ranking while neglecting the integration of search and recommendation (S&R) tasks. What makes search and recommendation different is how queries are formed: search uses explicit user requests, while recommendation relies on implicit user interests. As for retrieval versus ranking, the distinction comes down to whether the queries are the target items themselves. Recognizing the query as central element, we propose IntSR, an integrated generative framework for S&R. IntSR integrates these disparate tasks using distinct query modalities. It also addresses the increased computational complexity associated with integrated S&R behaviors and the erroneous pattern learning introduced by a dynamically changing corpus. IntSR has been successfully deployed across various scenarios in Amap, leading to substantial improvements in digital asset's GMV(+9.34%), POI recommendation's CTR(+2.76%), and travel mode suggestion's ACC(+7.04%).
Authors: Benedikt Hoock, Tobias K\"oppl
Abstract: In this work, we propose a novel method for calibrating Windkessel (WK) parameters in a dimensionally reduced 1D-0D coupled blood flow model. To this end, we design a data-driven neural network (NN)trained on simulated blood pressures in the left brachial artery. Once trained, the NN emulates the pressure pulse waves across the entire simulated domain, i.e., over time, space and varying WK parameters, with negligible error and computational effort. To calibrate the WK parameters on a measured pulse wave, the NN is extended by dummy neurons and retrained only on these. The main objective of this work is to assess the effectiveness of the method in various scenarios -- particularly, when the exact measurement location is unknown or the data are affected by noise.